3,253 controlled inference runs across 7 GGUF K-quant variants on Pixel 6a, M4 Mac, and x86. Revealing non-monotonic throughput, KV-cache collapse thresholds, and cross-device generalisation of quantization behaviour.
Non-monotonic speed ordering — Q2_K is fastest on ARM despite lowest bit-width
Pixel 6a @ ctx=256 — cliff\_sweep (Q2\_K, Q3\_K\_M, Q4\_K\_S, Q6\_K, Q8\_0) · standard\_sweep (Q4\_K\_M, Q5\_K\_M — thermal burst artifact excluded) · M4 Mac GPU (Metal) — Llama ctx=1024 cliff baseline and Qwen tg128 TPS sweep · M4 Mac CPU — clean tg128 TPS sweep (2026-04-15, n=10) · x86 mean of 5 trials @ ctx=256
ARM onset ctx=512 (Q2_K −48%, Q5_K_M −46%) · x86 onset ctx≈1300–1400 · Metal: flat
Shaded band marks the per-device cliff onset: ARM ctx=512 (Q2_K, Q5_K_M); x86 Llama ctx=1300–1400. Metal: no band; M4 Qwen shows a Q2_K decline but no validated cache-collapse threshold. KV-cache quant overlay available for Q3_K_M and Q6_K.
Accuracy across 6 NLP benchmarks — Q4_K_M beats Q6_K on BoolQ
100-question samples from official benchmark test sets. Exact-match scoring. imatrix = importance-weighted quantization calibration.
ARM ordering replicates on M4 · reverses on Metal GPU · x86 intermediate
Llama x86: cliff sweep data available (n=5 trials per context). · Qwen x86 cliff attempts are excluded from the public release and hidden from this heatmap rather than shown as empty cells. M4 Qwen appears only at contexts measured in the validated Metal cliff run.
Big.LITTLE architecture sweet spot: 4 threads (2× P-cores + 2× E-cores). 8 threads regresses due to E-core saturation.
Q4_K_M achieves near-Q8_0 perplexity — quality floor at 3 bits
Filter and browse all 3,253 published inference rows
| Device | Variant | Model | Context | Decode TPS ↕ | Prefill TPS | Experiment | Threads |
|---|