4,000+ controlled inference runs across 7 GGUF K-quant variants on Pixel 6a, M4 Mac, and x86. Revealing non-monotonic throughput, KV-cache collapse thresholds, and cross-device generalisation of quantization behaviour.
Non-monotonic speed ordering — Q2_K is fastest on ARM despite lowest bit-width
Pixel 6a @ ctx=256 — cliff\_sweep (Q2\_K, Q3\_K\_M, Q4\_K\_S, Q6\_K, Q8\_0) · standard\_sweep (Q4\_K\_M, Q5\_K\_M — thermal burst artifact excluded) · M4 Mac GPU (Metal) @ ctx=1024 canonical cliff sweep (n=5) · M4 Mac CPU — TPS sweep (n\_prompt=0, n\_gen=128, n=10, thermally settled) · x86 mean of 5 trials @ ctx=256
ARM onset ctx=512 (Q2_K −48%, Q5_K_M −46%) · x86 onset ctx≈1300–1400 · Metal: flat
Shaded band marks the per-device cliff onset: ARM ctx=512 (Q2_K, Q5_K_M); x86 ctx=1300–1400. Metal: no band (no cliff observed). KV-cache quant overlay available for Q3_K_M and Q6_K.
Accuracy across 6 NLP benchmarks — Q4_K_M beats Q6_K on BoolQ
100-question samples from official benchmark test sets. Exact-match scoring. imatrix = importance-weighted quantization calibration.
ARM ordering replicates on M4 · reverses on Metal GPU · x86 intermediate
Llama x86: cliff sweep data available (n=5 trials per context). · Qwen x86 †: ctx=256 reference only — no multi-context sweep collected; value is constant across slider positions. Both models benchmarked on Intel i5-1235U, 6 threads.
Big.LITTLE architecture sweet spot: 4 threads (2× P-cores + 2× E-cores). 8 threads regresses due to E-core saturation.
Q4_K_M achieves near-Q8_0 perplexity — quality floor at 3 bits
Filter and browse all 4,000+ inference records
| Device | Variant | Model | Context | Decode TPS ↕ | Prefill TPS | Experiment | Threads |
|---|