Research Dashboard

GGUF Quantization on Edge Devices

3,253 controlled inference runs across 7 GGUF K-quant variants on Pixel 6a, M4 Mac, and x86. Revealing non-monotonic throughput, KV-cache collapse thresholds, and cross-device generalisation of quantization behaviour.

Q2_K ~99% faster than Q6_K on ARM
≥40% TPS collapse from ctx=512 on ARM (Q2_K, Q5_K_M)
Q4_K_M > Q6_K on BoolQ accuracy
3,437
Total Records
7
GGUF Variants
3
Devices
6
Quality Benchmarks
01

Decode Throughput

Non-monotonic speed ordering — Q2_K is fastest on ARM despite lowest bit-width

Model
Device

Pixel 6a @ ctx=256 — cliff\_sweep (Q2\_K, Q3\_K\_M, Q4\_K\_S, Q6\_K, Q8\_0) · standard\_sweep (Q4\_K\_M, Q5\_K\_M — thermal burst artifact excluded) · M4 Mac GPU (Metal) — Llama ctx=1024 cliff baseline and Qwen tg128 TPS sweep · M4 Mac CPU — clean tg128 TPS sweep (2026-04-15, n=10) · x86 mean of 5 trials @ ctx=256

02

KV-Cache Collapse

ARM onset ctx=512 (Q2_K −48%, Q5_K_M −46%) · x86 onset ctx≈1300–1400 · Metal: flat

Device / Model
Variants

Shaded band marks the per-device cliff onset: ARM ctx=512 (Q2_K, Q5_K_M); x86 Llama ctx=1300–1400. Metal: no band; M4 Qwen shows a Q2_K decline but no validated cache-collapse threshold. KV-cache quant overlay available for Q3_K_M and Q6_K.

03

Quality Benchmarks

Accuracy across 6 NLP benchmarks — Q4_K_M beats Q6_K on BoolQ

Benchmark
Device
Calibration

100-question samples from official benchmark test sets. Exact-match scoring. imatrix = importance-weighted quantization calibration.

04

Cross-Device Comparison

ARM ordering replicates on M4 · reverses on Metal GPU · x86 intermediate

Model
Context Length
ctx=–
Slower
Faster

Llama x86: cliff sweep data available (n=5 trials per context). · Qwen x86 cliff attempts are excluded from the public release and hidden from this heatmap rather than shown as empty cells. M4 Qwen appears only at contexts measured in the validated Metal cliff run.

Thread Count Impact Q4_K_M · Pixel 6a · ctx=256

Big.LITTLE architecture sweet spot: 4 threads (2× P-cores + 2× E-cores). 8 threads regresses due to E-core saturation.

05

Perplexity (WikiText-2)

Q4_K_M achieves near-Q8_0 perplexity — quality floor at 3 bits

✓ All 7 variants measured on Pixel 6a full WikiText-2 corpus (~285K tokens, 568 chunks). x86 full-corpus PPL is retained in the dataset as supplementary cross-device reference. Hover bars for details.
06

Dataset Explorer

Filter and browse all 3,253 published inference rows

Device
Variant
Model
Experiment
Search
Device Variant Model Context Decode TPS ↕ Prefill TPS Experiment Threads
Loading…