Research Dashboard

GGUF Quantization
on Edge Devices

3,253 controlled inference runs across 7 GGUF K-quant variants on Pixel 6a, M4 Mac, and x86. Revealing non-monotonic throughput, KV-cache collapse thresholds, and cross-device generalisation of quantization behaviour.

Q2_K ~99% faster than Q6_K on ARM

≥40% TPS collapse from ctx=512 on ARM (Q2_K, Q5_K_M)

Q4_K_M > Q6_K on BoolQ accuracy

View Dataset Explore Results ↓

3,437

Total Records

GGUF Variants

Devices

Quality Benchmarks

Decode Throughput

Non-monotonic speed ordering — Q2_K is fastest on ARM despite lowest bit-width

Model

Device

Pixel 6a @ ctx=256 — cliff\_sweep (Q2\_K, Q3\_K\_M, Q4\_K\_S, Q6\_K, Q8\_0) · standard\_sweep (Q4\_K\_M, Q5\_K\_M — thermal burst artifact excluded) · M4 Mac GPU (Metal) — Llama ctx=1024 cliff baseline and Qwen tg128 TPS sweep · M4 Mac CPU — clean tg128 TPS sweep (2026-04-15, n=10) · x86 mean of 5 trials @ ctx=256

KV-Cache Collapse

ARM onset ctx=512 (Q2_K −48%, Q5_K_M −46%) · x86 onset ctx≈1300–1400 · Metal: flat

Device / Model

Variants

Show KV-cache quant mitigation Show collapse threshold band

Shaded band marks the per-device cliff onset: ARM ctx=512 (Q2_K, Q5_K_M); x86 Llama ctx=1300–1400. Metal: no band; M4 Qwen shows a Q2_K decline but no validated cache-collapse threshold. KV-cache quant overlay available for Q3_K_M and Q6_K.

Quality Benchmarks

Accuracy across 6 NLP benchmarks — Q4_K_M beats Q6_K on BoolQ

Benchmark

Device

Calibration

100-question samples from official benchmark test sets. Exact-match scoring. imatrix = importance-weighted quantization calibration.

Cross-Device Comparison

ARM ordering replicates on M4 · reverses on Metal GPU · x86 intermediate

Model

Context Length

ctx=–

Slower

Faster

Llama x86: cliff sweep data available (n=5 trials per context). · Qwen x86 cliff attempts are excluded from the public release and hidden from this heatmap rather than shown as empty cells. M4 Qwen appears only at contexts measured in the validated Metal cliff run.

Thread Count Impact Q4_K_M · Pixel 6a · ctx=256

Big.LITTLE architecture sweet spot: 4 threads (2× P-cores + 2× E-cores). 8 threads regresses due to E-core saturation.

Perplexity (WikiText-2)

Q4_K_M achieves near-Q8_0 perplexity — quality floor at 3 bits

✓ All 7 variants measured on Pixel 6a full WikiText-2 corpus (~285K tokens, 568 chunks). x86 full-corpus PPL is retained in the dataset as supplementary cross-device reference. Hover bars for details.

Dataset Explorer

Filter and browse all 3,253 published inference rows

Device

Variant

Model

Experiment

Device	Variant	Model	Context	Decode TPS ↕	Prefill TPS	Experiment	Threads

Loading…

GGUF Quantization on Edge Devices

Decode Throughput

KV-Cache Collapse

Quality Benchmarks

Cross-Device Comparison

Thread Count Impact Q4_K_M · Pixel 6a · ctx=256

Perplexity (WikiText-2)

Dataset Explorer

GGUF Quantization
on Edge Devices