Research Dashboard

GGUF Quantization
on Edge Devices

4,000+ controlled inference runs across 7 GGUF K-quant variants on Pixel 6a, M4 Mac, and x86. Revealing non-monotonic throughput, KV-cache collapse thresholds, and cross-device generalisation of quantization behaviour.

Q2_K ~99% faster than Q6_K on ARM

≥40% TPS collapse from ctx=512 on ARM (Q2_K, Q5_K_M)

Q4_K_M > Q6_K on BoolQ accuracy

View Dataset Explore Results ↓

4,062

Inference Records

GGUF Variants

Devices

Quality Benchmarks

Decode Throughput

Non-monotonic speed ordering — Q2_K is fastest on ARM despite lowest bit-width

Model

Device

Pixel 6a @ ctx=256 — cliff\_sweep (Q2\_K, Q3\_K\_M, Q4\_K\_S, Q6\_K, Q8\_0) · standard\_sweep (Q4\_K\_M, Q5\_K\_M — thermal burst artifact excluded) · M4 Mac GPU (Metal) @ ctx=1024 canonical cliff sweep (n=5) · M4 Mac CPU — TPS sweep (n\_prompt=0, n\_gen=128, n=10, thermally settled) · x86 mean of 5 trials @ ctx=256

KV-Cache Collapse

ARM onset ctx=512 (Q2_K −48%, Q5_K_M −46%) · x86 onset ctx≈1300–1400 · Metal: flat

Device / Model

Variants

Show KV-cache quant mitigation Show collapse threshold band

Shaded band marks the per-device cliff onset: ARM ctx=512 (Q2_K, Q5_K_M); x86 ctx=1300–1400. Metal: no band (no cliff observed). KV-cache quant overlay available for Q3_K_M and Q6_K.

Quality Benchmarks

Accuracy across 6 NLP benchmarks — Q4_K_M beats Q6_K on BoolQ

Benchmark

Device

Calibration

100-question samples from official benchmark test sets. Exact-match scoring. imatrix = importance-weighted quantization calibration.

Cross-Device Comparison

ARM ordering replicates on M4 · reverses on Metal GPU · x86 intermediate

Model

Context Length

ctx=–

Slower

Faster — = no data at this context

Llama x86: cliff sweep data available (n=5 trials per context). · Qwen x86 †: ctx=256 reference only — no multi-context sweep collected; value is constant across slider positions. Both models benchmarked on Intel i5-1235U, 6 threads.

Thread Count Impact Q4_K_M · Pixel 6a · ctx=256

Big.LITTLE architecture sweet spot: 4 threads (2× P-cores + 2× E-cores). 8 threads regresses due to E-core saturation.

Perplexity (WikiText-2)

Q4_K_M achieves near-Q8_0 perplexity — quality floor at 3 bits

✓ All 7 variants on full WikiText-2 corpus (~290K tokens). Q2_K & Q3_K_M measured on Pixel 6a · 4 threads; Q4_K_S–Q8_0 on x86 i5-1235U · 6 threads. Hover bars for details.

Dataset Explorer

Filter and browse all 4,000+ inference records

Device

Variant

Model

Experiment

Device	Variant	Model	Context	Decode TPS ↕	Prefill TPS	Experiment	Threads

Loading…

GGUF Quantization on Edge Devices

Decode Throughput

KV-Cache Collapse

Quality Benchmarks

Cross-Device Comparison

Thread Count Impact Q4_K_M · Pixel 6a · ctx=256

Perplexity (WikiText-2)

Dataset Explorer

GGUF Quantization
on Edge Devices