Research Dashboard

GGUF Quantization on Edge Devices

4,000+ controlled inference runs across 7 GGUF K-quant variants on Pixel 6a, M4 Mac, and x86. Revealing non-monotonic throughput, KV-cache collapse thresholds, and cross-device generalisation of quantization behaviour.

Q2_K ~99% faster than Q6_K on ARM
≥40% TPS collapse from ctx=512 on ARM (Q2_K, Q5_K_M)
Q4_K_M > Q6_K on BoolQ accuracy
4,062
Inference Records
7
GGUF Variants
3
Devices
6
Quality Benchmarks
01

Decode Throughput

Non-monotonic speed ordering — Q2_K is fastest on ARM despite lowest bit-width

Model
Device

Pixel 6a @ ctx=256 — cliff\_sweep (Q2\_K, Q3\_K\_M, Q4\_K\_S, Q6\_K, Q8\_0) · standard\_sweep (Q4\_K\_M, Q5\_K\_M — thermal burst artifact excluded) · M4 Mac GPU (Metal) @ ctx=1024 canonical cliff sweep (n=5) · M4 Mac CPU — TPS sweep (n\_prompt=0, n\_gen=128, n=10, thermally settled) · x86 mean of 5 trials @ ctx=256

02

KV-Cache Collapse

ARM onset ctx=512 (Q2_K −48%, Q5_K_M −46%) · x86 onset ctx≈1300–1400 · Metal: flat

Device / Model
Variants

Shaded band marks the per-device cliff onset: ARM ctx=512 (Q2_K, Q5_K_M); x86 ctx=1300–1400. Metal: no band (no cliff observed). KV-cache quant overlay available for Q3_K_M and Q6_K.

03

Quality Benchmarks

Accuracy across 6 NLP benchmarks — Q4_K_M beats Q6_K on BoolQ

Benchmark
Device
Calibration

100-question samples from official benchmark test sets. Exact-match scoring. imatrix = importance-weighted quantization calibration.

04

Cross-Device Comparison

ARM ordering replicates on M4 · reverses on Metal GPU · x86 intermediate

Model
Context Length
ctx=–
Slower
Faster — = no data at this context

Llama x86: cliff sweep data available (n=5 trials per context). · Qwen x86 : ctx=256 reference only — no multi-context sweep collected; value is constant across slider positions. Both models benchmarked on Intel i5-1235U, 6 threads.

Thread Count Impact Q4_K_M · Pixel 6a · ctx=256

Big.LITTLE architecture sweet spot: 4 threads (2× P-cores + 2× E-cores). 8 threads regresses due to E-core saturation.

05

Perplexity (WikiText-2)

Q4_K_M achieves near-Q8_0 perplexity — quality floor at 3 bits

✓ All 7 variants on full WikiText-2 corpus (~290K tokens). Q2_K & Q3_K_M measured on Pixel 6a · 4 threads; Q4_K_S–Q8_0 on x86 i5-1235U · 6 threads. Hover bars for details.
06

Dataset Explorer

Filter and browse all 4,000+ inference records

Device
Variant
Model
Experiment
Search
Device Variant Model Context Decode TPS ↕ Prefill TPS Experiment Threads
Loading…