chart-fftQwen3.5 GGUF Benchmarks

See how Unsloth Dynamic GGUFs perform and analysis on perplexity, KL Divergence and MXFP4.

We updated Qwen3.5-35B Unsloth Dynamic quants being SOTA on nearly all bits. We did over 150 KL Divergence benchmarks, totally 9TB of GGUFs. We uploaded all research artifacts. We also fixed a tool calling chat template bug (affects all quant uploaders)

  • Qwen3.5-35B-A3B GGUFs are updated to use new fixes (112B, 27B still converting, re-download once they are updated)

  • We tested Bartowski, Ubergram, AesSedai, Noctrex and our new Dynamic GGUFs

  • 99.9% KL Divergence shows SOTA on Pareto Frontier for UD-Q4_K_XL, IQ3_XXS & more.

  • Retiring MXFP4 from all GGUF quants: Q2_K_XL, Q3_K_XL and Q4_K_XL, except for pure MXFP4_MOE.

Perplexity benchmarks - lower is better
Perplexity benchmarks - lower is better
  • Imatrix definitely helps reduce KLD & PPL, at the cost of 5-10% slower inference.

  • Quantizing ssm_out (Mamba layers) is not a good idea, and ffn_down_exps.

1) Some tensors are very sensitive to quantization

  • We made over 9TB of research artifacts available for the community to investigate further on our Experiments pagearrow-up-right. It includes KLD metrics and all 121 configs we tested.

  • We varied bit widths across each tensor type, and generated a best and worst Pareto Frontier plot below vs 99.9% KLD.

  • For the best items to quantize, ffn_up_exps and ffn_gate_exps are generally ok to quantize to 3bit. ffn_down_exps is slightly more sensitive.

  • For the worst items, ssm_out dramatically increases KLD and the disk space savings is minuscule. For example, ssm_out at q2_k does dramatically worse. Quantizing any attn_* is especially sensitive for hybrid architectures, and so leaving them in higher precision works well.

Tensor type vs bits on 99.9% KL Divergence

  • We plot all quant levels vs 99.9% KLD, and sort from worst KLD to best. Quantizing ffn_* layers too heavily down is not a good idea.

  • However, some bit widths are good, especially 3bit. - for example leaving ffn_* (down, up, gate) at around iq3_xxs seems to be best compromise on disk space and 99.9% KLD change. 2 bits cause more degradation.

MXFP4 is much worse on many tensors - attn_gate, attn_q, ssm_beta, ssm_alpha using MXFP4 is not a good idea, and rather Q4_K is better - also MXFP4 uses 4.25 bits per weight, whilst Q4_K uses 4.5 bits per weight. It's better to use Q4_K than MXFP4 when choosing between them.

As you can see MXFP4 is unusually high

2) Imatrix works remarkably well

  • Imatrix definitely helps weight the quantization process in the right way. For example previously ssm_out at 2bits was really bad, however imatrix reduces the 99.9% KLD by a lot.

  • Imatrix generally helps on lower bits, and works on all quants and bit widths.

I quants (iq3_xxs, iq2_s etc) makes inference 5-10% slower, they're definitely better in terms of efficiency, but there is a tradeoff.

Type
pp512 (≈)
tg128 (≈)

mxfp4

1978.69

90.67

q4_k

1976.44

90.38

q3_k

1972.61

91.36

q6_k

1964.55

90.50

q2_k

1964.20

90.77

q8_0

1964.17

90.33

q5_k

1947.74

90.72

iq3_xxs

2030.94

85.68

iq2_xxs

1997.64

85.79

iq3_s

1990.12

84.37

iq2_xs

1967.85

85.19

iq2_s

1952.50

85.04

3) Perplexity & KLD can be misleading

Perplexity and KLD can be misleading as they’re highly influenced by calibration. Most GGUFs are evaluated on Wiki-test with 512 context windows, so results shift a lot if the GGUF’s imatrix calibration set includes Wikipedia-like and 512 context samples (as most GGUFs do). That’s why our GGUFs sometimes show higher perplexity as our imatrix data rather uses long-context chat and tool-calling examples instead.

Benjamin’s recent MiniMax‑M2.5 analysisarrow-up-right shows a case how perplexity and KLD can be very misleading. Unsloth Dynamic IQ2_XXS performs better than AesSedai’s IQ3_S on real world evals (LiveCodeBench v6, MMLU Pro) despite being 11GB smaller. Yet, AesSedai’s perplexity and KLD benchmarks suggest the opposite. (PPL: 0.3552 vs 0.2441; KLD: 9.0338 vs 8.2849 - lower is better).

KL Divergence - AesSedai
Perplexity - AesSedai

This mismatch shows how lower perplexity or KLD doesn’t necessarily translate to better real-world performance. The graph also shows UD‑Q4-K‑XL outperforming other Q4 quants, while being ~8GB smaller. This doesn’t mean perplexity or KLD is useless, as they provide a rough signal. So, going forward, we’ll publish perplexity and KLD for every quant so the community has some sort of reference.

Full Benchmarks

Quantizer
Quant Level
Disk Space (GB)
PPL
KLD 99.9%
Mean KLD

AesSedai

IQ3_S

12.65

6.9152

1.8669

0.0613

AesSedai

IQ4_XS

16.4

6.6447

0.8067

0.0235

AesSedai

Q4_K_M

20.62

6.5665

0.3171

0.0096

AesSedai

Q5_K_M

24.45

6.5356

0.21

0.0058

Ubergarm

Q4_0

19.79

6.5784

0.4829

0.0142

Unsloth

IQ2_XXS

9.09

7.716

4.2221

0.1846

Unsloth

Q2_K_XL

12.04

7.0438

2.9092

0.097

Unsloth

IQ3_XXS

13.12

6.7829

1.5296

0.0501

Unsloth

IQ3_S

14.13

6.7715

1.4193

0.0457

Unsloth

Q3_K_M

15.54

6.732

0.9726

0.0324

Unsloth

Q3_K_XL

16.06

6.7245

0.9539

0.0308

Unsloth

MXFP4_MOE

18.17

6.6

0.7789

0.0272

Unsloth

Q4_K_M

18.49

6.6053

0.5478

0.0192

Unsloth

Q4_K_L

18.82

6.5905

0.4828

0.015

Unsloth

Q4_K_XL

19.17

6.5918

0.4097

0.0137

Unsloth

Q5_K_XL

23.22

6.5489

0.236

0.0069

Unsloth

Q6_K_S

26.56

6.5456

0.2226

0.0065

Unsloth

Q6_K_XL

28.22

6.5392

0.1437

0.0041

Unsloth

Q8_K_XL

36.04

6.5352

0.1033

0.0026

bartowski

Qwen_IQ2_XXS

8.15

9.3427

6.0607

0.3457

bartowski

Qwen_Q2_K_L

11.98

7.5504

3.8095

0.1559

bartowski

Qwen_IQ3_XXS

12.94

7.0938

2.1563

0.0851

bartowski

Qwen_Q3_K_M

14.95

6.772

1.7779

0.0585

bartowski

Qwen_Q3_K_XL

15.97

6.8245

1.7516

0.0627

bartowski

Qwen_IQ4_XS

17.42

6.6234

0.7265

0.0234

bartowski

Qwen_Q4_K_M

19.77

6.6097

0.5771

0.0182

bartowski

Qwen_Q5_K_M

23.11

6.5828

0.3549

0.0106

noctrex

MXFP4_MOE_BF16

20.55

6.5948

0.7939

0.0248

noctrex

MXFP4_MOE_F16

20.55

6.5937

0.7614

0.0247

Last updated

Was this helpful?