chart-fftQwen3.5 GGUF Benchmarks

See how Unsloth Dynamic GGUFs perform + analysis of perplexity, KL divergence & MXFP4.

We updated all Qwen3.5 Unsloth Dynamic quants being SOTA on nearly all bits. We did over 150 KL Divergence benchmarks, totally 9TB of GGUFs. We uploaded all research artifacts.

We also fixed a tool calling chat template issue (affects all quant uploaders and types regardless where you're using it or where it's from).

circle-check
circle-info

Want to see how to run the model + hardware requirements? Read our inference guide.

99.9% KL Divergence shows SOTA on Pareto Frontier for Unsloth Dynamic Q4_K_XL, IQ3_XXS etc.:

Qwen3.5-122B-A10B Benchmarks
Qwen3.5-35B-A3B Benchmarks
  • Imatrix definitely helps reduce KLD & PPL, at the cost of 5-10% slower inference.

  • We tested our GGUFs against many other providers

  • Quantizing ssm_out (Mamba layers) is not a good idea, and ffn_down_exps.

  • Retiring MXFP4 from all GGUF quants: Q2_K_XL, Q3_K_XL and Q4_K_XL, except for pure MXFP4_MOE.

New Qwen3.5-9B GGUF Benchmarks conducted by Benjamin Marie

1) Some tensors are very sensitive to quantization

  • We made over 9TB of research artifacts available for the community to investigate further on our Experiments pagearrow-up-right. It includes KLD metrics and all 121 configs we tested.

  • We varied bit widths across each tensor type, and generated a best and worst Pareto Frontier plot below vs 99.9% KLD.

  • For the best items to quantize, ffn_up_exps and ffn_gate_exps are generally ok to quantize to 3bit. ffn_down_exps is slightly more sensitive.

  • For the worst items, ssm_out dramatically increases KLD and the disk space savings is minuscule. For example, ssm_out at q2_k does dramatically worse. Quantizing any attn_* is especially sensitive for hybrid architectures, and so leaving them in higher precision works well.

Tensor type vs bits on 99.9% KL Divergence

  • We plot all quant levels vs 99.9% KLD, and sort from worst KLD to best. Quantizing ffn_* layers too heavily down is not a good idea.

  • However, some bit widths are good, especially 3bit. - for example leaving ffn_* (down, up, gate) at around iq3_xxs seems to be best compromise on disk space and 99.9% KLD change. 2 bits cause more degradation.

MXFP4 is much worse on many tensors - attn_gate, attn_q, ssm_beta, ssm_alpha using MXFP4 is not a good idea, and rather Q4_K is better - also MXFP4 uses 4.25 bits per weight, whilst Q4_K uses 4.5 bits per weight. It's better to use Q4_K than MXFP4 when choosing between them.

As you can see MXFP4 is unusually high

2) Imatrix works very well

  • Imatrix definitely helps weight the quantization process in the right way. For example previously ssm_out at 2bits was really bad, however imatrix reduces the 99.9% KLD by a lot.

  • Imatrix generally helps on lower bits, and works on all quants and bit widths.

I quants (iq3_xxs, iq2_s etc) makes inference 5-10% slower, they're definitely better in terms of efficiency, but there is a tradeoff.

Type
pp512 (≈)
tg128 (≈)

mxfp4

1978.69

90.67

q4_k

1976.44

90.38

q3_k

1972.61

91.36

q6_k

1964.55

90.50

q2_k

1964.20

90.77

q8_0

1964.17

90.33

q5_k

1947.74

90.72

iq3_xxs

2030.94

85.68

iq2_xxs

1997.64

85.79

iq3_s

1990.12

84.37

iq2_xs

1967.85

85.19

iq2_s

1952.50

85.04

3) Perplexity & KLD can be misleading

Perplexity and KLD can be misleading as they’re highly influenced by calibration. Most GGUFs are evaluated on Wiki-test with 512 context windows, so results shift a lot if the GGUF’s imatrix calibration set includes Wikipedia-like and 512 context samples (as most GGUFs do). That’s why our GGUFs sometimes show higher perplexity as our imatrix data rather uses long-context chat and tool-calling examples instead.

Benjamin’s recent MiniMax‑M2.5 analysisarrow-up-right shows a case how perplexity and KLD can be very misleading. Unsloth Dynamic IQ2_XXS performs better than AesSedai’s IQ3_S on real world evals (LiveCodeBench v6, MMLU Pro) despite being 11GB smaller. Yet, AesSedai’s perplexity and KLD benchmarks suggest the opposite. (PPL: 0.3552 vs 0.2441; KLD: 9.0338 vs 8.2849 - lower is better).

KL Divergence - AesSedai
Perplexity - AesSedai

This mismatch shows how lower perplexity or KLD doesn’t necessarily translate to better real-world performance. The graph also shows UD‑Q4-K‑XL outperforming other Q4 quants, while being ~8GB smaller. This doesn’t mean perplexity or KLD is useless, as they provide a rough signal. So, going forward, we’ll publish perplexity and KLD for every quant so the community has some sort of reference.

4) March 5th 2026 Update - more robustness

We further enhanced our quantization method for Qwen3.5 MoEs to reduce Maximum KLD directly. 99.9% is what is generally used, but for massive outliers, Maximum KLD can be useful. Our New method generally pushes the Maximum KLD quite a much down vs the pre March 5th update.

Quant
Old GB
New GB
Old Max KLD
New Max KLD

UD-Q2_K_XL

12.0

11.3

8.237

8.155

UD-Q3_K_XL

16.1

15.5

5.505

5.146

UD-Q4_K_XL

19.2

20.7 (+7.8%)

5.894

2.877 (-51%)

UD-Q5_K_XL

23.2

24.6 (+6%)

5.536

3.210 (-42%)

Full Benchmarks

Quantizer
Quant Level
Disk Space (GB)
PPL
KLD 99.9%
Mean KLD

AesSedai

IQ3_S

12.65

6.9152

1.8669

0.0613

AesSedai

IQ4_XS

16.4

6.6447

0.8067

0.0235

AesSedai

Q4_K_M

20.62

6.5665

0.3171

0.0096

AesSedai

Q5_K_M

24.45

6.5356

0.21

0.0058

Ubergarm

Q4_0

19.79

6.5784

0.4829

0.0142

Unsloth

IQ2_XXS

9.09

7.716

4.2221

0.1846

Unsloth

Q2_K_XL

12.04

7.0438

2.9092

0.097

Unsloth

IQ3_XXS

13.12

6.7829

1.5296

0.0501

Unsloth

IQ3_S

14.13

6.7715

1.4193

0.0457

Unsloth

Q3_K_M

15.54

6.732

0.9726

0.0324

Unsloth

Q3_K_XL

16.06

6.7245

0.9539

0.0308

Unsloth

MXFP4_MOE

18.17

6.6

0.7789

0.0272

Unsloth

Q4_K_M

18.49

6.6053

0.5478

0.0192

Unsloth

Q4_K_L

18.82

6.5905

0.4828

0.015

Unsloth

Q4_K_XL

19.17

6.5918

0.4097

0.0137

Unsloth

Q5_K_XL

23.22

6.5489

0.236

0.0069

Unsloth

Q6_K_S

26.56

6.5456

0.2226

0.0065

Unsloth

Q6_K_XL

28.22

6.5392

0.1437

0.0041

Unsloth

Q8_K_XL

36.04

6.5352

0.1033

0.0026

bartowski

Qwen_IQ2_XXS

8.15

9.3427

6.0607

0.3457

bartowski

Qwen_Q2_K_L

11.98

7.5504

3.8095

0.1559

bartowski

Qwen_IQ3_XXS

12.94

7.0938

2.1563

0.0851

bartowski

Qwen_Q3_K_M

14.95

6.772

1.7779

0.0585

bartowski

Qwen_Q3_K_XL

15.97

6.8245

1.7516

0.0627

bartowski

Qwen_IQ4_XS

17.42

6.6234

0.7265

0.0234

bartowski

Qwen_Q4_K_M

19.77

6.6097

0.5771

0.0182

bartowski

Qwen_Q5_K_M

23.11

6.5828

0.3549

0.0106

noctrex

MXFP4_MOE_BF16

20.55

6.5948

0.7939

0.0248

noctrex

MXFP4_MOE_F16

20.55

6.5937

0.7614

0.0247

Last updated

Was this helpful?