# Unsloth Dynamic 2.0 GGUF 我们很高兴介绍 [Unsloth](https://github.com/unslothai/unsloth) Dynamic v2.0 量化——这是对我们之前量化方案的一次重大升级。这种新方法优于领先的量化方法，并为以下方面设立了新的基准： [Aider Polyglot](/docs/zh/ji-chu/unsloth-dynamic-2.0-ggufs/unsloth-dynamic-ggufs-on-aider-polyglot.md)、5-shot MMLU 和 KL 散度。这意味着你现在可以运行并微调 [量化后的 LLM](/docs/zh/mo-xing/tutorials.md) 同时尽可能保持准确率！你可以在大多数推理引擎上运行 2.0 GGUF，例如 llama.cpp， [Unsloth Studio](/docs/zh/xin/studio.md) 等。 {% columns %} {% column %} **2026年4月20日更新：** 查看我们新的 GGUF 基准： [Qwen3.6](/docs/zh/mo-xing/qwen3.6.md#unsloth-gguf-benchmarks) 以及 [Gemma 4](/docs/zh/mo-xing/gemma-4.md#unsloth-gguf-benchmarks). [2026年2月27日更新：](/docs/zh/mo-xing/qwen3.5/gguf-benchmarks.md) **Qwen3.5** 已发布，我们修复了一些工具调用聊天模板问题，并对每个 GGUF 的困惑度和 KL 散度进行了基准测试。 [查看基准！](/docs/zh/mo-xing/qwen3.5/gguf-benchmarks.md) 这个 **关键优势** 来自使用 [Unsloth 包](https://github.com/unslothai/unsloth) 以及量化模型，我们积极参与修复大型模型中的漏洞。我们已直接与以下团队合作： [Qwen3](https://www.reddit.com/r/LocalLLaMA/comments/1kaodxu/qwen3_unsloth_dynamic_ggufs_128k_context_bug_fixes/), [Meta（Llama 4）](https://github.com/ggml-org/llama.cpp/pull/12889), [Mistral（Devstral）](https://app.gitbook.com/o/HpyELzcNe0topgVLGCZY/s/xhOjnexMCB3dmuQFQ2Zq/~/changes/618/basics/tutorials-how-to-fine-tune-and-run-llms/devstral-how-to-run-and-fine-tune), [Google（Gemma 1–3）](https://news.ycombinator.com/item?id=39671146) 以及 [Microsoft（Phi-3/4）](https://simonwillison.net/2025/Jan/11/phi-4-bug-fixes)，贡献了可提高准确率的修复。 {% endcolumn %} {% column %}

{% endcolumn %} {% endcolumns %} {% hint style="success" %} Unsloth Dynamic GGUF 现在可以在 [Unsloth Studio](/docs/zh/xin/studio.md) ✨

{% endhint %} {% hint style="success" %} [2025年9月10日更新：](/docs/zh/ji-chu/unsloth-dynamic-2.0-ggufs/unsloth-dynamic-ggufs-on-aider-polyglot.md) 你们要求更严格的基准，所以这里是 Aider Polyglot 的结果！我们的 Dynamic 3-bit DeepSeek V3.1 GGUF 得分 **75.6%**，超过了许多全精度 SOTA LLM。 [阅读更多。](/docs/zh/ji-chu/unsloth-dynamic-2.0-ggufs/unsloth-dynamic-ggufs-on-aider-polyglot.md) DeepSeek-V3.2 Thinking Aider Benchmarks

{% endhint %} 你也可以查看 Benjamin Marie 针对 LiveCodeBench v6、MMLU Pro 等进行的真实场景用例基准：

你可以看到，尽管比非 Unsloth 量化版本小约 8GB，Unsloth 的 GGUF 表现依然更好。我们对基准测试和评估的详细分析见下文。 ### 💡 Dynamic v2.0 有什么新内容？ * **为 GGUF + safetensors 重新设计的层选择：** Unsloth Dynamic 2.0 现在会更智能、更广泛地选择性量化各层。我们不再只修改部分选定层，而是动态调整每一可能层的量化类型，而且不同层与不同模型的组合都会不同。 * 当前已选及未来所有 GGUF 上传都将使用 Dynamic 2.0 和我们新的校准数据集。该数据集包含超过 >1.5M **token** （取决于模型），并由高质量、人工甄选和清洗的数据组成——以大幅提升对话聊天性能。 * 此前，我们的动态量化（DeepSeek-R1 1.58-bit GGUF）仅对 MoE 架构有效。 **Dynamic 2.0 量化现在适用于所有模型（包括 MoE 和非 MoE）**. * **模型特定量化：** 每个模型现在都使用量身定制的量化方案。例如，Gemma 3 中量化的层与 Llama 4 中的层差异很大。 * 为最大化效率，尤其是在 Apple Silicon 和 ARM 设备上，我们现在还添加了 Q4\_NL、Q5.1、Q5.0、Q4.1 和 Q4.0 格式。为确保基准测试准确，我们构建了一个内部评估框架，使之与 Llama 4 和 Gemma 3 官方报告的 5-shot MMLU 分数相匹配。这使我们能够对全精度与 Dynamic v2.0、 **QAT** 以及标准 **imatrix** GGUF 量化版本进行同类对比。

未来所有 GGUF 上传都将使用 Unsloth Dynamic 2.0，而我们的 Dynamic 4-bit safetensor 量化版本未来也将从中受益。 ## 📊 为什么选择 KL 散度？ [准确率并不是你唯一需要的](https://arxiv.org/pdf/2407.09141) 展示了即使通过选择不必要的层来剪枝，仍会在“翻转”方面产生巨大差异。“翻转”被定义为答案从错误变为正确，或从正确变为错误。论文表明，当我们剪枝或量化层时，MMLU 可能不会下降，但那是因为一些错误答案可能“翻转”为正确答案。我们的目标是匹配原始模型，因此衡量“翻转”是一个很好的指标。

{% hint style="info" %} **KL 散度** 应该是 **报告量化误差的黄金标准之一** ，正如论文《Accuracy is Not All You Need》中所述。 **使用困惑度是错误的** 因为输出 token 值可能相互抵消，所以我们必须使用 KLD 或更难的基准，例如 [Aider](/docs/zh/ji-chu/unsloth-dynamic-2.0-ggufs/unsloth-dynamic-ggufs-on-aider-polyglot.md). {% endhint %} 论文还表明，KL 散度与翻转高度相关，因此我们的目标是在尽可能少增加量化磁盘空间的前提下，降低平均 KL 散度。 ## ⚖️ 校准数据集过拟合大多数框架使用维基百科文章测试集来报告困惑度和 KL 散度。然而，我们注意到使用同样也与维基百科相关的校准数据集会导致量化版本过拟合，并获得更低的困惑度分数。我们使用 [Calibration\_v3](https://gist.github.com/bartowski1182/eb213dccb3571f863da82e99418f81e8) 以及 [Calibration\_v5](https://gist.github.com/tristandruyen/9e207a95c7d75ddf37525d353e00659c/) 数据集进行公平测试，其中包括一些 wikitext 数据及其他数据。 **此外，指令模型具有独特的聊天模板，而仅使用文本校准数据集对指令模型并不有效** （基础模型则可以）。事实上，大多数 imatrix GGUF 通常都是在这些问题下进行校准的。因此，由于模型本质上已针对该领域进行优化，它们在同样使用维基百科数据的 KL 散度基准上自然表现更好。为确保公平且受控的评估，在对 KL 散度进行基准测试时，我们不会使用自己的校准数据集（它针对聊天性能进行了优化）。相反，我们使用相同的标准维基百科数据集进行测试，从而能够直接比较我们的 Dynamic 2.0 方法与基线 imatrix 方法的性能。 ## :1234: MMLU 复现冒险 * 复现 MMLU 5-shot 简直是噩梦。我们 **无法** 复现许多模型的 MMLU 结果，包括 Llama 3.1（8B）Instruct、Gemma 3（12B）等，原因是 **细微的实现问题**。例如，Llama 3.1（8B）应获得约 68.2%，而使用错误实现则只能达到 **35% 的准确率。**

* Llama 3.1（8B）Instruct 使用朴素的 MMLU 实现时，5-shot MMLU 准确率为 67.8%。然而我们发现 Llama **会将“A”和“\_A”（前面带空格的 A）分词为不同的 token id**。如果同时考虑有空格和无空格的 token，我们得到 68.2% (+0.4%) * 有趣的是，按 Eleuther AI 的 [LLM Harness](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/llama3/instruct/mmlu/_continuation_template_yaml) ，Llama 3 也会在问题后附加 **“The best answer is”** ，这符合 Llama 3 原始的 MMLU 基准。 * 还有许多其他细微问题，因此为了在受控环境中对所有内容进行基准测试，我们通过直接研究 [github.com/hendrycks/test](https://github.com/hendrycks/test) 从头设计了我们自己的 MMLU 实现，并在多个模型上验证结果并与报告数值进行比较。 ## :sparkles: Gemma 3 QAT 复现，基准 Gemma 团队发布了 Gemma 3 的两个 QAT（量化感知训练）版本： 1. Q4\_0 GGUF - 通过以下公式将所有层量化为 Q4\_0 `w = q * block_scale` 其中每个块包含 32 个权重。参见 [llama.cpp wiki ](https://github.com/ggml-org/llama.cpp/wiki/Tensor-Encoding-Schemes)了解更多细节。 2. int4 版本——推测是 [TorchAO int4 风格](https://github.com/pytorch/ao/blob/main/torchao/quantization/README.md)? 我们对所有 Q4\_0 GGUF 版本进行了基准测试，并对 12B 模型做了大量实验。我们看到 **12B Q4\_0 QAT 模型得分为 67.07%** 而完整的 bfloat16 12B 版本在 5-shot MMLU 上得分为 67.15%。这非常令人印象深刻！27B 模型基本上也快完成了！

指标	1B	4B	12B	27B
MMLU 5-shot	26.12%	55.13%	67.07%（67.15% BF16）	70.64%（71.5% BF16）
磁盘空间	0.93GB	2.94GB	7.52GB	16.05GB
效率*	1.20	10.26	5.59	2.84

我们设计了一个新的 **效率指标** ，它在考虑模型磁盘大小和 MMLU 5-shot 分数的同时，计算模型的实用性： $$ \text{Efficiency} = \frac{\text{MMLU 5 shot score} - 25}{\text{Disk Space GB}} $$ {% hint style="warning" %} 我们必须 **减去 25** 因为 MMLU 有 4 个选择题选项——A、B、C 或 D。假设我们做出一个只会随机选择答案的模型——它会得到 25% 的准确率，而且磁盘空间只占几字节。但显然这不是一个有用的模型。 {% endhint %} 关于相对于基础模型的 KL 散度，下面的表格展示了改进。提醒一下，KL 散度越接近 0 越好（即 0 表示与全精度模型完全相同） | 量化 | 基线 KLD | GB | 新的 KLD | GB | | --------- | -------- | ----- | -------- | ----- | | IQ1\_S | 1.035688 | 5.83 | 0.972932 | 6.06 | | IQ1\_M | 0.832252 | 6.33 | 0.800049 | 6.51 | | IQ2\_XXS | 0.535764 | 7.16 | 0.521039 | 7.31 | | IQ2\_M | 0.26554 | 8.84 | 0.258192 | 8.96 | | Q2\_K\_XL | 0.229671 | 9.78 | 0.220937 | 9.95 | | Q3\_K\_XL | 0.087845 | 12.51 | 0.080617 | 12.76 | | Q4\_K\_XL | 0.024916 | 15.41 | 0.023701 | 15.64 | 如果我们绘制磁盘空间增加与 KL 散度比率变化的比值，可以看到更明显的收益！我们的动态 2bit Q2\_K\_XL 将 KLD 降低了不少（约 7.5%）。

Gemma 3（27B）的 MMLU 结果截断表。见下文。 1. **我们的动态 4bit 版本小 2GB，而准确率比 QAT 版本高 1%！** 2. 从效率角度看，2bit Q2\_K\_XL 和其他版本表现似乎非常好！ | 量化 | Unsloth | Unsloth + QAT | 磁盘大小 | 效率 | | -------------- | --------- | ------------- | --------- | -------- | | IQ1\_M | 48.10 | 47.23 | 6.51 | 3.42 | | IQ2\_XXS | 59.20 | 56.57 | 7.31 | 4.32 | | IQ2\_M | 66.47 | 64.47 | 8.96 | 4.40 | | Q2\_K\_XL | 68.70 | 67.77 | 9.95 | 4.30 | | Q3\_K\_XL | 70.87 | 69.50 | 12.76 | 3.49 | | **Q4\_K\_XL** | **71.47** | **71.07** | **15.64** | **2.94** | | **Google QAT** | | **70.64** | **17.2** | **2.65** |

点击这里查看完整的 Google Gemma 3（27B）QAT 基准：

| 模型 | Unsloth | Unsloth + QAT | 磁盘大小 | 效率 | | -------------- | --------- | ------------- | --------- | -------- | | IQ1\_S | 41.87 | 43.37 | 6.06 | 3.03 | | IQ1\_M | 48.10 | 47.23 | 6.51 | 3.42 | | IQ2\_XXS | 59.20 | 56.57 | 7.31 | 4.32 | | IQ2\_M | 66.47 | 64.47 | 8.96 | 4.40 | | Q2\_K | 68.50 | 67.60 | 9.78 | 4.35 | | Q2\_K\_XL | 68.70 | 67.77 | 9.95 | 4.30 | | IQ3\_XXS | 68.27 | 67.07 | 10.07 | 4.18 | | Q3\_K\_M | 70.70 | 69.77 | 12.51 | 3.58 | | Q3\_K\_XL | 70.87 | 69.50 | 12.76 | 3.49 | | Q4\_K\_M | 71.23 | 71.00 | 15.41 | 2.98 | | **Q4\_K\_XL** | **71.47** | **71.07** | **15.64** | **2.94** | | Q5\_K\_M | 71.77 | 71.23 | 17.95 | 2.58 | | Q6\_K | 71.87 | 71.60 | 20.64 | 2.26 | | Q8\_0 | 71.60 | 71.53 | 26.74 | 1.74 | | **Google QAT** | | **70.64** | **17.2** | **2.65** |

## :llama: Llama 4 Bug 修复 + 运行我们还帮助修复了一些 Llama 4 的 bug： * Llama 4 Scout 在其官方仓库中更改了 RoPE Scaling 配置。我们帮助解决了 llama.cpp 中的问题以启用此 [这里的更改](https://github.com/ggml-org/llama.cpp/pull/12889)

* Llama 4 的 Scout 和 Maverick 两者的 QK Norm epsilon 应来自配置文件——这意味着应使用 1e-05 而不是 1e-06。我们帮助在以下位置解决了这些问题 [llama.cpp](https://github.com/ggml-org/llama.cpp/pull/12889) 以及 [transformers](https://github.com/huggingface/transformers/pull/37418) * Llama 4 团队和 vLLM 也独立修复了一个 QK Norm 在所有 heads 之间共享的问题（不应如此） [这里](https://github.com/vllm-project/vllm/pull/16311)。MMLU Pro 的准确率从 68.58% 提升到 71.53%。 * [Wolfram Ravenwolf](https://x.com/WolframRvnwlf/status/1909735579564331016) 展示了我们通过 llama.cpp 提供的 GGUF 相比第三方推理提供商能达到更高的准确率——这很可能是上述问题的组合造成的，也可能与量化问题有关。

如我们的图表所示，我们的 4-bit Dynamic QAT 量化在 5-shot MMLU 上表现更好，同时体积也更小。 ### 运行 Llama 4 Scout：例如，要运行 Llama 4 Scout，首先克隆 llama.cpp： ```bash apt-get update apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y git clone https://github.com/ggml-org/llama.cpp cmake llama.cpp -B llama.cpp/build \ -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split cp llama.cpp/build/bin/llama-* llama.cpp ``` 然后下载 Scout 的新 dynamic v 2.0 量化： ```python # !pip install huggingface_hub hf_transfer import os os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1" from huggingface_hub import snapshot_download snapshot_download( repo_id = "unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF", local_dir = "unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF", allow_patterns = ["*IQ2_XXS*"], ) ``` 然后让我们进行推理！ {% code overflow="wrap" %} ```bash ./llama.cpp/llama-cli \ --model unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF/Llama-4-Scout-17B-16E-Instruct-UD-IQ2_XXS.gguf \\ --threads 32 \\ --ctx-size 16384 \\ --n-gpu-layers 99 \ -ot ".ffn_.*_exps.=CPU" \\ --seed 3407 \ --prio 3 \\ --temp 0.6 \\ --min-p 0.01 \ --top-p 0.9 \\ -no-cnv \ --prompt "<|header_start|>user<|header_end|>\n\n创建一个 Flappy Bird 游戏。<|eot|><|header_start|>assistant<|header_end|>\n\n" ``` {% endcode %} {% hint style="success" %} 在此了解更多关于运行 Llama 4 的信息： {% endhint %} --- # Agent Instructions: Querying This Documentation If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question. Perform an HTTP GET request on the current page URL with the `ask` query parameter: ``` GET https://unsloth.ai/docs/zh/ji-chu/unsloth-dynamic-2.0-ggufs.md?ask= ``` The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation. Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.