> For the complete documentation index, see [llms.txt](https://unsloth.ai/docs/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://unsloth.ai/docs/zh/mo-xing/gemma-4/qat.md). # Gemma 4 QAT Gemma 4 QAT（量化感知训练）是 Google DeepMind 的新 [Gemma 4](/docs/zh/mo-xing/gemma-4.md) 变体，旨在 **在保持模型质量的同时减少内存需求**。这使得可以在本地运行更大的模型，例如 **Gemma 4 26B-A4B**，仅需在消费级 GPU 上配备 **16GB RAM**. Gemma 4 QAT 在训练时就考虑了量化，使 4 位格式可具有约**低 72% 的内存占用** 并且 **性能几乎与原版相同**。此外还提供了 2 个特殊的 E2B 和 E4B 移动量化版本，采用了混合量化宽度。转换为 `Q4_0` 从 QAT 直接转换到 Q4\_0 时，26B-A4B 的 top-1 准确率只有 70.2%。 [我们应用了 Unsloth Dynamic 方法](#qat-analysis) 将其提升到 **85.6%（+15.6%），同时还** [**小了 200MB**](#usage-guide)! Gemma 4 QAT 包括： **E2B**, **E4B**, **12B、26B-A4B**，以及 **31B。** 它们是多模态、混合思考模型，支持 140 多种语言，且上下文长度最高可达 **256K。** {% columns %} {% column %} 运行 Gemma 4 QAT QAT 分析 **Gemma-4-E2B** QAT 可在 3GB RAM 上运行， **E4B** 在 5GB 上**，12B** 在 7GB 上**，26-A4B** 在 15GB 上，以及 **31B** 在 18GB 上**.** 我们将 Gemma 4 QAT GGUF 命名为 `UD-Q4_K_XL` 因为我们发现 q4\_0 虽然更大，但会降低准确率。请参阅我们的 [Gemma 4 QAT GGUF](https://huggingface.co/collections/unsloth/gemma-4-qat). 要比较 `int4` 量化，请看下面原版与 QAT 大小的差异。QAT 在几乎保留全部原始准确率的同时，内存占用减少约 72%： {% endcolumn %} {% column %}

{% endcolumn %} {% endcolumns %} | Gemma 4 | QAT（int4）GGUF | 原始 BF16 | 百分比变化 | | ----------- | ------------: | ------: | -----: | | **E2B** | 2.62 GB | 9.31 GB | 71.86% | | **E4B** | 4.22 GB | 15.1 GB | 72.05% | | **12B** | 6.72 GB | 23.8 GB | 71.76% | | **26B A4B** | 14.2 GB | 50.5 GB | 71.88% | | **31B** | 17.3 GB | 61.4 GB | 71.82% | ### 使用指南适用于 E2B 和 E4B 的 Gemma 4 QAT 变体专为手机和笔记本电脑设计，而更大的 26B-A4B 和 31B QAT 模型现在可在笔记本电脑上运行，而不只是强大的家用 GPU。只有 **一个 GGUF 文件** 适用于每个 Gemma 4 模型，因为我们发现，高于已上传 `UD-Q4_K_XL` 版本的精度不仅不会提升，反而会降低准确率。请使用原始的非 QAT Q4\_0 量化版本 [这里](https://huggingface.co/collections/unsloth/gemma-4).

### 硬件要求 **表：Gemma 4 QAT 推理 GGUF 推荐硬件要求** （单位 = 总内存：RAM + VRAM，或统一内存）。 | Gemma 4 QAT | 要求 | | --------------- | ----: | | **E2B** QAT | 3 GB | | **E4B** QAT | 5 GB | | **12B** QAT | 7 GB | | **26B A4B** QAT | 15 GB | | **31B** QAT | 18 GB | ### 推荐设置 QAT 检查点使用相同的推荐 Gemma 4 设置： * `temperature = 1.0` * `top_p = 0.95` * `top_k = 64` {% hint style="info" %} Gemma 4 的最大上下文长度是 **128K** 用于 **E2B**, **E4B** 和 **256K** 用于 **12B**, **26B A4B**, **31B**. {% endhint %} ## QAT 分析我们发现，直接在 llama.cpp 环境中把 QAT Q4\_0 检查点天真地转换为 Q4\_0 实际上会降低准确率，而且并不真正符合 Q4\_0 的 BF16 QAT 格式。我们应用了 Unsloth 动态方法，强制让 llama.cpp 兼容的 Q4\_0 格式与真实的 BF16 QAT Q4\_0 格式更一致，并且最终既把量化版本做得更小了（嵌入不需要 Q6\_K），也更准确！

下面是一张 KLD、Top 1% 准确率和磁盘占用的表。你可以看到，我们的版本在 99.9% KLD 和平均 KLD 上有显著提升。 **例如，E2B 的平均 KLD 在朴素 Q4\_0 量化下为 0.05109，而我们的版本为 0.00173（相对提升 29 倍），而且体积还小了 22%！** 主要问题在于将 QAT BF16 转换为 llama.cpp 的 Q4\_0 格式并不是无损的。llama.cpp 使用的是 F16 缩放，而 QAT BF16 使用的是 BF16 缩放，并且在 llama.cpp 环境中这些缩放值并非最优确定。朴素转换对 BF16 QAT 的字节级精确匹配只有 24.77%，而我们发现借助一些技巧可以将其提升到 99.96%！

模型	方法	磁盘（GB）	99.9% KLD	平均 KLD	Top-1 %
E2B	Unsloth	2.62	0.0557	0.00173	98.16
E2B	Q4_0	3.35	1.0513	0.05109	89.29
E4B	Unsloth	4.22	0.0536	0.00121	98.54
E4B	Q4_0	5.15	0.6722	0.03778	90.94
26B	Unsloth	14.25	2.7087	0.09788	85.63
26B	Q4_0	14.44	4.5420	0.36094	70.20
31B	Unsloth	17.29	1.3659	0.01403	96.67
31B	Q4_0	17.65	3.0030	0.09349	87.91
12B	Unsloth	6.72	9.2740	0.13288	88.76
12B	Q4_0	6.98	14.7323	0.50702	74.08

## 移动混合 QAT Gemma-4 团队还发布了 Gemma-4-E2B-it 和 Gemma-4-E4B-it 的特殊移动混合 QAT 版本。我们也将它们忠实地转换为 llama.cpp 兼容格式，并且同样几乎恢复了全部准确率。我们对 2 位层使用了 TQ2\_0，并且使用了负缩放器。我们为 E2B 和 E4B 都制作了 UD-Q2\_K\_XL 量化版本。 | | E2B 移动版 | E4B 移动版 | | ---------------- | ------------ | ------- | | 大小 | 2.19 GB | 3.22 GB | | 2 位（TQ2\_0）张量 | 61（包括深层 MLP） | 2（仅嵌入层） | | 相对于 BF16 的平均 KLD | 0.00409 | 0.00102 | | Top-1 % | 97.82% | 98.76% | | 基础 PPL | \~103 | 42.4 | 参见 [gemma-4-E2B-it-qat-GGUF](https://huggingface.co/unsloth/gemma-4-E2B-it-qat-GGUF) 和 [gemma-4-E4B-it-qat-GGUF](https://huggingface.co/unsloth/gemma-4-E4B-it-qat-GGUF) 用于 `UD-Q2_K_XL`. ## 运行 Gemma 4 QAT 教程由于 Gemma 4 GGUF 有多种大小，因此小模型的推荐起点是 8 位，而更大的模型则是 **动态 4 位**. [Gemma 4 GGUF](https://huggingface.co/collections/unsloth/gemma-4-qat): | [E2B](https://huggingface.co/unsloth/gemma-4-E2B-it-qat-GGUF) | [E4B](https://huggingface.co/unsloth/gemma-4-E4B-it-qat-GGUF) | [12b](https://huggingface.co/unsloth/gemma-4-12b-it-qat-GGUF) | [26B-A4B](https://huggingface.co/unsloth/gemma-4-26B-A4B-it-qat-GGUF) | [31B](https://huggingface.co/unsloth/gemma-4-31B-it-qat-GGUF) | | ------------------------------------------------------------- | ------------------------------------------------------------- | ------------------------------------------------------------- | --------------------------------------------------------------------- | ------------------------------------------------------------- | 🦥 Unsloth Studio 指南 🦙 Llama.cpp 指南 {% columns %} {% column %} **你可以在我们的** [**Unsloth Studio**](/docs/zh/xin-de/studio.md)✨ **notebook 中免费运行并训练 Gemma 4 QAT：** {% endcolumn %} {% column %} {% embed url="" %} {% endcolumn %} {% endcolumns %} ### 🦥 Unsloth Studio 指南现在可以在 [Unsloth Studio](/docs/zh/xin-de/studio.md)中运行和训练 Gemma 4 QAT，我们新的本地 AI 开源网页 UI。Unsloth Studio 让你可以在本地运行模型，支持 **MacOS、Windows**、Linux，以及： {% columns %} {% column %} * 搜索、下载、 [运行 GGUF](/docs/zh/xin-de/studio.md#run-models-locally) 和 safetensor 模型 * [**自我修复** 工具调用](/docs/zh/xin-de/studio.md#execute-code--heal-tool-calling) + **网页搜索** * [**代码执行**](/docs/zh/xin-de/studio.md#run-models-locally) （Python、Bash） * [自动推理](/docs/zh/xin-de/studio.md#model-arena) 参数调优（temp、top-p 等） * 通过 llama.cpp 实现快速 CPU + GPU 推理 * [训练 LLM](/docs/zh/xin-de/studio.md#no-code-training) 速度快 2 倍，VRAM 减少 70% {% endcolumn %} {% column %}

{% endcolumn %} {% endcolumns %} {% stepper %} {% step %} #### 安装 Unsloth 在终端中运行： **MacOS、Linux、WSL：** ```bash curl -fsSL https://unsloth.ai/install.sh | sh ``` **Windows PowerShell：** ```bash irm https://unsloth.ai/install.ps1 | iex ``` {% endstep %} {% step %} #### 启动 Unsloth **MacOS、Linux、WSL 和 Windows：** ```bash unsloth studio -H 0.0.0.0 -p 8888 ``` 然后打开 `http://127.0.0.1:8888` （或你的具体 URL）。 {% endstep %} {% step %} #### 搜索并下载 Gemma 4 QAT 首次启动时，你需要创建密码来保护你的账户，并再次登录。然后前往 [Unsloth Chat](/docs/zh/xin-de/studio/chat.md) 标签页，在搜索栏中搜索 Gemma 4，并下载你想要的模型和量化版本。 {% endstep %} {% step %} #### 运行 Gemma 4 QAT 使用 Unsloth Studio 时，推理参数应会自动设置，不过你仍然可以手动更改。你也可以编辑上下文长度、聊天模板和其他设置。更多信息请查看我们的 [Unsloth Studio 推理指南](/docs/zh/xin-de/studio/chat.md).

{% endstep %} {% endstepper %} ### 🦙 Llama.cpp 指南对于本指南，无需选择量化类型，因为只有一种： `UD-Q4_K_XL`。参见： [Gemma 4 QAT 集合](https://huggingface.co/collections/unsloth/gemma-4-qat)。对于这些教程，我们将使用 [llama.cpp](llama.cpphttps://github.com/ggml-org/llama.cpp) 进行快速本地推理，尤其是如果你有 CPU。 {% stepper %} {% step %} 获取最新的 `llama.cpp` **于** [**GitHub 这里**](https://github.com/ggml-org/llama.cpp)。你也可以按照下面的构建说明操作。将 `-DGGML_CUDA=ON` 到 `-DGGML_CUDA=OFF` ，如果你没有 GPU，或者只想进行 CPU 推理。 **对于 Apple Mac / Metal 设备**，设置 `-DGGML_CUDA=OFF` 然后照常继续——Metal 支持默认开启。 ```bash apt-get update apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y git clone https://github.com/ggml-org/llama.cpp cmake llama.cpp -B llama.cpp/build \ -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split cp llama.cpp/build/bin/llama-* llama.cpp ``` {% endstep %} {% step %} 如果你想使用 `llama.cpp` 直接加载模型时，你可以根据每个模型按照下面的命令操作。 `UD-Q4_K_XL` 是唯一的量化类型。你也可以通过 Hugging Face（第 3 步）下载。这类似于 `ollama run` 。使用 `export LLAMA_CACHE="folder"` 来强制 `llama.cpp` 以保存到特定位置。 **26B-A4B：** ```bash export LLAMA_CACHE="unsloth/gemma-4-26B-A4B-it-qat-GGUF" ./llama.cpp/llama-cli \ -hf unsloth/gemma-4-26B-A4B-it-qat-GGUF:UD-Q4_K_XL \\ --temp 1.0 \ --top-p 0.95 \ --top-k 64 ``` **31B：** ```bash export LLAMA_CACHE="unsloth/gemma-4-31B-it-qat-GGUF" ./llama.cpp/llama-cli \ -hf unsloth/gemma-4-31B-it-qat-GGUF:UD-Q4_K_XL \\ --temp 1.0 \ --top-p 0.95 \ --top-k 64 ``` **E4B：** ```bash export LLAMA_CACHE="unsloth/gemma-4-E4B-it-qat-GGUF" ./llama.cpp/llama-cli \ -hf unsloth/gemma-4-E4B-it-qat-GGUF:UD-Q4_K_XL \\ --temp 1.0 \ --top-p 0.95 \ --top-k 64 ``` **E2B：** ```bash export LLAMA_CACHE="unsloth/gemma-4-E2B-it-GGUF" ./llama.cpp/llama-cli \ -hf unsloth/gemma-4-E2B-it-qat-GGUF:UD-Q4_K_XL \\ --temp 1.0 \ --top-p 0.95 \ --top-k 64 ``` {% endstep %} {% step %} 通过以下方式下载模型（在安装 `pip install huggingface_hub hf_transfer` 之后）。你可以选择 `UD-Q4_K_XL` 或其他量化版本，例如 `Q8_0` 。如果下载卡住了，请参见： [Hugging Face Hub、XET 调试](/docs/zh/ji-chu-zhi-shi/troubleshooting-and-faqs/hugging-face-hub-xet-debugging.md) ```bash hf download unsloth/gemma-4-26B-A4B-it-qat-GGUF \\ --local-dir unsloth/gemma-4-26B-A4B-it-qat-GGUF \\ --include "*mmproj-BF16*" \\ --include "*UD-Q4_K_XL*" # 动态 2 位请使用 "*UD-Q2_K_XL*" ``` {% endstep %} {% step %} 然后在对话模式下运行模型（带视觉 `mmproj-F16`): {% code overflow="wrap" %} ```bash ./llama.cpp/llama-cli \ --model unsloth/gemma-4-26B-A4B-it-qat-GGUF/gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf \\ --mmproj unsloth/gemma-4-26B-A4B-it-qat-GGUF/mmproj-BF16.gguf \\ --temp 1.0 \ --top-p 0.95 \ --top-k 64 ``` {% endcode %} {% endstep %} {% step %} ### Llama-server 部署要在 llama-server 上部署 Gemma-4，请使用： ```bash ./llama.cpp/llama-server \ --model unsloth/gemma-4-26B-A4B-it-qat-GGUF/gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf \\ --mmproj unsloth/gemma-4-26B-A4B-it-qat-GGUF/mmproj-BF16.gguf \\ --temp 1.0 \ --top-p 0.95 \ --top-k 64 \ --alias "unsloth/gemma-4-26B-A4B-it-qat-GGUF" \\ --port 8001 \ --chat-template-kwargs '{"enable_thinking":true}' ``` {% hint style="warning" %} 要 [禁用思考/推理](#how-to-enable-or-disable-reasoning-and-thinking)，使用 `--chat-template-kwargs '{"enable_thinking":false}'` 如果你使用的是 **Windows** Powershell，请使用： `--chat-template-kwargs "{\"enable_thinking\":false}"` 可将 'true' 和 'false' 互换使用。 {% endhint %} {% endstep %} {% endstepper %} --- # Agent Instructions This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com. ## Querying This Documentation If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question. Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter: ``` GET https://unsloth.ai/docs/zh/mo-xing/gemma-4/qat.md?ask=&goal= ``` `ask` is the immediate question: it should be specific, self-contained, and written in natural language. `goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation. Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.