> For the complete documentation index, see [llms.txt](https://unsloth.ai/docs/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://unsloth.ai/docs/zh/mo-xing/diffusiongemma.md). # DiffusionGemma - 如何在本地运行 DiffusionGemma **26B-A4B** 是 Google DeepMind 的全新开源 **多模态** 模型，基于 [Gemma 4](/docs/zh/mo-xing/gemma-4.md) MoE 架构。支持 **256K 上下文**, **140 多种语言**，DiffusionGemma 旨在用于 **高速文本生成** ，适用于文本、视频和图像输入。DiffusionGemma 可在本地运行于 **18GB 内存**，以及 [微调](#fine-tune-diffusiongemma) 现已通过 [Unsloth](https://github.com/unslothai/unsloth). 不同于标准的逐 token 解码，DiffusionGemma 使用 **扩散生成** 并行生成输出，并逐步将其优化为最终答案——类似扩散图像模型，但用于文本。通过以下方式运行模型： [Unsloth Studio](/docs/zh/xin-de/studio.md) 或 llama.cpp。在 RTX 6000 上，DiffusionGemma 可达到 **每秒 2000+ token**. **GGUF：** [diffusiongemma-26B-A4B-it-GGUF](https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF) 运行 DiffusionGemma 微调 DiffusionGemma {% hint style="success" %} **6 月 12 日：** 你现在可以通过 [Unsloth Studio](#unsloth-studio-guide) ✨，推理速度提升 1.8 倍！ {% endhint %} ### 使用指南 DiffusionGemma 适合需要比标准模型更快生成的用户。它适用于快速本地推理、长上下文文档分析、图像/视频理解、OCR 和文档解析、代码生成、工具使用、Agent 工作流，以及低延迟的小批量推理。与标准 Gemma 4 模型不同，DiffusionGemma 需要支持扩散的推理运行时。诸如 `温度`, `top_p`，以及 `top_k` 仅凭这一项不足以在没有所需扩散采样器的情况下复现推荐行为。

### 硬件要求通常最好至少有 18GB RAM 才能以 4 位精度运行该模型。 **GGUF：** [diffusiongemma-26B-A4B-it-GGUF](https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF) **表：DiffusionGemma Inference GGUF 推荐硬件要求** （单位 = 总内存：RAM + VRAM，或统一内存）。 | 4位 | 5-bit | 6 位 | 8位 | BF16 / FP16 | | ----- | ----- | ----- | ----- | ----------- | | 18 GB | 20 GB | 24 GB | 28 GB | 52 GB | {% hint style="info" %} 经验法则是，你的总可用内存应至少大于你下载的量化模型大小。如果不是，你仍然可以使用部分 RAM / 磁盘卸载运行，但生成会更慢。根据你使用的上下文窗口，还需要更多计算资源。 {% endhint %} ## 运行 DiffusionGemma 教程最好至少使用 4 位精度，所以我们将使用 4 位 `Q4_K_M` 量化版本，它需要 18GB RAM。 **GGUF：** [diffusiongemma-26B-A4B-it-GGUF](https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF) 🦥 Unsloth Studio 指南 🦙 Llama.cpp 指南 ### 🦥 Unsloth Studio 指南 {% hint style="success" %} 你现在可以通过 [Unsloth Studio](#unsloth-studio-guide) ✨。确保你使用 [`v0.1.463-beta`](https://github.com/unslothai/unsloth/tree/v0.1.462-beta) 或 `2026.6.6`. {% endhint %} DiffusionGemma 现在可以在 [Unsloth Studio](/docs/zh/xin-de/studio.md)中运行和训练 Gemma 4 QAT，我们新的本地 AI 开源网页 UI。Unsloth Studio 让你可以在本地运行模型，支持 **MacOS**, **Windows**、Linux，以及： {% columns %} {% column %} * 搜索、下载， [运行 GGUF](/docs/zh/xin-de/studio.md#run-models-locally) 以及 safetensor 模型 * [**自修复** 工具调用](/docs/zh/xin-de/studio.md#execute-code--heal-tool-calling) + **网页搜索** * [**代码执行**](/docs/zh/xin-de/studio.md#run-models-locally) （Python、Bash） * [自动推理](/docs/zh/xin-de/studio.md#model-arena) 参数调优（temp、top-p 等） * 通过 llama.cpp 实现快速 CPU + GPU 推理 * [训练 LLM](/docs/zh/xin-de/studio.md#no-code-training) 速度提升 2 倍，VRAM 减少 70% {% endcolumn %} {% column %}

{% endcolumn %} {% endcolumns %} {% stepper %} {% step %} #### 安装 Unsloth 请确保你使用最新的 [`v0.1.463-beta`](https://github.com/unslothai/unsloth/tree/v0.1.462-beta) 或 `2026.6.6`。在终端中运行： **MacOS、Linux、WSL：** ```bash curl -fsSL https://unsloth.ai/install.sh | sh ``` **Windows PowerShell：** ```bash irm https://unsloth.ai/install.ps1 | iex ``` {% endstep %} {% step %} #### 启动 Unsloth **MacOS、Linux、WSL 和 Windows：** ```bash unsloth studio -H 0.0.0.0 -p 8888 ``` 然后打开 `http://127.0.0.1:8888` （或你的具体 URL）在浏览器中。 {% endstep %} {% step %} #### 搜索并下载 DiffusionGemma 首次启动时，你需要创建密码来保护你的账户，并再次登录。然后前往 [Unsloth Chat](/docs/zh/xin-de/studio/chat.md) 选项卡中，在搜索栏中搜索 DiffusionGemma 并下载你想要的模型和量化版本。 {% endstep %} {% step %} #### 运行 DiffusionGemma 使用 Unsloth Studio 时，推理参数应会自动设置，不过你仍然可以手动更改。你也可以编辑上下文长度、聊天模板和其他设置。更多信息，你可以查看我们的 [Unsloth Studio 推理指南](/docs/zh/xin-de/studio/chat.md).

{% endstep %} {% endstepper %} ### 🦙 Llama.cpp 指南在本教程中，我们将使用动态 4 位 `Q4_K_M` 量化版本，它需要 18GB RAM，并且 [llama.cpp](llama.cpphttps://github.com/ggml-org/llama.cpp) 进行快速本地推理，尤其是如果你有 CPU。 {% stepper %} {% step %} 获取特定的 `llama.cpp` PR 在 [**GitHub 这里**](https://github.com/ggml-org/llama.cpp/pull/24423)。你也可以按照下面的构建说明进行操作。将 `-DGGML_CUDA=ON` 更改为 `-DGGML_CUDA=OFF` 如果你没有 GPU，或者只想进行 CPU 推理。 **对于 Apple Mac / Metal 设备**，设置 `-DGGML_CUDA=OFF` 然后像往常一样继续——Metal 支持默认开启。 ```bash git clone https://github.com/ggml-org/llama.cpp cd llama.cpp gh pr checkout 24423 # 使用 CUDA 构建（仅 CPU 构建请去掉 -DGGML_CUDA=ON） cmake -B build -DGGML_CUDA=ON cmake --build build -j --config Release --target llama-diffusion-cli cd .. ``` {% endstep %} {% step %} 通过以下方式下载模型（在安装 `pip install huggingface_hub`之后）。你可以选择 `Q4_K_M` 或其他量化版本，例如 `Q8_0` 。如果下载卡住了，请参见： [Hugging Face Hub、XET 调试](/docs/zh/ji-chu-zhi-shi/troubleshooting-and-faqs/hugging-face-hub-xet-debugging.md) ```bash pip install -U "huggingface_hub[cli]" hf download unsloth/diffusiongemma-26B-A4B-it-GGUF \\ --local-dir unsloth/diffusiongemma-26B-A4B-it-GGUF \\ --include "*Q8_0*" # 较小的 16 GB 下载请使用 "*Q4_K_M*" ``` {% endstep %} {% endstepper %} ### 与 DiffusionGemma 聊天然后运行下面的命令： {% code overflow="wrap" %} ```bash ./build/bin/llama-diffusion-cli \\ -m unsloth/diffusiongemma-26B-A4B-it-GGUF/diffusiongemma-26B-A4B-it-Q8_0.gguf \\ -ngl 99 -cnv -n 2048 ``` {% endcode %} 你会看到：

如果你输入类似“Create a Flappy Bird Game”的问题，你会看到步骤：

然后你会看到输出：

你也可以继续对话！更改 `-n 2048` 作为你想预测的 token 数量，因此数值越大，回答越长。 ### 扩散的实时可视化要真正实时查看扩散过程，请使用下面的参数——特别启用 `--diffusion-visual`: ```bash ./build/bin/llama-diffusion-cli \\ -m unsloth/diffusiongemma-26B-A4B-it-GGUF/diffusiongemma-26B-A4B-it-Q8_0.gguf \\ -ngl 99 -cnv -n 2048 --diffusion-visual ``` 你会再次看到：

然后我们得到：

llama.cpp 的所有参数，使用分支： * `-n, --n-predict N` - 目标 token；派生 `--diffusion-blocks` 并增长 `-ub` / `-b` / `-c`. * `-ngl 99` - 将所有层卸载到 GPU（`-ngl 0` 用于仅 CPU）。 * `-cnv` - 多轮对话模式。 * `--diffusion-visual` - 实时画布去噪视图。 * Entropy-Bound 采样器默认开启（`--diffusion-eb auto`）。可通过以下参数调整： `--diffusion-eb-max-steps` （默认 48）， `--diffusion-eb-t-max` / `--diffusion-eb-t-min` (0.8 -> 0.4), `--diffusion-eb-entropy-bound` （0.1），以及 `--diffusion-eb-confidence` (0.005). * `--diffusion-kv-cache {auto,on,off}` - 提示前缀 KV 缓存（auto = 单 GPU 时开启）。 ## 微调 DiffusionGemma 你现在可以直接使用……训练和微调 DiffusionGemma [**Unsloth**](#unsloth-studio-guide)。在我们的示例中，我们通过在数独任务上微调模型来展示领域特定训练的影响。基础模型最初在数独任务上表现不佳，但在针对性数据集上训练后，它学会了真正解数独，并且能正确解决每个示例。你可以使用我们的 Colab 笔记本（A100）通过以下方式微调 DiffusionGemma： {% embed url="" %}

## 推荐设置 [Unsloth Studio](#unsloth-studio-guide) 会自动为你的模型设置最佳推理参数。必要时请使用下面的设置： | 类别 | 设置 | 值 | | ----- | -------- | --------------------------- | | 采样 | 方法 | `diffusion_sampling` | | 采样 | 采样器 | `entropy_bounded_denoising` | | 采样 | 最大去噪步数 | `48` | | 温度 | 温度调度 | `linear_decay` | | 温度 | 温度起始值 | `0.8` | | 温度 | 温度结束值 | `0.4` | | 熵 | 熵边界 | `0.1` | | 自适应停止 | 已启用自适应停止 | `true` | | 自适应停止 | 熵阈值 | `0.005` | | 画布 | 画布长度 | `256` | **自适应停止触发条件** 只有当以下条件都满足时，自适应停止才应触发： **两者** 条件都满足： | 条件 | 所需值 | | ----------------------- | --------- | | 画布平均熵 | `< 0.005` | | 最高概率 token 在连续 2 步中保持稳定 | `true` | 在每个去噪步骤中，采样器应选择互信息边界保持为以下值的最低熵 token： `entropy_bound = 0.1`。未被选中的 token 应在下一次去噪步骤之前完全重新加噪。 ### 思考模式 DiffusionGemma 支持 Gemma 4 风格的思考模式。要启用思考，请在系统提示词开头添加思考 token： ``` <|think|> ``` 启用思考后，模型可能会先输出一个内部推理通道，然后再输出最终答案： ``` <|channel>thought [内部推理] [最终答案] ``` 要禁用思考，请从系统提示词中移除 `<|think|>` token。禁用思考后，模型仍可能输出一个空的思考通道： ``` <|channel>thought [最终答案] ``` 对于多轮对话，不要 **不要** 在对话历史中包含之前隐藏的思考。只包含下一轮用户输入前的最终助手回复。 ## DiffusionGemma 最佳实践 ### 多模态提示 DiffusionGemma 支持交错式多模态输入，包括文本和图像。视频可以作为一系列图像帧来处理。为了在多模态提示中获得最佳结果，请将图像或帧内容放在文本指令之前。示例： ``` [图像] 描述图表并总结关键趋势。 ``` 对于文档解析、OCR、图表理解、UI 理解或少量文本提取，请使用更高的视觉 token 预算。支持的视觉 token 预算： | 视觉 token 预算 | 最适合 | | ----------- | -------------- | | 70 | 快速分类、简单图像说明 | | 140 | 轻量级视觉问答 | | 280 | 通用图像理解 | | 560 | OCR、图表、UI 截图 | | 1120 | 密集文档、少量文本、详细提取 | 对于视频式输入，DiffusionGemma 最多可处理 **60 秒** 当采样频率为 **每秒 1 帧**. ### 采样说明 DiffusionGemma 不是普通的仅下一个 token 模型。它通过反复优化带噪的 token 预测来生成一块 token，称为 **画布**画布 1. 编码器处理提示词并构建上下文缓存。 2. 解码器接收一个 256 token 的生成画布。 3. 扩散采样器迭代式去噪画布。 4. 选出并保留有信心的 token。 5. 不确定的 token 会重新加噪并再次优化。 6. 一旦画布完成，它就会被附加到上下文中。 7. 模型继续处理下一个画布。这种块级自回归方法使 DiffusionGemma 能够比标准自回归模型在更少的前向传递中生成更多 token。 ## 基准测试 DiffusionGemma 针对速度和多模态推理进行了优化，不过标准 Gemma 4 在常规推理基准上更强。 | 基准 | DiffusionGemma 26B-A4B | Gemma 4 26B-A4B | | ------------------- | ---------------------: | --------------: | | MMLU Pro | 77.6% | 82.6% | | AIME 2026 无工具 | 69.1% | 88.3% | | LiveCodeBench v6 | 69.1% | 77.1% | | Codeforces ELO | 1429 | 1718 | | GPQA Diamond | 73.2% | 82.3% | | Tau2 平均值 | 56.2% | 68.2% | | HLE 无工具 | 11.0% | 8.7% | | HLE 带搜索 | 11.9% | 17.2% | | BigBench Extra Hard | 47.6% | 64.8% | | MMMLU | 81.5% | 86.3% | | 长上下文基准 | DiffusionGemma 26B-A4B | Gemma 4 26B-A4B | | ------------------------- | ---------------------: | --------------: | | MRCR v2 8 needle 128K 平均值 | 32.0% | 44.1% | **视觉基准：** | 视觉基准 | DiffusionGemma 26B-A4B | Gemma 4 26B-A4B | | --------------------- | ---------------------: | --------------: | | MMMU Pro | 54.3% | 73.8% | | OmniDocBench 1.5，越低越好 | 0.319 | 0.149 | | MATH-Vision | 70.5% | 82.4% | | MedXPertQA MM | 49.0% | 58.1% | --- # Agent Instructions This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com. ## Querying This Documentation If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question. Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter: ``` GET https://unsloth.ai/docs/zh/mo-xing/diffusiongemma.md?ask=&goal= ``` `ask` is the immediate question: it should be specific, self-contained, and written in natural language. `goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation. Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.