> For the complete documentation index, see [llms.txt](https://unsloth.ai/docs/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://unsloth.ai/docs/zh/mo-xing/mtp.md). # 如何运行 MTP 模型：多 Token 预测指南 MTP，即多 Token 预测，通过让模型一次预测多个即将到来的 token，而不是每一步生成一个 token，从而加速推理。它能在不损失准确率的情况下实现更快推理，尤其适用于 GPU。在本指南中，你将学习如何在本地设备上使用诸如 [Gemma 4](/docs/zh/mo-xing/gemma-4.md) 或 [Qwen3.6](/docs/zh/mo-xing/qwen3.6.md) 之类的 MTP 模型。 MTP 会预测多个未来 token，由主模型并行验证。这减少了生成的前向传播次数，从而加快输出，同时保持质量，因为只保留经过验证的 token。当运行 [GGUF](/docs/zh/ji-chu/unsloth-dynamic-2.0-ggufs.md)时，MTP 可使生成速度 **提升约 1.4 倍到 2.2 倍**。像 Gemma-4-31B 这样的稠密模型受益最大，可达到 **超过 1.4 倍的加速** 。在内存带宽较低的设备上，例如较老的 Mac，提升会更小。你可以直接在 [Unsloth Studio 的界面](/docs/zh/xin/studio.md) 或 llama.cpp 中运行 MTP 模型。 {% hint style="info" %} **MTP 比标准**消耗更多内存，因此请预留大约 2 GB 的额外 RAM/VRAM 空间。 {% endhint %} Gemma 4 MTP Qwen3.6 MTP 我们发现 `--spec-draft-n-max 2` 是最好的起点，不过， **不要假设 `2` 是最优的**，因为性能取决于硬件。你可以尝试从 `1` 到 `6` 之间的任意值，并使用对你的系统最快的那个。Unsloth Studio 会自动为你的特定硬件（Mac、CPU、GPU 等）设置理想的 MTP 参数——之后你仍然可以更改。 ### Gemma 4 MTP Google DeepMind 将 MTP 单独训练，独立于原始 [Gemma 4](/docs/zh/mo-xing/gemma-4/qat.md) 模型，包括 [QAT 变体](/docs/zh/mo-xing/gemma-4/qat.md)。与 Qwen 不同，Google 以 `assistant` 名称发布了特定的 MTP 变体。为了获得最佳效果，我们只上传 3 种精度选项： **8 位** 和 **16 位** （BF16、F16）。对于 QAT——我们采用了 [智能 4 位恢复流程](/docs/zh/mo-xing/gemma-4/qat.md#qat-analysis) ，就像我们对 Gemma 4 QAT 量化版本所做的那样，因此 MTP 量化版本也是通过智能 4 位派生的。我们已将 `mtp-` 前缀的 GGUF 上传到每个仓库，因此你只需使用 **常规的原始 Gemma 4 GGUF**即可，无需单独的仓库。你可以在这里访问 Gemma [MTP 模型](https://huggingface.co/collections/unsloth/gemma-4) ，它们现在可以在 [Unsloth](#unsloth-studio-mtp-guide)中运行。我们对带 MTP 的 Gemma 4 QAT 进行了基准测试，运行速度快了 1.5x - 2.2x：

**表：MTP 硬件需求** （单位 = 总内存：RAM + VRAM，或统一内存） | Gemma 4 变体 | 4 位 | 8 位 | BF16 / FP16 | | --------------- | -------: | -------: | ----------: | | **E2B** | 5 GB | 6–9 GB | 11 GB | | **E4B** | 6.5–7 GB | 10–13 GB | 17 GB | | **12B Unified** | 8–9 GB | 14–15 GB | 26 GB | | **26B A4B** | 17–18 GB | 29–31 GB | 53 GB | | **31B** | 18–21 GB | 35–39 GB | 63 GB | {% hint style="warning" %} **Gemma 4 MTP 会在** [**Unsloth Studio**](#unsloth-studio-mtp-guide)**中自动启用。你只需要下载常规的原始 Gemma 4 GGUF。** 我们已更新 Gemma 4 GGUF 文件，在 GGUF 包内的单独文件夹中加入了额外的 MTP 文件，因此无需单独下载 Gemma 4 assistant GGUF。目前仍需要单独 MTP GGUF 的唯一模型是 Qwen3.6。 {% endhint %} 要运行 Gemma 4 MTP 模型，请按照以下任一方式的步骤进行： [Unsloth Studio](#unsloth-studio-mtp-guide) 或 [llama.cpp](#llama.cpp-mtp-guide). 🦥 在 Unsloth Studio 中运行 🦙 在 llama.cpp 中运行因此下面的命令可以直接使用（这里使用 8 位版本） {% code overflow="wrap" %} ```bash llama-server \ -hf unsloth/gemma-4-31B-it-GGUF \ --spec-type draft-mtp \ --spec-draft-n-max 4 ``` {% endcode %} ### Qwen3.6 MTP Qwen 直接在 [Qwen3.6](/docs/zh/mo-xing/qwen3.6.md) 和 [Qwen3.5](/docs/zh/mo-xing/qwen3.5.md) 模型中训练了 MTP。这使得 Qwen3.6 27B MTP 在 RTX 6000 GPU 上可达到 160 tokens/s，而 Qwen3.6 35B-A3B 可达到 240 tokens/s。GGUF 上传： | [Qwen3.6-27B-MTP-GGUF](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF) | [Qwen3.6-35B-A3B-MTP-GGUF](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF) | | --------------------------------------------------------------------------- | ----------------------------------------------------------------------------------- | **表：MTP 硬件需求** （单位 = 总内存：RAM + VRAM，或统一内存）

Qwen3.6	3 位	4 位	6 位	8 位	BF16
27B	16 GB	19 GB	25 GB	31 GB	56 GB
35B-A3B	18 GB	24 GB	31 GB	39 GB	71 GB

下面是 MTP 与非 MTP 的推理吞吐量图：

我们还 [上传了 MTP GGUF](https://huggingface.co/unsloth/models?search=mtp) ，适用于 [Qwen3.5](/docs/zh/mo-xing/qwen3.5.md) **模型系列** ，包括：0.8B、2B、4B、9B、27B、35B-A3B、122B-A10B 和 397B-A17B。Llama.cpp 正在持续改进 MTP 性能，所以预计它会随着时间推移变得更快！要运行 Qwen MTP 模型，请按照以下任一方式的步骤进行： [Unsloth Studio](#unsloth-studio-mtp-guide) 或 [llama.cpp](#llama.cpp-mtp-guide). ### 🦥 Unsloth Studio MTP 指南 Unsloth Studio 会自动为你的特定硬件（Mac、CPU、GPU 等）设置理想的 MTP 参数——之后你仍然可以更改。 {% stepper %} {% step %} #### 安装 Unsloth 在终端中运行： **MacOS、Linux、WSL：** ```bash curl -fsSL https://unsloth.ai/install.sh | sh ``` **Windows PowerShell：** ```bash irm https://unsloth.ai/install.ps1 | iex ``` {% endstep %} {% step %} #### 启动 Unsloth **MacOS、Linux、WSL 和 Windows：** ```bash unsloth studio -H 127.0.0.1 -p 8888 ``` 然后在浏览器中打开 `http://127.0.0.1:8888` （或你的具体 URL）。 {% endstep %} {% step %} #### 搜索并下载你想要的模型首次启动时，你需要创建一个密码来保护你的账户，并在之后再次登录。然后转到 [Unsloth Chat](/docs/zh/xin/studio/chat.md) 选项卡，在搜索栏中搜索 Qwen3.6 MTP 或 Gemma 4，并下载你想要的模型和量化版本。 {% hint style="warning" %} **Gemma 4 MTP 会在 Unsloth 中自动启用。你只需要下载常规的原始 Gemma 4 GGUF。** 我们已更新 Gemma 4 GGUF 文件，在 GGUF 包内的单独文件夹中加入了额外的 MTP 文件，因此无需单独下载 Gemma 4 assistant GGUF。目前仍需要单独 MTP GGUF 的唯一模型是 Qwen3.6。 {% endhint %}

{% endstep %} {% step %} #### 运行你的 MTP 模型推理、MTP 和 speculative **解码设置** 在使用 Unsloth Studio 时应自动设置，不过你仍然可以手动更改。你也可以在右侧边栏中编辑 speculative decoding、上下文长度、聊天模板和其他设置。

如需更多信息，你可以查看我们的 [Unsloth Studio 推理指南](/docs/zh/xin/studio/chat.md)。下面，2 位的 Qwen3.6 MTP GGUF 执行了 10 多次工具调用，搜索了 10 个网站并运行了 Python 代码：

{% endstep %} {% endstepper %} ### 🦙 Llama.cpp MTP 指南 {% stepper %} {% step %} 安装最新版本的 `llama.cpp` 于 [**GitHub 这里**](https://github.com/ggml-org/llama.cpp/pull/22673)。你也可以按照下面的构建说明操作。将 `-DGGML_CUDA=ON` 改为 `-DGGML_CUDA=OFF` ，如果你没有 GPU，或者只想进行 CPU 推理。 **对于 Apple Mac / Metal 设备**，设置 `-DGGML_CUDA=OFF` 然后照常继续——Metal 支持默认开启。 ```bash apt-get update apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y git clone https://github.com/ggml-org/llama.cpp cmake llama.cpp -B llama.cpp/build \ -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split cp llama.cpp/build/bin/llama-* llama.cpp ``` {% endstep %} {% step %} 如果你想使用 `llama.cpp` 直接加载模型，你可以这样做：（`Q4_K_XL`）是量化类型。你也可以通过 Hugging Face 下载（见第 3 点）。这类似于 `ollama run` 。使用 `export LLAMA_CACHE="folder"` 来强制 `llama.cpp` 保存到特定位置。该模型的最大上下文长度为 256K。请按照以下针对特定模型的命令之一执行： Gemma 4 Qwen3.6 #### Gemma 4 MTP：别忘了 **更改模型名称** 为你想要的 Gemma 4 模型大小，例如 Gemma-4-26B-A4B 等，因为下面的说明是针对 Gemma-4-12B 的。注意，我们提供了一个 `mtp-` 带前缀的 GGUF，因此下面的 `-hf` 命令应会自动下载并使用 MTP。 **思考模式：** ```bash export LLAMA_CACHE="unsloth/gemma-4-12b-it-GGUF" ./llama.cpp/llama-cli \ -hf unsloth/gemma-4-12b-it-GGUF:UD-Q4_K_XL \ --temp 1.0 \ --top-p 0.95 \ --top-k 64 \ --spec-type draft-mtp --spec-draft-n-max 2 ``` {% hint style="info" %} 请参见 Gemma 4 新的 [保留思考](#thinking-enable-disable--preserve-thinking). {% endhint %} **非思考模式**: ```bash export LLAMA_CACHE="unsloth/gemma-4-12b-it-GGUF" ./llama.cpp/llama-cli \ -hf unsloth/gemma-4-12b-it-GGUF:UD-Q4_K_XL \ --temp 1.0 \ --top-p 0.95 \ --top-k 64 \ --spec-type draft-mtp --spec-draft-n-max 2 \ --chat-template-kwargs '{"enable_thinking":false}' ``` #### Qwen3.6 MTP：别忘了 **更改模型名称** 为你想要的 Qwen3.6 变体，例如 Qwen3.6-35B-A3B 或 Qwen3.5 等，因为下面的说明是针对 Qwen3.6-27B 的： **思考模式** （通用任务）**:** ```bash export LLAMA_CACHE="unsloth/Qwen3.6-27B-MTP-GGUF" ./llama.cpp/llama-cli \ -hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL \ --temp 1.0 \ --top-p 0.95 \ --top-k 20 \ --min-p 0.00 \ --spec-type draft-mtp --spec-draft-n-max 2 ``` 对于精确编程任务，请更改： `temperature=0.6` {% hint style="info" %} 请参见 Qwen3.6 新的 [保留思考](#thinking-enable-disable--preserve-thinking). {% endhint %} **非思考模式** （通用任务）： ```bash export LLAMA_CACHE="unsloth/Qwen3.6-27B-MTP-GGUF" ./llama.cpp/llama-server \ -hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL \ --temp 0.7 \ --top-p 0.8 \ --top-k 20 \ --presence-penalty 1.5 \ --min-p 0.00 \ --spec-type draft-mtp --spec-draft-n-max 2 \ --chat-template-kwargs '{"enable_thinking":false}' ``` {% endstep %} {% step %} #### 手动下载量化版本如果你想手动下载量化版本和 MTP 量化版本，也可以！请通过下面的代码下载模型（在安装 `pip install huggingface_hub hf_transfer`之后）。你可以选择 Q4\_K\_M 或其他量化版本，例如 `UD-Q4_K_XL` 。我们建议至少使用 2 位动态量化 `UD-Q2_K_XL` 以平衡体积和准确性。如果下载卡住，请参见： [Hugging Face Hub，XET 调试](/docs/zh/ji-chu/troubleshooting-and-faqs/hugging-face-hub-xet-debugging.md) #### Gemma 4 MTP： ```bash hf download unsloth/gemma-4-12B-it-qat-GGUF \ --local-dir unsloth/gemma-4-12B-it-qat-GGUF \ --include "*mmproj-F16*" \ --include "mtp-*" \ --include "*UD-Q4_K_XL*" # 动态 2 位请使用 "*UD-Q2_K_XL*" ``` #### Qwen3.6 MTP： ```bash hf download unsloth/Qwen3.6-27B-MTP-GGUF \ --local-dir unsloth/Qwen3.6-27B-MTP-GGUF \ --include "*mmproj-F16*" \ --include "*UD-Q4_K_XL*" # 动态 2 位请使用 "*UD-Q2_K_XL*" ``` {% endstep %} {% step %} 然后以对话模式运行模型： #### Gemma 4 MTP： {% code overflow="wrap" %} ```bash ./llama.cpp/llama-cli \ --model unsloth/gemma-4-12B-it-qat-GGUF/gemma-4-12B-it-qat-UD-Q4_K_XL.gguf \ --mmproj unsloth/gemma-4-12B-it-qat-GGUF/mmproj-F16.gguf \ --model-draft unsloth/gemma-4-12B-it-qat-GGUF/mtp-gemma-4-12B-it.gguf \ --temp 1.0 \ --top-p 0.95 \ --top-k 64 \ --spec-type draft-mtp --spec-draft-n-max 2 ``` {% endcode %} 然后你会看到下面的内容——也请忽略错误信息

#### Qwen3.6 MTP： {% code overflow="wrap" %} ```bash ./llama.cpp/llama-cli \ --model unsloth/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-UD-Q4_K_XL.gguf \ --mmproj unsloth/Qwen3.6-27B-MTP-GGUF/mmproj-F16.gguf \ --temp 1.0 \ --top-p 0.95 \ --min-p 0.00 \ --top-k 20 \ --spec-type draft-mtp --spec-draft-n-max 2 ``` {% endcode %} {% endstep %} {% step %} #### Llama-server 部署要在 llama-server 上部署 Gemma-4，请使用： {% code overflow="wrap" %} ```bash ./llama.cpp/llama-server \ --model unsloth/gemma-4-12B-it-qat-GGUF/gemma-4-12B-it-qat-UD-Q4_K_XL.gguf \ --mmproj unsloth/gemma-4-12B-it-qat-GGUF/mmproj-F16.gguf \ --model-draft unsloth/gemma-4-12B-it-qat-GGUF/mtp-gemma-4-12B-it.gguf \ --temp 1.0 \ --top-p 0.95 \ --top-k 64 \ --alias "unsloth/gemma-4-12b-it-qat-GGUF" \ --port 8001 \ --chat-template-kwargs '{"enable_thinking":true}' ``` {% endcode %} {% endstep %} {% endstepper %} --- # Agent Instructions This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com. ## Querying This Documentation If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question. Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter: ``` GET https://unsloth.ai/docs/zh/mo-xing/mtp.md?ask=&goal= ``` `ask` is the immediate question: it should be specific, self-contained, and written in natural language. `goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation. Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.