# Qwen3-Coder：如何本地运行 Qwen3-Coder 是 Qwen 全新一系列编码智能体模型，提供 30B（**Qwen3-Coder-Flash**）和 480B 参数版本。 **Qwen3-480B-A35B-Instruct** 实现了业界顶尖（SOTA）的代码性能，可与 Claude Sonnet-4、GPT-4.1 以及 [Kimi K2](/docs/zh/mo-xing/tutorials/kimi-k2-thinking-how-to-run-locally.md)相媲美，在 Aider Polygot 上达到 61.8%，并支持 256K（可扩展至 1M）token 上下文。我们还上传了带原生 **1M 上下文长度** 的 Qwen3-Coder，使用 YaRN 扩展，并提供全精度 8bit 和 16bit 版本。 [Unsloth](https://github.com/unslothai/unsloth) 现在还支持 Qwen3-Coder 的微调和 [RL](/docs/zh/kai-shi-shi-yong/reinforcement-learning-rl-guide.md) 。 {% hint style="success" %} [**更新：** 我们修复了 Qwen3-Coder 的工具调用！ ](#tool-calling-fixes)现在你可以在 llama.cpp、Ollama、LMStudio、Open WebUI、Jan 等中无缝使用工具调用。这个问题是普遍存在的，影响了所有上传版本（不只是 Unsloth），我们也已经将修复方案与 Qwen 团队沟通！ [阅读更多](#tool-calling-fixes) {% endhint %} 运行 30B-A3B 运行 480B-A35B {% hint style="success" %} **是否** [**Unsloth Dynamic Quants**](/docs/zh/ji-chu/unsloth-dynamic-2.0-ggufs.md) **有效？** 是的，而且效果非常好。在第三方对 Aider Polyglot 基准的测试中， **UD-Q4\_K\_XL（276GB）** 动态量化几乎与 **完整 bf16（960GB）** 的 Qwen3-coder 模型持平，得分分别为 60.9% 和 61.8%。 [更多细节见这里。](https://huggingface.co/unsloth/Qwen3-Coder-480B-A35B-Instruct-GGUF/discussions/8) {% endhint %} #### **Qwen3 Coder - Unsloth Dynamic 2.0 GGUF**: | Dynamic 2.0 GGUF（用于运行） | 1M 上下文 Dynamic 2.0 GGUF | | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | |

| ## 🖥️ **运行 Qwen3-Coder** 下面是针对该 [**30B-A3B**](#run-qwen3-coder-30b-a3b-instruct) 以及 [**480B-A35B**](#run-qwen3-coder-480b-a35b-instruct) 模型变体 ### :gear: 推荐设置的指南。Qwen 对两个模型都推荐以下推理设置： `temperature=0.7`, `top_p=0.8`, `top_k=20`, `repetition_penalty=1.05` * **温度 0.7** * Top\_K 20 * Min\_P 0.00（可选，但 0.01 也很好，llama.cpp 默认是 0.1） * Top\_P 0.8 * **重复惩罚 1.05** * 聊天模板： ``` <|im_start|>user 嗨！<|im_end|> <|im_start|>assistant 1+1 等于多少？<|im_end|> <|im_start|>user 2<|im_end|> <|im_start|>assistant ``` * 推荐上下文输出：65,536 个 token（可增加）。详情见这里。 **带换行未渲染的聊天模板/提示格式** {% code overflow="wrap" %} ``` <|im_start|>user\n嗨！<|im_end|>\n<|im_start|>assistant\n1+1 等于多少？<|im_end|>\n<|im_start|>user\n2<|im_end|>\n<|im_start|>assistant\n ``` {% endcode %} **工具调用的聊天模板** （获取旧金山当前温度）。关于如何格式化工具调用的更多细节见这里。 ``` <|im_start|>user 旧金山现在的温度是多少？明天呢？<|im_end|> <|im_start|>assistant \n\n\n美国加利福尼亚州旧金山 \n\n<|im_end|> <|im_start|>user {"temperature": 26.1, "location": "San Francisco, CA, USA", "unit": "celsius"} \n<|im_end|> ``` {% hint style="info" %} 提醒一下，这个模型只支持非思考模式，且不会在输出中生成 `` 块。同时，指定 `enable_thinking=False` 已经不再需要。 {% endhint %} ### 运行 Qwen3-Coder-30B-A3B-Instruct：要让我们的动态 4-bit 量化达到每秒 6+ token 的推理速度，至少需要 **18GB 统一内存** （VRAM 与 RAM 合计）或者 **18GB 系统 RAM** 。一般来说，可用内存应当与所用模型大小相当或更大。例如，UD\_Q8\_K\_XL 量化（全精度）大小为 32.5GB，至少需要 **33GB 统一内存** （VRAM + RAM）或者 **33GB RAM** ，才能获得最佳性能。 **注意：** 模型可以在小于其总大小的内存上运行，但这会降低推理速度。只有在追求最快速度时才需要最大内存。由于这是一个非思考模型，因此无需设置 `thinking=False` ，模型也不会生成 ` ` 块。 {% hint style="info" %} 遵循 [**以上最佳实践**](#recommended-settings)。它们与 480B 模型相同。 {% endhint %} #### 🦙 Ollama：运行 Qwen3-Coder-30B-A3B-Instruct 教程 1. 安装 `ollama` 如果你还没安装的话！你只能运行最大 32B 的模型。 ```bash apt-get update apt-get install pciutils -y curl -fsSL https://ollama.com/install.sh | sh ``` 2. 运行模型！注意如果失败，你可以在 `ollama serve`另一个终端里调用。我们将所有修复和建议参数（温度等）都包含在 `params` 中，位于我们的 Hugging Face 上传里！ ```bash ollama run hf.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:UD-Q4_K_XL ``` #### :sparkles: Llama.cpp：运行 Qwen3-Coder-30B-A3B-Instruct 教程 1. 获取最新的 `llama.cpp` 默认开启 [GitHub 仓库](https://github.com/ggml-org/llama.cpp)。你也可以按照下面的构建说明进行。将 `-DGGML_CUDA=ON` 到 `-DGGML_CUDA=OFF` 改为适用于没有 GPU 或只想进行 CPU 推理的情况。 **对于 Apple Mac / Metal 设备**，设置 `-DGGML_CUDA=OFF` 然后照常继续——Metal 支持默认开启。 ```bash apt-get update apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y git clone https://github.com/ggml-org/llama.cpp cmake llama.cpp -B llama.cpp/build \ -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split cp llama.cpp/build/bin/llama-* llama.cpp ``` 2. 你可以直接通过 HuggingFace 拉取： ```bash ./llama.cpp/llama-cli \ -hf unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:Q4_K_XL \ --jinja -ngl 99 --ctx-size 32768 \ --temp 0.7 --min-p 0.0 --top-p 0.80 --top-k 20 --repeat-penalty 1.05 ``` 3. 下载模型（安装 `pip install huggingface_hub hf_transfer` 之后）。你可以选择 UD\_Q4\_K\_XL 或其他量化版本。如果下载卡住，请参见 [Hugging Face Hub，XET 调试](/docs/zh/ji-chu/troubleshooting-and-faqs/hugging-face-hub-xet-debugging.md) ```python # !pip install huggingface_hub hf_transfer import os os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1" from huggingface_hub import snapshot_download snapshot_download( repo_id = "unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF", local_dir = "unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF", allow_patterns = ["*UD-Q4_K_XL*"], ) ``` ### 运行 Qwen3-Coder-480B-A35B-Instruct：要让我们的 1-bit 量化达到每秒 6+ token 的推理速度，我们建议至少使用 **150GB 统一内存** （VRAM 与 RAM 合计）或者 **150GB 系统 RAM** 。一般来说，可用内存应当与所用模型大小相当或更大。例如，Q2\_K\_XL 量化大小为 180GB，至少需要 **180GB 统一内存** （VRAM + RAM）或者 **180GB RAM** ，才能获得最佳性能。 **注意：** 模型可以在小于其总大小的内存上运行，但这会降低推理速度。只有在追求最快速度时才需要最大内存。 {% hint style="info" %} 遵循 [**以上最佳实践**](#recommended-settings)。它们与 30B 模型相同。 {% endhint %} #### 📖 Llama.cpp：运行 Qwen3-Coder-480B-A35B-Instruct 教程对于 Coder-480B-A35B，我们将专门使用 Llama.cpp 进行优化推理，并提供丰富的选项。 {% hint style="success" %} 如果你想要 **全精度未量化版本**，请使用我们的 `Q8_K_XL、Q8_0` 或 `BF16` 版本！ {% endhint %} 1. 获取最新的 `llama.cpp` 默认开启 [GitHub 仓库](https://github.com/ggml-org/llama.cpp)。你也可以按照下面的构建说明进行。将 `-DGGML_CUDA=ON` 到 `-DGGML_CUDA=OFF` 改为适用于没有 GPU 或只想进行 CPU 推理的情况。 ```bash apt-get update apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y git clone https://github.com/ggml-org/llama.cpp cmake llama.cpp -B llama.cpp/build \ -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split cp llama.cpp/build/bin/llama-* llama.cpp ``` 2. 你可以直接使用 llama.cpp 下载模型，但我通常建议使用 `huggingface_hub` 。直接使用 llama.cpp 的话，请执行： ```bash ./llama.cpp/llama-cli \ -hf unsloth/Qwen3-Coder-480B-A35B-Instruct-GGUF:Q2_K_XL \ --ctx-size 16384 \\ --n-gpu-layers 99 \ -ot ".ffn_.*_exps.=CPU" \ --temp 0.7 \\ --min-p 0.0 \ --top-p 0.8 \\ --top-k 20 \\ --repeat-penalty 1.05 ``` 3. 或者，通过（安装 `pip install huggingface_hub hf_transfer` 之后）下载模型。你可以选择 UD-Q2\_K\_XL，或其他量化版本。 ```python # !pip install huggingface_hub hf_transfer import os os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0" # 有时会触发速率限制，因此设为 0 以禁用 from huggingface_hub import snapshot_download snapshot_download( repo_id = "unsloth/Qwen3-Coder-480B-A35B-Instruct-GGUF", local_dir = "unsloth/Qwen3-Coder-480B-A35B-Instruct-GGUF", allow_patterns = ["*UD-Q2_K_XL*"], ) ``` 4. 以对话模式运行模型并尝试任意提示词。 5. 编辑 `--threads -1` 用于 CPU 线程数， `--ctx-size` 262114 用于上下文长度， `--n-gpu-layers 99` 用于将多少层卸载到 GPU。如果 GPU 显存不足，请尝试调整它。如果你只做 CPU 推理，也请将其移除。 {% hint style="success" %} 使用 `-ot ".ffn_.*_exps.=CPU"` 可将所有 MoE 层卸载到 CPU！这实际上可以让你把所有非 MoE 层放入 1 张 GPU，从而提升生成速度。如果你有更多 GPU 容量，可以自定义正则表达式以适配更多层。更多选项见 [这里](#improving-generation-speed). {% endhint %} {% code overflow="wrap" %} ```bash ./llama.cpp/llama-cli \ --model unsloth/Qwen3-Coder-480B-A35B-Instruct-GGUF/UD-Q2_K_XL/Qwen3-Coder-480B-A35B-Instruct-UD-Q2_K_XL-00001-of-00004.gguf \ --ctx-size 16384 \\ --n-gpu-layers 99 \ -ot ".ffn_.*_exps.=CPU" \ --temp 0.7 \\ --min-p 0.0 \ --top-p 0.8 \\ --top-k 20 \\ --repeat-penalty 1.05 ``` {% endcode %} {% hint style="success" %} 另外也别忘了新的 Qwen3 更新。运行 [**Qwen3-235B-A22B-Instruct-2507**](/docs/zh/mo-xing/tutorials/qwen3-next.md) 并使用 llama.cpp 在本地运行。 {% endhint %} #### :tools: 提升生成速度如果你有更多 VRAM，可以尝试卸载更多 MoE 层，或者直接卸载整个层。通常， `-ot ".ffn_.*_exps.=CPU"` 会将所有 MoE 层卸载到 CPU！这实际上可以让你把所有非 MoE 层放入 1 张 GPU，从而提升生成速度。如果你有更多 GPU 容量，可以自定义正则表达式以适配更多层。如果你有更多一点 GPU 内存，可以尝试 `-ot ".ffn_(up|down)_exps.=CPU"` 这会卸载上投影和下投影 MoE 层。尝试 `-ot ".ffn_(up)_exps.=CPU"` 如果你有更多 GPU 内存。这只会卸载上投影 MoE 层。你也可以自定义正则，例如 `-ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU"` 表示仅从第 6 层开始卸载 gate、up 和 down MoE 层。这个 [最新版 llama.cpp](https://github.com/ggml-org/llama.cpp/pull/14363) 还引入了高吞吐模式。使用 `llama-parallel`。阅读更多关于它的内容 [这里](https://github.com/ggml-org/llama.cpp/tree/master/examples/parallel)。你也可以 **将 KV cache 量化为 4bit** ，例如减少 VRAM/RAM 之间的数据移动，这也能让生成过程更快。 #### :triangular\_ruler:如何适配长上下文（256K 到 1M）要适配更长上下文，你可以使用 **KV 缓存量化** 将 K 和 V cache 量化到更低比特。这也能因为减少 RAM/VRAM 数据移动而提高生成速度。K 量化允许的选项（默认是 `f16`）如下。 `--cache-type-k f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1` 你应当使用 `_1` 这些变体可略微提高准确性，尽管速度稍慢。例如 `q4_1, q5_1` 你也可以量化 V cache，但你需要 **使用 Flash Attention 编译 llama.cpp** ，通过 `-DGGML_CUDA_FA_ALL_QUANTS=ON`，并使用 `--flash-attn` 来启用它。我们还通过 YaRN scaling 上传了 100 万上下文长度的 GGUF [这里](https://huggingface.co/unsloth/Qwen3-Coder-480B-A35B-Instruct-1M-GGUF). ## :toolbox: 工具调用修复我们已经成功通过 `llama.cpp --jinja` 修复了工具调用，专门用于通过 `llama-server`提供服务！如果你下载的是我们的 30B-A3B 量化版本，无需担心，因为这些版本已经包含我们的修复。对于 480B-A35B 模型，请： 1. 从下载第一个文件（UD-Q2\_K\_XL），并替换你当前的文件 2. 使用 `snapshot_download` ，就像中所示，它会自动覆盖旧文件 3. 通过以下方式使用新的聊天模板 `--chat-template-file`。见 [GGUF 聊天模板](https://huggingface.co/unsloth/Qwen3-Coder-480B-A35B-Instruct-GGUF?chat_template=default) 或 [chat\_template.jinja](https://huggingface.co/unsloth/Qwen3-Coder-480B-A35B-Instruct/raw/main/chat_template.jinja) 4. 另外，我们还在提供了一个单独的 150GB UD-IQ1\_M 文件（这样 Ollama 也能工作）这应该能解决诸如以下问题： ### 使用工具调用为了格式化用于工具调用的提示词，我们用一个示例来展示。我创建了一个名为 `get_current_temperature` 的 Python 函数，它应该获取某个地点的当前温度。现在我们创建了一个占位函数，它总是返回 21.6 摄氏度。你应该把它改成真正的函数！！ {% code overflow="wrap" %} ```python def get_current_temperature(location: str, unit: str = "celsius"): """获取某个地点的当前温度。参数： location：要获取温度的位置，格式为“城市，州，国家”。 unit：返回温度所用的单位。默认是“celsius”。（可选： ["celsius", "fahrenheit"]）返回：以字典形式返回温度、位置和单位 """ return { "temperature": 26.1, # 预设 -> 你来修改这个！ "location": location, "unit": unit, } ``` {% endcode %} 然后使用 tokenizer 创建完整提示词： {% code overflow="wrap" %} ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("unsloth/Qwen3-Coder-480B-A35B-Instruct") messages = [ {'role': 'user', 'content': "旧金山现在的温度是多少？明天呢？"}, {'content': '', 'role': 'assistant', 'function_call': None, 'tool_calls': [ {'id': 'ID', 'function': {'arguments': {"location": "San Francisco, CA, USA"}, 'name': 'get_current_temperature'}, 'type': 'function'}, ]}, {'role': 'tool', 'content': '{"temperature": 26.1, "location": "San Francisco, CA, USA", "unit": "celsius"}', 'tool_call_id': 'ID'}, ] prompt = tokenizer.apply_chat_template(messages, tokenize = False) ``` {% endcode %} ## :bulb:性能基准 {% hint style="info" %} 这些官方基准测试针对完整 BF16 检查点。要使用它，只需使用我们上传的 `Q8_K_XL、Q8_0、BF16` 检查点——这些版本同样也可以使用 MoE 卸载等技巧！ {% endhint %} 以下是 480B 模型的基准结果： #### 智能体编程

基准	Qwen3‑Coder 480B‑A35B‑Instruct	Kimi‑K2	DeepSeek‑V3-0324	Claude 4 Sonnet	GPT‑4.1
Terminal‑Bench	37.5	30.0	2.5	35.5	25.3
SWE‑bench Verified 配合 OpenHands （500 轮）	69.6	–	–	70.4	–
SWE‑bench Verified 配合 OpenHands （100 轮）	67.0	65.4	38.8	68.0	48.6
SWE‑bench Verified 配合 Private Scaffolding	–	65.8	–	72.7	63.8
SWE‑bench Live	26.3	22.3	13.0	27.7	–
SWE‑bench Multilingual	54.7	47.3	13.0	53.3	31.5
Multi‑SWE‑bench mini	25.8	19.8	7.5	24.8	–
Multi‑SWE‑bench flash	27.0	20.7	–	25.0	–
Aider‑Polyglot	61.8	60.0	56.9	56.4	52.4
Spider2	31.1	25.2	12.8	31.1	16.5

#### 智能体浏览器使用

基准	Qwen3‑Coder 480B‑A35B‑Instruct	Kimi‑K2	DeepSeek‑V3 0324	Claude Sonnet‑4	GPT‑4.1
WebArena	49.9	47.4	40.0	51.1	44.3
Mind2Web	55.8	42.7	36.0	47.4	49.6

#### 智能体工具使用

基准	Qwen3‑Coder 480B‑A35B‑Instruct	Kimi‑K2	DeepSeek‑V3 0324	Claude Sonnet‑4	GPT‑4.1
BFCL‑v3	68.7	65.2	56.9	73.3	62.9
TAU‑Bench 零售	77.5	70.7	59.1	80.5	–
TAU‑Bench 航空	60.0	53.5	40.0	60.0	–

--- # Agent Instructions: Querying This Documentation If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question. Perform an HTTP GET request on the current page URL with the `ask` query parameter: ``` GET https://unsloth.ai/docs/zh/mo-xing/tutorials/qwen3-coder-how-to-run-locally.md?ask= ``` The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation. Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.