# Qwen3.6 - 如何本地运行 Qwen3.6 是阿里巴巴全新的多模态混合推理模型家族，其中包括 Qwen3.6-35B-A3B。它在同等规模下提供顶级性能，支持跨 201 种语言的 256K 上下文，并提供思考与非思考两种模式。它在 agentic 编码、视觉和聊天任务上表现出色。 [35B-A3B GGUF](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF) 可在配备 22GB 内存的 Mac 上运行。运行 Qwen3.6 教程上传使用 Unsloth [动态 2.0](https://github.com/unslothai/docs/blob/main/basics/unsloth-dynamic-2.0-ggufs) 用于 SOTA 量化性能——因此量化会在真实使用场景数据集上进行校准，并且重要层会被上采样。感谢 Qwen 为 Unsloth 提供了首日访问权限。 ### :gear: 使用指南 **表：推理硬件需求** （单位 = 总内存：RAM + VRAM，或统一内存）

Qwen3.6	3 位	4 位	6 位	8 位	BF16
35B-A3B	17 GB	23 GB	30 GB	38 GB	70 GB

{% hint style="success" %} 为了获得最佳性能，请确保可用总内存（VRAM + 系统 RAM）大于你正在下载的量化模型文件大小。否则，llama.cpp 仍可通过 SSD/HDD 卸载运行，但推理速度会更慢。 {% endhint %} ### 推荐设置 * **最大上下文窗口：** `262,144` （可通过 YaRN 扩展到 1M） * `presence_penalty = 0.0 到 2.0` 默认情况下此项关闭，但为了减少重复，你可以使用它；不过使用更高的值可能会导致 **性能略有下降** * **足够的输出长度**: `32,768` 适用于大多数查询的 tokens {% hint style="info" %} 如果你得到的是乱码，可能是上下文长度设置得太低。或者尝试使用 `--cache-type-k bf16 --cache-type-v bf16` 这可能会有所帮助。 {% endhint %} 由于 Qwen3.6 采用混合推理，思考和非思考模式的设置不同： #### 思考模式： | 通用任务 | 精确编码任务（例如 WebDev） | | ------------------------- | ------------------------- | | temperature = 1.0 | temperature = 0.6 | | top\_p = 0.95 | top\_p = 0.95 | | top\_k = 20 | top\_k = 20 | | min\_p = 0.0 | min\_p = 0.0 | | presence\_penalty = 1.5 | presence\_penalty = 0.0 | | repeat penalty = 禁用或 1.0 | repeat penalty = 禁用或 1.0 | {% columns %} {% column %} 用于通用任务的思考模式： {% code overflow="wrap" %} ```bash temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0 ``` {% endcode %} {% endcolumn %} {% column %} 用于精确编码任务的思考模式： {% code overflow="wrap" %} ```bash temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0 ``` {% endcode %} {% endcolumn %} {% endcolumns %} #### Instruct（非思考）模式设置： | 通用任务 | 推理任务 | | ------------------------- | ------------------------- | | temperature = 0.7 | temperature = 1.0 | | top\_p = 0.8 | top\_p = 0.95 | | top\_k = 20 | top\_k = 20 | | min\_p = 0.0 | min\_p = 0.0 | | presence\_penalty = 1.5 | presence\_penalty = 1.5 | | repeat penalty = 禁用或 1.0 | repeat penalty = 禁用或 1.0 | {% hint style="warning" %} 要 [禁用思考 / 推理](#how-to-enable-or-disable-reasoning-and-thinking)，请使用 `--chat-template-kwargs '{"enable_thinking":false}'` {% endhint %} {% columns %} {% column %} 用于通用任务的 Instruct（非思考）模式： {% code overflow="wrap" %} ```bash temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0 ``` {% endcode %} {% endcolumn %} {% column %} 用于推理任务的 Instruct（非思考）模式： {% code overflow="wrap" %} ```bash temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0 ``` {% endcode %} {% endcolumn %} {% endcolumns %} ## Qwen3.6 推理教程：我们将使用动态 4 位 `UD_Q4_K_XL` 用于推理工作负载的 GGUF 变体。点击下方可跳转到指定模型说明：在 Unsloth Studio 中运行在 llama.cpp 中运行 {% hint style="warning" %} `presence_penalty = 0.0 到 2.0` 默认情况下此项关闭，但为了减少重复，你可以使用它；不过使用更高的值可能会导致 **性能略有下降。** **由于需要单独的 mmproj 视觉文件，目前没有任何 Qwen3.6 GGUF 能在 Ollama 中工作。请使用与 llama.cpp 兼容的后端。** {% endhint %} ## 🦥 Unsloth Studio 指南 Qwen3.6 可以在以下环境中运行和微调： [Unsloth Studio](https://unsloth.ai/docs/zh/xin-zeng/studio)，我们新的本地 AI 开源网页界面。Unsloth Studio 让你可以在以下系统本地运行模型： **MacOS、Windows**、Linux，以及： {% columns %} {% column %} * 搜索、下载， [运行 GGUF](https://unsloth.ai/docs/zh/xin-zeng/studio#run-models-locally) 和 safetensor 模型 * [**自我修复** 工具调用](https://unsloth.ai/docs/zh/xin-zeng/studio#execute-code--heal-tool-calling) + **网页搜索** * [**代码执行**](https://unsloth.ai/docs/zh/xin-zeng/studio#run-models-locally) （Python、Bash） * [自动推理](https://unsloth.ai/docs/zh/xin-zeng/studio#model-arena) 参数调优（temp、top-p 等） * 通过 llama.cpp 实现快速的 CPU + GPU 推理 * [训练 LLM](https://unsloth.ai/docs/zh/xin-zeng/studio#no-code-training) 速度提升 2 倍，VRAM 减少 70% {% endcolumn %} {% column %}

{% endcolumn %} {% endcolumns %} {% stepper %} {% step %} #### 安装 Unsloth 在终端中运行： **MacOS、Linux、WSL：** ```bash curl -fsSL https://unsloth.ai/install.sh | sh ``` **Windows PowerShell：** ```bash irm https://unsloth.ai/install.ps1 | iex ``` {% hint style="success" %} **安装会很快，大约需要 1-2 分钟。** {% endhint %} {% endstep %} {% step %} #### 启动 Unsloth **MacOS、Linux、WSL 和 Windows：** ```bash unsloth studio -H 0.0.0.0 -p 8888 ```

然后打开 `http://localhost:8888` （或你的特定 URL）在浏览器中。 {% endstep %} {% step %} #### 搜索并下载 Qwen3.6 首次启动时，你需要创建一个密码来保护你的账户，并在稍后重新登录。随后你会看到一个简短的引导向导，用于选择模型、数据集和基本设置。你可以随时跳过它。然后前往 [Studio Chat](https://unsloth.ai/docs/zh/xin-zeng/studio/chat) 标签页，在搜索栏中搜索 Qwen3.6，并下载你想要的模型和量化版本。 {% endstep %} {% step %} #### 运行 Qwen3.6 在使用 Unsloth Studio 时，推理参数应会自动设置，不过你仍然可以手动更改。你也可以编辑上下文长度、聊天模板和其他设置。更多信息可查看我们的 [Unsloth Studio 推理指南](https://unsloth.ai/docs/zh/xin-zeng/studio/chat).

{% endstep %} {% endstepper %} ## 🦙 Llama.cpp 指南 ### Qwen3.6-35B-A3B 在本指南中，我们将使用动态 4 位，它在 24GB RAM / Mac 设备上运行得非常好，适合快速推理。由于该模型在完整 F16 精度下大小只有约 72GB，我们不必太担心性能。GGUF： [Qwen3.6-35B-A3B-GGUF](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF) 对于这些教程，我们将使用 [llama.cpp](https://llama.cpphttps/github.com/ggml-org/llama.cpp) 用于快速本地推理，尤其是如果你有 CPU。 ### 🦙 Llama-server 服务与 OpenAI 的 completion 库要将 Qwen3.6 部署到生产环境，我们使用 `llama-server` 在一个新的终端中，例如通过 tmux，使用以下命令部署模型： {% code overflow="wrap" %} ```bash ./llama.cpp/llama-server \ --model unsloth/Qwen3.6-35B-A3B-GGUF/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \ --mmproj unsloth/Qwen3.6-35B-A3B-GGUF/mmproj-F16.gguf \ --alias "unsloth/Qwen3.6-35B-A3B" \ --temp 0.6 \ --top-p 0.95 \ --ctx-size 16384 \ --top-k 20 \ --min-p 0.00 \ --port 8001 ``` {% endcode %} 然后在一个新的终端中，在执行以下命令后： `pip install openai`，执行： {% code overflow="wrap" %} ```python from openai import OpenAI import json openai_client = OpenAI( base_url = "http://127.0.0.1:8001/v1", api_key = "sk-no-key-required", ) completion = openai_client.chat.completions.create( model = "unsloth/Qwen3.6-35B-A3B", messages = [{"role": "user", "content": "创建一个贪吃蛇游戏。"},], ) print(completion.choices[0].message.content) ``` {% endcode %} ### 💡 如何启用或禁用思考 {% columns %} {% column %} [**Unsloth Studio**](#unsloth-studio-guide) 默认会为思考模型自动提供一个“Think”切换开关。在 llama.cpp 中，你可以按照以下命令启用或禁用思考。将 '`true`' 和 '`false`' 互换使用。下面查看在以下环境中启用 / 禁用思考的代码示例： `llama-server`: {% endcolumn %} {% column %}

{% endcolumn %} {% endcolumns %}

llama-server 操作系统：启用思考禁用思考

Linux、MacOS、WSL：

llama-server 操作系统：	启用思考	禁用思考
Linux、MacOS、WSL：	`--chat-template-kwargs '{"enable_thinking":true}'`	`--chat-template-kwargs '{"enable_thinking":false}'`
Windows / Powershell：	`--chat-template-kwargs "{\"enable_thinking\":true}"`	`--chat-template-kwargs "{\"enable_thinking\":false}"`

--chat-template-kwargs '{"enable_thinking":true}'

--chat-template-kwargs '{"enable_thinking":false}'

Windows / Powershell：

--chat-template-kwargs "{\"enable_thinking\":true}"

--chat-template-kwargs "{\"enable_thinking\":false}"

例如，对于 Qwen3.6-35B-A3B，要禁用思考（默认启用）： ```bash ./llama.cpp/llama-server \ --model unsloth/Qwen3.6-35B-A3B-GGUF/Qwen3.6-35B-A3B-BF16.gguf \ --alias "unsloth/Qwen3.6-35B-A3B-GGUF" \ --temp 0.6 \ --top-p 0.95 \ --ctx-size 16384 \ --top-k 20 \ --min-p 0.00 \ --port 8001 \ --chat-template-kwargs '{"enable_thinking":false}' ``` 然后在 Python 中： ```python from openai import OpenAI import json openai_client = OpenAI( base_url = "http://127.0.0.1:8001/v1", api_key = "sk-no-key-required", ) completion = openai_client.chat.completions.create( model = "unsloth/Qwen3.6-35B-A3B-GGUF", messages = [{"role": "user", "content": "2+2 等于多少？"},], ) print(completion.choices[0].message.content) print(completion.choices[0].message.reasoning_content) ``` ### 👨‍💻 OpenAI Codex 与 Claude Code 要通过本地编码 agentic 工作负载运行该模型，你可以 [按照我们的指南](https://unsloth.ai/docs/zh/ji-chu/claude-code)。只需将模型名称改为你的 “Qwen3.6” 变体，并确保遵循正确的 Qwen3.6 参数和使用说明。使用我们刚刚设置好的 `llama-server` 。 {% columns %} {% column %} {% content-ref url="../ji-chu/claude-code" %} [claude-code](https://unsloth.ai/docs/zh/ji-chu/claude-code) {% endcontent-ref %} {% endcolumn %} {% column %} {% content-ref url="../ji-chu/codex" %} [codex](https://unsloth.ai/docs/zh/ji-chu/codex) {% endcontent-ref %} {% endcolumn %} {% endcolumns %} 例如，按照 Claude Code 的说明后，你会看到：

然后我们可以提问，例如 `创建一个 Python 国际象棋游戏` :

## 📊 基准测试 ### Unsloth GGUF 基准测试 Qwen3.6-35-A3B GGUF 的 KL 散度基准将更新于此。以下是我们之前针对 Qwen3.5 的结果：

由于 Qwen3.6 与 Qwen3.5 具有相同的架构，你可以参考我们之前的 Qwen3.5 基准测试： {% content-ref url="qwen3.5/gguf-benchmarks" %} [gguf-benchmarks](https://unsloth.ai/docs/zh/mo-xing/qwen3.5/gguf-benchmarks) {% endcontent-ref %} ### 官方 Qwen 基准测试 #### Qwen3.6-35B-A3B

--- # Agent Instructions: Querying This Documentation If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question. Perform an HTTP GET request on the current page URL with the `ask` query parameter: ``` GET https://unsloth.ai/docs/zh/mo-xing/qwen3.6.md?ask= ``` The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation. Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.