> For the complete documentation index, see [llms.txt](https://unsloth.ai/docs/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://unsloth.ai/docs/zh/mo-xing/kimi-k2.6.md).

# Kimi K2.6 - 如何本地运行

Kimi K2.6 是 Moonshot 的一个开源模型，在视觉、编码、agentic、长上下文和聊天任务上都能提供 SOTA 性能。这个 1T 参数的混合思考模型具有 256K 上下文长度，完整精度需要 610GB 磁盘空间，Dynamic 2-bit 需要 **350GB（-43% 大小）**. 通过 Unsloth Dynamic 运行 Kimi K2.6 [**Kimi-K2.6-GGUFs**](https://huggingface.co/unsloth/Kimi-K2.6-GGUF) ，可在 Unsloth Studio 或 llama.cpp 上使用。

**Dynamic 2-bit** 会将重要层提升到 8-bit，并且需要 **350GB+ 显存/内存** 配&#x7F6E;**.** 对于 **无损** Kimi K2.6，请使用 Q8（`UD-Q8_K_XL`），它只比 Q4（ **大 10GB** ）大`UD-Q4_K_XL`）。所有上传都使用 [Dynamic 2.0](/docs/zh/ji-chu/unsloth-dynamic-2.0-ggufs.md) 以获得 SOTA 量化性能。Kimi-K2.6 GGUFs 也 **支持视觉。**

**表：硬件需求** （单位 = 总内存：RAM + VRAM，或统一内存）

| 测量   | Dynamic 2-bit | Q4     | Q8（无损） |
| ---- | ------------- | ------ | ------ |
| 磁盘空间 | 340 GB        | 584 GB | 595 GB |
| 困惑度  | 2.4131        | 1.8420 | 1.8419 |

### 📊 量化分析

`UD-Q8_K_XL` 是无损的，因为 Kimi 对 MoE 权重使用 int4，而对其他所有部分使用 BF16，且 `Q8_K_XL` 遵循这一点。 `UD-Q4_K_XL` 类似，只不过剩余张量是 `Q8_0`，因此它接近完整精度，并且需要 600GB RAM/VRAM。其他来自其他提供商、非 Unsloth 的 GGUF 可能会采用 `UD-Q4_K_XL` 这种方法，而不是“真正无损”的 `UD-Q8_K_XL`.

我们遵循了 [jukofyork](https://github.com/jukofyork)的发现，即 `const float d = max / -7;` ，而不是默认的 `const float d = max / -8;` ，并且只在 MoE 层的量化过程中进行。这个针对 INT4 原生 MoEs 的双射补丁允许 `Q4_0` 量化类型将绝对误差从 1.8% 降至接近 0%（epsilon）。

不过我们必须将其他层保持为 BF16，下面展示的是它们相对于 BF16 基线的误差图。 `UD-Q8-K_XL` 在将 Q4\_0 转换为 BF16 时，确实是“无损”的，只存在一些机器 epsilon 级别的差异。以下的 `UD-Q8_K_XL` 的困惑度为 1.8419 ± 0.00721，且 `UD-Q4_K_XL` 为 1.8420 ± 0.00720。注意下面的误差图是 RMSE 除以 bfloat16 epsilon，因此它是一个很小的误差尺度。

<div data-with-frame="true"><figure><img src="/files/6d4f704bd88eb4f640abe42f591cc2d05f1c622f" alt=""><figcaption><p>查看 <code>Q4_K_XL</code> （蓝色）和 <code>Q8_K_XL</code> （橙色）之间的差异，它是无损的，并且大 10GB。</p></figcaption></figure></div>

### :gear: 使用指南

**思考模式和非思考模式需要不同的设置：**

| 默认（思考模式）          | 即时模式              |
| ----------------- | ----------------- |
| temperature = 1.0 | temperature = 0.6 |
| top\_p = 0.95     | top\_p = 0.95     |

* 建议上下文长度 = `98,304` （最多 `262,144`)

如果模型能放得下，在使用 B200 时你将获得 >40 tokens/s。我们建议 `UD-Q2_K_XL` （350GB）作为大小/质量的良好平衡。最佳经验法则：RAM+VRAM ≈ 量化大小；否则它仍然可以运行，只是由于卸载会更慢。

#### Kimi K2.6 的聊天模板

运行 `tokenizer.apply_chat_template([{"role": "user", "content": "1+1 等于多少？"},])` 得到：

{% code overflow="wrap" %}

```
<|im_system|>system<|im_middle|>你是 Kimi，由 Moonshot AI 创建的 AI 助手。<|im_end|><|im_user|>user<|im_middle|>1+1 等于多少？<|im_end|><|im_assistant|>assistant<|im_middle|><think>
```

{% endcode %}

## Kimi K2.6 运行指南

### 🦥 在 Unsloth Studio 中运行 Kimi-K2.6

Kimi K2.6 可以运行在 [Unsloth Studio](/docs/zh/xin/studio.md)，这是一个用于本地 AI 的开源网页 UI。 **Unsloth Studio 会自动卸载到 RAM 并检测多 GPU 配置**。借助 Unsloth Studio，你可以在以下平台本地运行模型： **MacOS、Windows**、Linux 以及：

{% columns %}
{% column %}

* 搜索、下载、 [运行 GGUF](/docs/zh/xin/studio.md#run-models-locally) 和 safetensor 模型
* [**自我修复** 工具调用](/docs/zh/xin/studio.md#execute-code--heal-tool-calling) + **网页搜索**
* [**代码执行**](/docs/zh/xin/studio.md#run-models-locally) （Python、Bash）
* [自动推理](/docs/zh/xin/studio.md#model-arena) 参数调优（temp、top-p 等）
* 通过 llama.cpp 实现快速 CPU + GPU 推理
* [训练 LLM](/docs/zh/xin/studio.md#no-code-training) 速度快 2 倍，显存占用减少 70%
  {% endcolumn %}

{% column %}

<div data-with-frame="true"><figure><img src="/files/5af4df407c8134f1ff75a4d7535569361c049e51" alt=""><figcaption></figcaption></figure></div>
{% endcolumn %}
{% endcolumns %}

{% stepper %}
{% step %}
**安装并启动 Unsloth**

要安装，请在终端中运行：

MacOS、Linux、WSL：

```bash
curl -fsSL https://unsloth.ai/install.sh | sh
```

Windows PowerShell：

```bash
irm https://unsloth.ai/install.ps1 | iex
```

**启动 Unsloth**

MacOS、Linux、WSL 和 Windows：

```bash
unsloth studio -H 0.0.0.0 -p 8888
```

然后打开 `http://localhost:8888` 在你的浏览器中。
{% endstep %}

{% step %}
**搜索并下载 Kimi-K2.6**

Unsloth Studio 会自动卸载到 RAM 并检测多 GPU 配置。首次启动时，您需要创建一个密码以保护您的帐户，并在之后重新登录。

然后前往 [Studio Chat](/docs/zh/xin/studio/chat.md) 选项卡并在 **Kimi-K2.6** 中搜索，然后下载你想要的模型和量化版本。确保你有足够的算力来运行该模型。

<div data-with-frame="true"><figure><img src="/files/4a9b58e74c0dcb530a3defb93176bb456955f534" alt="" width="563"><figcaption></figcaption></figure></div>
{% endstep %}

{% step %}
**运行 Kimi-K2.6**

在使用 Unsloth Studio 时，推理参数应会自动设置，不过你仍然可以手动更改。你也可以编辑上下文长度、聊天模板和其他设置。

更多信息请查看我们的 [Unsloth Studio 推理指南](/docs/zh/xin/studio/chat.md).

<div data-with-frame="true"><figure><img src="/files/23fb40df7f876a1d75f48943c38802ab9e31027f" alt="" width="563"><figcaption><p>使用工具调用运行 Qwen3.6 的示例</p></figcaption></figure></div>
{% endstep %}
{% endstepper %}

### 🦙 在 llama.cpp 中运行 Kimi K2.6

在本指南中，我们将运行 UD-Q2\_K\_XL 量化版本，这至少需要 350GB RAM。你可以自由更改量化类型。GGUF： [**Kimi-K2.6-GGUF**](https://huggingface.co/unsloth/Kimi-K2.6-GGUF)

对于这些教程，我们将使用 [llama.cpp](llama.cpphttps://github.com/ggml-org/llama.cpp) 进行快速本地推理，尤其是在你有 CPU 的情况下。

{% stepper %}
{% step %}
获取最新的 `llama.cpp` **在** [**GitHub 上这里**](https://github.com/ggml-org/llama.cpp)。你也可以按照下面的构建说明进行操作。将 `-DGGML_CUDA=ON` 改为 `-DGGML_CUDA=OFF` 如果你没有 GPU，或者只想进行 CPU 推理。 **对于 Apple Mac / Metal 设备**，设置 `-DGGML_CUDA=OFF` 然后按常规继续——Metal 支持默认开启。

```bash
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \\
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp
```

{% endstep %}

{% step %}
现在你可以使用 `llama.cpp` 直接加载和下载模型，就像 `ollama run`一样。首先，选择你想要的量化类型，例如 `Q2_K_XL`。此外还要使用 `export LLAMA_CACHE="folder"` 来强制 `llama.cpp` 保存到特定位置。注意这个下载过程可能会非常慢，所以最好使用下一节中的手动下载流程。

根据你的使用场景，使用下面的某个特定命令：

**思考模式：**

```bash
export LLAMA_CACHE="unsloth/Kimi-K2.6-GGUF"
./llama.cpp/llama-cli \\
    -hf unsloth/Kimi-K2.6-GGUF:UD-Q2_K_XL \\
    --temp 1.0 \\
    --top-p 0.95
```

**非思考模式（即时）：**

```bash
export LLAMA_CACHE="unsloth/Kimi-K2.6-GGUF"
./llama.cpp/llama-cli \\
    -hf unsloth/Kimi-K2.6-GGUF:UD-Q2_K_XL \\
    --temp 0.6 \\
    --top-p 0.95 \\
    --chat-template-kwargs '{"enable_thinking":false}'
```

{% endstep %}

{% step %}
如果你想手动下载模型，我们可以在安装 `pip install huggingface_hub`后通过下面的代码下载模型。如果下载卡住，请查看： [Hugging Face Hub、XET 调试](/docs/zh/ji-chu/troubleshooting-and-faqs/hugging-face-hub-xet-debugging.md)

```bash
hf download unsloth/Kimi-K2.6-GGUF \\
    --local-dir unsloth/Kimi-K2.6-GGUF \\
    --include "*mmproj-F16*" \\
    --include "*UD-Q2_K_XL*" # 对完整精度使用 "*UD-Q8_K_XL*"
```

{% endstep %}

{% step %}
然后在对话模式下运行模型：

{% code overflow="wrap" %}

```bash
./llama.cpp/llama-cli \\
    --model unsloth/Kimi-K2.6-GGUF/UD-Q2_K_XL/Kimi-K2.6-UD-Q2_K_XL-00001-of-0008.gguf \\
    --mmproj unsloth/Kimi-K2.6-GGUF/mmproj-F16.gguf \\
    --temp 1.0 \\
    --top-p 0.95
```

{% endcode %}
{% endstep %}
{% endstepper %}

### 📊 基准测试

你可以在下方查看更多表格形式的基准测试：

<div data-with-frame="true"><figure><img src="/files/e9fdad6e5a8057007bf05c9efb3d3dc464977446" alt="" width="563"><figcaption></figcaption></figure></div>


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://unsloth.ai/docs/zh/mo-xing/kimi-k2.6.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.