# MiniMax-M2.7 - 如何本地运行

MiniMax-M2.7 是一个面向 agentic 编码和聊天使用场景的新开源模型。该模型在 SWE-Pro（56.22%）和 Terminal Bench 2（57.0%）上取得了 SOTA 性能。

该 **230B 参数** （10B 激活）模型是 [MiniMax-M25](/docs/zh/mo-xing/tutorials/minimax-m25.md) 的继任者，并且拥有 **200K 上下文** 窗口。未量化的 bf16 需要 **457GB**. Unsloth Dynamic **4-bit** GGUF 将大小缩减至 **108GB** **(-60%)** ，因此它可以在 **128GB RAM** 设备上运&#x884C;**:** [**MiniMax-M2.7 GGUF**](https://huggingface.co/unsloth/MiniMax-M2.7-GGUF)

所有上传都使用 Unsloth [Dynamic 2.0](/docs/zh/ji-chu/unsloth-dynamic-2.0-ggufs.md) 以获得 SOTA 量化性能——因此重要层会被提升到更高比特位（例如 8 位或 16 位）。感谢 MiniMax 在首日开放访问。

{% hint style="success" %}
新的 MiniMax-M2.7 GGUF 基准测试现已 उपलब्ध！ [在此查看](#gguf-benchmarks)
{% endhint %}

### :gear: 使用指南

4-bit 动态量化 `UD-IQ4_XS` 使用 **108GB** 的磁盘空间——这非常适合 **128GB 统一内存 Mac** ，可达到约 15+ tokens/s，并且在以下配置下也能更快运行： **1x16GB GPU 和 96GB RAM** ，可达到 25+ tokens/s。 **2-bit** 量化版本或最大的 2-bit 版本可适配 96GB 设备。

对于接近 **全精度**，请使用 `Q8_0` （8-bit），它占用 243GB，可适配 256GB RAM 设备 / Mac，并可达到 15+ tokens/s。

{% hint style="success" %}
为了获得最佳性能，请确保你的可用总内存（VRAM + 系统 RAM）大于你正在下载的量化模型文件大小。如果不满足，llama.cpp 仍可通过 SSD/HDD 卸载方式运行，但推理会更慢。
{% endhint %}

### 推荐设置

MiniMax 建议使用以下参数以获得最佳性能： `temperature=1.0`, `top_p = 0.95`, `top_k = 40`.

{% columns %}
{% column %}

| 默认设置（大多数任务）         |
| ------------------- |
| `temperature = 1.0` |
| `top_p = 0.95`      |
| `top_k = 40`        |
| {% endcolumn %}     |

{% column %}

* **最大上下文窗口：** `196,608`
* 默认系统提示词：

{% code overflow="wrap" %}

```
你是一个有帮助的助手。你的名字是 MiniMax-M2.7，由 MiniMax 构建。
```

{% endcode %}
{% endcolumn %}
{% endcolumns %}

## 运行 MiniMax-M2.7 教程：

为了让 MiniMax-M2.7 在 128GB RAM 设备上运行，我们将使用 4-bit [`UD-IQ4_XS` 量化](https://huggingface.co/unsloth/MiniMax-M2.7-GGUF?show_file_info=UD-IQ4_XS%2FMiniMax-M2.7-UD-IQ4_XS-00001-of-00004.gguf)。你现在可以在 [llama.cpp](#run-in-llama.cpp) 和 [Unsloth Studio](#run-in-unsloth-studio).

{% hint style="warning" %}
不要使用 CUDA 13.2 来运行任何模型，因为它可能导致乱码或较差的输出。NVIDIA 正在修复。
{% endhint %}

### 🦥 在 Unsloth Studio 中运行

MiniMax-M2.7 现在可以在 [Unsloth Studio](/docs/zh/xin/studio.md)中运行，这是我们面向本地 AI 的新开源 Web UI。Unsloth Studio 让你可以在本地运行模型于 **MacOS、Windows**、Linux 以及：

{% columns %}
{% column %}

* 搜索、下载， [运行 GGUF](/docs/zh/xin/studio.md#run-models-locally) 和 safetensor 模型
* [**自我修复** 工具调用](/docs/zh/xin/studio.md#execute-code--heal-tool-calling) + **网页搜索**
* [**代码执行**](/docs/zh/xin/studio.md#run-models-locally) （Python、Bash）
* [自动推理](/docs/zh/xin/studio.md#model-arena) 参数调优（temp、top-p 等）
* 使用 llama.cpp 进行快速 CPU + GPU 推理和 CPU 卸载
  {% endcolumn %}

{% column %}

<div data-with-frame="true"><figure><img src="/files/2dfd7fbf0b551d243091cd1054c69104594c25d5" alt=""><figcaption></figcaption></figure></div>
{% endcolumn %}
{% endcolumns %}

{% stepper %}
{% step %}

#### 安装 Unsloth

在你的终端中运行：

**MacOS、Linux、WSL：**

```bash
curl -fsSL https://unsloth.ai/install.sh | sh
```

**Windows PowerShell：**

```bash
irm https://unsloth.ai/install.ps1 | iex
```

{% endstep %}

{% step %}

#### 启动 Unsloth

**MacOS、Linux、WSL 和 Windows：**

```bash
unsloth studio -H 0.0.0.0 -p 8888
```

**然后打开 `http://localhost:8888` 在你的浏览器中。**
{% endstep %}

{% step %}

#### 搜索并下载 MiniMax-M2.7

首次启动时，你需要创建一个密码来保护你的账户，并在之后重新登录。随后你会看到一个简短的引导向导，用于选择模型、数据集和基本设置。你可以随时跳过它。

你可以选择 `UD-IQ4_XS` （动态 4bit 量化）或其他量化版本，如 `UD-Q4_K_XL` 。如果下载卡住，请参阅 [Hugging Face Hub，XET 调试](/docs/zh/ji-chu/troubleshooting-and-faqs/hugging-face-hub-xet-debugging.md)

然后前往 [Studio Chat](/docs/zh/xin/studio/chat.md) 标签页，在搜索栏中搜索 MiniMax-M2.7，并下载你想要的模型和量化版本。由于文件大小较大，下载会花一些时间，请耐心等待。为确保快速推理，请确保你有 [足够的 RAM/VRAM](#usage-guide)，否则推理仍然可以工作，但 Unsloth 会卸载到你的 CPU。

<div data-with-frame="true"><figure><img src="/files/e9e036f2445291b598f861bc299eb74ac42c3b46" alt=""><figcaption></figcaption></figure></div>
{% endstep %}

{% step %}

#### 运行 MiniMax-M2.7

使用 Unsloth Studio 时，推理参数应会自动设置，不过你仍然可以手动更改。你也可以编辑上下文长度、聊天模板和其他设置。

如需更多信息，你可以查看我们的 [Unsloth Studio 推理指南](/docs/zh/xin/studio/chat.md).
{% endstep %}
{% endstepper %}

### ✨ 在 llama.cpp 中运行

{% hint style="warning" %}
不要使用 CUDA 13.2 来运行任何模型，因为它可能导致乱码或较差的输出。NVIDIA 正在修复。
{% endhint %}

{% stepper %}
{% step %}
获取最新的 `llama.cpp` 在 [GitHub 这里](https://github.com/ggml-org/llama.cpp)。你也可以按照下面的构建说明进行。将 `-DGGML_CUDA=ON` 改为 `-DGGML_CUDA=OFF` 如果你没有 GPU 或者只想进行 CPU 推理。 **对于 Apple Mac / Metal 设备**，设置 `-DGGML_CUDA=OFF` 然后照常继续——Metal 支持默认已开启。

{% code overflow="wrap" %}

```bash
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \\
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp
```

{% endcode %}
{% endstep %}

{% step %}
如果你想使用 `llama.cpp` 直接加载模型，你可以使用下面的方法：（:IQ4\_XS）是量化类型。你也可以通过 Hugging Face 下载（第 3 点）。这与 `ollama run` 类似。使用 `export LLAMA_CACHE="folder"` 来强制 `llama.cpp` 保存到特定位置。请记住，该模型的最大上下文长度只有 200K。

按照这个方式用于 **大多数默认** 使用场景：

```bash
export LLAMA_CACHE="unsloth/MiniMax-M2.7-GGUF"
./llama.cpp/llama-cli \\
    -hf unsloth/MiniMax-M2.7-GGUF:UD-IQ4_XS \\
    --temp 1.0 \\
    --top-p 0.95 \\
    --top-k 40
```

{% endstep %}

{% step %}
下载模型（在安装后 `pip install huggingface_hub hf_transfer`）。你可以选择 UD-IQ4\_XS（动态 4-bit 量化）或其他量化版本，如 `UD-Q6_K_XL` 。我们建议使用 4bit 动态量化 UD-IQ4\_XS 来平衡体积和精度。如果下载卡住，请参阅 [Hugging Face Hub，XET 调试](/docs/zh/ji-chu/troubleshooting-and-faqs/hugging-face-hub-xet-debugging.md)

```bash
hf download unsloth/MiniMax-M2.7-GGUF \\
    --local-dir unsloth/MiniMax-M2.7-GGUF \\
    --include "*UD-IQ4_XS*" # 8 位请使用 "*Q8_0*"
```

{% endstep %}

{% step %}
你可以编辑 `--threads 32` 来设置 CPU 线程数， `--ctx-size 16384` 来设置上下文长度， `--n-gpu-layers 2` 来设置 GPU 卸载的层数。如果你的 GPU 显存不足，可以尝试调整它。如果你只进行 CPU 推理，也请移除它。

{% code overflow="wrap" %}

```bash
./llama.cpp/llama-cli \\
    --model unsloth/MiniMax-M2.7-GGUF/UD-IQ4_XS/MiniMax-M2.7-UD-IQ4_XS-00001-of-00004.gguf \\
    --temp 1.0 \\
    --top-p 0.95 \\
    --top-k 40
```

{% endcode %}
{% endstep %}
{% endstepper %}

#### 🦙 Llama-server 与 OpenAI 的 completion 库

为了将 MiniMax-M2.7 部署到生产环境，我们使用 `llama-server` 或 OpenAI API。在一个新的终端中，例如通过 tmux，使用以下方式部署模型：

{% code overflow="wrap" %}

```bash
./llama.cpp/llama-server \\
    --model unsloth/MiniMax-M2.7-GGUF/UD-IQ4_XS/MiniMax-M2.7-UD-IQ4_XS-00001-of-00004.gguf \\
    --alias "unsloth/MiniMax-M2.7" \\
    --prio 3 \\
    --temp 1.0 \\
    --top-p 0.95 \\
    --min-p 0.01 \\
    --top-k 40 \\
    --port 8001
```

{% endcode %}

然后在一个新的终端中，在执行 `pip install openai`后，执行：

{% code overflow="wrap" %}

```python
from openai import OpenAI
import json
openai_client = OpenAI(
    base_url = "http://127.0.0.1:8001/v1",
    api_key = "sk-no-key-required",
)
completion = openai_client.chat.completions.create(
    model = "unsloth/MiniMax-M2.7",
    messages = [{"role": "user", "content": "Create a Snake game."},],
)
print(completion.choices[0].message.content)
```

{% endcode %}

## 📊 基准测试

### GGUF 基准测试

下面是 MiniMax-M2.7 的 KLD 99% 基准测试。左下角越低越好：

<figure><img src="/files/f2a237398524807c82010f28f74d6245736885fb" alt=""><figcaption></figcaption></figure>

由于 MiniMax-M2.7 使用与 MiniMax-M2.5 相同的架构，M2.7 的 GGUF 量化基准测试应与 M2.5 非常相似。因此，我们也将参考之前针对 M2.5 进行的量化基准测试：

<figure><img src="/files/fb33d5f655dfe59134b3cc15a5571f8854926e28" alt=""><figcaption></figcaption></figure>

[Benjamin Marie（第三方）对](https://x.com/bnjmn_marie/status/2027043753484021810/photo/1) **MiniMax-M2.5** 进行了基准测试，使用 **Unsloth GGUF 量化** 在一个 **750 提示混合套件** （LiveCodeBench v6、MMLU Pro、GPQA、Math500）上，同时报告了 **总体准确率** 和 **相对错误增长** （即量化模型相对于原始模型更频繁出错的程度）。

无论精度如何，Unsloth 量化版本在准确率和相对错误率方面都明显优于对应的非 Unsloth 版本（尽管体积小了 8GB）。

**关键结果：**

* **这里最佳的质量/体积权衡： `unsloth UD-Q4_K_XL`.**\
  它最接近原版：只下降了 **6.0 分** ，而且“仅仅” **+22.8%** 比基线多
* **。其他 Unsloth Q4 量化版本表现非常接近（约 64.5–64.9 准确率）。**\
  `IQ4_NL`, `MXFP4_MOE`，以及 `UD-IQ2_XXS` 在这个基准上质量基本相同， **约 33–35%** 比原版有更多错误。
* Unsloth GGUF 的表现远好于其他非 Unsloth GGUF，例如见 `lmstudio-community - Q4_K_M` （尽管体积小了 8GB）以及 `AesSedai - IQ3_S`.

### 官方基准测试

<figure><img src="/files/10d8b0685f9b9ebc9327dc53223901da5b345377" alt=""><figcaption></figcaption></figure>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://unsloth.ai/docs/zh/mo-xing/tutorials/minimax-m27.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.