# GLM-5.1 - 如何本地运行

GLM-5.1 是 Z.ai 的新开源模型。与 [GLM-5](https://unsloth.ai/docs/zh/mo-xing/tutorials/glm-5)相比，它在编码、智能体工具使用、推理、角色扮演、长程智能体任务以及整体聊天质量方面都有重大改进。

完整的 744B 参数（40B 激活）GLM-5.1 模型具有 **20万上下文** 窗口，并且需要 **1.65TB** 磁盘空间。Unsloth 动态 2-bit GGUF 将大小缩减到 **220GB** **(-80%)**，而动态 **1-bit 为 200GB（-85%）：** [**GLM-5.1-GGUF**](https://huggingface.co/unsloth/GLM-5.1-GGUF)

所有上传都使用 Unsloth [Dynamic 2.0](https://unsloth.ai/docs/zh/ji-chu/unsloth-dynamic-2.0-ggufs) 以获得最先进的量化性能——因此较低比特数会将重要层提升为 8 或 16 位。感谢 Z.ai 为 Unsloth 提供首发日访问权限。

{% hint style="info" %}
&#x20;**不要** 对任何 GGUF 使用 CUDA 13.2 运行时，因为这会导致输出质量很差。
{% endhint %}

#### :gear: 使用指南

中等 2-bit 动态量化 `UD-IQ2_M` 占用 **236GB** 磁盘空间——这可以直接装入 **256GB 统一内存的 Mac** 并且在 **1x24GB GPU** 和 **256GB RAM** 配合 MoE 卸载时表现良好。 **1-bit** 量化可装入 220GB RAM，而 8-bit 需要 805GB RAM。

{% hint style="success" %}
为了获得最佳性能，请确保你可用的总内存（VRAM + 系统 RAM）大于你下载的量化模型文件大小。否则，llama.cpp 仍可通过 SSD/HDD 卸载运行，但推理会更慢。
{% endhint %}

### 推荐设置

针对不同使用场景使用不同设置：

| 默认设置（大多数任务）         | 终端基准                |
| ------------------- | ------------------- |
| `temperature` = 1.0 | `temperature` = 0.7 |
| `top_p` = 0.95      | `top_p` = 1.0       |
| 最大新 token = 131072  | 最大新 token = 16384   |

* **最大上下文窗口：** `202,752`.
* 在 GLM-5.1 中，默认启用思考。要禁用思考：

{% code expandable="true" %}

```bash
    --chat-template-kwargs '{"enable_thinking":false}'
```

{% endcode %}

#### 聊天模板更新

GLM-5.1 采用与 GLM-5 相同的架构，只是 `chat_template.jinja` 不同。

* 支持 Claude 的搜索工具。带有 `defer_loading=True` 的工具不会出现在系统提示词中，而是改为显示在工具结果中。
* 允许在助手消息中出现空的推理块（`<think></think>`）。连续的助手消息必须保持同一种模式，即思考或非思考。
* 总体而言，GLM-5.1 主要改进了工具暴露、推理历史重建以及工具消息渲染。

## 运行 GLM-5.1 教程：

你现在可以在 [llama.cpp](#run-in-llama.cpp) 和 [Unsloth Studio](#run-in-unsloth-studio).

### 🦥 在 Unsloth Studio 中运行

GLM-5.1 现在可以在 [Unsloth Studio](https://unsloth.ai/docs/zh/xin-zeng/studio)中运行，我们新的用于本地 AI 的开源网页界面。Unsloth Studio 让你可以在本地运行模型，支持 **MacOS、Windows**、Linux 以及：

{% columns %}
{% column %}

* 搜索、下载、 [运行 GGUF](https://unsloth.ai/docs/zh/xin-zeng/studio#run-models-locally) 和 safetensor 模型
* [**自我修复** 工具调用](https://unsloth.ai/docs/zh/xin-zeng/studio#execute-code--heal-tool-calling) + **网页搜索**
* [**代码执行**](https://unsloth.ai/docs/zh/xin-zeng/studio#run-models-locally) （Python、Bash）
* [自动推理](https://unsloth.ai/docs/zh/xin-zeng/studio#model-arena) 参数调优（temp、top-p 等）
* 使用 llama.cpp 进行高速 CPU + GPU 推理和 CPU 卸载
  {% endcolumn %}

{% column %}

<div data-with-frame="true"><figure><img src="https://2657992854-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FxhOjnexMCB3dmuQFQ2Zq%2Fuploads%2FstfdTMsoBMmsbQsgQ1Ma%2Flandscape%20clip%20gemma.gif?alt=media&#x26;token=eec5f2f7-b97a-4c1c-ad01-5a041c3e4013" alt=""><figcaption></figcaption></figure></div>
{% endcolumn %}
{% endcolumns %}

{% stepper %}
{% step %}

#### 安装 Unsloth

在终端中运行：

**MacOS、Linux、WSL：**

```bash
curl -fsSL https://unsloth.ai/install.sh | sh
```

**Windows PowerShell：**

```bash
irm https://unsloth.ai/install.ps1 | iex
```

{% endstep %}

{% step %}

#### 启动 Unsloth

**MacOS、Linux、WSL 和 Windows：**

```bash
unsloth studio -H 0.0.0.0 -p 8888
```

**然后在浏览器中打开 `http://localhost:8888` 。**
{% endstep %}

{% step %}

#### 搜索并下载 GLM-5.1

首次启动时，你需要创建一个密码来保护账户，并在之后重新登录。随后你会看到一个简短的引导向导，用于选择模型、数据集和基本设置。你可以随时跳过它。

你可以选择 `UD-Q2_K_XL` （动态 2bit 量化）或其他量化版本，例如 `UD-Q4_K_XL` 。我们 <mark style="background-color:green;">**建议使用我们的 2bit 动态量化**</mark><mark style="background-color:green;">**&#x20;**</mark><mark style="background-color:green;">**`UD-Q2_K_XL`**</mark><mark style="background-color:green;">**&#x20;**</mark><mark style="background-color:green;">**以平衡大小和准确率**</mark>。如果下载卡住，请参见 [hugging-face-hub-xet-debugging](https://unsloth.ai/docs/zh/ji-chu/troubleshooting-and-faqs/hugging-face-hub-xet-debugging "mention")

然后转到 [Studio Chat](https://unsloth.ai/docs/zh/xin-zeng/studio/chat) 标签页，在搜索栏中搜索 GLM-5.1，并下载你想要的模型和量化版本。由于体积较大，下载需要一些时间，请耐心等待。为确保快速推理，请确保你有 [足够的 RAM/VRAM](#usage-guide)，否则推理仍然可以工作，但 Unsloth 会卸载到你的 CPU。

<div data-with-frame="true"><figure><img src="https://2657992854-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FxhOjnexMCB3dmuQFQ2Zq%2Fuploads%2Fkmkcl9FVLkAua8UPLnUz%2FScreenshot%202026-04-07%20at%2010.05.26%E2%80%AFAM.png?alt=media&#x26;token=2794e092-a4f2-4209-9b21-1a2410c2631b" alt="" width="563"><figcaption></figcaption></figure></div>
{% endstep %}

{% step %}

#### 运行 GLM-5.1

使用 Unsloth Studio 时，推理参数应会自动设置，但你仍然可以手动更改。你也可以编辑上下文长度、聊天模板和其他设置。

有关更多信息，你可以查看我们的 [Unsloth Studio 推理指南](https://unsloth.ai/docs/zh/xin-zeng/studio/chat).
{% endstep %}
{% endstepper %}

### 🦙 在 llama.cpp 中运行

{% stepper %}
{% step %}
获取最新的 `llama.cpp` **在** [**GitHub 这里**](https://github.com/ggml-org/llama.cpp)。你也可以按照下面的构建说明进行操作。将 `-DGGML_CUDA=ON` 改为 `-DGGML_CUDA=OFF` ，如果你没有 GPU，或者只想进行 CPU 推理。 **对于 Apple Mac / Metal 设备**，设置 `-DGGML_CUDA=OFF` 然后照常继续——Metal 支持默认开启。

```bash
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp
```

{% endstep %}

{% step %}
如果你想使用 `llama.cpp` 直接加载模型，你可以使用下面的命令：(:`IQ2_M`）是量化类型。你也可以通过 Hugging Face 下载（第 3 点）。这类似于 `ollama run` 。使用 `export LLAMA_CACHE="folder"` 来强制 `llama.cpp` 保存到指定位置。记住该模型的最大上下文长度只有 200K。

按照这个方法用于 **通用指令** 使用场景：

```bash
export LLAMA_CACHE="unsloth/GLM-5.1-GGUF"
./llama.cpp/llama-cli \
    -hf unsloth/GLM-5.1-GGUF:UD-IQ2_M \
    --ctx-size 16384 \
    --temp 0.7 \
    --top-p 1.0
```

按照这个方法用于 **工具调用** 使用场景：

```bash
export LLAMA_CACHE="unsloth/GLM-5.1-GGUF"
./llama.cpp/llama-cli \
    -hf unsloth/GLM-5.1-GGUF:UD-IQ2_M \
    --ctx-size 16384 \
    --temp 1.0 \
    --top-p 0.95
```

{% endstep %}

{% step %}
通过（在安装之后）下载模型 `pip install huggingface_hub hf_transfer` 。你可以选择 `UD-Q2_K_XL` （动态 2bit 量化）或其他量化版本，例如 `UD-Q4_K_XL` 。我们 <mark style="background-color:green;">**建议使用我们的 2bit 动态量化**</mark><mark style="background-color:green;">**&#x20;**</mark><mark style="background-color:green;">**`UD-Q2_K_XL`**</mark><mark style="background-color:green;">**&#x20;**</mark><mark style="background-color:green;">**以平衡大小和准确率**</mark>。如果下载卡住，请参见 [hugging-face-hub-xet-debugging](https://unsloth.ai/docs/zh/ji-chu/troubleshooting-and-faqs/hugging-face-hub-xet-debugging "mention")

```bash
pip install -U huggingface_hub
hf download unsloth/GLM-5.1-GGUF \
    --local-dir unsloth/GLM-5.1-GGUF \
    --include "*UD-IQ2_M*" # 动态 1bit 请使用 "*UD-TQ1_0*"
```

{% endstep %}

{% step %}
你可以编辑 `--threads 32` 来设置 CPU 线程数， `--ctx-size 16384` 来设置上下文长度， `--n-gpu-layers 2` 来设置 GPU 卸载的层数。如果你的 GPU 显存不足，可以尝试调整它。如果你只进行 CPU 推理，也请移除它。

{% code overflow="wrap" %}

```bash
./llama.cpp/llama-cli \
    --model unsloth/GLM-5.1-GGUF/UD-IQ2_M/GLM-5.1-UD-IQ2_M-00001-of-00006.gguf \
    --temp 1.0 \
    --top-p 0.95 \
    --ctx-size 16384 \
    --seed 3407
```

{% endcode %}
{% endstep %}
{% endstepper %}

#### 🦙 Llama-server 服务与 OpenAI 的补全库

要将 GLM-5 用于生产环境，我们使用 `llama-server` 在一个新的终端中，例如通过 tmux，使用以下方式部署模型：

{% code overflow="wrap" %}

```bash
./llama.cpp/llama-server \
    --model unsloth/GLM-5.1-GGUF/UD-IQ2_M/GLM-5.1-UD-IQ2_M-00001-of-00006.gguf \
    --alias "unsloth/GLM-5.1" \
    --prio 3 \
    --temp 1.0 \
    --top-p 0.95 \
    --ctx-size 16384 \
    --port 8001
```

{% endcode %}

然后在一个新终端中，在执行了 `pip install openai`之后，执行：

{% code overflow="wrap" %}

```python
from openai import OpenAI
import json
openai_client = OpenAI(
    base_url = "http://127.0.0.1:8001/v1",
    api_key = "sk-no-key-required",
)
completion = openai_client.chat.completions.create(
    model = "unsloth/GLM-5.1",
    messages = [{"role": "user", "content": "创建一个贪吃蛇游戏。"},],
)
print(completion.choices[0].message.content)
```

{% endcode %}

然后你就可以通过 OpenAI API 调用已服务化的模型：

```python
from openai import AsyncOpenAI, OpenAI
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8001/v1"
client = OpenAI( # 或 AsyncOpenAI
    api_key = openai_api_key,
    base_url = openai_api_base,
)
```

### :hammer:使用 GLM-5.1 进行工具调用

查看 [tool-calling-guide-for-local-llms](https://unsloth.ai/docs/zh/ji-chu/tool-calling-guide-for-local-llms "mention") 以了解更多关于如何进行工具调用的细节。在一个新终端中（如果使用 tmux，请按 CTRL+B+D），我们创建一些工具，例如加 2 个数、执行 Python 代码、执行 Linux 函数等等：

{% code expandable="true" %}

```python
import json, subprocess, random
from typing import Any
def add_number(a: float | str, b: float | str) -> float:
    return float(a) + float(b)
def multiply_number(a: float | str, b: float | str) -> float:
    return float(a) * float(b)
def substract_number(a: float | str, b: float | str) -> float:
    return float(a) - float(b)
def write_a_story() -> str:
    return random.choice([
        "很久很久以前，在一个遥远的星系里...",
        "有两个朋友，他们喜欢树懒和代码...",
        "世界即将终结，因为每只树懒都进化出了超人类智能...",
        "一位朋友并不知道，另一位朋友不小心编写了一个让树懒进化的程序...",
    ])
def terminal(command: str) -> str:
    if "rm" in command or "sudo" in command or "dd" in command or "chmod" in command:
        msg = "无法执行 'rm, sudo, dd, chmod' 命令，因为它们很危险"
        print(msg); return msg
    print(f"正在执行终端命令 `{command}`")
    try:
        return str(subprocess.run(command, capture_output = True, text = True, shell = True, check = True).stdout)
    except subprocess.CalledProcessError as e:
        return f"命令失败：{e.stderr}"
def python(code: str) -> str:
    data = {}
    exec(code, data)
    del data["__builtins__"]
    return str(data)
MAP_FN = {
    "add_number": add_number,
    "multiply_number": multiply_number,
    "substract_number": substract_number,
    "write_a_story": write_a_story,
    "terminal": terminal,
    "python": python,
}
tools = [
    {
        "type": "function",
        "function": {
            "name": "add_number",
            "description": "添加两个数字。",
            "parameters": {
                "type": "object",
                "properties": {
                    "a": {
                        "type": "string",
                        "description": "第一个数字。",
                    },
                    "b": {
                        "type": "string",
                        "description": "第二个数字。",
                    },
                },
                "required": ["a", "b"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "multiply_number",
            "description": "将两个数字相乘。",
            "parameters": {
                "type": "object",
                "properties": {
                    "a": {
                        "type": "string",
                        "description": "第一个数字。",
                    },
                    "b": {
                        "type": "string",
                        "description": "第二个数字。",
                    },
                },
                "required": ["a", "b"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "substract_number",
            "description": "将两个数字相减。",
            "parameters": {
                "type": "object",
                "properties": {
                    "a": {
                        "type": "string",
                        "description": "第一个数字。",
                    },
                    "b": {
                        "type": "string",
                        "description": "第二个数字。",
                    },
                },
                "required": ["a", "b"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "write_a_story",
            "description": "写一个随机故事。",
            "parameters": {
                "type": "object",
                "properties": {},
                "required": [],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "terminal",
            "description": "执行终端中的操作。",
            "parameters": {
                "type": "object",
                "properties": {
                    "command": {
                        "type": "string",
                        "description": "你想执行的命令，例如 `ls`、`rm` 等。",
                    },
                },
                "required": ["command"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "python",
            "description": "调用一个 Python 解释器来执行一些 Python 代码。",
            "parameters": {
                "type": "object",
                "properties": {
                    "code": {
                        "type": "string",
                        "description": "要运行的 Python 代码",
                    },
                },
                "required": ["code"],
            },
        },
    },
]
```

{% endcode %}

然后我们使用下面的函数（复制并粘贴后执行），它们会自动解析函数调用，并为任何模型调用 OpenAI 端点：

{% code overflow="wrap" expandable="true" %}

```python
from openai import OpenAI
def unsloth_inference(
    messages,
    temperature = 1.0,
    top_p = 0.95,
    top_k = -1,
    min_p = 0.01,
    repetition_penalty = 1.0,
):
    messages = messages.copy()
    openai_client = OpenAI(
        base_url = "http://127.0.0.1:8001/v1",
        api_key = "sk-no-key-required",
    )
    model_name = next(iter(openai_client.models.list())).id
    print(f"使用的模型 = {model_name}")
    has_tool_calls = True
    original_messages_len = len(messages)
    while has_tool_calls:
        print(f"当前消息 = {messages}")
        response = openai_client.chat.completions.create(
            model = model_name,
            messages = messages,
            temperature = temperature,
            top_p = top_p,
            tools = tools if tools else None,
            tool_choice = "auto" if tools else None,
            extra_body = {"top_k": top_k, "min_p": min_p, "repetition_penalty" :repetition_penalty,}
        )
        tool_calls = response.choices[0].message.tool_calls or []
        content = response.choices[0].message.content or ""
        tool_calls_dict = [tc.to_dict() for tc in tool_calls] if tool_calls else tool_calls
        messages.append({"role": "assistant", "tool_calls": tool_calls_dict, "content": content,})
        for tool_call in tool_calls:
            fx, args, _id = tool_call.function.name, tool_call.function.arguments, tool_call.id
            out = MAP_FN[fx](**json.loads(args))
            messages.append({"role": "tool", "tool_call_id": _id, "name": fx, "content": str(out),})
        else:
            has_tool_calls = False
    return messages
```

{% endcode %}

在通过 `llama-server` 类似于 [#deploy-with-llama-server-and-openais-completion-library](#deploy-with-llama-server-and-openais-completion-library "mention") 或者查看 [tool-calling-guide-for-local-llms](https://unsloth.ai/docs/zh/ji-chu/tool-calling-guide-for-local-llms "mention") 以了解更多细节后，我们就可以进行一些工具调用。

### 📊 基准测试

你可以在下方以表格形式查看 GLM-5.1 的基准测试：

<div><figure><img src="https://2657992854-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FxhOjnexMCB3dmuQFQ2Zq%2Fuploads%2FJx4pDC6fWJwaQvk1N8X3%2Fbench_51.png?alt=media&#x26;token=a6d51e6e-4e60-43d3-95de-fd3918fbcf67" alt=""><figcaption></figcaption></figure> <figure><img src="https://2657992854-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FxhOjnexMCB3dmuQFQ2Zq%2Fuploads%2FvgdCMB73JM8P1OGUDrsJ%2FHFUGDWhW8AAUCbw.jpg?alt=media&#x26;token=bef138cf-8afb-4e33-8a74-f4695a8cfe45" alt=""><figcaption></figcaption></figure></div>

| 基准                             | GLM-5.1           | GLM-5             | Qwen3.6-Plus | Minimax M2.7      | DeepSeek-V3.2     | Kimi K2.5 | Claude Opus 4.6 | Gemini 3.1 Pro | GPT-5.4          |
| ------------------------------ | ----------------- | ----------------- | ------------ | ----------------- | ----------------- | --------- | --------------- | -------------- | ---------------- |
| HLE                            | 31.0              | 30.5              | 28.8         | 28.0              | 25.1              | 31.5      | 36.7            | **45.0**       | 39.8             |
| HLE（含工具）                       | 52.3              | 50.4              | 50.6         | -                 | 40.8              | 51.8      | **53.1**\*      | 51.4\*         | 52.1\*           |
| AIME 2026                      | 95.3              | 95.4              | 95.1         | 89.8              | 95.1              | 94.5      | 95.6            | 98.2           | **98.7**         |
| HMMT 2025 年 11 月               | 94.0              | **96.9**          | 94.6         | 81.0              | 90.2              | 91.1      | 96.3            | 94.8           | 95.8             |
| HMMT 2026 年 2 月                | 82.6              | 82.8              | 87.8         | 72.7              | 79.9              | 81.3      | 84.3            | 87.3           | **91.8**         |
| IMOAnswerBench                 | 83.8              | 82.5              | 83.8         | 66.3              | 78.3              | 81.8      | 75.3            | 81.0           | **91.4**         |
| GPQA-Diamond                   | 86.2              | 86.0              | 90.4         | 87.0              | 82.4              | 87.6      | 91.3            | **94.3**       | 92.0             |
| SWE-Bench Pro                  | **58.4**          | 55.1              | 56.6         | 56.2              | -                 | 53.8      | 57.3            | 54.2           | 57.7             |
| NL2Repo                        | 42.7              | 35.9              | 37.9         | 39.8              | -                 | 32.0      | **49.8**        | 33.4           | 41.3             |
| Terminal-Bench 2.0（Terminus-2） | 63.5              | 56.2              | 61.6         | -                 | 39.3              | 50.8      | 65.4            | **68.5**       | -                |
| Terminal-Bench 2.0（最佳自报）       | 66.5（Claude Code） | 56.2（Claude Code） | -            | 57.0（Claude Code） | 46.4（Claude Code） | -         | -               | -              | **75.1** （Codex） |
| CyberGym                       | **68.7**          | 48.3              | -            | -                 | 17.3              | 41.3      | 66.6            | -              | -                |
| BrowseComp                     | **68.0**          | 62.0              | -            | -                 | 51.4              | 60.6      | -               | -              | -                |
| BrowseComp（含上下文管理）             | 79.3              | 75.9              | -            | -                 | 67.6              | 74.9      | 84.0            | **85.9**       | 82.7             |
| τ³-Bench                       | 70.6              | 69.2              | 70.7         | 67.6              | 69.2              | 66.0      | 72.4            | 67.1           | **72.9**         |
| MCP-Atlas（公开集）                 | 71.8              | 69.2              | **74.1**     | 48.8              | 62.2              | 63.8      | 73.8            | 69.2           | 67.2             |
| Tool-Decathlon                 | 40.7              | 38.0              | 39.8         | 46.3              | 35.2              | 27.8      | 47.2            | 48.8           | **54.6**         |
| Vending Bench 2                | $5,634.00         | $4,432.12         | $5,114.87    | -                 | $1,034.00         | $1,198.46 | **$8,017.59**   | $911.21        | $6,144.18        |


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://unsloth.ai/docs/zh/mo-xing/glm-5.1.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.