GLM-5：本地运行指南

在您自己的本地设备上运行 Z.ai 的新 GLM-5 模型！

GLM-5 是 Z.ai 的最新推理模型，在编码、代理和对话性能上超越了 GLM-4.7，并专为长上下文推理而设计。在 Humanity's Last Exam（人类的最后考试）上提升至 50.4%（+7.6%）、BrowseComp 提升至 75.9%（+8.4%）以及 Terminal-Bench-2.0 提升至 61.1%（+28.3%）。

完整的 7440 亿参数（40B 活跃）模型拥有 200K 上下文 窗口，并在 28.5T 记号上进行了预训练。完整 GLM-5 模型需要 1.65TB 的磁盘空间，而 Unsloth Dynamic 2-bit GGUF 将大小减少到 241GB (-85%)，并且动态 1-bit 为 176GB（-89%）： GLM-5-GGUF

所有上传都使用 Unsloth Dynamic 2.0 以实现 SOTA 的量化性能——因此 1-bit 会将重要层提升为 8 或 16 位。感谢 Z.ai 在第零天就向 Unsloth 提供访问权限。

⚙️ 使用指南

2-bit 动态量化 UD-IQ2_XXS 使用 241GB 的磁盘空间——这可以直接适配在一台 256GB 统一内存的 Mac上，也能很好地在 1x24GB 显卡和 256GB 内存 并关闭 MoE 卸载时运行。 1-bit 量化将在 180GB 内存上运行，而 8-bit 需要 805GB 内存。

为获得最佳性能，请确保可用内存总和（显存 + 系统内存）超过您要下载的量化模型文件的大小。如果不满足，llama.cpp 仍可通过 SSD/HDD 卸载运行，但推理会更慢。

运行 GLM-5 教程：

✨ 在 llama.cpp 中运行

获取最新的 llama.cpp 在 GitHub 这里。你也可以按下面的构建说明操作。若要 -DGGML_CUDA=ON 改为 -DGGML_CUDA=OFF 如果你没有 GPU 或只是想在 CPU 上推理。

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \\
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp

如果你想直接使用 llama.cpp 来加载模型，可以按下面操作：（:IQ2_XXS）是量化类型。你也可以通过 Hugging Face（第 3 点）下载。这类似于 ollama run 。使用 export LLAMA_CACHE="folder" 来强制 llama.cpp 保存到特定位置。请记住模型最大只有 200K 的上下文长度。

按此用于 通用指令 用例：

export LLAMA_CACHE="unsloth/GLM-5-GGUF"
./llama.cpp/llama-cli \\
    -hf unsloth/GLM-5-GGUF:UD-IQ2_XXS \\
    --ctx-size 16384 \\
    --flash-attn on \\
    --temp 0.7 \\
    --top-p 1.0 \\
    --min-p 0.01

按此用于 tool-calling 用例：

export LLAMA_CACHE="unsloth/GLM-5-GGUF"
./llama.cpp/llama-cli \\
    -hf unsloth/GLM-5-GGUF:UD-IQ2_XXS \\
    --ctx-size 16384 \\
    --flash-attn on \\
    --temp 1.0 \\
    --top-p 0.95 \\
    --min-p 0.01

通过以下方式下载模型（在安装 pip install huggingface_hub hf_transfer ）之后。你可以选择 UD-Q2_K_XL （动态 2bit 量化）或其他量化版本，例如 UD-Q4_K_XL 。我们 建议使用我们的 2bit 动态量化 UD-Q2_K_XL 以在大小和准确性之间取得平衡。如果下载卡住，请参见 Hugging Face Hub、XET 调试

pip install -U huggingface_hub
hf download unsloth/GLM-5-GGUF \\
    --local-dir unsloth/GLM-5-GGUF \\
    --include "*UD-IQ2_XXS*" # 对于 Dynamic 1bit 使用 "*UD-TQ1_0*"

你可以编辑 --threads 32 来设置 CPU 线程数， --ctx-size 16384 来设置上下文长度， --n-gpu-layers 2 来设置用于 GPU 卸载的层数。如果你的 GPU 内存不足，尝试调整它。若仅使用 CPU 推理则移除该项。

./llama.cpp/llama-cli \\
    --model unsloth/GLM-5-GGUF/UD-IQ2_XXS/GLM-5-UD-IQ2_XXS-00001-of-00006.gguf \\
    --temp 1.0 \\
    --top-p 0.95 \\
    --min-p 0.01 \\
    --ctx-size 16384 \\
    --seed 3407

🦙 Llama-server 服务与 OpenAI 的 completion 库

要将 GLM-5 部署到生产环境，我们使用 llama-server 在新终端（例如通过 tmux）中，通过以下命令部署模型：

./llama.cpp/llama-server \\
    --model unsloth/GLM-5-GGUF/UD-IQ2_XXS/GLM-5-UD-IQ2_XXS-00001-of-00006.gguf \\
    --alias "unsloth/GLM-5" \\
    --prio 3 \\
    --temp 1.0 \\
    --top-p 0.95 \\
    --ctx-size 16384 \\
    --port 8001

然后在另一个终端，在执行 pip install openai之后，运行：

from openai import OpenAI
import json
openai_client = OpenAI(
    base_url = "http://127.0.0.1:8001/v1",
    api_key = "sk-no-key-required",
)
completion = openai_client.chat.completions.create(
    model = "unsloth/GLM-5",
    messages = [{"role": "user", "content": "Create a Snake game."},],
)
print(completion.choices[0].message.content)

你将得到下面的贪吃蛇游戏示例：

这是一个完整的、可玩的贪吃蛇游戏，包含在单个 HTML 文件中。你可以复制此代码，保存为 `.html` 文件（例如 `snake.html`），然后在浏览器中打开以进行游戏。

### 代码

```html
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>经典贪吃蛇游戏</title>
    <style>
        body {
            display: flex;
            flex-direction: column;
            justify-content: center;
            align-items: center;
            height: 100vh;
            margin: 0;
            background-color: #222;
            color: white;
            font-family: Arial, sans-serif;
        }

        #gameCanvas {
            border: 2px solid #fff;
            background-color: #000;
        }

        h1 {
            margin-bottom: 10px;
        }

        #scoreBoard {
            font-size: 20px;
            margin-bottom: 10px;
        }

        #gameOverMenu {
            position: absolute;
            display: none;
            flex-direction: column;
            justify-content: center;
            align-items: center;
            background: rgba(0, 0, 0, 0.85);
            padding: 20px;
            border-radius: 10px;
            border: 2px solid red;
        }

        button {
            margin-top: 15px;
            padding: 10px 20px;
            font-size: 16px;
            cursor: pointer;
            background-color: #4CAF50;
            color: white;
            border: none;
            border-radius: 5px;
        }
        
        button:hover {
            background-color: #45a049;
        }
    </style>
</head>
<body>

    <h1>贪吃蛇游戏</h1>
    <div id="scoreBoard">分数: 0</div>
    <canvas id="gameCanvas" width="400" height="400"></canvas>

    <div id="gameOverMenu">
        <h2 style="color: red; margin: 0;">游戏结束！</h2>
        <p id="finalScore">最终得分: 0</p>
        <button onclick="resetGame()">再玩一次</button>
    </div>

    <script>
        // 游戏常量
        const canvas = document.getElementById('gameCanvas');
        const ctx = canvas.getContext('2d');
        const scoreBoard = document.getElementById('scoreBoard');
        const gameOverMenu = document.getElementById('gameOverMenu');
        const finalScoreDisplay = document.getElementById('finalScore');

        const gridSize = 20; // 每个方格的大小
        const tileCount = canvas.width / gridSize; // 每行/列的方格数量

        // 游戏变量
        let dx = 0; // 水平速度
        let dy = 0; // 垂直速度
        let score = 0;
        let snake = [];
        let foodX, foodY;
        let gameInterval;
        let isGameRunning = false;

        // 初始化游戏
        function initGame() {
            snake = [
                {x: 10, y: 10}, 
                {x: 9, y: 10}, 
                {x: 8, y: 10}
            ];
            score = 0;
            scoreBoard.innerText = 'Score: ' + score;
            dx = 1; // 游戏开始立即向右移动
            dy = 0;
            placeFood();
            isGameRunning = true;
            gameOverMenu.style.display = 'none';
            
            // 启动游戏循环
            if (gameInterval) clearInterval(gameInterval);
            gameInterval = setInterval(gameLoop, 100); // 每 100ms 运行一次游戏循环
        }

        // 主游戏循环
        function gameLoop() {
            if (!isGameRunning) return;

            moveSnake();
            if (checkGameOver()) {
                endGame();
                return;
            }
            checkFoodCollision();
            draw();
        }

        // 移动贪吃蛇
        function moveSnake() {
            // 根据当前方向创建新的蛇头
            const head = {x: snake[0].x + dx, y: snake[0].y + dy};
            
            // 将新的蛇头添加到数组开头
            snake.unshift(head);

            // 除非吃到食物，否则移除尾巴（最后一个元素）
            // 注意：我们在 checkFoodCollision 中处理移除尾巴
            snake.pop(); 
        }

        // 检查蛇是否吃到食物
        function checkFoodCollision() {
            const head = snake[0];
            
            if (head.x === foodX && head.y === foodY) {
                // 蛇变长：添加一个尾巴部分（复制最后一个）
                snake.push({...snake[snake.length - 1]});
                score += 10;
                scoreBoard.innerText = 'Score: ' + score;
                placeFood();
            }
        }

        // 检查碰撞（墙或自身）
        function checkGameOver() {
            const head = snake[0];

            // 墙体碰撞
            if (head.x < 0 || head.x >= tileCount || head.y < 0 || head.y >= tileCount) {
                return true;
            }

            // 自身碰撞（从第 4 段开始检查，因为蛇头不能撞到前三段）
            for (let i = 4; i < snake.length; i++) {
                if (head.x === snake[i].x && head.y === snake[i].y) {
                    return true;
                }
            }

            return false;
        }

        // 绘制一切
        function draw() {
            // 清空画布
            ctx.fillStyle = 'black';
            ctx.fillRect(0, 0, canvas.width, canvas.height);

            // 绘制食物
            ctx.fillStyle = 'red';
            ctx.fillRect(foodX * gridSize, foodY * gridSize, gridSize - 2, gridSize - 2);

            // 绘制蛇
            ctx.fillStyle = 'lime';
            for (let i = 0; i < snake.length; i++) {
                // 将蛇头绘制得略有不同或采用标准绘制
                const part = snake[i];
                ctx.fillRect(part.x * gridSize, part.y * gridSize, gridSize - 2, gridSize - 2);
            }
        }

        // 随机放置食物
        function placeFood() {
            foodX = Math.floor(Math.random() * tileCount);
            foodY = Math.floor(Math.random() * tileCount);

            // 确保食物不会生成在蛇身上
            for (let part of snake) {
                if (part.x === foodX && part.y === foodY) {
                    placeFood(); // 递归寻找新位置
                    return;
                }
            }
        }

        // 结束游戏逻辑
        function endGame() {
            isGameRunning = false;
            clearInterval(gameInterval);
            finalScoreDisplay.innerText = 'Final Score: ' + score;
            gameOverMenu.style.display = 'flex';
        }

        // 重置游戏逻辑
        function resetGame() {
            initGame();
        }

        // 键盘控制
        document.addEventListener('keydown', (e) => {
            // 防止反向转向（如果向右则不能向左）
            switch(e.key) {
                case 'ArrowUp':
                    if (dy !== 1) { dx = 0; dy = -1; }
                    break;
                case 'ArrowDown':
                    if (dy !== -1) { dx = 0; dy = 1; }
                    break;
                case 'ArrowLeft':
                    if (dx !== 1) { dx = -1; dy = 0; }
                    break;
                case 'ArrowRight':
                    if (dx !== -1) { dx = 1; dy = 0; }
                    break;
                case ' ':
                    if (!isGameRunning && gameOverMenu.style.display !== 'flex') {
                        initGame();
                    }
                    break;
            }
        });

        // 页面加载时开始游戏
        initGame();
    </script>
</body>
</html>
```

### 如何游玩
1.  **复制上方代码**。
2.  在你的电脑上创建一个名为 `snake.html` 的新文件。
3.  **将代码粘贴** 到该文件并保存。
4.  **双击 `snake.html`** 在浏览器中打开它。

### 控制方式
*   **方向键**：向上、向下、向左、向右移动。
*   **空格键**：开始游戏（如果尚未开始）。
*   **再玩一次按钮**：当你撞车时出现，可重新开始游戏。

### 本版本特色
*   **基于网格的移动**：经典复古风格。
*   **分数统计**：实时更新。
*   **游戏结束画面**：显示你的最终得分并允许你轻松重启。
*   **碰撞检测**：如果你撞到墙或自身则游戏结束。
*   **自身碰撞安全机制**：代码防止由于常见简易教程中的“尾巴跳过”逻辑导致吃到食物后蛇立即咬到自己。

💻 vLLM 部署

你现在可以通过 vLLM 提供 Z.ai 的 FP8 版本模型。你需要 860GB 或更多的显存，因此至少推荐 8xH200（141x8 = 1128GB）。8xB200 也能良好运行。首先，安装 vllm nightly：

uv pip install --upgrade --force-reinstall vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly/cu130
uv pip install --upgrade --force-reinstall git+https://github.com/huggingface/transformers.git
uv pip install --force-reinstall numba

要禁用 FP8 KV Cache（可减少 50% 内存使用），移除 --kv-cache-dtype fp8

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:False
vllm serve unsloth/GLM-5-FP8 \\
    --served-model-name unsloth/GLM-5-FP8 \\ \\
    --kv-cache-dtype fp8 \\
    --tensor-parallel-size 8 \\
    --tool-call-parser glm47 \\
    --reasoning-parser glm45 \\
    --enable-auto-tool-choice \\
    --dtype bfloat16 \\
    --seed 3407 \\
    --max-model-len 200000 \\
    --gpu-memory-utilization 0.93 \\
    --max_num_batched_tokens 4096 \\
    --speculative-config.method mtp \\
    --speculative-config.num_speculative_tokens 1 \\
    --port 8001

然后你可以通过 OpenAI API 调用该已部署模型：

from openai import AsyncOpenAI, OpenAI
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8001/v1"
client = OpenAI( # 或 AsyncOpenAI
    api_key = openai_api_key,
    base_url = openai_api_base,
)

🔨使用 GLM 5 的工具调用

详见 Tool Calling Guide 了解有关如何进行工具调用的更多详情。在新终端（若使用 tmux，请使用 CTRL+B+D）中，我们创建了一些工具，例如相加两个数、执行 Python 代码、执行 Linux 命令等：

import json, subprocess, random
from typing import Any
def add_number(a: float | str, b: float | str) -> float:
    return float(a) + float(b)
def multiply_number(a: float | str, b: float | str) -> float:
    return float(a) * float(b)
def substract_number(a: float | str, b: float | str) -> float:
    return float(a) - float(b)
def write_a_story() -> str:
    return random.choice([
        "很久以前，在一个遥远的星系……",
        "有两个朋友，他们热爱树懒和编程……",
        "世界正在终结，因为每只树懒都进化出了超凡的智力……",
        "一个朋友一无所知，另一个朋友不小心写了一个让树懒进化的程序……",
    ])
def terminal(command: str) -> str:
    if "rm" in command or "sudo" in command or "dd" in command or "chmod" in command:
        msg = "不能执行 'rm, sudo, dd, chmod' 命令，因为它们很危险"
        print(msg); return msg
    print(f"正在执行终端命令 `{command}`")
    try:
        return str(subprocess.run(command, capture_output = True, text = True, shell = True, check = True).stdout)
    except subprocess.CalledProcessError as e:
        return f"命令失败：{e.stderr}"
def python(code: str) -> str:
    data = {}
    exec(code, data)
    del data["__builtins__"]
    return str(data)
MAP_FN = {
    "add_number": add_number,
    "multiply_number": multiply_number,
    "substract_number": substract_number,
    "write_a_story": write_a_story,
    "terminal": terminal,
    "python": python,
}
tools = [
    {
        "type": "function",
        "function": {
            "name": "add_number",
            "description": "将两个数字相加。",
            "parameters": {
                "type": "object",
                "properties": {
                    "a": {
                        "type": "string",
                        "description": "第一个数字。",
                    },
                    "b": {
                        "type": "string",
                        "description": "第二个数字。",
                    },
                },
                "required": ["a", "b"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "multiply_number",
            "description": "将两个数字相乘。",
            "parameters": {
                "type": "object",
                "properties": {
                    "a": {
                        "type": "string",
                        "description": "第一个数字。",
                    },
                    "b": {
                        "type": "string",
                        "description": "第二个数字。",
                    },
                },
                "required": ["a", "b"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "substract_number",
            "description": "将两个数字相减。",
            "parameters": {
                "type": "object",
                "properties": {
                    "a": {
                        "type": "string",
                        "description": "第一个数字。",
                    },
                    "b": {
                        "type": "string",
                        "description": "第二个数字。",
                    },
                },
                "required": ["a", "b"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "write_a_story",
            "description": "写一篇随机故事。",
            "parameters": {
                "type": "object",
                "properties": {},
                "required": [],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "terminal",
            "description": "在终端执行操作。",
            "parameters": {
                "type": "object",
                "properties": {
                    "command": {
                        "type": "string",
                        "description": "您希望启动的命令，例如 `ls`、`rm` 等。",
                    },
                },
                "required": ["command"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "python",
            "description": "使用一些将要运行的 Python 代码调用 Python 解释器。",
            "parameters": {
                "type": "object",
                "properties": {
                    "code": {
                        "type": "string",
                        "description": "要运行的 Python 代码",
                    },
                },
                "required": ["code"],
            },
        },
    },
]

然后我们使用下面的函数（复制、粘贴并执行），它们会自动解析函数调用并为任何模型调用 OpenAI 端点：

from openai import OpenAI
def unsloth_inference(
    messages,
    temperature = 1.0,
    top_p = 0.95,
    top_k = -1,
    min_p = 0.01,
    repetition_penalty = 1.0,
):
    messages = messages.copy()
    openai_client = OpenAI(
        base_url = "http://127.0.0.1:8001/v1",
        api_key = "sk-no-key-required",
    )
    model_name = next(iter(openai_client.models.list())).id
    print(f"Using model = {model_name}")
    has_tool_calls = True
    original_messages_len = len(messages)
    while has_tool_calls:
        print(f"Current messages = {messages}")
        response = openai_client.chat.completions.create(
            model = model_name,
            messages = messages,
            temperature = temperature,
            top_p = top_p,
            tools = tools if tools else None,
            tool_choice = "auto" if tools else None,
            extra_body = {"top_k": top_k, "min_p": min_p, "repetition_penalty" :repetition_penalty,}
        )
        tool_calls = response.choices[0].message.tool_calls or []
        content = response.choices[0].message.content or ""
        tool_calls_dict = [tc.to_dict() for tc in tool_calls] if tool_calls else tool_calls
        messages.append({"role": "assistant", "tool_calls": tool_calls_dict, "content": content,})
        for tool_call in tool_calls:
            fx, args, _id = tool_call.function.name, tool_call.function.arguments, tool_call.id
            out = MAP_FN[fx](**json.loads(args))
            messages.append({"role": "tool", "tool_call_id": _id, "name": fx, "content": str(out),})
        else:
            has_tool_calls = False
    return messages

通过以下方式启动 GLM 5 之后， llama-server 就像在 GLM-5 或参见 Tool Calling Guide 以获取更多细节，然后我们可以进行一些工具调用。

📊 基准测试

您可以在下面以表格形式查看更多基准测试：

基准

GLM-5

GLM-4.7

DeepSeek-V3.2

Kimi K2.5

Claude Opus 4.5

Gemini 3 Pro

GPT-5.2 (xhigh)

HLE

30.5

24.8

25.1

31.5

28.4

37.2

35.4

HLE（带工具）

50.4

42.8

40.8

51.8

43.4*

45.8*

45.5*

AIME 2026 I

92.7

92.9

92.7

92.5

93.3

90.6

HMMT 2025年11月

96.9

93.5

90.2

91.1

91.7

93.0

97.1

IMOAnswerBench

82.5

82.0

78.3

81.8

78.5

83.3

86.3

GPQA-Diamond

86.0

85.7

82.4

87.6

87.0

91.9

92.4

SWE-bench 已验证

77.8

73.8

73.1

76.8

80.9

76.2

80.0

SWE-bench 多语言

73.3

66.7

70.2

73.0

77.5

65.0

72.0

Terminal-Bench 2.0（Terminus 2）

56.2 / 60.7 †

41.0

39.3

50.8

59.3

54.2

54.0

Terminal-Bench 2.0（Claude 代码）

56.2 / 61.1 †

32.8

46.4

57.9

CyberGym

43.2

23.5

17.3

41.3

50.6

39.9

BrowseComp

62.0

52.0

51.4

60.6

37.0

37.8

BrowseComp（带上下文管理）

75.9

67.5

67.6

74.9

67.8

59.2

65.8

BrowseComp-中文

72.7

66.6

65.0

62.3

62.4

66.8

76.1

τ²-Bench

89.7

87.4

85.3

80.2

91.6

90.7

85.5

MCP-Atlas（公共集）

67.8

52.0

62.2

63.8

65.2

66.6

68.0

Tool-Decathlon

38.0

23.8

35.2

27.8

43.5

36.4

46.3

Vending Bench 2

$4,432.12

$2,376.82

$1,034.00

$1,198.46

$4,967.06

$5,478.16

$3,591.33

上一页Kimi K2.5 下一页gpt-oss

最后更新于12天前

这有帮助吗？

hashtag⚙️ 使用指南

hashtag推荐设置

hashtag运行 GLM-5 教程：

hashtag✨ 在 llama.cpp 中运行

hashtag🦙 Llama-server 服务与 OpenAI 的 completion 库

hashtag💻 vLLM 部署

hashtag🔨使用 GLM 5 的工具调用

hashtag📊 基准测试

⚙️ 使用指南

推荐设置

运行 GLM-5 教程：

✨ 在 llama.cpp 中运行

🦙 Llama-server 服务与 OpenAI 的 completion 库

💻 vLLM 部署

🔨使用 GLM 5 的工具调用

📊 基准测试