GLM-5: How to Run Locally Guide

Run the new GLM-5 model by Z.ai on your own local device!

GLM-5 is Z.ai’s latest reasoning model, delivering stronger coding, agent, and chat performance than GLM-4.7, and is designed for long context reasoning. It increases performance on benchmarks such as Humanity's Last Exam 50.4% (+7.6%), BrowseComp 75.9% (+8.4%) and Terminal-Bench-2.0 61.1% (+28.3%).

The full 744B parameter (40B active) model has a 200K context window and was pre-trained on 28.5T tokens. The full GLM-5 model requires 1.65TB of disk space, while the Unsloth Dynamic 2-bit GGUF reduces the size to 241GB (-85%), and dynamic 1-bit is 176GB (-89%): GLM-5-GGUF

All uploads use Unsloth Dynamic 2.0 for SOTA quantization performance - so 1-bit has important layers upcasted to 8 or 16-bit. Thank you Z.ai for providing Unsloth with day zero access.

⚙️ Usage Guide

The 2-bit dynamic quant UD-IQ2_XXS uses 241GB of disk space - this can directly fit on a 256GB unified memory Mac, and also works well in a 1x24GB card and 256GB of RAM with MoE offloading. The 1-bit quant will fit on a 180GB RAM and 8-bit requires 805GB RAM.

For best performance, make sure your total available memory (VRAM + system RAM) exceeds the size of the quantized model file you’re downloading. If it doesn’t, llama.cpp can still run via SSD/HDD offloading, but inference will be slower.

Recommended Settings

Use distinct settings for different use cases:

Default Settings (Most Tasks)

SWE Bench Verified

temperature = 1.0

temperature = 0.7

top_p = 0.95

top_p = 1.0

max new tokens = 131072

max new tokens = 16384

repeat penalty = disabled or 1.0

Min_P = 0.01 (llama.cpp's default is 0.05)
Maximum context window: 202,752.
For multi-turn agentic tasks (τ²-Bench and Terminal Bench 2), please turn on Preserved Thinking mode.

Run GLM-5 Tutorials:

✨ Run in llama.cpp

Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp

If you want to use llama.cpp directly to load models, you can do the below: (:IQ2_XXS) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location. Remember the model has only a maximum of 200K context length.

Follow this for general instruction use-cases:

export LLAMA_CACHE="unsloth/GLM-5-GGUF"
./llama.cpp/llama-cli \
    -hf unsloth/GLM-5-GGUF:UD-IQ2_XXS \
    --ctx-size 16384 \
    --flash-attn on \
    --temp 0.7 \
    --top-p 1.0 \
    --min-p 0.01

Follow this for tool-calling use-cases:

export LLAMA_CACHE="unsloth/GLM-5-GGUF"
./llama.cpp/llama-cli \
    -hf unsloth/GLM-5-GGUF:UD-IQ2_XXS \
    --ctx-size 16384 \
    --flash-attn on \
    --temp 1.0 \
    --top-p 0.95 \
    --min-p 0.01

Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-Q2_K_XL (dynamic 2bit quant) or other quantized versions like UD-Q4_K_XL . We recommend using our 2bit dynamic quant UD-Q2_K_XL to balance size and accuracy. If downloads get stuck, see Hugging Face Hub, XET debugging

pip install -U huggingface_hub
hf download unsloth/GLM-5-GGUF \
    --local-dir unsloth/GLM-5-GGUF \
    --include "*UD-IQ2_XXS*" # Use "*UD-TQ1_0*" for Dynamic 1bit

You can edit --threads 32 for the number of CPU threads, --ctx-size 16384 for context length, --n-gpu-layers 2 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.

./llama.cpp/llama-cli \
    --model unsloth/GLM-5-GGUF/UD-IQ2_XXS/GLM-5-UD-IQ2_XXS-00001-of-00006.gguf \
    --temp 1.0 \
    --top-p 0.95 \
    --min-p 0.01 \
    --ctx-size 16384 \
    --seed 3407

🦙 Llama-server serving & OpenAI's completion library

To deploy GLM-5 for production, we use llama-server In a new terminal say via tmux, deploy the model via:

./llama.cpp/llama-server \
    --model unsloth/GLM-5-GGUF/UD-IQ2_XXS/GLM-5-UD-IQ2_XXS-00001-of-00006.gguf \
    --alias "unsloth/GLM-5" \
    --prio 3 \
    --temp 1.0 \
    --top-p 0.95 \
    --ctx-size 16384 \
    --port 8001

Then in a new terminal, after doing pip install openai, do:

from openai import OpenAI
import json
openai_client = OpenAI(
    base_url = "http://127.0.0.1:8001/v1",
    api_key = "sk-no-key-required",
)
completion = openai_client.chat.completions.create(
    model = "unsloth/GLM-5",
    messages = [{"role": "user", "content": "Create a Snake game."},],
)
print(completion.choices[0].message.content)

And you will get the following example of a Snake game:

Here is a complete, playable Snake game contained within a single HTML file. You can copy this code, save it as an `.html` file (e.g., `snake.html`), and open it in your web browser to play.

### The Code

```html
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Classic Snake Game</title>
    <style>
        body {
            display: flex;
            flex-direction: column;
            justify-content: center;
            align-items: center;
            height: 100vh;
            margin: 0;
            background-color: #222;
            color: white;
            font-family: Arial, sans-serif;
        }

        #gameCanvas {
            border: 2px solid #fff;
            background-color: #000;
        }

        h1 {
            margin-bottom: 10px;
        }

        #scoreBoard {
            font-size: 20px;
            margin-bottom: 10px;
        }

        #gameOverMenu {
            position: absolute;
            display: none;
            flex-direction: column;
            justify-content: center;
            align-items: center;
            background: rgba(0, 0, 0, 0.85);
            padding: 20px;
            border-radius: 10px;
            border: 2px solid red;
        }

        button {
            margin-top: 15px;
            padding: 10px 20px;
            font-size: 16px;
            cursor: pointer;
            background-color: #4CAF50;
            color: white;
            border: none;
            border-radius: 5px;
        }
        
        button:hover {
            background-color: #45a049;
        }
    </style>
</head>
<body>

    <h1>Snake Game</h1>
    <div id="scoreBoard">Score: 0</div>
    <canvas id="gameCanvas" width="400" height="400"></canvas>

    <div id="gameOverMenu">
        <h2 style="color: red; margin: 0;">Game Over!</h2>
        <p id="finalScore">Final Score: 0</p>
        <button onclick="resetGame()">Play Again</button>
    </div>

    <script>
        // Game Constants
        const canvas = document.getElementById('gameCanvas');
        const ctx = canvas.getContext('2d');
        const scoreBoard = document.getElementById('scoreBoard');
        const gameOverMenu = document.getElementById('gameOverMenu');
        const finalScoreDisplay = document.getElementById('finalScore');

        const gridSize = 20; // Size of each square
        const tileCount = canvas.width / gridSize; // Number of squares per row/column

        // Game Variables
        let dx = 0; // Horizontal velocity
        let dy = 0; // Vertical velocity
        let score = 0;
        let snake = [];
        let foodX, foodY;
        let gameInterval;
        let isGameRunning = false;

        // Initialize the game
        function initGame() {
            snake = [
                {x: 10, y: 10}, 
                {x: 9, y: 10}, 
                {x: 8, y: 10}
            ];
            score = 0;
            scoreBoard.innerText = 'Score: ' + score;
            dx = 1; // Start moving right immediately
            dy = 0;
            placeFood();
            isGameRunning = true;
            gameOverMenu.style.display = 'none';
            
            // Start the game loop
            if (gameInterval) clearInterval(gameInterval);
            gameInterval = setInterval(gameLoop, 100); // Run game loop every 100ms
        }

        // Main game loop
        function gameLoop() {
            if (!isGameRunning) return;

            moveSnake();
            if (checkGameOver()) {
                endGame();
                return;
            }
            checkFoodCollision();
            draw();
        }

        // Move the snake
        function moveSnake() {
            // Create new head based on current direction
            const head = {x: snake[0].x + dx, y: snake[0].y + dy};
            
            // Add new head to the beginning of the array
            snake.unshift(head);

            // Remove the tail (last element) unless food is eaten
            // Note: We handle removing the tail in checkFoodCollision
            snake.pop(); 
        }

        // Check if snake eats food
        function checkFoodCollision() {
            const head = snake[0];
            
            if (head.x === foodX && head.y === foodY) {
                // Grow snake: add a tail piece (duplicate the last one)
                snake.push({...snake[snake.length - 1]});
                score += 10;
                scoreBoard.innerText = 'Score: ' + score;
                placeFood();
            }
        }

        // Check for collisions (walls or self)
        function checkGameOver() {
            const head = snake[0];

            // Wall collision
            if (head.x < 0 || head.x >= tileCount || head.y < 0 || head.y >= tileCount) {
                return true;
            }

            // Self collision (start checking from the 4th segment because head can't hit the first 3)
            for (let i = 4; i < snake.length; i++) {
                if (head.x === snake[i].x && head.y === snake[i].y) {
                    return true;
                }
            }

            return false;
        }

        // Draw everything
        function draw() {
            // Clear canvas
            ctx.fillStyle = 'black';
            ctx.fillRect(0, 0, canvas.width, canvas.height);

            // Draw Food
            ctx.fillStyle = 'red';
            ctx.fillRect(foodX * gridSize, foodY * gridSize, gridSize - 2, gridSize - 2);

            // Draw Snake
            ctx.fillStyle = 'lime';
            for (let i = 0; i < snake.length; i++) {
                // Draw the head slightly different or just standard
                const part = snake[i];
                ctx.fillRect(part.x * gridSize, part.y * gridSize, gridSize - 2, gridSize - 2);
            }
        }

        // Place food at random position
        function placeFood() {
            foodX = Math.floor(Math.random() * tileCount);
            foodY = Math.floor(Math.random() * tileCount);

            // Ensure food doesn't spawn on the snake body
            for (let part of snake) {
                if (part.x === foodX && part.y === foodY) {
                    placeFood(); // Recursively find a new spot
                    return;
                }
            }
        }

        // End game logic
        function endGame() {
            isGameRunning = false;
            clearInterval(gameInterval);
            finalScoreDisplay.innerText = 'Final Score: ' + score;
            gameOverMenu.style.display = 'flex';
        }

        // Reset game logic
        function resetGame() {
            initGame();
        }

        // Keyboard controls
        document.addEventListener('keydown', (e) => {
            // Prevent reversing direction (can't go left if going right)
            switch(e.key) {
                case 'ArrowUp':
                    if (dy !== 1) { dx = 0; dy = -1; }
                    break;
                case 'ArrowDown':
                    if (dy !== -1) { dx = 0; dy = 1; }
                    break;
                case 'ArrowLeft':
                    if (dx !== 1) { dx = -1; dy = 0; }
                    break;
                case 'ArrowRight':
                    if (dx !== -1) { dx = 1; dy = 0; }
                    break;
                case ' ':
                    if (!isGameRunning && gameOverMenu.style.display !== 'flex') {
                        initGame();
                    }
                    break;
            }
        });

        // Start the game on load
        initGame();
    </script>
</body>
</html>
```

### How to Play
1.  **Copy the code** above.
2.  Create a new file on your computer named `snake.html`.
3.  **Paste the code** into that file and save it.
4.  **Double-click `snake.html`** to open it in your browser.

### Controls
*   **Arrow Keys**: Move Up, Down, Left, Right.
*   **Spacebar**: Starts the game (if it hasn't started yet).
*   **Play Again Button**: Appears when you crash to restart the game.

### Features of this Version
*   **Grid-based movement**: Classic retro feel.
*   **Score tracking**: Updates in real-time.
*   **Game Over Screen**: Displays your final score and allows you to restart easily.
*   **Collision Detection**: Ends the game if you hit the walls or yourself.
*   **Self-Collision Safety**: The code prevents the snake from accidentally eating itself immediately after eating food due to "tail skipping" logic commonly found in simple tutorials.

💻 vLLM Deployment

You can now serve Z.ai's FP8 version of the model via vLLM. You need 860GB VRAM or more, so 8xH200 (141x8 = 1128GB) is at least recommended. 8xB200 works well. Firstly, install vllm nightly:

uv pip install --upgrade --force-reinstall vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly/cu130
uv pip install --upgrade --force-reinstall git+https://github.com/huggingface/transformers.git
uv pip install --force-reinstall numba

To disable FP8 KV Cache (reduces memory usage by 50%), remove --kv-cache-dtype fp8

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:False
vllm serve unsloth/GLM-5-FP8 \
    --served-model-name unsloth/GLM-5-FP8 \ \
    --kv-cache-dtype fp8 \
    --tensor-parallel-size 8 \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --enable-auto-tool-choice \
    --dtype bfloat16 \
    --seed 3407 \
    --max-model-len 200000 \
    --gpu-memory-utilization 0.93 \
    --max_num_batched_tokens 4096 \
    --speculative-config.method mtp \
    --speculative-config.num_speculative_tokens 1 \
    --port 8001

You can then call the served model via the OpenAI API:

from openai import AsyncOpenAI, OpenAI
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8001/v1"
client = OpenAI( # or AsyncOpenAI
    api_key = openai_api_key,
    base_url = openai_api_base,
)

🔨Tool Calling with GLM 5

See Tool Calling Guide for more details on how to do tool calling. In a new terminal (if using tmux, use CTRL+B+D), we create some tools like adding 2 numbers, executing Python code, executing Linux functions and much more:

import json, subprocess, random
from typing import Any
def add_number(a: float | str, b: float | str) -> float:
    return float(a) + float(b)
def multiply_number(a: float | str, b: float | str) -> float:
    return float(a) * float(b)
def substract_number(a: float | str, b: float | str) -> float:
    return float(a) - float(b)
def write_a_story() -> str:
    return random.choice([
        "A long time ago in a galaxy far far away...",
        "There were 2 friends who loved sloths and code...",
        "The world was ending because every sloth evolved to have superhuman intelligence...",
        "Unbeknownst to one friend, the other accidentally coded a program to evolve sloths...",
    ])
def terminal(command: str) -> str:
    if "rm" in command or "sudo" in command or "dd" in command or "chmod" in command:
        msg = "Cannot execute 'rm, sudo, dd, chmod' commands since they are dangerous"
        print(msg); return msg
    print(f"Executing terminal command `{command}`")
    try:
        return str(subprocess.run(command, capture_output = True, text = True, shell = True, check = True).stdout)
    except subprocess.CalledProcessError as e:
        return f"Command failed: {e.stderr}"
def python(code: str) -> str:
    data = {}
    exec(code, data)
    del data["__builtins__"]
    return str(data)
MAP_FN = {
    "add_number": add_number,
    "multiply_number": multiply_number,
    "substract_number": substract_number,
    "write_a_story": write_a_story,
    "terminal": terminal,
    "python": python,
}
tools = [
    {
        "type": "function",
        "function": {
            "name": "add_number",
            "description": "Add two numbers.",
            "parameters": {
                "type": "object",
                "properties": {
                    "a": {
                        "type": "string",
                        "description": "The first number.",
                    },
                    "b": {
                        "type": "string",
                        "description": "The second number.",
                    },
                },
                "required": ["a", "b"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "multiply_number",
            "description": "Multiply two numbers.",
            "parameters": {
                "type": "object",
                "properties": {
                    "a": {
                        "type": "string",
                        "description": "The first number.",
                    },
                    "b": {
                        "type": "string",
                        "description": "The second number.",
                    },
                },
                "required": ["a", "b"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "substract_number",
            "description": "Substract two numbers.",
            "parameters": {
                "type": "object",
                "properties": {
                    "a": {
                        "type": "string",
                        "description": "The first number.",
                    },
                    "b": {
                        "type": "string",
                        "description": "The second number.",
                    },
                },
                "required": ["a", "b"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "write_a_story",
            "description": "Writes a random story.",
            "parameters": {
                "type": "object",
                "properties": {},
                "required": [],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "terminal",
            "description": "Perform operations from the terminal.",
            "parameters": {
                "type": "object",
                "properties": {
                    "command": {
                        "type": "string",
                        "description": "The command you wish to launch, e.g `ls`, `rm`, ...",
                    },
                },
                "required": ["command"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "python",
            "description": "Call a Python interpreter with some Python code that will be ran.",
            "parameters": {
                "type": "object",
                "properties": {
                    "code": {
                        "type": "string",
                        "description": "The Python code to run",
                    },
                },
                "required": ["code"],
            },
        },
    },
]

We then use the below functions (copy and paste and execute) which will parse the function calls automatically and call the OpenAI endpoint for any model:

from openai import OpenAI
def unsloth_inference(
    messages,
    temperature = 1.0,
    top_p = 0.95,
    top_k = -1,
    min_p = 0.01,
    repetition_penalty = 1.0,
):
    messages = messages.copy()
    openai_client = OpenAI(
        base_url = "http://127.0.0.1:8001/v1",
        api_key = "sk-no-key-required",
    )
    model_name = next(iter(openai_client.models.list())).id
    print(f"Using model = {model_name}")
    has_tool_calls = True
    original_messages_len = len(messages)
    while has_tool_calls:
        print(f"Current messages = {messages}")
        response = openai_client.chat.completions.create(
            model = model_name,
            messages = messages,
            temperature = temperature,
            top_p = top_p,
            tools = tools if tools else None,
            tool_choice = "auto" if tools else None,
            extra_body = {"top_k": top_k, "min_p": min_p, "repetition_penalty" :repetition_penalty,}
        )
        tool_calls = response.choices[0].message.tool_calls or []
        content = response.choices[0].message.content or ""
        tool_calls_dict = [tc.to_dict() for tc in tool_calls] if tool_calls else tool_calls
        messages.append({"role": "assistant", "tool_calls": tool_calls_dict, "content": content,})
        for tool_call in tool_calls:
            fx, args, _id = tool_call.function.name, tool_call.function.arguments, tool_call.id
            out = MAP_FN[fx](**json.loads(args))
            messages.append({"role": "tool", "tool_call_id": _id, "name": fx, "content": str(out),})
        else:
            has_tool_calls = False
    return messages

After launching GLM 5 via llama-server like in GLM-5 or see Tool Calling Guide for more details, we then can do some tool calls.

📊 Benchmarks

You can view further below for benchmarks in table format:

Benchmark

GLM-5

GLM-4.7

DeepSeek-V3.2

Kimi K2.5

Claude Opus 4.5

Gemini 3 Pro

GPT-5.2 (xhigh)

HLE

30.5

24.8

25.1

31.5

28.4

37.2

35.4

HLE (w/ Tools)

50.4

42.8

40.8

51.8

43.4*

45.8*

45.5*

AIME 2026 I

92.7

92.9

92.7

92.5

93.3

90.6

HMMT Nov. 2025

96.9

93.5

90.2

91.1

91.7

93.0

97.1

IMOAnswerBench

82.5

82.0

78.3

81.8

78.5

83.3

86.3

GPQA-Diamond

86.0

85.7

82.4

87.6

87.0

91.9

92.4

SWE-bench Verified

77.8

73.8

73.1

76.8

80.9

76.2

80.0

SWE-bench Multilingual

73.3

66.7

70.2

73.0

77.5

65.0

72.0

Terminal-Bench 2.0 (Terminus 2)

56.2 / 60.7 †

41.0

39.3

50.8

59.3

54.2

54.0

Terminal-Bench 2.0 (Claude Code)

56.2 / 61.1 †

32.8

46.4

57.9

CyberGym

43.2

23.5

17.3

41.3

50.6

39.9

BrowseComp

62.0

52.0

51.4

60.6

37.0

37.8

BrowseComp (w/ Context Manage)

75.9

67.5

67.6

74.9

67.8

59.2

65.8

BrowseComp-Zh

72.7

66.6

65.0

62.3

62.4

66.8

76.1

τ²-Bench

89.7

87.4

85.3

80.2

91.6

90.7

85.5

MCP-Atlas (Public Set)

67.8

52.0

62.2

63.8

65.2

66.6

68.0

Tool-Decathlon

38.0

23.8

35.2

27.8

43.5

36.4

46.3

Vending Bench 2

$4,432.12

$2,376.82

$1,034.00

$1,198.46

$4,967.06

$5,478.16

$3,591.33

PreviousKimi K2.5 Nextgpt-oss

Last updated 12 days ago

Was this helpful?

hashtag⚙️ Usage Guide

hashtagRecommended Settings

hashtagRun GLM-5 Tutorials:

hashtag✨ Run in llama.cpp

hashtag🦙 Llama-server serving & OpenAI's completion library

hashtag💻 vLLM Deployment

hashtag🔨Tool Calling with GLM 5

hashtag📊 Benchmarks

⚙️ Usage Guide

Recommended Settings

Run GLM-5 Tutorials:

✨ Run in llama.cpp

🦙 Llama-server serving & OpenAI's completion library

💻 vLLM Deployment

🔨Tool Calling with GLM 5

📊 Benchmarks