GLM-5: How to Run Locally Guide
Run the new GLM-5 model by Z.ai on your own local device!
GLM-5 is Z.ai’s latest reasoning model, delivering stronger coding, agent, and chat performance than GLM-4.7, and is designed for long context reasoning. It increases performance on benchmarks such as Humanity's Last Exam 50.4% (+7.6%), BrowseComp 75.9% (+8.4%) and Terminal-Bench-2.0 61.1% (+28.3%).
The full 744B parameter (40B active) model has a 200K context window and was pre-trained on 28.5T tokens. The full GLM-5 model requires 1.65TB of disk space, while the Unsloth Dynamic 2-bit GGUF reduces the size to 241GB (-85%), and dynamic 1-bit is 176GB (-89%): GLM-5-GGUF
All uploads use Unsloth Dynamic 2.0 for SOTA quantization performance - so 1-bit has important layers upcasted to 8 or 16-bit. Thank you Z.ai for providing Unsloth with day zero access.
⚙️ Usage Guide
The 2-bit dynamic quant UD-IQ2_XXS uses 241GB of disk space - this can directly fit on a 256GB unified memory Mac, and also works well in a 1x24GB card and 256GB of RAM with MoE offloading. The 1-bit quant will fit on a 180GB RAM and 8-bit requires 805GB RAM.
For best performance, make sure your total available memory (VRAM + system RAM) exceeds the size of the quantized model file you’re downloading. If it doesn’t, llama.cpp can still run via SSD/HDD offloading, but inference will be slower.
Recommended Settings
Use distinct settings for different use cases:
temperature = 1.0
temperature = 0.7
top_p = 0.95
top_p = 1.0
max new tokens = 131072
max new tokens = 16384
repeat penalty = disabled or 1.0
repeat penalty = disabled or 1.0
Min_P = 0.01(llama.cpp's default is 0.05)Maximum context window:
202,752.For multi-turn agentic tasks (τ²-Bench and Terminal Bench 2), please turn on Preserved Thinking mode.
Run GLM-5 Tutorials:
✨ Run in llama.cpp
Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.
If you want to use llama.cpp directly to load models, you can do the below: (:IQ2_XXS) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location. Remember the model has only a maximum of 200K context length.
Follow this for general instruction use-cases:
Follow this for tool-calling use-cases:
Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-Q2_K_XL (dynamic 2bit quant) or other quantized versions like UD-Q4_K_XL . We recommend using our 2bit dynamic quant UD-Q2_K_XL to balance size and accuracy. If downloads get stuck, see Hugging Face Hub, XET debugging
You can edit --threads 32 for the number of CPU threads, --ctx-size 16384 for context length, --n-gpu-layers 2 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.
🦙 Llama-server serving & OpenAI's completion library
To deploy GLM-5 for production, we use llama-server In a new terminal say via tmux, deploy the model via:
Then in a new terminal, after doing pip install openai, do:
And you will get the following example of a Snake game:

💻 vLLM Deployment
You can now serve Z.ai's FP8 version of the model via vLLM. You need 860GB VRAM or more, so 8xH200 (141x8 = 1128GB) is at least recommended. 8xB200 works well. Firstly, install vllm nightly:
To disable FP8 KV Cache (reduces memory usage by 50%), remove --kv-cache-dtype fp8
You can then call the served model via the OpenAI API:
🔨Tool Calling with GLM 5
See Tool Calling Guide for more details on how to do tool calling. In a new terminal (if using tmux, use CTRL+B+D), we create some tools like adding 2 numbers, executing Python code, executing Linux functions and much more:
We then use the below functions (copy and paste and execute) which will parse the function calls automatically and call the OpenAI endpoint for any model:
After launching GLM 5 via llama-server like in GLM-5 or see Tool Calling Guide for more details, we then can do some tool calls.
📊 Benchmarks
You can view further below for benchmarks in table format:

HLE
30.5
24.8
25.1
31.5
28.4
37.2
35.4
HLE (w/ Tools)
50.4
42.8
40.8
51.8
43.4*
45.8*
45.5*
AIME 2026 I
92.7
92.9
92.7
92.5
93.3
90.6
-
HMMT Nov. 2025
96.9
93.5
90.2
91.1
91.7
93.0
97.1
IMOAnswerBench
82.5
82.0
78.3
81.8
78.5
83.3
86.3
GPQA-Diamond
86.0
85.7
82.4
87.6
87.0
91.9
92.4
SWE-bench Verified
77.8
73.8
73.1
76.8
80.9
76.2
80.0
SWE-bench Multilingual
73.3
66.7
70.2
73.0
77.5
65.0
72.0
Terminal-Bench 2.0 (Terminus 2)
56.2 / 60.7 †
41.0
39.3
50.8
59.3
54.2
54.0
Terminal-Bench 2.0 (Claude Code)
56.2 / 61.1 †
32.8
46.4
-
57.9
-
-
CyberGym
43.2
23.5
17.3
41.3
50.6
39.9
-
BrowseComp
62.0
52.0
51.4
60.6
37.0
37.8
-
BrowseComp (w/ Context Manage)
75.9
67.5
67.6
74.9
67.8
59.2
65.8
BrowseComp-Zh
72.7
66.6
65.0
62.3
62.4
66.8
76.1
τ²-Bench
89.7
87.4
85.3
80.2
91.6
90.7
85.5
MCP-Atlas (Public Set)
67.8
52.0
62.2
63.8
65.2
66.6
68.0
Tool-Decathlon
38.0
23.8
35.2
27.8
43.5
36.4
46.3
Vending Bench 2
$4,432.12
$2,376.82
$1,034.00
$1,198.46
$4,967.06
$5,478.16
$3,591.33
Last updated
Was this helpful?

