GLM-5: How to Run Locally Guide
Run the new GLM-5 model by Z.ai on your own local device!
GLM-5 is Z.ai’s latest reasoning model, delivering stronger coding, agent, and chat performance than GLM-4.7, and is designed for long context reasoning. It increases performance on benchmarks such as Humanity's Last Exam 50.4% (+7.6%), BrowseComp 75.9% (+8.4%) and Terminal-Bench-2.0 61.1% (+28.3%).
The full 744B parameter (40B active) model has a 200K context window and was pre-trained on 28.5T tokens. The full GLM-5 model requires 1.51TB of disk space, while the Unsloth Dynamic 2-bit GGUF reduces the size to 281GB (-81%), and dynamic 1-bit is 176GB (-88%): GLM-5-GGUF
All uploads use Unsloth Dynamic 2.0 for SOTA quantization performance - so 1-bit has important layers upcasted to 8 or 16bit. Thank you Z.ai for providing Unsloth with day zero access.
⚙️ Usage Guide
The 2-bit dynamic quant UD-Q2_K_XL uses 281GB of disk space - this works well in a 1x24GB card and 256GB of RAM with MoE offloading. Otherwise you can used IQ2_M which directly fits on a 256GB Mac.
Use --jinja for llama.cpp quants - this enables the correct template! You might get incorrect results if you do not use --jinja And use --fit on which auto fits the GGUF on your hardware.
The 1-bit quants will fit in a 1x 40GB GPU (with MoE layers offloaded to RAM). Expect around 5 tokens/s with this setup if you have bonus 165GB RAM as well. It is recommended to have at least 205GB RAM to run this 4-bit. For optimal performance you will need at least 205GB unified memory or 205GB combined RAM+VRAM for 5+ tokens/s. To learn how to increase generation speed and fit longer contexts, read here.
Though not a must, for best performance, have your VRAM + RAM combined equal to the size of the quant you're downloading. If not, hard drive / SSD offloading will work with llama.cpp, just inference will be slower. Also use --fit on in llama.cpp to auto enable maximum GPU usage!
Recommended Settings
Use distinct settings for different use cases. Recommended settings for default and multi-turn agentic use cases:
temperature = 1.0
temperature = 0.7
top_p = 0.95
top_p = 1.0
max new tokens = 131072
max new tokens = 16384
repeat penalty = disabled or 1.0
repeat penalty = disabled or 1.0
Use
--jinjafor llama.cpp variants.Maximum context window:
202,752.For multi-turn agentic tasks (τ²-Bench and Terminal Bench 2), please turn on Preserved Thinking mode.
Run GLM-5 Tutorials:
✨ Run in llama.cpp
Obtain the latest llama.cpp and you MUST install PR 19460 on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.
If you want to use llama.cpp directly to load models, you can do the below: (:Q2_K_XL) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location. Remember the model has only a maximum of 200K context length.
Follow this for general instruction use-cases:
Follow this for tool-calling use-cases:
Use --fit on for maximum usage of your GPU and CPU.
Optionally, try -ot ".ffn_.*_exps.=CPU" to offload all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.
If you have a bit more GPU memory, try -ot ".ffn_(up|down)_exps.=CPU" This offloads up and down projection MoE layers.
Try -ot ".ffn_(up)_exps.=CPU" if you have even more GPU memory. This offloads only up projection MoE layers.
And finally offload all layers via -ot ".ffn_.*_exps.=CPU" This uses the least VRAM.
You can also customize the regex, for example -ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU" means to offload gate, up and down MoE layers but only from the 6th layer onwards.
Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-Q2_K_XL (dynamic 2bit quant) or other quantized versions like Q4_K_XL . We recommend using our 2.7bit dynamic quant UD-Q2_K_XL to balance size and accuracy.
You can edit --threads 32 for the number of CPU threads, --ctx-size 16384 for context length, --n-gpu-layers 2 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.
🦙 Llama-server serving & OpenAI's completion library
To deploy GLM-5 for production, we use llama-server In a new terminal say via tmux, deploy the model via:
Then in a new terminal, after doing pip install openai, do:
And you will get the following example of a Snake game:

💻 vLLM Deployment
You can now serve Z.ai's FP8 version of the model via vLLM. Firstly, install via nightly:
Then serve. If you have 1 GPU, use CUDA_VISIBLE_DEVICES='0' and set --tensor-parallel-size 1 or remove this argument. To disable FP8, remove --quantization fp8 --kv-cache-dtype fp8
You can then call the served model via the OpenAI API:
🔨Tool Calling with GLM 5
See Tool Calling Guide for more details on how to do tool calling. In a new terminal (if using tmux, use CTRL+B+D), we create some tools like adding 2 numbers, executing Python code, executing Linux functions and much more:
We then use the below functions (copy and paste and execute) which will parse the function calls automatically and call the OpenAI endpoint for any model:
After launching GLM 5 via llama-server like in GLM-5 or see Tool Calling Guide for more details, we then can do some tool calls.
📊 Benchmarks
You can view further below for benchmarks in table format:

HLE
30.5
24.8
25.1
31.5
28.4
37.2
35.4
HLE (w/ Tools)
50.4
42.8
40.8
51.8
43.4*
45.8*
45.5*
AIME 2026 I
92.7
92.9
92.7
92.5
93.3
90.6
-
HMMT Nov. 2025
96.9
93.5
90.2
91.1
91.7
93.0
97.1
IMOAnswerBench
82.5
82.0
78.3
81.8
78.5
83.3
86.3
GPQA-Diamond
86.0
85.7
82.4
87.6
87.0
91.9
92.4
SWE-bench Verified
77.8
73.8
73.1
76.8
80.9
76.2
80.0
SWE-bench Multilingual
73.3
66.7
70.2
73.0
77.5
65.0
72.0
Terminal-Bench 2.0 (Terminus 2)
56.2 / 60.7 †
41.0
39.3
50.8
59.3
54.2
54.0
Terminal-Bench 2.0 (Claude Code)
56.2 / 61.1 †
32.8
46.4
-
57.9
-
-
CyberGym
43.2
23.5
17.3
41.3
50.6
39.9
-
BrowseComp
62.0
52.0
51.4
60.6
37.0
37.8
-
BrowseComp (w/ Context Manage)
75.9
67.5
67.6
74.9
67.8
59.2
65.8
BrowseComp-Zh
72.7
66.6
65.0
62.3
62.4
66.8
76.1
τ²-Bench
89.7
87.4
85.3
80.2
91.6
90.7
85.5
MCP-Atlas (Public Set)
67.8
52.0
62.2
63.8
65.2
66.6
68.0
Tool-Decathlon
38.0
23.8
35.2
27.8
43.5
36.4
46.3
Vending Bench 2
$4,432.12
$2,376.82
$1,034.00
$1,198.46
$4,967.06
$5,478.16
$3,591.33
Last updated
Was this helpful?

