zGLM-5: How to Run Locally Guide

Run the new GLM-5 model by Z.ai on your own local device!

GLM-5 is Z.ai’s latest reasoning model, delivering stronger coding, agent, and chat performance than GLM-4.7, and is designed for long context reasoning. It increases performance on benchmarks such as Humanity's Last Exam 50.4% (+7.6%), BrowseComp 75.9% (+8.4%) and Terminal-Bench-2.0 61.1% (+28.3%).

The full 744B parameter (40B active) model has a 200K context window and was pre-trained on 28.5T tokens. The full GLM-5 model requires 1.51TB of disk space, while the Unsloth Dynamic 2-bit GGUF reduces the size to 281GB (-81%), and dynamic 1-bit is 176GB (-88%): GLM-5-GGUFarrow-up-right

All uploads use Unsloth Dynamic 2.0 for SOTA quantization performance - so 1-bit has important layers upcasted to 8 or 16bit. Thank you Z.ai for providing Unsloth with day zero access.

⚙️ Usage Guide

The 2-bit dynamic quant UD-Q2_K_XL uses 281GB of disk space - this works well in a 1x24GB card and 256GB of RAM with MoE offloading. Otherwise you can used IQ2_M which directly fits on a 256GB Mac.

circle-info

Use --jinja for llama.cpp quants - this enables the correct template! You might get incorrect results if you do not use --jinja And use --fit on which auto fits the GGUF on your hardware.

The 1-bit quants will fit in a 1x 40GB GPU (with MoE layers offloaded to RAM). Expect around 5 tokens/s with this setup if you have bonus 165GB RAM as well. It is recommended to have at least 205GB RAM to run this 4-bit. For optimal performance you will need at least 205GB unified memory or 205GB combined RAM+VRAM for 5+ tokens/s. To learn how to increase generation speed and fit longer contexts, read here.

circle-check

Use distinct settings for different use cases. Recommended settings for default and multi-turn agentic use cases:

Default Settings (Most Tasks)
SWE Bench Verified

temperature = 1.0

temperature = 0.7

top_p = 0.95

top_p = 1.0

max new tokens = 131072

max new tokens = 16384

repeat penalty = disabled or 1.0

repeat penalty = disabled or 1.0

  • Use --jinja for llama.cpp variants.

  • Maximum context window: 202,752.

  • For multi-turn agentic tasks (τ²-Bench and Terminal Bench 2), please turn on Preserved Thinking mode.

Run GLM-5 Tutorials:

✨ Run in llama.cpp

1

Obtain the latest llama.cpp and you MUST install PR 19460 on GitHub herearrow-up-right. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

2

If you want to use llama.cpp directly to load models, you can do the below: (:Q2_K_XL) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location. Remember the model has only a maximum of 200K context length.

Follow this for general instruction use-cases:

Follow this for tool-calling use-cases:

circle-info

Use --fit on for maximum usage of your GPU and CPU.

Optionally, try -ot ".ffn_.*_exps.=CPU" to offload all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.

If you have a bit more GPU memory, try -ot ".ffn_(up|down)_exps.=CPU" This offloads up and down projection MoE layers.

Try -ot ".ffn_(up)_exps.=CPU" if you have even more GPU memory. This offloads only up projection MoE layers.

And finally offload all layers via -ot ".ffn_.*_exps.=CPU" This uses the least VRAM.

You can also customize the regex, for example -ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU" means to offload gate, up and down MoE layers but only from the 6th layer onwards.

3

Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-Q2_K_XL (dynamic 2bit quant) or other quantized versions like Q4_K_XL . We recommend using our 2.7bit dynamic quant UD-Q2_K_XL to balance size and accuracy.

4

You can edit --threads 32 for the number of CPU threads, --ctx-size 16384 for context length, --n-gpu-layers 2 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.

🦙 Llama-server serving & OpenAI's completion library

To deploy GLM-5 for production, we use llama-server In a new terminal say via tmux, deploy the model via:

Then in a new terminal, after doing pip install openai, do:

And you will get the following example of a Snake game:

💻 vLLM Deployment

You can now serve Z.ai's FP8 version of the model via vLLM. Firstly, install via nightly:

Then serve. If you have 1 GPU, use CUDA_VISIBLE_DEVICES='0' and set --tensor-parallel-size 1 or remove this argument. To disable FP8, remove --quantization fp8 --kv-cache-dtype fp8

You can then call the served model via the OpenAI API:

🔨Tool Calling with GLM 5

See Tool Calling Guide for more details on how to do tool calling. In a new terminal (if using tmux, use CTRL+B+D), we create some tools like adding 2 numbers, executing Python code, executing Linux functions and much more:

We then use the below functions (copy and paste and execute) which will parse the function calls automatically and call the OpenAI endpoint for any model:

After launching GLM 5 via llama-server like in GLM-5 or see Tool Calling Guide for more details, we then can do some tool calls.

📊 Benchmarks

You can view further below for benchmarks in table format:

Benchmark
GLM-5
GLM-4.7
DeepSeek-V3.2
Kimi K2.5
Claude Opus 4.5
Gemini 3 Pro
GPT-5.2 (xhigh)

HLE

30.5

24.8

25.1

31.5

28.4

37.2

35.4

HLE (w/ Tools)

50.4

42.8

40.8

51.8

43.4*

45.8*

45.5*

AIME 2026 I

92.7

92.9

92.7

92.5

93.3

90.6

-

HMMT Nov. 2025

96.9

93.5

90.2

91.1

91.7

93.0

97.1

IMOAnswerBench

82.5

82.0

78.3

81.8

78.5

83.3

86.3

GPQA-Diamond

86.0

85.7

82.4

87.6

87.0

91.9

92.4

SWE-bench Verified

77.8

73.8

73.1

76.8

80.9

76.2

80.0

SWE-bench Multilingual

73.3

66.7

70.2

73.0

77.5

65.0

72.0

Terminal-Bench 2.0 (Terminus 2)

56.2 / 60.7 †

41.0

39.3

50.8

59.3

54.2

54.0

Terminal-Bench 2.0 (Claude Code)

56.2 / 61.1 †

32.8

46.4

-

57.9

-

-

CyberGym

43.2

23.5

17.3

41.3

50.6

39.9

-

BrowseComp

62.0

52.0

51.4

60.6

37.0

37.8

-

BrowseComp (w/ Context Manage)

75.9

67.5

67.6

74.9

67.8

59.2

65.8

BrowseComp-Zh

72.7

66.6

65.0

62.3

62.4

66.8

76.1

τ²-Bench

89.7

87.4

85.3

80.2

91.6

90.7

85.5

MCP-Atlas (Public Set)

67.8

52.0

62.2

63.8

65.2

66.6

68.0

Tool-Decathlon

38.0

23.8

35.2

27.8

43.5

36.4

46.3

Vending Bench 2

$4,432.12

$2,376.82

$1,034.00

$1,198.46

$4,967.06

$5,478.16

$3,591.33

Last updated

Was this helpful?