GLM-4.7-Flash: How To Run Locally

Run & fine-tune GLM-4.7-Flash locally on your device!

GLM-4.7-Flash is Z.ai’s new 30B MoE reasoning model built for local deployment, delivering best-in-class performance for coding, agentic workflows, and chat. It uses ~3.6B parameters, supports 200K context, and leads SWE-Bench, GPQA, and reasoning/chat benchmarks.

GLM-4.7-Flash runs on 24GB RAM/VRAM/unified memory (32GB for full precision), and you can now fine-tune with Unsloth. To run GLM 4.7 Flash with vLLM, see GLM-4.7-Flash in vLLM

Running TutorialFine-tuning

GLM-4.7-Flash GGUF to run: unsloth/GLM-4.7-Flash-GGUF

⚙️ Usage Guide

For best performance, make sure your total available memory (VRAM + system RAM) exceeds the size of the quantized model file you’re downloading. If it doesn’t, llama.cpp can still run via SSD/HDD offloading, but inference will be slower.

After speaking with the Z.ai's team, they recommend using their GLM-4.7 sampling parameters:

Default Settings (Most Tasks)
Terminal Bench, SWE Bench Verified

temperature = 1.0

temperature = 0.7

top_p = 0.95

top_p = 1.0

repeat penalty = disabled or 1.0

repeat penalty = disabled or 1.0

  • For general use-case: --temp 1.0 --top-p 0.95

  • For tool-calling: --temp 0.7 --top-p 1.0

  • If using llama.cpp, set --min-p 0.01 as llama.cpp's default is 0.05

  • Sometimes you'll need to experiment what numbers work best for your use-case.

  • Maximum context window: 202,752

🖥️ Run GLM-4.7-Flash

Depending on your use-case you will need to use different settings. Some GGUFs end up similar in size because the model architecture (like gpt-oss) has dimensions not divisible by 128, so parts can’t be quantized to lower bits.

Because this guide uses 4-bit, you will need around 18GB RAM/unified memory. We recommend using at least 4-bit precision for best performance.

🦥 Unsloth Studio Guide

GLM-4.7-Flash can be run and fine-tuned in Unsloth Studio, our new open-source web UI for local AI. With Unsloth Studio, you can run models locally on MacOS, Windows, Linux and:

1

Install Unsloth

Run in your terminal:

MacOS, Linux, WSL:

Windows PowerShell:

2

Launch Unsloth

MacOS, Linux, WSL and Windows:

Then open http://localhost:8888 in your browser.

3

Search and download GLM-4.7-Flash

On first launch you will need to create a password to secure your account and sign in again later. You’ll then see a brief onboarding wizard to choose a model, dataset, and basic settings. You can skip it at any time.

Then go to the Studio Chat tab and search for GLM-4.7-Flash in the search bar and download your desired model and quant.

4

Run GLM-4.7-Flash

Inference parameters should be auto-set when using Unsloth Studio, however you can still change it manually. You can also edit the context length, chat template and other settings.

For more information, you can view our Unsloth Studio inference guide.

Llama.cpp Tutorial (GGUF):

Instructions to run in llama.cpp (note we will be using 4-bit to fit most devices):

1

Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference. For Apple Mac / Metal devices, set -DGGML_CUDA=OFF then continue as usual - Metal support is on by default.

2

You can directly pull from Hugging Face. You can increase the context to 200K as your RAM/VRAM allows.

You can also try Z.ai's recommended GLM-4.7 sampling parameters:

  • For general use-case: --temp 1.0 --top-p 0.95

  • For tool-calling: --temp 0.7 --top-p 1.0

  • Remember to disable repeat penalty!

Follow this for general instruction use-cases:

Follow this for tool-calling use-cases:

3

Download the model via (after installing pip install huggingface_hub). You can choose UD-Q4_K_XL or other quantized versions. If downloads get stuck, see Hugging Face Hub, XET debugging

4

Then run the model in conversation mode:

Also, adjust context window as required, up to 202752

Reducing repetition and looping

This means you can now use Z.ai's recommended parameters and get great results:

  • For general use-case: --temp 1.0 --top-p 0.95

  • For tool-calling: --temp 0.7 --top-p 1.0

  • If using llama.cpp, set --min-p 0.01 as llama.cpp's default is 0.05

  • Remember to disable repeat penalty! Or set --repeat-penalty 1.0

We added "scoring_func": "sigmoid" to config.json for the main model - see.

🐦Flappy Bird Example with UD-Q4_K_XL

As an example, we did the following long conversation by using UD-Q4_K_XL via ./llama.cpp/llama-cli --model unsloth/GLM-4.7-Flash-GGUF/GLM-4.7-Flash-UD-Q4_K_XL.gguf --fit on --temp 1.0 --top-p 0.95 --min-p 0.01 :

which rendered the following Flappy Bird game in HTML form:

Flappy Bird Game in HTML (Expandable)

And we took some screenshots (4bit works):

🦥 Fine-tuning GLM-4.7-Flash

Unsloth now supports fine-tuning of GLM-4.7-Flash, however you will need to use transformers v5. The 30B model does not fit on a free Colab GPU; however, you can use our notebook. 16-bit LoRA fine-tuning of GLM-4.7-Flash will use around 60GB VRAM:

On fine-tuning MoE's, it's probably not a good idea to fine-tune the router layer so we disabled it by default. If you want to maintain its reasoning capabilities (optional), you can use a mix of direct answers and chain-of-thought examples. Use at least 75% reasoning and 25% non-reasoning in your dataset to make the model retain its reasoning capabilities.

🦙Llama-server serving & deployment

To deploy GLM-4.7-Flash for production, we use llama-server In a new terminal say via tmux, deploy the model via:

Then in a new terminal, after doing pip install openai, do:

Which will print

💻 GLM-4.7-Flash in vLLM

You can now use our new FP8 Dynamic quant of the model for premium and fast inference. First install vLLM from nightly:

Then serve Unsloth's dynamic FP8 version of the model. We enabled FP8 to reduce KV cache memory usage by 50%, and on 4 GPUs. If you have 1 GPU, use CUDA_VISIBLE_DEVICES='0' and set --tensor-parallel-size 1 or remove this argument. To disable FP8, remove --quantization fp8 --kv-cache-dtype fp8

You can then call the served model via the OpenAI API:

vLLM GLM-4.7-Flash Speculative Decoding

We found using the MTP (multi token prediction) module from GLM 4.7 Flash makes generation throughput drop from 13,000 tokens on 1 B200 to 1,300 tokens! (10x slower) On Hopper, it should be fine hopefully.

Only 1,300 tokens / s throughput on 1xB200 (130 tokens / s decoding per user)

And 13,000 tokens / s throughput on 1xB200 (still 130 token /s decoding per user)

🔨Tool Calling with GLM-4.7-Flash

See Tool Calling Guide for more details on how to do tool calling. In a new terminal (if using tmux, use CTRL+B+D), we create some tools like adding 2 numbers, executing Python code, executing Linux functions and much more:

We then use the below functions (copy and paste and execute) which will parse the function calls automatically and call the OpenAI endpoint for any model:

After launching GLM-4.7-Flash via llama-server like in GLM-4.7-Flash or see Tool Calling Guide for more details, we then can do some tool calls:

Tool Call for mathematical operations for GLM 4.7

Tool Call to execute generated Python code for GLM-4.7-Flash

Benchmarks

GLM-4.7-Flash is the best performing 30B model across all benchmarks except AIME 25.

Benchmark
GLM-4.7-Flash
Qwen3-30B-A3B-Thinking-2507
GPT-OSS-20B

AIME 25

91.6

85.0

91.7

GPQA

75.2

73.4

71.5

LCB v6

64.0

66.0

61.0

HLE

14.4

9.8

10.9

SWE-bench Verified

59.2

22.0

34.0

τ²-Bench

79.5

49.0

47.7

BrowseComp

42.8

2.29

28.3

Last updated

Was this helpful?