MiniMax-2.5: How to Run Guide
Run MiniMax-2.5 locally on your own device!
MiniMax-2.5 is a new open LLM achieving SOTA in coding, agentic tool use and search and office work, scoring 80.2% in SWE-Bench Verified, 51.3% in Multi-SWE-Bench, and 76.3% in BrowseComp.
The full 230B parameters (10B active) model has a 200K context window and the 8-bit LLM requires 243GB. Unsloth Dynamic 3-bit GGUF reduces the size to 101GB (-62%): MiniMax-2.5 GGUF
All uploads use Unsloth Dynamic 2.0 for SOTA quantization performance - so 3-bit has important layers upcasted to 8 or 16-bit. You can also fine-tune the model via Unsloth, using multiGPUs.
⚙️ Usage Guide
The 4-bit dynamic quant UD-Q3_K_XL uses 101GB of disk space - this fits nicely on a 128GB unified memory Mac for ~20+ tokens/s, and also works faster with a 1x16GB GPU and 96GB of RAM for 25+ tokens/s. 2-bit quants or the biggest 2-bit will fit on a 96GB device.
For near full precision, use Q8_0 (8-bit) which utilizes 243GB and will fit on a 256GB RAM device / Mac for 10+ tokens/s.
Though not a must, for best performance, have your VRAM + RAM combined equal to the size of the quant you're downloading. If not, hard drive / SSD offloading will work with llama.cpp, just inference will be slower.
Recommended Settings
MiniMax recommends using the following parameters for best performance: temperature=1.0, top_p = 0.95, top_k = 40.
temperature = 1.0
top_p = 0.95
top_k = 40
Maximum context window:
196,608.Use
--jinjafor llama.cpp variants.Default system prompt:
You are a helpful assistant. Your name is MiniMax-M2.5 and is built by MiniMax.Run MiniMax-2.5 Tutorials:
For these tutorials, we will be utilizing the 3-bit UD-Q3_K_XL quant which fits in a 128GB RAM device.
✨ Run in llama.cpp
Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.
If you want to use llama.cpp directly to load models, you can do the below: (:Q3_K_XL) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location. Remember the model has only a maximum of 200K context length.
Follow this for most default use-cases:
Use --fit on for maximum usage of your GPU and CPU.
Optionally, try -ot ".ffn_.*_exps.=CPU" to offload all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.
If you have a bit more GPU memory, try -ot ".ffn_(up|down)_exps.=CPU" This offloads up and down projection MoE layers.
Try -ot ".ffn_(up)_exps.=CPU" if you have even more GPU memory. This offloads only up projection MoE layers.
And finally offload all layers via -ot ".ffn_.*_exps.=CPU" This uses the least VRAM.
You can also customize the regex, for example -ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU" means to offload gate, up and down MoE layers but only from the 6th layer onwards.
Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-Q3_K_XL (dynamic 4-bit quant) or other quantized versions like UD-Q6_K_XL . We recommend using our 4bit dynamic quant UD-Q3_K_XL to balance size and accuracy.
You can edit --threads 32 for the number of CPU threads, --ctx-size 16384 for context length, --n-gpu-layers 2 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.
🦙 Llama-server & OpenAI's completion library
To deploy MiniMax-2.5 for production, we use llama-server or OpenAI API. In a new terminal say via tmux, deploy the model via:
Then in a new terminal, after doing pip install openai, do:
📊 Benchmarks
You can view further below for benchmarks in table format:

AIME25
86.3
83.0
91.0
95.6
96.0
98.0
GPQA-D
85.2
83.0
87.0
90.0
91.0
90.0
HLE w/o tools
19.4
22.2
28.4
30.7
37.2
31.4
SciCode
44.4
41.0
50.0
52.0
56.0
52.0
IFBench
70.0
70.0
58.0
53.0
70.0
75.0
AA-LCR
69.5
62.0
74.0
71.0
71.0
73.0
SWE-Bench Verified
80.2
74.0
80.9
80.8
78.0
80.0
SWE-Bench Pro
55.4
49.7
56.9
55.4
54.1
55.6
Terminal Bench 2
51.7
47.9
53.4
55.1
54.0
54.0
Multi-SWE-Bench
51.3
47.2
50.0
50.3
42.7
—
SWE-Bench Multilingual
74.1
71.9
77.5
77.8
65.0
72.0
VIBE-Pro (AVG)
54.2
42.4
55.2
55.6
36.9
—
BrowseComp (w/ctx)
76.3
62.0
67.8
84.0
59.2
65.8
Wide Search
70.3
63.2
76.2
79.4
57.0
—
RISE
50.2
34.0
50.5
62.5
36.8
50.0
BFCL multi-turn
76.8
37.4
68.0
63.3
61.0
—
τ² Telecom
97.8
87.0
98.2
99.3
98.0
98.7
MEWC
74.4
55.6
82.1
89.8
78.7
41.3
GDPval-MM
59.0
24.6
61.1
73.5
28.1
54.5
Finance Modeling
21.6
17.3
30.1
33.2
15.0
20.0




Last updated
Was this helpful?

