MiniMax-2.5: How to Run Guide

Run MiniMax-2.5 locally on your own device!

MiniMax-2.5 is a new open LLM achieving SOTA in coding, agentic tool use and search and office work, scoring 80.2% in SWE-Bench Verified, 51.3% in Multi-SWE-Bench, and 76.3% in BrowseComp.

The full 230B parameters (10B active) model has a 200K context window and the 8-bit LLM requires 243GB. Unsloth Dynamic 3-bit GGUF reduces the size to 101GB (-62%): MiniMax-2.5 GGUF

All uploads use Unsloth Dynamic 2.0 for SOTA quantization performance - so 3-bit has important layers upcasted to 8 or 16-bit. You can also fine-tune the model via Unsloth, using multiGPUs.

⚙️ Usage Guide

The 4-bit dynamic quant UD-Q3_K_XL uses 101GB of disk space - this fits nicely on a 128GB unified memory Mac for ~20+ tokens/s, and also works faster with a 1x16GB GPU and 96GB of RAM for 25+ tokens/s. 2-bit quants or the biggest 2-bit will fit on a 96GB device.

For near full precision, use Q8_0 (8-bit) which utilizes 243GB and will fit on a 256GB RAM device / Mac for 10+ tokens/s.

Though not a must, for best performance, have your VRAM + RAM combined equal to the size of the quant you're downloading. If not, hard drive / SSD offloading will work with llama.cpp, just inference will be slower.

Recommended Settings

MiniMax recommends using the following parameters for best performance: temperature=1.0, top_p = 0.95, top_k = 40.

Default Settings (Most Tasks)

temperature = 1.0

top_p = 0.95

top_k = 40

Maximum context window: 196,608.
Use --jinja for llama.cpp variants.
Default system prompt:

You are a helpful assistant. Your name is MiniMax-M2.5 and is built by MiniMax.

Run MiniMax-2.5 Tutorials:

For these tutorials, we will be utilizing the 3-bit UD-Q3_K_XL quant which fits in a 128GB RAM device.

✨ Run in llama.cpp

Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp

If you want to use llama.cpp directly to load models, you can do the below: (:Q3_K_XL) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location. Remember the model has only a maximum of 200K context length.

Follow this for most default use-cases:

export LLAMA_CACHE="unsloth/MiniMax-2.5-GGUF"
./llama.cpp/llama-cli \
    -hf unsloth/MiniMax-2.5-GGUF:UD-Q3_K_XL \
    --jinja \
    --ctx-size 16384 \
    --flash-attn on \
    --temp 1.0 \
    --top-p 0.95 \
    --top-k 40 \
    --fit on

Use --fit on for maximum usage of your GPU and CPU.

Optionally, try -ot ".ffn_.*_exps.=CPU" to offload all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.

If you have a bit more GPU memory, try -ot ".ffn_(up|down)_exps.=CPU" This offloads up and down projection MoE layers.

Try -ot ".ffn_(up)_exps.=CPU" if you have even more GPU memory. This offloads only up projection MoE layers.

And finally offload all layers via -ot ".ffn_.*_exps.=CPU" This uses the least VRAM.

You can also customize the regex, for example -ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU" means to offload gate, up and down MoE layers but only from the 6th layer onwards.

Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-Q3_K_XL (dynamic 4-bit quant) or other quantized versions like UD-Q6_K_XL . We recommend using our 4bit dynamic quant UD-Q3_K_XL to balance size and accuracy.

hf download unsloth/MiniMax-2.5-GGUF \
    --local-dir unsloth/MiniMax-2.5-GGUF \
    --include "*UD-Q3_K_XL*" # Use "*Q8_0*" for 8-bit

You can edit --threads 32 for the number of CPU threads, --ctx-size 16384 for context length, --n-gpu-layers 2 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.

./llama.cpp/llama-cli \
    --model unsloth/MiniMax-2.5-GGUF/UD-Q3_K_XL/MiniMax-2.5-UD-Q3_K_XL-00001-of-00003.gguf \
    --jinja \
    --temp 1.0 \
    --top-p 0.95 \
    --top-k 40 \
    --ctx-size 16384 \
    --seed 3407 \
    --fit on

🦙 Llama-server & OpenAI's completion library

To deploy MiniMax-2.5 for production, we use llama-server or OpenAI API. In a new terminal say via tmux, deploy the model via:

./llama.cpp/llama-server \
    --model unsloth/MiniMax-2.5-GGUF/UD-Q3_K_XL/MiniMax-2.5-UD-Q3_K_XL-00001-of-00003.gguf \
    --alias "unsloth/MiniMax-2.5" \
    --fit on \
    --prio 3 \
    --temp 1.0 \
    --top-p 0.95 \
    --top-k 40 \
    --ctx-size 16384 \
    --port 8001 \
    --jinja

Then in a new terminal, after doing pip install openai, do:

from openai import OpenAI
import json
openai_client = OpenAI(
    base_url = "http://127.0.0.1:8001/v1",
    api_key = "sk-no-key-required",
)
completion = openai_client.chat.completions.create(
    model = "unsloth/MiniMax-2.5",
    messages = [{"role": "user", "content": "Create a Snake game."},],
)
print(completion.choices[0].message.content)

📊 Benchmarks

You can view further below for benchmarks in table format:

Benchmark

MiniMax-M2.5

MiniMax-M2.1

Claude Opus 4.5

Claude Opus 4.6

Gemini 3 Pro

GPT-5.2 (thinking)

AIME25

86.3

83.0

91.0

95.6

96.0

98.0

GPQA-D

85.2

83.0

87.0

90.0

91.0

90.0

HLE w/o tools

19.4

22.2

28.4

30.7

37.2

31.4

SciCode

44.4

41.0

50.0

52.0

56.0

52.0

IFBench

70.0

58.0

53.0

70.0

75.0

AA-LCR

69.5

62.0

74.0

71.0

73.0

SWE-Bench Verified

80.2

74.0

80.9

80.8

78.0

80.0

SWE-Bench Pro

55.4

49.7

56.9

55.4

54.1

55.6

Terminal Bench 2

51.7

47.9

53.4

55.1

54.0

Multi-SWE-Bench

51.3

47.2

50.0

50.3

42.7

—

SWE-Bench Multilingual

74.1

71.9

77.5

77.8

65.0

72.0

VIBE-Pro (AVG)

54.2

42.4

55.2

55.6

36.9

—

BrowseComp (w/ctx)

76.3

62.0

67.8

84.0

59.2

65.8

Wide Search

70.3

63.2

76.2

79.4

57.0

—

RISE

50.2

34.0

50.5

62.5

36.8

50.0

BFCL multi-turn

76.8

37.4

68.0

63.3

61.0

—

τ² Telecom

97.8

87.0

98.2

99.3

98.0

98.7

MEWC

74.4

55.6

82.1

89.8

78.7

41.3

GDPval-MM

59.0

24.6

61.1

73.5

28.1

54.5

Finance Modeling

21.6

17.3

30.1

33.2

15.0

20.0

PreviousQwen3-Coder-Next NextGLM-4.7-Flash

Last updated 1 hour ago

Was this helpful?

hashtag⚙️ Usage Guide

hashtagRecommended Settings

hashtagRun MiniMax-2.5 Tutorials:

hashtag✨ Run in llama.cpp

hashtag🦙 Llama-server & OpenAI's completion library

hashtag📊 Benchmarks

⚙️ Usage Guide

Recommended Settings

Run MiniMax-2.5 Tutorials:

✨ Run in llama.cpp

🦙 Llama-server & OpenAI's completion library

📊 Benchmarks