waveformMiniMax-2.5: How to Run Guide

Run MiniMax-2.5 locally on your own device!

MiniMax-2.5 is a new open LLM achieving SOTA in coding, agentic tool use and search and office work, scoring 80.2% in SWE-Bench Verified, 51.3% in Multi-SWE-Bench, and 76.3% in BrowseComp.

The full 230B parameters (10B active) model has a 200K context window and the 8-bit LLM requires 243GB. Unsloth Dynamic 3-bit GGUF reduces the size to 101GB (-62%): MiniMax-2.5 GGUFarrow-up-right

All uploads use Unsloth Dynamic 2.0 for SOTA quantization performance - so 3-bit has important layers upcasted to 8 or 16-bit. You can also fine-tune the model via Unsloth, using multiGPUs.

⚙️ Usage Guide

The 4-bit dynamic quant UD-Q3_K_XL uses 101GB of disk space - this fits nicely on a 128GB unified memory Mac for ~20+ tokens/s, and also works faster with a 1x16GB GPU and 96GB of RAM for 25+ tokens/s. 2-bit quants or the biggest 2-bit will fit on a 96GB device.

For near full precision, use Q8_0 (8-bit) which utilizes 243GB and will fit on a 256GB RAM device / Mac for 10+ tokens/s.

circle-check

MiniMax recommends using the following parameters for best performance: temperature=1.0, top_p = 0.95, top_k = 40.

Default Settings (Most Tasks)

temperature = 1.0

top_p = 0.95

top_k = 40

  • Maximum context window: 196,608.

  • Use --jinja for llama.cpp variants.

  • Default system prompt:

You are a helpful assistant. Your name is MiniMax-M2.5 and is built by MiniMax.

Run MiniMax-2.5 Tutorials:

For these tutorials, we will be utilizing the 3-bit UD-Q3_K_XLarrow-up-right quant which fits in a 128GB RAM device.

✨ Run in llama.cpp

1

Obtain the latest llama.cpp on GitHub herearrow-up-right. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

2

If you want to use llama.cpp directly to load models, you can do the below: (:Q3_K_XL) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location. Remember the model has only a maximum of 200K context length.

Follow this for most default use-cases:

circle-info

Use --fit on for maximum usage of your GPU and CPU.

Optionally, try -ot ".ffn_.*_exps.=CPU" to offload all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.

If you have a bit more GPU memory, try -ot ".ffn_(up|down)_exps.=CPU" This offloads up and down projection MoE layers.

Try -ot ".ffn_(up)_exps.=CPU" if you have even more GPU memory. This offloads only up projection MoE layers.

And finally offload all layers via -ot ".ffn_.*_exps.=CPU" This uses the least VRAM.

You can also customize the regex, for example -ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU" means to offload gate, up and down MoE layers but only from the 6th layer onwards.

3

Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-Q3_K_XL (dynamic 4-bit quant) or other quantized versions like UD-Q6_K_XL . We recommend using our 4bit dynamic quant UD-Q3_K_XL to balance size and accuracy.

4

You can edit --threads 32 for the number of CPU threads, --ctx-size 16384 for context length, --n-gpu-layers 2 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.

🦙 Llama-server & OpenAI's completion library

To deploy MiniMax-2.5 for production, we use llama-server or OpenAI API. In a new terminal say via tmux, deploy the model via:

Then in a new terminal, after doing pip install openai, do:

📊 Benchmarks

You can view further below for benchmarks in table format:

Benchmark
MiniMax-M2.5
MiniMax-M2.1
Claude Opus 4.5
Claude Opus 4.6
Gemini 3 Pro
GPT-5.2 (thinking)

AIME25

86.3

83.0

91.0

95.6

96.0

98.0

GPQA-D

85.2

83.0

87.0

90.0

91.0

90.0

HLE w/o tools

19.4

22.2

28.4

30.7

37.2

31.4

SciCode

44.4

41.0

50.0

52.0

56.0

52.0

IFBench

70.0

70.0

58.0

53.0

70.0

75.0

AA-LCR

69.5

62.0

74.0

71.0

71.0

73.0

SWE-Bench Verified

80.2

74.0

80.9

80.8

78.0

80.0

SWE-Bench Pro

55.4

49.7

56.9

55.4

54.1

55.6

Terminal Bench 2

51.7

47.9

53.4

55.1

54.0

54.0

Multi-SWE-Bench

51.3

47.2

50.0

50.3

42.7

SWE-Bench Multilingual

74.1

71.9

77.5

77.8

65.0

72.0

VIBE-Pro (AVG)

54.2

42.4

55.2

55.6

36.9

BrowseComp (w/ctx)

76.3

62.0

67.8

84.0

59.2

65.8

Wide Search

70.3

63.2

76.2

79.4

57.0

RISE

50.2

34.0

50.5

62.5

36.8

50.0

BFCL multi-turn

76.8

37.4

68.0

63.3

61.0

τ² Telecom

97.8

87.0

98.2

99.3

98.0

98.7

MEWC

74.4

55.6

82.1

89.8

78.7

41.3

GDPval-MM

59.0

24.6

61.1

73.5

28.1

54.5

Finance Modeling

21.6

17.3

30.1

33.2

15.0

20.0

Coding Core Benchmark Scores
Search and Tool Use
Tasks Completed per 100
Office Capabilities

Last updated

Was this helpful?