🥝Kimi K2.5: How to Run Locally Guide

Guide on running Kimi-K2.5 on your own local device!

Kimi-K2.5 is the new model by Moonshot which achieves SOTA performance in vision, coding, agentic and chat tasks. The 1T parameter hybrid reasoning model requires 600GB of disk space, while the quantized Unsloth Dynamic 1.8-bit version reduces this to 240GB (-60% size): Kimi-K2.5-GGUFarrow-up-right

All uploads use Unsloth Dynamic 2.0 for SOTA Aider and 5-shot MMLU performance. See how our Dynamic 1–2 bit GGUFs perform on coding benchmarks.

circle-info

You need >240GB of disk space to run the 1-bit quant!

The only requirement is disk space + RAM + VRAM ≥ 240GB. That means you do not need to have that much RAM or VRAM (GPU) to run the model, but it will be much slower.

The 1.8-bit (UD-TQ1_0) quant will run on a single 24GB GPU if you offload all MoE layers to system RAM (or a fast SSD). With ~256GB RAM, expect ~10 tokens/s. The full Kimi K2.5 model is 630GB and typically requires at least 4× H200 GPUs.

If the model fits, you will get >40 tokens/s when using a B200.

To run the model in near full precision, you can use the 4-bit or 5-bit quants. You can use any higher just to be safe.

For strong performance, aim for >240GB of unified memory (or combined RAM+VRAM) to reach 10+ tokens/s. If you’re below that, it'll work but speed will drop (llama.cpp can still run via mmap/disk offload) and may fall from ~10 tokens/s to <2 token/s.

We recommend UD-Q2_K_XL (375GB) as a good size/quality balance. Best rule of thumb: RAM+VRAM ≈ the quant size; otherwise it’ll still work, just slower due to offloading.

🥝 Run Kimi K2.5 Guide

Kimi-K2.5 requires different sampling parameters for different use-cases.

Currently there is no vision support for the model but hopefully llama.cpp will support it soon.

circle-check

Kimi K2.5 differences to Kimi K2 Thinking

  • Both models use a modified DeepSeek V3 MoE architecture.

  • rope_scaling.beta_fast K2.5 uses 32.0 vs K2 Thinking's 1.0.

  • MoonViT is the native‑resolution 200M parameter vision encoder. It's similar to the one used in Kimi-VL-A3B-Instruct.

🌙 Usage Guide:

According to Moonshot AI, these are the recommended settings for Kimi K2.5 inference:

Default Settings (Instant Mode)
Thinking Mode

temperature = 0.6

temperature = 1.0

top_p = 0.95

top_p = 0.95

min_p = 0.01

min_p = 0.01

  • Set the temperature 1.0 to reduce repetition and incoherence.

  • Suggested context length = 98,304 (up to 256K)

  • Note: Using different tools may require different settings

circle-info

We recommend setting min_p to 0.01 to suppress the occurrence of unlikely tokens with low probabilities. And disable or set repeat penalty = 1.0 if needed.

Chat Template for Kimi K2.5

Running tokenizer.apply_chat_template([{"role": "user", "content": "What is 1+1?"},]) gets:

✨ Run Kimi K2.5 in llama.cpp

For this guide we'll be running the smallest 1-bit quant which is 240GB in size. Feel free to change quantization type to 2-bit, 3-bit etc. To run the model in near full precision, you can use the 4-bit or 5-bit quants. You can use any higher just to be safe.

  1. Obtain the latest llama.cpp on GitHub herearrow-up-right. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

  1. If you want to use llama.cpp directly to load models, you can do the below: (:UD-TQ1_0) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location.

circle-check
  1. --fit on will auto fit the model to your system. If not using --fit on and you have around 360GB of combined GPU memory, remove -ot ".ffn_.*_exps.=CPU" to get maximum speed.

circle-info

Use --fit on for auto fitting on GPUs and CPUs. If this doesn't work, then see below:

Please try out -ot ".ffn_.*_exps.=CPU" to offload all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.

If you have a bit more GPU memory, try -ot ".ffn_(up|down)_exps.=CPU" This offloads up and down projection MoE layers.

Try -ot ".ffn_(up)_exps.=CPU" if you have even more GPU memory. This offloads only up projection MoE layers.

And finally offload all layers via -ot ".ffn_.*_exps.=CPU" This uses the least VRAM.

You can also customize the regex, for example -ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU" means to offload gate, up and down MoE layers but only from the 6th layer onwards.

  1. Download the model via (after installing pip install huggingface_hub hf_transfer ). We recommend using our 2bit dynamic quant UD-Q2_K_XL to balance size and accuracy. All versions at: huggingface.co/unsloth/Kimi-K2.5-GGUFarrow-up-right

circle-info

If you find that downloads get stuck at 90 to 95% or so, please see our troubleshooting guidearrow-up-right.

  1. Run any prompt.

  2. Edit --ctx-size 16384 for context length. You can also leave this out for auto context length discovery via --fit on

  1. As an example try: "Create a Flappy Bird game in HTML", and you will get:

✨ Deploy with llama-server and OpenAI's completion library

circle-check

After installing llama.cpp as per Kimi K2.5, you can use the below to launch an OpenAI compatible server:

Then use OpenAI's Python library after pip install openai :

And we get:

And in the other llama-server screen:

📊 Benchmarks

Reasoning & Knowledge

Benchmark
Kimi K2.5
GPT-5.2
Claude 4.5 Opus
Gemini 3 Pro
DeepSeek V3.2
Qwen3-VL-235B-A22B-Thinking

HLE-Full

30.1

34.5

30.8

37.5

25.1†

-

HLE-Full (w/ tools)

50.2

45.5

43.2

45.8

40.8†

-

AIME 2025

96.1

100

92.8

95.0

93.1

-

HMMT 2025 (Feb)

95.4

99.4

92.9*

97.3*

92.5

-

IMO-AnswerBench

81.8

86.3

78.5*

83.1*

78.3

-

GPQA-Diamond

87.6

92.4

87.0

91.9

82.4

-

MMLU-Pro

87.1

86.7*

89.3*

90.1

85.0

-

Image & Video

Benchmark
Kimi K2.5
GPT-5.2
Claude 4.5 Opus
Gemini 3 Pro
DeepSeek V3.2
Qwen3-VL-235B-A22B-Thinking

MMMU-Pro

78.5

79.5*

74.0

81.0

-

69.3

CharXiv (RQ)

77.5

82.1

67.2*

81.4

-

66.1

MathVision

84.2

83.0

77.1*

86.1*

-

74.6

MathVista (mini)

90.1

82.8*

80.2*

89.8*

-

85.8

ZeroBench

9

9*

3*

8*

-

4*

ZeroBench (w/ tools)

11

7*

9*

12*

-

3*

OCRBench

92.3

80.7*

86.5*

90.3*

-

87.5

OmniDocBench 1.5

88.8

85.7

87.7*

88.5

-

82.0*

InfoVQA (val)

92.6

84*

76.9*

57.2*

-

89.5

SimpleVQA

71.2

55.8*

69.7*

69.7*

-

56.8*

WorldVQA

46.3

28.0

36.8

47.4

-

23.5

VideoMMMU

86.6

85.9

84.4*

87.6

-

80.0

MMVU

80.4

80.8*

77.3

77.5

-

71.1

MotionBench

70.4

64.8

60.3

70.3

-

-

VideoMME

87.4

86.0*

-

88.4*

-

79.0

LongVideoBench

79.8

76.5*

67.2*

77.7*

-

65.6*

LVBench

75.9

-

-

73.5*

-

63.6

Coding

Benchmark
Kimi K2.5
GPT-5.2
Claude 4.5 Opus
Gemini 3 Pro
DeepSeek V3.2
Qwen3-VL-235B-A22B-Thinking

SWE-Bench Verified

76.8

80.0

80.9

76.2

73.1

-

SWE-Bench Pro

50.7

55.6

55.4*

-

-

-

SWE-Bench Multilingual

73.0

72.0

77.5

65.0

70.2

-

Terminal Bench 2.0

50.8

54.0

59.3

54.2

46.4

-

PaperBench

63.5

63.7*

72.9*

-

47.1

-

CyberGym

41.3

-

50.6

39.9*

17.3*

-

SciCode

48.7

52.1

49.5

56.1

38.9

-

OJBench (cpp)

57.4

-

54.6*

68.5*

54.7*

-

LiveCodeBench (v6)

85.0

-

82.2*

87.4*

83.3

-

Long Context

Benchmark
Kimi K2.5
GPT-5.2
Claude 4.5 Opus
Gemini 3 Pro
DeepSeek V3.2
Qwen3-VL-235B-A22B-Thinking

Longbench v2

61.0

54.5*

64.4*

68.2*

59.8*

-

AA-LCR

70.0

72.3*

71.3*

65.3*

64.3*

-

Benchmark
Kimi K2.5
GPT-5.2
Claude 4.5 Opus
Gemini 3 Pro
DeepSeek V3.2
Qwen3-VL-235B-A22B-Thinking

BrowseComp

60.6

65.8

37.0

37.8

51.4

-

BrowseComp (w/ctx manage)

74.9

65.8

57.8

59.2

67.6

-

BrowseComp (Agent Swarm)

78.4

-

-

-

-

-

WideSearch (item-f1)

72.7

-

76.2*

57.0

32.5*

-

WideSearch (item-f1 Agent Swarm)

79.0

-

-

-

-

-

DeepSearchQA

77.1

71.3*

76.1*

63.2*

60.9*

-

FinSearchCompT2&T3

67.8

-

66.2*

49.9

59.1*

-

Seal-0

57.4

45.0

47.7*

45.5*

49.5*

-

Notes

  • * = score re-evaluated by the authors (not publicly available previously).

  • = DeepSeek V3.2 score corresponds to its text-only subset (as noted in the footnotes).

  • - = not evaluated / not available.

Last updated

Was this helpful?