🥝Kimi K2.5: How to Run Locally Guide
Guide on running Kimi-K2.5 on your own local device!
Kimi-K2.5 is the new model by Moonshot which achieves SOTA performance in vision, coding, agentic and chat tasks. The 1T parameter hybrid reasoning model requires 600GB of disk space, while the quantized Unsloth Dynamic 1.8-bit version reduces this to 240GB (-60% size): Kimi-K2.5-GGUF
All uploads use Unsloth Dynamic 2.0 for SOTA Aider and 5-shot MMLU performance. See how our Dynamic 1–2 bit GGUFs perform on coding benchmarks.
⚙️ Recommended Requirements
You need >240GB of disk space to run the 1-bit quant!
The only requirement is disk space + RAM + VRAM ≥ 240GB. That means you do not need to have that much RAM or VRAM (GPU) to run the model, but it will be much slower.
The 1.8-bit (UD-TQ1_0) quant will run on a single 24GB GPU if you offload all MoE layers to system RAM (or a fast SSD). With ~256GB RAM, expect ~10 tokens/s. The full Kimi K2.5 model is 630GB and typically requires at least 4× H200 GPUs.
If the model fits, you will get >40 tokens/s when using a B200.
To run the model in near full precision, you can use the 4-bit or 5-bit quants. You can use any higher just to be safe.
For strong performance, aim for >240GB of unified memory (or combined RAM+VRAM) to reach 10+ tokens/s. If you’re below that, it'll work but speed will drop (llama.cpp can still run via mmap/disk offload) and may fall from ~10 tokens/s to <2 token/s.
We recommend UD-Q2_K_XL (375GB) as a good size/quality balance. Best rule of thumb: RAM+VRAM ≈ the quant size; otherwise it’ll still work, just slower due to offloading.
🥝 Run Kimi K2.5 Guide
Kimi-K2.5 requires different sampling parameters for different use-cases.
Currently there is no vision support for the model but hopefully llama.cpp will support it soon.
To run the model in full precision, you only need to use the 4-bit or 5-bit Dynamic GGUFs (e.g. UD_Q4_K_XL) because the model was originally released in INT4 format.
You can choose a higher-bit quantization just to be safe in case of small quantization differences, but in most cases this is unnecessary.
Kimi K2.5 differences to Kimi K2 Thinking
Both models use a modified DeepSeek V3 MoE architecture.
rope_scaling.beta_fastK2.5 uses 32.0 vs K2 Thinking's 1.0.MoonViT is the native‑resolution 200M parameter vision encoder. It's similar to the one used in Kimi-VL-A3B-Instruct.
🌙 Usage Guide:
According to Moonshot AI, these are the recommended settings for Kimi K2.5 inference:
temperature = 0.6
temperature = 1.0
top_p = 0.95
top_p = 0.95
min_p = 0.01
min_p = 0.01
Set the temperature 1.0 to reduce repetition and incoherence.
Suggested context length = 98,304 (up to 256K)
Note: Using different tools may require different settings
We recommend setting min_p to 0.01 to suppress the occurrence of unlikely tokens with low probabilities. And disable or set repeat penalty = 1.0 if needed.
Chat Template for Kimi K2.5
Running tokenizer.apply_chat_template([{"role": "user", "content": "What is 1+1?"},]) gets:
✨ Run Kimi K2.5 in llama.cpp
For this guide we'll be running the smallest 1-bit quant which is 240GB in size. Feel free to change quantization type to 2-bit, 3-bit etc. To run the model in near full precision, you can use the 4-bit or 5-bit quants. You can use any higher just to be safe.
Obtain the latest
llama.cppon GitHub here. You can follow the build instructions below as well. Change-DGGML_CUDA=ONto-DGGML_CUDA=OFFif you don't have a GPU or just want CPU inference.
If you want to use
llama.cppdirectly to load models, you can do the below: (:UD-TQ1_0) is the quantization type. You can also download via Hugging Face (point 3). This is similar toollama run. Useexport LLAMA_CACHE="folder"to forcellama.cppto save to a specific location.
LLAMA_SET_ROWS=1 makes llama.cpp a little bit faster! Use it! --fit on auto fits models on all your GPUs and CPUs optimally.
--fit onwill auto fit the model to your system. If not using--fit onand you have around 360GB of combined GPU memory, remove-ot ".ffn_.*_exps.=CPU"to get maximum speed.
Use --fit on for auto fitting on GPUs and CPUs. If this doesn't work, then see below:
Please try out -ot ".ffn_.*_exps.=CPU" to offload all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.
If you have a bit more GPU memory, try -ot ".ffn_(up|down)_exps.=CPU" This offloads up and down projection MoE layers.
Try -ot ".ffn_(up)_exps.=CPU" if you have even more GPU memory. This offloads only up projection MoE layers.
And finally offload all layers via -ot ".ffn_.*_exps.=CPU" This uses the least VRAM.
You can also customize the regex, for example -ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU" means to offload gate, up and down MoE layers but only from the 6th layer onwards.
Download the model via (after installing
pip install huggingface_hub hf_transfer). We recommend using our 2bit dynamic quant UD-Q2_K_XL to balance size and accuracy. All versions at: huggingface.co/unsloth/Kimi-K2.5-GGUF
If you find that downloads get stuck at 90 to 95% or so, please see our troubleshooting guide.
Run any prompt.
Edit
--ctx-size 16384for context length. You can also leave this out for auto context length discovery via--fit on
As an example try: "Create a Flappy Bird game in HTML", and you will get:

✨ Deploy with llama-server and OpenAI's completion library
Using --kv-unified can make inference serving faster in llama.cpp! See https://www.reddit.com/r/LocalLLaMA/comments/1qnwa33/glm_47_flash_huge_performance_improvement_with_kvu/
After installing llama.cpp as per Kimi K2.5, you can use the below to launch an OpenAI compatible server:
Then use OpenAI's Python library after pip install openai :
And we get:

And in the other llama-server screen:

📊 Benchmarks
Reasoning & Knowledge
HLE-Full
30.1
34.5
30.8
37.5
25.1†
-
HLE-Full (w/ tools)
50.2
45.5
43.2
45.8
40.8†
-
AIME 2025
96.1
100
92.8
95.0
93.1
-
HMMT 2025 (Feb)
95.4
99.4
92.9*
97.3*
92.5
-
IMO-AnswerBench
81.8
86.3
78.5*
83.1*
78.3
-
GPQA-Diamond
87.6
92.4
87.0
91.9
82.4
-
MMLU-Pro
87.1
86.7*
89.3*
90.1
85.0
-
Image & Video
MMMU-Pro
78.5
79.5*
74.0
81.0
-
69.3
CharXiv (RQ)
77.5
82.1
67.2*
81.4
-
66.1
MathVision
84.2
83.0
77.1*
86.1*
-
74.6
MathVista (mini)
90.1
82.8*
80.2*
89.8*
-
85.8
ZeroBench
9
9*
3*
8*
-
4*
ZeroBench (w/ tools)
11
7*
9*
12*
-
3*
OCRBench
92.3
80.7*
86.5*
90.3*
-
87.5
OmniDocBench 1.5
88.8
85.7
87.7*
88.5
-
82.0*
InfoVQA (val)
92.6
84*
76.9*
57.2*
-
89.5
SimpleVQA
71.2
55.8*
69.7*
69.7*
-
56.8*
WorldVQA
46.3
28.0
36.8
47.4
-
23.5
VideoMMMU
86.6
85.9
84.4*
87.6
-
80.0
MMVU
80.4
80.8*
77.3
77.5
-
71.1
MotionBench
70.4
64.8
60.3
70.3
-
-
VideoMME
87.4
86.0*
-
88.4*
-
79.0
LongVideoBench
79.8
76.5*
67.2*
77.7*
-
65.6*
LVBench
75.9
-
-
73.5*
-
63.6
Coding
SWE-Bench Verified
76.8
80.0
80.9
76.2
73.1
-
SWE-Bench Pro
50.7
55.6
55.4*
-
-
-
SWE-Bench Multilingual
73.0
72.0
77.5
65.0
70.2
-
Terminal Bench 2.0
50.8
54.0
59.3
54.2
46.4
-
PaperBench
63.5
63.7*
72.9*
-
47.1
-
CyberGym
41.3
-
50.6
39.9*
17.3*
-
SciCode
48.7
52.1
49.5
56.1
38.9
-
OJBench (cpp)
57.4
-
54.6*
68.5*
54.7*
-
LiveCodeBench (v6)
85.0
-
82.2*
87.4*
83.3
-
Long Context
Longbench v2
61.0
54.5*
64.4*
68.2*
59.8*
-
AA-LCR
70.0
72.3*
71.3*
65.3*
64.3*
-
Agentic Search
BrowseComp
60.6
65.8
37.0
37.8
51.4
-
BrowseComp (w/ctx manage)
74.9
65.8
57.8
59.2
67.6
-
BrowseComp (Agent Swarm)
78.4
-
-
-
-
-
WideSearch (item-f1)
72.7
-
76.2*
57.0
32.5*
-
WideSearch (item-f1 Agent Swarm)
79.0
-
-
-
-
-
DeepSearchQA
77.1
71.3*
76.1*
63.2*
60.9*
-
FinSearchCompT2&T3
67.8
-
66.2*
49.9
59.1*
-
Seal-0
57.4
45.0
47.7*
45.5*
49.5*
-
Notes
*= score re-evaluated by the authors (not publicly available previously).†= DeepSeek V3.2 score corresponds to its text-only subset (as noted in the footnotes).-= not evaluated / not available.
Last updated
Was this helpful?

