# Kimi K2.6 - How to Run Locally Kimi K2.6 is an open model by Moonshot that delivers SOTA performance across vision, coding, agentic, long context and chat tasks. The 1T-parameter hybrid thinking model has 256K context length and requires 600GB of disk space. You can run it locally via Unsloth: [**Kimi-K2.6-GGUFs**](https://huggingface.co/unsloth/Kimi-K2.6-GGUF) ### :gear: Usage Guide To run Kimi K2.6 in **full precision**, run **Q8** (`UD-Q8_K_XL`)**,** which is only **10GB bigger** than **Q4** (`UD-Q4_K_XL`)**.** Near full precision `UD-Q4_K_XL` requires 600GB. All uploads use Unsloth [Dynamic 2.0](https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs) for SOTA quant performance. Kimi-K2.6 GGUFs **support vision**. Thinking and non-thinking mode require different settings: | Default (Thinking Mode) | Instant Mode | | ------------------------------------------------------------------ | ------------------------------------------------------------------ | | **temperature = 1.0** | **temperature = 0.6** | | **top\_p = 0.95** | **top\_p = 0.95** | * Suggested context length = `98,304` (up to `262,144`) `UD-Q8_K_XL` is lossless from Kimi-K2.6 as Kimi uses int4 for MoE weights & BF16 rest. We follow that for `Q8_K_XL`. `UD-Q4_K_XL` is the same as `UD-Q8_K_XL` except all other tensors are `Q8_0`. So technically `UD-Q4_K_XL` is near full precision. {% hint style="info" %} For best performance, make sure your total available memory (VRAM + system RAM) exceeds the size of the quantized model file you’re downloading. If it doesn’t, llama.cpp can still run via SSD/HDD offloading, but inference will be slower. {% endhint %} If the model fits, you will get >40 tokens/s when using a B200. We recommend `UD-Q4_K_XL` (584GB) as a good size/quality balance. Best rule of thumb: RAM+VRAM ≈ the quant size; otherwise it’ll still work, just slower due to offloading. #### Chat Template for Kimi K2.6 Running `tokenizer.apply_chat_template([{"role": "user", "content": "What is 1+1?"},])` gets: {% code overflow="wrap" %} ``` <|im_system|>system<|im_middle|>You are Kimi, an AI assistant created by Moonshot AI.<|im_end|><|im_user|>user<|im_middle|>What is 1+1?<|im_end|><|im_assistant|>assistant<|im_middle|> ``` {% endcode %} ## Run Kimi K2.6 Guide ### 🦥 Run Kimi-K2.6 in Unsloth Studio Kimi K2.6 can run in [Unsloth Studio](https://unsloth.ai/docs/new/studio), an open-source web UI for local AI. With Unsloth Studio, you can run models locally on **MacOS, Windows**, Linux and: {% columns %} {% column %} * Search, download, [run GGUFs](https://unsloth.ai/docs/new/studio#run-models-locally) and safetensor models * [**Self-healing** tool calling](https://unsloth.ai/docs/new/studio#execute-code--heal-tool-calling) + **web search** * [**Code execution**](https://unsloth.ai/docs/new/studio#run-models-locally) (Python, Bash) * [Automatic inference](https://unsloth.ai/docs/new/studio#model-arena) parameter tuning (temp, top-p, etc.) * Fast CPU + GPU inference via llama.cpp * [Train LLMs](https://unsloth.ai/docs/new/studio#no-code-training) 2x faster with 70% less VRAM {% endcolumn %} {% column %}

{% endcolumn %} {% endcolumns %} {% stepper %} {% step %} **Install Unsloth** Run in your terminal: MacOS, Linux, WSL: ```bash curl -fsSL https://unsloth.ai/install.sh | sh ``` Windows PowerShell: ```bash irm https://unsloth.ai/install.ps1 | iex ``` {% hint style="success" %} **Installation will be quick and take approx 1-2 mins.** {% endhint %} {% endstep %} {% step %} **Launch Unsloth** MacOS, Linux, WSL and Windows: ```bash unsloth studio -H 0.0.0.0 -p 8888 ``` Then open `http://localhost:8888` in your browser. {% endstep %} {% step %} **Search and download Kimi-K2.6** On first launch you will need to create a password to secure your account and sign in again later. You’ll then see a brief onboarding wizard to choose a model, dataset, and basic settings. You can skip it at any time and go directly to chat. Then go to the [Studio Chat](https://unsloth.ai/docs/new/studio/chat) tab and search for **Kimi-K2.6** in the search bar and download your desired model and quant. Ensure you have enough compute the run the model.

{% endstep %} {% step %} **Run Kimi-K2.6** Inference parameters should be auto-set when using Unsloth Studio, however you can still change it manually. You can also edit the context length, chat template and other settings. For more information, you can view our [Unsloth Studio inference guide](https://unsloth.ai/docs/new/studio/chat).

{% endstep %} {% endstepper %} ### 🦙 Run Kimi K2.6 in llama.cpp For this guide we'll be running the UD-Q4\_K\_XL quant which will require at least 600GB RAM. The smaller quants are still uploading. Feel free to change quantization type. GGUF: [**Kimi-K2.6-GGUF**](https://huggingface.co/unsloth/Kimi-K2.6-GGUF) For these tutorials, we will using [llama.cpp](https://llama.cpphttps/github.com/ggml-org/llama.cpp) for fast local inference, especially if you have a CPU. {% stepper %} {% step %} Obtain the latest `llama.cpp` **on** [**GitHub here**](https://github.com/ggml-org/llama.cpp). You can follow the build instructions below as well. Change `-DGGML_CUDA=ON` to `-DGGML_CUDA=OFF` if you don't have a GPU or just want CPU inference. **For Apple Mac / Metal devices**, set `-DGGML_CUDA=OFF` then continue as usual - Metal support is on by default. ```bash apt-get update apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y git clone https://github.com/ggml-org/llama.cpp cmake llama.cpp -B llama.cpp/build \ -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split cp llama.cpp/build/bin/llama-* llama.cpp ``` {% endstep %} {% step %} If you want to use `llama.cpp` directly to load models, you can do the below: (:`Q4_K_XL`) is the quantization type. You can also download via Hugging Face (point 3). This is similar to `ollama run` . Use `export LLAMA_CACHE="folder"` to force `llama.cpp` to save to a specific location. The model has a maximum of `262,144` context length. Use one of the specific commands below, according to your use-case: **Thinking mode:** ```bash export LLAMA_CACHE="unsloth/Kimi-K2.6-GGUF" ./llama.cpp/llama-cli \ -hf unsloth/Kimi-K2.6-GGUF:UD-Q4_K_XL \ --temp 1.0 \ --top-p 0.95 ``` **Non-thinking mode (Instant):** ```bash export LLAMA_CACHE="unsloth/Qwen3.6-35B-A3B-GGUF" ./llama.cpp/llama-cli \ -hf unsloth/Kimi-K2.6-GGUF:UD-Q4_K_XL \ --temp 0.6 \ --top-p 0.95 \ --chat-template-kwargs '{"enable_thinking":false}' ``` {% endstep %} {% step %} Download the model via the code below (after installing `pip install huggingface_hub hf_transfer`). If downloads get stuck, see: [hugging-face-hub-xet-debugging](https://unsloth.ai/docs/basics/troubleshooting-and-faqs/hugging-face-hub-xet-debugging "mention") ```bash hf download unsloth/Kimi-K2.6-GGUF \ --local-dir unsloth/Kimi-K2.6-GGUF \ --include "*mmproj-F16*" \ --include "*UD-Q4_K_XL*" # Use "*UD-Q8_K_XL*" for full precision ``` {% endstep %} {% step %} Then run the model in conversation mode: {% code overflow="wrap" %} ```bash ./llama.cpp/llama-cli \ --model unsloth/Kimi-K2.6-GGUF/UD-Q4_K_XL/Kimi-K2.6-UD-Q4_K_XL-00001-of-00014.gguf \ --mmproj unsloth/Kimi-K2.6-GGUF/mmproj-F16.gguf \ --temp 1.0 \ --top-p 0.95 ``` {% endcode %} {% endstep %} {% endstepper %} ### 📊 Benchmarks You can view further below for benchmarks in table format: