> For the complete documentation index, see [llms.txt](https://unsloth.ai/docs/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://unsloth.ai/docs/models/tutorials/cogito-v2-how-to-run-locally.md). # Cogito v2.1: How to Run Locally {% hint style="success" %} Deep Cogito v2.1 is an updated 671B MoE that is the most powerful open weights model as of 19 November 2025. {% endhint %} Cogito v2.1 comes in 1 671B MoE size, whilst Cogito v2 Preview is [Deep Cogito](https://www.deepcogito.com/)'s release of models spans 4 model sizes ranging from 70B to 671B. By using **IDA (Iterated Distillation & Amplification)**, these models are trained with the model internalizing the reasoning process using iterative policy improvement, rather than simply searching longer at inference time (like DeepSeek R1). Deep Cogito is based in [San Fransisco, USA](https://techcrunch.com/2025/04/08/deep-cogito-emerges-from-stealth-with-hybrid-ai-reasoning-models/) (like Unsloth :flag\_us:) and we're excited to provide quantized dynamic models for all 4 model sizes! All uploads use Unsloth [Dynamic 2.0](/docs/basics/unsloth-dynamic-2.0-ggufs.md) for SOTA 5-shot MMLU and KL Divergence performance, meaning you can run & fine-tune quantized these LLMs with minimal accuracy loss! **Tutorials navigation:** Run 671B MoE Run 109B MoE Run 405B Dense Run 70B Dense {% hint style="success" %} Choose which model size fits your hardware! We upload 1.58bit to 16bit variants for all 4 model sizes! {% endhint %} ## :gem: Model Sizes and Uploads There are 4 model sizes: 1. 2 Dense models based off from Llama - 70B and 405B 2. 2 MoE models based off from Llama 4 Scout (109B) and DeepSeek R1 (671B)

Model Sizes	Recommended Quant & Link	Disk Size	Architecture
70B Dense	UD-Q4_K_XL	44GB	Llama 3 70B
109B MoE	UD-Q3_K_XL	50GB	Llama 4 Scout
405B Dense	UD-Q2_K_XL	152GB	Llama 3 405B
671B MoE	UD-Q2_K_XL	251GB	DeepSeek R1

{% hint style="success" %} Though not necessary, for the best performance, have your VRAM + RAM combined = to the size of the quant you're downloading. If you have less VRAM + RAM, then the quant will still function, just be much slower. {% endhint %} ## 🐳 Run Cogito 671B MoE in llama.cpp 1. Obtain the latest `llama.cpp` on [GitHub here](https://github.com/ggml-org/llama.cpp). You can follow the build instructions below as well. Change `-DGGML_CUDA=ON` to `-DGGML_CUDA=OFF` if you don't have a GPU or just want CPU inference. **For Apple Mac / Metal devices**, set `-DGGML_CUDA=OFF` then continue as usual - Metal support is on by default. {% code overflow="wrap" %} ```shellscript apt-get update apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y git clone https://github.com/ggml-org/llama.cpp cmake llama.cpp -B llama.cpp/build \ -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split llama-mtmd-cli cp llama.cpp/build/bin/llama-* llama.cpp ``` {% endcode %} 2. If you want to use `llama.cpp` directly to load models, you can do the below: (:IQ1\_S) is the quantization type. You can also download via Hugging Face (point 3). This is similar to `ollama run` . Use `export LLAMA_CACHE="folder"` to force `llama.cpp` to save to a specific location. {% hint style="success" %} Please try out `-ot ".ffn_.*_exps.=CPU"` to offload all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity. If you have a bit more GPU memory, try `-ot ".ffn_(up|down)_exps.=CPU"` This offloads up and down projection MoE layers. Try `-ot ".ffn_(up)_exps.=CPU"` if you have even more GPU memory. This offloads only up projection MoE layers. And finally offload all layers via `-ot ".ffn_.*_exps.=CPU"` This uses the least VRAM. You can also customize the regex, for example `-ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU"` means to offload gate, up and down MoE layers but only from the 6th layer onwards. {% endhint %} ```shellscript export LLAMA_CACHE="unsloth/cogito-671b-v2.1-GGUF" ./llama.cpp/llama-cli \ -hf unsloth/cogito-671b-v2.1-GGUF:UD-Q2_K_XL \ --n-gpu-layers 99 \ --temp 0.6 \ --top-p 0.95 \ --min-p 0.01 \ --ctx-size 16384 \ --seed 3407 \ --jinja \ -ot ".ffn_.*_exps.=CPU" ``` 3. Download the model via (after installing `pip install huggingface_hub hf_transfer` ). You can choose `UD-IQ1_S`(dynamic 1.78bit quant) or other quantized versions like `Q4_K_M` . We **recommend using our 2.7bit dynamic quant**** ****`UD-Q2_K_XL`**** ****to balance size and accuracy**. More versions at: {% code overflow="wrap" %} ```python # !pip install huggingface_hub hf_transfer import os os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0" # Can sometimes rate limit, so set to 0 to disable from huggingface_hub import snapshot_download snapshot_download( repo_id = "unsloth/cogito-671b-v2.1-GGUF", local_dir = "unsloth/cogito-671b-v2.1-GGUF", allow_patterns = ["*UD-IQ1_S*"], # Dynamic 1bit (168GB) Use "*UD-Q2_K_XL*" for Dynamic 2bit (251GB) ) ``` {% endcode %} 4. Edit `--threads 32` for the number of CPU threads, `--ctx-size 16384` for context length, `--n-gpu-layers 2` for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference. ## :mouse\_three\_button:Run Cogito 109B MoE in llama.cpp 1. Follow the same instructions as running the [671B model above](#run-cogito-671b-moe-in-llama.cpp). 2. Then run the below: ```shellscript export LLAMA_CACHE="unsloth/cogito-v2-preview-llama-109B-MoE-GGUF" ./llama.cpp/llama-cli \ -hf unsloth/cogito-v2-preview-llama-109B-MoE-GGUF:Q3_K_XL \ --n-gpu-layers 99 \ --temp 0.6 \ --min-p 0.01 \ --top-p 0.9 \ --ctx-size 16384 \ --jinja \ -ot ".ffn_.*_exps.=CPU" ``` ## :deciduous\_tree:Run Cogito 405B Dense in llama.cpp 1. Follow the same instructions as running the [671B model above](#run-cogito-671b-moe-in-llama.cpp). 2. Then run the below: ```shellscript export LLAMA_CACHE="unsloth/cogito-v2-preview-llama-405B-GGUF" ./llama.cpp/llama-cli \ -hf unsloth/cogito-v2-preview-llama-405B-GGUF:Q2_K_XL \ --n-gpu-layers 99 \ --temp 0.6 \ --min-p 0.01 \ --top-p 0.9 \ --jinja \ --ctx-size 16384 ``` ## :sunglasses: Run Cogito 70B Dense in llama.cpp 1. Follow the same instructions as running the [671B model above](#run-cogito-671b-moe-in-llama.cpp). 2. Then run the below: ```shellscript export LLAMA_CACHE="unsloth/cogito-v2-preview-llama-70B-GGUF" ./llama.cpp/llama-cli \ -hf unsloth/cogito-v2-preview-llama-70B-GGUF:Q4_K_XL \ --n-gpu-layers 99 \ --temp 0.6 \ --min-p 0.01 \ --top-p 0.9 \ --jinja \ --ctx-size 16384 ``` See for more details --- # Agent Instructions This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com. ## Querying This Documentation If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question. Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter: ``` GET https://unsloth.ai/docs/models/tutorials/cogito-v2-how-to-run-locally.md?ask=&goal= ``` `ask` is the immediate question: it should be specific, self-contained, and written in natural language. `goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation. Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.