6 cute pastel coloured sloths staring at their computer screens happy

Run DeepSeek-R1-0528 Dynamic 1-bit GGUFs

May 29, 2025 • By Daniel & Michael

May 29, 2025

• By Daniel & Michael

DeepSeek-R1-0528 is DeepSeek's new update to their R1 reasoning model. R1-0528 is the world's most powerful open-source model, rivalling OpenAI's GPT-4.5, o3 and Google's Gemini 2.5 Pro.

DeepSeek also released a R1-0528 distilled version by fine-tuning Qwen3 (8B). The distill achieves the same performance as Qwen3 (235B). Qwen3 GGUF: DeepSeek-R1-0528-Qwen3-8B-GGUF
You can also fine-tune the Qwen3 model with Unsloth.

You can run the model using Unsloth's 1.78-bit Dynamic 2.0 GGUFs on your favorite inference frameworks. We quantized DeepSeek’s R1 671B parameter model from 720GB down to 185GB - a 75% size reduction.

Recommended: Read our Complete Guide for a walkthrough on how to run DeepSeek-R1-0528 locally.

To ensure the best tradeoff between accuracy and size, we do not to quantize all layers, but selectively quantize e.g. the MoE layers to lower bit, and leave attention and other layers in 4 or 6bit.

And grab our full DeepSeek-R1-0528 GGUFs here

🐋How to Run DeepSeek-R1-0528

For DeepSeek-R1-0528-Qwen3-8B, the model can pretty much fit in any setup, and even those with as less as 20GB RAM. There is no need for any prep beforehand.

Qwen3 and the full R1-0528 model uses the same settings and chat templates.

According to DeepSeek, these are the recommended settings for R1 (R1-0528 should use the same settings) inference:
- Set the temperature 0.6to reduce repetition and incoherence.
- Set top_p to 0.95 (recommended)
- Run multiple tests and average results for reliable evaluation.

It is recommended to have at least 64GB RAM to run this quant (you will get 1 token/s without a GPU). For optimal performance you will need at least 180GB unified memory or 180GB combined RAM+VRAM for 5+ tokens/s. While it's technically possible to run the model without a GPU, we advise against it, unless you're leveraging Apple's unified memory chips.

For the 1.78-bit quantization:
- On 1x 24GB GPU (with all layers offloaded), you can expect up to 20 tokens/second throughput and around 4 tokens/second for single-user inference.
- Try to have a combination of RAM + VRAM that adds up to the size of the quant you're downloading (e.g. 183GB = at least 183GB combined VRAM+RAM or unified memory)
- A 24GB GPU like the RTX 4090 should achieve 3 tokens/second, depending on workload and configuration.

🦙 How to Run R1-0528-Qwen3-8B in Ollama:

Install ollama if you haven't already! You can only run models up to 32B in size. To run the full 720GB R1-0528 model, see here.

apt-get update
apt-get install pciutils -y
curl -fsSL https://ollama.com/install.sh | sh

Run the model! Note you can call ollama serve in another terminal if it fails! We include all our fixes and suggested parameters (temperature etc) in params in our Hugging Face upload!

ollama run hf.co/unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF:Q4_K_XL

To disable thinking, use (or you can set it in the system prompt):

>>> Write your prompt here

✨ How to Run R1-0528 in llama.cpp:

Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp

If you want to use llama.cpp directly to load models, you can do the below: (:IQ1_S) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location.

export LLAMA_CACHE="unsloth/DeepSeek-R1-0528-GGUF"
./llama.cpp/llama-cli \
    -hf unsloth/DeepSeek-R1-0528-GGUF:IQ1_S \
    --cache-type-k q4_0 \
    --threads -1 \
    --n-gpu-layers 99 \
    --prio 3 \
    --temp 0.6 \
    --top_p 0.95 \
    --min_p 0.01 \
    --ctx-size 16384 \
    --seed 3407 \
    -ot ".ffn_.*_exps.=CPU"

Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-IQ1_S(dynamic 1.78bit quant) or other quantized versions like Q4_K_M . I recommend using our 2.7bit dynamic quant UD-Q2_K_XL to balance size and accuracy.
# !pip install huggingface_hub hf_transfer import os os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0" # Can sometimes rate limit, so set to 0 to disable from huggingface_hub import snapshot_download snapshot_download( repo_id = "unsloth/DeepSeek-R1-0528-GGUF", local_dir = "unsloth/DeepSeek-R1-0528-GGUF", allow_patterns = ["*UD-IQ1_S*"], # Dynamic 1bit (185GB) Use "*UD-Q2_K_XL*" for Dynamic 2bit (251GB) )
Run Unsloth's Flappy Bird test as described in our 1.58bit Dynamic Quant for DeepSeek R1.
Edit --threads 32 for the number of CPU threads, --ctx-size 16384 for context length, --n-gpu-layers 2 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.

./llama.cpp/llama-cli \
    --model unsloth/DeepSeek-R1-0528-GGUF/UD-IQ1_S/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf \
    --cache-type-k q4_0 \
    --threads -1 \
    --n-gpu-layers 99 \
    --prio 3 \
    --temp 0.6 \
    --top_p 0.95 \
    --min_p 0.01 \
    --ctx-size 16384 \
    --seed 3407 \
    -ot ".ffn_.*_exps.=CPU" \
    -no-cnv \
    --prompt "<｜User｜>Create a Flappy Bird game in Python. You must include these things:\n1. You must use pygame.\n2. The background color should be randomly chosen and is a light shade. Start with a light blue color.\n3. Pressing SPACE multiple times will accelerate the bird.\n4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.\n5. Place on the bottom some land colored as dark brown or yellow chosen randomly.\n6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.\n7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.\n8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.\nThe final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<｜Assistant｜>"

We also test our dynamic quants via the Heptagon test which tests the model on creating a basic physics engine to simulate balls rotating in a moving enclosed heptagon shape.

./llama.cpp/llama-cli \
    --model unsloth/DeepSeek-R1-0528-GGUF/UD-IQ1_S/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf \
    --cache-type-k q4_0 \
    --threads -1 \
    --n-gpu-layers 99 \
    --prio 3 \
    --temp 0.6 \
    --top_p 0.95 \
    --min_p 0.01 \
    --ctx-size 16384 \
    --seed 3407 \
    -ot ".ffn_.*_exps.=CPU" \
    -no-cnv \
    --prompt "<｜User｜>Write a Python program that shows 20 balls bouncing inside a spinning heptagon:\n- All balls have the same radius.\n- All balls have a number on it from 1 to 20.\n- All balls drop from the heptagon center when starting.\n- Colors are: #f8b862, #f6ad49, #f39800, #f08300, #ec6d51, #ee7948, #ed6d3d, #ec6800, #ec6800, #ee7800, #eb6238, #ea5506, #ea5506, #eb6101, #e49e61, #e45e32, #e17b34, #dd7a56, #db8449, #d66a35\n- The balls should be affected by gravity and friction, and they must bounce off the rotating walls realistically. There should also be collisions between balls.\n- The material of all the balls determines that their impact bounce height will not exceed the radius of the heptagon, but higher than ball radius.\n- All balls rotate with friction, the numbers on the ball can be used to indicate the spin of the ball.\n- The heptagon is spinning around its center, and the speed of spinning is 360 degrees per 5 seconds.\n- The heptagon size should be large enough to contain all the balls.\n- Do not use the pygame library; implement collision detection algorithms and collision response etc. by yourself. The following Python libraries are allowed: tkinter, math, numpy, dataclasses, typing, sys.\n- All codes should be put in a single Python file.<｜Assistant｜>"

💕 Thank you!

Thank you for the constant support. We hope to have some great news in the coming weeks! 🙏

As always, be sure to join our Reddit page and Discord server for help or just to show your support! You can also follow us on Twitter and newsletter.

Thank you for reading!

Daniel & Michael Han 🦥
29 May 2025

Run DeepSeek-R1-0528 now!

Get started for free