IBM Granite 4.1 - How to Run Locally
Run IBM Granite-4.1 with Unsloth GGUFs and how to fine-tune!
IBM releases Granite-4.1 models with 3 sizes: 3B, 8B and 30B. Granite-4.1 is a long-context dense model family, built for instruction following, tool calling, chat, RAG and coding use cases. The models are highly competitive for their sizes and were trained on 15T tokens.
Learn how to run Unsloth Granite-4.1 Dynamic GGUFs or fine-tune/RL the model. You can fine-tune Granite-4.1 with our free notebook for a support agent use-case.
Granite-4.1 model family:
Granite-4.1-3B Dense: Lightweight and efficient for local, edge and high-volume tasks. Great for quick classification, extraction, simple RAG, function calling and fine-tuning on smaller GPUs.
Granite-4.1-8B Dense: A balanced model for local assistants, RAG, coding, multilingual chat and tool-use workflows. This is a great default pick if you want stronger quality while keeping memory use practical.
Granite-4.1-30B Dense: The strongest Granite-4.1 model. Best for more demanding enterprise assistants, long-context tasks, complex RAG, coding, multilingual workflows and agentic tool-calling use cases.
⚙️ Usage Guide
Use these settings for deterministic, instruction-following responses:
temperature=0.0, top_p=1.0, top_k=0
Temperature of
0.0Top_K =
0Top_P =
1.0Recommended minimum context:
16,384Maximum context length window:
131,072tokens
Unsloth Granite-4.1 uploads
Run Granite-4.1 Tutorials
Run in Unsloth StudioRun in llama.cpp
Do NOT use CUDA 13.2 as you may get gibberish outputs. NVIDIA is working on a fix.
🦥 Unsloth Studio Guide
For this tutorial, we will be using Unsloth Studio, which is our new web UI for running and training LLMs. With Unsloth Studio, you can run models and input audio, image and text locally on Mac, Windows, and Linux and:
Search, download, run GGUFs and safetensor models
Compare models side-by-side
Self-healing tool calling + web search
Code execution (Python, Bash)
Automatic inference parameter tuning (temp, top-p, etc.)
Train LLMs 2x faster with 70% less VRAM

Search and download Granite 4.1
On first launch you will need to create a password to secure your account and sign in again later. Then go to the Studio Chat tab and search for Granite 4.1 in the search bar and download your desired model and quant.
Run Granite 4.1
Inference parameters should be auto-set when using Unsloth Studio, however you can still change it manually. You can also edit the context length, chat template and other settings.
For more information, you can view our Unsloth Studio inference guide.
🦙 Llama.cpp Tutorial
Obtain the latest
llama.cpp. You can follow the build instructions below as well. Change-DGGML_CUDA=ONto-DGGML_CUDA=OFFif you don't have a GPU or just want CPU inference. For Apple Mac / Metal devices, set-DGGML_CUDA=OFFthen continue as usual — Metal support is on by default.
If you want to use
llama.cppdirectly to load models, you can do the below.UD-Q4_K_XLis the quantization type. You can also change it to other quantized versions likeQ4_K_M,Q5_K_M,Q8_0or BF16 full precision if available.
OR download the model via Hugging Face after installing
huggingface_hubandhf_transfer.
Run Unsloth's Flappy Bird test.
Edit --threads 32 for the number of CPU threads, --ctx-size 16384 for context length, and --n-gpu-layers 99 for GPU offloading. Try adjusting GPU layers if your GPU goes out of memory. Remove --n-gpu-layers if you are using CPU-only inference.
For conversation mode:
Fine-tuning Granite-4.1 in Unsloth
Unsloth supports Granite-4.1 models including 3B, 8B and 30B for fine-tuning. Training is 2x faster, uses less VRAM and supports longer context lengths. Granite-4.1-3B and Granite-4.1-8B are the best starting points for local fine-tuning, while Granite-4.1-30B is the strongest model for higher-accuracy enterprise workflows.
Granite-4.0 free fine-tuning notebook (change model name to Granite-4.1)
This notebook trains a model to become a support agent that understands customer interactions, complete with analysis and recommendations. This setup allows you to train a bot that provides real-time assistance to support agents. We also show you how to train a model using data stored in a Google Sheet.
Unsloth config for Granite-4.1
If you have an old version of Unsloth and/or are fine-tuning locally, install the latest version of Unsloth:
To force reinstall the latest Unsloth and Unsloth Zoo:
You can change the model name to any Granite-4.1 model:
For the 30B model, use a larger GPU or multi-GPU setup, and reduce max_seq_length or increase quantization if you run out of memory.
Last updated
Was this helpful?


