IBM Granite 4.1 - How to Run Locally

Run IBM Granite-4.1 with Unsloth GGUFs and how to fine-tune!

IBM releases Granite-4.1 models with 3 sizes: 3B, 8B and 30B. Granite-4.1 is a long-context dense model family, built for instruction following, tool calling, chat, RAG and coding use cases. The models are highly competitive for their sizes and were trained on 15T tokens.

Learn how to run Unsloth Granite-4.1 Dynamic GGUFs or fine-tune/RL the model. You can fine-tune Granite-4.1 with our free notebook for a support agent use-case.

Granite-4.1 model family:

  • Granite-4.1-3B Dense: Lightweight and efficient for local, edge and high-volume tasks. Great for quick classification, extraction, simple RAG, function calling and fine-tuning on smaller GPUs.

  • Granite-4.1-8B Dense: A balanced model for local assistants, RAG, coding, multilingual chat and tool-use workflows. This is a great default pick if you want stronger quality while keeping memory use practical.

  • Granite-4.1-30B Dense: The strongest Granite-4.1 model. Best for more demanding enterprise assistants, long-context tasks, complex RAG, coding, multilingual workflows and agentic tool-calling use cases.

⚙️ Usage Guide

Use these settings for deterministic, instruction-following responses:

temperature=0.0, top_p=1.0, top_k=0

  • Temperature of 0.0

  • Top_K = 0

  • Top_P = 1.0

  • Recommended minimum context: 16,384

  • Maximum context length window: 131,072 tokens

Unsloth Granite-4.1 uploads

Run Granite-4.1 Tutorials

Run in Unsloth StudioRun in llama.cpp

🦥 Unsloth Studio Guide

For this tutorial, we will be using Unsloth Studio, which is our new web UI for running and training LLMs. With Unsloth Studio, you can run models and input audio, image and text locally on Mac, Windows, and Linux and:

1

Install Unsloth

MacOS, Linux, WSL:

Windows PowerShell:

2

Setup Unsloth Studio (one time)

Setup automatically installs Node.js (via nvm), builds the frontend, installs all Python dependencies, and builds llama.cpp with CUDA support.

WSL users: you will be prompted for your sudo password to install build dependencies (cmake, git, libcurl4-openssl-dev).

3

Launch Unsloth

MacOS, Linux, WSL:

Windows Powershell:

Then open http://localhost:8888 in your browser.

4

Search and download Granite 4.1

On first launch you will need to create a password to secure your account and sign in again later. Then go to the Studio Chat tab and search for Granite 4.1 in the search bar and download your desired model and quant.

5

Run Granite 4.1

Inference parameters should be auto-set when using Unsloth Studio, however you can still change it manually. You can also edit the context length, chat template and other settings.

For more information, you can view our Unsloth Studio inference guide.

🦙 Llama.cpp Tutorial

  1. Obtain the latest llama.cpp. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference. For Apple Mac / Metal devices, set -DGGML_CUDA=OFF then continue as usual — Metal support is on by default.

  1. If you want to use llama.cpp directly to load models, you can do the below. UD-Q4_K_XL is the quantization type. You can also change it to other quantized versions like Q4_K_M, Q5_K_M, Q8_0 or BF16 full precision if available.

  1. OR download the model via Hugging Face after installing huggingface_hub and hf_transfer.

  1. Run Unsloth's Flappy Bird test.

Edit --threads 32 for the number of CPU threads, --ctx-size 16384 for context length, and --n-gpu-layers 99 for GPU offloading. Try adjusting GPU layers if your GPU goes out of memory. Remove --n-gpu-layers if you are using CPU-only inference.

  1. For conversation mode:

Fine-tuning Granite-4.1 in Unsloth

Unsloth supports Granite-4.1 models including 3B, 8B and 30B for fine-tuning. Training is 2x faster, uses less VRAM and supports longer context lengths. Granite-4.1-3B and Granite-4.1-8B are the best starting points for local fine-tuning, while Granite-4.1-30B is the strongest model for higher-accuracy enterprise workflows.

This notebook trains a model to become a support agent that understands customer interactions, complete with analysis and recommendations. This setup allows you to train a bot that provides real-time assistance to support agents. We also show you how to train a model using data stored in a Google Sheet.

Unsloth config for Granite-4.1

If you have an old version of Unsloth and/or are fine-tuning locally, install the latest version of Unsloth:

To force reinstall the latest Unsloth and Unsloth Zoo:

You can change the model name to any Granite-4.1 model:

For the 30B model, use a larger GPU or multi-GPU setup, and reduce max_seq_length or increase quantization if you run out of memory.

Last updated

Was this helpful?