How to Run Local LLMs with Claude Code

Guide to use open models with Claude Code on your local device.

This step-by-step guide shows you how to connect open LLMs and APIs to Claude Code entirely locally, complete with screenshots. Run using any open model like Qwen3.6, DeepSeek and Gemma.

For this tutorial, we’ll use the open models: Gemma 4 and Qwen3.5 which are strong agentic & coding models (works on 24GB RAM/unified mem device). For inference, we'll use Unsloth Studio and llama.cpp enables you to run/serve LLMs on macOS, Linux, and Windows. You can swap in any other model, just update the model names in your scripts.

Claude Code Setup📖 Setup Local Model Tutorial

For model quants, we will utilize Unsloth Dynamic GGUFs to run any LLM quantized, while retaining as much accuracy as possible.

Claude Code Setup

Before setting up our local LLM, we need to install Claude Code. Claude Code is a terminal-based coding agent that understands your codebase and handles complex Git workflows using natural language.

Install Claude Code:

Paste into your terminal to install Claude Code:

curl -fsSL https://claude.ai/install.sh | bash

After install, navigate to your project folder. Then type claude into the shell to begin.

cd ~/projects/my-project 
claude

🕵️Fixing 90% slower inference in Claude Code

To solve this, edit ~/.claude/settings.json to include CLAUDE_CODE_ATTRIBUTION_HEADER and set it to 0 within "env"

Using export CLAUDE_CODE_ATTRIBUTION_HEADER=0 DOES NOT work!

For example do cat > ~/.claude/settings.json then add the below (when pasted, do ENTER then CTRL+D to save it). If you have a previous ~/.claude/settings.json file, just add "CLAUDE_CODE_ATTRIBUTION_HEADER" : "0" to the "env" section, and leave the rest of the settings file unchanged.

📖 Quickstart Tutorials

Before we begin, we firstly need to complete setup for the specific model you're going to use. We use Unsloth (a web UI) and llama.cpp which are open-source frameworks for running and serving LLMs on your Mac, Linux, Windows devices.

Unsloth also has unique self-healing tool-calling and web search capabilities. See right for Claude Code connected to Unsloth:

Connect Claude Code🦥 Unsloth Tutorial llama.cpp Tutorial

🦥 Unsloth Tutorial

For this tutorial, we will serve/connect local models to Claude Code via a UI by using Unsloth. Unsloth works on Windows, WSL, Linux and MacOS.

See below for install instructions:

Example of Qwen3.6 2-bit running in Unsloth.

Step 1: Setup Unsloth

Launch the terminal from Mac, then install Unsloth by entering the command below.

Unsloth will start setting up the environment and installing the required packages as shown below. Type Y and Press Enter when asked if you want to allow Studio to start now. This will start Unsloth on your local 8888 port.

If you chose not to start Unsloth during the installation process, you can always start the Unsloth app using unsloth studio -p 8888 . If you would like to have your Unsloth instance accessible by clients outside of your PC/computer, add -H 0.0.0.0 to the unsloth studio command.

Step 2: Start Unsloth

Open your browser of choice and type http://127.0.0.1:8888 in the URL box. If this is your first time installing Unsloth, you will be forwarded to the Password page where you will need to create a new password. After, Unsloth should now open on the Chat Page as shown below.

Model Loading + API Guide

1

Select Model

Before using the API, load a model from the Select model dropdown in the top-left corner of the Chat page.

In this guide, we’ll use: unsloth/gemma-4-26B-A4B-it-GGUF with the recommended UD-Q4_K_XL quantization.

2

Test the Model

Before using the Client, send a quick message:

This confirms that the model loaded correctly and is ready to respond.

3

Unsloth API key

In Studio, open Settings → API to view or create your API key.

Treat your API key like a password and avoid exposing it in screenshots or repositories.

⚙️ Connect Claude Code

Now that we have setup the local LLM for Claude Code, we now configure Claude Code to work with Unsloth or llama.cpp. We start by setting the following environment variables. These variables will not persist between sessions by default.

Config: Set the local API URL:

Copy your key from Unsloth Studio → Settings → API, then set it:

Optional: Use the name of the model currently loaded in Unsloth as a default.

The model name should match the model currently loaded in Unsloth Studio.

Start Claude Code

Start Claude Code with the model that is currently loaded in Unsloth.

We will use gemma-4-26B-A4B-it-GGUF, but you can use any Unsloth compatible model.

Claude Code should open and display the selected model.

Try this prompt to research and rank high-quality SFT datasets.

After you submit the prompt, the agent will search the web, evaluate findings, and write the final report. This may take a few minutes.

Some workflows may require you to approve actions or answer follow up prompts.

Some workflows may require you to approve actions or answer follow-up prompts.

Once complete, the generated sft_report.md will look similar to this.

🦙 Llama.cpp Tutorial

Before we begin, we firstly need to complete setup for the specific model you're going to use. We use llama.cpp which is an open-source framework for running LLMs on your Mac, Linux, Windows etc. devices. Llama.cpp contains llama-server which allows you to serve and deploy LLMs efficiently. The model will be served on port 8001, with all agent tools routed through a single OpenAI-compatible endpoint.

Qwen3.5 Tutorial

We'll be using Qwen3.5-35B-A3B and specific settings for fast accurate coding tasks. If you don't have enough VRAM and want a smarter model, Qwen3.5-27B is a great choice, but it will be ~2x slower, or you can use other Qwen3.5 variants like 9B, 4B or 2B.

Use Qwen3.5-27B if you want a smarter model or if you don't have enough VRAM. It will be ~2x slower than 35B-A3B however. Or you can use Qwen3-Coder-Next which is fantastic if you have enough VRAM.

1

Install llama.cpp

We need to install llama.cpp to deploy/serve local LLMs to use in Claude Code etc. We follow the official build instructions for correct GPU bindings and maximum performance. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference. For Apple Mac / Metal devices, set -DGGML_CUDA=OFF then continue as usual - Metal support is on by default.

2

Download and use models locally

Download the model via huggingface_hub in Python (after installing via pip install huggingface_hub hf_transfer). We use the UD-Q4_K_XL quant for the best size/accuracy balance. You can find all Unsloth GGUF uploads in our Collection here. If downloads get stuck, see Hugging Face Hub, XET debugging

3

Start the Llama-server

To deploy Qwen3.5 for agentic workloads, we use llama-server. We apply Qwen's recommended sampling parameters for thinking mode: temp 0.6, top_p 0.95 , top-k 20. Keep in mind these numbers change if you use non-thinking mode or other tasks.

Run this command in a new terminal (use tmux or open a new terminal). The below should fit perfectly in a 24GB GPU (RTX 4090) (uses 23GB) --fit on will also auto offload, but if you see bad performance, reduce --ctx-size .

We used --cache-type-k q8_0 --cache-type-v q8_0 for KV cache quantization for less VRAM usage. For full precision, use --cache-type-k bf16 --cache-type-v bf16 .Note bf16 KV Cache might be slightly slower on some machines.

Start Claude Code with llama-server

Navigate to your project folder (mkdir project ; cd project) and run:

To use Qwen3.6-35B-A3B, simply change it to:

To set Claude Code to execute commands without any approvals do (BEWARE this will make Claude Code do and execute code however it likes without any approvals!)

Try this prompt to install and run a simple Unsloth finetune:

After waiting a bit, Unsloth will be installed in a venv via uv, and loaded up:

and finally you will see a successfully finetuned model with Unsloth!

Last updated

Was this helpful?