How to Run Local LLMs with Claude Code
Guide to use open models with Claude Code on your local device.
This step-by-step guide shows you how to connect open LLMs and APIs to Claude Code entirely locally, complete with screenshots. Run using any open model like Qwen3.6, DeepSeek and Gemma.
For this tutorial, we’ll use the open models: Gemma 4 and Qwen3.5 which are strong agentic & coding models (works on 24GB RAM/unified mem device). For inference, we'll use Unsloth Studio and llama.cpp enables you to run/serve LLMs on macOS, Linux, and Windows. You can swap in any other model, just update the model names in your scripts.
Claude Code Setup📖 Setup Local Model Tutorial
For model quants, we will utilize Unsloth Dynamic GGUFs to run any LLM quantized, while retaining as much accuracy as possible.
Claude Code Setup
Before setting up our local LLM, we need to install Claude Code. Claude Code is a terminal-based coding agent that understands your codebase and handles complex Git workflows using natural language.
🕵️Fixing 90% slower inference in Claude Code
Claude Code recently prepends and adds a Claude Code Attribution header, which invalidates the KV Cache, making inference 90% slower with local models.
To solve this, edit ~/.claude/settings.json to include CLAUDE_CODE_ATTRIBUTION_HEADER and set it to 0 within "env"
Using export CLAUDE_CODE_ATTRIBUTION_HEADER=0 DOES NOT work!
For example do cat > ~/.claude/settings.json then add the below (when pasted, do ENTER then CTRL+D to save it). If you have a previous ~/.claude/settings.json file, just add "CLAUDE_CODE_ATTRIBUTION_HEADER" : "0" to the "env" section, and leave the rest of the settings file unchanged.
📖 Quickstart Tutorials
Before we begin, we firstly need to complete setup for the specific model you're going to use. We use Unsloth (a web UI) and llama.cpp which are open-source frameworks for running and serving LLMs on your Mac, Linux, Windows devices.
Unsloth also has unique self-healing tool-calling and web search capabilities. See right for Claude Code connected to Unsloth:

Connect Claude Code🦥 Unsloth Tutorial llama.cpp Tutorial
🦥 Unsloth Tutorial
For this tutorial, we will serve/connect local models to Claude Code via a UI by using Unsloth. Unsloth works on Windows, WSL, Linux and MacOS.
Search, download, run GGUFs and safetensor models
Self-healing tool calling + web search
Code execution (Python, Bash)
Automatic inference parameter selection (temp, top-p, etc.)
Fast CPU + GPU inference via llama.cpp
Train LLMs 2x faster with 70% less VRAM
See below for install instructions:

Step 1: Setup Unsloth
Launch the terminal from Mac, then install Unsloth by entering the command below.
Unsloth will start setting up the environment and installing the required packages as shown below. Type Y and Press Enter when asked if you want to allow Studio to start now. This will start Unsloth on your local 8888 port.

If you chose not to start Unsloth during the installation process, you can always start the Unsloth app using unsloth studio -p 8888 . If you would like to have your Unsloth instance accessible by clients outside of your PC/computer, add -H 0.0.0.0 to the unsloth studio command.
Step 2: Start Unsloth
Open your browser of choice and type http://127.0.0.1:8888 in the URL box. If this is your first time installing Unsloth, you will be forwarded to the Password page where you will need to create a new password. After, Unsloth should now open on the Chat Page as shown below.

Step 1: Setup Unsloth
Open the Start Menu, search for PowerShell, and launch it. Copy & enter the install command:
it will begin installing automatically. After installation finishes, PowerShell will ask if you want to start Unsloth Studio.

You can also launch it with the following command:
If you would like to have your instance accessible by clients outside of your PC/computer.
Add -H 0.0.0.0 to the unsloth studio command.
Step 2: Start Unsloth
Open http://127.0.0.1:8888 in your browser. On first launch, create a new password to continue to the Chat page. Unsloth Studio is now installed and ready to use.

Step 1: Setup Unsloth
Open your terminal application. You can launch it by pressing Ctrl + Alt + T, or by searching for Terminal in your system's application menu.
Click the Windows Start Menu, type the name of your installed distro (e.g. Ubuntu), then open it.
On WSL, make sure your NVIDIA drivers are installed on Windows (not inside WSL) and that the CUDA toolkit is installed inside your WSL distro. See the System Requirements below for details.
To install, copy and run the install command:
Then:
Click inside the terminal window
Paste the command with
Ctrl + Shift + VPress
Enter
Unsloth will start setting up the environment and installing the required packages as shown below. Type Y and Press Enter when asked if you want to allow Studio to start now. This will start Unsloth on your local 8888 port.

If you chose not to start Unsloth during the installation process, you can always start the Unsloth app using unsloth studio -p 8888 . If you would like to have your Unsloth instance accessible by clients outside of your PC/computer, add -H 0.0.0.0 to the unsloth studio command.
Step 2: Start Unsloth
Open your browser of choice and type http://127.0.0.1:8888 in the URL box. If this is your first time installing Unsloth, you will be forwarded to the Password page where you will need to create a new password. After, Unsloth should now open on the Chat Page as shown below.

Model Loading + API Guide
⚙️ Connect Claude Code
Now that we have setup the local LLM for Claude Code, we now configure Claude Code to work with Unsloth or llama.cpp. We start by setting the following environment variables. These variables will not persist between sessions by default.
Config: Set the local API URL:
Copy your key from Unsloth Studio → Settings → API, then set it:
Optional: Use the name of the model currently loaded in Unsloth as a default.
The model name should match the model currently loaded in Unsloth Studio.
Config: Set the local API URL in Powershell:
Copy your key from Unsloth Studio → Settings → API, then set it:
Optional: Use the name of the model currently loaded in Unsloth to set as a default.
Model name should be the model that is currently loaded in Unsloth Studio.
Start Claude Code
Start Claude Code with the model that is currently loaded in Unsloth.
We will use gemma-4-26B-A4B-it-GGUF, but you can use any Unsloth compatible model.
Claude Code should open and display the selected model.

See Fixing 90% slower inference in Claude Code first to fix open models being 90% slower due to KV Cache invalidation.
Try this prompt to research and rank high-quality SFT datasets.
After you submit the prompt, the agent will search the web, evaluate findings, and write the final report. This may take a few minutes.
Some workflows may require you to approve actions or answer follow up prompts.

Some workflows may require you to approve actions or answer follow-up prompts.
Once complete, the generated sft_report.md will look similar to this.

If you see Unable to connect to API (ConnectionRefused) , remember to unset ANTHROPIC_BASE_URL via unset ANTHROPIC_BASE_URL
If you find open models to be 90% slower, see here first to fix KV cache being invalidated.
🦙 Llama.cpp Tutorial
Before we begin, we firstly need to complete setup for the specific model you're going to use. We use llama.cpp which is an open-source framework for running LLMs on your Mac, Linux, Windows etc. devices. Llama.cpp contains llama-server which allows you to serve and deploy LLMs efficiently. The model will be served on port 8001, with all agent tools routed through a single OpenAI-compatible endpoint.
Qwen3.5 Tutorial
We'll be using Qwen3.5-35B-A3B and specific settings for fast accurate coding tasks. If you don't have enough VRAM and want a smarter model, Qwen3.5-27B is a great choice, but it will be ~2x slower, or you can use other Qwen3.5 variants like 9B, 4B or 2B.
Use Qwen3.5-27B if you want a smarter model or if you don't have enough VRAM. It will be ~2x slower than 35B-A3B however. Or you can use Qwen3-Coder-Next which is fantastic if you have enough VRAM.
Install llama.cpp
We need to install llama.cpp to deploy/serve local LLMs to use in Claude Code etc. We follow the official build instructions for correct GPU bindings and maximum performance. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference. For Apple Mac / Metal devices, set -DGGML_CUDA=OFF then continue as usual - Metal support is on by default.

Download and use models locally
Download the model via huggingface_hub in Python (after installing via pip install huggingface_hub hf_transfer). We use the UD-Q4_K_XL quant for the best size/accuracy balance. You can find all Unsloth GGUF uploads in our Collection here. If downloads get stuck, see Hugging Face Hub, XET debugging

We used unsloth/Qwen3.5-35B-A3B-GGUF , but you can use another variant like 27B or any other model like unsloth/Qwen3-Coder-Next-GGUF.

Start the Llama-server
To deploy Qwen3.5 for agentic workloads, we use llama-server. We apply Qwen's recommended sampling parameters for thinking mode: temp 0.6, top_p 0.95 , top-k 20. Keep in mind these numbers change if you use non-thinking mode or other tasks.
Run this command in a new terminal (use tmux or open a new terminal). The below should fit perfectly in a 24GB GPU (RTX 4090) (uses 23GB) --fit on will also auto offload, but if you see bad performance, reduce --ctx-size .
We used --cache-type-k q8_0 --cache-type-v q8_0 for KV cache quantization for less VRAM usage. For full precision, use --cache-type-k bf16 --cache-type-v bf16 .Note bf16 KV Cache might be slightly slower on some machines.
You can also disable thinking for Qwen3.5 which can improve performance for agentic coding stuff. To disable thinking with llama.cpp add this to the llama-server command:
--chat-template-kwargs "{\"enable_thinking\": false}"

Start Claude Code with llama-server
We used unsloth/GLM-4.7-Flash-GGUF , but you can use anything like unsloth/Qwen3.6-27B-GGUF.
See Fixing 90% slower inference in Claude Code first to fix open models being 90% slower due to KV Cache invalidation.
Navigate to your project folder (mkdir project ; cd project) and run:
To use Qwen3.6-35B-A3B, simply change it to:

To set Claude Code to execute commands without any approvals do (BEWARE this will make Claude Code do and execute code however it likes without any approvals!)
Try this prompt to install and run a simple Unsloth finetune:

After waiting a bit, Unsloth will be installed in a venv via uv, and loaded up:

and finally you will see a successfully finetuned model with Unsloth!

If you see Unable to connect to API (ConnectionRefused) , remember to unset ANTHROPIC_BASE_URL via unset ANTHROPIC_BASE_URL
If you find open models to be 90% slower, see here first to fix KV cache being invalidated.
Last updated
Was this helpful?





