# How to Run Local LLMs with Claude Code

This step-by-step guide shows you how to connect open LLMs and APIs to Claude Code entirely locally, complete with screenshots. Run using any open model like Qwen3.6, DeepSeek and Gemma.

For this tutorial, we’ll use the open models: [Gemma 4](/docs/models/gemma-4.md) and [Qwen3.5](/docs/models/qwen3.5.md) which are strong agentic & coding models (works on 24GB RAM/unified mem device).  For inference, we'll use [Unsloth Studio](https://github.com/unslothai/unsloth) and [`llama.cpp`](https://github.com/ggml-org/llama.cpp) enables you to run/serve LLMs on macOS, Linux, and Windows. You can swap in [any other model](/docs/models/tutorials.md), just update the model names in your scripts.

<a href="/pages/w020xJgdCTBtTvfHtvye#claude-code-setup" class="button primary" data-icon="claude">Claude Code Setup</a><a href="/pages/w020xJgdCTBtTvfHtvye#quickstart-tutorials" class="button primary">📖 Setup Local Model Tutorial</a>

For model quants, we will utilize Unsloth [Dynamic GGUFs](/docs/basics/unsloth-dynamic-2.0-ggufs.md) to run any LLM quantized, while retaining as much accuracy as possible.

## <i class="fa-claude">:claude:</i> Claude Code Setup

Before setting up our local LLM, we need to install Claude Code. Claude Code is a terminal-based coding agent that understands your codebase and handles complex Git workflows using natural language.

{% tabs %}
{% tab title="macOS, Linux, WSL" %}

#### **Install Claude Code:**

Paste into your terminal to install Claude Code:

```bash
curl -fsSL https://claude.ai/install.sh | bash
```

After install, navigate to your project folder. Then type `claude` into the `shell` to begin.

```bash
cd ~/projects/my-project 
claude
```

{% endtab %}

{% tab title="Windows" %}

#### **Install Claude Code:**

Enter into `PowerShell` to install Claude Code:

```powershell
irm https://claude.ai/install.ps1 | iex
```

After install, navigate to your project folder. Then type `claude` into the `powershell` to begin.

<pre class="language-powershell"><code class="lang-powershell"><strong>cd /path/to/your/project
</strong>claude
</code></pre>

<div data-with-frame="true"><figure><img src="/files/07ztAhrJciHKIJHCKJCh" alt="" width="563"><figcaption></figcaption></figure></div>
{% endtab %}
{% endtabs %}

### :detective:Fixing 90% slower inference in Claude Code

{% hint style="warning" %}
Claude Code recently prepends and adds a Claude Code Attribution header, which **invalidates the KV Cache, making inference 90% slower with local models**.
{% endhint %}

To solve this, edit `~/.claude/settings.json` to include `CLAUDE_CODE_ATTRIBUTION_HEADER` and set it to 0 within `"env"`

{% hint style="info" %}
Using `export CLAUDE_CODE_ATTRIBUTION_HEADER=0` **DOES NOT** work!
{% endhint %}

For example do `cat > ~/.claude/settings.json` then add the below (when pasted, do ENTER then CTRL+D to save it). If you have a previous `~/.claude/settings.json` file, just add `"CLAUDE_CODE_ATTRIBUTION_HEADER" : "0"` to the "env" section, and leave the rest of the settings file unchanged.

<pre class="language-json"><code class="lang-json">{
  "promptSuggestionEnabled": false,
  "env": {
    "CLAUDE_CODE_ENABLE_TELEMETRY": "0",
    "CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC": "1",
    <a data-footnote-ref href="#user-content-fn-1">"CLAUDE_CODE_ATTRIBUTION_HEADER" : "0"</a>
  },
  "attribution": {
    "commit": "",
    "pr": ""
  },
  "plansDirectory" : "./plans",
  "prefersReducedMotion" : true,
  "terminalProgressBarEnabled" : false,
  "effortLevel" : "high"
}
</code></pre>

## 📖 Quickstart Tutorials

{% columns %}
{% column %}
Before we begin, we firstly need to complete setup for the specific model you're going to use. We use [Unsloth](/docs/new/studio.md) (a web UI) and llama.cpp which are open-source frameworks for running and serving LLMs on your Mac, Linux, Windows devices.

Unsloth also has unique self-healing [tool-calling](/docs/new/studio/chat.md#auto-healing-tool-calling) and [web search](/docs/new/studio/chat.md#code-execution) capabilities. See right for Claude Code connected to Unsloth:
{% endcolumn %}

{% column %}

<div data-with-frame="true"><figure><img src="/files/Z3eIk2YCloY1lJy73JHS" alt=""><figcaption></figcaption></figure></div>
{% endcolumn %}
{% endcolumns %}

<a href="/pages/w020xJgdCTBtTvfHtvye#connect-claude-code" class="button primary" data-icon="claude">Connect Claude Code</a><a href="/pages/w020xJgdCTBtTvfHtvye#unsloth-tutorial" class="button primary">🦥 Unsloth Tutorial</a><a href="/pages/w020xJgdCTBtTvfHtvye#llama.cpp-tutorial" class="button primary"> llama.cpp Tutorial</a>

## 🦥 Unsloth Tutorial

For this tutorial, we will serve/connect local models to Claude Code via a UI by using [Unsloth](https://github.com/unslothai/unsloth). Unsloth works on Windows, WSL, Linux and MacOS.&#x20;

{% columns %}
{% column %}

* Search, download, [run GGUFs](/docs/new/studio.md#run-models-locally) and safetensor models
* [**Self-healing** tool calling](/docs/new/studio.md#execute-code--heal-tool-calling) + **web search**
* [**Code execution**](/docs/new/studio.md#run-models-locally) (Python, Bash)
* [Automatic inference](/docs/new/studio.md#model-arena) parameter selection (temp, top-p, etc.)
* Fast CPU + GPU inference via llama.cpp
* [Train LLMs](/docs/new/studio.md#no-code-training) 2x faster with 70% less VRAM

See below for install instructions:
{% endcolumn %}

{% column %}

<div data-with-frame="true"><figure><img src="/files/WUtoS2GRLOCn4S5mOvve" alt=""><figcaption><p>Example of Qwen3.6 2-bit running in Unsloth.</p></figcaption></figure></div>
{% endcolumn %}
{% endcolumns %}

{% tabs %}
{% tab title="MacOS" %}

#### Step 1: Setup Unsloth

Launch the `terminal` from Mac, then install Unsloth by entering the command below.

```bash
curl -fsSL https://unsloth.ai/install.sh | sh
```

Unsloth will start setting up the environment and installing the required packages as shown below. Type **Y** and Press `Enter` when asked if you want to allow Studio to start now. This will start Unsloth on your local **8888** port.

<figure><img src="/files/kAxiYilqsmP233htYNpi" alt="" width="375"><figcaption></figcaption></figure>

{% hint style="info" %}
If you chose not to start Unsloth during the installation process, you can always start the Unsloth app using `unsloth studio -p 8888` . If you would like to have your Unsloth instance accessible by clients outside of your PC/computer, add `-H 0.0.0.0` to the `unsloth studio` command.
{% endhint %}

#### Step 2: Start Unsloth

Open your browser of choice and type `http://127.0.0.1:8888`  in the URL box. If this is your first time installing Unsloth, you will be forwarded to the Password page where you will need to create a new password. After, Unsloth should now open on the Chat Page as shown below.

<figure><img src="/files/ryuI6lvessgKynLGfv1K" alt="" width="375"><figcaption></figcaption></figure>
{% endtab %}

{% tab title="Windows" %}

#### Step 1: Setup Unsloth

Open the Start Menu, search for `PowerShell`, and launch it. Copy & enter the install command:

```powershell
irm https://unsloth.ai/install.ps1 | iex
```

it will begin installing automatically. After installation finishes, PowerShell will ask if you want to start Unsloth Studi&#x6F;**.**

<figure><img src="/files/kAxiYilqsmP233htYNpi" alt="" width="375"><figcaption></figcaption></figure>

You can also launch it with the following command:

```bash
unsloth studio -H 0.0.0.0 -p 8888
```

{% hint style="info" %}
If you would like to have your instance accessible by clients outside of your PC/computer.\
Add `-H 0.0.0.0` to the `unsloth studio` command.
{% endhint %}

#### Step 2: Start Unsloth

Open `http://127.0.0.1:8888` in your browser. On first launch, create a new password to continue to the Chat page. **Unsloth Studio** is now installed and ready to use.

<figure><img src="/files/ryuI6lvessgKynLGfv1K" alt="" width="375"><figcaption></figcaption></figure>
{% endtab %}

{% tab title="Linux, WSL" %}

#### Step 1: Setup Unsloth

{% tabs %}
{% tab title="Linux" %}
Open your terminal application. You can launch it by pressing `Ctrl + Alt + T`, or by searching for `Terminal` in your system's application menu.
{% endtab %}

{% tab title="WSL" %}
Click the Windows Start Menu, type the name of your installed distro (e.g. `Ubuntu`), then open it.

{% hint style="warning" %}
On **WSL**, make sure your **NVIDIA drivers** are installed on **Windows** (not inside WSL) and that the **CUDA toolkit** is installed inside your WSL distro. See the System Requirements below for details.
{% endhint %}
{% endtab %}
{% endtabs %}

To install, copy and run the install command:

```bash
curl -fsSL https://unsloth.ai/install.sh | sh
```

Then:

1. Click inside the terminal window
2. Paste the command with `Ctrl + Shift + V`
3. Press `Enter`

Unsloth will start setting up the environment and installing the required packages as shown below. Type **Y** and Press `Enter` when asked if you want to allow Studio to start now. This will start Unsloth on your local **8888** port.

<figure><img src="/files/uQP4sGPAd6C4MBSFdUTm" alt=""><figcaption></figcaption></figure>

{% hint style="info" %}
If you chose not to start Unsloth during the installation process, you can always start the Unsloth app using `unsloth studio -p 8888` . If you would like to have your Unsloth instance accessible by clients outside of your PC/computer, add `-H 0.0.0.0` to the `unsloth studio` command.
{% endhint %}

#### Step 2: Start Unsloth

Open your browser of choice and type `http://127.0.0.1:8888`  in the URL box. If this is your first time installing Unsloth, you will be forwarded to the Password page where you will need to create a new password. After, Unsloth should now open on the Chat Page as shown below.

<figure><img src="/files/CresfHYJ3aP1rTTlj3YF" alt="" width="375"><figcaption></figcaption></figure>
{% endtab %}
{% endtabs %}

### Model Loading + API Guide

{% stepper %}
{% step %}

#### Select Model

Before using the API, load a model from the **Select model** dropdown in the top-left corner of the Chat page.

<figure><img src="/files/qjw3MAiRKvLO7rcVlB9r" alt=""><figcaption></figcaption></figure>

In this guide, we’ll use: `unsloth/gemma-4-26B-A4B-it-GGUF` with the recommended `UD-Q4_K_XL` quantization.
{% endstep %}

{% step %}

#### Test the Model

Before using the Client, send a quick message:

<div data-with-frame="true"><figure><img src="/files/QXM0lfihazCXbxesdwDi" alt="" width="563"><figcaption></figcaption></figure></div>

{% hint style="info" %}
This confirms that the model loaded correctly and is ready to respond.
{% endhint %}
{% endstep %}

{% step %}

#### **Unsloth API key**

In Studio, open **Settings → API** to view or create your API key.

<figure><img src="/files/lnHH6JFk2bFBG8Nh96hf" alt=""><figcaption></figcaption></figure>

Treat your API key like a password and avoid exposing it in screenshots or repositories.
{% endstep %}
{% endstepper %}

## ⚙️ Connect Claude Code

Now that we have setup the local LLM for Claude Code, we now configure Claude Code to work with Unsloth or llama.cpp. We start by setting the following environment variables. These variables will not persist between sessions by default.&#x20;

{% tabs %}
{% tab title="MacOS, Linux, WSL" %}
**Config:** Set the local API URL:

```bash
export ANTHROPIC_BASE_URL="http://localhost:8888"
```

Copy your key from Unsloth Studio → Settings → API, then set it:

```bash
export ANTHROPIC_AUTH_TOKEN="sk-unsloth-xxxxxxxxxxxx"
```

Optional: Use the name of the model currently loaded in Unsloth as a default.

```bash
export ANTHROPIC_MODEL="gemma-4-26B-A4B-it-GGUF"
```

The model name should match the model currently loaded in Unsloth Studio.
{% endtab %}

{% tab title="Windows" %}
**Config:** Set the local API URL in Powershell:

```powershell
$env:ANTHROPIC_BASE_URL = "http://localhost:8888"
```

Copy your key from **Unsloth Studio → Settings → API**, then set it:

```powershell
$env:ANTHROPIC_AUTH_TOKEN = "sk-unsloth-xxxxxxxxxxxx"
```

**Optional:** Use the name of the model currently loaded in Unsloth to set as a default.

```powershell
$env:ANTHROPIC_MODEL = "gemma-4-26B-A4B-it-GGUF"
```

{% hint style="info" %}
Model name should be the model that is currently loaded in Unsloth Studio.&#x20;
{% endhint %}
{% endtab %}
{% endtabs %}

### Start Claude Code

Start Claude Code with the model that is currently loaded in Unsloth.

We will use `gemma-4-26B-A4B-it-GGUF`, but you can use any Unsloth compatible model.&#x20;

```shellscript
claude --model unsloth/gemma-4-26B-A4B-it-GGUF
```

Claude Code should open and display the selected model.

<figure><img src="/files/sXGuItodanagp8U9Oqn0" alt=""><figcaption></figcaption></figure>

{% hint style="warning" %}
See [#fixing-90-slower-inference-in-claude-code](#fixing-90-slower-inference-in-claude-code "mention") first to fix open models being 90% slower due to KV Cache invalidation.
{% endhint %}

Try this prompt to research and rank high-quality SFT datasets.

{% code overflow="wrap" %}

```
You can only work in project/. Do not search for CLAUDE.md — this is it. Use web search to find 10 real instruction/chat/SFT datasets on Hugging Face, briefly summarize your findings and explain why each dataset is relevant for SFT as you research, then create sft_report.md as a polished markdown report containing the rank, dataset name, creator, 3–5 relevant tags, a short plain-English summary, and why it is useful for SFT. Keep everything concise and readable with no giant metadata dumps, pasted raw descriptions, oversized tag lists, or unrelated datasets. Task is complete once sft_report.md contains 10 clean, well-written dataset entries, and finish with: “Successfully finetuned a model with Unsloth!
```

{% endcode %}

After you submit the prompt, the agent will search the web, evaluate findings, and write the final report. This may take a few minutes.

Some workflows may require you to approve actions or answer follow up prompts.

<figure><img src="/files/qbFpKunX1xs70vOC6SSB" alt="" width="563"><figcaption></figcaption></figure>

{% hint style="info" %}
Some workflows may require you to approve actions or answer follow-up prompts.
{% endhint %}

Once complete, the generated `sft_report.md` will look similar to this.

<figure><img src="/files/qvb24Ck0dJ37z9nRANZp" alt="" width="375"><figcaption></figcaption></figure>

{% hint style="warning" %}
If you see `Unable to connect to API (ConnectionRefused)` , remember to unset `ANTHROPIC_BASE_URL`  via `unset ANTHROPIC_BASE_URL`

If you find open models to be 90% slower, [see here first](#fixing-90-slower-inference-in-claude-code) to fix KV cache being invalidated.
{% endhint %}

## 🦙 Llama.cpp Tutorial

Before we begin, we firstly need to complete setup for the specific model you're going to use. We use `llama.cpp` which is an open-source framework for running LLMs on your Mac, Linux, Windows etc. devices. Llama.cpp contains `llama-server` which allows you to serve and deploy LLMs efficiently. The model will be served on port 8001, with all agent tools routed through a single OpenAI-compatible endpoint.

#### Qwen3.5 Tutorial

We'll be using [Qwen3.5](/docs/models/qwen3.5.md)-35B-A3B and specific settings for fast accurate coding tasks. If you don't have enough VRAM and want a **smarter** model, **Qwen3.5-27B** is a great choice, but it will be \~2x slower, or you can use other Qwen3.5 variants like 9B, 4B or 2B.

{% hint style="info" %}
Use Qwen3.5-27B if you want a **smarter** model or if you don't have enough VRAM. It will be \~2x slower than 35B-A3B however. Or you can use [**Qwen3-Coder-Next**](/docs/models/qwen3-coder-next.md) which is fantastic if you have enough VRAM.
{% endhint %}

{% stepper %}
{% step %}

#### Install llama.cpp

We need to install `llama.cpp` to deploy/serve local LLMs to use in Claude Code etc. We follow the official build instructions for correct GPU bindings and maximum performance. Change `-DGGML_CUDA=ON` to `-DGGML_CUDA=OFF` if you don't have a GPU or just want CPU inference. **For Apple Mac / Metal devices**, set `-DGGML_CUDA=OFF` then continue as usual - Metal support is on by default.

```bash
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev git-all -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp
```

<figure><img src="/files/fxSckSyuT1ERATAqVSpm" alt="" width="563"><figcaption></figcaption></figure>
{% endstep %}

{% step %}

#### Download and use models locally

Download the model via `huggingface_hub` in Python (after installing via `pip install huggingface_hub hf_transfer`). We use the **UD-Q4\_K\_XL** quant for the best size/accuracy balance. You can find all Unsloth GGUF uploads in our [Collection here](/docs/get-started/unsloth-model-catalog.md). If downloads get stuck, see [Hugging Face Hub, XET debugging](/docs/basics/troubleshooting-and-faqs/hugging-face-hub-xet-debugging.md)

```bash
hf download unsloth/Qwen3.5-35B-A3B-GGUF \
    --local-dir unsloth/Qwen3.5-35B-A3B-GGUF \
    --include "*UD-Q4_K_XL*" # Use "*UD-Q2_K_XL*" for Dynamic 2bit
```

<figure><img src="/files/RxdMVSV6D87hD1PoIDrQ" alt=""><figcaption></figcaption></figure>

{% hint style="success" %}
We used `unsloth/Qwen3.5-35B-A3B-GGUF` , but you can use another variant like 27B or any other model like `unsloth/`[`Qwen3-Coder-Next`](/docs/models/qwen3-coder-next.md)`-GGUF`.
{% endhint %}

<figure><img src="/files/pJIOLpqznnfZi5l4yJeI" alt="" width="563"><figcaption></figcaption></figure>
{% endstep %}

{% step %}

#### Start the Llama-server

To deploy Qwen3.5 for agentic workloads, we use `llama-server`. We apply [Qwen's recommended sampling parameters](/docs/models/qwen3.5.md#recommended-settings) for thinking mode: `temp 0.6`, `top_p 0.95` , `top-k 20`. Keep in mind these numbers change if you use non-thinking mode or other tasks.

Run this command in a new terminal (use `tmux` or open a new terminal). The below should **fit perfectly in a 24GB GPU (RTX 4090) (uses 23GB)** `--fit on` will also auto offload, but if you see bad performance, reduce `--ctx-size` .

{% hint style="info" %}
We used `--cache-type-k q8_0 --cache-type-v q8_0` for KV cache quantization for less VRAM usage. For full precision, use `--cache-type-k bf16 --cache-type-v bf16` .Note bf16 KV Cache might be slightly slower on some machines.
{% endhint %}

```bash
./llama.cpp/llama-server \
    --model unsloth/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf \
    --alias "unsloth/Qwen3.5-35B-A3B" \
    --temp 0.6 \
    --top-p 0.95 \
    --top-k 20 \
    --min-p 0.00 \
    --port 8001 \
    --kv-unified \
    --cache-type-k q8_0 --cache-type-v q8_0
```

{% hint style="success" %}
You can also disable thinking for Qwen3.5 which can improve performance for agentic coding stuff. To disable thinking with llama.cpp add this to the llama-server command:

`--chat-template-kwargs "{\"enable_thinking\": false}"`

<img src="/files/YpzKOQF1D92X7IubHAuV" alt="" data-size="original">
{% endhint %}
{% endstep %}
{% endstepper %}

### Start Claude Code with llama-server

{% hint style="success" %}
We used `unsloth/GLM-4.7-Flash-GGUF` , but you can use anything like `unsloth/Qwen3.6-27B-GGUF`.
{% endhint %}

{% hint style="warning" %}
See [#fixing-90-slower-inference-in-claude-code](#fixing-90-slower-inference-in-claude-code "mention") first to fix open models being 90% slower due to KV Cache invalidation.
{% endhint %}

Navigate to your project folder (`mkdir project ; cd project`) and run:

```bash
claude --model unsloth/GLM-4.7-Flash
```

To use Qwen3.6-35B-A3B, simply change it to:

```bash
claude --model unsloth/Qwen3.6-35B-A3B
```

<div data-with-frame="true"><figure><img src="/files/nAOVnL1pPAwojL1NsvJ6" alt="" width="563"><figcaption></figcaption></figure></div>

To set Claude Code to execute commands without any approvals do **(BEWARE this will make Claude Code do and execute code however it likes without any approvals!)**

{% code overflow="wrap" %}

```bash
claude --model unsloth/GLM-4.7-Flash --dangerously-skip-permissions
```

{% endcode %}

Try this prompt to install and run a simple Unsloth finetune:

{% code overflow="wrap" %}

```
You can only work in the cwd project/. Do not search for CLAUDE.md - this is it. Install Unsloth via a virtual environment via uv. Use `python -m venv unsloth_env` then `source unsloth_env/bin/activate` if possible. See https://unsloth.ai/docs/get-started/install/pip-install on how (get it and read). Then do a simple Unsloth finetuning run described in https://github.com/unslothai/unsloth. You have access to 1 GPU.
```

{% endcode %}

<div data-with-frame="true"><figure><img src="/files/Brb1lUcwGnls4FE7c10L" alt="" width="563"><figcaption></figcaption></figure></div>

After waiting a bit, Unsloth will be installed in a venv via uv, and loaded up:

<div data-with-frame="true"><figure><img src="/files/Sqd1nyIuQNHfwgO2YdM3" alt="" width="563"><figcaption></figcaption></figure></div>

and finally you will see a successfully finetuned model with Unsloth!

<div data-with-frame="true"><figure><img src="/files/ijszTl7Hinfu2TJieY3k" alt="" width="563"><figcaption></figcaption></figure></div>

{% hint style="warning" %}
If you see `Unable to connect to API (ConnectionRefused)` , remember to unset `ANTHROPIC_BASE_URL`  via `unset ANTHROPIC_BASE_URL`

If you find open models to be 90% slower, [see here first](#fixing-90-slower-inference-in-claude-code) to fix KV cache being invalidated.
{% endhint %}

[^1]: Must use this!


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://unsloth.ai/docs/basics/claude-code.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
