# How to use Unsloth as an API endpoint

You can use **local LLMs** with tools like [Claude Code](/docs/basics/claude-code.md) and [Codex](/docs/basics/codex.md) by connecting those tools to Unsloth’s **OpenAI-compatible API endpoint**. This lets you run models like [Qwen](/docs/models/qwen3.6.md) and [Gemma](/docs/models/gemma-4.md) locally for agentic coding. Unsloth also has beneficial features such as self-healing **tool calling**, **code execution**, and **web search**.

Unsloth makes it easy to deploy a fast API inference endpoint that provides:

* [**Self-healing tool calling**](/docs/new/studio/chat.md#auto-healing-tool-calling), which helps reduce broken or malformed tool calls by 50%
* [**Code execution**](/docs/new/studio/chat.md#code-execution) support, allowing Bash and Python execution for more accurate code outputs.
* **Advanced** [**Web search**](/docs/new/studio/chat.md#advanced-web-search) that visits and actually reads webpages to gather in-depth info.
* [**Automatic inference** settings](/docs/new/studio/chat.md#auto-parameter-tuning) for GGUF models (temp, top-k etc.)

{% columns %}
{% column %}
Models loaded in Unsloth (including GGUFs) are exposed as an **authenticated API** via `llama-server`. A long API key is generated for security reasons like how OpenAI provides one.

Your **local models** can then be used directly in your preferred AI agent, SDK, or chat client. Unsloth speaks two dialects on the same port. Both support streaming, tool calling (OpenAI `tools` / Anthropic `tools`), and vision inputs:
{% endcolumn %}

{% column %}

<figure><img src="/files/Z3eIk2YCloY1lJy73JHS" alt=""><figcaption></figcaption></figure>
{% endcolumn %}
{% endcolumns %}

* **Anthropic-compatible `/v1/messages`**  for Claude Code, OpenClaw, the Anthropic SDK, and any client that expects the Messages API.
* **OpenAI-compatible `/v1/chat/completions`** and **`/v1/responses`** for the OpenAI SDK, OpenCode, Cursor, Continue, Cline, Open WebUI, SillyTavern, and any OpenAI-compatible tool.

### ⚡ Quickstart

1. **Install or update** [**Unsloth Studio**](/docs/new/studio.md)**.** Then launch Unsloth.
2. **Load a model.** Click **New Chat**, pick or search a model (GGUF), and wait for it to finish loading.
3. **Create an API key.** Click your **Unsloth** avatar in the bottom-left → **Settings** → **API** → type a key name → **Create**. Copy the `sk-unsloth-…` value that appears. Unsloth only shows it once.
4. **Point your client at Unsloth.** Use `http://localhost:PORT` as the base URL and your `sk-unsloth-…` key for auth. Jump to the recipe for your tool below.

### 🔑 Creating an API key

1. Open the sidebar, click your **Unsloth** avatar at the bottom-left.
2. Go to **Settings** → **API** (globe :globe\_with\_meridians: icon).
3. Enter a friendly name (e.g. `claude-code-macbook`). Set an expiry (optional)
4. Click **Create**.
5. **Copy the key.** Unsloth stores only a hash and you won't be able to view it again.

<div data-with-frame="true"><figure><img src="/files/h74Myk0Gm7aygIC5vAsH" alt="" width="375"><figcaption></figcaption></figure></div>

All keys start with the `sk-unsloth-` prefix. Revoke a key from the same page at any time. Requests made with a revoked key will fail with `401 Unauthorized`.

{% hint style="warning" %}
Treat your API key like a password. Anyone with the key and network access to your Unsloth instance can send requests to your loaded model.
{% endhint %}

### <i class="fa-terminal">:terminal:</i> Unsloth run command

1. **Install or update Unsloth Unsloth.** Earlier versions don't expose the external API. See Installation.
2. **Load a GGUF model.** load a GGUF model using the run command. This will also load the UI on the default port. The endpoint URL and API Key will be printed out to the console , ready for you to be used with your client of choice.

   ```bash
   unsloth run --model unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL
   ```

#### Loading a model from the CLI

You can load a model and have an API key created for you automatically using the `unsloth` CLI tool. When the model finishes loading, the endpoint URL and API key are printed to your console. Copy them into your client of choice and you're ready to go.

#### Before you start

Make sure you're on a recent version of Unsloth Studio as earlier versions don't expose the external API. See installation.

#### The quick way

Open a terminal and load a GGUF model:

```bash
unsloth run --model unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL
```

That's it. This starts the server on the default port, loads the UI, and prints your endpoint URL and API key.

#### How the model name works

You can point at a model in a few different ways. Pick the one you find easiest:

```bash
# Combined: repo and quantization variant in one string (recommended — shortest)
unsloth run --model unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL

# Separate: repo and variant as two flags (the older style, still works)
unsloth run --model unsloth/gemma-4-26B-A4B-it-GGUF --gguf-variant UD-Q4_K_XL

# Using -hf / --hf-repo (matches llama.cpp's spelling, handy if you're coming from there)
unsloth run -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL
```

#### Tuning the run (optional)

You don't need any of this for a basic load  but if you want more control, you can pass extra flags and they'll be forwarded straight to the underlying `llama-server`. Your values override Studio's defaults.

A few examples:

```bash
# Set a larger context window and pin threads
unsloth run --model unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL -c 131072 --threads 32

# Adjust sampling
unsloth run --model unsloth/Qwen3-1.7B-GGUF --top-k 20 --seed 42

# Use a custom chat template
unsloth run --model unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL \
  --chat-template-file /path/to/template.jinja
```

Most `llama-server` flags work like context size, GPU layers, sampling parameters, KV cache types, reasoning settings, and so on. See the [llama-server docs](https://github.com/ggml-org/llama.cpp/tree/master/tools/server) for the full list.

#### **Server-side tool policy**

`unsloth run` controls whether server-side tools (web search, code execution, etc.) are exposed by the inference server. Defaults are based on the bind address:

* **`127.0.0.1` (localhost)** — tools **on** by default. Only your machine can reach the server.
* **`0.0.0.0` or any non-loopback address** — tools **off** by default. A leaked API key on a network-exposed server means arbitrary code execution on the host.

**Flags:**

* `--enable-tools` / `--disable-tools` — force on or off. On `0.0.0.0`, `--enable-tools` shows a y/N security prompt.
* `--yes` / `-y` — skip the prompt (for automation).

The resolved policy is a process-level hard override — individual requests cannot bypass it via `enable_tools=true` in the request body.

<div data-with-frame="true"><figure><img src="/files/MaID7j6ybUcV10UN2VhO" alt=""><figcaption></figcaption></figure></div>

### 🌐 **Endpoints**

Studio exposes these endpoints on whichever port it booted on (typically `http://localhost:8000` or `http://localhost:8888`):

| Endpoint                    | Compatible with             | Use it from                                                           |
| --------------------------- | --------------------------- | --------------------------------------------------------------------- |
| `POST /v1/messages`         | Anthropic Messages API      | Claude Code, Anthropic SDK, OpenClaw, anything that speaks Anthropic  |
| `POST /v1/chat/completions` | OpenAI Chat Completions API | OpenAI SDK, opencode, Cursor, Continue, Cline, Open WebUI, curl, etc. |
| `GET /v1/models`            | OpenAI models list          | List the models currently loaded in Unsloth                           |

Authenticate with an `Authorization: Bearer sk-unsloth-…` header on every request.

{% hint style="info" %}
You don't need to run different servers for the two formats. Studio handles both on the same port.
{% endhint %}

### 🖇️ Connecting your client

Unsloth enables you run local LLMs via most frameworks including [Claude Code](/docs/basics/claude-code.md), [Codex](/docs/basics/codex.md), [OpenClaw](/docs/integrations/openclaw.md), [OpenCode](/docs/integrations/opencode.md) and more. Click the specific tools below for a guide:

{% columns %}
{% column width="50%" %}
{% content-ref url="/pages/w020xJgdCTBtTvfHtvye" %}
[Claude Code](/docs/basics/claude-code.md)
{% endcontent-ref %}

{% content-ref url="/pages/PCjZ57h5pE0QccKyJMYD" %}
[OpenAI Codex](/docs/basics/codex.md)
{% endcontent-ref %}

{% content-ref url="/pages/1cwX0SOqPoqLx7fQ2sIS" %}
[Curl & HTTP](/docs/integrations/connect-curl-and-http-to-unsloth.md)
{% endcontent-ref %}
{% endcolumn %}

{% column width="50%" %}
{% content-ref url="/pages/CwQEpEmkKPmyEYdnEngt" %}
[OpenClaw](/docs/integrations/openclaw.md)
{% endcontent-ref %}

{% content-ref url="/pages/qaA8ZjTxsH2GTuBOHyra" %}
[OpenCode](/docs/integrations/opencode.md)
{% endcontent-ref %}

{% content-ref url="/pages/viZvzp58ObzZkXtCm0qv" %}
[Python SDK](/docs/integrations/connect-python-sdk-to-unsloth.md)
{% endcontent-ref %}
{% endcolumn %}
{% endcolumns %}

### 🧰 Tool calling

Both endpoints support function / tool calling in their native format, plus an Unsloth-specific shorthand for Studio's built-in tools.

**OpenAI-style tools:** send `tools` and `tool_choice` to `/v1/chat/completions` exactly as you would with OpenAI. Claude Code (via `/v1/messages`), opencode, Cursor, Continue, and Cline all work out of the box.

**Anthropic-style tools:** send `tools` (with `input_schema`) and `tool_choice` to `/v1/messages` exactly as you would with Claude.

**Studio server side tools:** Studio can execute Python, web search, and bash *server-side* and stream the results back as `tool_result` events. Opt in by adding these extra fields to either endpoint:

```json
{
  "messages": [{"role": "user", "content": "What is 123 * 456? Use Python."}],
  "stream": true,
  "enable_tools": true,
  "enabled_tools": ["python", "web_search","terminal"],
  "session_id": "my-session"
}
```

The model sees each tool's output on its next turn. For deeper coverage (schemas, streaming events, chaining), see .

{% hint style="info" %}
If you're using the Anthropic `/v1/messages` endpoint, `tool_choice` maps cleanly: Anthropic `auto` → OpenAI `auto`, Anthropic `any` → OpenAI `required`, Anthropic `{type: "tool", name: "x"}` → OpenAI `{type: "function", function: {name: "x"}}`, Anthropic `none` → OpenAI `none`.
{% endhint %}

### ❔ Troubleshooting

**`401 Unauthorized`** :  either the `Authorization` header is missing or the key is wrong. Keys must be passed as `Authorization: Bearer sk-unsloth-…`. If you lost the key, create a new one from **Settings → API.** Studio doesn't show old keys after creation.

**`Lost connection to the model server`** : Studio couldn't reach the underlying llama.cpp server. Usually the model finished loading but crashed, or the model tab was closed inside Studio. Reload the model from **New Chat** and retry.

**Claude Code shows the default Anthropic model, not my local one** :  check all three env vars are exported in the **same** shell where you run `claude`:

```bash
echo $ANTHROPIC_BASE_URL
echo $ANTHROPIC_AUTH_TOKEN
echo $ANTHROPIC_MODEL
```

Then run `/model` inside Claude Code to confirm. On Windows PowerShell use `$env:ANTHROPIC_BASE_URL` etc.

**`stream: true` returns a single JSON blob instead of SSE** :  make sure you're hitting the right path (`/v1/messages` or `/v1/chat/completions`) and that your HTTP client is actually consuming the response as a stream, not buffering it.

**I can't find the name of the model to add to opencode (or OpenClaw / any other client)** :  ask Studio directly. `GET /v1/models` returns the exact model ID you need to plug into the client's "Model ID" field:

```bash
curl http://localhost:8888/v1/models \
  -H "Authorization: Bearer sk-unsloth-xxxxxxxxxxxx"
```

You'll get back a JSON payload of the form `{"data": [{"id": "gemma-4-26B-A4B-it-GGUF", ...}]}`. Copy the `id` value, that's the string opencode's **Model ID** field (left column) and OpenClaw's `models[].id` expect. The display name on the right is whatever you want users to see.

**Tool calls aren't executed** :  The model needs to support tool calling for client-side tools (`tools` / `tool_choice`). For Studio's built-in tools, remember to set `enable_tools: true` **and** list the ones you want in `enabled_tools` (e.g. `["python", "web_search"]`).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://unsloth.ai/docs/basics/api.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
