# Connect vLLM to Unsloth for Local Chat Inference

Learn how to connect **vLLM to** [**Unsloth**](https://github.com/unslothai/unsloth) using vLLM’s **OpenAI-compatible API** so you can serve models and chat with them locally inside a open-source UI chat interface. This guide walks through installing vLLM, launching a local vLLM server, configuring the API base URL, loading available model IDs, and selecting your hosted vLLM model.

By the end, your vLLM-served models will appear alongside local models, giving you a fast and flexible way to run external LLM inference from a UI chat interface.

### Setup

{% stepper %}
{% step %}

### Install vLLM

Install vLLM first so you can run the vllm serve command.

Follow the official vLLM installation guide for your platform and hardware:

* [Install vLLM](https://docs.vllm.ai/en/stable/getting_started/installation/)

After installing, check that vLLM works in your terminal:

`vllm --help`
{% endstep %}

{% step %}

### Choose a model

vLLM serves models from Hugging Face.

For example, start a vLLM server with an Unsloth model:

```bash
vllm serve unsloth/gemma-4-26B-A4B-it 
\ --dtype auto
```

This exposes an API endpoint at:

`http://localhost:8000/v1`

To require an API key, add:

```bash
--api-key token-abc123
```

{% endstep %}

{% step %}

### Connect vLLM to Unsloth

Open **Settings → Connections**, then click **Add Provider**.

Select **vLLM**, then enter your server details.

<figure><img src="/files/BdMZR4w9uCCFMuVkKflR" alt=""><figcaption></figcaption></figure>

Enter your vLLM server details:

* **API key:** leave empty unless you started vLLM with --api-key
* **Base URL:** for example, <http://localhost:8000/v1>
* **Reasoning model:** enable this if the served model supports thinking
* **Model IDs:** click **Load Models**, or enter custom IDs manually

After you click **Add Provider**, the models you enabled will appear under **External** in the model dropdown.
{% endstep %}

{% step %}

### Ready to Chat

After saving the connection, your vLLM model will appear under **External** in the model dropdown. Select it to start chatting through your vLLM server.

<figure><img src="/files/ocXVg6g06AAJQrmgyze8" alt=""><figcaption></figcaption></figure>

{% hint style="info" %}
If your vLLM server is slow to respond (especially during model loading), you can adjust the timeout:

```bash
AIOHTTP_CLIENT_TIMEOUT_MODEL_LIST=30
```

{% endhint %}
{% endstep %}
{% endstepper %}

### Common vLLM arguments

The example above uses the core serving settings. You can add more vllm serve arguments depending on your model and hardware.

Common options include:

```bash
vllm serve unsloth/gemma-4-26B-A4B-it \
  --dtype auto \
  --host 0.0.0.0 \
  --port 8000 \
  --api-key token-abc123 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.9
```

For the full list of vLLM server arguments, see the official vLLM OpenAI-compatible server docs:

<https://docs.vllm.ai/en/stable/serving/openai_compatible_server/>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://unsloth.ai/docs/integrations/connections/vllm.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
