For the complete documentation index, see llms.txt. This page is also available as Markdown.

Connect vLLM to Unsloth for Local Chat Inference

Learn how to connect vLLM to Unsloth using vLLM’s OpenAI-compatible API so you can serve models and chat with them locally inside a open-source UI chat interface. This guide walks through installing vLLM, launching a local vLLM server, configuring the API base URL, loading available model IDs, and selecting your hosted vLLM model.

By the end, your vLLM-served models will appear alongside local models, giving you a fast and flexible way to run external LLM inference from a UI chat interface.

Setup

1

Install vLLM

Install vLLM first so you can run the vllm serve command. Follow the official vLLM install guide for your platform and hardware.

After installing, check that vLLM works in your terminal: vllm --help

2

Choose a model

vLLM serves models from Hugging Face.

For example, start a vLLM server with an Unsloth model:

vllm serve unsloth/gemma-4-26B-A4B-it 
\ --dtype auto

This exposes an API endpoint at:

http://localhost:8000/v1

To require an API key, add:

--api-key token-abc123
3

Connect vLLM to Unsloth

Open Settings → Connections, then click Add Connection.

Select vLLM, then enter your server details.

Enter your vLLM server details:

  • API key: leave empty unless you started vLLM with --api-key

  • Base URL: for example, http://localhost:8000/v1

  • Reasoning model: enable this if the served model supports thinking

  • Model IDs: click Load Models, or enter custom IDs manually

After you click Add Connection, the models you enabled will appear under Connection in the model dropdown.

4

Ready to Chat

After saving the connection, your vLLM model will appear under Connected in the model dropdown. Select it to start chatting through your vLLM server.

If your vLLM server is slow to respond (especially during model loading), you can adjust the timeout:

AIOHTTP_CLIENT_TIMEOUT_MODEL_LIST=30

Common vLLM arguments

The example above uses the core serving settings. You can add more vllm serve arguments depending on your model and hardware.

Common options include:

vllm serve unsloth/gemma-4-26B-A4B-it \
  --dtype auto \
  --host 0.0.0.0 \
  --port 8000 \
  --api-key token-abc123 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.9

For the full list of vLLM server arguments, see the official vLLM OpenAI-compatible server docs.

Last updated

Was this helpful?