For the complete documentation index, see llms.txt. This page is also available as Markdown.

Connect llama.cpp to Unsloth: Run GGUFs with llama-server

Llama.cpp is an open-source inference engine for running GGUF models efficiently on local hardware, and Unsloth makes it easy to run those models directly into a open-source UI chat interface. By starting a local llama-server, you can serve a GGUF model from your machine or Hugging Face, connect it to Unsloth, and use it like any other external chat model.

This guide walks through installing llama.cpp, launching llama-server, connecting it to Unsloth, enabling your model, and configuring prompt caching, context length, API keys, FA, and chat templates.

Setup

1

Install llama.cpp

Install llama.cpp first so you can run the llama-server command.

Use one of the official install options:

After installing, check that llama-server works in your terminal:

llama-server --help

2

Choose a GGUF model

llama-server can load a local .gguf file or download a GGUF model from Hugging Face.

To serve a Hugging Face GGUF repo directly, use the repo and quant name:

llama-server -hf unsloth/Qwen3.6-27B-GGUF:UD-Q4_K_XL

If you wish to load a local model, you can also follow the steps below.

Start llama-server with the model you want to serve:

This exposes an API endpoint at: http://localhost:8080/v1

To require an API key, add:

3

Connect Llama.cpp to Unsloth

Open Settings → Connections, then click Add Connection. Select llama.cpp, then enter your server details:

if you did not start llama-server with --api-key, leave the API key field empty. Enter the base URL of your server, e.g. http://localhost:8080/v1 Click Load Models to fetch available model IDs, or enter model IDs manually if your server does not expose /models.

Then, after you click Add Connection, The models you enabled will now appear under Connected in the Select Model dropdown.

4

Ready to Chat

After saving the connection, your llama.cpp model will appear under Connection in the model dropdown. Select it to start chatting through you llama-server.

Prompt Caching

Prompt caching reduces latency and cost when requests reuse the same long prefix. Use the Prompt caching setting in the Unsloth side panel to control caching behaviour for supported connections.

With llama.cpp, prompt caching is enabled by default and can be disabled when starting llama-server with:

Common llama-server arguments

The example above only uses the required connection settings. You can add more llama-server arguments depending on your model and hardware.

Common options include:

For the full list of server arguments, see the official llama.cpp server README.

Last updated

Was this helpful?