How to Run Local LLMs with OpenAI Codex
Use open models with OpenAI Codex on your device locally.
This step-by-step guide shows you how to connect open LLMs and APIs to OpenAI Codex entirely locally, complete with screenshots. Codex only needs a local endpoint that speaks the OpenAI Responses API. Run using any open model like Qwen, DeepSeek, Gemma, and more.
For this tutorial, we’ll use the open models: Gemma 4 and Qwen3.5 which are strong agentic & coding models (works on 24GB RAM/unified mem device). For inference, we'll use Unsloth Studio and llama.cpp enables you to run/serve LLMs on macOS, Linux, and Windows. You can swap in any other model, just update the model names in your scripts and Codex config.
Setup Codex📖 Setup Local Model Tutorial
For model quants, we'll use Unsloth Dynamic GGUFs so you can run quantized GGUF models while retaining as much accuracy as possible.
Codex has changed quite a lot since Jan 2026. It now uses the OpenAI Responses API exclusively, and Chat Completions support has been deprecated. Unsloth Studio supports both, so we'll use wire_api = "responses" throughout this guide.
Setup Codex
Codex is OpenAI's official coding agent that runs locally. While designed for ChatGPT, it supports custom API endpoints, which makes it work for local LLMs. We'll later point it to Unsloth Studio's /v1/responses endpoint once Studio is up.
Run in your terminal:
apt update
sudo apt install nodejs npm -y
npm install -g @openai/codexRun in Windows Powershel:
winget install --id OpenAI.CodexPrefer the Codex desktop app? Install from the Microsoft Store:
winget install --id 9PLM9XGG6VKS --source msstoreOr via the Microsoft app Store. The app reads the same %USERPROFILE%\.codex\config.toml, so the provider config we set up later applies either way.
Prefer WSL? Open PowerShell as admin, run wsl --install, restart, then follow the Linux tab above inside Ubuntu. You'll need a small networking trick to reach Studio on the Windows host - see the WSL hint in Connect Codex to Studio.
Run in your terminal:
bash brew install --cask codexThat's it for the install - don't run codex yet. Running it bare drops you into OpenAI's "Sign in with ChatGPT" picker (which is modal - there's no escape hatch). Once we wire up a local profile, codex -p unsloth_api or codex -p llama_cpp skips that screen entirely because custom providers default to requires_openai_auth = false. Start the local model server first, then launch Codex against it.
📖 Quickstart Tutorials
Before we begin, we firstly need to complete setup for the specific model you're going to use. We use Unsloth (a web UI) and llama.cpp which are open-source frameworks for running and serving LLMs on your Mac, Linux, Windows devices.
Before we begin, we firstly need to complete setup for the specific model you're going to use. We use Unsloth (a web UI) and llama.cpp which are open-source frameworks for running and serving LLMs on your Mac, Linux, Windows devices.
Unsloth also has unique self-healing tool-calling and web search capabilities. See right for Claude Code connected to Unsloth:

🦥 Unsloth Tutorial🦙 llama.cpp Tutorial
🦥 Unsloth Tutorial
For this tutorial, we will serve/connect local models to Claude Code via a UI by using Unsloth. Unsloth works on Windows, WSL, Linux and MacOS.
Search, download, run GGUFs and safetensor models
Self-healing tool calling + web search
Code execution (Python, Bash)
Automatic inference parameter tuning (temp, top-p, etc.)
Fast CPU + GPU inference via llama.cpp
Train LLMs 2x faster with 70% less VRAM
See below for install instructions:

Step 1: Setup Unsloth
Launch the terminal from Mac, then install Unsloth by entering the command below.
Unsloth will start setting up the environment and installing the required packages as shown below. Type Y and Press Enter when asked if you want to allow Studio to start now. This will start Unsloth on your local 8888 port.

If you chose not to start Unsloth during the installation process, you can always start the Unsloth app using unsloth studio -p 8888 . If you would like to have your Unsloth instance accessible by clients outside of your PC/computer, add -H 0.0.0.0 to the unsloth studio command.
Step 2: Start Unsloth
Open your browser of choice and type http://127.0.0.1:8888 in the URL box. If this is your first time installing Unsloth, you will be forwarded to the Password page where you will need to create a new password. After, Unsloth should now open on the Chat Page as shown below.

Step 1: Setup Unsloth
Open the Start Menu, search for PowerShell, and launch it. Copy & enter the install command:
it will begin installing automatically. After installation finishes, PowerShell will ask if you want to start Unsloth Studio.

You can also launch it with the following command:
If you would like to have your instance accessible by clients outside of your PC/computer.
Add -H 0.0.0.0 to the unsloth studio command.
Step 2: Start Unsloth
Open http://127.0.0.1:8888 in your browser. On first launch, create a new password to continue to the Chat page. Unsloth Studio is now installed and ready to use.

Step 1: Setup Unsloth
Open your terminal application. You can launch it by pressing Ctrl + Alt + T, or by searching for Terminal in your system's application menu.
Click the Windows Start Menu, type the name of your installed distro (e.g. Ubuntu), then open it.
On WSL, make sure your NVIDIA drivers are installed on Windows (not inside WSL) and that the CUDA toolkit is installed inside your WSL distro. See the System Requirements below for details.
To install, copy and run the install command:
Then:
Click inside the terminal window
Paste the command with
Ctrl + Shift + VPress
Enter
Unsloth will start setting up the environment and installing the required packages as shown below. Type Y and Press Enter when asked if you want to allow Studio to start now. This will start Unsloth on your local 8888 port.

If you chose not to start Unsloth during the installation process, you can always start the Unsloth app using unsloth studio -p 8888 . If you would like to have your Unsloth instance accessible by clients outside of your PC/computer, add -H 0.0.0.0 to the unsloth studio command.
Step 2: Start Unsloth
Open your browser of choice and type http://127.0.0.1:8888 in the URL box. If this is your first time installing Unsloth, you will be forwarded to the Password page where you will need to create a new password. After, Unsloth should now open on the Chat Page as shown below.

Model Loading + API Guide
⚙️ Connect Codex
This section is the same whether you used Unsloth Studio, llama.cpp, or another OpenAI-compatible local server. Codex needs three values: the API key, the base URL, and the model name. The example below uses Unsloth Studio; for llama.cpp, use the same shape with the llama_cpp profile in the llama.cpp section.
Configure the Unsloth provider
Codex looks for ~/.codex/config.toml on macOS/Linux/WSL or %USERPROFILE%\.codex\config.toml on Windows. Create or edit it:
This config creates a Codex profile called unsloth_api, points it at Studio, and tells Codex to read the API key from an environment variable named UNSLOTH_STUDIO_AUTH_TOKEN. You'll set the real key in the next step.
base_url
Your local server endpoint + /v1
env_key
Name of the env var Codex reads your API key from. This is not the key itself.
wire_api
responses. Codex now exclusively uses OpenAI's Responses API.
requires_openai_auth
false makes Codex skip the "Sign in with ChatGPT" screen for this provider. Default is already false, but be explicit.
model
The model ID your server exposes. Hit GET <base_url>/models to confirm the exact string.
OpenAI removed wire_api = "chat" support. Always use wire_api = "responses". If you set wire_api = "chat", Codex refuses to start with `wire_api = "chat"` is no longer supported. How to fix: set `wire_api = "responses"` in your provider config.
You can add multiple profiles to the same file, one for each Unsloth model you swap between. Codex picks them up automatically.
Set the API key env var
Use the same env var name you wrote in env_key. In the Unsloth Studio example above, env_key = "UNSLOTH_STUDIO_AUTH_TOKEN", so set UNSLOTH_STUDIO_AUTH_TOKEN in the same terminal you will run Codex from:
If you renamed env_key, rename the variable in the commands too. For example, a llama.cpp profile that uses env_key = "LLAMA_CPP_API_KEY" needs LLAMA_CPP_API_KEY, not UNSLOTH_STUDIO_AUTH_TOKEN.
Session vs Persistent: the commands above apply to the current terminal only. To persist:
MacOS / Linux / WSL: add the
exportline to~/.bashrc(bash) or~/.zshrc(zsh).Windows: run
setx UNSLOTH_STUDIO_AUTH_TOKEN "YOUR_TOKEN"once, or add the$env:line to your PowerShell$PROFILE.
Running Codex inside WSL with Unsloth on Windows? WSL is a separate network namespace, so localhost from inside WSL doesn't reach Unsloth. Edit your config.toml to use the Windows host IP instead:
Then set base_url = "http://<that-ip>:8888/v1". If you have WSL2 mirrored networking enabled (.wslconfig → networkingMode=mirrored), localhost works as on native Windows.
Launch Codex
First launch in a new directory Codex asks "Do you trust the contents of this directory?" - pick Yes, continue. This is the per-cwd trust prompt, not the ChatGPT login (that one is skipped because of `requires_openai_auth = false`). Subsequent launches in the same directory skip this prompt.

Seeing Model metadata for unsloth/gemma-4-26B-A4B not found. Defaulting to fallback metadata? Codex ships with a built-in table of context windows, tool support, and input modalities for OpenAI's own models. For anything else - it falls back to safe defaults. The warning fires once per session for every non-OpenAI slug. Everything still works, you can ignore it.
To fix it: add model_context_window = 131072 to the top of ~/.codex/config.toml so Codex uses Gemma 4's real 128K context instead of its fallback guess. For full control over tool support and input modalities too, point model_catalog_json inside [profiles.unsloth_api] at a JSON file containing a custom ModelInfo entry for your slug.
The -p unsloth_api flag tells Codex to use the profile you just added. The model name appears in Codex's status bar.

Add --search to enable web search:
To bypass all approval prompts (BEWARE this will make Codex do and execute code however it likes without any approvals!):
Try a real task
Try this prompt to install and run a simple Unsloth finetune:
and if we wait a little longer, you will see a successfully fine-tuned model with Unsloth!

Disconnect or revert
Launch Codex without -p unsloth_api and it'll use its default provider. Or delete the [profiles.unsloth_api] and [model_providers.unsloth_api] blocks from ~/.codex/config.toml.
You can leave Unsloth Studio running or shut it down. It doesn't intercept anything when stopped.
Troubleshooting
Model metadata for ... not found
Non-OpenAI slug, no built-in metadata
Harmless warning. To silence the side-effects, set model_context_window = 131072 in ~/.codex/config.toml, or point
Codex says it's GPT
Codex injects an OpenAI-referencing system prompt; local models mirror it
Not a routing bug. Verify via Studio's activity panel. Override the system prompt to change self-report.
Connection refused
Studio isn't running or wrong port
Confirm Studio is up at http://localhost:8888; check base_url in config.toml
wire_api = "chat" is no longer supported
Legacy wire_api = "chat" in config
Switch to wire_api = "responses"
model not found
Model ID typo
GET http://localhost:8888/v1/models and copy the exact ID
OOM mid-generation
Context too large for VRAM
Reduce context in Studio Settings → Inference, or use a smaller quant
Codex shows "Sign in with ChatGPT" picker
Launched bare codex (no -p)
Quit (Ctrl+C), then re-launch with codex -p unsloth_api. Custom providers skip that
WSL: Connection refused to localhost
WSL network namespace
Use the Windows host IP in base_url, or enable WSL2 mirrored networking
🦙 Llama.cpp Tutorial
We can also use llama.cpp directly. We need to deploy llama-server which is an open-source framework for running and serving LLMs efficiently on Mac, Linux and Windows devices. The model will be served on port 8001 with all agent tool calls routed through that single OpenAI-compatible endpoint.
The llama.cpp endpoint will be on port 8001 instead of 8888 (Unsloth Studio's default). Adjust your Codex base_url accordingly in ~/.codex/config.toml.
Install llama.cpp
We need to install llama.cpp to deploy/serve local LLMs to use in Codex. We follow the official build instructions for correct GPU bindings and maximum performance. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference. For Apple Mac / Metal devices, set -DGGML_CUDA=OFF then continue as usual - Metal support is on by default.
Download and use models locally
Download the model via the hf CLI (pip install huggingface_hub hf_transfer). We use the UD-Q4_K_XL quant for the best size/accuracy balance. You can find all Unsloth GGUF uploads in our Collection here. If downloads get stuck, see https://hugging-face-hub-xet-debugging.md.
Want vision support? Add --include "*mmproj-BF16*" to also pull the vision projector, then pass --mmproj unsloth/gemma-4-26B-A4B-it-GGUF/mmproj-BF16.gguf to llama-server. Codex itself is text-only, so this is optional.
We used unsloth/gemma-4-26B-A4B-it-GGUF, but you can use anything like unsloth/Qwen3.6-35B-A3B-GGUF - see Qwen3.6-35B-A3B.
Start the Llama-server
To deploy Gemma-4-26B-A4B for agentic workloads, we use llama-server. We apply Google's recommended sampling parameters (temp 1.0, top_p 0.95, top_k 64) and enable --jinja for proper tool calling support.
Run this command in a new terminal (use tmux or open a new terminal). The below should fit comfortably in a 24GB GPU (RTX 4090) at ~18GB. --fit on will also auto offload, but if you see bad performance, reduce --ctx-size.
We used --cache-type-k q8_0 --cache-type-v q8_0 for KV cache quantization to reduce VRAM use. If you see reduced quality, use bf16 instead (--cache-type-k bf16 --cache-type-v bf16), but VRAM doubles.
Disabling thinking can improve performance for agentic coding tasks. Gemma 4 enables thinking by default via the chat template - to disable it, add the following flag to the llama-server command:
MacOS / Linux / WSL:
--chat-template-kwargs '{"enable_thinking":false}'
Windows PowerShell:
--chat-template-kwargs "{\"enable_thinking\":false}"
Last updated
Was this helpful?




