openaiHow to Run Local LLMs with OpenAI Codex

Use open models with OpenAI Codex on your device locally.

This guide will walk you through connecting open LLMs to the Codex CLI entirely locally. It works with any OpenAI or API compatible local model setup including: DeepSeek, Qwen, Gemma, and more.

In this tutorial, we’ll use GLM-4.7-Flasharrow-up-right (a 30B MoE, agentic + coding model) which fits nicely on a 24GB RAM/unified memory device to autonomously fine-tune an LLM using Unslotharrow-up-right. Prefer a different model? Swap in any other modelarrow-up-right by updating the model names in the scripts.

openaiOpenAI Codex Tutorial

For model quants, we’ll use Unsloth Dynamic GGUFsarrow-up-right so you can run quantized GGUF models while preserving as much quality as possible.

We’ll use llama.cpparrow-up-right, an open-source runtime for running LLMs on macOS, Linux, and Windows. Its llama-server component lets you serve models efficiently via a single OpenAI-compatible HTTP endpoint. In this setup, the model is served on port 8001, and all agent tool calls are routed through that one endpoint.

📖 #1: Setup Tutorial

1

Instal llama.cpp

We need to install llama.cpp to deploy/serve local LLMs to use in Codex etc. We follow the official build instructions for correct GPU bindings and maximum performance. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev git-all -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp
2

Download and use models locally

Download the model via huggingface_hub in Python (after installing via pip install huggingface_hub hf_transfer). We use the UD-Q4_K_XL quant for the best size/accuracy balance. You can find all Unsloth GGUF uploads in our Collection here. If downloads get stuck, see Hugging Face Hub, XET debugging

circle-check
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download

snapshot_download(
    repo_id = "unsloth/GLM-4.7-Flash-GGUF",
    local_dir = "unsloth/GLM-4.7-Flash-GGUF",
    allow_patterns = ["*UD-Q4_K_XL*"],
)
3

Start the Llama-server

To deploy GLM-4.7-Flash for agentic workloads, we use llama-server. We apply Z.ai's recommended sampling parameters (temp 1.0, top_p 0.95) and enable --jinja for proper tool calling support.

Run this command in a new terminal (use tmux or open a new terminal). The below should fit perfectly in a 24GB GPU (RTX 4090) (uses 23GB) --fit on will also auto offload, but if you see bad performance, reduce --ctx-size . We used --cache-type-k q8_0 --cache-type-v q8_0 for KV cache quantization to reduce VRAM usage.

./llama.cpp/llama-server \
    --model unsloth/GLM-4.7-Flash-GGUF/GLM-4.7-Flash-UD-Q4_K_XL.gguf \
    --alias "unsloth/GLM-4.7-Flash" \
    --temp 1.0 \
    --top-p 0.95 \
    --min-p 0.01 \
    --port 8001 \
    --kv-unified \
    --cache-type-k q8_0 --cache-type-v q8_0 \
    --flash-attn on \
    --batch-size 4096 --ubatch-size 1024 \
    --ctx-size 131072
circle-check

openai OpenAI Codex CLI Tutorial

Codex arrow-up-rightis OpenAI's official coding agent that runs locally. While designed for ChatGPT, it supports custom API endpoints, making it perfect for local LLMs. For installing on Windowsarrow-up-right - it's best to use WSL.

Install

Mac (Homebrew):

brew install --cask codex

Universal (NPM) for Linux

apt update
apt install nodejs npm -y
npm install -g @openai/codex

Configure

First run codex to login and setup things, then create or edit the configuration file at ~/.codex/config.toml (Mac/Linux) or %USERPROFILE%\.codex\config.toml (Windows).

Use cat > ~/.codex/config.toml for Linux / Mac:

Navigate to your project folder (mkdir project ; cd project) and run:

Or to allow any code to execute. (BEWARE this will make Codex do and execute code however it likes without any approvals!)

And you will see:

circle-exclamation

Try this prompt to install and run a simple Unsloth finetune:

and you will see:

and if we wait a little longer, we finally get:

Last updated

Was this helpful?