# NVIDIA Nemotron 3 Nano Omni - How To Run Locally

NVIDIA Nemotron-3-Nano-Omni-30B-A3B is an open 30B parameter, 3B active hybrid reasoning MoE model built for multimodal agentic workloads including **audio**, **video**, text, images and docs as input, with text output. The model runs on **25GB RAM** for 4-bit and 36GB for 8-bit.

With a **256K context**, Nemotron 3 Nano Omni is the **strongest omni** model for its size and the highest-efficiency open multimodal model. We collaborated with NVIDIA for day zero support!\
**GGUF:** [Nemotron-3-Nano-Omni-30B-A3B-Reasoning](https://huggingface.co/unsloth/Nemotron-3-Nano-30B-A3B-GGUF)

### ⚙️ Usage Guide

NVIDIA recommends these settings for inference:

{% columns %}
{% column %}
**General chat/instruction (default):**

* `temperature = 1.0`
* `top_p = 1.0`
  {% endcolumn %}

{% column %}
**Tool calling use-cases:**

* `temperature = 0.6`
* `top_p = 0.95`
  {% endcolumn %}
  {% endcolumns %}

**For most local use, set:**

* For multimodal prompts, images, audio and sampled video frames also consume context
* For multimodal use cases, do not manually type image or audio tokens. Use the model processor, `llama.cpp` multimodal projector or your serving backend to attach images/audio/video frames.

{% hint style="warning" %}
Do NOT use CUDA 13.2 as you may get gibberish outputs. NVIDIA is working on a fix.
{% endhint %}

### Run Nemotron-3-Nano-Omni

Depending on your use-case you will need to use [different settings](#usage-guide). Some GGUFs end up similar in size because the model architecture (like [gpt-oss](https://unsloth.ai/docs/models/gpt-oss-how-to-run-and-fine-tune)) has dimensions not divisible by 128, so parts can’t be quantized to lower bits. **GGUF:** [Nemotron-3-Nano-Omni-30B-A3B-Reasoning](https://huggingface.co/unsloth/Nemotron-3-Nano-30B-A3B-GGUF)

The 4-bit versions of the model requires \~25GB RAM. 8-bit requires 36GB. For these guides, we will be using `UD-Q4-K-XL` which is a good balance between size and accuracy.

<a href="#unsloth-studio-guide" class="button primary">Run in Unsloth Studio</a><a href="#llama.cpp-tutorial" class="button secondary">Run in llama.cpp</a>

{% hint style="warning" %}
Currently no multimodal/vision GGUF works in **Ollama** due to separate `mmproj` vision files. Use llama.cpp compatible backends.

Do NOT use **CUDA 13.2** as you may get gibberish outputs. NVIDIA is working on a fix.
{% endhint %}

### 🦥 Unsloth Studio Guide

For this tutorial, we will be using [Unsloth Studio](https://unsloth.ai/docs/new/studio), which is our new web UI for running and training LLMs. With Unsloth Studio, you can run models and input **audio**, image and text locally on **Mac, Windows**, and Linux and:

{% columns %}
{% column %}

* Search, download, [run GGUFs](https://unsloth.ai/docs/new/studio#run-models-locally) and safetensor models
* **Compare** models **side-by-side**
* [**Self-healing** tool calling](https://unsloth.ai/docs/new/studio#execute-code--heal-tool-calling) + **web search**
* [**Code execution**](https://unsloth.ai/docs/new/studio#run-models-locally) (Python, Bash)
* [Automatic inference](https://unsloth.ai/docs/new/studio#model-arena) parameter tuning (temp, top-p, etc.)
* [Train LLMs](https://unsloth.ai/docs/new/studio#no-code-training) 2x faster with 70% less VRAM
  {% endcolumn %}

{% column %}

<div data-with-frame="true"><figure><img src="https://3215535692-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FxhOjnexMCB3dmuQFQ2Zq%2Fuploads%2FFeQ0UUlnjXkDdqhcWglh%2Fskinny%20studio%20chat.png?alt=media&#x26;token=c2ee045f-c243-4024-a8e4-bb4dbe7bae79" alt=""><figcaption></figcaption></figure></div>
{% endcolumn %}
{% endcolumns %}

{% stepper %}
{% step %}

#### Install Unsloth

**MacOS, Linux, WSL:**

```bash
curl -fsSL https://unsloth.ai/main/install.sh | sh
```

**Windows PowerShell:**

```bash
irm https://unsloth.ai/install.ps1 | iex
```

{% endstep %}

{% step %}

#### Setup Unsloth Studio (one time)

Setup automatically installs Node.js (via nvm), builds the frontend, installs all Python dependencies, and builds llama.cpp with CUDA support.

{% hint style="info" %}
**WSL users:** you will be prompted for your `sudo` password to install build dependencies (`cmake`, `git`, `libcurl4-openssl-dev`).
{% endhint %}
{% endstep %}

{% step %}

#### Launch Unsloth

**MacOS, Linux, WSL:**

```bash
source unsloth_studio/bin/activate
unsloth studio -H 0.0.0.0 -p 8888
```

**Windows Powershell:**

```bash
& .\unsloth_studio\Scripts\unsloth.exe studio -H 0.0.0.0 -p 8888
```

<div data-with-frame="true"><figure><img src="https://3215535692-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FxhOjnexMCB3dmuQFQ2Zq%2Fuploads%2Fd1yMMNa65Ccz50Ke0E7r%2FScreenshot%202026-03-17%20at%2012.32.38%E2%80%AFAM.png?alt=media&#x26;token=9369cfe7-35b1-4955-b8cb-42f7ecb43780" alt="" width="375"><figcaption></figcaption></figure></div>

**Then open `http://localhost:8888` in your browser.**
{% endstep %}

{% step %}

#### Search and download NVIDIA-Nemotron-3-Nano-30B-A3B-Omni

On first launch you will need to create a password to secure your account and sign in again later. Then go to the [Studio Chat](https://unsloth.ai/docs/new/studio/chat) tab and search for Nemotron-3-Nano-Omni in the search bar and download your desired model and quant.

<div data-with-frame="true"><figure><img src="https://3215535692-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FxhOjnexMCB3dmuQFQ2Zq%2Fuploads%2FQUTU2gI4DNuscVEuiT8f%2FScreenshot%202026-03-20%20at%201.28.50%E2%80%AFAM.png?alt=media&#x26;token=74d5fd9e-a229-4ddc-a96d-abe68e1ca6a3" alt="" width="375"><figcaption></figcaption></figure></div>
{% endstep %}

{% step %}

#### Run Nemotron-3-Nano-30B-A3B-Omni

Inference parameters should be auto-set when using Unsloth Studio, however you can still change it manually. You can also edit the context length, chat template and other settings.

For more information, you can view our [Unsloth Studio inference guide](https://unsloth.ai/docs/new/studio/chat).

<div data-with-frame="true"><figure><img src="https://3215535692-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FxhOjnexMCB3dmuQFQ2Zq%2Fuploads%2FXPQGEEr1YoKofrTatAKK%2Ftoolcallingif.gif?alt=media&#x26;token=25d68698-fb13-4c46-99b2-d39fb025df08" alt="" width="563"><figcaption></figcaption></figure></div>
{% endstep %}
{% endstepper %}

### 🦙 Llama.cpp Tutorial:

Instructions to run in llama.cpp (note we will be using 4-bit to fit most devices):

{% stepper %}
{% step %}
Obtain the latest `llama.cpp` on [GitHub here](https://github.com/ggml-org/llama.cpp). You can follow the build instructions below as well. Change `-DGGML_CUDA=ON` to `-DGGML_CUDA=OFF` if you don't have a GPU or just want CPU inference. **For Apple Mac / Metal devices**, set `-DGGML_CUDA=OFF` then continue as usual - Metal support is on by default.

{% code overflow="wrap" %}

```bash
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp
```

{% endcode %}
{% endstep %}

{% step %}
You can directly pull from Hugging Face. You can increase the context to `262,144` as your RAM/VRAM allows.

Follow this for **text only** use-cases:

```bash
./llama.cpp/llama-cli \
    -hf unsloth/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning-GGUF:UD-Q4_K_XL \
    --temp 1.0 \
    --top-p 1.0
```

For **tool-calling** use-cases, change `temp = 0.6` and `top-p = 0.95`

**Multimodal image/audio** use cases:

```bash
./llama.cpp/llama-mtmd-cli \
    -hf unsloth/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning-GGUF:UD-Q4_K_XL \
    --image screenshot.png \
    --audio meeting.wav \
    -p "Summarize what is shown and said. Return key actions as bullet points." \
    --temp 1.0 \
    --top-p 1.0
```

Image-only use cases:

```bash
./llama.cpp/llama-mtmd-cli \
    -hf unsloth/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning-GGUF:UD-Q4_K_XL \
    --image chart.png \
    -p "Read this chart and explain the main trend." \
    --temp 1.0 \
    --top-p 1.0
```

{% endstep %}

{% step %}
Download the model via (after installing `pip install huggingface_hub hf_transfer` ). You can choose `UD-Q4_K_XL` or other quantized versions.

```python
# !pip install huggingface_hub hf_transfer
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id = "unsloth/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning-GGUF",
    local_dir = "unsloth/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning-GGUF",
    allow_patterns = ["*UD-Q4_K_XL*"],
)
```

{% endstep %}

{% step %}
Then run the model in conversation mode:

{% code overflow="wrap" %}

```bash
./llama.cpp/llama-cli \
    --model unsloth/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning-GGUF/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning-UD-Q4_K_XL.gguf \
    --mmproj unsloth/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning-GGUF/mmproj-F16.gguf \
    --temp 1.0 \
    --top-p 1.0
```

{% endcode %}
{% endstep %}

{% step %}
**Video-style workflows**

For local video-style workflows, sample frames from a video and pass them as multiple images.

{% code expandable="true" %}

```bash
mkdir -p frames
ffmpeg -i demo.mp4 -vf "fps=1/2,scale=1280:-1" frames/frame_%04d.png

FRAMES=$(python - <<'PY'
from pathlib import Path

frames = sorted(Path("frames").glob("*.png"))[:16]
print(",".join(str(x) for x in frames))
PY
)

./llama.cpp/llama-mtmd-cli \
    -hf unsloth/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning-GGUF:UD-Q4_K_XL \
    --image "$FRAMES" \
    -p "Analyze these sampled video frames. Summarize the sequence of events and any important visual details." \
    --temp 1.0 \
    --top-p 1.0
```

{% endcode %}
{% endstep %}
{% endstepper %}

#### Llama-server serving & deployment

To deploy Nemotron 3 Nano Omni locally, use `llama-server`. In a new terminal, for example via `tmux`, deploy the model:

```bash
./llama.cpp/llama-server \
    -hf unsloth/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning-GGUF:UD-Q4_K_XL \
    --alias "unsloth/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning" \
    --prio 3 \
    --temp 1.0 \
    --top-p 1.0 \
    --port 8001
```

If you downloaded the model manually, use:

```bash
./llama.cpp/llama-server \
    --model unsloth/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning-GGUF/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-UD-Q4_K_XL.gguf \
    --mmproj unsloth/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning-GGUF/mmproj-Nemotron-3-Nano-Omni-30B-A3B-Reasoning-UD-Q4_K_XL.gguf \
    --alias "unsloth/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning" \
    --prio 3 \
    --temp 1.0 \
    --top-p 1.0 \
    --port 8001
```

Then in a new terminal, after installing the OpenAI client with `pip install openai`:

```python
from openai import OpenAI

openai_client = OpenAI(
    base_url="http://127.0.0.1:8001/v1",
    api_key="sk-no-key-required",
)

completion = openai_client.chat.completions.create(
    model="unsloth/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning",
    messages=[
        {"role": "user", "content": "What is 2+2?"},
    ],
)

print(completion.choices[0].message.content)
```

Which will print something like:

```
2 + 2 = 4.
```

#### Image input through the OpenAI-compatible server

{% code expandable="true" %}

```python
from openai import OpenAI
import base64
import mimetypes


def file_to_data_url(path: str) -> str:
    mime = mimetypes.guess_type(path)[0] or "application/octet-stream"
    with open(path, "rb") as f:
        data = base64.b64encode(f.read()).decode("utf-8")
    return f"data:{mime};base64,{data}"


openai_client = OpenAI(
    base_url="http://127.0.0.1:8001/v1",
    api_key="sk-no-key-required",
)

completion = openai_client.chat.completions.create(
    model="unsloth/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Read this screenshot and explain what the agent should do next.",
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": file_to_data_url("screenshot.png"),
                    },
                },
            ],
        }
    ],
)

print(completion.choices[0].message.content)
```

{% endcode %}

### 🦥 Fine-tuning Nemotron 3 Nano Omni

Unsloth supports the entire [Nemotron](https://unsloth.ai/docs/models/nemotron-3) model family. Nemotron 3 Nano Omni is useful for multimodal agent datasets. You can train on audio, vision or text via Unsloth. **Video input** fine-tuning is currently not supported.

For text-only and notebooks, you can start from the existing [Nemotron 3 Nano fine-tuning flow](https://unsloth.ai/docs/nemotron-3#fine-tuning-nemotron-3-and-rl). For multimodal adapters, make sure your dataset includes the modality your agent actually needs:

* **Computer use:** screenshots, UI state, cursor/context, expected next action
* **Document intelligence:** PDFs, screenshots, charts, tables, structured extraction targets
* **Audio understanding:** audio clips, sampled frames, summaries, timestamps, events and follow-up questions
* **Agent loops:** observation → reasoning → action → validation examples

For Omni, do not blindly reuse text-only VRAM numbers. Multimodal encoders, projector weights, image tokens, audio chunks and long context all increase memory use. Start with shorter contexts and smaller batch sizes, then scale up.

### Benchmarks

Nemotron 3 Nano Omni is the strongest omni model for its size. It is also the highest-efficiency open multimodal model with leading accuracy. The model surpasses Qwen3-Omni-30B-A3B on every benchmark.

<div data-with-frame="true"><figure><img src="https://3215535692-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FxhOjnexMCB3dmuQFQ2Zq%2Fuploads%2FVeilBuCszSJr4yoJDaGm%2Funnamed.png?alt=media&#x26;token=5fe8205c-6f2b-40c3-a0e2-ef254a357d2b" alt="" width="563"><figcaption></figcaption></figure></div>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://unsloth.ai/docs/models/nemotron-3-nano-omni.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
