🧩NVIDIA Nemotron 3 Nano Omni - How To Run Locally

Run & fine-tune Nemotron-3-Nano-Omni-30B-A3B locally on your device!

NVIDIA Nemotron-3-Nano-Omni-30B-A3B is an open 30B parameter, 3B active hybrid reasoning MoE model built for multimodal agentic workloads including audio, video, text, images and docs as input, with text output. The model runs on 25GB RAM for 4-bit and 36GB for 8-bit.

With a 256K context, Nemotron 3 Nano Omni is the strongest omni model for its size and the highest-efficiency open multimodal model. We collaborated with NVIDIA for day zero support! GGUF: Nemotron-3-Nano-Omni-30B-A3B-Reasoning

⚙️ Usage Guide

NVIDIA recommends these settings for inference:

General chat/instruction (default):

  • temperature = 1.0

  • top_p = 1.0

Tool calling use-cases:

  • temperature = 0.6

  • top_p = 0.95

Run Nemotron-3-Nano-Omni

Depending on your use-case you will need to use different settings. Some GGUFs end up similar in size because the model architecture (like gpt-oss) has dimensions not divisible by 128, so parts can’t be quantized to lower bits. GGUF: Nemotron-3-Nano-Omni-30B-A3B-Reasoning

The 4-bit versions of the model requires ~25GB RAM. 8-bit requires 36GB. For these guides, we will be using UD-Q4-K-XL which is a good balance between size and accuracy.

Run in Unsloth StudioRun in llama.cpp

🦥 Unsloth Studio Guide

For this tutorial, we will be using Unsloth Studio, which is our new web UI for running and training LLMs. With Unsloth Studio, you can run models and input audio, image and text locally on Mac, Windows, and Linux and:

1

Install Unsloth

MacOS, Linux, WSL:

Windows PowerShell:

2

Setup Unsloth Studio (one time)

Setup automatically installs Node.js (via nvm), builds the frontend, installs all Python dependencies, and builds llama.cpp with CUDA support.

WSL users: you will be prompted for your sudo password to install build dependencies (cmake, git, libcurl4-openssl-dev).

3

Launch Unsloth

MacOS, Linux, WSL:

Windows Powershell:

Then open http://127.0.0.1:8888 in your browser.

4

Search and download NVIDIA-Nemotron-3-Nano-30B-A3B-Omni

On first launch you will need to create a password to secure your account and sign in again later. Then go to the Studio Chat tab and search for Nemotron-3-Nano-Omni in the search bar and download your desired model and quant.

5

Run Nemotron-3-Nano-30B-A3B-Omni

Inference parameters should be auto-set when using Unsloth Studio, however you can still change it manually. You can also edit the context length, chat template and other settings.

For more information, you can view our Unsloth Studio inference guide.

🦙 Llama.cpp Tutorial:

Instructions to run in llama.cpp (note we will be using 4-bit to fit most devices):

1

Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference. For Apple Mac / Metal devices, set -DGGML_CUDA=OFF then continue as usual - Metal support is on by default.

2

Let's first get an image! You can also upload images as well. We shall use https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/images/unsloth%20made%20with%20love.png, which is just our mini logo showing how finetunes are made with Unsloth:

Let's get the 2nd image at https://files.worldwildlife.org/wwfcmsprod/images/Sloth_Sitting_iStock_3_12_2014/story_full_width/8l7pbjmj29_iStock_000011145477Large_mini__1_.jpg

3

Download the model via the code below (after installing pip install huggingface_hub). You can choose Q4_K_M or other quantized versions like UD-Q4_K_XL . We recommend using at least 2-bit dynamic quant UD-Q2_K_XL to balance size and accuracy. If downloads get stuck, see: Hugging Face Hub, XET debugging

4

Then run the model in conversation mode:

5

You will then see the below:

6

Then use /image to load both images in and ask "What is this image":

7

And for the sloth image:

Llama-server serving & deployment

To deploy Nemotron 3 Nano Omni locally, use llama-server. In a new terminal, for example via tmux, deploy the model:

If you downloaded the model manually, use:

Then in a new terminal, after installing the OpenAI client with pip install openai:

Which will show something like the below:

Image input through the OpenAI-compatible server

Let's use picture.png which was the sloth image like in 🦙 Llama.cpp Tutorial:

Which will show something like below:

🦥 Fine-tuning Nemotron 3 Nano Omni

Unsloth supports the entire Nemotron model family. Nemotron 3 Nano Omni is useful for multimodal agent datasets. You can train on audio, vision or text via Unsloth. Video input fine-tuning is currently not supported.

For text-only and notebooks, you can start from the existing Nemotron 3 Nano fine-tuning flow. For multimodal adapters, make sure your dataset includes the modality your agent actually needs:

  • Computer use: screenshots, UI state, cursor/context, expected next action

  • Document intelligence: PDFs, screenshots, charts, tables, structured extraction targets

  • Audio understanding: audio clips, sampled frames, summaries, timestamps, events and follow-up questions

  • Agent loops: observation → reasoning → action → validation examples

For Omni, do not blindly reuse text-only VRAM numbers. Multimodal encoders, projector weights, image tokens, audio chunks and long context all increase memory use. Start with shorter contexts and smaller batch sizes, then scale up.

Benchmarks

Nemotron 3 Nano Omni is the strongest omni model for its size. It is also the highest-efficiency open multimodal model with leading accuracy. The model surpasses Qwen3-Omni-30B-A3B on every benchmark.

Last updated

Was this helpful?