🧩NVIDIA Nemotron 3 Nano - How To Run Guide
Run & fine-tune NVIDIA Nemotron 3 Nano locally on your device!
NVIDIA releases Nemotron-3-Nano-4B, a 4B open hybrid MoE model that follows Nemotron-3-Super-120B-A12B and Nemotron-3-Nano-30B-A3B. The Nemotron family is designed for fast, accurate coding, math, and agentic workloads. They feature a 1M-token context window and are competitive across reasoning, chat, and throughput benchmarks.
Nemotron-3-Nano-4B runs on 5GB of RAM, VRAM, or unified memory. Nemotron-3-Nano-30A3B runs on 24GB RAM. Nemotron 3 can now be fine-tuned locally via Unsloth. Thanks to NVIDIA for giving Unsloth day-zero support.
Nemotron-3-Nano-4BNemotron-3-Nano-30B-A3BFine-tuning Nemotron 3
⚙️ Usage Guide
NVIDIA recommends these settings for inference:
General chat/instruction (default):
temperature = 1.0top_p = 1.0
Tool calling use-cases:
temperature = 0.6top_p = 0.95
For most local use, set:
max_new_tokens=32,768to262,144for standard prompts with a max of 1M tokensIncrease for deep reasoning or long-form generation as your RAM/VRAM allows.
The chat template format is found when we use the below:
tokenizer.apply_chat_template([
{"role" : "user", "content" : "What is 1+1?"},
{"role" : "assistant", "content" : "2"},
{"role" : "user", "content" : "What is 2+2?"}
], add_generation_prompt = True, tokenize = False,
)Because the model was trained with NoPE, you only need to change max_position_embeddings. The model doesn’t use explicit positional embeddings, so YaRN isn’t needed.
Nemotron 3 chat template format:
Nemotron 3 uses <think> with token ID 12 and </think> with token ID 13 for reasoning. Use --special to see the tokens for llama.cpp. You might also need --verbose-prompt to see <think> since it's prepended.
🖥️ Run Nemotron-3-Nano-4B
Depending on your use-case you will need to use different settings. Some GGUFs end up similar in size because the model architecture (like gpt-oss) has dimensions not divisible by 128, so parts can’t be quantized to lower bits.
The 4-bit versions of the model requires ~3GB RAM. 8-bit requires 5GB.
🦥 Unsloth Studio Guide
For this tutorial, we will be using Unsloth Studio, which is our new web UI for running and training LLMs. With Unsloth Studio, you can run models locally on Mac, Windows, and Linux and:
Search, download, run GGUFs and safetensor models
Compare models side-by-side
Self-healing tool calling + web search
Code execution (Python, Bash)
Automatic inference parameter tuning (temp, top-p, etc.)
Train LLMs 2x faster with 70% less VRAM

Setup Unsloth Studio (one time)
Setup automatically installs Node.js (via nvm), builds the frontend, installs all Python dependencies, and builds llama.cpp with CUDA support.
First install may take 5-10 minutes. This is normal as llama.cpp needs to compile binaries. Do not cancel it.
WSL users: you will be prompted for your sudo password to install build dependencies (cmake, git, libcurl4-openssl-dev).
Search and download Nemotron-3-Nano-4B
On first launch you will need to create a password to secure your account and sign in again later. You’ll then see a brief onboarding wizard to choose a model, dataset, and basic settings. You can skip it at any time.
Then go to the Studio Chat tab and search for Nemotron-3-Nano-4B in the search bar and download your desired model and quant.

Run Nemotron-3-Nano-4B
Inference parameters should be auto-set when using Unsloth Studio, however you can still change it manually. You can also edit the context length, chat template and other settings.
For more information, you can view our Unsloth Studio inference guide.

Llama.cpp Tutorial:
Instructions to run in llama.cpp (we'll be using 8-bit for near full precision):
Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.
You can directly pull from Hugging Face. You can increase the context to 1M as your RAM/VRAM allows.
Follow this for general instruction use-cases:
Follow this for tool-calling use-cases:
Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose Q8_0 or other quantized versions.
Then run the model in conversation mode:
Also, adjust context window as required. Ensure your hardware can handle more than a 256K context window. Setting it to 1M may trigger CUDA OOM and crash, which is why the default is 262,144.
🖥️ Run Nemotron-3-Nano-30B-A3B
Depending on your use-case you will need to use different settings. Some GGUFs end up similar in size because the model architecture (like gpt-oss) has dimensions not divisible by 128, so parts can’t be quantized to lower bits.
The 4-bit versions of the model requires ~24GB RAM. 8-bit requires 36GB.
🦥 Unsloth Studio Guide
For this tutorial, we will be using Unsloth Studio, which is our new web UI for running and training LLMs. With Unsloth Studio, you can run models locally on Mac, Windows, and Linux and:
Search, download, run GGUFs and safetensor models
Compare models side-by-side
Self-healing tool calling + web search
Code execution (Python, Bash)
Automatic inference parameter tuning (temp, top-p, etc.)
Train LLMs 2x faster with 70% less VRAM

Setup Unsloth Studio (one time)
Setup automatically installs Node.js (via nvm), builds the frontend, installs all Python dependencies, and builds llama.cpp with CUDA support.
First install may take 5-10 minutes. This is normal as llama.cpp needs to compile binaries. Do not cancel it.
WSL users: you will be prompted for your sudo password to install build dependencies (cmake, git, libcurl4-openssl-dev).
Search and download Nemotron-3-Nano-30B-A3B
On first launch you will need to create a password to secure your account and sign in again later. You’ll then see a brief onboarding wizard to choose a model, dataset, and basic settings. You can skip it at any time.
Then go to the Studio Chat tab and search for Nemotron-3-Nano-4B in the search bar and download your desired model and quant.

Run Nemotron-3-Nano-30B-A3B
Inference parameters should be auto-set when using Unsloth Studio, however you can still change it manually. You can also edit the context length, chat template and other settings.
For more information, you can view our Unsloth Studio inference guide.

Llama.cpp Tutorial:
Instructions to run in llama.cpp (note we will be using 4-bit to fit most devices):
Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference. For Apple Mac / Metal devices, set -DGGML_CUDA=OFF then continue as usual - Metal support is on by default.
You can directly pull from Hugging Face. You can increase the context to 1M as your RAM/VRAM allows.
Follow this for general instruction use-cases:
Follow this for tool-calling use-cases:
Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-Q4_K_XL or other quantized versions.
Then run the model in conversation mode:
Also, adjust context window as required. Ensure your hardware can handle more than a 256K context window. Setting it to 1M may trigger CUDA OOM and crash, which is why the default is 262,144.
Nemotron 3 uses <think> with token ID 12 and </think> with token ID 13 for reasoning. Use --special to see the tokens for llama.cpp. You might also need --verbose-prompt to see <think> since it's prepended.
🦥 Fine-tuning Nemotron 3 and RL
Unsloth now supports fine-tuning of all Nemotron models, including Nemotron 3 Super and Nano.
The 4B model fits on a free Colab GPU however the 30B model does not fit. We still made an 80GB A100 Colab notebook for you to fine-tune with. 16-bit LoRA fine-tuning of Nemotron 3 Nano will use around 60GB VRAM:
On fine-tuning MoE's - it's probably not a good idea to fine-tune the router layer so we disabled it by default. If you want to maintain its reasoning capabilities (optional), you can use a mix of direct answers and chain-of-thought examples. Use at least 75% reasoning and 25% non-reasoning in your dataset to make the model retain its reasoning capabilities.
✨Reinforcement Learning + NeMo Gym
We worked with the open-source NVIDIA NeMo Gym team to enable the democratization of RL environments. Our collab enables single-turn rollout RL training for many domains of interest, including math, coding, tool-use, etc, using training environments and datasets from NeMo Gym:
Also check out our latest collab guide published on NVIDIA’s official Developer blog:
🦙Llama-server serving & deployment
To deploy Nemotron 3 for production, we use llama-server In a new terminal say via tmux, deploy the model via:
When you run the above, you will get:

Then in a new terminal, after doing pip install openai, do:
Which will print
Benchmarks
Nemotron-3-Nano-4B is the best performing model for its size, including throughput.

Nemotron-3-Nano-30B-A3B is the best performing model across all benchmarks, including throughput.

Last updated
Was this helpful?


