🧩NVIDIA Nemotron 3 Nano - How To Run Guide
Run & fine-tune NVIDIA Nemotron 3 Nano locally on your device!
NVIDIA releases Nemotron-3-Nano-4B, a 4B open hybrid MoE model that follows Nemotron-3-Super-120B-A12B and Nemotron-3-Nano-30B-A3B. The Nemotron family is designed for fast, accurate coding, math, and agentic workloads. They feature a 1M-token context window and are competitive across reasoning, chat, and throughput benchmarks.
Nemotron-3-Nano-4B runs on 5GB of RAM, VRAM, or unified memory. Nemotron-3-Nano-30A3B runs on 24GB RAM. Nemotron 3 can now be fine-tuned locally via Unsloth. Thanks to NVIDIA for giving Unsloth day-zero support.
Nemotron-3-Nano-4BNemotron-3-Nano-30B-A3BFine-tuning Nemotron 3
⚙️ Usage Guide
NVIDIA recommends these settings for inference:
General chat/instruction (default):
temperature = 1.0top_p = 1.0
Tool calling use-cases:
temperature = 0.6top_p = 0.95
For most local use, set:
max_new_tokens=32,768to262,144for standard prompts with a max of 1M tokensIncrease for deep reasoning or long-form generation as your RAM/VRAM allows.
The chat template format is found when we use the below:
tokenizer.apply_chat_template([
{"role" : "user", "content" : "What is 1+1?"},
{"role" : "assistant", "content" : "2"},
{"role" : "user", "content" : "What is 2+2?"}
], add_generation_prompt = True, tokenize = False,
)Because the model was trained with NoPE, you only need to change max_position_embeddings. The model doesn’t use explicit positional embeddings, so YaRN isn’t needed.
Nemotron 3 chat template format:
Nemotron 3 uses <think> with token ID 12 and </think> with token ID 13 for reasoning. Use --special to see the tokens for llama.cpp. You might also need --verbose-prompt to see <think> since it's prepended.
🖥️ Run Nemotron-3-Nano-4B
Depending on your use-case you will need to use different settings. Some GGUFs end up similar in size because the model architecture (like gpt-oss) has dimensions not divisible by 128, so parts can’t be quantized to lower bits.
The 4-bit versions of the model requires ~3GB RAM. 8-bit requires 5GB.
Llama.cpp Tutorial (GGUF):
Instructions to run in llama.cpp (we'll be using 8-bit for near full precision):
Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.
You can directly pull from Hugging Face. You can increase the context to 1M as your RAM/VRAM allows.
Follow this for general instruction use-cases:
Follow this for tool-calling use-cases:
Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose Q8_0 or other quantized versions.
Then run the model in conversation mode:
Also, adjust context window as required. Ensure your hardware can handle more than a 256K context window. Setting it to 1M may trigger CUDA OOM and crash, which is why the default is 262,144.
🖥️ Run Nemotron-3-Nano-30B-A3B
Depending on your use-case you will need to use different settings. Some GGUFs end up similar in size because the model architecture (like gpt-oss) has dimensions not divisible by 128, so parts can’t be quantized to lower bits.
The 4-bit versions of the model requires ~24GB RAM. 8-bit requires 36GB.
Llama.cpp Tutorial (GGUF):
Instructions to run in llama.cpp (note we will be using 4-bit to fit most devices):
Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference. For Apple Mac / Metal devices, set -DGGML_CUDA=OFF then continue as usual - Metal support is on by default.
You can directly pull from Hugging Face. You can increase the context to 1M as your RAM/VRAM allows.
Follow this for general instruction use-cases:
Follow this for tool-calling use-cases:
Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-Q4_K_XL or other quantized versions.
Then run the model in conversation mode:
Also, adjust context window as required. Ensure your hardware can handle more than a 256K context window. Setting it to 1M may trigger CUDA OOM and crash, which is why the default is 262,144.
Nemotron 3 uses <think> with token ID 12 and </think> with token ID 13 for reasoning. Use --special to see the tokens for llama.cpp. You might also need --verbose-prompt to see <think> since it's prepended.
🦥 Fine-tuning Nemotron 3 and RL
Unsloth now supports fine-tuning of all Nemotron models, including Nemotron 3 Super and Nano.
The 4B model fits on a free Colab GPU however the 30B model does not fit. We still made an 80GB A100 Colab notebook for you to fine-tune with. 16-bit LoRA fine-tuning of Nemotron 3 Nano will use around 60GB VRAM:
On fine-tuning MoE's - it's probably not a good idea to fine-tune the router layer so we disabled it by default. If you want to maintain its reasoning capabilities (optional), you can use a mix of direct answers and chain-of-thought examples. Use at least 75% reasoning and 25% non-reasoning in your dataset to make the model retain its reasoning capabilities.
✨Reinforcement Learning + NeMo Gym
We worked with the open-source NVIDIA NeMo Gym team to enable the democratization of RL environments. Our collab enables single-turn rollout RL training for many domains of interest, including math, coding, tool-use, etc, using training environments and datasets from NeMo Gym:
Also check out our latest collab guide published on NVIDIA’s official Developer blog:
🦙Llama-server serving & deployment
To deploy Nemotron 3 for production, we use llama-server In a new terminal say via tmux, deploy the model via:
When you run the above, you will get:

Then in a new terminal, after doing pip install openai, do:
Which will print
Benchmarks
Nemotron-3-Nano-4B is the best performing model for its size, including throughput.

Nemotron-3-Nano-30B-A3B is the best performing model across all benchmarks, including throughput.

Last updated
Was this helpful?

