🧩NVIDIA Nemotron-3-Super: How To Run Guide

Run & fine-tune NVIDIA Nemotron-3-Super-120B-A12B locally on your device!

NVIDIA releases Nemotron-3-Super-120B-A12B, a 120B open hybrid reasoning MoE model with 12B active parameters, following the earlier launch of Nemotron-3-Nano, its 30B counterpart. Nemotron-3-Super is designed for high efficiency and accuracy for multi-agent AI. With a 1M-token context window, it leads its size class on AIME 2025, Terminal Bench and SWE-Bench Verified benchmarks, while achieving the highest throughput.

Nemotron-3-Super runs on a device with 64GB of RAM, VRAM, or unified memory and can now be fine-tuned locally. Thanks NVIDIA for giving Unsloth day-zero support.

Nemotron 3 SuperNemotron 3 Nano

GGUF: NVIDIA-Nemotron-3-Super-120B-A12B-GGUFarrow-up-right

⚙️ Usage Guide

NVIDIA recommends these settings for inference:

General chat/instruction (default):

  • temperature = 1.0

  • top_p = 1.0

Tool calling use-cases:

  • temperature = 0.6

  • top_p = 0.95

For most local use, set:

  • max_new_tokens = 32,768 to 262,144 for standard prompts with a max of 1M tokens

  • Increase for deep reasoning or long-form generation as your RAM/VRAM allows.

The chat template format is found when we use the below:

tokenizer.apply_chat_template([
    {"role" : "user", "content" : "What is 1+1?"},
    {"role" : "assistant", "content" : "2"},
    {"role" : "user", "content" : "What is 2+2?"}
    ], add_generation_prompt = True, tokenize = False,
)
circle-check

Nemotron 3 chat template format:

circle-info

Nemotron 3 uses <think> with token ID 12 and </think> with token ID 13 for reasoning. Use --special to see the tokens for llama.cpp. You might also need --verbose-prompt to see <think> since it's prepended.

🖥️ Run Nemotron-3-Super-120B-A12B

Depending on your use-case you will need to use different settings. Some GGUFs end up similar in size because the model architecture (like gpt-oss) has dimensions not divisible by 128, so parts can’t be quantized to lower bits. Access GGUFs herearrow-up-right.

The 4-bit versions of the model requires ~64GB RAM - 72GB RAM. 8-bit requires 128GB.

Llama.cpp Tutorial (GGUF):

Instructions to run in llama.cpp (note we will be using 4-bit to fit most devices):

1

Obtain the latest llama.cpp on GitHub herearrow-up-right. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

circle-exclamation
2

You can directly pull from Hugging Face. You can increase the context to 1M as your RAM/VRAM allows.

Follow this for general instruction use-cases:

Follow this for tool-calling use-cases:

3

Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose Q4_K_M or other quantized versions like UD-Q4_K_XL . We recommend using at least 2-bit dynamic quant UD-Q2_K_XL to balance size and accuracy. If downloads get stuck, see: Hugging Face Hub, XET debugging

4

Then run the model in conversation mode:

Also, adjust context window as required. Ensure your hardware can handle more than a 256K context window. Setting it to 1M may trigger CUDA OOM and crash, which is why the default is 262,144.

🦥 Fine-tuning Nemotron 3 and RL

Unsloth now supports fine-tuning of all Nemotron models, including Nemotron 3 Super and Nano. For notebook examples of Nano, see our Nemotron 3 Nano fine-tuning guide.

Nemotron 3 Super

  • Router-layer fine-tuning is disabled by default for stability.

  • Nemotron-3-Super-120B - bf16 LoRA works on 256GB VRAM. If you're using multiGPUs, add device_map = "balanced" or follow our multiGPU Guide.

🦙Llama-server serving & deployment

To deploy Nemotron 3 for production, we use llama-server In a new terminal say via tmux, deploy the model via:

When you run the above, you will get:

Then in a new terminal, after doing pip install openai, do:

Which will print

Benchmarks

Compared to similar sized models, Nemotron 3 Super performs competitively, while providing highest throughput.

Last updated

Was this helpful?