# NVIDIA Nemotron-3-Super：如何运行指南

NVIDIA 发布 **Nemotron-3-Super-120B-A12B**，一款拥有 120B 参数的开放式混合推理 MoE 模型，具有 12B 活跃参数，继此前发布的 [Nemotron-3-Nano](/docs/zh/mo-xing/nemotron-3.md)，其 30B 对应版本。Nemotron-3-Super 旨在为多智能体 AI 提供高效率和高准确性。凭借 **1M-token** 的上下文窗口，它在 AIME 2025、Terminal Bench 和 SWE-Bench Verified 基准上在同尺寸级别中领先，同时实现了最高吞吐量。

Nemotron-3-Super 可运行在配备 **64GB** RAM、VRAM 或统一内存的设备上，并且现在可以在本地进行微调。感谢 NVIDIA 为 Unsloth 提供 day-zero 支持。

<a href="/pages/744e7d433d981b3fd86d1a7ad5e4f1d406c1c0eb#run-nemotron-3-super-120b" class="button primary">Nemotron 3 Super</a><a href="/pages/744e7d433d981b3fd86d1a7ad5e4f1d406c1c0eb" class="button secondary">Nemotron 3 Nano</a>

GGUF： [Nemotron-3-Super-120B-A12B-GGUF](https://huggingface.co/unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF) • [NVFP4](https://huggingface.co/unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4) • [FP8](https://huggingface.co/unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-FP8) • [BF16](https://huggingface.co/unsloth/NVIDIA-Nemotron-3-Super-120B-A12B)

### ⚙️ 使用指南

NVIDIA 建议在推理中使用以下设置：

{% columns %}
{% column %}
**通用聊天/指令（默认）：**

* `temperature = 1.0`
* `top_p = 1.0`
  {% endcolumn %}

{% column %}
**工具调用用例：**

* `temperature = 0.6`
* `top_p = 0.95`
  {% endcolumn %}
  {% endcolumns %}

**对于大多数本地使用场景，请设置：**

* `max_new_tokens` = `32,768` 改为 `262,144` 用于最多 100 万 token 的标准提示
* 随着你的 RAM/VRAM 允许，可增加该值以进行深度推理或长文本生成。

聊天模板格式可通过使用下面内容找到：

{% code overflow="wrap" %}

```python
tokenizer.apply_chat_template([
    {"role" : "user", "content" : "What is 1+1?"},
    {"role" : "assistant", "content" : "2"},
    {"role" : "user", "content" : "What is 2+2?"}
    ], add_generation_prompt = True, tokenize = False,
)
```

{% endcode %}

{% hint style="success" %}
由于该模型使用 NoPE 训练，你只需要更改 `max_position_embeddings`。该模型不使用显式位置嵌入，因此不需要 YaRN。
{% endhint %}

#### Nemotron 3 聊天模板格式：

{% hint style="info" %}
Nemotron 3 使用 `<think>` ，token ID 为 12；并使用 `</think>` ，token ID 为 13，用于推理。使用 `--special` 查看 llama.cpp 的 token。你可能还需要 `--verbose-prompt` 以查看 `<think>` ，因为它是预先添加的。
{% endhint %}

{% code overflow="wrap" lineNumbers="true" %}

```
<|im_start|>system\n<|im_end|>\n<|im_start|>user\nWhat is 1+1?<|im_end|>\n<|im_start|>assistant\n<think></think>2<|im_end|>\n<|im_start|>user\nWhat is 2+2?<|im_end|>\n<|im_start|>assistant\n<think>\n
```

{% endcode %}

### 🖥️ 运行 Nemotron-3-Super-120B-A12B

根据你的用例，你需要使用不同的设置。一些 GGUF 之所以最终大小相近，是因为模型架构（例如 [gpt-oss](/docs/zh/mo-xing/gpt-oss-how-to-run-and-fine-tune.md)）的维度不能被 128 整除，因此部分内容无法量化到更低比特。访问 GGUF [这里](https://huggingface.co/unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF).

该模型的 4-bit 版本需要约 64GB RAM - 72GB RAM。8-bit 需要 128GB。

#### Llama.cpp 教程（GGUF）：

在 llama.cpp 中运行的说明（注意我们将使用 4 位以适配大多数设备）：

{% stepper %}
{% step %}
获取最新的 `llama.cpp` 在 [GitHub 这里](https://github.com/ggml-org/llama.cpp)。你也可以按照下面的构建说明操作。将 `-DGGML_CUDA=ON` 改为 `-DGGML_CUDA=OFF` 如果你没有 GPU，或者只想进行 CPU 推理。

{% code overflow="wrap" %}

```bash
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp
```

{% endcode %}
{% endstep %}

{% step %}
你可以直接从 Hugging Face 拉取。随着你的 RAM/VRAM 允许，可以将上下文增加到 100 万。

针对以下情况请遵循这个 **通用说明** 使用场景：

```bash
./llama.cpp/llama-cli \
    -hf unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF:UD-Q4_K_XL \
    --ctx-size 16384 \
    --temp 1.0 --top-p 1.0
```

针对以下情况请遵循这个 **工具调用** 使用场景：

```bash
./llama.cpp/llama-cli \
    -hf unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF:UD-Q4_K_XL \
    --ctx-size 32768 \\
    --temp 0.6 --top-p 0.95
```

{% endstep %}

{% step %}
通过以下方式下载模型（在安装 `pip install huggingface_hub hf_transfer` ）。你可以选择 Q4\_K\_M 或其他量化版本，如 `UD-Q4_K_XL` 。我们建议至少使用 2-bit 动态量化 `UD-Q2_K_XL` 以平衡体积和准确性。如果下载卡住，请参见： [Hugging Face Hub，XET 调试](/docs/zh/ji-chu/troubleshooting-and-faqs/hugging-face-hub-xet-debugging.md)

```bash
hf download unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF \
    --local-dir unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF \
    --include "*UD-Q4_K_XL*" # 动态 2 位请使用 "*UD-Q2_K_XL*"
```

{% endstep %}

{% step %}
然后以对话模式运行模型：

{% code overflow="wrap" %}

```bash
/llama.cpp/llama-cli \
    --model unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF/UD-Q4_K_XL/NVIDIA-Nemotron-3-Super-120B-A12B-UD-Q4_K_XL-00001-of-00003.gguf \
    --ctx-size 16384 \
    --seed 3407 \
    --prio 2 \
    --temp 0.6 \\
    --top-p 0.95
```

{% endcode %}

<figure><img src="/files/9ed516f64cfc3648a1a6dc08c481b21ffeccab47" alt=""><figcaption></figcaption></figure>

另外，请按需调整 **上下文窗口** 。确保你的硬件能够处理大于 256K 的上下文窗口。将其设置为 1M 可能会触发 CUDA OOM 并导致崩溃，因此默认值为 262,144。
{% endstep %}
{% endstepper %}

### 🦥 Nemotron 3 和 RL 微调

Unsloth 现在支持对所有 Nemotron 模型进行微调，包括 Nemotron 3 Super 和 Nano。有关 Nano 的 notebook 示例，请参阅我们的 Nemotron 3 [Nano 微调指南](/docs/zh/mo-xing/nemotron-3.md).

#### Nemotron 3 Super

* 为保证稳定性，路由层微调默认已禁用。
* Nemotron-3-Super-120B - bf16 LoRA 可在 256GB VRAM 上运行。如果你使用多 GPU，请添加     `device_map = "balanced"` 或者参考我们的 [多 GPU 指南](/docs/zh/ji-chu/multi-gpu-training-with-unsloth.md).

### 🦙Llama-server 服务与部署

要将 Nemotron 3 部署到生产环境，我们使用 `llama-server` 在一个新终端中，例如通过 tmux，按以下方式部署模型：

{% code overflow="wrap" %}

```bash
./llama.cpp/llama-server \
    --model unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF/UD-Q4_K_XL/NVIDIA-Nemotron-3-Super-120B-A12B-UD-Q4_K_XL-00001-of-00003.gguf \
    --alias "unsloth/NVIDIA-Nemotron-3-Super-120B-A12B" \
    --prio 3 \
    --min-p 0.01 \
    --temp 0.6 \\
    --top-p 0.95 \
    --ctx-size 16384 \
    --port 8001
```

{% endcode %}

当你运行上面的命令时，你会得到：

<figure><img src="/files/8e48ce2194b66023c3a0765a6de6bb2d3db7f501" alt=""><figcaption></figcaption></figure>

然后在一个新终端中，在执行 `pip install openai`之后，执行：

{% code overflow="wrap" %}

```python
from openai import OpenAI
import json
openai_client = OpenAI(
    base_url = "http://127.0.0.1:8001/v1",
    api_key = "sk-no-key-required",
)
completion = openai_client.chat.completions.create(
    model = "unsloth/NVIDIA-Nemotron-3-Super-120B-A12B",
    messages = [{"role": "user", "content": "2+2 等于多少？"},],
)
print(completion.choices[0].message.reasoning_content)
print(completion.choices[0].message.content)
```

{% endcode %}

这将打印

{% code overflow="wrap" %}

```
好了，用户问“2+2 等于多少？”这看起来是一个非常基础的算术问题。

嗯，也许他们是在测试我有没有在认真听，或者可能是个正在学习数学的小孩。也可能是有人想看看我会不会把一个简单问题过度复杂化。

我应该保持直接，因为这个问题没有显示出任何刁难的迹象。答案肯定是 4——基本加法没必要反复猜测。

不过我还是有点怀疑，他们是不是在为一个笑话做铺垫（比如“在 2 的大数情况下，2+2=5”），但既然他们没有给出任何上下文，我就先假设这是一个真诚的问题。

最好清楚而友好地回答——如果他们正在学习，这样可能会鼓励他们继续提问。不过也不用啰嗦；只要有帮助地说明事实就好。

2 + 2 等于 **4**。

这是一条十进制记数法中的基本算术事实。如果你问的是不同语境（比如模运算、二进制，或者某个笑话/引用），请随时说明——我很乐意适配！😊
```

{% endcode %}

### 基准测试

与类似规模的模型相比，Nemotron 3 Super 具有竞争力，同时提供最高吞吐量。

<figure><img src="/files/cc7ea788a7ac51c7862da7480c982b50542de550" alt=""><figcaption></figcaption></figure>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://unsloth.ai/docs/zh/mo-xing/nemotron-3/nemotron-3-super.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.