> For the complete documentation index, see [llms.txt](https://unsloth.ai/docs/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://unsloth.ai/docs/zh/mo-xing/nemotron-3-ultra.md).

# NVIDIA Nemotron 3 Ultra - 如何在本地运行

NVIDIA Nemotron 3 Ultra 是一个开源 **550B 参数，55B 激活** 前沿推理模型，也是 NVIDIA 的 **最大模型** 迄今发布的。Nemotron-3-Ultra-550B-A55B 专为长时间运行的自主代理以及在编码、深度研究工作流中的推理而构建。它是 **最强的西方开源模型**，并采用新的 Open Model、Weights & Data 许可证。

凭借高达 **100万上下文**，Nemotron 3 Ultra 采用 Hybrid Transformer-Mamba MoE 架构，并可在持续会话中保留长时代理状态、日志和计划。GGUF 版本大小为 [Nemotron-3-Ultra-550B-A55B](https://huggingface.co/unsloth/NVIDIA-Nemotron-3-Ultra-550B-A55B-GGUF) 的动态 1bit 版本占用 189GB 磁盘空间。它也使用 NVFP4 进行预训练。我们还做了 [GGUF KLD 基准](#kld-benchmarks).

### ⚙️ 使用指南

NVIDIA 建议使用以下推理设置：

* `temperature = 1.0`
* `top_p = 0.95`

| 详细信息   | Nemotron 3 Ultra                                                                    |
| ------ | ----------------------------------------------------------------------------------- |
| 模型大小   | 总参数 550B / 激活参数 55B                                                                 |
| 上下文长度  | 最高 1M token                                                                         |
| 架构     | 带有 Latent MoE 的 Hybrid Transformer-Mamba MoE，Multi-Token Prediction（MTP 目前不支持 GGUF） |
| 模型 I/O | 文本输入，文本输出                                                                           |

聊天模板如下：

{% code overflow="wrap" %}

```
<|im_start|>system\n<|im_end|>\n<|im_start|>user\nWhat is 1+1?<|im_end|>\n<|im_start|>assistant\n<think></think>2<|im_end|>\n<|im_start|>assistant\n<think>\n
```

{% endcode %}

### 运行 Nemotron-3-Ultra

该模型的 3-bit 版本大约需要 256GB RAM，4-bit 需要约 300GB，8-bit 则需要 600GB。对于这些指南，我们将使用 3-bit `UD-IQ3_XXS` 这适合 256GB 设备，并且在体积和准确性之间取得了良好平衡。根据你的使用场景，你将需要使用 [不同设置](#usage-guide). **GGUF：** [Nemotron-3-Ultra-550B-A55B](https://huggingface.co/unsloth/NVIDIA-Nemotron-3-Ultra-550B-A55B-GGUF)

<a href="/pages/9f0ec0c76e850149511575cd28cb1e3e1d462753#unsloth-studio-guide" class="button primary">在 Unsloth Studio 中运行</a><a href="/pages/9f0ec0c76e850149511575cd28cb1e3e1d462753#llama.cpp-tutorial" class="button secondary">在 llama.cpp 中运行</a>

### 🦥 Unsloth Studio 指南

在本教程中，我们将使用 [Unsloth Studio](/docs/zh/xin-de/studio.md)，这是我们用于运行和训练 LLM 的 UI。借助 Unsloth Studio，你可以在本地运行模型，并输入图像和文本于 **Mac、Windows**和 Linux，并且可以：

{% columns %}
{% column %}

* 搜索、下载， [运行 GGUF](/docs/zh/xin-de/studio.md#run-models-locally) 以及 safetensor 模型
* **并排** 比较 **模型**
* [**自修复** 工具调用](/docs/zh/xin-de/studio.md#execute-code--heal-tool-calling) + **网页搜索**
* [**代码执行**](/docs/zh/xin-de/studio.md#run-models-locally) （Python、Bash）
* [自动推理](/docs/zh/xin-de/studio.md#model-arena) 参数调优（temp、top-p 等）
* [训练 LLM](/docs/zh/xin-de/studio.md#no-code-training) 速度提升 2 倍，VRAM 减少 70%
  {% endcolumn %}

{% column %}

<div data-with-frame="true"><figure><img src="/files/5af4df407c8134f1ff75a4d7535569361c049e51" alt=""><figcaption></figcaption></figure></div>
{% endcolumn %}
{% endcolumns %}

{% stepper %}
{% step %}

#### 安装 Unsloth

**MacOS、Linux、WSL：**

```bash
curl -fsSL https://unsloth.ai/install.sh | sh
```

**Windows PowerShell：**

```bash
irm https://unsloth.ai/install.ps1 | iex
```

{% endstep %}

{% step %}

#### 设置 Unsloth Studio（一次性）

设置会自动安装 Node.js（通过 nvm）、构建前端、安装所有 Python 依赖，并构建带 CUDA 支持的 llama.cpp。

{% hint style="info" %}
**WSL 用户：** 系统会提示你输入 `sudo` 密码以安装构建依赖（`cmake`, `git`, `libcurl4-openssl-dev`).
{% endhint %}
{% endstep %}

{% step %}

#### 启动 Unsloth

**MacOS、Linux、WSL：**

```bash
source unsloth_studio/bin/activate
unsloth studio -H 0.0.0.0 -p 8888
```

**Windows PowerShell：**

```bash
unsloth studio
```

<div data-with-frame="true"><figure><img src="/files/7fd4b2ed7fb55df6d31b4dd1ce1181d57613709b" alt="" width="375"><figcaption></figcaption></figure></div>

然后打开 `http://127.0.0.1:8888` 在你的浏览器中。
{% endstep %}

{% step %}

#### 搜索并下载 Nemotron-3-Ultra

首次启动时，你需要创建一个密码来保护你的账户，并在稍后重新登录。然后转到 [Unsloth Chat](/docs/zh/xin-de/studio/chat.md) 选项卡，并在搜索栏中搜索 Nemotron-3-Ultra，然后下载你想要的模型和量化版本。
{% endstep %}

{% step %}

#### 运行 Nemotron-3-Ultra

使用 Unsloth Studio 时，推理参数应会自动设置，不过你仍然可以手动更改。你也可以编辑上下文长度、聊天模板和其他设置。

更多信息，你可以查看我们的 [Unsloth Studio 推理指南](/docs/zh/xin-de/studio/chat.md).
{% endstep %}

{% step %}

#### 部署 Nemotron-3-Ultra

你也可以使用 `unsloth studio run` 通过 llama-server 提供服务，如下所示：

{% code overflow="wrap" %}

```bash
unsloth studio run --model unsloth/NVIDIA-Nemotron-3-Ultra-550B-A55B-GGUF:UD-Q4_K_XL
```

{% endcode %}
{% endstep %}
{% endstepper %}

### 🦙 Llama.cpp 教程：

在 llama.cpp 中运行的说明（注意，我们将使用 4-bit 以适配大多数设备）：

{% stepper %}
{% step %}
获取最新的 `llama.cpp` 在 [GitHub 这里](https://github.com/ggml-org/llama.cpp)。你也可以按照下面的构建说明进行操作。将 `-DGGML_CUDA=ON` 更改为 `-DGGML_CUDA=OFF` 如果你没有 GPU，或者只想进行 CPU 推理。 **对于 Apple Mac / Metal 设备**，设置 `-DGGML_CUDA=OFF` 然后像往常一样继续——Metal 支持默认开启。

{% code overflow="wrap" %}

```bash
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \\
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp
```

{% endcode %}
{% endstep %}

{% step %}
通过下面的代码下载模型（在安装 `pip install huggingface_hub`之后）。你可以选择 Q4\_K\_M 或其他量化版本，例如 `UD-Q4_K_XL` 。我们建议至少使用 2 位动态量化 `UD-Q2_K_XL` 以平衡体积和准确性。如果下载卡住，请参见： [Hugging Face Hub、XET 调试](/docs/zh/ji-chu-zhi-shi/troubleshooting-and-faqs/hugging-face-hub-xet-debugging.md)

{% code overflow="wrap" %}

```bash
pip install huggingface_hub
hf download unsloth/NVIDIA-Nemotron-3-Ultra-550B-A55B-GGUF \\
    --local-dir unsloth/NVIDIA-Nemotron-3-Ultra-550B-A55B-GGUF \\
    --include "*UD-IQ3_XXS*" # 动态 2bit 使用 "*UD-Q2_K_XL*"
```

{% endcode %}
{% endstep %}

{% step %}
然后以对话模式运行模型：

{% code overflow="wrap" %}

```bash
./llama.cpp/llama-cli \\
    --model unsloth/NVIDIA-Nemotron-3-Ultra-550B-A55B-GGUF/UD-IQ3_XXS/NVIDIA-Nemotron-3-Ultra-550B-A55B-UD-IQ3_XXS-00001-of-00006.gguf \\
    --temp 1.0 \\
    --top-p 0.95 \
    --min-p 0.01
```

{% endcode %}
{% endstep %}
{% endstepper %}

#### Llama-server 提供服务与部署

要在本地部署 Nemotron-3-Ultra，请使用 `llama-server`。在新的终端中，例如通过 `tmux`，部署该模型：

```bash
./llama.cpp/llama-server \
    -hf unsloth/NVIDIA-Nemotron-3-Ultra-550B-A55B-GGUF:UD-IQ3_XXS \\
    --alias "unsloth/NVIDIA-Nemotron-3-Ultra-550B-A55B" \\
    --temp 1.0 \\
    --top-p 0.95 \
    --port 8001
```

如果你是手动下载的模型，请使用：

{% code overflow="wrap" %}

```bash
./llama.cpp/llama-server \
    --model unsloth/NVIDIA-Nemotron-3-Ultra-550B-A55B-GGUF/UD-IQ3_XXS/NVIDIA-Nemotron-3-Ultra-550B-A55B-UD-IQ3_XXS-00001-of-00006.gguf \\
    --alias "unsloth/NVIDIA-Nemotron-3-Ultra-550B-A55B" \\
    --temp 1.0 \\
    --top-p 0.95 \
    --port 8001
```

{% endcode %}

然后在新的终端中，在安装 OpenAI 客户端后使用 `pip install openai`:

```python
from openai import OpenAI
openai_client = OpenAI(
    base_url = "http://127.0.0.1:8001/v1",
    api_key = "sk-no-key-required",
)
completion = openai_client.chat.completions.create(
    model = "unsloth/NVIDIA-Nemotron-3-Ultra-550B-A55B",
    messages = [
        {"role": "user", "content": "What is 2+2?"},
    ],
)
print(completion.choices[0].message.reasoning_content)
print(completion.choices[0].message.content)
```

<figure><img src="/files/2a32a6055af7d3c1271209209d8441204db39ab0" alt=""><figcaption></figcaption></figure>

在 4 台 B200 上，生成速度约可达到 40 tokens/s！

<figure><img src="/files/299ffe1c060b287002fe3a21bcc6ee8283c2c46f" alt=""><figcaption></figcaption></figure>

### Unsloth GGUF 基准

我们还对 GGUF 量化版本做了 KLD 分析——在对数平均 KLD 标度上，由于我们的 [动态方法](/docs/zh/ji-chu-zhi-shi/unsloth-dynamic-2.0-ggufs.md) ，较重要的层保留更高精度，其余层使用更低比特。

<figure><img src="/files/01b83c606d741fc21e0ff27daae7fd86677e9cc4" alt=""><figcaption></figcaption></figure>

按线性刻度：

<figure><img src="/files/a24045f6d76f13d4f19cbb0cb2dadbba55ed4ac1" alt=""><figcaption></figcaption></figure>

### 官方基准

Nemotron 3 Ultra 是 NVIDIA 最大的 Nemotron 3 推理模型，旨在在前沿推理、编码和智能体任务上提供领先的准确性，同时通过高吞吐量优化完成任务所需时间。

Ultra 尤其适合那些任务成功取决于持续推理而非短单轮响应的工作负载：

* 跨大型代码仓库的自主编码会话
* 跨多个来源、且证据相互冲突的深度研究
* 具有持续工具使用循环的企业工作流
* EDA / 芯片设计验证与故障分析

如图 1 和图 2 所示，Nemotron 3 Ultra 在智能体生产力、指令遵循和长上下文任务的准确性上领先，并提供领先的整体表现，相比其他领先的开源模型可节省 30% 的成本。&#x20;

图 1：Nemotron 3 Ultra 在智能体基准测试中领先于开源模型，涵盖智能体生产力、编码和指令遵循。

<div align="center" data-with-frame="true"><figure><img src="/files/f5d349c877f7fd18698aeb2d77e63272ffab12af" alt="Image of a table showing Nemotron 3 Ultra leading among open models on agentic benchmarks for agent productivity, coding, and instruction following." width="536"><figcaption></figcaption></figure></div>

图 2：Nemotron 3 Ultra 可节省高达 30% 的成本，并在成本效率前沿上领先

<div data-with-frame="true"><img src="/files/9689b0c1746b25427347f5ba6c38657484abcb56" alt="显示 Nemotron 3 Ultra 可节省高达 30% 的成本，并在成本效率前沿上领先的图片" width="563"></div>

来自 NVIDIA 的更多基准：

| 基准                                          | N-3-Ultra 550B-A55B | MiniMax-2.7 230B-A10B | GLM-5.1 744B-A40B | Kimi-K2.6 1T-A32B |       |       |      |
| ------------------------------------------- | :-----------------: | :-------------------: | :---------------: | :---------------: | :---: | :---: | :--: |
| **智能体**                                     |                     |                       |                   |                   |       |       |      |
| Terminal Bench 2.1                          |         56.4        |          55.5         |        59.3       |        67.2       |  49.9 |  49.2 | 54.2 |
| GDPVal                                      |         46.7        |          47.6         |        54.7       |        50.4       |  34.6 |  54.6 | 50.2 |
| SWE-Bench Verified                          |         71.9        |          72.2         |        73.8       |        69.5       |  69.9 |  74.0 | 72.4 |
| SWE-Bench Multilingual                      |         67.7        |          69.2         |        73.8       |        65.9       |  67.7 |  71.9 | 72.1 |
| ProfBench（搜索）                               |         56.0        |          52.0         |        46.0       |        56.0       |  53.0 |  59.9 | 57.0 |
| PinchBench                                  |         90.0        |          77.6         |        81.2       |        90.2       |  86.6 |  88.6 | 91.3 |
| TauBench V3                                 |                     |                       |                   |                   |       |       |      |
| 航空                                          |         81.5        |          75.3         |        85.0       |        85.8       |  76.5 |  80.8 | 80.8 |
| 零售                                          |         86.4        |          84.9         |        84.1       |        82.9       |  88.5 |  88.9 | 89.1 |
| 电信                                          |         92.9        |          89.6         |        96.9       |        97.8       |  98.0 |  96.3 | 98.3 |
| 银行                                          |         22.6        |          14.6         |        12.8       |        23.1       |  20.9 |  25.9 | 26.7 |
| 平均                                          |         70.9        |          66.1         |        69.7       |        72.4       |  71.0 |  73.2 | 73.7 |
| BrowseComp                                  |         44.4        |          54.1         |        59.4       |        61.3       |  40.5 |  59.4 | 46.9 |
| Vals.ai 金融代理 1.1                            |                     |                       |                   |                   |       |       |      |
| 无网页搜索                                       |         60.1        |          51.3         |        60.2       |        54.0       |  61.3 |  58.9 | 58.4 |
| 有网页搜索                                       |         53.7        |          50.5         |        60.7       |        58.8       |  59.0 |  62.3 | 60.1 |
| **推理与知识**                                   |                     |                       |                   |                   |       |       |      |
| IOI 2025                                    |        570.0        |           --          |       456.5       |       585.0       | 441.3 | 580.1 |  --  |
| LiveCodeBench（v6）                           |         89.0        |          77.2         |        85.7       |        90.2       |  79.3 |  92.5 | 90.9 |
| IMOAnswerBench（无工具）                         |         88.6        |          68.3         |        86.8       |        91.1       |  83.1 |  93.0 | 91.1 |
| IMOAnswerBench（有工具）                         |         92.3        |          75.1         |        91.1       |       93.71       | 84.51 |  85.4 | 89.6 |
| Apex-Shortlist（无工具）                         |         74.9        |          28.9         |        71.1       |        77.4       |  61.4 |  85.8 | 82.4 |
| Apex-Shortlist（有工具）                         |         84.8        |          51.9         |        79.0       |        73.2       |  60.4 |  86.5 | 82.0 |
| GPQA（无工具）                                   |         87.0        |          86.6         |        86.1       |        91.0       |  87.1 |  87.8 | 88.5 |
| SciCode（子任务）                                |         44.6        |          38.3         |        47.7       |        52.0       |  48.0 |  50.5 | 48.2 |
| HLE（无工具）                                    |         26.7        |          23.1         |        27.2       |        34.8       |  28.5 |  37.7 | 32.2 |
| HLE（有工具）                                    |         37.4        |           --          |        50.4       |        54.0       |  48.3 |  48.2 | 45.1 |
| CritPt（无工具）                                 |         3.1         |          0.6          |        3.7        |        9.1        |  2.4  |  14.0 | 10.6 |
| MMLU-Pro                                    |         86.8        |          81.9         |        85.9       |        88.1       |  88.3 |  87.5 | 86.4 |
| OmniScience 准确率                             |         24.1        |          20.5         |        31.3       |        35.5       |  35.9 |  46.8 | 39.9 |
| OmniScience 非幻觉性                            |         78.7        |          74.4         |        66.8       |        67.1       |  7.4  |  5.7  |  2.8 |
| **聊天与指令遵循**                                 |                     |                       |                   |                   |       |       |      |
| IFBench（宽松提示）                               |         81.7        |          74.6         |        76.6       |        73.7       |  78.2 |  79.1 | 82.0 |
| 多挑战                                         |         63.8        |          42.5         |        63.0       |        63.1       |  63.9 |  64.1 | 63.5 |
| **长上下文**                                    |                     |                       |                   |                   |       |       |      |
| AA-LCR                                      |         65.4        |          69.8         |        66.9       |        70.2       |  68.3 |  67.3 | 62.7 |
| RULER（1M）                                   |         94.7        |           --          |         --        |         --        |  90.1 |  94.2 | 87.7 |
| Longbench v2（≤ 1M）                          |         61.9        |           --          |         --        |         --        |  68.9 |  62.1 | 57.0 |
| **多语言**                                     |                     |                       |                   |                   |       |       |      |
| MMLU-ProX（平均 en/de/fr/es/it/ja/zh/hi/pt/ko） |         83.0        |          78.4         |        85.8       |        85.0       |  86.4 |  85.6 | 84.3 |
| WMT24++（en→xx）                              |         83.7        |          82.8         |        84.4       |        84.5       |  86.8 |  85.9 | 85.9 |


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://unsloth.ai/docs/zh/mo-xing/nemotron-3-ultra.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.