# NVIDIA Nemotron-3-Super: 実行ガイド

NVIDIA がリリース **Nemotron-3-Super-120B-A12B**、12Bのアクティブパラメータを持つ120Bのオープンなハイブリッド推論MoEモデルで、先行してリリースされた [Nemotron-3-Nano](/docs/jp/moderu/nemotron-3.md)、その30B版に続くものです。Nemotron-3-Superは、マルチエージェントAI向けに高い効率と精度を実現するよう設計されています。 **100万トークンの** コンテキストウィンドウにより、AIME 2025、Terminal Bench、SWE-Bench Verifiedベンチマークで同クラスをリードし、さらに最高スループットを達成しています。

Nemotron-3-Superは、 **64GB** のRAM、VRAM、またはユニファイドメモリを搭載したデバイス上で動作し、現在はローカルでファインチューニング可能です。Unslothに初日サポートを提供してくれたNVIDIAに感謝します。

<a href="/pages/6b8d31fd5301efc60f6ff33d32a66c700ccee8ba#run-nemotron-3-super-120b" class="button primary">Nemotron 3 Super</a><a href="/pages/6b8d31fd5301efc60f6ff33d32a66c700ccee8ba" class="button secondary">Nemotron 3 Nano</a>

GGUF: [Nemotron-3-Super-120B-A12B-GGUF](https://huggingface.co/unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF) • [NVFP4](https://huggingface.co/unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4) • [FP8](https://huggingface.co/unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-FP8) • [BF16](https://huggingface.co/unsloth/NVIDIA-Nemotron-3-Super-120B-A12B)

### ⚙️ 使用ガイド

NVIDIA は推論用に次の設定を推奨しています:

{% columns %}
{% column %}
**一般的なチャット/インストラクション（デフォルト）:**

* `temperature = 1.0`
* `top_p = 1.0`
  {% endcolumn %}

{% column %}
**ツール呼び出しのユースケース:**

* `temperature = 0.6`
* `top_p = 0.95`
  {% endcolumn %}
  {% endcolumns %}

**ほとんどのローカル利用では、次のように設定します:**

* `max_new_tokens` = `32,768` を `262,144` 最大 100万トークンの標準プロンプト向け
* 必要に応じて、RAM/VRAM の許す範囲で深い推論や長文生成向けに増やしてください。

チャットテンプレートの形式は、以下を使用すると確認できます:

{% code overflow="wrap" %}

```python
tokenizer.apply_chat_template([
    {"role" : "user", "content" : "What is 1+1?"},
    {"role" : "assistant", "content" : "2"},
    {"role" : "user", "content" : "What is 2+2?"}
    ], add_generation_prompt = True, tokenize = False,
)
```

{% endcode %}

{% hint style="success" %}
このモデルは NoPE で学習されているため、変更する必要があるのは `max_position_embeddings`のみです。モデルは明示的な位置埋め込みを使わないため、YaRN は不要です。
{% endhint %}

#### Nemotron 3 のチャットテンプレート形式:

{% hint style="info" %}
Nemotron 3 は `<think>` トークン ID 12 で `</think>` トークン ID 13 で推論を行います。 `--special` を使うと llama.cpp のトークンを確認できます。必要に応じて `--verbose-prompt` を使って `<think>` 先頭に付加されているため表示してください。
{% endhint %}

{% code overflow="wrap" lineNumbers="true" %}

```
<|im_start|>system\n<|im_end|>\n<|im_start|>user\nWhat is 1+1?<|im_end|>\n<|im_start|>assistant\n<think></think>2<|im_end|>\n<|im_start|>user\nWhat is 2+2?<|im_end|>\n<|im_start|>assistant\n<think>\n
```

{% endcode %}

### 🖥️ Nemotron-3-Super-120B-A12Bを実行

用途によって、異なる設定を使う必要があります。一部の GGUF は、モデルアーキテクチャ（たとえば [gpt-oss](/docs/jp/moderu/gpt-oss-how-to-run-and-fine-tune.md)）は128で割り切れない次元を持つため、一部は低ビットに量子化できません。GGUFにアクセス [こちら](https://huggingface.co/unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF).

このモデルの4ビット版には約64GB〜72GBのRAMが必要です。8ビット版には128GBが必要です。

#### Llama.cpp チュートリアル（GGUF）：

llama.cpp で実行するための手順（ほとんどのデバイスに収まるよう 4-bit を使用します）:

{% stepper %}
{% step %}
最新の `llama.cpp` を [GitHub こちら](https://github.com/ggml-org/llama.cpp)から取得してください。以下のビルド手順に従うこともできます。 `-DGGML_CUDA=ON` を `-DGGML_CUDA=OFF` に変更してください。GPU がない場合、または CPU 推論だけを使いたい場合です。

{% code overflow="wrap" %}

```bash
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \\
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp
```

{% endcode %}
{% endstep %}

{% step %}
Hugging Face から直接取得できます。RAM/VRAM が許す範囲でコンテキストを 100万 に増やせます。

以下を参照してください： **一般向けの説明** ユースケース：

```bash
./llama.cpp/llama-cli \\
    -hf unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF:UD-Q4_K_XL \
    --ctx-size 16384 \\
    --temp 1.0 --top-p 1.0
```

以下を参照してください： **ツール呼び出し** ユースケース：

```bash
./llama.cpp/llama-cli \\
    -hf unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF:UD-Q4_K_XL \
    --ctx-size 32768 \
    --temp 0.6 --top-p 0.95
```

{% endstep %}

{% step %}
モデルのダウンロード（ `pip install huggingface_hub hf_transfer` ）。Q4\_K\_Mや、次のような他の量子化版を選べます `UD-Q4_K_XL` 。サイズと精度のバランスのため、少なくとも2ビットのDynamic量子化を使うことを推奨します `UD-Q2_K_XL` 。ダウンロードが止まる場合は、次を参照してください: [Hugging Face Hub、XETデバッグ](/docs/jp/ji-ben/troubleshooting-and-faqs/hugging-face-hub-xet-debugging.md)

```bash
hf download unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF \
    --local-dir unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF \
    --include "*UD-Q4_K_XL*" # Dynamic 2bit では "*UD-Q2_K_XL*" を使用
```

{% endstep %}

{% step %}
その後、会話モードでモデルを実行します:

{% code overflow="wrap" %}

```bash
/llama.cpp/llama-cli \
    --model unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF/UD-Q4_K_XL/NVIDIA-Nemotron-3-Super-120B-A12B-UD-Q4_K_XL-00001-of-00003.gguf \
    --ctx-size 16384 \\
    --seed 3407 \\
    --prio 2 \\
    --temp 0.6 \\
    --top-p 0.95
```

{% endcode %}

<figure><img src="/files/391ac26f0e28649083f8e89214b3518cddabd153" alt=""><figcaption></figcaption></figure>

また、 **コンテキストウィンドウ** も必要に応じて調整してください。ハードウェアが 256K を超えるコンテキストウィンドウに対応できることを確認してください。1M に設定すると CUDA の OOM を引き起こしてクラッシュする可能性があるため、デフォルトは 262,144 です。
{% endstep %}
{% endstepper %}

### 🦥 Nemotron 3 と RL のファインチューニング

Unslothは現在、Nemotron 3 SuperとNanoを含むすべてのNemotronモデルのファインチューニングをサポートしています。Nanoのノートブック例については、私たちのNemotron 3 [Nanoファインチューニングガイド](/docs/jp/moderu/nemotron-3.md).

#### Nemotron 3 Super

* 安定性のため、Router層のファインチューニングはデフォルトで無効になっています。
* Nemotron-3-Super-120B - bf16 LoRAは256GBのVRAMで動作します。マルチGPUを使用している場合は、     `device_map = "balanced"` を追加するか、私たちの [マルチGPUガイド](/docs/jp/ji-ben/multi-gpu-training-with-unsloth.md).

### 🦙Llama-server のサービングとデプロイ

Nemotron 3 を本番環境にデプロイするには、次を使用します `llama-server` を使用します。新しいターミナルで、たとえば tmux 経由で、次のようにモデルをデプロイします:

{% code overflow="wrap" %}

```bash
./llama.cpp/llama-server \\
    --model unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF/UD-Q4_K_XL/NVIDIA-Nemotron-3-Super-120B-A12B-UD-Q4_K_XL-00001-of-00003.gguf \
    --alias "unsloth/NVIDIA-Nemotron-3-Super-120B-A12B" \
    --prio 3 \\
    --min-p 0.01 \\
    --temp 0.6 \\
    --top-p 0.95 \
    --ctx-size 16384 \\
    --port 8001
```

{% endcode %}

上記を実行すると、次のようになります:

<figure><img src="/files/e53bfce01a50bb3e3b41e38d1e04e48d31ece54c" alt=""><figcaption></figcaption></figure>

次に新しいターミナルで、 `pip install openai`を実行した後、次を行います:

{% code overflow="wrap" %}

```python
from openai import OpenAI
import json
openai_client = OpenAI(
    base_url = "http://127.0.0.1:8001/v1",
    api_key = "sk-no-key-required",
)
completion = openai_client.chat.completions.create(
    model = "unsloth/NVIDIA-Nemotron-3-Super-120B-A12B",
    messages = [{"role": "user", "content": "What is 2+2?"},],
)
print(completion.choices[0].message.reasoning_content)
print(completion.choices[0].message.content)
```

{% endcode %}

すると次を出力します

{% code overflow="wrap" %}

```
よし、ユーザーは「2+2は？」と尋ねています。これはとても基本的な算数の質問のようです。

うーん、私がちゃんと見ているか試しているのかもしれないし、あるいは算数を学んでいる幼い子どもかもしれません。簡単な質問を私がややこしくしすぎるか確認している可能性もあります。

質問にひっかけの気配はないので、素直に答えるのがよさそうです。答えは間違いなく4です。基本的な足し算を疑い直す必要はありません。

とはいえ、もしかすると「大きな値の2では2+2=5」みたいなジョークの前振りかとも思いますが、文脈の示唆がないので、真剣な質問だと仮定します。

はっきり、そして温かく答えるのがよさそうです。学んでいる人なら、もっと質問しやすくなるかもしれません。余計な装飾は不要ですが、親切に事実だけ伝えましょう。

2 + 2 は **4** です。

これは10進表記における基本的な算数の事実です。もし別の文脈（例えば合同算術、2進数、あるいはジョークや引用）での質問なら、遠慮なく教えてください。柔軟に対応します！😊
```

{% endcode %}

### ベンチマーク

同規模のモデルと比べても、Nemotron 3 Superは競争力のある性能を発揮し、さらに最高スループットを提供します。

<figure><img src="/files/6ff0ba2ff5d509df62c559ee196fe806ad910f90" alt=""><figcaption></figcaption></figure>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://unsloth.ai/docs/jp/moderu/nemotron-3/nemotron-3-super.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.