# NVIDIA Nemotron 3 Nano Omni - 如何在本地运行 NVIDIA Nemotron-3-Nano-Omni-30B-A3B 是一个开放的 30B 参数、3B 活跃混合推理 MoE 模型，专为多模态智能体工作负载而构建，包括 **音频**, **视频**、文本、图像和文档作为输入，并输出文本。该模型可在 **25GB 内存** 下以 4-bit 运行，8-bit 则需要 36GB。在 **256K 上下文**的情况下，Nemotron 3 Nano Omni 是 **同尺寸中最强的 omni** 模型，也是效率最高的开放多模态模型。我们与 NVIDIA 合作，提供第一天支持！\ **GGUF：** [Nemotron-3-Nano-Omni-30B-A3B-Reasoning](https://huggingface.co/unsloth/Nemotron-3-Nano-30B-A3B-GGUF) ### ⚙️ 使用指南 NVIDIA 建议在推理中使用以下设置： {% columns %} {% column %} **通用聊天/指令（默认）：** * `temperature = 1.0` * `top_p = 1.0` {% endcolumn %} {% column %} **工具调用用例：** * `temperature = 0.6` * `top_p = 0.95` {% endcolumn %} {% endcolumns %} {% hint style="warning" %} 请不要使用 CUDA 13.2，因为你可能会得到乱码输出。NVIDIA 正在修复。 {% endhint %} ### 运行 Nemotron-3-Nano-Omni 根据你的使用场景，你需要使用 [不同的设置](#usage-guide)。某些 GGUF 的大小最终会相近，因为模型架构（例如 [gpt-oss](/docs/zh/mo-xing/gpt-oss-how-to-run-and-fine-tune.md)）的维度不能被 128 整除，因此部分内容无法量化到更低位宽。 **GGUF：** [Nemotron-3-Nano-Omni-30B-A3B-Reasoning](https://huggingface.co/unsloth/Nemotron-3-Nano-30B-A3B-GGUF) 该模型的 4-bit 版本需要约 25GB 内存。8-bit 需要 36GB。对于这些指南，我们将使用 `UD-Q4-K-XL` ，它在大小和准确性之间取得了良好平衡。在 Unsloth Studio 中运行在 llama.cpp 中运行 {% hint style="warning" %} 目前没有任何多模态/视觉 GGUF 可在 **Ollama** 中工作，因为它们使用单独的 `mmproj` 视觉文件。请使用兼容 llama.cpp 的后端。请不要使用 **CUDA 13.2** 否则你可能会得到乱码输出。NVIDIA 正在修复。 {% endhint %} ### 🦥 Unsloth Studio 指南在本教程中，我们将使用 [Unsloth Studio](/docs/zh/xin/studio.md)，这是我们用于运行和训练 LLM 的新 Web UI。使用 Unsloth Studio，你可以在本地于 **音频**、图像和文本上运行模型，支持 **Mac、Windows**和 Linux，并且： {% columns %} {% column %} * 搜索、下载、 [运行 GGUF](/docs/zh/xin/studio.md#run-models-locally) 和 safetensor 模型 * **并排** 比较 **模型** * [**自愈式** 工具调用](/docs/zh/xin/studio.md#execute-code--heal-tool-calling) + **网页搜索** * [**代码执行**](/docs/zh/xin/studio.md#run-models-locally) （Python、Bash） * [自动推理](/docs/zh/xin/studio.md#model-arena) 参数调优（temp、top-p 等） * [训练 LLM](/docs/zh/xin/studio.md#no-code-training) 速度提升 2 倍，VRAM 减少 70% {% endcolumn %} {% column %}

{% endcolumn %} {% endcolumns %} {% stepper %} {% step %} #### 安装 Unsloth **MacOS、Linux、WSL：** ```bash curl -fsSL https://unsloth.ai/install.sh | sh ``` **Windows PowerShell：** ```bash irm https://unsloth.ai/install.ps1 | iex ``` {% endstep %} {% step %} #### 设置 Unsloth Studio（仅需一次）设置过程会自动安装 Node.js（通过 nvm）、构建前端、安装所有 Python 依赖，并构建带 CUDA 支持的 llama.cpp。 {% hint style="info" %} **WSL 用户：** 系统会提示你输入 `sudo` 密码以安装构建依赖（`cmake`, `git`, `libcurl4-openssl-dev`). {% endhint %} {% endstep %} {% step %} #### 启动 Unsloth **MacOS、Linux、WSL：** ```bash source unsloth_studio/bin/activate unsloth studio -H 0.0.0.0 -p 8888 ``` **Windows PowerShell：** ```bash & .\unsloth_studio\Scripts\unsloth.exe studio -H 0.0.0.0 -p 8888 ```

然后在浏览器中打开 `http://127.0.0.1:8888` 。 {% endstep %} {% step %} #### 搜索并下载 NVIDIA-Nemotron-3-Nano-30B-A3B-Omni 首次启动时，你需要创建一个密码来保护账户安全，并在以后重新登录。然后前往 [Studio Chat](/docs/zh/xin/studio/chat.md) 选项卡，在搜索栏中搜索 Nemotron-3-Nano-Omni，并下载你想要的模型和量化版本。

{% endstep %} {% step %} #### 运行 Nemotron-3-Nano-30B-A3B-Omni 在使用 Unsloth Studio 时，推理参数应会自动设置，不过你仍然可以手动更改。你也可以编辑上下文长度、聊天模板和其他设置。如需更多信息，你可以查看我们的 [Unsloth Studio 推理指南](/docs/zh/xin/studio/chat.md).

{% endstep %} {% endstepper %} ### 🦙 Llama.cpp 教程：在 llama.cpp 中运行的说明（注意我们将使用 4 位以适配大多数设备）： {% stepper %} {% step %} 获取最新的 `llama.cpp` 在 [GitHub 这里](https://github.com/ggml-org/llama.cpp)。你也可以按照下面的构建说明操作。将 `-DGGML_CUDA=ON` 改为 `-DGGML_CUDA=OFF` 如果你没有 GPU，或者只想进行 CPU 推理。 **对于 Apple Mac / Metal 设备**，设置 `-DGGML_CUDA=OFF` 然后照常继续——Metal 支持默认开启。 {% code overflow="wrap" %} ```bash apt-get update apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y git clone https://github.com/ggml-org/llama.cpp cmake llama.cpp -B llama.cpp/build \ -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split cp llama.cpp/build/bin/llama-* llama.cpp ``` {% endcode %} {% endstep %} {% step %} **我们先获取一张图片！** 你也可以上传图片。我们将使用，这只是我们的迷你标志，展示了如何使用 Unsloth 制作微调： {% code overflow="wrap" %} ```bash wget https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/images/unsloth%20made%20with%20love.png -O unsloth.png ``` {% endcode %}

让我们从以下地址获取第 2 张图片： {% code overflow="wrap" %} ```bash wget https://files.worldwildlife.org/wwfcmsprod/images/Sloth_Sitting_iStock_3_12_2014/story_full_width/8l7pbjmj29_iStock_000011145477Large_mini__1_.jpg -O picture.png ``` {% endcode %}

{% endstep %} {% step %} 通过下面的代码下载模型（在安装之后 `pip install huggingface_hub`）。你可以选择 Q4\_K\_M 或其他量化版本，例如 `UD-Q4_K_XL` 。我们建议至少使用 2-bit 动态量化 `UD-Q2_K_XL` 以平衡大小和准确性。如果下载卡住，请参见： [Hugging Face Hub、XET 调试](/docs/zh/ji-chu/troubleshooting-and-faqs/hugging-face-hub-xet-debugging.md) {% code overflow="wrap" %} ```bash pip install huggingface_hub hf download unsloth/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning-GGUF \ --local-dir unsloth/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning-GGUF \ --include "*mmproj-BF16*" \ --include "*UD-Q4_K_XL*" # 动态 2bit 请使用 "*UD-Q2_K_XL*" ``` {% endcode %} {% endstep %} {% step %} 然后以对话模式运行模型： {% code overflow="wrap" %} ```bash ./llama.cpp/llama-cli \ --model unsloth/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning-GGUF/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning-UD-Q4_K_XL.gguf \ --mmproj unsloth/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning-GGUF/mmproj-BF16.gguf \ --temp 1.0 \ --top-p 1.0 \\ --min-p 0.01 ``` {% endcode %} {% endstep %} {% step %} 然后你会看到如下内容：

{% endstep %} {% step %} 然后使用 `/image` 来加载两张图片，并询问“这是什么图片”：

{% endstep %} {% step %} 对于树懒图片：

{% endstep %} {% endstepper %} #### Llama-server 服务与部署要在本地部署 Nemotron 3 Nano Omni，请使用 `llama-server`。在新的终端中，例如通过 `tmux`，部署模型： ```bash ./llama.cpp/llama-server \ -hf unsloth/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning-GGUF:UD-Q4_K_XL \ --alias "unsloth/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning" \ --prio 3 \ --temp 1.0 \ --top-p 1.0 \\ --port 8001 ``` 如果你手动下载了模型，请使用： {% code overflow="wrap" %} ```bash ./llama.cpp/llama-server \ --model unsloth/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning-GGUF/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning-UD-Q4_K_XL.gguf \ --mmproj unsloth/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning-GGUF/mmproj-BF16.gguf \ --alias "unsloth/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning" \ --prio 3 \ --temp 1.0 \ --top-p 1.0 \\ --port 8001 ``` {% endcode %} 然后在新的终端中，在安装 OpenAI 客户端之后使用 `pip install openai`: ```python from openai import OpenAI openai_client = OpenAI( base_url = "http://127.0.0.1:8001/v1", api_key = "sk-no-key-required", ) completion = openai_client.chat.completions.create( model = "unsloth/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning", messages = [ {"role": "user", "content": "2+2 等于多少？"}, ], ) print(completion.choices[0].message.reasoning_content) print(completion.choices[0].message.content) ``` 这将显示类似如下内容：

#### 通过与 OpenAI 兼容的服务器输入图像我们来使用 `picture.png` 它就是树懒图片，如同在 [#llama.cpp-tutorial](#llama.cpp-tutorial "mention") {% code expandable="true" %} ```python from openai import OpenAI import base64 import mimetypes image_link = "picture.png" def file_to_data_url(path: str) -> str: mime = mimetypes.guess_type(path)[0] or "application/octet-stream" with open(path, "rb") as f: data = base64.b64encode(f.read()).decode("utf-8") return f"data:{mime};base64,{data}" openai_client = OpenAI( base_url = "http://127.0.0.1:8001/v1", api_key = "sk-no-key-required", ) completion = openai_client.chat.completions.create( model = "unsloth/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning", messages = [ { "role": "user", "content": [ { "type": "text", "text": "这是什么图片？", }, { "type": "image_url", "image_url": { "url": file_to_data_url(image_link), }, }, ], } ], ) print(completion.choices[0].message.reasoning_content) print(completion.choices[0].message.content) ``` {% endcode %} 这将显示类似如下内容：

### 🦥 微调 Nemotron 3 Nano Omni Unsloth 支持整个 [Nemotron](/docs/zh/mo-xing/nemotron-3.md) 模型家族。Nemotron 3 Nano Omni 适用于多模态智能体数据集。你可以通过 Unsloth 在音频、视觉或文本上训练。 **视频输入** 微调目前不受支持。对于纯文本和笔记本，你可以从现有的 [Nemotron 3 Nano 微调流程](/docs/zh/mo-xing/nemotron-3.md#fine-tuning-nemotron-3-and-rl)开始。对于多模态适配器，请确保你的数据集包含你的智能体实际需要的模态： * **计算机使用：** 截图、UI 状态、光标/上下文、预期的下一步动作 * **文档智能：** PDF、截图、图表、表格、结构化提取目标 * **音频理解：** 音频片段、采样帧、摘要、时间戳、事件和后续问题 * **智能体循环：** 观察 → 推理 → 行动 → 验证示例对于 Omni，不要盲目复用纯文本 VRAM 数值。多模态编码器、投影器权重、图像 token、音频块和长上下文都会增加内存使用。先从更短的上下文和更小的批大小开始，然后再扩大。 ### 基准测试 Nemotron 3 Nano Omni 在其尺寸范围内是最强的 omni 模型。它也是效率最高、准确性领先的开放多模态模型。该模型在所有基准测试中都超过了 Qwen3-Omni-30B-A3B。

--- # Agent Instructions: Querying This Documentation If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question. Perform an HTTP GET request on the current page URL with the `ask` query parameter: ``` GET https://unsloth.ai/docs/zh/mo-xing/nemotron-3-nano-omni.md?ask= ``` The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation. Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.