# Qwen3-VL：如何运行指南 Qwen3-VL 是 Qwen 的新视觉模型，具有 **指令** 和 **思考** 版本。2B、4B、8B 和 32B 模型是稠密模型，而 30B 和 235B 是 MoE。235B 思考 LLM 提供 SOTA 视觉和编码性能，可与 GPT-5（high）和 Gemini 2.5 Pro 媲美。\ \ Qwen3-VL 具备视觉、视频和 OCR 能力，以及 256K 上下文（可扩展至 1M）。\ \ [Unsloth](https://github.com/unslothai/unsloth) 支持 **Qwen3-VL 微调和** [**RL**](/docs/zh/kai-shi-shi-yong/reinforcement-learning-rl-guide/vision-reinforcement-learning-vlm-rl.md)。使用我们的 [notebooks](#fine-tuning-qwen3-vl). 运行 Qwen3-VL 微调 Qwen3-VL ## 🖥️ **运行 Qwen3-VL** 要在 llama.cpp、vLLM、Ollama 等中运行该模型，推荐设置如下： ### :gear: 推荐设置 Qwen 为两个模型都推荐这些设置（Instruct 与 Thinking 略有不同）： | Instruct 设置： | Thinking 设置： | | ------------------------------------------------------------------------ | ------------------------------------------------------------------------ | | **Temperature = 0.7** | **Temperature = 1.0** | | **Top\_P = 0.8** | **Top\_P = 0.95** | | **presence\_penalty = 1.5** | **presence\_penalty = 0.0** | | 输出长度 = 32768（最多 256K） | 输出长度 = 40960（最多 256K） | | Top\_K = 20 | Top\_K = 20 | Qwen3-VL 还使用了下面这些设置来得到他们的基准测试结果，如所提到的 [在 GitHub 上](https://github.com/QwenLM/Qwen3-VL/tree/main?tab=readme-ov-file#generation-hyperparameters). {% columns %} {% column %} Instruct 设置： ```bash export greedy='false' export seed=3407 export top_p=0.8 export top_k=20 export temperature=0.7 export repetition_penalty=1.0 export presence_penalty=1.5 export out_seq_length=32768 ``` {% endcolumn %} {% column %} Thinking 设置： ```bash export greedy='false' export seed=1234 export top_p=0.95 export top_k=20 export temperature=1.0 export repetition_penalty=1.0 export presence_penalty=0.0 export out_seq_length=40960 ``` {% endcolumn %} {% endcolumns %} ### :bug:聊天模板修复在 Unsloth，我们最重视准确性，所以我们调查了为什么在运行 Thinking 模型的第二轮后，llama.cpp 会崩溃，如下所示： {% columns %} {% column %}

{% endcolumn %} {% column %} 错误代码： ``` terminate called after throwing an instance of 'std::runtime_error' what(): 值不可调用：第 63 行，第 78 列处为 null： {%- if '' in content %} {%- set reasoning_content = ((content.split('')|first).rstrip('\n').split('')|last).lstrip('\n') %} ^ ``` {% endcolumn %} {% endcolumns %} 我们已经成功修复了 VL 模型的 Thinking 聊天模板，因此我们重新上传了所有 Thinking 量化版本以及 Unsloth 的量化版本。现在它们都应该能在第二轮对话后正常工作 - **其他量化版本将在第二轮对话后加载失败。** ### **Qwen3-VL Unsloth 上传**: 自 2025 年 10 月 30 日起，llama.cpp 已支持 Qwen3-VL 的 GGUF，因此你现在可以在本地运行它们！ | 动态 GGUF（用于运行） | 4-bit BnB Unsloth Dynamic | 16 位全精度 | | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | |

| ### 📖 Llama.cpp：运行 Qwen3-VL 教程 1. 获取最新的 `llama.cpp` 在 [GitHub 这里](https://github.com/ggml-org/llama.cpp)。你也可以按照下面的构建说明进行操作。将 `-DGGML_CUDA=ON` 改为 `-DGGML_CUDA=OFF` 如果你没有 GPU，或者只想进行 CPU 推理。 **对于 Apple Mac / Metal 设备**，设置 `-DGGML_CUDA=OFF` 然后照常继续 - Metal 支持默认已开启。 ```bash apt-get update apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y git clone https://github.com/ggml-org/llama.cpp cmake llama.cpp -B llama.cpp/build \\ -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON cmake --build llama.cpp/build --config Release -j --clean-first cp llama.cpp/build/bin/llama-* llama.cpp ``` 2. **先获取一张图片！** 你也可以上传图片。我们将使用，这只是我们的迷你标志，展示了 finetunes 是如何由 Unsloth 制作的：

3. 让我们下载这张图片 {% code overflow="wrap" %} ```bash wget https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/images/unsloth%20made%20with%20love.png -O unsloth.png ``` {% endcode %} 4. 让我们获取第二张图片：

{% code overflow="wrap" %} ```bash wget https://files.worldwildlife.org/wwfcmsprod/images/Sloth_Sitting_iStock_3_12_2014/story_full_width/8l7pbjmj29_iStock_000011145477Large_mini__1_.jpg -O picture.png ``` {% endcode %} 5. 然后，让我们使用 llama.cpp 的自动模型下载功能，8B Instruct 模型可以尝试这样： ```bash ./llama.cpp/llama-mtmd-cli \\ -hf unsloth/Qwen3-VL-8B-Instruct-GGUF:UD-Q4_K_XL \\ --n-gpu-layers 99 \\ --jinja \\ --top-p 0.8 \\ --top-k 20 \\ --temp 0.7 \\ --min-p 0.0 \\ --flash-attn on \\ --presence-penalty 1.5 \\ --ctx-size 8192 ``` 6. 进入后，你会看到下面的界面：

7. 通过以下方式加载图片 `/image PATH` 即 `/image unsloth.png` 然后按 ENTER

8. 当你按下 ENTER 时，它会显示“unsloth.png image loaded”

9. 现在让我们问一个问题，比如“这是什么图片？”：

10. 现在通过以下方式加载图片 2 `/image picture.png` 然后按下 ENTER 并询问“这是什么图片？”

11. 最后让我们问一下这两张图片之间有什么关系（它有效！） {% code overflow="wrap" %} ``` 这两张图片直接相关，因为它们都展示了**树懒**，而树懒是“made with unsloth”项目的核心主题。 - 第一张图片是“made with unsloth”项目的**官方标志**。它展示了一个风格化的卡通树懒角色，位于绿色圆圈内，旁边有“made with unsloth”文字。这就是该项目的视觉标识。 - 第二张图片是一只真实树懒在其自然栖息地中的**照片**。这张照片捕捉了这种动物在野外的外观和行为。这两张图片之间的关系是：标志（图片 1）是用于推广“made with unsloth”项目的数字化表示或符号，而照片（图片 2）是真实世界中实际树懒的描绘。该项目很可能将标志中的角色用作图标或吉祥物，而这张照片则用于展示树懒在自然环境中的样子。 ``` {% endcode %}

12. 你也可以通过（在安装后）下载模型 `pip install huggingface_hub hf_transfer` ）HuggingFace 的 `snapshot_download` 这对于大型模型下载很有用， **因为 llama.cpp 的自动下载器可能会变慢。** 你可以选择 Q4\_K\_M，或其他量化版本。 ```python # !pip install huggingface_hub hf_transfer import os os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1" from huggingface_hub import snapshot_download snapshot_download( repo_id = "unsloth/Qwen3-VL-8B-Instruct-GGUF", # 或 "unsloth/Qwen3-VL-8B-Thinking-GGUF" local_dir = "unsloth/Qwen3-VL-8B-Instruct-GGUF", # 或 "unsloth/Qwen3-VL-8B-Thinking-GGUF" allow_patterns = ["*UD-Q4_K_XL*", "*mmproj-F16*"], ) ``` 13. 运行模型并尝试任意提示词。 **对于 Instruct：** ```bash ./llama.cpp/llama-mtmd-cli \\ --model unsloth/Qwen3-VL-8B-Instruct-GGUF/Qwen3-VL-8B-Instruct-UD-Q4_K_XL.gguf \\ --mmproj unsloth/Qwen3-VL-8B-Instruct-GGUF/mmproj-F16.gguf \\ --n-gpu-layers 99 \\ --jinja \\ --top-p 0.8 \\ --top-k 20 \\ --temp 0.7 \\ --min-p 0.0 \\ --flash-attn on \\ --presence-penalty 1.5 \\ --ctx-size 8192 ``` 14. **对于 Thinking**: ```bash ./llama.cpp/llama-mtmd-cli \\ --model unsloth/Qwen3-VL-8B-Thinking-GGUF/Qwen3-VL-8B-Thinking-UD-Q4_K_XL.gguf \\ --mmproj unsloth/Qwen3-VL-8B-Thinking-GGUF/mmproj-F16.gguf \\ --n-gpu-layers 99 \\ --jinja \\ --top-p 0.95 \\ --top-k 20 \\ --temp 1.0 \\ --min-p 0.0 \\ --flash-attn on \\ --presence-penalty 0.0 \\ --ctx-size 8192 ``` ### :magic\_wand:运行 Qwen3-VL-235B-A22B 和 Qwen3-VL-30B-A3B 对于 Qwen3-VL-235B-A22B，我们将使用 llama.cpp 进行优化推理，并提供大量选项。 1. 我们沿用与上面类似的步骤，不过这次还需要执行额外步骤，因为模型非常大。 2. 通过以下方式下载模型（在安装后） `pip install huggingface_hub hf_transfer` ）。你可以选择 UD-Q2\_K\_XL，或其他量化版本。。 ```python # !pip install huggingface_hub hf_transfer import os os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1" from huggingface_hub import snapshot_download snapshot_download( repo_id = "unsloth/Qwen3-VL-235B-A22B-Instruct-GGUF", local_dir = "unsloth/Qwen3-VL-235B-A22B-Instruct-GGUF", allow_patterns = ["*UD-Q2_K_XL*", "*mmproj-F16*"], ) ``` 3. 运行模型并尝试一个提示词。为 Thinking 与 Instruct 设置正确的参数。 **Instruct：** {% code overflow="wrap" %} ```bash ./llama.cpp/llama-mtmd-cli \\ --model unsloth/Qwen3-VL-235B-A22B-Instruct-GGUF/UD-Q2_K_XL/Qwen3-VL-235B-A22B-Instruct-UD-Q2_K_XL-00001-of-00002.gguf \\ --mmproj unsloth/Qwen3-VL-235B-A22B-Instruct-GGUF/mmproj-F16.gguf --jinja \\ --top-p 0.8 \\ --top-k 20 \\ --temp 0.7 \\ --min-p 0.0 \\ --flash-attn on \\ --presence-penalty 1.5 \\ --ctx-size 8192 \\ ``` {% endcode %} **Thinking：** {% code overflow="wrap" %} ```bash ./llama.cpp/llama-mtmd-cli \\ --model unsloth/Qwen3-VL-235B-A22B-Thinking-GGUF/UD-Q2_K_XL/Qwen3-VL-235B-A22B-Thinking-UD-Q2_K_XL-00001-of-00002.gguf \\ --mmproj unsloth/Qwen3-VL-235B-A22B-Thinking-GGUF/mmproj-F16.gguf \\ --n-gpu-layers 99 \\ --jinja \\ --top-p 0.95 \\ --top-k 20 \\ --temp 1.0 \\ --min-p 0.0 \\ --flash-attn on \\ --presence-penalty 0.0 \\ --ctx-size 8192 \\ -ot ".ffn_.*_exps.=CPU" ``` {% endcode %} 4. 编辑， `--ctx-size 16384` 用于上下文长度， `--n-gpu-layers 99` 用于将多少层卸载到 GPU。如果你的 GPU 显存不足，请尝试调整它。如果你只有 CPU 推理，也请移除它。 {% hint style="success" %} **使用 `--fit on` 于 2025 年 12 月 15 日引入，以最大化利用你的 GPU 和 CPU。** 可选地，使用 `-ot ".ffn_.*_exps.=CPU"` 将所有 MoE 层卸载到 CPU！这实际上使你可以将所有非 MoE 层放在 1 张 GPU 上，从而提高生成速度。如果你有更多 GPU 容量，可以自定义正则表达式以放入更多层。 {% endhint %} ### 🐋 Docker：运行 Qwen3-VL 如果你已经安装了 Docker desktop，要从 Hugging Face 运行 Unsloth 的模型，只需运行下面的命令即可： ```bash docker model pull hf.co/unsloth/Qwen3-VL-8B-Instruct-GGUF:UD-Q4_K_XL ``` 或者你也可以运行 Docker 上已上传的 Qwen3-VL 模型： ```bash docker model run ai/qwen3-vl ``` ## 🦥 **微调 Qwen3-VL** Unsloth 支持对 Qwen3-VL 进行微调和强化学习（RL），包括更大的 32B 和 235B 模型。这还包括对视频和目标检测微调的支持。和往常一样，Unsloth 使 Qwen3-VL 模型训练速度提高 1.7 倍，显存占用减少 60%，上下文长度扩大 8 倍且没有准确率下降。\ \ 我们制作了两个 Qwen3-VL（8B）训练笔记本，你可以在 Colab 上免费训练： * [常规 SFT 微调笔记本](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_VL_\(8B\)-Vision.ipynb) * [GRPO/GSPO RL 笔记本](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_VL_\(8B\)-Vision-GRPO.ipynb) {% hint style="success" %} **将 Qwen3-VL 保存为 GGUF 现在可以正常工作了，因为 llama.cpp 刚刚支持了它！** 如果你想使用其他任何 Qwen3-VL 模型，只需将 8B 模型改为 2B、32B 等版本即可。 {% endhint %} GRPO 笔记本的目标是让一个视觉语言模型在给定如下图所示的图像输入时，通过 RL 解决数学问题：

这项 Qwen3-VL 支持还集成了我们最新的更新，使 RL 更省内存、更快速，其中包括我们的 [待机功能](/docs/zh/kai-shi-shi-yong/reinforcement-learning-rl-guide/memory-efficient-rl.md#unsloth-standby)，与其他实现相比，它能独特地限制速度下降。你可以通过我们的 [VLM GRPO 指南](/docs/zh/kai-shi-shi-yong/reinforcement-learning-rl-guide/vision-reinforcement-learning-vlm-rl.md). ### 多图训练为了使用多张图片对 Qwen3-VL 进行微调或训练，最直接的更改是将 ```python ds_converted = ds.map( convert_to_conversation, ) ``` 改为： ```python ds_converted = [convert_to_converation(sample) for sample in dataset] ``` 使用 map 会触发数据集标准化和 arrow 处理规则，而这些规则可能更严格，也更复杂。 --- # Agent Instructions: Querying This Documentation If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question. Perform an HTTP GET request on the current page URL with the `ask` query parameter: ``` GET https://unsloth.ai/docs/zh/mo-xing/tutorials/qwen3-how-to-run-and-fine-tune/qwen3-vl-how-to-run-and-fine-tune.md?ask= ``` The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation. Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.