💜Qwen3.5 - 本地运行指南

在本地设备上运行包括 Qwen3.5-397B-A17B 在内的新 Qwen3.5 LLM！

Qwen3.5 是阿里巴巴的新模型家族，包括 Qwen3.5-397B-A17B，这是一款具有 397B 参数（17B 活跃参数）的多模态推理模型，其性能可与 Gemini 3 Pro、Claude Opus 4.5 和 GPT-5.2 相媲美。它支持 256K 上下文 （可扩展至 1M）覆盖 201 种语言，提供思考与非思考模式，并在编程、视觉、智能代理、对话和长上下文任务方面表现出色。

完整的 Qwen3.5-397B-A17B 模型大小为 ~807GB 在磁盘上，您可以在 192GB Mac / 内存设备上运行 3-bit，或在 256GB Mac: Qwen3.5-397B-A17B GGUF

所有上传均使用 Unsloth Dynamic 2.0 以获得 SOTA 的量化性能——因此 4-bit 会将重要层提升为 8 或 16 位。感谢 Qwen 在第零天向 Unsloth 提供访问权限。

⚙️ 使用指南

Unsloth 的 4-bit 动态量化 UD-Q4_K_XL 使用 214GB 的磁盘空间——这可以直接适配到一台 256GB M3 Ultra，并且在 1x24GB 卡和 256GB 内存 配合 MoE 卸载时，可实现 25+ tokens/s。3-bit 量化可适配 192GB 内存，而 8-bit 需要 512GB 内存/显存。

为获得最佳性能，请确保您的显存 + 内存总和等于您下载的量化模型大小。如果不满足，硬盘 / SSD 卸载将可以与 llama.cpp 配合使用，只是推理会更慢。

Qwen3.5-397B-A17B 教程：

在本指南中我们将使用 Dynamic MXFP4_MOE 该量化在 256GB 内存 / Mac 设备上能够很好地运行以实现快速推理：

✨ 在 llama.cpp 中运行

获取最新的 llama.cpp 在 这里的 GitHub。您也可以按照下面的构建说明操作。将 -DGGML_CUDA=ON 改为 -DGGML_CUDA=OFF 如果您没有 GPU 或仅想使用 CPU 推理。

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp

如果您想直接使用 llama.cpp 来加载模型，可以按下面操作：（:Q3_K_XL）是量化类型。您也可以通过 Hugging Face（第 3 点）下载。这类似于 ollama run 。使用 export LLAMA_CACHE="folder" 来强制 llama.cpp 保存到特定位置。请记住模型的最大上下文长度只有 200K。

按以下方式用于思考模式：

export LLAMA_CACHE="unsloth/Qwen3.5-397B-A17B-GGUF"
./llama.cpp/llama-cli \
    -hf unsloth/Qwen3.5-397B-A17B-GGUF:MXFP4_MOE \
    --ctx-size 16384 \
    --temp 0.6 \
    --top-p 0.95 \
    --top-k 20 \
    --min-p 0.00

按以下方式用于 非思考 模式：

export LLAMA_CACHE="unsloth/Qwen3.5-397B-A17B-GGUF"
./llama.cpp/llama-cli \
    -hf unsloth/Qwen3.5-397B-A17B-GGUF:MXFP4_MOE \
    --ctx-size 16384 \
    --temp 0.7 \
    --top-p 0.8 \
    --top-k 20 \
    --min-p 0.00 \
    --chat-template-kwargs "{\"enable_thinking\": false}"

通过以下方式下载模型（在安装 pip install huggingface_hub hf_transfer ）之后。您可以选择 MXFP4_MOE （动态 4bit）或其他量化版本例如 UD-Q4_K_XL 。我们建议至少使用 2-bit 动态量化以平衡尺寸和精度， UD-Q2_K_XL 。

hf download unsloth/Qwen3.5-397B-A17B-GGUF \
    --local-dir unsloth/Qwen3.5-397B-A17B-GGUF \
    --include "*MXFP4_MOE*" # 对于 Dynamic 2bit 使用 "*UD-Q2_K_XL*"

您可以编辑 --threads 32 来设置 CPU 线程数， --ctx-size 16384 来设置上下文长度， --n-gpu-layers 2 来设置用于 GPU 卸载的层数。如果您的 GPU 出现显存不足，请尝试调整该值。若仅使用 CPU 推理，则移除该参数。

./llama.cpp/llama-cli \
    --model unsloth/Qwen3.5-397B-A17B-GGUF/MXFP4_MOE/Qwen3.5-397B-A17B-MXFP4_MOE-00001-of-00006.gguf \
    --ctx-size 16384 \
    --temp 0.6 \
    --top-p 0.95 \
    --top-k 20 \
    --min-p 0.00
    --seed 3407

要禁用思考 / 推理功能，请使用 --chat-template-kwargs "{\"enable_thinking\": false}"

🦙 Llama-server 服务 & OpenAI 的 completion 库

要将 Qwen3.5-397B-A17B 部署到生产环境，我们使用 llama-server 在新终端中，例如使用 tmux，通过以下命令部署模型：

./llama.cpp/llama-server \
    --model unsloth/Qwen3.5-397B-A17B-GGUF/MXFP4_MOE/Qwen3.5-397B-A17B-MXFP4_MOE-00001-of-00006.gguf \
    --alias "unsloth/Qwen3.5-397B-A17B" \
    --temp 0.6 \
    --top-p 0.95 \
    --ctx-size 16384 \
    --top-k 20 \
    --min-p 0.00 \
    --port 8001

然后在新终端中，在执行 pip install openai之后，执行：

from openai import OpenAI
import json
openai_client = OpenAI(
    base_url = "http://127.0.0.1:8001/v1",
    api_key = "sk-no-key-required",
)
completion = openai_client.chat.completions.create(
    model = "unsloth/Qwen3.5-397B-A17B",
    messages = [{"role": "user", "content": "Create a Snake game."},],
)
print(completion.choices[0].message.content)

🔨与 Qwen3.5 的工具调用

参见 Tool Calling Guide 以获取关于如何进行工具调用的更多细节。在新终端中（如果使用 tmux，使用 CTRL+B+D），我们创建一些工具，例如相加两个数字、执行 Python 代码、执行 Linux 操作等：

import json, subprocess, random
from typing import Any
def add_number(a: float | str, b: float | str) -> float:
    return float(a) + float(b)
def multiply_number(a: float | str, b: float | str) -> float:
    return float(a) * float(b)
def substract_number(a: float | str, b: float | str) -> float:
    return float(a) - float(b)
def write_a_story() -> str:
    return random.choice([
        "很久很久以前在一个遥远的银河系……",
        "有两个朋友，他们热爱树懒和代码……",
        "世界快要毁灭了，因为每只树懒都进化出超人般的智慧……",
        "在一个朋友不知情的情况下，另一个朋友意外地写了一个能让树懒进化的程序……",
    ])
def terminal(command: str) -> str:
    if "rm" in command or "sudo" in command or "dd" in command or "chmod" in command:
        msg = "无法执行 'rm, sudo, dd, chmod' 命令，因为它们很危险"
        print(msg); return msg
    print(f"正在执行终端命令 `{command}`")
    try:
        return str(subprocess.run(command, capture_output = True, text = True, shell = True, check = True).stdout)
    except subprocess.CalledProcessError as e:
        return f"命令失败：{e.stderr}"
def python(code: str) -> str:
    data = {}
    exec(code, data)
    del data["__builtins__"]
    return str(data)
MAP_FN = {
    "add_number": add_number,
    "multiply_number": multiply_number,
    "substract_number": substract_number,
    "write_a_story": write_a_story,
    "terminal": terminal,
    "python": python,
}
tools = [
    {
        "type": "function",
        "function": {
            "name": "add_number",
            "description": "将两个数字相加。",
            "parameters": {
                "type": "object",
                "properties": {
                    "a": {
                        "type": "string",
                        "description": "第一个数字。",
                    },
                    "b": {
                        "type": "string",
                        "description": "第二个数字。",
                    },
                },
                "required": ["a", "b"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "multiply_number",
            "description": "将两个数字相乘。",
            "parameters": {
                "type": "object",
                "properties": {
                    "a": {
                        "type": "string",
                        "description": "第一个数字。",
                    },
                    "b": {
                        "type": "string",
                        "description": "第二个数字。",
                    },
                },
                "required": ["a", "b"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "substract_number",
            "description": "将两个数字相减。",
            "parameters": {
                "type": "object",
                "properties": {
                    "a": {
                        "type": "string",
                        "description": "第一个数字。",
                    },
                    "b": {
                        "type": "string",
                        "description": "第二个数字。",
                    },
                },
                "required": ["a", "b"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "write_a_story",
            "description": "写一则随机故事。",
            "parameters": {
                "type": "object",
                "properties": {},
                "required": [],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "terminal",
            "description": "在终端执行操作。",
            "parameters": {
                "type": "object",
                "properties": {
                    "command": {
                        "type": "string",
                        "description": "您希望执行的命令，例如 `ls`、`rm`、...",
                    },
                },
                "required": ["command"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "python",
            "description": "用一些将要运行的 Python 代码调用 Python 解释器。",
            "parameters": {
                "type": "object",
                "properties": {
                    "code": {
                        "type": "string",
                        "description": "要运行的 Python 代码",
                    },
                },
                "required": ["code"],
            },
        },
    },
]

然后我们使用下面的函数（复制粘贴并执行），这些函数会自动解析函数调用并为任何模型调用 OpenAI 端点：

from openai import OpenAI
def unsloth_inference(
    messages,
    temperature = 0.6,
    top_p = 0.95,
    top_k = 20,
    min_p = 0.00,
    repetition_penalty = 1.0,
):
    messages = messages.copy()
    openai_client = OpenAI(
        base_url = "http://127.0.0.1:8001/v1",
        api_key = "sk-no-key-required",
    )
    model_name = next(iter(openai_client.models.list())).id
    print(f"使用模型 = {model_name}")
    has_tool_calls = True
    original_messages_len = len(messages)
    while has_tool_calls:
        print(f"当前消息 = {messages}")
        response = openai_client.chat.completions.create(
            model = model_name,
            messages = messages,
            temperature = temperature,
            top_p = top_p,
            tools = tools if tools else None,
            tool_choice = "auto" if tools else None,
            extra_body = {"top_k": top_k, "min_p": min_p, "repetition_penalty" :repetition_penalty,}
        )
        tool_calls = response.choices[0].message.tool_calls or []
        content = response.choices[0].message.content or ""
        tool_calls_dict = [tc.to_dict() for tc in tool_calls] if tool_calls else tool_calls
        messages.append({"role": "assistant", "tool_calls": tool_calls_dict, "content": content,})
        for tool_call in tool_calls:
            fx, args, _id = tool_call.function.name, tool_call.function.arguments, tool_call.id
            out = MAP_FN[fx](**json.loads(args))
            messages.append({"role": "tool", "tool_call_id": _id, "name": fx, "content": str(out),})
        else:
            has_tool_calls = False
    return messages

在通过 llama-server 启动 Qwen3.5 之后， Qwen3.5 如同在 Tool Calling Guide 或参见

获取更多细节，我们随后可以进行一些工具调用。

📊 基准测试

您可以在下方查看 Qwen3.5-397B-A17B 的表格格式基准：

语言基准

知识

基准

GPT5.2

Claude 4.5 Opus

Gemini-3 Pro

Qwen3-Max-Thinking

K2.5-1T-A32B

Qwen3.5-397B-A17B

87.4

89.5

89.8

85.7

87.1

87.8

MMLU-Pro

95.0

95.6

95.9

92.8

94.5

94.9

MMLU-Redux

67.9

70.6

74.0

67.3

69.2

70.4

SuperGPQA

90.5

92.2

93.4

93.7

94.0

93.0

C-Eval

知识

基准

GPT5.2

Claude 4.5 Opus

Gemini-3 Pro

Qwen3-Max-Thinking

K2.5-1T-A32B

指令遵循

94.8

90.9

93.5

93.4

93.9

92.6

IFEval

75.4

58.0

70.4

70.9

70.2

76.5

IFBench

57.9

54.2

64.2

63.3

62.7

67.6

多挑战

知识

基准

GPT5.2

Claude 4.5 Opus

Gemini-3 Pro

Qwen3-Max-Thinking

K2.5-1T-A32B

长上下文

72.7

74.0

70.7

68.7

70.0

68.7

AA-LCR

54.5

64.4

68.2

60.6

61.0

63.2

LongBench v2

知识

基准

GPT5.2

Claude 4.5 Opus

Gemini-3 Pro

Qwen3-Max-Thinking

K2.5-1T-A32B

STEM

92.4

87.0

91.9

87.4

87.6

88.4

GPQA

35.5

30.8

37.5

30.2

30.1

28.7

HLE

43.3

38.8

37.6

HLE-Verified¹

知识

基准

GPT5.2

Claude 4.5 Opus

Gemini-3 Pro

Qwen3-Max-Thinking

K2.5-1T-A32B

推理

87.7

84.8

90.7

85.9

85.0

83.6

LiveCodeBench v6

99.4

92.9

97.3

98.0

95.4

94.8

HMMT Feb 25

100

93.3

94.7

91.1

92.7

HMMT Nov 25

86.3

84.0

83.3

83.9

81.8

80.9

IMOAnswerBench

96.7

93.3

90.6

93.3

91.3

AIME26

知识

基准

GPT5.2

Claude 4.5 Opus

Gemini-3 Pro

Qwen3-Max-Thinking

K2.5-1T-A32B

通用代理

63.1

77.5

72.5

67.7

68.3

72.9

BFCL-V4

87.1

91.6

85.4

84.6

77.0

86.7

TAU2-Bench

38.2

56.3

51.6

40.9

41.9

49.7

VITA-Bench

44.6

33.9

23.3

28.7

14.5

34.3

DeepPlanning

43.8

43.5

36.4

18.8

27.8

38.3

工具十项全能

57.5

42.3

53.9

33.5

29.5

46.1

MCP-Mark

知识

基准

GPT5.2

Claude 4.5 Opus

Gemini-3 Pro

Qwen3-Max-Thinking

K2.5-1T-A32B

搜索代理³

45.5

43.4

45.8

49.8

50.2

48.3

带工具的 HLE

65.8

67.8

59.2

53.9

--/74.9

69.0/78.6

BrowseComp

76.1

62.4

66.8

60.9

70.3

BrowseComp-zh

76.8

76.4

68.0

57.9

72.7

74.0

WideSearch

45.0

47.7

45.5

46.9

57.4

46.9

Seal-0

知识

基准

GPT5.2

Claude 4.5 Opus

Gemini-3 Pro

Qwen3-Max-Thinking

K2.5-1T-A32B

多语言能力

89.5

90.1

90.6

84.4

86.0

88.5

MMMLU

83.7

85.7

87.7

78.5

82.3

84.7

MMLU-ProX

54.6

56.7

54.2

56.0

59.1

NOVA-63

87.5

86.2

90.5

82.3

83.3

85.6

INCLUDE

90.9

91.6

93.2

86.0

89.3

89.8

全球 PIQA

62.5

79.0

81.6

64.7

43.1

73.3

PolyMATH

78.8

79.7

80.7

77.6

78.9

WMT24++

88.4

79.2

87.5

84.0

72.8

88.2

MAXIFE

知识

基准

GPT5.2

Claude 4.5 Opus

Gemini-3 Pro

Qwen3-Max-Thinking

K2.5-1T-A32B

编码代理

80.0

80.9

76.2

75.3

76.8

76.4

SWE-bench Verified

72.0

77.5

65.0

66.7

73.0

72.0

SWE-bench 多语言

68.7

68.6

62.4

57.5

61.3

68.3

SecCodeBench

54.0

59.3

54.2

22.5

50.8

52.5

终端基准 2

说明 HLE-Verified：人类最后考试（HLE）的经验证并修订版本，附带透明的逐项验证协议和细粒度错误分类法。我们将数据集开源于.
https://huggingface.co/datasets/skylenage/HLE-Verified
TAU2-Bench：我们遵循官方设置，航空领域除外——该领域对所有模型应用了 Claude Opus 4.5 系统卡中提出的修复措施来进行评估。
MCPMark：GitHub MCP 服务器使用 api.githubcopilot.com 的 v0.30.3；Playwright 工具的响应在 32k tokens 处被截断。
搜索代理：基于我们模型构建的大多数搜索代理采用简单的上下文折叠策略（256k）：一旦累计工具响应长度达到预设阈值，较早的工具响应将从历史中被修剪以将上下文保持在限制内。
BrowseComp：我们测试了两种策略，简单的上下文折叠得分为 69.0，而使用与 DeepSeek-V3.2 和 Kimi K2.5 相同的全部丢弃策略则得分为 78.6。
WideSearch：我们使用 256k 的上下文窗口且不进行任何上下文管理。
MMLU-ProX：我们报告 29 种语言的平均准确率。
WMT24++：在难度标注和再平衡后的 WMT24 的一个更难子集；我们使用 XCOMET-XXL 报告 55 种语言的平均分数。
MAXIFE：我们报告英语 + 多语言原始提示（共 23 个设置）上的准确率。

空白单元（--）表示分数尚不可用或不适用。

视觉语言基准

知识

基准

GPT5.2

Claude 4.5 Opus

STEM 与益智

Qwen3-Max-Thinking

K2.5-1T-A32B

Qwen3-VL-235B-A22B

86.7

80.7

87.2

80.6

84.3

85.0

MMMU

79.5

70.6

81.0

69.3

78.5

79.0

MMMU-Pro

83.0

74.3

86.6

74.6

84.2

88.6

MathVision

83.1

80.0

87.9

85.8

90.1

90.3

Mathvista(mini)

79.0

70.0

86.9

74.8

84.7

87.9

We-Math

86.8

79.7

85.1

82.8

84.4

86.3

DynaMath

ZEROBench

33.2

28.4

39.0

28.4

33.5

41.0

ZEROBench_sub

34.4

14.2

49.7

22.2

36.5

52.3/43.3

BabyVision

知识

基准

GPT5.2

Claude 4.5 Opus

STEM 与益智

Qwen3-Max-Thinking

K2.5-1T-A32B

通用 VQA

83.3

77.0

83.3

81.3

81.0

83.9

RealWorldQA

77.1

73.2

83.1

78.7

80.5

83.8

MMStar

65.2

64.1

68.6

66.7

69.8

71.4

HallusionBench

88.2

89.2

93.7

89.7

94.2

93.7

MMBench (EN-DEV-v1.1)

55.8

65.7

73.2

61.3

71.2

67.1

SimpleVQA

知识

基准

GPT5.2

Claude 4.5 Opus

STEM 与益智

Qwen3-Max-Thinking

K2.5-1T-A32B

文本识别与文档理解

85.7

87.7

88.5

84.5

88.8

90.8

OmniDocBench1.5

82.1

68.5

81.4

66.1

77.5

80.8

CharXiv(RQ)

61.9

60.5

56.2

58.5

61.5

MMLongBench-Doc

70.3

76.9

79.0

81.5

79.7

82.0

CC-OCR

92.2

87.7

94.1

89.2

90.8

93.9

AI2D_TEST

80.7

85.8

90.4

87.5

92.3

93.1

OCRBench

知识

基准

GPT5.2

Claude 4.5 Opus

STEM 与益智

Qwen3-Max-Thinking

K2.5-1T-A32B

空间智能

59.8

46.8

70.5

52.5

67.5

ERQA

91.9

90.6

97.3

93.7

94.1

97.2

CountBench

84.1

91.1

87.8

92.3

RefCOCO(avg)

46.3

43.2

47.0

ODInW13

81.3

75.7

61.2

84.3

77.4

84.5

EmbSpatialBench

65.5

69.9

73.6

RefSpatialBench

68.8

78.8

72.8

66.8

68.2

81.6

LingoQA

75.9

67.0

88.0

85.9

77.0

95.8/91.1

11.0

12.5

Hypersim

34.9

38.3

SUNRGBD

13.9

16.0

Nuscene

知识

基准

GPT5.2

Claude 4.5 Opus

STEM 与益智

Qwen3-Max-Thinking

K2.5-1T-A32B

视频理解

77.6

88.4

83.8

87.4

87.5

VideoMME (含字幕)

85.8

81.4

87.7

79.0

83.2

83.7

VideoMME (不含字幕)

85.9

84.4

87.6

80.0

86.6

84.7

VideoMMMU

85.6

81.7

83.0

83.8

85.0

86.7

MLVU (M-Avg)

78.1

67.2

74.1

75.2

73.5

77.6

MVBench

73.7

57.3

76.2

63.6

75.9

75.5

LVBench

80.8

77.3

77.5

71.1

80.4

75.4

MMVU

知识

基准

GPT5.2

Claude 4.5 Opus

STEM 与益智

Qwen3-Max-Thinking

K2.5-1T-A32B

视觉代理

45.7

72.7

62.0

65.6

ScreenSpot Pro

38.2

66.3

38.1

63.3

62.2

OSWorld-Verified

63.7

66.8

AndroidWorld

知识

基准

GPT5.2

Claude 4.5 Opus

STEM 与益智

Qwen3-Max-Thinking

K2.5-1T-A32B

医学

69.8

65.6

74.5

65.4

79.9

76.3

VQA-RAD

76.9

76.4

81.3

54.7

81.6

79.9

SLAKE

72.9

75.5

80.3

65.4

87.4

85.1

OM-VQA

58.9

59.9

62.3

41.2

63.3

64.2

PMC-VQA

73.3

63.6

76.0

47.6

65.3

70.0

终端基准 2

MedXpertQA-MM MathVision：我们模型的得分使用固定提示评估，例如：“请一步步推理，并将最终答案置于\boxed{} MathVision：我们模型的得分使用固定提示评估，例如：“请一步步推理，并将最终答案置于 。” 对于其他模型，我们报告有无该
格式化的运行中得分较高者。
BabyVision：我们模型的得分是在启用 CI（代码解释器）时报告的；在未启用 CI 的情况下结果为 43.3 。
MAXIFE：我们报告英语 + 多语言原始提示（共 23 个设置）上的准确率。

上一页Ultra Long Context RL 下一页GLM-5

最后更新于2小时前

这有帮助吗？

hashtag⚙️ 使用指南

hashtag推荐设置

hashtagQwen3.5-397B-A17B 教程：

hashtag✨ 在 llama.cpp 中运行

hashtag🦙 Llama-server 服务 & OpenAI 的 completion 库

hashtag🔨与 Qwen3.5 的工具调用

hashtag获取更多细节，我们随后可以进行一些工具调用。

hashtag您可以在下方查看 Qwen3.5-397B-A17B 的表格格式基准：

hashtag语言基准

hashtagC-Eval

hashtag多挑战

hashtagLongBench v2

hashtagHLE-Verified¹

hashtagAIME26

hashtagMCP-Mark

hashtagSeal-0

hashtagMAXIFE

hashtag空白单元（--）表示分数尚不可用或不适用。

hashtag视觉语言基准

hashtagBabyVision

hashtagSimpleVQA

hashtagOCRBench

hashtagNuscene

hashtagMMVU

hashtagAndroidWorld

⚙️ 使用指南

推荐设置

Qwen3.5-397B-A17B 教程：

✨ 在 llama.cpp 中运行

🦙 Llama-server 服务 & OpenAI 的 completion 库

🔨与 Qwen3.5 的工具调用

获取更多细节，我们随后可以进行一些工具调用。

您可以在下方查看 Qwen3.5-397B-A17B 的表格格式基准：

语言基准

C-Eval

多挑战

LongBench v2

HLE-Verified¹

AIME26

MCP-Mark

Seal-0

MAXIFE

空白单元（--）表示分数尚不可用或不适用。

视觉语言基准

BabyVision

SimpleVQA

OCRBench

Nuscene

MMVU

AndroidWorld