💜Qwen3.5 - How to Run Locally Guide
Run the new Qwen3.5 LLMs including Qwen3.5-397B-A17B on your local device!
Qwen3.5 is Alibaba’s new model family, including Qwen3.5-397B-A17B, a 397B-parameter (17B active) multimodal reasoning model with performance comparable to Gemini 3 Pro, Claude Opus 4.5, and GPT-5.2. It supports 256K context (extendable to 1M) across 201 languages, offers thinking and non-thinking modes, and excels in coding, vision, agents, chat, and long-context tasks.
The full Qwen3.5-397B-A17B model is ~807GB on disk, and you can run 3-bit on a 192GB Mac / RAM device or 4-bit MXFP4 on a 256GB Mac: Qwen3.5-397B-A17B GGUF
All uploads use Unsloth Dynamic 2.0 for SOTA quantization performance - so 4-bit has important layers upcasted to 8 or 16-bit. Thank you Qwen for providing Unsloth with day zero access.
⚙️ Usage Guide
The Unsloth 4-bit dynamic quant UD-Q4_K_XL uses 214GB of disk space - this can directly fit on a 256GB M3 Ultra, and also works well in a 1x24GB card and 256GB of RAM with MoE offloading for 25+ tokens/s. The 3-bit quant will fit on a 192GB RAM and 8-bit requires 512GB RAM/VRAM.
For best performance, have your VRAM + RAM combined equal to the size of the quant you're downloading. If not, hard drive / SSD offloading will work with llama.cpp, just inference will be slower.
Recommended Settings
As Qwen3.5 is hybrid reasoning, thinking and non-thinking mode require different settings:
temperature = 0.6
temperature = 0.7
top_p = 0.95
top_p = 0.8
tok_k = 20
tok_k = 20
min_p = 0
min_p = 0
repeat penalty = disabled or 1.0
repeat penalty = disabled or 1.0
Maximum context window:
262,144presence_penalty = 0.0 to 2.0default this is off, but to reduce repetitions, you can use this, however using a higher value may result in slight decrease in performanceThinking:
temperature=0.6,top_p=0.95,top_k=20,min_p=0Non-Thinking:
temperature=0.7,top_p=0.8,top_k=20,min_p=0Adequate Output Length:
32,768tokens for most queries
Qwen3.5-397B-A17B Tutorial:
For this guide we will be utilizing Dynamic MXFP4_MOE which fits nicely on a 256GB RAM / Mac device for fast inference:
✨ Run in llama.cpp
Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.
If you want to use llama.cpp directly to load models, you can do the below: (:Q3_K_XL) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location. Remember the model has only a maximum of 200K context length.
Follow this for thinking mode:
Follow this for non-thinking mode:
Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose MXFP4_MOE (dynamic 4bit) or other quantized versions like UD-Q4_K_XL . We recommend using at least 2-bit dynamic quant UD-Q2_K_XL to balance size and accuracy.
You can edit --threads 32 for the number of CPU threads, --ctx-size 16384 for context length, --n-gpu-layers 2 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.
To disable thinking / reasoning, use --chat-template-kwargs "{\"enable_thinking\": false}"
🦙 Llama-server serving & OpenAI's completion library
To deploy Qwen3.5-397B-A17B for production, we use llama-server In a new terminal say via tmux, deploy the model via:
Then in a new terminal, after doing pip install openai, do:
🔨Tool Calling with Qwen3.5
See Tool Calling Guide for more details on how to do tool calling. In a new terminal (if using tmux, use CTRL+B+D), we create some tools like adding 2 numbers, executing Python code, executing Linux functions and much more:
We then use the below functions (copy and paste and execute) which will parse the function calls automatically and call the OpenAI endpoint for any model:
After launching Qwen3.5 via llama-server like in Qwen3.5 or see Tool Calling Guide for more details, we then can do some tool calls.
📊 Benchmarks
You can view further below for Qwen3.5-397B-A17B benchmarks in table format:

Language Benchmarks
Knowledge
Benchmark
GPT5.2
Claude 4.5 Opus
Gemini-3 Pro
Qwen3-Max-Thinking
K2.5-1T-A32B
Qwen3.5-397B-A17B
MMLU-Pro
87.4
89.5
89.8
85.7
87.1
87.8
MMLU-Redux
95.0
95.6
95.9
92.8
94.5
94.9
SuperGPQA
67.9
70.6
74.0
67.3
69.2
70.4
C-Eval
90.5
92.2
93.4
93.7
94.0
93.0
Instruction Following
Benchmark
GPT5.2
Claude 4.5 Opus
Gemini-3 Pro
Qwen3-Max-Thinking
K2.5-1T-A32B
Qwen3.5-397B-A17B
IFEval
94.8
90.9
93.5
93.4
93.9
92.6
IFBench
75.4
58.0
70.4
70.9
70.2
76.5
MultiChallenge
57.9
54.2
64.2
63.3
62.7
67.6
Long Context
Benchmark
GPT5.2
Claude 4.5 Opus
Gemini-3 Pro
Qwen3-Max-Thinking
K2.5-1T-A32B
Qwen3.5-397B-A17B
AA-LCR
72.7
74.0
70.7
68.7
70.0
68.7
LongBench v2
54.5
64.4
68.2
60.6
61.0
63.2
STEM
Benchmark
GPT5.2
Claude 4.5 Opus
Gemini-3 Pro
Qwen3-Max-Thinking
K2.5-1T-A32B
Qwen3.5-397B-A17B
GPQA
92.4
87.0
91.9
87.4
87.6
88.4
HLE
35.5
30.8
37.5
30.2
30.1
28.7
HLE-Verified¹
43.3
38.8
48
37.6
--
37.6
Reasoning
Benchmark
GPT5.2
Claude 4.5 Opus
Gemini-3 Pro
Qwen3-Max-Thinking
K2.5-1T-A32B
Qwen3.5-397B-A17B
LiveCodeBench v6
87.7
84.8
90.7
85.9
85.0
83.6
HMMT Feb 25
99.4
92.9
97.3
98.0
95.4
94.8
HMMT Nov 25
100
93.3
93.3
94.7
91.1
92.7
IMOAnswerBench
86.3
84.0
83.3
83.9
81.8
80.9
AIME26
96.7
93.3
90.6
93.3
93.3
91.3
General Agent
Benchmark
GPT5.2
Claude 4.5 Opus
Gemini-3 Pro
Qwen3-Max-Thinking
K2.5-1T-A32B
Qwen3.5-397B-A17B
BFCL-V4
63.1
77.5
72.5
67.7
68.3
72.9
TAU2-Bench
87.1
91.6
85.4
84.6
77.0
86.7
VITA-Bench
38.2
56.3
51.6
40.9
41.9
49.7
DeepPlanning
44.6
33.9
23.3
28.7
14.5
34.3
Tool Decathlon
43.8
43.5
36.4
18.8
27.8
38.3
MCP-Mark
57.5
42.3
53.9
33.5
29.5
46.1
Search Agent³
Benchmark
GPT5.2
Claude 4.5 Opus
Gemini-3 Pro
Qwen3-Max-Thinking
K2.5-1T-A32B
Qwen3.5-397B-A17B
HLE w/ tool
45.5
43.4
45.8
49.8
50.2
48.3
BrowseComp
65.8
67.8
59.2
53.9
--/74.9
69.0/78.6
BrowseComp-zh
76.1
62.4
66.8
60.9
--
70.3
WideSearch
76.8
76.4
68.0
57.9
72.7
74.0
Seal-0
45.0
47.7
45.5
46.9
57.4
46.9
Multilingualism
Benchmark
GPT5.2
Claude 4.5 Opus
Gemini-3 Pro
Qwen3-Max-Thinking
K2.5-1T-A32B
Qwen3.5-397B-A17B
MMMLU
89.5
90.1
90.6
84.4
86.0
88.5
MMLU-ProX
83.7
85.7
87.7
78.5
82.3
84.7
NOVA-63
54.6
56.7
56.7
54.2
56.0
59.1
INCLUDE
87.5
86.2
90.5
82.3
83.3
85.6
Global PIQA
90.9
91.6
93.2
86.0
89.3
89.8
PolyMATH
62.5
79.0
81.6
64.7
43.1
73.3
WMT24++
78.8
79.7
80.7
77.6
77.6
78.9
MAXIFE
88.4
79.2
87.5
84.0
72.8
88.2
Coding Agent
Benchmark
GPT5.2
Claude 4.5 Opus
Gemini-3 Pro
Qwen3-Max-Thinking
K2.5-1T-A32B
Qwen3.5-397B-A17B
SWE-bench Verified
80.0
80.9
76.2
75.3
76.8
76.4
SWE-bench Multilingual
72.0
77.5
65.0
66.7
73.0
72.0
SecCodeBench
68.7
68.6
62.4
57.5
61.3
68.3
Terminal Bench 2
54.0
59.3
54.2
22.5
50.8
52.5
Notes
HLE-Verified: a verified and revised version of Humanity’s Last Exam (HLE), accompanied by a transparent, component-wise verification protocol and a fine-grained error taxonomy. We open-source the dataset at https://huggingface.co/datasets/skylenage/HLE-Verified.
TAU2-Bench:we follow the official setup except for the airline domain, where all models are evaluated by applying the fixes proposed in the Claude Opus 4.5 system card.
MCPMark: GitHub MCP server uses v0.30.3 from api.githubcopilot.com; Playwright tool responses are truncated at 32k tokens.
Seach Agent: most Search Agents built on our model adopt a simple context-folding strategy(256k): once the cumulative Tool Response length reaches a preset threshold, earlier Tool Responses are pruned from the history to keep the context within limits.
BrowseComp: we tested two strategies, simple context-folding achieved a score of 69.0, while using the same discard-all strategy as DeepSeek-V3.2 and Kimi K2.5 achieved 78.6.
WideSearch: we use a 256k context window without any context management.
MMLU-ProX: we report the averaged accuracy on 29 languages.
WMT24++: a harder subset of WMT24 after difficulty labeling and rebalancing; we report the averaged scores on 55 languages using XCOMET-XXL.
MAXIFE: we report the accuracy on English + multilingual original prompts (totally 23 settings).
Empty cells (--) indicate scores not yet available or not applicable.
Vision Language Benchmarks
STEM and Puzzle
Benchmark
GPT5.2
Claude 4.5 Opus
Gemini-3 Pro
Qwen3-VL-235B-A22B
K2.5-1T-A32B
Qwen3.5-397B-A17B
MMMU
86.7
80.7
87.2
80.6
84.3
85.0
MMMU-Pro
79.5
70.6
81.0
69.3
78.5
79.0
MathVision
83.0
74.3
86.6
74.6
84.2
88.6
Mathvista(mini)
83.1
80.0
87.9
85.8
90.1
90.3
We-Math
79.0
70.0
86.9
74.8
84.7
87.9
DynaMath
86.8
79.7
85.1
82.8
84.4
86.3
ZEROBench
9
3
10
4
9
12
ZEROBench_sub
33.2
28.4
39.0
28.4
33.5
41.0
BabyVision
34.4
14.2
49.7
22.2
36.5
52.3/43.3
General VQA
Benchmark
GPT5.2
Claude 4.5 Opus
Gemini-3 Pro
Qwen3-VL-235B-A22B
K2.5-1T-A32B
Qwen3.5-397B-A17B
RealWorldQA
83.3
77.0
83.3
81.3
81.0
83.9
MMStar
77.1
73.2
83.1
78.7
80.5
83.8
HallusionBench
65.2
64.1
68.6
66.7
69.8
71.4
MMBench (EN-DEV-v1.1)
88.2
89.2
93.7
89.7
94.2
93.7
SimpleVQA
55.8
65.7
73.2
61.3
71.2
67.1
Text Recognition and Document Understanding
Benchmark
GPT5.2
Claude 4.5 Opus
Gemini-3 Pro
Qwen3-VL-235B-A22B
K2.5-1T-A32B
Qwen3.5-397B-A17B
OmniDocBench1.5
85.7
87.7
88.5
84.5
88.8
90.8
CharXiv(RQ)
82.1
68.5
81.4
66.1
77.5
80.8
MMLongBench-Doc
--
61.9
60.5
56.2
58.5
61.5
CC-OCR
70.3
76.9
79.0
81.5
79.7
82.0
AI2D_TEST
92.2
87.7
94.1
89.2
90.8
93.9
OCRBench
80.7
85.8
90.4
87.5
92.3
93.1
Spatial Intelligence
Benchmark
GPT5.2
Claude 4.5 Opus
Gemini-3 Pro
Qwen3-VL-235B-A22B
K2.5-1T-A32B
Qwen3.5-397B-A17B
ERQA
59.8
46.8
70.5
52.5
--
67.5
CountBench
91.9
90.6
97.3
93.7
94.1
97.2
RefCOCO(avg)
--
--
84.1
91.1
87.8
92.3
ODInW13
--
--
46.3
43.2
--
47.0
EmbSpatialBench
81.3
75.7
61.2
84.3
77.4
84.5
RefSpatialBench
--
--
65.5
69.9
--
73.6
LingoQA
68.8
78.8
72.8
66.8
68.2
81.6
V*
75.9
67.0
88.0
85.9
77.0
95.8/91.1
Hypersim
--
--
--
11.0
--
12.5
SUNRGBD
--
--
--
34.9
--
38.3
Nuscene
--
--
--
13.9
--
16.0
Video Understanding
Benchmark
GPT5.2
Claude 4.5 Opus
Gemini-3 Pro
Qwen3-VL-235B-A22B
K2.5-1T-A32B
Qwen3.5-397B-A17B
VideoMME (w sub.)
86
77.6
88.4
83.8
87.4
87.5
VideoMME (w/o sub.)
85.8
81.4
87.7
79.0
83.2
83.7
VideoMMMU
85.9
84.4
87.6
80.0
86.6
84.7
MLVU (M-Avg)
85.6
81.7
83.0
83.8
85.0
86.7
MVBench
78.1
67.2
74.1
75.2
73.5
77.6
LVBench
73.7
57.3
76.2
63.6
75.9
75.5
MMVU
80.8
77.3
77.5
71.1
80.4
75.4
Visual Agent
Benchmark
GPT5.2
Claude 4.5 Opus
Gemini-3 Pro
Qwen3-VL-235B-A22B
K2.5-1T-A32B
Qwen3.5-397B-A17B
ScreenSpot Pro
--
45.7
72.7
62.0
--
65.6
OSWorld-Verified
38.2
66.3
--
38.1
63.3
62.2
AndroidWorld
--
--
--
63.7
--
66.8
Medical
Benchmark
GPT5.2
Claude 4.5 Opus
Gemini-3 Pro
Qwen3-VL-235B-A22B
K2.5-1T-A32B
Qwen3.5-397B-A17B
VQA-RAD
69.8
65.6
74.5
65.4
79.9
76.3
SLAKE
76.9
76.4
81.3
54.7
81.6
79.9
OM-VQA
72.9
75.5
80.3
65.4
87.4
85.1
PMC-VQA
58.9
59.9
62.3
41.2
63.3
64.2
MedXpertQA-MM
73.3
63.6
76.0
47.6
65.3
70.0
Notes
MathVision:our model’s score is evaluated using a fixed prompt, e.g., “Please reason step by step, and put your final answer within
\boxed{}.” For other models, we report the higher score between runs with and without the\boxed{}formatting.BabyVision: our model’s score is reported with CI (Code Interpreter) enabled; without CI, the result is 43.3.
V*: our model’s score is reported with CI (Code Interpreter) enabled; without CI, the result is 91.1.
Empty cells (--) indicate scores not yet available or not applicable.
Last updated
Was this helpful?

