💜Qwen3.5 - How to Run Locally Guide
Run the new Qwen3.5 LLMs including new Medium series: Qwen3.5-35B-A3B, 27B, 122B-A10B, and 397B-A17B on your local device!
Qwen3.5 is Alibaba’s new model family, including Qwen3.5-35B-A3B, 27B, 122B-A10B and 397B-A17B. The multimodal hybrid reasoning LLMs deliver the strongest performances for their sizes. They support 256K context across 201 languages, have thinking and non-thinking modes, and excel in agentic coding, vision, chat, and long-context tasks. The 35B and 27B models work on a 21GB Mac / RAM device. See all GGUFs here.
All Qwen3.5 Medium models are now available!
Looping / over-thinking issues? Please use the correct inference settings.
Qwen3.5-397B-A17B is comparable to Gemini 3 Pro, Claude Opus 4.5, and GPT-5.2. The full 397B model is ~807GB on disk, and 3-bit runs on a 192GB Mac / RAM device or 4-bit MXFP4 on a 256GB Mac. See quantization benchmarks for our GGUFs!
All uploads use Unsloth Dynamic 2.0 for SOTA quantization performance - so 4-bit has important layers upcasted to 8 or 16-bit. Thank you Qwen for providing Unsloth with day zero access. You can also fine-tune Qwen3.5 with Unsloth.
35B-A3B27B122B-A10B397B-A17BFine-tune Qwen3.5
⚙️ Usage Guide
Table: Inference hardware requirements (units = total memory: RAM + VRAM, or unified memory)
For best performance, make sure your total available memory (VRAM + system RAM) exceeds the size of the quantized model file you’re downloading. If it doesn’t, llama.cpp can still run via SSD/HDD offloading, but inference will be slower.
Between 27B and 35B-A3B, use 27B if you want slightly more accurate results and can't fit in your device. Go for 35B-A3B if you want much faster inference.
Recommended Settings
Maximum context window:
262,144(can be extended to 1M via YaRN)presence_penalty = 0.0 to 2.0default this is off, but to reduce repetitions, you can use this, however using a higher value may result in slight decrease in performanceAdequate Output Length:
32,768tokens for most queries
As Qwen3.5 is hybrid reasoning, thinking and non-thinking mode have different settings:
Thinking mode:
temperature = 1.0
temperature = 0.6
top_p = 0.95
top_p = 0.95
tok_k = 20
tok_k = 20
min_p = 0.0
min_p = 0.0
presence_penalty = 1.5
presence_penalty = 0.0
repeat penalty = disabled or 1.0
repeat penalty = disabled or 1.0
Thinking mode for general tasks:
Thinking mode for precise coding tasks:
Instruct (non-thinking) mode settings:
temperature = 0.7
temperature = 1.0
top_p = 0.8
top_p = 0.95
tok_k = 20
tok_k = 20
min_p = 0.0
min_p = 0.0
presence_penalty = 1.5
presence_penalty = 1.5
repeat penalty = disabled or 1.0
repeat penalty = disabled or 1.0
To disable thinking / reasoning, use --chat-template-kwargs "{\"enable_thinking\": false}"
Instruct (non-thinking) for general tasks:
Instruct (non-thinking) for reasoning tasks:
Qwen3.5 Inference Tutorials:
Because Qwen3.5 comes in many different sizes, we'll be using Dynamic 4-bit MXFP4_MOE GGUF variants for all inference workloads. Click below to navigate to designated model instructions:
Qwen3.5-35B-A3BQwen3.5-27BQwen3.5-122B-A10BQwen3.5-397B-A17B
Unsloth Dynamic GGUF uploads:
presence_penalty = 0.0 to 2.0 default this is off, but to reduce repetitions, you can use this, however using a higher value may result in slight decrease in performance
Qwen3.5-35B-A3B
For this guide we will be utilizing Dynamic 4-bit which works great on a 24GB RAM / Mac device for fast inference. Because the model is only around 72GB at full F16 precision, we won't need to worry much about performance. GGUF: Qwen3.5-35B-A3B-GGUF
✨ Run in llama.cpp
Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.
If you want to use llama.cpp directly to load models, you can do the below: (:Q3_K_XL) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location. The model has a maximum of 256K context length.
Follow one of the specific commands below, according to your use-case:
Thinking mode:
Precise coding tasks (e.g. WebDev):
General tasks:
Non-thinking mode:
General tasks:
Reasoning tasks:
Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose Q4_K_M or other quantized versions like UD-Q4_K_XL . We recommend using at least 2-bit dynamic quant UD-Q2_K_XL to balance size and accuracy. If downloads get stuck, see: Hugging Face Hub, XET debugging
Then run the model in conversation mode:
Qwen3.5-27B
For this guide we will be utilizing Dynamic 4-bit which works great on a 18GB RAM / Mac device for fast inference. GGUF: Qwen3.5-27B-GGUF
Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.
If you want to use llama.cpp directly to load models, you can do the below: (:Q3_K_XL) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location. The model has a maximum of 256K context length.
Follow one of the specific commands below, according to your use-case:
Thinking mode:
Precise coding tasks (e.g. WebDev):
General tasks:
Non-thinking mode:
General tasks:
Reasoning tasks:
Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose MXFP4_MOE or other quantized versions like UD-Q4_K_XL . We recommend using at least 2-bit dynamic quant UD-Q2_K_XL to balance size and accuracy. If downloads get stuck, see: Hugging Face Hub, XET debugging
Then run the model in conversation mode:
Qwen3.5-122B-A10B
For this guide we will be utilizing Dynamic 4-bit which works great on a 70GB RAM / Mac device for fast inference. GGUF: Qwen3.5-122B-A10B-GGUF
Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.
If you want to use llama.cpp directly to load models, you can do the below: (:Q3_K_XL) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location. The model has a maximum of 256K context length.
Follow one of the specific commands below, according to your use-case:
Thinking mode:
Precise coding tasks (e.g. WebDev):
General tasks:
Non-thinking mode:
General tasks:
Reasoning tasks:
Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose MXFP4_MOE (dynamic 4bit) or other quantized versions like UD-Q4_K_XL . We recommend using at least 2-bit dynamic quant UD-Q2_K_XL to balance size and accuracy. If downloads get stuck, see: Hugging Face Hub, XET debugging
Then run the model in conversation mode:
Qwen3.5-397B-A17B
For 397B-A17B, Unsloth 4-bit dynamic quant UD-Q4_K_XL uses 214GB of disk space - this can directly fit on a 256GB M3 Ultra, and also works well in a 1x24GB card and 256GB of RAM with MoE offloading for 25+ tokens/s. The 3-bit quant will fit on a 192GB RAM and 8-bit requires 512GB RAM/VRAM. GGUF: Qwen3.5-397B-A17B-GGUF
Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.
If you want to use llama.cpp directly to load models, you can do the below: (:Q3_K_XL) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location. Remember the model has only a maximum of 256K context length.
Follow this for thinking mode:
Follow this for non-thinking mode:
Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose MXFP4_MOE (dynamic 4bit) or other quantized versions like UD-Q4_K_XL . We recommend using at least 2-bit dynamic quant UD-Q2_K_XL to balance size and accuracy. If downloads get stuck, see: Hugging Face Hub, XET debugging
You can edit --threads 32 for the number of CPU threads, --ctx-size 16384 for context length, --n-gpu-layers 2 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.
🦙 Llama-server serving & OpenAI's completion library
To deploy Qwen3.5-397B-A17B for production, we use llama-server In a new terminal say via tmux, deploy the model via:
Then in a new terminal, after doing pip install openai, do:
To disable thinking / reasoning, use --chat-template-kwargs "{\"enable_thinking\": false}"
👾 OpenAI Codex & Claude Code
To run the model via local coding agentic workloads, you can follow our guide. Just change the model name 'GLM-4.7-Flash' to your desired 'Qwen3.5' and ensure you follow the correct Qwen3.5 parameters and usage instructions. Use the llama-server we just set up just then.
After following the instructions for Claude Code for example you will see:

We can then ask say Create a Python game for Chess :



🔨Tool Calling with Qwen3.5
See Tool Calling Guide for more details on how to do tool calling. In a new terminal (if using tmux, use CTRL+B+D), we create some tools like adding 2 numbers, executing Python code, executing Linux functions and much more:
We then use the below functions (copy and paste and execute) which will parse the function calls automatically and call the OpenAI endpoint for any model:
After launching Qwen3.5 via llama-server like in Qwen3.5 or see Tool Calling Guide for more details, we then can do some tool calls.
📊 Benchmarks
Unsloth GGUF Benchmarks
Qwen3.5-397B-A17B Benchmarks

Benjamin Marie (third-party) benchmarked Qwen3.5-397B-A17B using Unsloth GGUFs on a 750-prompt mixed suite (LiveCodeBench v6, MMLU Pro, GPQA, Math500), reporting both overall accuracy and relative error increase (how much more often the quantized model makes mistakes vs. the original).
Key results (accuracy; change vs. original; relative error increase):
Original weights: 81.3%
UD-Q4_K_XL: 80.5% (−0.8 points; +4.3% relative error increase)
UD-Q3_K_XL: 80.7% (−0.6 points; +3.5% relative error increase)
UD-Q4_K_XL and UD-Q3_K_XL stay extremely close to the original, well under a 1-point accuracy drop on this suite, which Ben insinuates that you can sharply reduce memory footprint (~500 GB less) with little to no practical loss on the tested tasks.
How to choose: Q3 scoring slightly higher than Q4 here is completely plausible as normal run-to-run variance at this scale, so treat Q3 and Q4 as effectively similar quality in this benchmark:
Pick Q3 if you want the smallest footprint / best memory savings
Pick Q4 if you want a slightly more conservative option with similar results
All listed quants utilize our dynamic metholodgy. Even UD-IQ2_M uses a the same methodology of dynamic however the conversion process is different to UD-Q2-K-XL where K-XL is usually faster than UD-IQ2_M even though it's bigger, so that is why UD-IQ2_M may perform better than UD-Q2-K-XL.
Official Qwen Benchmarks
Qwen3.5-35B-A3B, 27B and 122B-A10B Benchmarks
You can view further below for benchmarks in table format:

Language
Knowledge
MMLU-Pro
83.7
80.8
84.4
86.7
86.1
85.3
MMLU-Redux
93.7
91.0
93.8
94.0
93.2
93.3
C-Eval
82.2
76.2
92.1
91.9
90.5
90.2
SuperGPQA
58.6
54.6
64.9
67.1
65.6
63.4
Instruction Following
IFEval
93.9
88.9
87.8
93.4
95.0
91.9
IFBench
75.4
69.0
51.7
76.1
76.5
70.2
MultiChallenge
59.0
45.3
50.2
61.5
60.8
60.0
Long Context
AA-LCR
68.0
50.7
60.0
66.9
66.1
58.5
LongBench v2
56.8
48.2
54.8
60.2
60.6
59.0
STEM & Reasoning
HLE w/ CoT
19.4
14.9
18.2
25.3
24.3
22.4
GPQA Diamond
82.8
80.1
81.1
86.6
85.5
84.2
HMMT Feb 25
89.2
90.0
85.1
91.4
92.0
89.0
HMMT Nov 25
84.2
90.0
89.5
90.3
89.8
89.2
Coding
SWE-bench Verified
72.0
62.0
--
72.0
72.4
69.2
Terminal Bench 2
31.9
18.7
--
49.4
41.6
40.5
LiveCodeBench v6
80.5
82.7
75.1
78.9
80.7
74.6
CodeForces
2160
2157
2146
2100
1899
2028
OJBench
40.4
41.5
32.7
39.5
40.1
36.0
FullStackBench en
30.6
58.9
61.1
62.6
60.1
58.1
FullStackBench zh
35.2
60.4
63.1
58.7
57.4
55.0
General Agent
BFCL-V4
55.5
--
54.8
72.2
68.5
67.3
TAU2-Bench
69.8
--
58.5
79.5
79.0
81.2
VITA-Bench
13.9
--
31.6
33.6
41.9
31.9
DeepPlanning
17.9
--
17.1
24.1
22.6
22.8
Search Agent
HLE w/ tool
35.8
19.0
--
47.5
48.5
47.4
Browsecomp
48.1
41.1
--
63.8
61.0
61.0
Browsecomp-zh
49.5
42.9
--
69.9
62.1
69.5
WideSearch
47.2
40.4
--
60.5
61.1
57.1
Seal-0
34.2
45.1
--
44.1
47.2
41.4
Multilingualism
MMMLU
86.2
78.2
83.4
86.7
85.9
85.2
MMLU-ProX
78.5
74.5
77.9
82.2
82.2
81.0
NOVA-63
51.9
51.1
55.4
58.6
58.1
57.1
INCLUDE
81.8
74.0
81.0
82.8
81.6
79.7
Global PIQA
88.5
84.1
85.7
88.4
87.5
86.6
PolyMATH
67.3
54.0
60.1
68.9
71.2
64.4
WMT24++
80.7
74.4
75.8
78.3
77.6
76.3
MAXIFE
85.3
83.7
83.2
87.9
88.0
86.6
Notes (Language)
CodeForces: evaluated on an internal query set.
TAU2-Bench: followed official setup except airline domain fixes per Claude Opus 4.5 system card.
Search Agent: context-folding (256k) with pruning of earlier tool responses after a threshold.
WideSearch: 256k context window without context management.
MMLU-ProX: averaged accuracy over 29 languages.
WMT24++: averaged over 55 languages using XCOMET-XXL.
MAXIFE: accuracy over English + multilingual original prompts (23 settings).
--means not available / not applicable.
Vision Language
STEM and Puzzle
MMMU
79.0
79.6
80.6
83.9
82.3
81.4
MMMU-Pro
67.3
68.4
69.3
76.9
75.0
75.1
MathVision
71.9
71.1
74.6
86.2
86.0
83.9
Mathvista(mini)
79.1
79.8
85.8
87.4
87.8
86.2
DynaMath
81.4
78.8
82.8
85.9
87.7
85.0
ZEROBench
3
4
4
9
10
8
ZEROBench_sub
27.3
26.3
28.4
36.2
36.2
34.1
VlmsAreBlind
75.8
85.5
79.5
96.7
96.9
97.0
BabyVision
20.9 / 34.5
18.6 / 34.5
22.2 / 34.5
40.2 / 34.5
44.6 / 34.8
38.4 / 29.6
General VQA
RealWorldQA
79.0
70.3
81.3
85.1
83.7
84.1
MMStar
74.1
73.8
78.7
82.9
81.0
81.9
MMBenchEN-DEV-v1.1
86.8
88.3
89.7
92.8
92.6
91.5
SimpleVQA
56.8
57.6
61.3
61.7
56.0
58.3
HallusionBench
63.2
59.9
66.7
67.6
70.0
67.9
Text Recognition and Document Understanding
OmniDocBench1.5
77.0
85.8
84.5
89.8
88.9
89.3
CharXiv(RQ)
68.6
67.2
66.1
77.2
79.5
77.5
MMLongBench-Doc
50.3
--
56.2
59.0
60.2
59.5
CC-OCR
70.8
68.1
81.5
81.8
81.0
80.7
AI2D_TEST
88.2
87.0
89.2
93.3
92.9
92.6
OCRBench
821
766
87.5
92.1
89.4
91.0
Spatial Intelligence
ERQA
54.0
45.0
52.5
62.0
60.5
64.8
CountBench
91.0
90.0
93.7
97.0
97.8
97.8
RefCOCO(avg)
--
--
91.1
91.3
90.9
89.2
ODInW13
--
--
43.2
44.5
41.1
42.6
EmbSpatialBench
80.7
71.8
84.3
83.9
84.5
83.1
RefSpatialBench
9.0
2.2
69.9
69.3
67.7
63.5
LingoQA
62.4
12.8
66.8
80.8
82.0
79.2
Hypersim
--
--
11.0
12.7
13.0
13.1
SUNRGBD
--
--
34.9
36.2
35.4
33.4
Nuscene
--
--
13.9
15.4
15.2
14.6
Video Understanding
VideoMME(w sub.)
83.5
81.1
83.8
87.3
87.0
86.6
VideoMME(w/o sub.)
78.9
75.3
79.0
83.9
82.8
82.5
VideoMMMU
82.5
77.6
80.0
82.0
82.3
80.4
MLVU
83.3
72.8
83.8
87.3
85.9
85.6
MVBench
--
--
75.2
76.6
74.6
74.8
LVBench
--
--
63.6
74.4
73.6
71.4
MMVU
69.8
70.6
71.1
74.7
73.3
72.3
Visual Agent
ScreenSpot Pro
--
36.2
62.0
70.40
70.28
68.60
OSWorld-Verified
--
61.4
38.1
58.01
56.15
54.49
AndroidWorld
--
--
63.7
66.4
64.2
71.1
Tool Calling
TIR-Bench
24.6 / 42.5
27.6 / 42.5
29.8 / 42.5
53.2 / 42.5
59.8 / 42.3
55.5 / 38.0
V*
71.7 / 90.1
58.6 / 89.0
85.9 / 89.5
93.2 / 90.1
93.7 / 89.0
92.7 / 89.5
Medical VQA
SLAKE
70.5
73.6
54.7
81.6
80.0
78.7
PMC-VQA
36.3
55.9
41.2
63.3
62.4
62.0
MedXpertQA-MM
34.4
54.0
47.6
67.3
62.4
61.4
Notes (Vision Language)
MathVision: Qwen score uses a fixed prompt; other models use the higher of runs with/without
\boxed{}formatting.BabyVision: scores are reported as “with CI / without CI”.
TIR-Bench and V*: scores are reported as “with CI / without CI”.
--means not available / not applicable.
Qwen3.5-397B-A17B Benchmarks
You can view further below for Qwen3.5-397B-A17B benchmarks in table format:

Language Benchmarks
Knowledge
Benchmark
GPT5.2
Claude 4.5 Opus
Gemini-3 Pro
Qwen3-Max-Thinking
K2.5-1T-A32B
Qwen3.5-397B-A17B
MMLU-Pro
87.4
89.5
89.8
85.7
87.1
87.8
MMLU-Redux
95.0
95.6
95.9
92.8
94.5
94.9
SuperGPQA
67.9
70.6
74.0
67.3
69.2
70.4
C-Eval
90.5
92.2
93.4
93.7
94.0
93.0
Instruction Following
Benchmark
GPT5.2
Claude 4.5 Opus
Gemini-3 Pro
Qwen3-Max-Thinking
K2.5-1T-A32B
Qwen3.5-397B-A17B
IFEval
94.8
90.9
93.5
93.4
93.9
92.6
IFBench
75.4
58.0
70.4
70.9
70.2
76.5
MultiChallenge
57.9
54.2
64.2
63.3
62.7
67.6
Long Context
Benchmark
GPT5.2
Claude 4.5 Opus
Gemini-3 Pro
Qwen3-Max-Thinking
K2.5-1T-A32B
Qwen3.5-397B-A17B
AA-LCR
72.7
74.0
70.7
68.7
70.0
68.7
LongBench v2
54.5
64.4
68.2
60.6
61.0
63.2
STEM
Benchmark
GPT5.2
Claude 4.5 Opus
Gemini-3 Pro
Qwen3-Max-Thinking
K2.5-1T-A32B
Qwen3.5-397B-A17B
GPQA
92.4
87.0
91.9
87.4
87.6
88.4
HLE
35.5
30.8
37.5
30.2
30.1
28.7
HLE-Verified¹
43.3
38.8
48
37.6
--
37.6
Reasoning
Benchmark
GPT5.2
Claude 4.5 Opus
Gemini-3 Pro
Qwen3-Max-Thinking
K2.5-1T-A32B
Qwen3.5-397B-A17B
LiveCodeBench v6
87.7
84.8
90.7
85.9
85.0
83.6
HMMT Feb 25
99.4
92.9
97.3
98.0
95.4
94.8
HMMT Nov 25
100
93.3
93.3
94.7
91.1
92.7
IMOAnswerBench
86.3
84.0
83.3
83.9
81.8
80.9
AIME26
96.7
93.3
90.6
93.3
93.3
91.3
General Agent
Benchmark
GPT5.2
Claude 4.5 Opus
Gemini-3 Pro
Qwen3-Max-Thinking
K2.5-1T-A32B
Qwen3.5-397B-A17B
BFCL-V4
63.1
77.5
72.5
67.7
68.3
72.9
TAU2-Bench
87.1
91.6
85.4
84.6
77.0
86.7
VITA-Bench
38.2
56.3
51.6
40.9
41.9
49.7
DeepPlanning
44.6
33.9
23.3
28.7
14.5
34.3
Tool Decathlon
43.8
43.5
36.4
18.8
27.8
38.3
MCP-Mark
57.5
42.3
53.9
33.5
29.5
46.1
Search Agent³
Benchmark
GPT5.2
Claude 4.5 Opus
Gemini-3 Pro
Qwen3-Max-Thinking
K2.5-1T-A32B
Qwen3.5-397B-A17B
HLE w/ tool
45.5
43.4
45.8
49.8
50.2
48.3
BrowseComp
65.8
67.8
59.2
53.9
--/74.9
69.0/78.6
BrowseComp-zh
76.1
62.4
66.8
60.9
--
70.3
WideSearch
76.8
76.4
68.0
57.9
72.7
74.0
Seal-0
45.0
47.7
45.5
46.9
57.4
46.9
Multilingualism
Benchmark
GPT5.2
Claude 4.5 Opus
Gemini-3 Pro
Qwen3-Max-Thinking
K2.5-1T-A32B
Qwen3.5-397B-A17B
MMMLU
89.5
90.1
90.6
84.4
86.0
88.5
MMLU-ProX
83.7
85.7
87.7
78.5
82.3
84.7
NOVA-63
54.6
56.7
56.7
54.2
56.0
59.1
INCLUDE
87.5
86.2
90.5
82.3
83.3
85.6
Global PIQA
90.9
91.6
93.2
86.0
89.3
89.8
PolyMATH
62.5
79.0
81.6
64.7
43.1
73.3
WMT24++
78.8
79.7
80.7
77.6
77.6
78.9
MAXIFE
88.4
79.2
87.5
84.0
72.8
88.2
Coding Agent
Benchmark
GPT5.2
Claude 4.5 Opus
Gemini-3 Pro
Qwen3-Max-Thinking
K2.5-1T-A32B
Qwen3.5-397B-A17B
SWE-bench Verified
80.0
80.9
76.2
75.3
76.8
76.4
SWE-bench Multilingual
72.0
77.5
65.0
66.7
73.0
72.0
SecCodeBench
68.7
68.6
62.4
57.5
61.3
68.3
Terminal Bench 2
54.0
59.3
54.2
22.5
50.8
52.5
Notes
HLE-Verified: a verified and revised version of Humanity’s Last Exam (HLE), accompanied by a transparent, component-wise verification protocol and a fine-grained error taxonomy. We open-source the dataset at https://huggingface.co/datasets/skylenage/HLE-Verified.
TAU2-Bench:we follow the official setup except for the airline domain, where all models are evaluated by applying the fixes proposed in the Claude Opus 4.5 system card.
MCPMark: GitHub MCP server uses v0.30.3 from api.githubcopilot.com; Playwright tool responses are truncated at 32k tokens.
Seach Agent: most Search Agents built on our model adopt a simple context-folding strategy(256k): once the cumulative Tool Response length reaches a preset threshold, earlier Tool Responses are pruned from the history to keep the context within limits.
BrowseComp: we tested two strategies, simple context-folding achieved a score of 69.0, while using the same discard-all strategy as DeepSeek-V3.2 and Kimi K2.5 achieved 78.6.
WideSearch: we use a 256k context window without any context management.
MMLU-ProX: we report the averaged accuracy on 29 languages.
WMT24++: a harder subset of WMT24 after difficulty labeling and rebalancing; we report the averaged scores on 55 languages using XCOMET-XXL.
MAXIFE: we report the accuracy on English + multilingual original prompts (totally 23 settings).
Empty cells (--) indicate scores not yet available or not applicable.
Vision Language Benchmarks
STEM and Puzzle
Benchmark
GPT5.2
Claude 4.5 Opus
Gemini-3 Pro
Qwen3-VL-235B-A22B
K2.5-1T-A32B
Qwen3.5-397B-A17B
MMMU
86.7
80.7
87.2
80.6
84.3
85.0
MMMU-Pro
79.5
70.6
81.0
69.3
78.5
79.0
MathVision
83.0
74.3
86.6
74.6
84.2
88.6
Mathvista(mini)
83.1
80.0
87.9
85.8
90.1
90.3
We-Math
79.0
70.0
86.9
74.8
84.7
87.9
DynaMath
86.8
79.7
85.1
82.8
84.4
86.3
ZEROBench
9
3
10
4
9
12
ZEROBench_sub
33.2
28.4
39.0
28.4
33.5
41.0
BabyVision
34.4
14.2
49.7
22.2
36.5
52.3/43.3
General VQA
Benchmark
GPT5.2
Claude 4.5 Opus
Gemini-3 Pro
Qwen3-VL-235B-A22B
K2.5-1T-A32B
Qwen3.5-397B-A17B
RealWorldQA
83.3
77.0
83.3
81.3
81.0
83.9
MMStar
77.1
73.2
83.1
78.7
80.5
83.8
HallusionBench
65.2
64.1
68.6
66.7
69.8
71.4
MMBench (EN-DEV-v1.1)
88.2
89.2
93.7
89.7
94.2
93.7
SimpleVQA
55.8
65.7
73.2
61.3
71.2
67.1
Text Recognition and Document Understanding
Benchmark
GPT5.2
Claude 4.5 Opus
Gemini-3 Pro
Qwen3-VL-235B-A22B
K2.5-1T-A32B
Qwen3.5-397B-A17B
OmniDocBench1.5
85.7
87.7
88.5
84.5
88.8
90.8
CharXiv(RQ)
82.1
68.5
81.4
66.1
77.5
80.8
MMLongBench-Doc
--
61.9
60.5
56.2
58.5
61.5
CC-OCR
70.3
76.9
79.0
81.5
79.7
82.0
AI2D_TEST
92.2
87.7
94.1
89.2
90.8
93.9
OCRBench
80.7
85.8
90.4
87.5
92.3
93.1
Spatial Intelligence
Benchmark
GPT5.2
Claude 4.5 Opus
Gemini-3 Pro
Qwen3-VL-235B-A22B
K2.5-1T-A32B
Qwen3.5-397B-A17B
ERQA
59.8
46.8
70.5
52.5
--
67.5
CountBench
91.9
90.6
97.3
93.7
94.1
97.2
RefCOCO(avg)
--
--
84.1
91.1
87.8
92.3
ODInW13
--
--
46.3
43.2
--
47.0
EmbSpatialBench
81.3
75.7
61.2
84.3
77.4
84.5
RefSpatialBench
--
--
65.5
69.9
--
73.6
LingoQA
68.8
78.8
72.8
66.8
68.2
81.6
V*
75.9
67.0
88.0
85.9
77.0
95.8/91.1
Hypersim
--
--
--
11.0
--
12.5
SUNRGBD
--
--
--
34.9
--
38.3
Nuscene
--
--
--
13.9
--
16.0
Video Understanding
Benchmark
GPT5.2
Claude 4.5 Opus
Gemini-3 Pro
Qwen3-VL-235B-A22B
K2.5-1T-A32B
Qwen3.5-397B-A17B
VideoMME (w sub.)
86
77.6
88.4
83.8
87.4
87.5
VideoMME (w/o sub.)
85.8
81.4
87.7
79.0
83.2
83.7
VideoMMMU
85.9
84.4
87.6
80.0
86.6
84.7
MLVU (M-Avg)
85.6
81.7
83.0
83.8
85.0
86.7
MVBench
78.1
67.2
74.1
75.2
73.5
77.6
LVBench
73.7
57.3
76.2
63.6
75.9
75.5
MMVU
80.8
77.3
77.5
71.1
80.4
75.4
Visual Agent
Benchmark
GPT5.2
Claude 4.5 Opus
Gemini-3 Pro
Qwen3-VL-235B-A22B
K2.5-1T-A32B
Qwen3.5-397B-A17B
ScreenSpot Pro
--
45.7
72.7
62.0
--
65.6
OSWorld-Verified
38.2
66.3
--
38.1
63.3
62.2
AndroidWorld
--
--
--
63.7
--
66.8
Medical
Benchmark
GPT5.2
Claude 4.5 Opus
Gemini-3 Pro
Qwen3-VL-235B-A22B
K2.5-1T-A32B
Qwen3.5-397B-A17B
VQA-RAD
69.8
65.6
74.5
65.4
79.9
76.3
SLAKE
76.9
76.4
81.3
54.7
81.6
79.9
OM-VQA
72.9
75.5
80.3
65.4
87.4
85.1
PMC-VQA
58.9
59.9
62.3
41.2
63.3
64.2
MedXpertQA-MM
73.3
63.6
76.0
47.6
65.3
70.0
Notes
MathVision:our model’s score is evaluated using a fixed prompt, e.g., “Please reason step by step, and put your final answer within
\boxed{}.” For other models, we report the higher score between runs with and without the\boxed{}formatting.BabyVision: our model’s score is reported with CI (Code Interpreter) enabled; without CI, the result is 43.3.
V*: our model’s score is reported with CI (Code Interpreter) enabled; without CI, the result is 91.1.
Empty cells (--) indicate scores not yet available or not applicable.
Last updated
Was this helpful?

