💜Qwen3.5 - How to Run Locally Guide

Run the new Qwen3.5 LLMs including new Medium series: Qwen3.5-35B-A3B, 27B, 122B-A10B, and 397B-A17B on your local device!

Qwen3.5 is Alibaba’s new model family, including Qwen3.5-35B-A3B, 27B, 122B-A10B and 397B-A17B. The multimodal hybrid reasoning LLMs deliver the strongest performances for their sizes. They support 256K context across 201 languages, have thinking and non-thinking modes, and excel in agentic coding, vision, chat, and long-context tasks. The 35B and 27B models work on a 21GB Mac / RAM device. See all GGUFs herearrow-up-right.

circle-check

Qwen3.5-397B-A17B is comparable to Gemini 3 Pro, Claude Opus 4.5, and GPT-5.2. The full 397B model is ~807GB on disk, and 3-bit runs on a 192GB Mac / RAM device or 4-bit MXFP4 on a 256GB Mac. See quantization benchmarks for our GGUFs!

All uploads use Unsloth Dynamic 2.0 for SOTA quantization performance - so 4-bit has important layers upcasted to 8 or 16-bit. Thank you Qwen for providing Unsloth with day zero access. You can also fine-tune Qwen3.5 with Unsloth.

35B-A3B27B122B-A10B397B-A17BFine-tune Qwen3.5

⚙️ Usage Guide

Table: Inference hardware requirements (units = total memory: RAM + VRAM, or unified memory)

Qwen3.5
3-bit
4-bit
6-bit
8-bit
BF16

14 GB

17 GB

24 GB

30 GB

54 GB

17 GB

22 GB

30 GB

38 GB

70 GB

60 GB

70 GB

106 GB

132 GB

245 GB

180 GB

214 GB

340 GB

512 GB

810 GB

circle-check

Between 27B and 35B-A3B, use 27B if you want slightly more accurate results and can't fit in your device. Go for 35B-A3B if you want much faster inference.

  • Maximum context window: 262,144 (can be extended to 1M via YaRN)

  • presence_penalty = 0.0 to 2.0 default this is off, but to reduce repetitions, you can use this, however using a higher value may result in slight decrease in performance

  • Adequate Output Length: 32,768 tokens for most queries

As Qwen3.5 is hybrid reasoning, thinking and non-thinking mode have different settings:

Thinking mode:

General tasks
Precise coding tasks (e.g. WebDev)

temperature = 1.0

temperature = 0.6

top_p = 0.95

top_p = 0.95

tok_k = 20

tok_k = 20

min_p = 0.0

min_p = 0.0

presence_penalty = 1.5

presence_penalty = 0.0

repeat penalty = disabled or 1.0

repeat penalty = disabled or 1.0

Thinking mode for general tasks:

Thinking mode for precise coding tasks:

Instruct (non-thinking) mode settings:

General tasks
Reasoning tasks

temperature = 0.7

temperature = 1.0

top_p = 0.8

top_p = 0.95

tok_k = 20

tok_k = 20

min_p = 0.0

min_p = 0.0

presence_penalty = 1.5

presence_penalty = 1.5

repeat penalty = disabled or 1.0

repeat penalty = disabled or 1.0

circle-info

To disable thinking / reasoning, use --chat-template-kwargs "{\"enable_thinking\": false}"

Instruct (non-thinking) for general tasks:

Instruct (non-thinking) for reasoning tasks:

Qwen3.5 Inference Tutorials:

Because Qwen3.5 comes in many different sizes, we'll be using Dynamic 4-bit MXFP4_MOE GGUF variants for all inference workloads. Click below to navigate to designated model instructions:

Qwen3.5-35B-A3BQwen3.5-27BQwen3.5-122B-A10BQwen3.5-397B-A17B

Unsloth Dynamic GGUF uploads:

circle-exclamation

Qwen3.5-35B-A3B

For this guide we will be utilizing Dynamic 4-bit which works great on a 24GB RAM / Mac device for fast inference. Because the model is only around 72GB at full F16 precision, we won't need to worry much about performance. GGUF: Qwen3.5-35B-A3B-GGUFarrow-up-right

✨ Run in llama.cpp

1

Obtain the latest llama.cpp on GitHub herearrow-up-right. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

2

If you want to use llama.cpp directly to load models, you can do the below: (:Q3_K_XL) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location. The model has a maximum of 256K context length.

Follow one of the specific commands below, according to your use-case:

Thinking mode:

Precise coding tasks (e.g. WebDev):

General tasks:

Non-thinking mode:

General tasks:

Reasoning tasks:

3

Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose Q4_K_M or other quantized versions like UD-Q4_K_XL . We recommend using at least 2-bit dynamic quant UD-Q2_K_XL to balance size and accuracy. If downloads get stuck, see: Hugging Face Hub, XET debugging

4

Then run the model in conversation mode:

Qwen3.5-27B

For this guide we will be utilizing Dynamic 4-bit which works great on a 18GB RAM / Mac device for fast inference. GGUF: Qwen3.5-27B-GGUFarrow-up-right

1

Obtain the latest llama.cpp on GitHub herearrow-up-right. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

2

If you want to use llama.cpp directly to load models, you can do the below: (:Q3_K_XL) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location. The model has a maximum of 256K context length.

Follow one of the specific commands below, according to your use-case:

Thinking mode:

Precise coding tasks (e.g. WebDev):

General tasks:

Non-thinking mode:

General tasks:

Reasoning tasks:

3

Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose MXFP4_MOE or other quantized versions like UD-Q4_K_XL . We recommend using at least 2-bit dynamic quant UD-Q2_K_XL to balance size and accuracy. If downloads get stuck, see: Hugging Face Hub, XET debugging

4

Then run the model in conversation mode:

Qwen3.5-122B-A10B

For this guide we will be utilizing Dynamic 4-bit which works great on a 70GB RAM / Mac device for fast inference. GGUF: Qwen3.5-122B-A10B-GGUFarrow-up-right

1

Obtain the latest llama.cpp on GitHub herearrow-up-right. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

2

If you want to use llama.cpp directly to load models, you can do the below: (:Q3_K_XL) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location. The model has a maximum of 256K context length.

Follow one of the specific commands below, according to your use-case:

Thinking mode:

Precise coding tasks (e.g. WebDev):

General tasks:

Non-thinking mode:

General tasks:

Reasoning tasks:

3

Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose MXFP4_MOE (dynamic 4bit) or other quantized versions like UD-Q4_K_XL . We recommend using at least 2-bit dynamic quant UD-Q2_K_XL to balance size and accuracy. If downloads get stuck, see: Hugging Face Hub, XET debugging

4

Then run the model in conversation mode:

Qwen3.5-397B-A17B

For 397B-A17B, Unsloth 4-bit dynamic quant UD-Q4_K_XL uses 214GB of disk space - this can directly fit on a 256GB M3 Ultra, and also works well in a 1x24GB card and 256GB of RAM with MoE offloading for 25+ tokens/s. The 3-bit quant will fit on a 192GB RAM and 8-bit requires 512GB RAM/VRAM. GGUF: Qwen3.5-397B-A17B-GGUFarrow-up-right

1

Obtain the latest llama.cpp on GitHub herearrow-up-right. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

2

If you want to use llama.cpp directly to load models, you can do the below: (:Q3_K_XL) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location. Remember the model has only a maximum of 256K context length.

Follow this for thinking mode:

Follow this for non-thinking mode:

3

Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose MXFP4_MOE (dynamic 4bit) or other quantized versions like UD-Q4_K_XL . We recommend using at least 2-bit dynamic quant UD-Q2_K_XL to balance size and accuracy. If downloads get stuck, see: Hugging Face Hub, XET debugging

4

You can edit --threads 32 for the number of CPU threads, --ctx-size 16384 for context length, --n-gpu-layers 2 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.

🦙 Llama-server serving & OpenAI's completion library

To deploy Qwen3.5-397B-A17B for production, we use llama-server In a new terminal say via tmux, deploy the model via:

Then in a new terminal, after doing pip install openai, do:

circle-info

To disable thinking / reasoning, use --chat-template-kwargs "{\"enable_thinking\": false}"

👾 OpenAI Codex & Claude Code

To run the model via local coding agentic workloads, you can follow our guide. Just change the model name 'GLM-4.7-Flash' to your desired 'Qwen3.5' and ensure you follow the correct Qwen3.5 parameters and usage instructions. Use the llama-server we just set up just then.

After following the instructions for Claude Code for example you will see:

We can then ask say Create a Python game for Chess :

🔨Tool Calling with Qwen3.5

See Tool Calling Guide for more details on how to do tool calling. In a new terminal (if using tmux, use CTRL+B+D), we create some tools like adding 2 numbers, executing Python code, executing Linux functions and much more:

We then use the below functions (copy and paste and execute) which will parse the function calls automatically and call the OpenAI endpoint for any model:

After launching Qwen3.5 via llama-server like in Qwen3.5 or see Tool Calling Guide for more details, we then can do some tool calls.

📊 Benchmarks

Unsloth GGUF Benchmarks

Qwen3.5-397B-A17B Benchmarks

Benjamin Marie (third-party) benchmarkedarrow-up-right Qwen3.5-397B-A17B using Unsloth GGUFs on a 750-prompt mixed suite (LiveCodeBench v6, MMLU Pro, GPQA, Math500), reporting both overall accuracy and relative error increase (how much more often the quantized model makes mistakes vs. the original).

Key results (accuracy; change vs. original; relative error increase):

  • Original weights: 81.3%

  • UD-Q4_K_XL: 80.5% (−0.8 points; +4.3% relative error increase)

  • UD-Q3_K_XL: 80.7% (−0.6 points; +3.5% relative error increase)

UD-Q4_K_XL and UD-Q3_K_XL stay extremely close to the original, well under a 1-point accuracy drop on this suite, which Ben insinuates that you can sharply reduce memory footprint (~500 GB less) with little to no practical loss on the tested tasks.

How to choose: Q3 scoring slightly higher than Q4 here is completely plausible as normal run-to-run variance at this scale, so treat Q3 and Q4 as effectively similar quality in this benchmark:

  • Pick Q3 if you want the smallest footprint / best memory savings

  • Pick Q4 if you want a slightly more conservative option with similar results

All listed quants utilize our dynamic metholodgy. Even UD-IQ2_M uses a the same methodology of dynamic however the conversion process is different to UD-Q2-K-XL where K-XL is usually faster than UD-IQ2_M even though it's bigger, so that is why UD-IQ2_M may perform better than UD-Q2-K-XL.

Official Qwen Benchmarks

Qwen3.5-35B-A3B, 27B and 122B-A10B Benchmarks

You can view further below for benchmarks in table format:

Language

Knowledge

Benchmark
GPT-5-mini 2025-08-07
GPT-OSS-120B
Qwen3-235B-A22B
Qwen3.5-122B-A10B
Qwen3.5-27B
Qwen3.5-35B-A3B

MMLU-Pro

83.7

80.8

84.4

86.7

86.1

85.3

MMLU-Redux

93.7

91.0

93.8

94.0

93.2

93.3

C-Eval

82.2

76.2

92.1

91.9

90.5

90.2

SuperGPQA

58.6

54.6

64.9

67.1

65.6

63.4

Instruction Following

Benchmark
GPT-5-mini 2025-08-07
GPT-OSS-120B
Qwen3-235B-A22B
Qwen3.5-122B-A10B
Qwen3.5-27B
Qwen3.5-35B-A3B

IFEval

93.9

88.9

87.8

93.4

95.0

91.9

IFBench

75.4

69.0

51.7

76.1

76.5

70.2

MultiChallenge

59.0

45.3

50.2

61.5

60.8

60.0

Long Context

Benchmark
GPT-5-mini 2025-08-07
GPT-OSS-120B
Qwen3-235B-A22B
Qwen3.5-122B-A10B
Qwen3.5-27B
Qwen3.5-35B-A3B

AA-LCR

68.0

50.7

60.0

66.9

66.1

58.5

LongBench v2

56.8

48.2

54.8

60.2

60.6

59.0

STEM & Reasoning

Benchmark
GPT-5-mini 2025-08-07
GPT-OSS-120B
Qwen3-235B-A22B
Qwen3.5-122B-A10B
Qwen3.5-27B
Qwen3.5-35B-A3B

HLE w/ CoT

19.4

14.9

18.2

25.3

24.3

22.4

GPQA Diamond

82.8

80.1

81.1

86.6

85.5

84.2

HMMT Feb 25

89.2

90.0

85.1

91.4

92.0

89.0

HMMT Nov 25

84.2

90.0

89.5

90.3

89.8

89.2

Coding

Benchmark
GPT-5-mini 2025-08-07
GPT-OSS-120B
Qwen3-235B-A22B
Qwen3.5-122B-A10B
Qwen3.5-27B
Qwen3.5-35B-A3B

SWE-bench Verified

72.0

62.0

--

72.0

72.4

69.2

Terminal Bench 2

31.9

18.7

--

49.4

41.6

40.5

LiveCodeBench v6

80.5

82.7

75.1

78.9

80.7

74.6

CodeForces

2160

2157

2146

2100

1899

2028

OJBench

40.4

41.5

32.7

39.5

40.1

36.0

FullStackBench en

30.6

58.9

61.1

62.6

60.1

58.1

FullStackBench zh

35.2

60.4

63.1

58.7

57.4

55.0

General Agent

Benchmark
GPT-5-mini 2025-08-07
GPT-OSS-120B
Qwen3-235B-A22B
Qwen3.5-122B-A10B
Qwen3.5-27B
Qwen3.5-35B-A3B

BFCL-V4

55.5

--

54.8

72.2

68.5

67.3

TAU2-Bench

69.8

--

58.5

79.5

79.0

81.2

VITA-Bench

13.9

--

31.6

33.6

41.9

31.9

DeepPlanning

17.9

--

17.1

24.1

22.6

22.8

Search Agent

Benchmark
GPT-5-mini 2025-08-07
GPT-OSS-120B
Qwen3-235B-A22B
Qwen3.5-122B-A10B
Qwen3.5-27B
Qwen3.5-35B-A3B

HLE w/ tool

35.8

19.0

--

47.5

48.5

47.4

Browsecomp

48.1

41.1

--

63.8

61.0

61.0

Browsecomp-zh

49.5

42.9

--

69.9

62.1

69.5

WideSearch

47.2

40.4

--

60.5

61.1

57.1

Seal-0

34.2

45.1

--

44.1

47.2

41.4

Multilingualism

Benchmark
GPT-5-mini 2025-08-07
GPT-OSS-120B
Qwen3-235B-A22B
Qwen3.5-122B-A10B
Qwen3.5-27B
Qwen3.5-35B-A3B

MMMLU

86.2

78.2

83.4

86.7

85.9

85.2

MMLU-ProX

78.5

74.5

77.9

82.2

82.2

81.0

NOVA-63

51.9

51.1

55.4

58.6

58.1

57.1

INCLUDE

81.8

74.0

81.0

82.8

81.6

79.7

Global PIQA

88.5

84.1

85.7

88.4

87.5

86.6

PolyMATH

67.3

54.0

60.1

68.9

71.2

64.4

WMT24++

80.7

74.4

75.8

78.3

77.6

76.3

MAXIFE

85.3

83.7

83.2

87.9

88.0

86.6

Notes (Language)

  • CodeForces: evaluated on an internal query set.

  • TAU2-Bench: followed official setup except airline domain fixes per Claude Opus 4.5 system card.

  • Search Agent: context-folding (256k) with pruning of earlier tool responses after a threshold.

  • WideSearch: 256k context window without context management.

  • MMLU-ProX: averaged accuracy over 29 languages.

  • WMT24++: averaged over 55 languages using XCOMET-XXL.

  • MAXIFE: accuracy over English + multilingual original prompts (23 settings).

  • -- means not available / not applicable.

Vision Language

STEM and Puzzle

Benchmark
GPT-5-mini 2025-08-07
Claude-Sonnet-4.5
Qwen3-VL-235B-A22B
Qwen3.5-122B-A10B
Qwen3.5-27B
Qwen3.5-35B-A3B

MMMU

79.0

79.6

80.6

83.9

82.3

81.4

MMMU-Pro

67.3

68.4

69.3

76.9

75.0

75.1

MathVision

71.9

71.1

74.6

86.2

86.0

83.9

Mathvista(mini)

79.1

79.8

85.8

87.4

87.8

86.2

DynaMath

81.4

78.8

82.8

85.9

87.7

85.0

ZEROBench

3

4

4

9

10

8

ZEROBench_sub

27.3

26.3

28.4

36.2

36.2

34.1

VlmsAreBlind

75.8

85.5

79.5

96.7

96.9

97.0

BabyVision

20.9 / 34.5

18.6 / 34.5

22.2 / 34.5

40.2 / 34.5

44.6 / 34.8

38.4 / 29.6

General VQA

Benchmark
GPT-5-mini 2025-08-07
Claude-Sonnet-4.5
Qwen3-VL-235B-A22B
Qwen3.5-122B-A10B
Qwen3.5-27B
Qwen3.5-35B-A3B

RealWorldQA

79.0

70.3

81.3

85.1

83.7

84.1

MMStar

74.1

73.8

78.7

82.9

81.0

81.9

MMBenchEN-DEV-v1.1

86.8

88.3

89.7

92.8

92.6

91.5

SimpleVQA

56.8

57.6

61.3

61.7

56.0

58.3

HallusionBench

63.2

59.9

66.7

67.6

70.0

67.9

Text Recognition and Document Understanding

Benchmark
GPT-5-mini 2025-08-07
Claude-Sonnet-4.5
Qwen3-VL-235B-A22B
Qwen3.5-122B-A10B
Qwen3.5-27B
Qwen3.5-35B-A3B

OmniDocBench1.5

77.0

85.8

84.5

89.8

88.9

89.3

CharXiv(RQ)

68.6

67.2

66.1

77.2

79.5

77.5

MMLongBench-Doc

50.3

--

56.2

59.0

60.2

59.5

CC-OCR

70.8

68.1

81.5

81.8

81.0

80.7

AI2D_TEST

88.2

87.0

89.2

93.3

92.9

92.6

OCRBench

821

766

87.5

92.1

89.4

91.0

Spatial Intelligence

Benchmark
GPT-5-mini 2025-08-07
Claude-Sonnet-4.5
Qwen3-VL-235B-A22B
Qwen3.5-122B-A10B
Qwen3.5-27B
Qwen3.5-35B-A3B

ERQA

54.0

45.0

52.5

62.0

60.5

64.8

CountBench

91.0

90.0

93.7

97.0

97.8

97.8

RefCOCO(avg)

--

--

91.1

91.3

90.9

89.2

ODInW13

--

--

43.2

44.5

41.1

42.6

EmbSpatialBench

80.7

71.8

84.3

83.9

84.5

83.1

RefSpatialBench

9.0

2.2

69.9

69.3

67.7

63.5

LingoQA

62.4

12.8

66.8

80.8

82.0

79.2

Hypersim

--

--

11.0

12.7

13.0

13.1

SUNRGBD

--

--

34.9

36.2

35.4

33.4

Nuscene

--

--

13.9

15.4

15.2

14.6

Video Understanding

Benchmark
GPT-5-mini 2025-08-07
Claude-Sonnet-4.5
Qwen3-VL-235B-A22B
Qwen3.5-122B-A10B
Qwen3.5-27B
Qwen3.5-35B-A3B

VideoMME(w sub.)

83.5

81.1

83.8

87.3

87.0

86.6

VideoMME(w/o sub.)

78.9

75.3

79.0

83.9

82.8

82.5

VideoMMMU

82.5

77.6

80.0

82.0

82.3

80.4

MLVU

83.3

72.8

83.8

87.3

85.9

85.6

MVBench

--

--

75.2

76.6

74.6

74.8

LVBench

--

--

63.6

74.4

73.6

71.4

MMVU

69.8

70.6

71.1

74.7

73.3

72.3

Visual Agent

Benchmark
GPT-5-mini 2025-08-07
Claude-Sonnet-4.5
Qwen3-VL-235B-A22B
Qwen3.5-122B-A10B
Qwen3.5-27B
Qwen3.5-35B-A3B

ScreenSpot Pro

--

36.2

62.0

70.40

70.28

68.60

OSWorld-Verified

--

61.4

38.1

58.01

56.15

54.49

AndroidWorld

--

--

63.7

66.4

64.2

71.1

Tool Calling

Benchmark
GPT-5-mini 2025-08-07
Claude-Sonnet-4.5
Qwen3-VL-235B-A22B
Qwen3.5-122B-A10B
Qwen3.5-27B
Qwen3.5-35B-A3B

TIR-Bench

24.6 / 42.5

27.6 / 42.5

29.8 / 42.5

53.2 / 42.5

59.8 / 42.3

55.5 / 38.0

V*

71.7 / 90.1

58.6 / 89.0

85.9 / 89.5

93.2 / 90.1

93.7 / 89.0

92.7 / 89.5

Medical VQA

Benchmark
GPT-5-mini 2025-08-07
Claude-Sonnet-4.5
Qwen3-VL-235B-A22B
Qwen3.5-122B-A10B
Qwen3.5-27B
Qwen3.5-35B-A3B

SLAKE

70.5

73.6

54.7

81.6

80.0

78.7

PMC-VQA

36.3

55.9

41.2

63.3

62.4

62.0

MedXpertQA-MM

34.4

54.0

47.6

67.3

62.4

61.4

Notes (Vision Language)

  • MathVision: Qwen score uses a fixed prompt; other models use the higher of runs with/without \boxed{} formatting.

  • BabyVision: scores are reported as “with CI / without CI”.

  • TIR-Bench and V*: scores are reported as “with CI / without CI”.

  • -- means not available / not applicable.

Qwen3.5-397B-A17B Benchmarks

You can view further below for Qwen3.5-397B-A17B benchmarks in table format:

Language Benchmarks

Knowledge

Benchmark

GPT5.2

Claude 4.5 Opus

Gemini-3 Pro

Qwen3-Max-Thinking

K2.5-1T-A32B

Qwen3.5-397B-A17B

MMLU-Pro

87.4

89.5

89.8

85.7

87.1

87.8

MMLU-Redux

95.0

95.6

95.9

92.8

94.5

94.9

SuperGPQA

67.9

70.6

74.0

67.3

69.2

70.4

C-Eval

90.5

92.2

93.4

93.7

94.0

93.0

Instruction Following

Benchmark

GPT5.2

Claude 4.5 Opus

Gemini-3 Pro

Qwen3-Max-Thinking

K2.5-1T-A32B

Qwen3.5-397B-A17B

IFEval

94.8

90.9

93.5

93.4

93.9

92.6

IFBench

75.4

58.0

70.4

70.9

70.2

76.5

MultiChallenge

57.9

54.2

64.2

63.3

62.7

67.6

Long Context

Benchmark

GPT5.2

Claude 4.5 Opus

Gemini-3 Pro

Qwen3-Max-Thinking

K2.5-1T-A32B

Qwen3.5-397B-A17B

AA-LCR

72.7

74.0

70.7

68.7

70.0

68.7

LongBench v2

54.5

64.4

68.2

60.6

61.0

63.2

STEM

Benchmark

GPT5.2

Claude 4.5 Opus

Gemini-3 Pro

Qwen3-Max-Thinking

K2.5-1T-A32B

Qwen3.5-397B-A17B

GPQA

92.4

87.0

91.9

87.4

87.6

88.4

HLE

35.5

30.8

37.5

30.2

30.1

28.7

HLE-Verified¹

43.3

38.8

48

37.6

--

37.6

Reasoning

Benchmark

GPT5.2

Claude 4.5 Opus

Gemini-3 Pro

Qwen3-Max-Thinking

K2.5-1T-A32B

Qwen3.5-397B-A17B

LiveCodeBench v6

87.7

84.8

90.7

85.9

85.0

83.6

HMMT Feb 25

99.4

92.9

97.3

98.0

95.4

94.8

HMMT Nov 25

100

93.3

93.3

94.7

91.1

92.7

IMOAnswerBench

86.3

84.0

83.3

83.9

81.8

80.9

AIME26

96.7

93.3

90.6

93.3

93.3

91.3

General Agent

Benchmark

GPT5.2

Claude 4.5 Opus

Gemini-3 Pro

Qwen3-Max-Thinking

K2.5-1T-A32B

Qwen3.5-397B-A17B

BFCL-V4

63.1

77.5

72.5

67.7

68.3

72.9

TAU2-Bench

87.1

91.6

85.4

84.6

77.0

86.7

VITA-Bench

38.2

56.3

51.6

40.9

41.9

49.7

DeepPlanning

44.6

33.9

23.3

28.7

14.5

34.3

Tool Decathlon

43.8

43.5

36.4

18.8

27.8

38.3

MCP-Mark

57.5

42.3

53.9

33.5

29.5

46.1

Search Agent³

Benchmark

GPT5.2

Claude 4.5 Opus

Gemini-3 Pro

Qwen3-Max-Thinking

K2.5-1T-A32B

Qwen3.5-397B-A17B

HLE w/ tool

45.5

43.4

45.8

49.8

50.2

48.3

BrowseComp

65.8

67.8

59.2

53.9

--/74.9

69.0/78.6

BrowseComp-zh

76.1

62.4

66.8

60.9

--

70.3

WideSearch

76.8

76.4

68.0

57.9

72.7

74.0

Seal-0

45.0

47.7

45.5

46.9

57.4

46.9

Multilingualism

Benchmark

GPT5.2

Claude 4.5 Opus

Gemini-3 Pro

Qwen3-Max-Thinking

K2.5-1T-A32B

Qwen3.5-397B-A17B

MMMLU

89.5

90.1

90.6

84.4

86.0

88.5

MMLU-ProX

83.7

85.7

87.7

78.5

82.3

84.7

NOVA-63

54.6

56.7

56.7

54.2

56.0

59.1

INCLUDE

87.5

86.2

90.5

82.3

83.3

85.6

Global PIQA

90.9

91.6

93.2

86.0

89.3

89.8

PolyMATH

62.5

79.0

81.6

64.7

43.1

73.3

WMT24++

78.8

79.7

80.7

77.6

77.6

78.9

MAXIFE

88.4

79.2

87.5

84.0

72.8

88.2

Coding Agent

Benchmark

GPT5.2

Claude 4.5 Opus

Gemini-3 Pro

Qwen3-Max-Thinking

K2.5-1T-A32B

Qwen3.5-397B-A17B

SWE-bench Verified

80.0

80.9

76.2

75.3

76.8

76.4

SWE-bench Multilingual

72.0

77.5

65.0

66.7

73.0

72.0

SecCodeBench

68.7

68.6

62.4

57.5

61.3

68.3

Terminal Bench 2

54.0

59.3

54.2

22.5

50.8

52.5

Notes

  • HLE-Verified: a verified and revised version of Humanity’s Last Exam (HLE), accompanied by a transparent, component-wise verification protocol and a fine-grained error taxonomy. We open-source the dataset at https://huggingface.co/datasets/skylenage/HLE-Verifiedarrow-up-right.

  • TAU2-Bench:we follow the official setup except for the airline domain, where all models are evaluated by applying the fixes proposed in the Claude Opus 4.5 system card.

  • MCPMark: GitHub MCP server uses v0.30.3 from api.githubcopilot.com; Playwright tool responses are truncated at 32k tokens.

  • Seach Agent: most Search Agents built on our model adopt a simple context-folding strategy(256k): once the cumulative Tool Response length reaches a preset threshold, earlier Tool Responses are pruned from the history to keep the context within limits.

  • BrowseComp: we tested two strategies, simple context-folding achieved a score of 69.0, while using the same discard-all strategy as DeepSeek-V3.2 and Kimi K2.5 achieved 78.6.

  • WideSearch: we use a 256k context window without any context management.

  • MMLU-ProX: we report the averaged accuracy on 29 languages.

  • WMT24++: a harder subset of WMT24 after difficulty labeling and rebalancing; we report the averaged scores on 55 languages using XCOMET-XXL.

  • MAXIFE: we report the accuracy on English + multilingual original prompts (totally 23 settings).

  • Empty cells (--) indicate scores not yet available or not applicable.

Vision Language Benchmarks

STEM and Puzzle

Benchmark

GPT5.2

Claude 4.5 Opus

Gemini-3 Pro

Qwen3-VL-235B-A22B

K2.5-1T-A32B

Qwen3.5-397B-A17B

MMMU

86.7

80.7

87.2

80.6

84.3

85.0

MMMU-Pro

79.5

70.6

81.0

69.3

78.5

79.0

MathVision

83.0

74.3

86.6

74.6

84.2

88.6

Mathvista(mini)

83.1

80.0

87.9

85.8

90.1

90.3

We-Math

79.0

70.0

86.9

74.8

84.7

87.9

DynaMath

86.8

79.7

85.1

82.8

84.4

86.3

ZEROBench

9

3

10

4

9

12

ZEROBench_sub

33.2

28.4

39.0

28.4

33.5

41.0

BabyVision

34.4

14.2

49.7

22.2

36.5

52.3/43.3

General VQA

Benchmark

GPT5.2

Claude 4.5 Opus

Gemini-3 Pro

Qwen3-VL-235B-A22B

K2.5-1T-A32B

Qwen3.5-397B-A17B

RealWorldQA

83.3

77.0

83.3

81.3

81.0

83.9

MMStar

77.1

73.2

83.1

78.7

80.5

83.8

HallusionBench

65.2

64.1

68.6

66.7

69.8

71.4

MMBench (EN-DEV-v1.1)

88.2

89.2

93.7

89.7

94.2

93.7

SimpleVQA

55.8

65.7

73.2

61.3

71.2

67.1

Text Recognition and Document Understanding

Benchmark

GPT5.2

Claude 4.5 Opus

Gemini-3 Pro

Qwen3-VL-235B-A22B

K2.5-1T-A32B

Qwen3.5-397B-A17B

OmniDocBench1.5

85.7

87.7

88.5

84.5

88.8

90.8

CharXiv(RQ)

82.1

68.5

81.4

66.1

77.5

80.8

MMLongBench-Doc

--

61.9

60.5

56.2

58.5

61.5

CC-OCR

70.3

76.9

79.0

81.5

79.7

82.0

AI2D_TEST

92.2

87.7

94.1

89.2

90.8

93.9

OCRBench

80.7

85.8

90.4

87.5

92.3

93.1

Spatial Intelligence

Benchmark

GPT5.2

Claude 4.5 Opus

Gemini-3 Pro

Qwen3-VL-235B-A22B

K2.5-1T-A32B

Qwen3.5-397B-A17B

ERQA

59.8

46.8

70.5

52.5

--

67.5

CountBench

91.9

90.6

97.3

93.7

94.1

97.2

RefCOCO(avg)

--

--

84.1

91.1

87.8

92.3

ODInW13

--

--

46.3

43.2

--

47.0

EmbSpatialBench

81.3

75.7

61.2

84.3

77.4

84.5

RefSpatialBench

--

--

65.5

69.9

--

73.6

LingoQA

68.8

78.8

72.8

66.8

68.2

81.6

V*

75.9

67.0

88.0

85.9

77.0

95.8/91.1

Hypersim

--

--

--

11.0

--

12.5

SUNRGBD

--

--

--

34.9

--

38.3

Nuscene

--

--

--

13.9

--

16.0

Video Understanding

Benchmark

GPT5.2

Claude 4.5 Opus

Gemini-3 Pro

Qwen3-VL-235B-A22B

K2.5-1T-A32B

Qwen3.5-397B-A17B

VideoMME (w sub.)

86

77.6

88.4

83.8

87.4

87.5

VideoMME (w/o sub.)

85.8

81.4

87.7

79.0

83.2

83.7

VideoMMMU

85.9

84.4

87.6

80.0

86.6

84.7

MLVU (M-Avg)

85.6

81.7

83.0

83.8

85.0

86.7

MVBench

78.1

67.2

74.1

75.2

73.5

77.6

LVBench

73.7

57.3

76.2

63.6

75.9

75.5

MMVU

80.8

77.3

77.5

71.1

80.4

75.4

Visual Agent

Benchmark

GPT5.2

Claude 4.5 Opus

Gemini-3 Pro

Qwen3-VL-235B-A22B

K2.5-1T-A32B

Qwen3.5-397B-A17B

ScreenSpot Pro

--

45.7

72.7

62.0

--

65.6

OSWorld-Verified

38.2

66.3

--

38.1

63.3

62.2

AndroidWorld

--

--

--

63.7

--

66.8

Medical

Benchmark

GPT5.2

Claude 4.5 Opus

Gemini-3 Pro

Qwen3-VL-235B-A22B

K2.5-1T-A32B

Qwen3.5-397B-A17B

VQA-RAD

69.8

65.6

74.5

65.4

79.9

76.3

SLAKE

76.9

76.4

81.3

54.7

81.6

79.9

OM-VQA

72.9

75.5

80.3

65.4

87.4

85.1

PMC-VQA

58.9

59.9

62.3

41.2

63.3

64.2

MedXpertQA-MM

73.3

63.6

76.0

47.6

65.3

70.0

Notes

  • MathVision:our model’s score is evaluated using a fixed prompt, e.g., “Please reason step by step, and put your final answer within \boxed{}.” For other models, we report the higher score between runs with and without the \boxed{} formatting.

  • BabyVision: our model’s score is reported with CI (Code Interpreter) enabled; without CI, the result is 43.3.

  • V*: our model’s score is reported with CI (Code Interpreter) enabled; without CI, the result is 91.1.

  • Empty cells (--) indicate scores not yet available or not applicable.

Last updated

Was this helpful?