🌠Qwen3-Coder-Next: How to Run Locally

Guide to run Qwen3-Coder-Next locally on your device!

Qwen releases Qwen3-Coder-Next, an 80B MoE model (3B active parameters) with 256K context for fast agentic coding and local use. It is comparable to the performance of models with 10–20× more active parameters.

It runs on 46GB RAM/VRAM/unified memory (85GB for 8-bit), is non-reasoning for ultra-quick code responses. The model excels at long-horizon reasoning, complex tool use, and recovery from execution failures.

circle-check

You’ll also learn to run the model on Codex & Claude Code. For fine-tuning, Qwen3-Next-Coder fits on a single B200 GPU for bf16 LoRA in Unsloth.

Qwen3-Coder-Next Unsloth Dynamic GGUFs to run: unsloth/Qwen3-Coder-Next-GGUFarrow-up-right

Run GGUF TutorialCodex & Claude CodeFP8 vLLM Tutorial

⚙️ Usage Guide

Don't have 46GB RAM or unified memory? No worries you can run our smaller quants like 3-bit. It is best to have the model size = to the sum of your compute ( disk space + RAM + VRAM ≥ size of quant). If your quant fully fits on your device, expect 20+ tokens/s. If it doesn't fit, it'll still work by offloading but it will be slower.

To achieve optimal performance, Qwen recommends these settings:

  • Temperature = 1.0

  • Top_P = 0.95

  • Top_K = 40

  • Min_P = 0.01 (llama.cpp's default is 0.05)

  • repeat penalty = disabled or 1.0

Supports up to 262,144 context natively but you can set it to 32,768 tokens for less memory use.

🖥️ Run Qwen3-Coder-Next

Depending on your use-case you will need to use different settings. Because this guide uses 4-bit, you will need around 46GB RAM/unified memory. We recommend using at least 3-bit precision for best performance.

circle-check
circle-info

NOTE: This model supports only non-thinking mode and does not generate <think></think> blocks in its output. So specifying enable_thinking=False is no longer required.

Llama.cpp Tutorial (GGUF):

Instructions to run in llama.cpp (note we will be using 4-bit to fit most devices):

1

Obtain the latest llama.cpp on GitHub herearrow-up-right. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

2

You can directly pull from Hugging Face. You can increase the context to 256K if your RAM/VRAM can fit it. Using --fit on will also auto determine the context length.

You can use the recommended parameters: temperature=1.0, top_p=0.95, top_k=40

3

Download the model via (after installing pip install huggingface_hub). You can choose UD-Q4_K_XL or other quantized versions. If downloads get stuck, see Hugging Face Hub, XET debugging

4

Then run the model in conversation mode:

Also, adjust context window as required, up to 262,144

circle-info

NOTE: This model supports only non-thinking mode and does not generate <think></think> blocks in its output. So specifying enable_thinking=False is no longer required.

🦙Llama-server serving & deployment

To deploy Qwen3-Coder-Next for production, we use llama-server In a new terminal say via tmux. Then, deploy the model via:

Then in a new terminal, after doing pip install openai, we can run the model:

Which will output:

We extracted the HTML and ran it, and the example Flappy Bird game it generated worked well!

👾 OpenAI Codex & Claude Code

To run the model via local coding agentic workloads, you can follow our guide. Just change the model name 'GLM-4.7-Flash' to 'Qwen3-Coder-Next' and ensure you follow the correct Qwen3-Coder-Next parameters and usage instructions. Use the llama-server we just set up just then.

After following the instructions for Claude Code for example you will see:

We can then ask say Create a Python game for Chess :

If you see API Error: 400 {"error":{"code":400,"message":"request (16582 tokens) exceeds the available context size (16384 tokens), try increasing it","type":"exceed_context_size_error","n_prompt_tokens":16582,"n_ctx":16384}} that means you need to increase the context length or see 📐How to fit long context

🎱 FP8 Qwen3-Coder-Next in vLLM

You can now use our new FP8 Dynamic quantarrow-up-right of the model for premium and fast inference. First install vLLM from nightly. Change --extra-index-url https://wheels.vllm.ai/nightly/cu130 to your CUDA version found via nvidia-smi - only cu129 and cu130 are currently supported.

Then serve Unsloth's dynamic FP8 versionarrow-up-right of the model. You can also enable FP8 to reduce KV cache memory usage by 50% by adding --kv-cache-dtype fp8 We served it on on 4 GPUs, but if you have 1 GPU, use CUDA_VISIBLE_DEVICES='0' and set --tensor-parallel-size 1 or remove this argument. Use tmux to launch the below in a new terminal then CTRL+B+D - use tmux attach-session -t0 to return back to it.

You should see something like below. See Tool Calling with Qwen3-Coder-Next for how to actually use Qwen3-Coder-Next using the OpenAI API and tool calling - this works for vLLM and llama-server.

🔧Tool Calling with Qwen3-Coder-Next

In a new terminal, we create some tools like adding 2 numbers, executing Python code, executing Linux functions and much more:

We then use the below functions (copy and paste and execute) which will parse the function calls automatically and call the OpenAI endpoint for any model:

Now we'll showcase multiple methods of running tool-calling for many different use-cases below:

Execute generated Python code

Execute arbitrary terminal functions

We confirm the file was created and it was!

See Tool Calling Guide for more examples for tool calling.

🛠️ Improving generation speed

circle-check

If you have more VRAM, you can try offloading more MoE layers, or offloading whole layers themselves.

Normally, -ot ".ffn_.*_exps.=CPU" offloads all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.

If you have a bit more GPU memory, try -ot ".ffn_(up|down)_exps.=CPU" This offloads up and down projection MoE layers.

Try -ot ".ffn_(up)_exps.=CPU" if you have even more GPU memory. This offloads only up projection MoE layers.

You can also customize the regex, for example -ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU" means to offload gate, up and down MoE layers but only from the 6th layer onwards.

The latest llama.cpp releasearrow-up-right also introduces high throughput mode. Use llama-parallel. Read more about it herearrow-up-right. You can also quantize the KV cache to 4bits for example to reduce VRAM / RAM movement, which can also make the generation process faster. The next section talks about KV cache quantization.

📐How to fit long context

To fit longer context, you can use KV cache quantization to quantize the K and V caches to lower bits. This can also increase generation speed due to reduced RAM / VRAM data movement. The allowed options for K quantization (default is f16) include the below.

--cache-type-k f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1

You should use the _1 variants for somewhat increased accuracy, albeit it's slightly slower. For eg q4_1, q5_1 So try out --cache-type-k q4_1

You can also quantize the V cache, but you will need to compile llama.cpp with Flash Attention support via -DGGML_CUDA_FA_ALL_QUANTS=ON, and use --flash-attn to enable it. After installing Flash Attention, you can then use --cache-type-v q4_1

If you are using our Dynamic FP8 quants 🎱 FP8 Qwen3-Coder-Next in vLLMthen using FP8 KV cache quantization can make context length support double approximately. Add --kv-cache-dtype fp8

📐Benchmarks

GGUF Quantization Benchmarks

Here are some quantization benchmarks conducted by third-party assessors.

Benchmarks were run by third-party contributors on the Aider Polyglot server, comparing Unsloth GGUF quantizations on the Aider Polyglot benchmark (score vs. VRAM). Notably, the 3-bit UD-IQ3_XXS quant comes close to BF16 performance, making 3-bit a sensible minimum for most use cases.

NVFP4 slightly outperforms the BF16 reference, which may be sampling noise due to limited runs; however, the overall pattern for: 1-bit → 2-bit → 3-bit → 6-bit steadily improving, suggests the benchmark is capturing meaningful quality differences across Unsloth GGUFs. The non-Unsloth FP8 seems to perform worse than both UD-IQ3_XXS and UD-Q6_K_XL, which could reflect differences in the quantization pipeline or, again, insufficient sampling.

Third-party results conducted by Benjamin Mariearrow-up-right evaluating Qwen3.5-397B-A17B with Unsloth GGUF quants on a 750-prompt mixed suite (LiveCodeBench v6, MMLU Pro, GPQA, Math500). Both UD-Q4_K_XL and UD-Q3_K_XL track the original weights very closely: Original = 81.3%, UD-Q4_K_XL = 80.5% (-0.8; +4.3% relative error increase), and UD-Q3_K_XL = 80.7% (-0.6; +3.5% relative error increase). In other words, the observed degradation is well under 1 accuracy point, supporting Benjamin’s conclusion that you can dramatically reduce memory footprint (he reports ~500 GB less) with little to no practical loss on the tasks tested.

Note that Q3 scoring slightly higher than Q4 here is plausible as normal measurement variance (margin of error) at this scale, so treat Q3 vs Q4 as similar in quality for this run, and pick based on your memory/throughput goals (Q3 for minimum footprint; Q4 for a slightly more conservative option with similar results).

Qwen3-Coder-Next Benchmarks

Qwen3-Coder-Next is the best performing model for its size, and its performance is comparable to models with 10–20× more active parameters.

Benchmark
Qwen3-Coder-Next (80B)
DeepSeek-V3.2 (671B)
GLM-4.7 (358B)
MiniMax M2.1 (229B)

SWE-Bench Verified (w/ SWE-Agent)

70.6

70.2

74.2

74.8

SWE-Bench Multilingual (w/ SWE-Agent)

62.8

62.3

63.7

66.2

SWE-Bench Pro (w/ SWE-Agent)

44.3

40.9

40.6

34.6

Terminal-Bench 2.0 (w/ Terminus-2 json)

36.2

39.3

37.1

32.6

Aider

66.2

69.9

52.1

61.0

Last updated

Was this helpful?