🌠Qwen3-Coder-Next：本地运行指南

在您的设备上本地运行 Qwen3-Coder-Next 的指南！

Qwen 发布了 Qwen3-Coder-Next，一款 80B MoE 模型（3B 活跃参数），具有 256K 上下文 用于快速自治编码和本地使用。其性能可与活跃参数多 10–20× 的模型相媲美。

它运行在 46GB 内存/显存/统一内存（8 位时为 85GB），为超快速代码响应而设为非推理模式。该模型擅长长时程推理、复杂工具使用以及从执行故障中恢复。

2 月 19 日更新: 在 llama.cpp 修复了解析问题后，工具调用现在应更加可靠。

新！参见量化基准了解我们的动态 GGUF！

Q6 或 Q8 GGUF 在 LM Studio 中失败了吗？ LM Studio 推送了修复，请更新并重新下载。

2 月 4 日： llama.cpp 修复了一个错误，纠正了 向量化 key_gdiff 的计算。 这修复了此前的循环和输出问题。我们已更新 GGUF — 请 重新下载 并更新 llama.cpp 以获得更好的输出。

你还将学习如何在 Codex & Claude Code 上运行模型。对于微调，Qwen3-Next-Coder 在 Unsloth 中的 bf16 LoRA 可适配单个 B200 GPU。

Qwen3-Coder-Next Unsloth 动态 GGUF 运行： unsloth/Qwen3-Coder-Next-GGUF

运行 GGUF 教程 Codex & Claude Code FP8 vLLM 教程

⚙️ 使用指南

没有 46GB 内存或统一内存？别担心，你可以运行我们更小的量化版本，例如 3 位。最好让模型大小等于你的计算总和（ 磁盘空间 + 内存 + 显存 ≥ 量化后大小）。 如果你的量化文件完全适合你的设备，预计每秒 20+ 代币。如果不适合，它仍然可以通过换出（offloading）工作，但会更慢。

为了达到最佳性能，Qwen 推荐以下设置：

温度 = 1.0
Top_P = 0.95
Top_K = 40
Min_P = 0.01 （llama.cpp 的默认值是 0.05）
重复惩罚 = 禁用或 1.0

原生支持最多 262,144 上下文，但你可以将其设置为 32,768 代币以减少内存使用。

🖥️ 运行 Qwen3-Coder-Next

根据你的用例需要使用不同设置。因为本指南使用 4 位，你将需要大约 46GB 内存/统一内存。我们建议至少使用 3 位精度以获得最佳性能。

2 月 4 日更新： llama.cpp 修复了一个错误，纠正了 向量化 key_gdiff 的计算。 这修复了此前的循环和输出问题。我们已更新 GGUF — 请 重新下载 并更新 llama.cpp 以获得更好的输出。

注意：此模型仅支持非思考模式，并且不会在输出中生成 <think></think> 块。因此不再需要指定 enable_thinking=False 。

Llama.cpp 教程（GGUF）：

在 llama.cpp 中运行的说明（注意我们将使用 4 位以适配大多数设备）：

获取最新的 llama.cpp 在 GitHub（在此）。你也可以按照下面的构建说明。若没有 GPU 或仅想使用 CPU 推理，请将 -DGGML_CUDA=ON 改为 -DGGML_CUDA=OFF 。

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp

你可以直接从 Hugging Face 拉取。如果你的内存/显存足够，可以将上下文增加到 256K。使用 --fit on 也会自动确定上下文长度。

你可以使用推荐参数： temperature=1.0, top_p=0.95, top_k=40

./llama.cpp/llama-cli \
    -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
    --ctx-size 16384 \
    --temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40

通过以下方式下载模型（在安装后） pip install huggingface_hub）。你可以选择 UD-Q4_K_XL 或其他量化版本。如果下载卡住，请参见 Hugging Face Hub、XET 调试

pip install -U huggingface_hub
hf download unsloth/Qwen3-Coder-Next-GGUF \
    --local-dir unsloth/Qwen3-Coder-Next-GGUF \
    --include "*UD-Q4_K_XL*"

然后以对话模式运行模型：

./llama.cpp/llama-cli \
    --model unsloth/Qwen3-Coder-Next-GGUF/Qwen3-Coder-Next-UD-Q4_K_XL.gguf \
    --seed 3407 \
    --temp 1.0 \
    --top-p 0.95 \
    --min-p 0.01 \
    --top-k 40

另外，根据需要调整 上下文窗口 ，最高可达 262,144

注意：此模型仅支持非思考模式，并且不会在输出中生成 <think></think> 块。因此不再需要指定 enable_thinking=False 。

🦙Llama-server 提供服务与部署

要将 Qwen3-Coder-Next 部署到生产环境，我们使用 llama-server 在新终端（例如通过 tmux）中。然后，通过以下命令部署模型：

./llama.cpp/llama-server \
    --model unsloth/Qwen3-Coder-Next-GGUF/Qwen3-Coder-Next-UD-Q4_K_XL.gguf \
    --alias "unsloth/Qwen3-Coder-Next" \
    --seed 3407 \
    --temp 1.0 \
    --top-p 0.95 \
    --min-p 0.01 \
    --top-k 40 \
    --port 8001 \

然后在新终端，执行完 pip install openai，我们可以运行模型：

from openai import OpenAI
import json
openai_client = OpenAI(
    base_url = "http://127.0.0.1:8001/v1",
    api_key = "sk-no-key-required",
)
completion = openai_client.chat.completions.create(
    model = "unsloth/Qwen3-Coder-Next",
    messages = [{"role": "user", "content": "Create a Flappy Bird game in HTML"},],
)
print(completion.choices[0].message.content)

这将输出：

这里是一个完整、可运行的 Flappy Bird 游戏，封装在单个文件中。

我使用 **HTML5 Canvas** 进行图形绘制，使用 **JavaScript** 实现物理（重力、碰撞检测和计分）。无需外部图片或下载；游戏使用代码绘制小鸟和管道。

### 运行方法：
1.  复制下面的代码块。
2.  在你的计算机上创建一个名为 `game.html` 的新文件。
3.  将代码粘贴到该文件并保存。
4.  双击 `game.html` 在你的浏览器中打开。

```html
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Simple Flappy Bird</title>
    <style>
        body {
            margin: 0;
            padding: 0;
            display: flex;
            flex-direction: column;
            justify-content: center;
            align-items: center;
            height: 100vh;
            background-color: #333;
            font-family: 'Courier New', Courier, monospace;
            color: white;
        }

        h1 {
            margin-bottom: 10px;
        }

        #game-container {
            position: relative;
            box-shadow: 0 0 20px rgba(0,0,0,0.5);
        }

        canvas {
            display: block;
            background-color: #70c5ce; /* Sky blue */
            border: 4px solid #000;
        }

        #ui-layer {
            position: absolute;
            top: 0;
            left: 0;
            width: 100%;
            height: 100%;
            pointer-events: none; /* Let clicks pass through to canvas */
            display: flex;
            flex-direction: column;
            justify-content: center;
            align-items: center;
            text-align: center;
        }

        .message {
            background: rgba(255, 255, 255, 0.8);
            padding: 20px;
            border-radius: 10px;
            color: #333;
        }

        #score-board {
            position: absolute;
            top: 20px;
            width: 100%;
            text-align: center;
            font-size: 40px;
            font-weight: bold;
            color: white;
            text-shadow: 2px 2px 0 #000;
            z-index: 10;
        }
    </style>
</head>
<body>

    <h1>Flappy Bird Clone</h1>
    
    <div id="game-container">
        <div id="score-board">0</div>
        <canvas id="birdCanvas" width="320" height="480"></canvas>
        
        <div id="ui-layer">
            <div id="start-screen" class="message">
                <h2>Ready to Fly?</h2>
                <p>Press <strong>Space</strong> or <strong>Click</strong> to Jump</p>
                <p>Press Space to Start</p>
            </div>
            <div id="game-over-screen" class="message" style="display: none;">
                <h2>Game Over</h2>
                <p>Score: <span id="final-score">0</span></p>
                <p>Best: <span id="best-score">0</span></p>
                <p>Press <strong>Space</strong> to Restart</p>
            </div>
        </div>
    </div>

    <script>
        // --- Configuration ---
        const canvas = document.getElementById('birdCanvas');
        const ctx = canvas.getContext('2d');
        const scoreElement = document.getElementById('score-board');
        const startScreen = document.getElementById('start-screen');
        const gameOverScreen = document.getElementById('game-over-screen');
        const finalScoreSpan = document.getElementById('final-score');
        const bestScoreSpan = document.getElementById('best-score');

        // Game Variables
        let frames = 0;
        let score = 0;
        let highScore = localStorage.getItem('flappyHighScore') || 0;
        let gameState = 'START'; // START, PLAYING, GAMEOVER
        const gravity = 0.25;
        const speed = 2; // Speed of pipes moving left

        // --- The Bird Object ---
        const bird = {
            x: 50,
            y: 150,
            width: 30,
            height: 30,
            velocity: 0,
            jumpStrength: 4.5,
            radius: 15,
            draw: function() {
                ctx.fillStyle = "#FFD700"; // Gold color
                ctx.beginPath();
                ctx.arc(this.x + this.radius, this.y + this.radius, this.radius, 0, Math.PI * 2);
                ctx.fill();
                ctx.lineWidth = 2;
                ctx.stroke();

                // Eye
                ctx.fillStyle = "white";
                ctx.beginPath();
                ctx.arc(this.x + this.radius + 5, this.y + this.radius - 5, 5, 0, Math.PI * 2);
                ctx.fill();
                ctx.fillStyle = "black";
                ctx.beginPath();
                ctx.arc(this.x + this.radius + 7, this.y + this.radius - 5, 2, 0, Math.PI * 2);
                ctx.fill();
                
                // Beak
                ctx.fillStyle = "orange";
                ctx.beginPath();
                ctx.moveTo(this.x + this.radius + 10, this.y + this.radius);
                ctx.lineTo(this.x + this.radius + 20, this.y + this.radius + 5);
                ctx.lineTo(this.x + this.radius + 10, this.y + this.radius + 10);
                ctx.fill();
                ctx.stroke();
            },
            update: function() {
                this.velocity += gravity;
                this.y += this.velocity;

                // Floor Collision
                if (this.y + this.height >= canvas.height) {
                    this.y = canvas.height - this.height;
                    gameOver();
                }
                
                // Ceiling Collision (Optional: prevents flying over pipes)
                if (this.y < 0) {
                    this.y = 0;
                    this.velocity = 0;
                }
            },
            jump: function() {
                this.velocity = -this.jumpStrength;
            },
            reset: function() {
                this.y = 150;
                this.velocity = 0;
            }
        };

        // --- The Pipes Array ---
        const pipes = {
            position: [],
            width: 50,
            gap: 120, // Space between top and bottom pipe
            dx: 2, // Movement speed

            draw: function() {
                for (let i = 0; i < this.position.length; i++) {
                    let p = this.position[i];
                    let topY = p.y;
                    let bottomY = p.y + this.gap;

                    ctx.fillStyle = "#228B22"; // Forest Green

                    // Top Pipe
                    ctx.fillRect(p.x, 0, this.width, topY);
                    ctx.strokeRect(p.x, 0, this.width, topY);

                    // Bottom Pipe
                    ctx.fillRect(p.x, bottomY, this.width, canvas.height - bottomY);
                    ctx.strokeRect(p.x, bottomY, this.width, canvas.height - bottomY);
                }
            },

            update: function() {
                // Add new pipe every 120 frames (approx 2 seconds)
                if (frames % 120 === 0) {
                    // Random height for top pipe
                    // Min height 50, Max height canvas - gap - 50
                    let maxY = canvas.height - this.gap - 50;
                    let randomY = Math.floor(Math.random() * (maxY - 50 + 1) + 50);
                    
                    this.position.push({
                        x: canvas.width,
                        y: randomY
                    });
                }

                for (let i = 0; i < this.position.length; i++) {
                    let p = this.position[i];
                    p.x -= this.dx;

                    // Collision Detection
                    // Logic: Check if bird X is within pipe X range
                    if (bird.x + bird.width > p.x && bird.x < p.x + this.width) {
                        // Logic: Check if bird Y is hitting Top Pipe OR Bottom Pipe
                        if (bird.y < p.y || bird.y + bird.height > p.y + this.gap) {
                            gameOver();
                        }
                    }

                    // Score Update (when bird passes pipe)
                    if (p.x + this.width < bird.x && !p.passed) {
                        score++;
                        scoreElement.innerText = score;
                        p.passed = true;
                    }

                    // Remove pipes that have gone off screen
                    if (p.x + this.width <= 0) {
                        this.position.shift();
                        // Decrement i because array length changed
                        i--; 
                    }
                }
            },
            
            reset: function() {
                this.position = [];
            }
        };

        // --- Background (Clouds/Grass) ---
        const background = {
            draw: function() {
                // Draw Grass
                ctx.fillStyle = "#7cfc00"; // Lawn Green
                ctx.fillRect(0, canvas.height - 20, canvas.width, 20);
                ctx.beginPath();
                ctx.moveTo(0, canvas.height - 20);
                ctx.lineTo(canvas.width, canvas.height - 20);
                ctx.stroke();
            }
        };

        // --- Game Control Functions ---

        function loop() {
            // Clear Canvas
            ctx.clearRect(0, 0, canvas.width, canvas.height);

            // Draw Background
            background.draw();

            if (gameState === 'START') {
                bird.draw();
                // Draw a ground line
                ctx.fillStyle = "#ded895";
                ctx.fillRect(0, canvas.height - 10, canvas.width, 10);
            } 
            else if (gameState === 'PLAYING') {
                bird.update();
                bird.draw();
                pipes.update();
                pipes.draw();
                frames++;
            } 
            else if (gameState === 'GAMEOVER') {
                pipes.draw();
                bird.draw();
                // Don't update frames or position, just freeze
            }

            requestAnimationFrame(loop);
        }

        function startGame() {
            gameState = 'PLAYING';
            startScreen.style.display = 'none';
            gameOverScreen.style.display = 'none';
            score = 0;
            frames = 0;
            scoreElement.innerText = score;
            bird.reset();
            pipes.reset();
        }

        function gameOver() {
            gameState = 'GAMEOVER';
            
            // Update High Score
            if (score > highScore) {
                highScore = score;
                localStorage.setItem('flappyHighScore', highScore);
            }

            finalScoreSpan.innerText = score;
            bestScoreSpan.innerText = highScore;
            gameOverScreen.style.display = 'block';
        }

        // --- Input Handling ---

        function handleInput(e) {
            // Prevent default scrolling behavior for Space
            if (e.type === 'keydown' && e.code === 'Space') {
                e.preventDefault();
            }

            if (e.code === 'Space' || e.type === 'mousedown' || e.type === 'touchstart') {
                switch (gameState) {
                    case 'START':
                        startGame();
                        bird.jump();
                        break;
                    case 'PLAYING':
                        bird.jump();
                        break;
                    case 'GAMEOVER':
                        startGame();
                        bird.jump();
                        break;
                }
            }
        }

        window.addEventListener('keydown', handleInput);
        canvas.addEventListener('mousedown', handleInput);
        canvas.addEventListener('touchstart', handleInput);

        // Initialize
        loop();

    </script>
</body>
</html>
```

### 此版本的特性：
1.  **物理引擎：** 真实的重力和跳跃机制。
2.  **碰撞检测：** 如果撞到管道、地面或天花板，游戏结束。
3.  **计分系统：** 每通过一个管道得 1 分。
4.  **最高得分：** 使用浏览器的 LocalStorage 记录最高得分，即使刷新页面也会保留。
5.  **响应式控制：** 支持 **空格键**、**鼠标点击** 或 **触摸**（移动设备）。
6.  **图形：** 小鸟由代码绘制（包括眼睛和喙），管道带有边框，因此不会出现坏掉的图片链接。

我们提取了 HTML 并运行了它，示例生成的 Flappy Bird 游戏运行良好！

👾 OpenAI Codex & Claude Code

要通过本地编码代理工作负载运行模型，你可以遵循我们的指南。只需将模型名称 'GLM-4.7-Flash' 改为 'Qwen3-Coder-Next'，并确保遵循正确的 Qwen3-Coder-Next 参数和使用说明。使用我们刚才设置的。 llama-server 我们刚才设置的。

Claude Code

OpenAI Codex

例如，按照 Claude Code 的说明操作后你会看到：

然后我们可以例如要求 创建一个用于国际象棋的 Python 游戏 :

如果你看到 API Error: 400 {"error":{"code":400,"message":"request (16582 tokens) exceeds the available context size (16384 tokens), try increasing it","type":"exceed_context_size_error","n_prompt_tokens":16582,"n_ctx":16384}} 这意味着你需要增加上下文长度或参见为了适配更长的上下文，你可以使用

🎱 vLLM 中的 FP8 Qwen3-Coder-Next

你现在可以使用我们新的 FP8 动态量化模型以获得优质且快速的推理。首先从 nightly 安装 vLLM。将 --extra-index-url https://wheels.vllm.ai/nightly/cu130 更改为与你通过以下命令查询到的 CUDA 版本相匹配： nvidia-smi - 仅支持 cu129 并 cu130 目前支持。

# 如果没有 uv，请安装以加快环境安装速度
curl -LsSf https://astral.sh/uv/install.sh | sh

# 创建一个新的 Python 环境 - 如果你想改变整个系统则不需要
uv venv unsloth_fp8 --python 3.12 --seed
source unsloth_fp8/bin/activate

uv pip install --upgrade --force-reinstall vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly/cu130
uv pip install --upgrade --force-reinstall git+https://github.com/huggingface/transformers.git
uv pip install --force-reinstall numba

然后提供服务 Unsloth 的动态 FP8 版本的模型。你也可以通过添加来启用 FP8，将 KV 缓存内存使用减少 50% --kv-cache-dtype fp8 我们在 4 张 GPU 上部署，但如果你只有 1 张 GPU，使用 CUDA_VISIBLE_DEVICES='0' 并设置 --tensor-parallel-size 1 或移除此参数。使用 tmux 在新终端中启动下面的内容然后按 CTRL+B+D - 使用 tmux attach-session -t0 以返回到该会话。

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:False
CUDA_VISIBLE_DEVICES='0,1,2,3' vllm serve unsloth/Qwen3-Coder-Next-FP8-Dynamic \
    --served-model-name unsloth/Qwen3-Coder-Next \
    --tensor-parallel-size 4 \
    --tool-call-parser qwen3_coder \
    --enable-auto-tool-choice \
    --dtype bfloat16 \
    --seed 3407 \
    --max-model-len 200000 \
    --gpu-memory-utilization 0.93 \
    --port 8001

你应该会看到类似下列的内容。参见 Qwen3-Coder-Next 关于如何使用 OpenAI API 和工具调用来实际使用 Qwen3-Coder-Next —— 这适用于 vLLM 和 llama-server。

🔧使用 Qwen3-Coder-Next 的工具调用

在一个新终端中，我们创建一些工具，比如两个数相加、执行 Python 代码、执行 Linux 命令等等：

import json, subprocess, random
from typing import Any
def add_number(a: float | str, b: float | str) -> float:
    return float(a) + float(b)
def multiply_number(a: float | str, b: float | str) -> float:
    return float(a) * float(b)
def substract_number(a: float | str, b: float | str) -> float:
    return float(a) - float(b)
def write_a_story() -> str:
    return random.choice([
        "很久很久以前，在一个遥远的星系……",
        "有两个朋友，他们热爱树懒和代码……",
        "世界在崩溃，因为每只树懒都进化出超人般的智慧……",
        "一个朋友毫不知情，另一个朋友不小心写了一个让树懒进化的程序……",
    ])
def terminal(command: str) -> str:
    if "rm" in command or "sudo" in command or "dd" in command or "chmod" in command:
        msg = "无法执行 'rm, sudo, dd, chmod' 命令，因为它们很危险"
        print(msg); return msg
    print(f"Executing terminal command `{command}`")
    try:
        return str(subprocess.run(command, capture_output = True, text = True, shell = True, check = True).stdout)
    except subprocess.CalledProcessError as e:
        return f"命令执行失败: {e.stderr}"
def python(code: str) -> str:
    data = {}
    exec(code, data)
    del data["__builtins__"]
    return str(data)
MAP_FN = {
    "add_number": add_number,
    "multiply_number": multiply_number,
    "substract_number": substract_number,
    "write_a_story": write_a_story,
    "terminal": terminal,
    "python": python,
}
tools = [
    {
        "type": "function",
        "function": {
            "name": "add_number",
            "description": "将两个数字相加。",
            "parameters": {
                "type": "object",
                "properties": {
                    "a": {
                        "type": "string",
                        "description": "第一个数字。",
                    },
                    "b": {
                        "type": "string",
                        "description": "第二个数字。",
                    },
                },
                "required": ["a", "b"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "multiply_number",
            "description": "将两个数字相乘。",
            "parameters": {
                "type": "object",
                "properties": {
                    "a": {
                        "type": "string",
                        "description": "第一个数字。",
                    },
                    "b": {
                        "type": "string",
                        "description": "第二个数字。",
                    },
                },
                "required": ["a", "b"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "substract_number",
            "description": "将两个数字相减。",
            "parameters": {
                "type": "object",
                "properties": {
                    "a": {
                        "type": "string",
                        "description": "第一个数字。",
                    },
                    "b": {
                        "type": "string",
                        "description": "第二个数字。",
                    },
                },
                "required": ["a", "b"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "write_a_story",
            "description": "写一个随机故事。",
            "parameters": {
                "type": "object",
                "properties": {},
                "required": [],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "terminal",
            "description": "执行终端操作。",
            "parameters": {
                "type": "object",
                "properties": {
                    "command": {
                        "type": "string",
                        "description": "你希望启动的命令，例如 `ls`, `rm`, ...",
                    },
                },
                "required": ["command"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "python",
            "description": "调用 Python 解释器运行一些 Python 代码。",
            "parameters": {
                "type": "object",
                "properties": {
                    "code": {
                        "type": "string",
                        "description": "要运行的 Python 代码",
                    },
                },
                "required": ["code"],
            },
        },
    },
]

然后我们使用下面的函数（复制粘贴并执行），这些函数会自动解析函数调用并为任何模型调用 OpenAI 端点：

from openai import OpenAI
def unsloth_inference(
    messages,
    temperature = 1.0,
    top_p = 0.95,
    top_k = 40,
    min_p = 0.01,
    repetition_penalty = 1.0,
):
    messages = messages.copy()
    openai_client = OpenAI(
        base_url = "http://127.0.0.1:8001/v1",
        api_key = "sk-no-key-required",
    )
    model_name = next(iter(openai_client.models.list())).id
    print(f"Using model = {model_name}")
    has_tool_calls = True
    original_messages_len = len(messages)
    while has_tool_calls:
        print(f"Current messages = {messages}")
        response = openai_client.chat.completions.create(
            model = model_name,
            messages = messages,
            temperature = temperature,
            top_p = top_p,
            tools = tools if tools else None,
            tool_choice = "auto" if tools else None,
            extra_body = {"top_k": top_k, "min_p": min_p, "repetition_penalty" :repetition_penalty,}
        )
        tool_calls = response.choices[0].message.tool_calls or []
        content = response.choices[0].message.content or ""
        tool_calls_dict = [tc.to_dict() for tc in tool_calls] if tool_calls else tool_calls
        messages.append({"role": "assistant", "tool_calls": tool_calls_dict, "content": content,})
        for tool_call in tool_calls:
            fx, args, _id = tool_call.function.name, tool_call.function.arguments, tool_call.id
            out = MAP_FN[fx](**json.loads(args))
            messages.append({"role": "tool", "tool_call_id": _id, "name": fx, "content": str(out),})
        else:
            has_tool_calls = False
    return messages

现在我们将展示多种用于不同用例的工具调用运行方法：

执行生成的 Python 代码

messages = [{
    "role": "user",
    "content": [{"type": "text", "text": "用 Python 创建一个斐波那契函数并求 fib(20)。"}],
}]
unsloth_inference(messages, temperature = 1.0, top_p = 0.95, top_k = 40, min_p = 0.00)

执行任意终端命令

messages = [{
    "role": "user",
    "content": [{"type": "text", "text": "将 'I'm a happy Sloth' 写入文件，然后把它打印出来给我。"}],
}]
messages = unsloth_inference(messages, temperature = 1.0, top_p = 1.0, top_k = 40, min_p = 0.00)

我们确认该文件已被创建，确实如此！

参见 Tool Calling Guide 有关更多工具调用示例，请参见。

🛠️ 提高生成速度

如果使用 vLLM / SGLang，尝试使用我们的 FP8-Dynamic 量化，它可以提高吞吐量 25% 或更多！参见 Qwen3-Coder-Next

如果你有更多显存，可以尝试卸载更多 MoE 层，或卸载整个层本身。

通常， -ot ".ffn_.*_exps.=CPU" 会将所有 MoE 层卸载到 CPU！这实际上允许你将所有非 MoE 层放在 1 张 GPU 上，从而提高生成速度。如果你有更多 GPU 容量，你可以自定义正则表达式以适配更多层。

如果你有更多一些 GPU 内存，尝试 -ot ".ffn_(up|down)_exps.=CPU" 这会卸载上投影和下投影的 MoE 层。

试试 -ot ".ffn_(up)_exps.=CPU" 如果你有更多显存。这只会卸载上投影的 MoE 层。

你也可以自定义正则表达式，例如 -ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU" 意味着从第 6 层开始卸载 gate、up 和 down 的 MoE 层。

最新的 llama.cpp 发行版也引入了高吞吐量模式。使用 llama-parallel。更多信息请阅读这里。你也可以 将 KV 缓存量化为 4 位 例如以减少 VRAM / RAM 的移动，这也可以使生成过程更快。下一节讨论了 KV 缓存量化。 📐 如何适配长上下文

为了适配更长的上下文，你可以使用

KV 缓存量化 将 K 和 V 缓存量化为更低位数。这也可以由于减少 RAM / VRAM 数据移动而提高生成速度。K 量化的允许选项（默认是 f16 ）包括以下选项。--cache-type-k f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1

你应该使用

这些变体以获得更高的准确性，尽管它们稍微慢一些。例如 _1 q4_1, q5_1 所以试试 --cache-type-k q4_1 你也可以量化 V 缓存，但你需要

用 Flash Attention 支持重新编译 llama.cpp 通过 -DGGML_CUDA_FA_ALL_QUANTS=ON ，并使用--flash-attn 来启用它。安装 Flash Attention 之后，你就可以使用 --cache-type-v q4_1 如果你使用我们的动态 FP8 量化

那么使用 FP8 KV 缓存量化可以使上下文长度支持大约翻倍。添加 Qwen3-Coder-Next基准测试 --kv-cache-dtype fp8

📐GGUF 量化基准

以下是由第三方评估者进行的一些量化基准测试结果。

基准测试由第三方贡献者在 Aider Polyglot 服务器上运行，比较了 Unsloth GGUF 在 Aider Polyglot 基准（得分 vs. VRAM）上的量化表现。值得注意的是，3 位的

UD-IQ3_XXS 量化接近于 BF16 的性能，使得 3 位成为大多数用例的合理最低选择。 NVFP4 略微优于 BF16 参考，这可能是由于运行次数有限导致的抽样噪声；然而，对于：

1 位 → 2 位 → 3 位 → 6 位 稳步提升的总体模式表明该基准正在捕捉到 Unsloth GGUF 之间有意义的质量差异。非 Unsloth 的 FP8 似乎比 UD-Q6_K_XL 表现更差，这可能反映了量化流程的差异，或者同样是采样不足所致。 Qwen3-Coder-Next 基准 量化接近于 并 Qwen3-Coder-Next 在其规模中表现最佳，其性能可与具有 10–20× 更多活跃参数的模型相媲美。基准

Qwen3-Coder-Next (80B)

DeepSeek-V3.2 (671B)

GLM-4.7 (358B)

MiniMax M2.1 (229B)

SWE-Bench 已验证（含 SWE-Agent）

SWE-Bench 多语种（含 SWE-Agent）

SWE-Bench Pro（含 SWE-Agent）

Terminal-Bench 2.0（含 Terminus-2 json）

70.6

70.2

74.2

74.8

Aider

62.8

62.3

63.7

66.2

Aider

44.3

40.9

40.6

34.6

Aider

36.2

39.3

37.1

32.6

Aider

66.2

69.9

52.1

61.0

上一页Fine-tune Qwen3.5 下一页MiniMax-2.5

最后更新于5天前

这有帮助吗？

hashtag⚙️ 使用指南

hashtag🖥️ 运行 Qwen3-Coder-Next

hashtagLlama.cpp 教程（GGUF）：

hashtag🦙Llama-server 提供服务与部署

hashtag👾 OpenAI Codex & Claude Code

hashtag🎱 vLLM 中的 FP8 Qwen3-Coder-Next

hashtag🔧使用 Qwen3-Coder-Next 的工具调用

hashtag执行生成的 Python 代码

hashtag执行任意终端命令

hashtag🛠️ 提高生成速度

hashtag为了适配更长的上下文，你可以使用

hashtag📐GGUF 量化基准

hashtag以下是由第三方评估者进行的一些量化基准测试结果。

hashtagQwen3-Coder-Next (80B)

⚙️ 使用指南

🖥️ 运行 Qwen3-Coder-Next

Llama.cpp 教程（GGUF）：

🦙Llama-server 提供服务与部署

👾 OpenAI Codex & Claude Code

🎱 vLLM 中的 FP8 Qwen3-Coder-Next

🔧使用 Qwen3-Coder-Next 的工具调用

执行生成的 Python 代码

执行任意终端命令

🛠️ 提高生成速度

为了适配更长的上下文，你可以使用

📐GGUF 量化基准

以下是由第三方评估者进行的一些量化基准测试结果。

Qwen3-Coder-Next (80B)