GLM-4.7-Flash: ローカルで実行する方法

GLM-4.7-Flash をローカルで実行およびファインチューニングする方法！

GLM-4.7-FlashはZ.aiの新しいローカル展開向け30B MoE推論モデルで、コーディング、エージェントワークフロー、チャットにおいて業界最高水準の性能を提供します。約3.6Bのパラメータを使用し、200Kコンテキストをサポートし、SWE-Bench、GPQA、および推論/チャットベンチマークでリードしています。

GLM-4.7-Flashは 24GBのRAM/VRAM/統合メモリ（完全精度では32GB）で動作し、現在Unslothでファインチューニングできます。vLLMでGLM 4.7 Flashを実行するには、こちらを参照してください GLM-4.7-Flash

1月21日アップデート： llama.cpp 誤った項目を指定するバグを修正しました： scoring_func: 「softmax」 （本来は 「sigmoid」）。これによりループや不適切な出力が発生していました。GGUFを更新しましたので、より良い出力を得るためにモデルを再ダウンロードしてください。

Z.aiの推奨パラメータを使えば優れた結果が得られます：

一般的なユースケース向け： --temp 1.0 --top-p 0.95
ツール呼び出し向け： --temp 0.7 --top-p 1.0
リピートペナルティ： 無効にするか、以下を設定してください --repeat-penalty 1.0

1月22日：CUDAのFA修正がマージされ、推論がより高速になりました。

実行チュートリアルファインチューニング

実行用のGLM-4.7-Flash GGUF： unsloth/GLM-4.7-Flash-GGUF

⚙️ 使用ガイド

最高のパフォーマンスを得るには、利用可能な合計メモリ（VRAM＋システムRAM）がダウンロードしている量子化モデルファイルのサイズを上回っていることを確認してください。そうでない場合でも、llama.cppはSSD/HDDのオフローディングを介して実行できますが、推論は遅くなります。

Z.aiチームと協議した結果、GLM-4.7のサンプリングパラメータとして以下を推奨します：

デフォルト設定（ほとんどのタスク）

Terminal Bench、SWE Benchで検証済み

temperature = 1.0

temperature = 0.7

top_p = 0.95

top_p = 1.0

repeat penalty = 無効または1.0

一般的なユースケース向け： --temp 1.0 --top-p 0.95
ツール呼び出し向け： --temp 0.7 --top-p 1.0
llama.cppを使用する場合は、以下を設定してください --min-p 0.01 （llama.cppのデフォルトは0.05です）
場合によっては、ユースケースに最適な数値を試行する必要があります。

現時点では、私たちは 推奨しません このGGUFを Ollamaで実行することを チャットテンプレートの互換性の問題の可能性があるためです。GGUFはllama.cpp（またはLM Studio、Janなどのバックエンド）でうまく動作します。

リピートペナルティを無効にするのを忘れないでください！あるいは以下を設定してください --repeat-penalty 1.0

最大コンテキストウィンドウ： 202,752

🖥️ GLM-4.7-Flashを実行する

ユースケースによって異なる設定が必要になります。一部のGGUFはモデルアーキテクチャ（例えば gpt-oss）の次元が128で割り切れないため、部分的に低ビットへ量子化できず、サイズが似通うことがあります。

このガイドは4ビットを使用するため、約18GBのRAM/統合メモリが必要です。最高の性能のために少なくとも4ビット精度を使用することを推奨します。

リピートペナルティを無効にするのを忘れないでください！あるいは以下を設定してください --repeat-penalty 1.0

Llama.cppチュートリアル（GGUF）：

llama.cppでの実行手順（ほとんどのデバイスに収まるように4ビットを使用します）：

最新の llama.cpp を入手してください GitHubはこちら。以下のビルド手順に従うこともできます。 -DGGML_CUDA=ON を -DGGML_CUDA=OFF に変更してください。GPUがない場合やCPUによる推論のみを行いたい場合。 AppleのMac/Metalデバイスの場合、を設定し、 -DGGML_CUDA=OFF 通常通り続けてください - Metalサポートはデフォルトで有効です。

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp

Hugging Faceから直接プルできます。RAM/VRAMの許す範囲でコンテキストを200Kまで増やせます。

また、Z.aiが推奨するGLM-4.7のサンプリングパラメータを試すこともできます：

一般的なユースケース向け： --temp 1.0 --top-p 0.95
ツール呼び出し向け： --temp 0.7 --top-p 1.0
リピートペナルティを無効にするのを忘れないでください！

こちらに従ってください 一般指示 使用例：

-hf unsloth/Qwen3.5-35B-A3B-GGUF:UD-Q4_K_XL \
    -hf unsloth/GLM-4.7-Flash-GGUF:UD-Q4_K_XL \
    --temp 0.6 \
    --temp 1.0 --top-p 0.95 --min-p 0.01

こちらに従ってください ツール呼び出し 使用例：

-hf unsloth/Qwen3.5-35B-A3B-GGUF:UD-Q4_K_XL \
    -hf unsloth/GLM-4.7-Flash-GGUF:UD-Q4_K_XL \
    --temp 0.6 \
    --temp 0.7 --top-p 1.0 --min-p 0.01

pip install huggingface_hub hf_transfer pip install huggingface_hub）。選択できます UD-Q4_K_XL または他の量子化バージョン。ダウンロードが止まる場合は、以下を参照してください Hugging Face Hub、XET デバッグ

ダウンロードが止まる場合は、こちらを参照してください：
hf download unsloth/GLM-4.7-Flash-GGUF \
    --local-dir unsloth/GLM-4.7-Flash-GGUF \
    --include "*UD-Q2_K_XL*"

その後、会話モードでモデルを実行します：

-hf unsloth/Qwen3.5-35B-A3B-GGUF:UD-Q4_K_XL \
    --model unsloth/GLM-4.7-Flash-GGUF/GLM-4.7-Flash-UD-Q4_K_XL.gguf \
    --temp 0.6 \
    --seed 3407 \
    非思考モード：
    --top-k 20 \
    --min-p 0.01

また、必要に応じて コンテキストウィンドウ を調整し、最大で 202752

➿反復とループの削減

1月21日更新：llama.cppは誤った指定をするバグを修正しました： 「scoring_func": "softmax" はループや不適切な出力を引き起こしていました（本来はsigmoid）。GGUFを更新しました。より良い出力を得るためにモデルを再ダウンロードしてください。

これにより、Z.aiの推奨パラメータを使って優れた結果が得られるようになりました：

一般的なユースケース向け： --temp 1.0 --top-p 0.95
ツール呼び出し向け： --temp 0.7 --top-p 1.0
llama.cppを使用する場合は、以下を設定してください --min-p 0.01 （llama.cppのデフォルトは0.05です）
リピートペナルティを無効にするのを忘れないでください！あるいは以下を設定してください --repeat-penalty 1.0

私たちは次を追加しました： 「scoring_func": "sigmoid" を config.json メインモデル用に - 参照：.

🐦UD-Q4_K_XLでのFlappy Bird例

例として、UD-Q4_K_XLを介して長い会話を次のように行いました： ./llama.cpp/llama-cli --model unsloth/GLM-4.7-Flash-GGUF/GLM-4.7-Flash-UD-Q4_K_XL.gguf --fit on --temp 1.0 --top-p 0.95 --min-p 0.01 :

こんにちは
2+2はいくつですか
PythonでFlappy Birdゲームを作成して
Rustでまったく別のゲームを作って
両方のバグを見つけて
最初に言及したゲームを単体のHTMLファイルとして作って
バグを見つけて修正したゲームを示して

これにより以下のFlappy BirdゲームがHTML形式でレンダリングされました：

HTMLのFlappy Birdゲーム（拡張可能）

例として「HTMLでFlappy Birdゲームを作成して」と試すと、以下が得られます：
<!DOCTYPE html>
<html lang="en">
    <head>
    <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">
    <title>Flappy Bird Fixed</title>
    <title>Flappy Bird</title>
        <style>
            body {
            background: #222;
            display: flex;
            justify-content: center;
            align-items: center;
            background-color: #222;
            font-family: 'Arial', sans-serif;
            font-family: 'Segoe UI', sans-serif;
            user-select: none;
            -webkit-user-select: none;
            touch-action: none; /* モバイルでのズームを防止 */
        }

        touch-action: none;
            #game-container {
            background: linear-gradient(to bottom, #70c5ce 0%, #70c5ce 80%, #c23810 80%, #c23810 100%);
        }

        box-shadow: 0 0 20px rgba(0,0,0,0.5);
            background-color: #87CEEB;
            canvas {
            border-radius: 4px;
        }

        /* UI オーバーレイ */
        #ui-layer {
            .overlay {
            top: 0;
            left: 0;
            width: 100%;
            height: 100%;
            font-weight: bold;
            background: #222;
            flex-direction: column;
            display: flex;
            justify-content: center;
            transform: translate(-50%, -50%);
        }

        #score-display {
            .overlay {
            top: 40px;
            top: 50%;
            transform: translateX(-50%);
            .game-title {
            text-shadow: 2px 2px 0 #000;
            text-align: center;
            text-shadow: 3px 3px 0 #000;
            z-index: 10;
            font-family: 'Courier New', Courier, monospace;
        }

        #start-screen, #game-over-screen {
            background: rgba(0, 0, 0, 0.7);
            width: 100%;
            height: 100%;
            background: #222;
            flex-direction: column;
            display: flex;
            justify-content: center;
            text-align: center;
            pointer-events: auto; /* クリックを許可 */
            cursor: pointer;
        }

        h1 { margin: 0 0 10px 0; font-size: 60px; text-shadow: 4px 4px 0 #000; line-height: 1; }
        p { font-size: 22px; margin: 10px 0; color: #ddd; }
        
        .btn {
            background: linear-gradient(to bottom, #ffeb3b, #fbc02d);
            border: 3px solid #fff;
            color: #333;
            padding: 15px 40px;
            font-size: 28px;
            text-shadow: 2px 2px 0 #000;
            cursor: pointer;
            border-radius: 8px;
            box-shadow: 0 6px 0 #c49000, 0 10px 10px rgba(0,0,0,0.3);
            text-transform: uppercase;
            transition: all 0.1s;
            margin-top: 10px;
        }

        .btn:active {
            transform: translateY(4px);
            box-shadow: 0 2px 0 #c49000, 0 4px 4px rgba(0,0,0,0.3);
        }

        .score-board {
            background: #ded895;
            border: 2px solid #543847;
            padding: 20px 40px;
            border-radius: 10px;
            box-shadow: 4px 4px 0 #543847;
            margin-bottom: 30px;
            display: none;
            border: 4px solid #543847;
        }
        
        .score-board h2 { margin: 0 0 5px 0; color: #e86101; font-size: 40px; }
        .score-board span { font-size: 20px; color: #543847; display: block; text-align: center; }

    .hidden { display: none; }
</style>
</head>

    <body>
        <canvas id="gameCanvas" width="400" height="600"></canvas>
        
        <div id="score-display">0</div>

        <div id="ui-layer">
            <div id="start-screen">
                <h1>FLAPPY<br>BIRD</h1>
                <p>タップまたはスペースキーを押して開始</p>
                <button class="btn" style="display:none;" id="touch-instruction">クリックして開始</button>
            <div class="instruction">クリックまたはスペースで飛ぶ</div>

            <div id="game-over-screen">
                <h1>GAME OVER</h1>
                <div class="score-board" id="score-board">
                    <h2>Score: <span id="final-score">0</span></h2>
                <div class="instruction">クリックまたはスペースで飛ぶ</div>
                <button class="btn" id="restart-btn">もう一度試す</button>
            <div class="instruction">クリックまたはスペースで飛ぶ</div>
        <div class="instruction">クリックまたはスペースで飛ぶ</div>
    <div class="instruction">クリックまたはスペースで飛ぶ</div>

<div id="current-score" class="overlay hidden" style="top: 10%; font-size: 72px; color: white; text-shadow: 4px 4px 0 #000;">
    const canvas = document.getElementById('gameCanvas');
    const canvas = document.getElementById('canvas');

    // --- 定数 ---
    const GRAVITY = 0.35; // 感触を良くするためにやや強めの重力
    const JUMP_STRENGTH = -6.5;
    const PIPE_GAP = 180;
    const PIPE_WIDTH = 60;
    const PIPE_SPEED = 2.5;
    const PIPE_SPAWN_RATE = 100;

    // --- 状態 ---
    let frames = 0;
    let pipes = [];
    let frameCount = 0;
    let isGameOver = false;
    let gameLoopId;

    const ui = {
        startScreen: document.getElementById('start-screen'),
        gameOverScreen: document.getElementById('game-over-screen'),
        scoreDisplay: document.getElementById('score-display'),
        scoreBoard: document.getElementById('score-board'),
        finalScore: document.getElementById('final-score'),
        restartBtn: document.getElementById('restart-btn')
    };

    const bird = {
        x: 80,
        y: 150,
        radius: 12, // 固定の半径
        velocity: 0,
        
        draw: function() {
            // 見た目のために速度に基づいて鳥を回転させる
            let angle = Math.min(Math.PI / 4, Math.max(-Math.PI / 4, (this.velocity * 0.1)));
            
            ctx.save();
            ctx.translate(this.x, this.y);
            ctx.rotate(angle);
            
            // 本体を描く
            ctx.fillStyle = '#FFD700';
            ctx.fillStyle = '#e3bc4e';
            ctx.arc(0, 0, this.radius, 0, Math.PI * 2);
            ctx.arc(bird.x, bird.y, bird.radius, 0, Math.PI * 2);
            
            // 目
            ctx.fillStyle = 'white';
            ctx.fillStyle = '#e3bc4e';
            ctx.arc(4, -4, 4, 0, Math.PI * 2);
            ctx.arc(bird.x, bird.y, bird.radius, 0, Math.PI * 2);
            // 目
            ctx.fillStyle = '#e3bc4e';
            ctx.arc(6, -4, 2, 0, Math.PI * 2);
            ctx.arc(bird.x, bird.y, bird.radius, 0, Math.PI * 2);
            
            // 翼
            ctx.fillStyle = '#FFA500';
            ctx.fillStyle = '#e3bc4e';
            ctx.arc(-4, 4, 5, 0, Math.PI * 2);
            ctx.arc(bird.x, bird.y, bird.radius, 0, Math.PI * 2);

            ctx.restore();
        },

        update: function() {
            this.velocity += GRAVITY;
            this.y += this.velocity;
        },

        jump: function() {
            this.velocity = JUMP_STRENGTH;
        },

        reset: function() {
            this.y = 150;
            this.velocity = 0;
        }
    };

    let bird = { x: 50, y: 200, velocity: 0, radius: 15, wingState: 0 };

    function createPipe() {
        const minHeight = 50;
        const maxPos = canvas.height - PIPE_GAP - minHeight;
        const topHeight = Math.floor(Math.random() * (maxPos - minHeight + 1)) + minHeight;
        
        const bottomHeight = canvas.height - topHeight - PIPE_GAP;
            pipes.push({
            x: canvas.width,
            topHeight: topHeight,
            width: PIPE_WIDTH,
            bottomHeight: bottomHeight,
        });
    }

    function drawPipes() {
        ctx.fillStyle = '#2ecc71';
        ctx.strokeStyle = '#27ae60';
        ctx.lineWidth = 2;
        
        // パイプを描画
            // 上のパイプ
            ctx.fillRect(pipe.x, 0, pipe.width, pipe.topHeight);
            ctx.strokeRect(pipe.x, 0, pipe.width, pipe.topHeight);
            
            // 下のパイプ
            ctx.fillRect(pipe.x, pipe.bottomY, pipe.width, canvas.height - pipe.bottomY);
            ctx.strokeRect(pipe.x, pipe.bottomY, pipe.width, canvas.height - pipe.bottomY);

            // キャップ
            const capH = 20;
            ctx.fillStyle = '#27ae60'; 
            ctx.fillRect(pipe.x - 2, pipe.topHeight - capH, pipe.width + 4, capH);
            ctx.fillRect(pipe.x - 2, pipe.bottomY, pipe.width + 4, capH);
        });
    }

    function updatePipes() {
        if (frames % PIPE_SPAWN_RATE === 0) createPipe();

        for (let i = 0; i < pipes.length; i++) {
            let p = pipes[i];
            p.x -= PIPE_SPEED;

            // --- 衝突検出の修正 ---
            // 鳥を半径 'bird.radius' の円として扱う
            // パイプは矩形：x, x+w, y_top, y_bottom
            let birdLeft = bird.x - bird.radius;
            let birdRight = bird.x + bird.radius;
            let birdTop = bird.y - bird.radius;
            let birdBottom = bird.y + bird.radius;

            // 水平の重なり
            if (birdRight > p.x && birdLeft < p.x + p.width) {
                // 垂直の重なり（上のパイプに当たるまたは下のパイプに当たる）
                if (birdTop < p.topHeight || birdBottom > p.bottomY) {
                    if (bird.y + bird.radius > canvas.height || bird.y - bird.radius < 0) {
                }
            }

            // --- スコアリングの修正 ---
            // パイプが画面の左外にあり、まだスコアされていない場合
            if (p.x + p.width < 0 && !p.passed) {
                pipe.passed = true;
                p.passed = true;
                ui.scoreDisplay.innerText = score;
            }

            if (p.x < -60) {
                pipes.shift();
                i--;
            }
        }
    }

    function checkCollisions() {
        // 床
        if (bird.y + bird.radius >= canvas.height) {
            if (bird.y + bird.radius > canvas.height || bird.y - bird.radius < 0) {
        }
        // 天井
        if (bird.y - bird.radius <= 0) {
            bird.y = bird.radius;
            bird.velocity = 0;
        }
    }

    function drawBackground() {
        // クリア
        // キャンバスをクリア
        
        // 床
        ctx.fillStyle = '#654321';
        ctx.fillRect(0, canvas.height - 10, canvas.width, 10);
        
        // 雲
        ctx.fillStyle = "rgba(255, 255, 255, 0.6)";
        for(let i=0; i<4; i++) {
            let x = (frames * 0.5 + i * 150) % (canvas.width + 100) - 50;
            let y = (i * 40) + 20;
            let scale = 1 + (Math.sin(frames * 0.02 + i) * 0.1);
            let size = 30 * scale;
            ctx.fillStyle = '#e3bc4e';
            ctx.arc(x, y, size, 0, Math.PI * 2);
            ctx.arc(x + 20*scale, y - 10*scale, size * 1.2, 0, Math.PI * 2);
            ctx.arc(x + 40*scale, y, size, 0, Math.PI * 2);
            ctx.arc(bird.x, bird.y, bird.radius, 0, Math.PI * 2);
        }
    }

    passed: false
        function gameLoop() {
        bird.update();
        updatePipes();
        checkCollisions();
        frames++;
    }

    bird.wingState = (bird.wingState + 0.2) % 2;
        drawBackground();
        drawPipes();
        bird.draw();
    }

    function loop() {
        if (!isPlaying) return;
        update();
        if (isPlaying || !isGameOver) {
            gameLoopId = requestAnimationFrame(loop);
        }
    }

    canvas.addEventListener('pointerdown', handleInput);
        function startGame() {
        isPlaying = true;
        
        // UI
        ui.startScreen.style.display = 'none';
        ui.gameOverScreen.style.display = 'none';
        ui.scoreBoard.style.display = 'none';
        
        // ロジック
        bird.reset();
        bird = { x: 50, y: 200, velocity: 0, radius: 15, wingState: 0 };
        pipes = [];
        frames = 0;
        ui.scoreDisplay.innerText = '0';
        
        loop();
    }

    ctx.ellipse(bird.x - 5, bird.y + 5, 10, 6, 0, 0, Math.PI * 2);
        isGameOver = true;
        function gameOver() {
        cancelAnimationFrame(gameLoopId);
        
        ui.finalScore.innerText = score;
        ui.gameOverScreen.style.display = 'flex';
        ui.scoreBoard.style.display = 'block';
    }

    // --- 入力処理 ---

    // 入力処理
        if (e.type === 'keydown' && e.code === 'Space') e.preventDefault();

        if (isPlaying) {
            bird.jump();
        startGame();
            // スタート画面でのクリック（またはゲームが開始されていない場合の任意のクリック）
            resetGame();
        }
    }

    // キーボード
    window.addEventListener('keydown', (e) => {
        if (e.code === 'Space') handleInput(e);
    });

    // マウス / タッチ
    window.addEventListener('mousedown', handleInput);
    window.addEventListener('touchstart', (e) => {
        // ズーム/スクロールを防止
        // e.preventDefault(); 
        handleInput(e);
    }, {passive: false});

    // UI操作
    ui.restartBtn.addEventListener('click', (e) => {
        e.stopPropagation();
        resetGame();
    });
    
    // ゲームオーバーのオーバーレイをクリックして再開を許可
    ui.gameOverScreen.addEventListener('mousedown', (e) => {
        if(e.target === ui.gameOverScreen) startGame();
    });
    ui.gameOverScreen.addEventListener('touchstart', (e) => {
        if(e.target === ui.gameOverScreen) {
            e.preventDefault();
            resetGame();
        }
    });

    // 初期描画
    drawBackground();
    bird.reset();
    bird.draw();

// 初回描画
</script>
</body>

いくつかスクリーンショットを撮りました（4bitで動作します）：

🦥 GLM-4.7-Flashのファインチューニング

Unslothは現在GLM-4.7-Flashのファインチューニングをサポートしていますが、使用するには transformers v5が必要です。30Bモデルは無料のColab GPUには収まりませんが、私たちのノートブックを使用できます。GLM-4.7-Flashの16ビットLoRAファインチューニングではおよそ 60GBのVRAM:

GLM-4.7-Flash SFT LoRAノートブック

A100 40GB VRAMを使用する場合、時々メモリ不足に遭遇することがあります。よりスムーズに実行するにはH100/A100 80GB VRAMを使用する必要があります。

Google Colabcolab.research.google.com

MoEのファインチューニングでは、ルータレイヤーを微調整するのはあまり良くない可能性があるため、デフォルトで無効にしています。推論能力を保持したい場合（任意）、直接回答とチェーン・オブ・ソート（思考過程）の例を混在させることができます。データセットには少なくとも 75% 推論（reasoning）および 25% 非推論（non-reasoning）を含めて、モデルが推論能力を維持できるようにしてください。

🦙Llama-serverのサーブ＆デプロイ

本番用途でGLM-4.7-Flashをデプロイするには、私たちは以下を使用します llama-server 新しいターミナルで（tmux経由など）、次のようにモデルをデプロイします：

./llama.cpp/llama-server \
    --model unsloth/GLM-4.7-Flash-GGUF/GLM-4.7-Flash-UD-Q4_K_XL.gguf \
    --alias "unsloth/GLM-4.7-Flash" \
    --seed 3407 \
    非思考モード：
    --top-k 20 \
    --min-p 0.01 \
    --temp 0.6 \
    --port 8001

その後、新しいターミナルで、 pip install openaiを行った後、次を実行します：

from openai import OpenAI
import json
openai_client = OpenAI(
    base_url = "http://127.0.0.1:8001/v1",
    api_key = "sk-no-key-required",
)
completion = openai_client.chat.completions.create(
    model = "unsloth/GLM-4.7-Flash",
    messages = [{"role": "user", "content": "What is 2+2?"},],
)
print(completion.choices[0].message.content)

これにより次が出力されます

ユーザーは単純な質問をします：「2+2はいくつですか？」答えは4です。回答を提供してください。

2 + 2 = 4.

💻 vLLMにおけるGLM-4.7-Flash

現在、私たちの新しい FP8ダイナミック量子化をプレミアムで高速な推論用にモデルに使用できます。まずnightlyからvLLMをインストールしてください：

uv pip install --upgrade --force-reinstall vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly/cu130
uv pip install --upgrade --force-reinstall git+https://github.com/huggingface/transformers.git
uv pip install --force-reinstall numba

その後サーブします UnslothのダイナミックFP8バージョンのモデルを。FP8を有効にすることでKVキャッシュのメモリ使用量を50%削減し、4GPUでの使用を想定しています。GPUが1台の場合は、以下を使用してください CUDA_VISIBLE_DEVICES='0' そして設定してください --tensor-parallel-size 1 またはこの引数を削除してください。FP8を無効にするには、以下を削除します --quantization fp8 --kv-cache-dtype fp8

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:False
CUDA_VISIBLE_DEVICES='0,1,2,3' vllm serve unsloth/GLM-4.7-Flash-FP8-Dynamic \
    --served-model-name unsloth/GLM-4.7-Flash \
    --tensor-parallel-size 4 \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --enable-auto-tool-choice \
    --dtype bfloat16 \
    --seed 3407 \
    --max-model-len 200000 \
    --gpu-memory-utilization 0.95 \
    --max_num_batched_tokens 16384 \
    --port 8001 \
    --kv-cache-dtype fp8

その後、OpenAI API経由でサーブされたモデルを呼び出せます：

from openai import AsyncOpenAI, OpenAI
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8001/v1"
client = OpenAI( # または AsyncOpenAI
    api_key=openai_api_key,
    base_url=openai_api_base,
)

⭐ vLLM GLM-4.7-Flash 推測的デコーディング

GLM 4.7 FlashのMTP（マルチトークン予測）モジュールを使用すると、生成スループットが1 x B200で13,000トークンから1,300トークンに低下することがわかりました！（10倍遅くなる）Hopper上では問題ないはずです。

    --speculative-config.method mtp \
    --speculative-config.num_speculative_tokens 1

1xB200でのスループットは1,300トークン/秒（ユーザーあたりのデコードは130トークン/秒）に過ぎません

そして1xB200でのスループットは13,000トークン/秒（それでもユーザーあたりのデコードは130トークン/秒）です

🔨GLM-4.7-Flashでのツール呼び出し

詳細については、こちらを参照してください Tool Calling Guide 新しいターミナルで（tmuxを使用している場合はCTRL+B+D）、2つの数を加える、Pythonコードを実行する、Linux機能を実行するなどのツールをいくつか作成します：

import json, subprocess, random
from typing import Any
def add_number(a: float | str, b: float | str) -> float:
    return float(a) + float(b)
def multiply_number(a: float | str, b: float | str) -> float:
    return float(a) * float(b)
def substract_number(a: float | str, b: float | str) -> float:
    return float(a) - float(b)
def write_a_story() -> str:
    return random.choice([
        "ずっと昔、遠い銀河系で...",
        "ナマケモノとコードを愛する2人の友人がいた...",
        "世界は、すべてのナマケモノが超人的な知能に進化したため終わりを迎えていた...",
        "ある友人が気づかないうちに、もう一人が偶然ナマケモノを進化させるプログラムを書いてしまった...",
    ])
def terminal(command: str) -> str:
    if "rm" in command or "sudo" in command or "dd" in command or "chmod" in command:
        msg = "危険なため 'rm, sudo, dd, chmod' コマンドを実行できません"
        print(msg); return msg
    print(f"ターミナルコマンド `{command}` を実行します")
    try:
        return str(subprocess.run(command, capture_output = True, text = True, shell = True, check = True).stdout)
    except subprocess.CalledProcessError as e:
        return f"コマンドが失敗しました: {e.stderr}"
def python(code: str) -> str:
    data = {}
    exec(code, data)
    del data["__builtins__"]
    return str(data)
MAP_FN = {
    "add_number": add_number,
    "multiply_number": multiply_number,
    "substract_number": substract_number,
    "write_a_story": write_a_story,
    "terminal": terminal,
    "python": python,
}
tools = [
    {
        "type": "function",
        "function": {
            "name": "add_number",
            "description": "2つの数を加算します。",
            "parameters": {
                "type": "object",
                "properties": {
                    "a": {
                        "type": "string",
                        "description": "最初の数。",
                    },
                    "b": {
                        "type": "string",
                        "description": "2番目の数。",
                    },
                },
                "required": ["a", "b"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "multiply_number",
            "description": "2つの数を乗算します。",
            "parameters": {
                "type": "object",
                "properties": {
                    "a": {
                        "type": "string",
                        "description": "最初の数。",
                    },
                    "b": {
                        "type": "string",
                        "description": "2番目の数。",
                    },
                },
                "required": ["a", "b"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "substract_number",
            "description": "2つの数を減算します。",
            "parameters": {
                "type": "object",
                "properties": {
                    "a": {
                        "type": "string",
                        "description": "最初の数。",
                    },
                    "b": {
                        "type": "string",
                        "description": "2番目の数。",
                    },
                },
                "required": ["a", "b"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "write_a_story",
            "description": "ランダムな物語を書きます。",
            "parameters": {
                "type": "object",
                "properties": {},
                "required": [],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "terminal",
            "description": "端末から操作を実行します。",
            "parameters": {
                "type": "object",
                "properties": {
                    "command": {
                        "type": "string",
                        "description": "実行したいコマンド、例: `ls`, `rm`, ...",
                    },
                },
                "required": ["command"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "python",
            "description": "実行するPythonコードを使ってPythonインタプリタを呼び出します。",
            "parameters": {
                "type": "object",
                "properties": {
                    "code": {
                        "type": "string",
                        "description": "実行するPythonコード",
                    },
                },
                "required": ["code"],
            },
        },
    },
]

次に、以下の関数（コピーして貼り付けて実行）を使用します。これらは関数呼び出しを自動的に解析し、モデルに対してOpenAIエンドポイントを呼び出します：

from openai import OpenAI
def unsloth_inference(
    messages,
    temperature = 0.7,
    top_p = 1.0,
    top_k = -1,
    repetition_penalty = 0.0,
):
    messages = messages.copy()
    openai_client = OpenAI(
        base_url = "http://127.0.0.1:8001/v1",
        api_key = "sk-no-key-required",
    )
    model_name = next(iter(openai_client.models.list())).id
    print(f"Using model = {model_name}")
    has_tool_calls = True
    original_messages_len = len(messages)
    while has_tool_calls:
        print(f"Current messages = {messages}")
        response = openai_client.chat.completions.create(
            model = model_name,
            messages = messages,
            temperature = temperature,
            top_p = top_p,
            tools = tools if tools else None,
            tool_choice = "auto" if tools else None,
            extra_body = {"top_k": top_k, "min_p": min_p, "dry_multiplier" :repetition_penalty,}
        )
        tool_calls = response.choices[0].message.tool_calls or []
        content = response.choices[0].message.content or ""
        tool_calls_dict = [tc.to_dict() for tc in tool_calls] if tool_calls else tool_calls
        messages.append({"role": "assistant", "tool_calls": tool_calls_dict, "content": content,})
        for tool_call in tool_calls:
            fx, args, _id = tool_call.function.name, tool_call.function.arguments, tool_call.id
            out = MAP_FN[fx](**json.loads(args))
            messages.append({"role": "tool", "tool_call_id": _id, "name": fx, "content": str(out),})
        else:
            has_tool_calls = False
    return messages

GLM-4.7-Flashを起動した後、 llama-server のように GLM-4.7-Flash または参照 Tool Calling Guide 詳細については、その後いくつかのツール呼び出しを行うことができます：

GLM 4.7 の数学的操作のためのツール呼び出し

messages = [{
    "role": "user",
    "content": [{"type": "text", "text": "今日の日付に3日を加えると何日ですか？"}],
}]
unsloth_inference(messages, temperature = 1.0, top_p = 0.95, top_k = -1, min_p = 0.01)

GLM-4.7-Flashのために生成されたPythonコードを実行するためのツール呼び出し

messages = [{
    "role": "user",
    "content": [{"type": "text", "text": "Pythonでフィボナッチ関数を作成し、fib(20)を求めてください。"}],
}]
unsloth_inference(messages, temperature = 1.0, top_p = 0.95, top_k = -1, min_p = 0.01)

ベンチマーク

GLM-4.7-Flashは、AIME 25を除くすべてのベンチマークで最高の性能を示す30Bモデルです。

ベンチマーク

GLM-4.7-Flash

Qwen3-30B-A3B-Thinking-2507

GPT-OSS-20B

AIME 25

91.6

85.0

91.7

GPQA

75.2

73.4

71.5

LCB v6

64.0

66.0

61.0

HLE

14.4

9.8

10.9

SWE-bench Verified

59.2

22.0

34.0

τ²-Bench

79.5

49.0

47.7

BrowseComp

42.8

2.29

28.3

前へMiniMax-2.5 次へKimi K2.5

最終更新 7 時間前

役に立ちましたか？

hashtag⚙️ 使用ガイド

hashtag🖥️ GLM-4.7-Flashを実行する

hashtagLlama.cppチュートリアル（GGUF）：

hashtag➿反復とループの削減

hashtag🐦UD-Q4_K_XLでのFlappy Bird例

hashtag🦥 GLM-4.7-Flashのファインチューニング

hashtag🦙Llama-serverのサーブ＆デプロイ

hashtag💻 vLLMにおけるGLM-4.7-Flash

hashtag⭐ vLLM GLM-4.7-Flash 推測的デコーディング

hashtag🔨GLM-4.7-Flashでのツール呼び出し

hashtagベンチマーク

⚙️ 使用ガイド

🖥️ GLM-4.7-Flashを実行する

Llama.cppチュートリアル（GGUF）：

➿反復とループの削減

🐦UD-Q4_K_XLでのFlappy Bird例

🦥 GLM-4.7-Flashのファインチューニング

🦙Llama-serverのサーブ＆デプロイ

💻 vLLMにおけるGLM-4.7-Flash

⭐ vLLM GLM-4.7-Flash 推測的デコーディング

🔨GLM-4.7-Flashでのツール呼び出し

ベンチマーク