GLM-4.7-Flash: Anleitung zum lokalen Betrieb

Führe & feinabstimme GLM-4.7-Flash lokal auf deinem Gerät!

GLM-4.7-Flash ist Z.ai's neues 30B MoE-Reasoning-Modell für lokale Bereitstellung und bietet erstklassige Leistung für Programmierung, agentenbasierte Workflows und Chat. Es verwendet ~3,6B Parameter, unterstützt 200K Kontext und führt bei SWE-Bench, GPQA sowie Reasoning-/Chat-Benchmarks.

GLM-4.7-Flash läuft auf 24GB RAM/VRAM/unified Memory (32GB für volle Genauigkeit), und Sie können jetzt mit Unsloth feinabstimmen. Um GLM 4.7 Flash mit vLLM auszuführen, siehe GLM-4.7-Flash in vLLM

Update vom 21. Jan: llama.cpp behob einen Fehler, bei dem fälschlicherweise scoring_func: "softmax" (sollte "sigmoid") sein. Dies verursachte Schleifen und schlechte Ausgaben. Wir haben die GGUFs aktualisiert – bitte laden Sie das Modell erneut herunter für deutlich bessere Ergebnisse.

Sie können jetzt Z.ai’s empfohlene Parameter verwenden und großartige Ergebnisse erzielen:

Für allgemeine Anwendungsfälle: --temp 1.0 --top-p 0.95
Für Tool-Aufrufe: --temp 0.7 --top-p 1.0
Wiederholungsstrafe: Deaktivieren Sie sie, oder setzen Sie --repeat-penalty 1.0

22. Jan: Schnellere Inferenz ist verfügbar, da der FA-Fix für CUDA jetzt zusammengeführt wurde.

Tutorial zur Ausführung Feinabstimmung

GLM-4.7-Flash GGUF zum Ausführen: unsloth/GLM-4.7-Flash-GGUF

⚙️ Gebrauchsanleitung

Für beste Leistung stelle sicher, dass dein insgesamt verfügbarer Speicher (VRAM + System-RAM) größer ist als die Größe der quantisierten Modelldatei, die du herunterlädst. Wenn dies nicht der Fall ist, kann llama.cpp immer noch über SSD/HDD-Offloading laufen, aber die Inferenz wird langsamer sein.

Nach Rücksprache mit dem Z.ai-Team empfehlen sie, die folgenden GLM-4.7-Sampling-Parameter zu verwenden:

Standardeinstellungen (die meisten Aufgaben)

Terminal Bench, SWE Bench verifiziert

temperature = 1.0

temperature = 0.7

top_p = 0.95

top_p = 1.0

repeat penalty = deaktiviert oder 1.0

Für allgemeine Anwendungsfälle: --temp 1.0 --top-p 0.95
Für Tool-Aufrufe: --temp 0.7 --top-p 1.0
Wenn Sie llama.cpp verwenden, setzen Sie --min-p 0.01 da llama.cpp standardmäßig 0.05 hat
Manchmal müssen Sie experimentieren, welche Zahlen am besten für Ihren Anwendungsfall funktionieren.

Für jetzt, wir empfehlen nicht dieses GGUF mit Ollama auszuführen aufgrund möglicher Kompatibilitätsprobleme mit Chat-Vorlagen. Das GGUF funktioniert gut auf llama.cpp (oder Backends z. B. LM Studio, Jan).

Denken Sie daran, die Wiederholungsstrafe zu deaktivieren! Oder setzen Sie --repeat-penalty 1.0

Maximales Kontextfenster: 202,752

🖥️ GLM-4.7-Flash ausführen

Je nach Anwendungsfall müssen Sie unterschiedliche Einstellungen verwenden. Einige GGUFs sind schließlich ähnlich groß, weil die Modellarchitektur (wie gpt-oss) Dimensionen hat, die nicht durch 128 teilbar sind, sodass Teile nicht auf niedrigere Bitbreiten quantisiert werden können.

Weil diese Anleitung 4-Bit verwendet, benötigen Sie etwa 18GB RAM/unified Memory. Wir empfehlen mindestens 4-Bit-Präzision für beste Leistung.

Denken Sie daran, die Wiederholungsstrafe zu deaktivieren! Oder setzen Sie --repeat-penalty 1.0

Llama.cpp Tutorial (GGUF):

Anweisungen zum Ausführen in llama.cpp (Hinweis: wir verwenden 4-Bit, um auf die meisten Geräte zu passen):

Holen Sie sich das neueste llama.cpp auf GitHub hier. Sie können auch den Build-Anweisungen unten folgen. Ändern Sie -DGGML_CUDA=ON zu -DGGML_CUDA=OFF wenn du keine GPU hast oder einfach nur CPU-Inferenz möchtest. Für Apple Mac / Metal-Geräte, setze -DGGML_CUDA=OFF und fahre dann wie gewohnt fort - Metal-Unterstützung ist standardmäßig aktiviert.

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp

Sie können direkt von Hugging Face ziehen. Sie können den Kontext auf 200K erhöhen, je nachdem, wie Ihr RAM/VRAM es zulässt.

Sie können auch Z.ai's empfohlene GLM-4.7-Sampling-Parameter ausprobieren:

Für allgemeine Anwendungsfälle: --temp 1.0 --top-p 0.95
Für Tool-Aufrufe: --temp 0.7 --top-p 1.0
Denken Sie daran, die Wiederholungsstrafe zu deaktivieren!

Folge dem für allgemeine Anweisung Anwendungsfälle:

./llama.cpp/llama-cli \
    -hf unsloth/GLM-4.7-Flash-GGUF:UD-Q4_K_XL \
    --ctx-size 16384 \
    --temp 1.0 --top-p 0.95 --min-p 0.01

Folge dem für tool-calling Anwendungsfälle:

./llama.cpp/llama-cli \
    -hf unsloth/GLM-4.7-Flash-GGUF:UD-Q4_K_XL \
    --ctx-size 16384 \
    --temp 0.7 --top-p 1.0 --min-p 0.01

Laden Sie das Modell über (nach Installation von pip install huggingface_hub). Sie können UD-Q4_K_XL oder andere quantisierte Versionen. Falls Downloads hängen bleiben, siehe Hugging Face Hub, XET-Debugging

pip install -U huggingface_hub
hf download unsloth/GLM-4.7-Flash-GGUF \
    --local-dir unsloth/GLM-4.7-Flash-GGUF \
    --include "*UD-Q2_K_XL*"

Dann das Modell im Konversationsmodus ausführen:

./llama.cpp/llama-cli \
    --model unsloth/GLM-4.7-Flash-GGUF/GLM-4.7-Flash-UD-Q4_K_XL.gguf \
    --ctx-size 16384 \
    --seed 3407 \
    --temp 1.0 \
    --top-p 0.95 \
    --min-p 0.01

Passen Sie außerdem Kontextfenster wie benötigt an, bis zu 202752

➿Wiederholungen und Schleifen reduzieren

UPDATE 21. JAN: llama.cpp hat einen Fehler behoben, bei dem fälschlicherweise "scoring_func": "softmax" angegeben wurde, was Schleifen und schlechte Ausgaben verursachte (sollte sigmoid sein). Wir haben die GGUFs aktualisiert. Bitte laden Sie das Modell erneut herunter für deutlich bessere Ergebnisse.

Das bedeutet, Sie können jetzt Z.ai's empfohlene Parameter verwenden und großartige Ergebnisse erzielen:

Für allgemeine Anwendungsfälle: --temp 1.0 --top-p 0.95
Für Tool-Aufrufe: --temp 0.7 --top-p 1.0
Wenn Sie llama.cpp verwenden, setzen Sie --min-p 0.01 da llama.cpp standardmäßig 0.05 hat
Denken Sie daran, die Wiederholungsstrafe zu deaktivieren! Oder setzen Sie --repeat-penalty 1.0

Wir haben hinzugefügt "scoring_func": "sigmoid" zu config.json für das Hauptmodell - siehe.

🐦Flappy Bird Beispiel mit UD-Q4_K_XL

Als Beispiel führten wir das folgende lange Gespräch, indem wir UD-Q4_K_XL über ./llama.cpp/llama-cli --model unsloth/GLM-4.7-Flash-GGUF/GLM-4.7-Flash-UD-Q4_K_XL.gguf --fit on --temp 1.0 --top-p 0.95 --min-p 0.01 :

Hi
Was ist 2+2
Erstelle ein Python Flappy Bird Spiel
Erstelle ein völlig anderes Spiel in Rust
Finde Bugs in beiden
Mache das erste erwähnte Spiel aber als eigenständige HTML-Datei
Finde Bugs und zeige das behobene Spiel

was das folgende Flappy Bird Spiel in HTML-Form rendert:

Flappy Bird Spiel in HTML (erweiterbar)

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">
    <title>Flappy Bird Fixed</title>
    <style>
        body {
            margin: 0;
            display: flex;
            justify-content: center;
            align-items: center;
            height: 100vh;
            background-color: #222;
            font-family: 'Arial', sans-serif;
            overflow: hidden;
            user-select: none;
            -webkit-user-select: none;
            touch-action: none; /* Verhindert Zoomen auf Mobilgeräten */
        }

        #game-container {
            position: relative;
            box-shadow: 0 0 20px rgba(0,0,0,0.5);
        }

        canvas {
            background-color: #87CEEB;
            display: block;
            border-radius: 4px;
        }

        /* UI Overlays */
        #ui-layer {
            position: absolute;
            top: 0;
            left: 0;
            width: 100%;
            height: 100%;
            pointer-events: none;
            display: flex;
            flex-direction: column;
            justify-content: center;
            align-items: center;
            text-align: center;
        }

        #score-display {
            position: absolute;
            top: 40px;
            left: 50%;
            transform: translateX(-50%);
            font-size: 48px;
            font-weight: bold;
            color: white;
            text-shadow: 3px 3px 0 #000;
            z-index: 10;
            font-family: 'Courier New', Courier, monospace;
        }

        #start-screen, #game-over-screen {
            background: rgba(0, 0, 0, 0.7);
            width: 100%;
            height: 100%;
            display: flex;
            flex-direction: column;
            justify-content: center;
            align-items: center;
            color: white;
            pointer-events: auto; /* Klicks erlauben */
            cursor: pointer;
        }

        h1 { margin: 0 0 10px 0; font-size: 60px; text-shadow: 4px 4px 0 #000; line-height: 1; }
        p { font-size: 22px; margin: 10px 0; color: #ddd; }
        
        .btn {
            background: linear-gradient(to bottom, #ffeb3b, #fbc02d);
            border: 3px solid #fff;
            color: #333;
            padding: 15px 40px;
            font-size: 28px;
            font-weight: bold;
            cursor: pointer;
            border-radius: 8px;
            box-shadow: 0 6px 0 #c49000, 0 10px 10px rgba(0,0,0,0.3);
            text-transform: uppercase;
            transition: all 0.1s;
            margin-top: 10px;
        }

        .btn:active {
            transform: translateY(4px);
            box-shadow: 0 2px 0 #c49000, 0 4px 4px rgba(0,0,0,0.3);
        }

        .score-board {
            background: #ded895;
            border: 2px solid #543847;
            padding: 20px 40px;
            border-radius: 10px;
            box-shadow: 4px 4px 0 #543847;
            margin-bottom: 30px;
            display: none;
            border: 4px solid #543847;
        }
        
        .score-board h2 { margin: 0 0 5px 0; color: #e86101; font-size: 40px; }
        .score-board span { font-size: 20px; color: #543847; display: block; text-align: center; }

    </style>
</head>
<body>

    <div id="game-container">
        <canvas id="gameCanvas" width="400" height="600"></canvas>
        
        <div id="score-display">0</div>

        <div id="ui-layer">
            <div id="start-screen">
                <h1>FLAPPY<br>BIRD</h1>
                <p>Tippe oder drücke Leertaste zum Starten</p>
                <button class="btn" style="display:none;" id="touch-instruction">Klicken zum Starten</button>
            </div>

            <div id="game-over-screen">
                <h1>GAME OVER</h1>
                <div class="score-board" id="score-board">
                    <h2>Score: <span id="final-score">0</span></h2>
                </div>
                <button class="btn" id="restart-btn">Erneut versuchen</button>
            </div>
        </div>
    </div>

<script>
    const canvas = document.getElementById('gameCanvas');
    const ctx = canvas.getContext('2d');

    // --- Konstanten ---
    const GRAVITY = 0.35; // Etwas stärkere Schwerkraft für besseres Spielgefühl
    const JUMP_STRENGTH = -6.5;
    const PIPE_GAP = 180;
    const PIPE_WIDTH = 60;
    const PIPE_SPEED = 2.5;
    const PIPE_SPAWN_RATE = 100;

    // --- Zustand ---
    let frames = 0;
    let score = 0;
    let isGameOver = false;
    let isPlaying = false;
    let gameLoopId;

    const ui = {
        startScreen: document.getElementById('start-screen'),
        gameOverScreen: document.getElementById('game-over-screen'),
        scoreDisplay: document.getElementById('score-display'),
        scoreBoard: document.getElementById('score-board'),
        finalScore: document.getElementById('final-score'),
        restartBtn: document.getElementById('restart-btn')
    };

    const bird = {
        x: 80,
        y: 150,
        radius: 12, // Fester Radius
        velocity: 0,
        
        draw: function() {
            // Vogel basierend auf der Geschwindigkeit drehen für visuelle Effekte
            let angle = Math.min(Math.PI / 4, Math.max(-Math.PI / 4, (this.velocity * 0.1)));
            
            ctx.save();
            ctx.translate(this.x, this.y);
            ctx.rotate(angle);
            
            // Körper zeichnen
            ctx.fillStyle = '#FFD700';
            ctx.beginPath();
            ctx.arc(0, 0, this.radius, 0, Math.PI * 2);
            ctx.fill();
            
            // Auge
            ctx.fillStyle = 'white';
            ctx.beginPath();
            ctx.arc(4, -4, 4, 0, Math.PI * 2);
            ctx.fill();
            ctx.fillStyle = 'black';
            ctx.beginPath();
            ctx.arc(6, -4, 2, 0, Math.PI * 2);
            ctx.fill();
            
            // Flügel
            ctx.fillStyle = '#FFA500';
            ctx.beginPath();
            ctx.arc(-4, 4, 5, 0, Math.PI * 2);
            ctx.fill();

            ctx.restore();
        },

        update: function() {
            this.velocity += GRAVITY;
            this.y += this.velocity;
        },

        jump: function() {
            this.velocity = JUMP_STRENGTH;
        },

        reset: function() {
            this.y = 150;
            this.velocity = 0;
        }
    };

    let pipes = [];

    function createPipe() {
        const minHeight = 50;
        const maxPos = canvas.height - PIPE_GAP - minHeight;
        const topHeight = Math.floor(Math.random() * (maxPos - minHeight + 1)) + minHeight;
        
        pipes.push({
            x: canvas.width,
            topHeight: topHeight,
            bottomY: topHeight + PIPE_GAP,
            width: PIPE_WIDTH,
            passed: false
        });
    }

    function drawPipes() {
        ctx.fillStyle = '#2ecc71';
        ctx.strokeStyle = '#27ae60';
        ctx.lineWidth = 2;
        
        pipes.forEach(pipe => {
            // Oberes Rohr
            ctx.fillRect(pipe.x, 0, pipe.width, pipe.topHeight);
            ctx.strokeRect(pipe.x, 0, pipe.width, pipe.topHeight);
            
            // Unteres Rohr
            ctx.fillRect(pipe.x, pipe.bottomY, pipe.width, canvas.height - pipe.bottomY);
            ctx.strokeRect(pipe.x, pipe.bottomY, pipe.width, canvas.height - pipe.bottomY);

            // Kappe
            const capH = 20;
            ctx.fillStyle = '#27ae60'; 
            ctx.fillRect(pipe.x - 2, pipe.topHeight - capH, pipe.width + 4, capH);
            ctx.fillRect(pipe.x - 2, pipe.bottomY, pipe.width + 4, capH);
        });
    }

    function updatePipes() {
        if (frames % PIPE_SPAWN_RATE === 0) createPipe();

        for (let i = 0; i < pipes.length; i++) {
            let p = pipes[i];
            p.x -= PIPE_SPEED;

            // --- BEHOBENE KOLLISIONSERKENNUNG ---
            // Behandle den Vogel als Kreis mit Radius 'bird.radius'
            // Rohr ist ein Rechteck: x, x+w, y_top, y_bottom
            let birdLeft = bird.x - bird.radius;
            let birdRight = bird.x + bird.radius;
            let birdTop = bird.y - bird.radius;
            let birdBottom = bird.y + bird.radius;

            // Horizontale Überlappung
            if (birdRight > p.x && birdLeft < p.x + p.width) {
                // Vertikale Überlappung (oberes Rohr getroffen ODER unteres Rohr getroffen)
                if (birdTop < p.topHeight || birdBottom > p.bottomY) {
                    gameOver();
                }
            }

            // --- BEHOBENE PUNKTEVERGABE ---
            // Wenn das Rohr links aus dem Bildschirm ist und noch nicht gewertet wurde
            if (p.x + p.width < 0 && !p.passed) {
                score++;
                p.passed = true;
                ui.scoreDisplay.innerText = score;
            }

            if (p.x < -60) {
                pipes.shift();
                i--;
            }
        }
    }

    function checkCollisions() {
        // Boden
        if (bird.y + bird.radius >= canvas.height) {
            gameOver();
        }
        // Decke
        if (bird.y - bird.radius <= 0) {
            bird.y = bird.radius;
            bird.velocity = 0;
        }
    }

    function drawBackground() {
        // Löschen
        ctx.clearRect(0, 0, canvas.width, canvas.height);
        
        // Boden
        ctx.fillStyle = '#654321';
        ctx.fillRect(0, canvas.height - 10, canvas.width, 10);
        
        // Wolken
        ctx.fillStyle = "rgba(255, 255, 255, 0.6)";
        for(let i=0; i<4; i++) {
            let x = (frames * 0.5 + i * 150) % (canvas.width + 100) - 50;
            let y = (i * 40) + 20;
            let scale = 1 + (Math.sin(frames * 0.02 + i) * 0.1);
            let size = 30 * scale;
            ctx.beginPath();
            ctx.arc(x, y, size, 0, Math.PI * 2);
            ctx.arc(x + 20*scale, y - 10*scale, size * 1.2, 0, Math.PI * 2);
            ctx.arc(x + 40*scale, y, size, 0, Math.PI * 2);
            ctx.fill();
        }
    }

    function update() {
        if (!isPlaying) return;
        bird.update();
        updatePipes();
        checkCollisions();
        frames++;
    }

    function draw() {
        drawBackground();
        drawPipes();
        bird.draw();
    }

    function loop() {
        update();
        draw();
        if (isPlaying || !isGameOver) {
            gameLoopId = requestAnimationFrame(loop);
        }
    }

    function startGame() {
        isPlaying = true;
        isGameOver = false;
        
        // UI
        ui.startScreen.style.display = 'none';
        ui.gameOverScreen.style.display = 'none';
        ui.scoreBoard.style.display = 'none';
        
        // Logik
        bird.reset();
        pipes = [];
        score = 0;
        frames = 0;
        ui.scoreDisplay.innerText = '0';
        
        loop();
    }

    function gameOver() {
        isPlaying = false;
        isGameOver = true;
        cancelAnimationFrame(gameLoopId);
        
        ui.finalScore.innerText = score;
        ui.gameOverScreen.style.display = 'flex';
        ui.scoreBoard.style.display = 'block';
    }

    // --- Eingabeverarbeitung ---

    function handleInput(e) {
        if (e.type === 'keydown' && e.code === 'Space') e.preventDefault();

        if (isPlaying) {
            bird.jump();
        } else if (!isGameOver) {
            // Klick auf Startbildschirm (oder jeder Klick, wenn das Spiel noch nicht gestartet ist)
            startGame();
        }
    }

    // Tastatur
    window.addEventListener('keydown', (e) => {
        if (e.code === 'Space') handleInput(e);
    });

    // Maus / Touch
    window.addEventListener('mousedown', handleInput);
    window.addEventListener('touchstart', (e) => {
        // Verhindert Zoomen/Scrollen
        // e.preventDefault(); 
        handleInput(e);
    }, {passive: false});

    // UI-Interaktionen
    ui.restartBtn.addEventListener('click', (e) => {
        e.stopPropagation();
        startGame();
    });
    
    // Ermöglicht Klick auf die Game-Over-Überlagerung zum Neustart
    ui.gameOverScreen.addEventListener('mousedown', (e) => {
        if(e.target === ui.gameOverScreen) startGame();
    });
    ui.gameOverScreen.addEventListener('touchstart', (e) => {
        if(e.target === ui.gameOverScreen) {
            e.preventDefault();
            startGame();
        }
    });

    // Erste Darstellung
    drawBackground();
    bird.reset();
    bird.draw();

</script>
</body>
</html>

Und wir haben einige Screenshots gemacht (4bit funktioniert):

🦥 Feinabstimmung von GLM-4.7-Flash

Unsloth unterstützt jetzt die Feinabstimmung von GLM-4.7-Flash, allerdings müssen Sie transformers v5verwenden. Das 30B-Modell passt nicht auf eine kostenlose Colab-GPU; Sie können jedoch unser Notebook verwenden. 16-Bit LoRA-Feinabstimmung von GLM-4.7-Flash verwendet etwa 60GB VRAM:

GLM-4.7-Flash SFT LoRA Notebook

Bei Verwendung einer A100 mit 40GB VRAM kann es manchmal zu Out-of-Memory kommen. Sie sollten H100/A100 mit 80GB VRAM für reibungslosere Läufe verwenden.

Google Colabcolab.research.google.com

Beim Feinabstimmen von MoEs ist es wahrscheinlich keine gute Idee, die Router-Schicht feinabzustimmen, daher haben wir sie standardmäßig deaktiviert. Wenn Sie seine Reasoning-Fähigkeiten beibehalten möchten (optional), können Sie eine Mischung aus direkten Antworten und Chain-of-Thought-Beispielen verwenden. Verwenden Sie mindestens 75% Reasoning und 25% Nicht-Reasoning in Ihrem Datensatz, damit das Modell seine Reasoning-Fähigkeiten behält.

🦙 Llama-Server Bereitstellung & Deployment

Um GLM-4.7-Flash für die Produktion bereitzustellen, verwenden wir llama-server In einem neuen Terminal, z. B. via tmux, das Modell wie folgt bereitstellen:

./llama.cpp/llama-server \
    --model unsloth/GLM-4.7-Flash-GGUF/GLM-4.7-Flash-UD-Q4_K_XL.gguf \
    --alias "unsloth/GLM-4.7-Flash" \
    --seed 3407 \
    --temp 1.0 \
    --top-p 0.95 \
    --min-p 0.01 \
    --ctx-size 16384 \
    --gpu-memory-utilization 0.93 \

Dann in einem neuen Terminal, nachdem Sie pip install openaiausgeführt hast, mache:

from openai import OpenAI
import json
openai_client = OpenAI(
    base_url = "http://127.0.0.1:8001/v1",
    api_key = "sk-no-key-required",
)
completion = openai_client.chat.completions.create(
    model = "unsloth/GLM-4.7-Flash",
    messages = [{"role": "user", "content": "What is 2+2?"},],
)
print(completion.choices[0].message.content)

Was ausgibt

Benutzer stellt eine einfache Frage: "Was ist 2+2?" Die Antwort ist 4. Gib die Antwort.

2 + 2 = 4.

💻 GLM-4.7-Flash in vLLM

Sie können jetzt unser neues FP8 dynamische Quantisierung des Modells für premium und schnelle Inferenz. Installieren Sie zuerst vLLM aus dem Nightly-Build:

uv pip install --upgrade --force-reinstall vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly/cu130
uv pip install --upgrade --force-reinstall git+https://github.com/huggingface/transformers.git
uv pip install --force-reinstall numba

Dann starten Sie den Dienst Unsloths dynamische FP8-Version des Modells. Wir haben FP8 aktiviert, um die KV-Cache-Speichernutzung um 50 % zu reduzieren, und auf 4 GPUs. Wenn Sie 1 GPU haben, verwenden Sie CUDA_VISIBLE_DEVICES='0' und setzen Sie --tensor-parallel-size 1 oder entfernen Sie dieses Argument. Um FP8 zu deaktivieren, entfernen Sie --quantization fp8 --kv-cache-dtype fp8

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:False
CUDA_VISIBLE_DEVICES='0,1,2,3' vllm serve unsloth/GLM-4.7-Flash-FP8-Dynamic \
    --served-model-name unsloth/GLM-4.7-Flash \
    --tensor-parallel-size 4 \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --enable-auto-tool-choice \
    --dtype bfloat16 \
    --seed 3407 \
    --max-model-len 200000 \
    --gpu-memory-utilization 0.95 \
    --max_num_batched_tokens 16384 \
    --port 8001 \
    --kv-cache-dtype fp8

Sie können dann das bereitgestellte Modell über die OpenAI-API aufrufen:

from openai import AsyncOpenAI, OpenAI
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8001/v1"
client = OpenAI( # oder AsyncOpenAI
    api_key=openai_api_key,
    base_url=openai_api_base,
)

⭐ vLLM GLM-4.7-Flash Spekulatives Decoding

Wir haben festgestellt, dass die Verwendung des MTP-(Multi Token Prediction)-Moduls von GLM 4.7 Flash den Generierungsdurchsatz von 13.000 Tokens auf 1 B200 auf 1.300 Tokens reduziert! (10x langsamer) Auf Hopper sollte es hoffentlich in Ordnung sein.

    --speculative-config.method mtp \
    --speculative-config.num_speculative_tokens 1

Nur 1.300 Tokens/s Durchsatz auf 1xB200 (130 Tokens/s Decoding pro Nutzer)

Und 13.000 Tokens/s Durchsatz auf 1xB200 (immer noch 130 Tokens/s Decoding pro Nutzer)

🔨Toolaufrufe mit GLM-4.7-Flash

Siehe Tool Calling Guide für weitere Details, wie man Toolaufrufe macht. In einem neuen Terminal (bei Verwendung von tmux drücken Sie STRG+B+D) erstellen wir einige Tools wie das Addieren von 2 Zahlen, das Ausführen von Python-Code, das Ausführen von Linux-Funktionen und vieles mehr:

import json, subprocess, random
from typing import Any
def add_number(a: float | str, b: float | str) -> float:
    return float(a) + float(b)
def multiply_number(a: float | str, b: float | str) -> float:
    return float(a) * float(b)
def substract_number(a: float | str, b: float | str) -> float:
    return float(a) - float(b)
def write_a_story() -> str:
    return random.choice([
        "Vor langer Zeit in einer weit, weit entfernten Galaxis...",
        "Es gab zwei Freunde, die Faultiere und Code liebten...",
        "Die Welt ging unter, weil jedes Faultier eine übermenschliche Intelligenz entwickelte...",
        "Ohne dass ein Freund es wusste, hatte der andere versehentlich ein Programm geschrieben, um Faultiere zu entwickeln...",
    ])
def terminal(command: str) -> str:
    if "rm" in command or "sudo" in command or "dd" in command or "chmod" in command:
        msg = "Kann 'rm, sudo, dd, chmod'-Befehle nicht ausführen, da sie gefährlich sind"
        print(msg); return msg
    print(f"Führe Terminalbefehl `{command}` aus")
    try:
        return str(subprocess.run(command, capture_output = True, text = True, shell = True, check = True).stdout)
    except subprocess.CalledProcessError as e:
        return f"Befehl fehlgeschlagen: {e.stderr}"
def python(code: str) -> str:
    data = {}
    exec(code, data)
    del data["__builtins__"]
    return str(data)
MAP_FN = {
    "add_number": add_number,
    "multiply_number": multiply_number,
    "substract_number": substract_number,
    "write_a_story": write_a_story,
    "terminal": terminal,
    "python": python,
}
tools = [
    {
        "type": "function",
        "function": {
            "name": "add_number",
            "description": "Addiere zwei Zahlen.",
            "parameters": {
                "type": "object",
                "properties": {
                    "a": {
                        "type": "string",
                        "description": "Die erste Zahl.",
                    },
                    "b": {
                        "type": "string",
                        "description": "Die zweite Zahl.",
                    },
                },
                "required": ["a", "b"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "multiply_number",
            "description": "Multipliziert zwei Zahlen.",
            "parameters": {
                "type": "object",
                "properties": {
                    "a": {
                        "type": "string",
                        "description": "Die erste Zahl.",
                    },
                    "b": {
                        "type": "string",
                        "description": "Die zweite Zahl.",
                    },
                },
                "required": ["a", "b"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "substract_number",
            "description": "Subtrahiert zwei Zahlen.",
            "parameters": {
                "type": "object",
                "properties": {
                    "a": {
                        "type": "string",
                        "description": "Die erste Zahl.",
                    },
                    "b": {
                        "type": "string",
                        "description": "Die zweite Zahl.",
                    },
                },
                "required": ["a", "b"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "write_a_story",
            "description": "Schreibt eine zufällige Geschichte.",
            "parameters": {
                "type": "object",
                "properties": {},
                "required": [],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "terminal",
            "description": "Führt Operationen aus dem Terminal aus.",
            "parameters": {
                "type": "object",
                "properties": {
                    "command": {
                        "type": "string",
                        "description": "Der Befehl, den Sie ausführen möchten, z. B. `ls`, `rm`, ...",
                    },
                },
                "required": ["command"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "python",
            "description": "Ruft einen Python-Interpreter mit etwas Python-Code auf, der ausgeführt wird.",
            "parameters": {
                "type": "object",
                "properties": {
                    "code": {
                        "type": "string",
                        "description": "Der auszuführende Python-Code",
                    },
                },
                "required": ["code"],
            },
        },
    },
]

Anschließend verwenden wir die untenstehenden Funktionen (kopieren, einfügen und ausführen), die Funktionsaufrufe automatisch parsen und den OpenAI-Endpunkt für jedes Modell aufrufen:

from openai import OpenAI
def unsloth_inference(
    messages,
    temperature = 0.7,
    top_p = 1.0,
    top_k = -1,
    repetition_penalty = 0.0,
):
    messages = messages.copy()
    openai_client = OpenAI(
        base_url = "http://127.0.0.1:8001/v1",
        api_key = "sk-no-key-required",
    )
    model_name = next(iter(openai_client.models.list())).id
    print(f"Verwende Modell = {model_name}")
    has_tool_calls = True
    original_messages_len = len(messages)
    while has_tool_calls:
        print(f"Aktuelle Nachrichten = {messages}")
        response = openai_client.chat.completions.create(
            model = model_name,
            messages = messages,
            temperature = temperature,
            top_p = top_p,
            tools = tools if tools else None,
            tool_choice = "auto" if tools else None,
            extra_body = {"top_k": top_k, "min_p": min_p, "dry_multiplier" :repetition_penalty,}
        )
        tool_calls = response.choices[0].message.tool_calls or []
        content = response.choices[0].message.content or ""
        tool_calls_dict = [tc.to_dict() for tc in tool_calls] if tool_calls else tool_calls
        messages.append({"role": "assistant", "tool_calls": tool_calls_dict, "content": content,})
        for tool_call in tool_calls:
            fx, args, _id = tool_call.function.name, tool_call.function.arguments, tool_call.id
            out = MAP_FN[fx](**json.loads(args))
            messages.append({"role": "tool", "tool_call_id": _id, "name": fx, "content": str(out),})
        else:
            has_tool_calls = False
    return messages

Nachdem GLM-4.7-Flash gestartet wurde über llama-server wie in GLM-4.7-Flash oder siehe Tool Calling Guide für weitere Details, können wir dann einige Toolaufrufe durchführen:

Toolaufruf für mathematische Operationen für GLM 4.7

messages = [{
    "role": "user",
    "content": [{"type": "text", "text": "Welches Datum ist heute plus 3 Tage?"}],
}]
unsloth_inference(messages, temperature = 1.0, top_p = 0.95, top_k = -1, min_p = 0.01)

Toolaufruf zum Ausführen generierten Python-Codes für GLM-4.7-Flash

messages = [{
    "role": "user",
    "content": [{"type": "text", "text": "Erstelle eine Fibonacci-Funktion in Python und finde fib(20)."}],
}]
unsloth_inference(messages, temperature = 1.0, top_p = 0.95, top_k = -1, min_p = 0.01)

Benchmarks

GLM-4.7-Flash ist das leistungsstärkste 30B-Modell in allen Benchmarks außer AIME 25.

Benchmark

GLM-4.7-Flash

Qwen3-30B-A3B-Thinking-2507

GPT-OSS-20B

AIME 25

91.6

85.0

91.7

GPQA

75.2

73.4

71.5

LCB v6

64.0

66.0

61.0

HLE

14.4

9.8

10.9

SWE-bench Verifiziert

59.2

22.0

34.0

τ²-Bench

79.5

49.0

47.7

BrowseComp

42.8

2.29

28.3

VorherigeMiniMax-2.5 NächsteKimi K2.5

Zuletzt aktualisiert vor 7 Stunden

War das hilfreich?

hashtag⚙️ Gebrauchsanleitung

hashtag🖥️ GLM-4.7-Flash ausführen

hashtagLlama.cpp Tutorial (GGUF):

hashtag➿Wiederholungen und Schleifen reduzieren

hashtag🐦Flappy Bird Beispiel mit UD-Q4_K_XL

hashtag🦥 Feinabstimmung von GLM-4.7-Flash

hashtag🦙 Llama-Server Bereitstellung & Deployment

hashtag💻 GLM-4.7-Flash in vLLM

hashtag⭐ vLLM GLM-4.7-Flash Spekulatives Decoding

hashtag🔨Toolaufrufe mit GLM-4.7-Flash

hashtagBenchmarks

⚙️ Gebrauchsanleitung

🖥️ GLM-4.7-Flash ausführen

Llama.cpp Tutorial (GGUF):

➿Wiederholungen und Schleifen reduzieren

🐦Flappy Bird Beispiel mit UD-Q4_K_XL

🦥 Feinabstimmung von GLM-4.7-Flash

🦙 Llama-Server Bereitstellung & Deployment

💻 GLM-4.7-Flash in vLLM

⭐ vLLM GLM-4.7-Flash Spekulatives Decoding

🔨Toolaufrufe mit GLM-4.7-Flash

Benchmarks