📙Devstral : Comment exécuter et affiner

Exécutez et affinez Mistral Devstral 1.1, y compris Small-2507 et 2505.

Devstral-Small-2507 (Devstral 1.1) est le nouveau LLM agentif de Mistral pour l'ingénierie logicielle. Il excelle dans l'appel d'outils, l'exploration de bases de code et l'alimentation d'agents de codage. Mistral AI a publié la version originale 2505 en mai 2025.

Affiné à partir de Mistral-Small-3.1, Devstral prend en charge une fenêtre de contexte de 128k. Devstral Small 1.1 a des performances améliorées, atteignant un score de 53,6 % de performance sur SWE-bench vérifié, en faisant (10 juillet 2025) le modèle open n°1 sur le benchmark.

Les GGUFs Unsloth Devstral 1.1 contiennent des fonctionnalités d'appel d'outils et corrections de modèle de chat. Devstral 1.1 fonctionne toujours bien avec OpenHands mais se généralise maintenant mieux à d'autres invites et environnements de codage.

En tant que modèle texte uniquement, l'encodeur de vision de Devstral a été retiré avant l'affinage. Nous avons ajouté prise en charge Vision optionnelle pour le modèle.

Nous avons également travaillé en coulisses avec Mistral pour aider à déboguer, tester et corriger d'éventuels bugs et problèmes ! Assurez-vous de télécharger les versions officielles de Mistral ou les GGUFs d'Unsloth / quantifications dynamiques pour obtenir la implémentation correcte (c.-à-d. prompt système correct, modèle de chat correct, etc.)

Veuillez utiliser --jinja dans llama.cpp pour activer le prompt système !

Toutes les mises en ligne de Devstral utilisent notre méthodologie Dynamic 2.0 méthodologie Unsloth, offrant les meilleures performances sur les benchmarks MMLU 5-shot et KL Divergence. Cela signifie que vous pouvez exécuter et affiner des Mistral LLM quantifiés avec une perte d'exactitude minimale !

Devstral - Quantifications Dynamiques Unsloth quants :

Devstral 2507 (nouveau)

Devstral 2505

GGUF : Devstral-Small-2507-GGUF

Devstral-Small-2505-GGUF

BnB 4-bit : Devstral-Small-2507-unsloth-bnb-4bit

Devstral-Small-2505-unsloth-bnb-4bit

🖥️ Exécution de Devstral

⚙️ Paramètres officiels recommandés

Selon Mistral AI, voici les paramètres recommandés pour l'inférence :

Température de 0,0 à 0,15
Min_P de 0,01 (optionnel, mais 0,01 fonctionne bien, la valeur par défaut de llama.cpp est 0,1)
Utilisez --jinja pour activer le prompt système.

Un prompt système est recommandé, et est un dérivé du prompt système d'Open Hands. Le prompt système complet est fourni ici.

Vous êtes Devstral, un modèle agentif utile entraîné par Mistral AI et utilisant l'ossature OpenHands. Vous pouvez interagir avec un ordinateur pour résoudre des tâches.

<ROLE>
Votre rôle principal est d'assister les utilisateurs en exécutant des commandes, en modifiant du code et en résolvant efficacement des problèmes techniques. Vous devez être minutieux, méthodique et privilégier la qualité sur la rapidité.
* Si l'utilisateur pose une question, comme "pourquoi X se produit", n'essayez pas de corriger le problème. Donnez simplement une réponse à la question.
</ROLE>

.... LE PROMPT SYSTÈME CONTINUE ....

Nos mises en ligne dynamiques ont le préfixe 'UD'. Celles qui n'en ont pas ne sont pas dynamiques mais utilisent quand même notre jeu de données d'étalonnage.

🦙 Tutoriel : Comment exécuter Devstral dans Ollama

Installer ollama si vous ne l’avez pas encore fait !

apt-get update
apt-get install pciutils -y
curl -fsSL https://ollama.com/install.sh | sh

Exécutez le modèle avec notre quantification dynamique. Notez que vous pouvez appeler ollama serve &dans un autre terminal si cela échoue ! Nous incluons tous les paramètres suggérés (température, etc.) dans params dans notre upload Hugging Face !
De plus Devstral prend en charge des longueurs de contexte 128K, il est donc préférable d'activer quantification du cache KV. Nous utilisons la quantification 8 bits qui permet d'économiser 50 % de mémoire. Vous pouvez aussi essayer "q4_0"

export OLLAMA_KV_CACHE_TYPE="q8_0"
ollama run hf.co/unsloth/Devstral-Small-2507-GGUF:UD-Q4_K_XL

📖 Tutoriel : Comment exécuter Devstral dans llama.cpp

Obtenez la dernière llama.cpp sur GitHub ici. Vous pouvez aussi suivre les instructions de compilation ci-dessous. Changez -DGGML_CUDA=ON en -DGGML_CUDA=OFF si vous n'avez pas de GPU ou si vous voulez simplement de l'inférence CPU. Pour les appareils Apple Mac / Metal, définissez -DGGML_CUDA=OFF puis continuez comme d'habitude - le support Metal est activé par défaut.

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split llama-mtmd-cli
cp llama.cpp/build/bin/llama-* llama.cpp

Si vous voulez utiliser llama.cpp directement pour charger des modèles, vous pouvez faire ce qui suit : (:Q4_K_XL) est le type de quantification. Vous pouvez aussi télécharger via Hugging Face (point 3). C’est similaire à ollama run

./llama.cpp/llama-cli -hf unsloth/Devstral-Small-2507-GGUF:UD-Q4_K_XL --jinja

OU téléchargez le modèle via (après avoir installé pip install huggingface_hub hf_transfer ). Vous pouvez choisir Q4_K_M, ou d’autres versions quantifiées (comme la précision complète BF16).

# !pip install huggingface_hub hf_transfer
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id = "unsloth/Devstral-Small-2507-GGUF",
    local_dir = "unsloth/Devstral-Small-2507-GGUF",
    allow_patterns = ["*Q4_K_XL*", "*mmproj-F16*"], # Pour Q4_K_XL
)

Exécutez le modèle.
Modifier --threads -1 pour le nombre maximal de threads CPU, --ctx-size 131072 pour la longueur de contexte (Devstral supporte une longueur de contexte de 128K !), --n-gpu-layers 99 pour le déchargement GPU sur combien de couches. Essayez de l'ajuster si votre GPU manque de mémoire. Retirez-le aussi si vous faites de l'inférence uniquement CPU. Nous utilisons également la quantification 8 bits pour le cache K afin de réduire l'utilisation mémoire.
Pour le mode conversation :

./llama.cpp/llama-cli \
    --model unsloth/Devstral-Small-2507-GGUF/Devstral-Small-2507-UD-Q4_K_XL.gguf \
    --threads -1 \
    --ctx-size 131072 \
    --cache-type-k q8_0 \
    --n-gpu-layers 99 \
    --seed 3407 \
    --prio 2 \
    --temp 0.15 \
    --repeat-penalty 1.0 \
    --min-p 0.01 \
    --top-k 64 \
    --top-p 0.95 \
    --jinja

Pour le mode non conversationnel pour tester notre invite Flappy Bird :

./llama.cpp/llama-cli \
    --model unsloth/Devstral-Small-2507-GGUF/Devstral-Small-2507-UD-Q4_K_XL.gguf \
    --threads -1 \
    --ctx-size 131072 \
    --cache-type-k q8_0 \
    --n-gpu-layers 99 \
    --seed 3407 \
    --prio 2 \
    --temp 0.15 \
    --repeat-penalty 1.0 \
    --min-p 0.01 \
    --top-k 64 \
    --top-p 0.95 \
    -no-cnv \
    --prompt "[SYSTEM_PROMPT]You are Devstral, a helpful agentic model trained by Mistral AI and using the OpenHands scaffold. You can interact with a computer to solve tasks.\n\n<ROLE>\nYour primary role is to assist users by executing commands, modifying code, and solving technical problems effectively. You should be thorough, methodical, and prioritize quality over speed.\n* If the user asks a question, like \"why is X happening\", don\'t try to fix the problem. Just give an answer to the question.\n</ROLE>\n\n<EFFICIENCY>\n* Each action you take is somewhat expensive. Wherever possible, combine multiple actions into a single action, e.g. combine multiple bash commands into one, using sed and grep to edit/view multiple files at once.\n* When exploring the codebase, use efficient tools like find, grep, and git commands with appropriate filters to minimize unnecessary operations.\n</EFFICIENCY>\n\n<FILE_SYSTEM_GUIDELINES>\n* When a user provides a file path, do NOT assume it\'s relative to the current working directory. First explore the file system to locate the file before working on it.\n* If asked to edit a file, edit the file directly, rather than creating a new file with a different filename.\n* For global search-and-replace operations, consider using `sed` instead of opening file editors multiple times.\n</FILE_SYSTEM_GUIDELINES>\n\n<CODE_QUALITY>\n* Write clean, efficient code with minimal comments. Avoid redundancy in comments: Do not repeat information that can be easily inferred from the code itself.\n* When implementing solutions, focus on making the minimal changes needed to solve the problem.\n* Before implementing any changes, first thoroughly understand the codebase through exploration.\n* If you are adding a lot of code to a function or file, consider splitting the function or file into smaller pieces when appropriate.\n</CODE_QUALITY>\n\n<VERSION_CONTROL>\n* When configuring git credentials, use "openhands" as the user.name and "openhands@all-hands.dev" as the user.email by default, unless explicitly instructed otherwise.\n* Exercise caution with git operations. Do NOT make potentially dangerous changes (e.g., pushing to main, deleting repositories) unless explicitly asked to do so.\n* When committing changes, use `git status` to see all modified files, and stage all files necessary for the commit. Use `git commit -a` whenever possible.\n* Do NOT commit files that typically shouldn\'t go into version control (e.g., node_modules/, .env files, build directories, cache files, large binaries) unless explicitly instructed by the user.\n* If unsure about committing certain files, check for the presence of .gitignore files or ask the user for clarification.\n</VERSION_CONTROL>\n\n<PULL_REQUESTS>\n* When creating pull requests, create only ONE per session/issue unless explicitly instructed otherwise.\n* When working with an existing PR, update it with new commits rather than creating additional PRs for the same issue.\n* When updating a PR, preserve the original PR title and purpose, updating description only when necessary.\n</PULL_REQUESTS>\n\n<PROBLEM_SOLVING_WORKFLOW>\n1. EXPLORATION: Thoroughly explore relevant files and understand the context before proposing solutions\n2. ANALYSIS: Consider multiple approaches and select the most promising one\n3. TESTING:\n   * For bug fixes: Create tests to verify issues before implementing fixes\n   * For new features: Consider test-driven development when appropriate\n   * If the repository lacks testing infrastructure and implementing tests would require extensive setup, consult with the user before investing time in building testing infrastructure\n   * If the environment is not set up to run tests, consult with the user first before investing time to install all dependencies\n4. IMPLEMENTATION: Make focused, minimal changes to address the problem\n5. VERIFICATION: If the environment is set up to run tests, test your implementation thoroughly, including edge cases. If the environment is not set up to run tests, consult with the user first before investing time to run tests.\n</PROBLEM_SOLVING_WORKFLOW>\n\n<SECURITY>\n* Only use GITHUB_TOKEN and other credentials in ways the user has explicitly requested and would expect.\n* Use APIs to work with GitHub or other platforms, unless the user asks otherwise or your task requires browsing.\n</SECURITY>\n\n<ENVIRONMENT_SETUP>\n* When user asks you to run an application, don\'t stop if the application is not installed. Instead, please install the application and run the command again.\n* If you encounter missing dependencies:\n  1. First, look around in the repository for existing dependency files (requirements.txt, pyproject.toml, package.json, Gemfile, etc.)\n  2. If dependency files exist, use them to install all dependencies at once (e.g., `pip install -r requirements.txt`, `npm install`, etc.)\n  3. Only install individual packages directly if no dependency files are found or if only specific packages are needed\n* Similarly, if you encounter missing dependencies for essential tools requested by the user, install them when possible.\n</ENVIRONMENT_SETUP>\n\n<TROUBLESHOOTING>\n* If you\'ve made repeated attempts to solve a problem but tests still fail or the user reports it\'s still broken:\n  1. Step back and reflect on 5-7 different possible sources of the problem\n  2. Assess the likelihood of each possible cause\n  3. Methodically address the most likely causes, starting with the highest probability\n  4. Document your reasoning process\n* When you run into any major issue while executing a plan from the user, please don\'t try to directly work around it. Instead, propose a new plan and confirm with the user before proceeding.\n</TROUBLESHOOTING>[/SYSTEM_PROMPT][INST]Create a Flappy Bird game in Python. You must include these things:\n1. You must use pygame.\n2. The background color should be randomly chosen and is a light shade. Start with a light blue color.\n3. Pressing SPACE multiple times will accelerate the bird.\n4. The bird\'s shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.\n5. Place on the bottom some land colored as dark brown or yellow chosen randomly.\n6. Make a score shown on the top right side. Increment if you pass pipes and don\'t hit them.\n7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.\n8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.\nThe final game should be inside a markdown section in Python. Check your code for error[/INST]"

N'oubliez pas de supprimer <bos> puisque Devstral ajoute automatiquement un <bos> ! Utilisez aussi --jinja pour activer le prompt système !

👀Prise en charge Vision expérimentale

Xuan-Son de Hugging Face a montré dans leur dépôt GGUF comment il est en fait possible de « greffer » l'encodeur de vision de Mistral 3.1 Instruct sur Devstral 2507. Nous avons aussi téléversé nos fichiers mmproj qui vous permettent d'utiliser ce qui suit :

./llama.cpp/llama-mtmd-cli \
    --model unsloth/Devstral-Small-2507-GGUF/Devstral-Small-2507-UD-Q4_K_XL.gguf \
    --mmproj unsloth/Devstral-Small-2507-GGUF/mmproj-F16.gguf \
    --threads -1 \
    --ctx-size 131072 \
    --cache-type-k q8_0 \
    --n-gpu-layers 99 \
    --seed 3407 \
    --prio 2 \
    --temp 0.15

Par exemple :

Instruction et code de sortie

Code rendu

🦥 Affinage de Devstral avec Unsloth

Tout comme les modèles Mistral standard y compris Mistral Small 3.1, Unsloth prend en charge l'affinage de Devstral. L'entraînement est 2x plus rapide, utilise 70 % de VRAM en moins et prend en charge des longueurs de contexte 8x plus longues. Devstral tient confortablement dans un GPU L4 de 24 Go de VRAM.

Malheureusement, Devstral dépasse légèrement les limites mémoire d'un VRAM de 16 Go, donc l'affinage gratuit sur Google Colab n'est pas possible pour l'instant. Cependant, vous pouvez affiner le modèle gratuitement en utilisant notre carnet Kaggle, qui offre l'accès à des GPU doubles. Changez simplement le nom du modèle Magistral du notebook pour le modèle Devstral.

Si vous avez une ancienne version d’Unsloth et/ou affinez localement, installez la dernière version d’Unsloth :

pip install --upgrade --force-reinstall --no-cache-dir unsloth unsloth_zoo

PrécédentGrok 2 SuivantRun Unsloth models in Docker

Mis à jour il y a 8 heures

Ce contenu vous a-t-il été utile ?

hashtagDevstral - Quantifications Dynamiques Unsloth quants :

hashtag🖥️ Exécution de Devstral

hashtag⚙️ Paramètres officiels recommandés

hashtag🦙 Tutoriel : Comment exécuter Devstral dans Ollama

hashtag📖 Tutoriel : Comment exécuter Devstral dans llama.cpp

hashtag👀Prise en charge Vision expérimentale

hashtag🦥 Affinage de Devstral avec Unsloth