Mistral 3.5 - How To Run Locally
Guide for Mistral Mistral 3.5 models, to run or fine-tune locally on your device
Mistral releases Mistral-Medium-3.5-128B, their new dense 128B parameter, multimodal, hybrid reasoning model. It supports text and image input, text output, a 256K context window and excels at reasoning, coding, long-context, tool use, agentic workflows, and multimodal doc/image understanding.
Mistral Medium 3.5 offers highly competitive performance for models 5x its size. Run locally on ~64GB RAM. GGUF: Mistral-Medium-3.5-128B-GGUF
May 1, 2026 Update: We worked with Mistral to fix Mistral Medium 3.5 inference affecting some implementations, and released updated GGUFs with the fix (NOT related to Unsloth or our quants). The issue was caused by a YaRN parsing quirk affecting several implementations, including transformers and llama.cpp. Changing mscale_all_dim from 1 to 0 resolved it. We also fixed mmproj files not being generated correctly.
Mistral has now pushed our fixes to their official repo!
Usage Guide
Vision for GGUFs it now supported for now. Support will come later.
Table: Mistral Medium 3.5 recommended hardware requirements. Units are total memory: RAM + VRAM, or unified memory.
Medium 3.5 128B
64 GB
80 GB
128-170 GB
Your total available memory should at least exceed the size of the quantized model you download. If it does not, llama.cpp can still run with partial RAM / disk offload, but generation will be slower. You will also need more memory for long context, larger batches, tool-heavy agent runs and image prompts.
Recommended Settings
Use Mistral's recommended reasoning settings:
reasoning_effort="none"→ fast instant replies, chat, extraction and simple instructions.reasoning_effort="high"→ reasoning mode, recommended for complex prompts, coding, research, math and agentic usage.
Recommended sampling defaults:
Use
temperature = 0.7forreasoning_effort="high".Use
temperature = 0.0to0.7forreasoning_effort="none", depending on the task.Keep repetition and presence penalties disabled or at
1.0unless you see looping.Maximum context length of
262,144
Reasoning Mode
Mistral Medium 3.5 supports instant instruct mode and reasoning mode with a 'high' option.
To enable high reasoning for llama.cpp / llama-server:
To disable reasoning:
If you're on Windows PowerShell, use:
Run Mistral 3.5 Tutorials
Because Mistral Medium 3.5 is a dense 128B model, the recommended starting point is Dynamic 4-bit GGUFs for local inference. GGUF: unsloth/Mistral-Medium-3.5-128B-GGUF
Run in Unsloth StudioRun in llama.cpp
Currently no multimodal/vision GGUF works in Ollama due to separate mmproj vision files. Use llama.cpp compatible backends.
Do NOT use CUDA 13.2 as you may get gibberish outputs. NVIDIA is working on a fix.
🦥 Unsloth Studio Guide
For this tutorial, we will be using Unsloth Studio, which is our new web UI for running and training LLMs. With Unsloth Studio, you can run models and input audio, image and text locally on Mac, Windows, and Linux and:
Search, download, run GGUFs and safetensor models
Compare models side-by-side
Self-healing tool calling + web search
Code execution (Python, Bash)
Automatic inference parameter tuning (temp, top-p, etc.)
Train LLMs 2x faster with 70% less VRAM

Search and download Mistral Medium 3.5
On first launch you will need to create a password to secure your account and sign in again later. Then go to the Studio Chat tab and search for Mistral 3.5 in the search bar and download your desired model and quant.
Run Mistral 3.5
Inference parameters should be auto-set when using Unsloth Studio, however you can still change it manually. You can also edit the context length, chat template and other settings.
For more information, you can view our Unsloth Studio inference guide.
🦙 Llama.cpp Guide
For this guide we will use Unsloth Dynamic 4-bit for Mistral Medium 3.5. See: unsloth/Mistral-Medium-3.5-128B-GGUF.
For these tutorials, we will use llama.cpp for fast local inference, especially if you have a CPU or high-memory unified-memory machine.
1. Build llama.cpp
Obtain the latest llama.cpp on GitHub. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference. For Apple Mac / Metal devices, set -DGGML_CUDA=OFF; Metal support is on by default.
2. Run directly from Hugging Face
For high reasoning mode:
3. Download the model manually
After installing huggingface_hub and hf_transfer:
If downloads get stuck, set:
4. Run the local GGUF
If a multimodal projector GGUF is included, use:
Llama-server deployment
To deploy Mistral Medium 3.5 on llama-server, use:
For reasoning mode:
If you're on Windows PowerShell, use:
You can ping llama-server with an OpenAI-compatible request:
Mistral 3.5 Best Practices
Prompting examples
Simple reasoning prompt
Use reasoning_effort="high" for this style of prompt.
OCR / document prompt
For OCR and document extraction, put the image first and ask for structured output.
Multi-modal comparison prompt
Coding agent prompt
Use reasoning_effort="high" and tool calling for codebase exploration.
JSON / function calling prompt
Benchmarks


Last updated
Was this helpful?


