Mistral 3.5 - How To Run Locally

Guide for Mistral Mistral 3.5 models, to run or fine-tune locally on your device

Mistral releases Mistral-Medium-3.5-128B, their new dense 128B parameter, multimodal, hybrid reasoning model. It supports text and image input, text output, a 256K context window and excels at reasoning, coding, long-context, tool use, agentic workflows, and multimodal doc/image understanding.

Mistral Medium 3.5 offers highly competitive performance for models 5x its size. Run locally on ~64GB RAM. GGUF: Mistral-Medium-3.5-128B-GGUF

Usage Guide

Vision for GGUFs it now supported for now. Support will come later.

Table: Mistral Medium 3.5 recommended hardware requirements. Units are total memory: RAM + VRAM, or unified memory.

Mistral 3.5
3-bit
4-bit
8-bit

Medium 3.5 128B

64 GB

80 GB

128-170 GB

Your total available memory should at least exceed the size of the quantized model you download. If it does not, llama.cpp can still run with partial RAM / disk offload, but generation will be slower. You will also need more memory for long context, larger batches, tool-heavy agent runs and image prompts.

Use Mistral's recommended reasoning settings:

  • reasoning_effort="none" → fast instant replies, chat, extraction and simple instructions.

  • reasoning_effort="high" → reasoning mode, recommended for complex prompts, coding, research, math and agentic usage.

Recommended sampling defaults:

  • Use temperature = 0.7 for reasoning_effort="high".

  • Use temperature = 0.0 to 0.7 for reasoning_effort="none", depending on the task.

  • Keep repetition and presence penalties disabled or at 1.0 unless you see looping.

  • Maximum context length of 262,144

Reasoning Mode

Mistral Medium 3.5 supports instant instruct mode and reasoning mode with a 'high' option.

To enable high reasoning for llama.cpp / llama-server:

To disable reasoning:

If you're on Windows PowerShell, use:

Run Mistral 3.5 Tutorials

Because Mistral Medium 3.5 is a dense 128B model, the recommended starting point is Dynamic 4-bit GGUFs for local inference. GGUF: unsloth/Mistral-Medium-3.5-128B-GGUF

Run in Unsloth StudioRun in llama.cpp

🦥 Unsloth Studio Guide

For this tutorial, we will be using Unsloth Studio, which is our new web UI for running and training LLMs. With Unsloth Studio, you can run models and input audio, image and text locally on Mac, Windows, and Linux and:

1

Install Unsloth

MacOS, Linux, WSL:

Windows PowerShell:

2

Setup Unsloth Studio (one time)

Setup automatically installs Node.js (via nvm), builds the frontend, installs all Python dependencies, and builds llama.cpp with CUDA support.

WSL users: you will be prompted for your sudo password to install build dependencies (cmake, git, libcurl4-openssl-dev).

3

Launch Unsloth

MacOS, Linux, WSL:

Windows Powershell:

Then open http://localhost:8888 in your browser.

4

Search and download Mistral Medium 3.5

On first launch you will need to create a password to secure your account and sign in again later. Then go to the Studio Chat tab and search for Mistral 3.5 in the search bar and download your desired model and quant.

5

Run Mistral 3.5

Inference parameters should be auto-set when using Unsloth Studio, however you can still change it manually. You can also edit the context length, chat template and other settings.

For more information, you can view our Unsloth Studio inference guide.

🦙 Llama.cpp Guide

For this guide we will use Unsloth Dynamic 4-bit for Mistral Medium 3.5. See: unsloth/Mistral-Medium-3.5-128B-GGUF.

For these tutorials, we will use llama.cpp for fast local inference, especially if you have a CPU or high-memory unified-memory machine.

1. Build llama.cpp

Obtain the latest llama.cpp on GitHub. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference. For Apple Mac / Metal devices, set -DGGML_CUDA=OFF; Metal support is on by default.

2. Run directly from Hugging Face

For high reasoning mode:

3. Download the model manually

After installing huggingface_hub and hf_transfer:

If downloads get stuck, set:

4. Run the local GGUF

If a multimodal projector GGUF is included, use:

Llama-server deployment

To deploy Mistral Medium 3.5 on llama-server, use:

For reasoning mode:

If you're on Windows PowerShell, use:

You can ping llama-server with an OpenAI-compatible request:

Mistral 3.5 Best Practices

Prompting examples

Simple reasoning prompt

Use reasoning_effort="high" for this style of prompt.

OCR / document prompt

For OCR and document extraction, put the image first and ask for structured output.

Multi-modal comparison prompt

Coding agent prompt

Use reasoning_effort="high" and tool calling for codebase exploration.

JSON / function calling prompt

Benchmarks

Last updated

Was this helpful?