# Mistral 3.5 - ローカル実行方法

Mistral が Mistral-Medium-3.5-128B をリリースしました。これは新しい、密な 128B パラメータのマルチモーダルなハイブリッド推論モデルです。テキストと画像の入力、テキスト出力、256K のコンテキストウィンドウをサポートし、推論、コーディング、長文コンテキスト、ツール使用、エージェント的ワークフロー、そしてマルチモーダルな文書/画像理解に優れています。

Mistral Medium 3.5 は、自身のサイズの 5 倍のモデルに対して非常に競争力のある性能を提供します。約 64GB の RAM でローカル実行できます。GGUF: [Mistral-Medium-3.5-128B-GGUF](https://huggingface.co/unsloth/Mistral-Medium-3.5-128B-GGUF)

{% hint style="success" %}
**2026年5月1日 更新:** いくつかの実装に影響していた Mistral Medium 3.5 の推論の問題を修正するために Mistral と協力し、修正済みの更新版 GGUF をリリースしました（**Unsloth とは関係なく** 、または当社の量子化版とは関係ありません）。この問題は YaRN の解析上の癖が原因で、 `transformers` および `llama.cpp`を含むいくつかの実装に影響していました。 `mscale_all_dim` を `1` に `0` 変更することで解決しました。さらに、 `mmproj` ファイルが正しく生成されない問題も修正しました。

<mark style="background-color:$success;">**Mistral は現在、私たちの修正を公式リポジトリに反映しています！**</mark>
{% endhint %}

### 使用ガイド

{% hint style="info" %}
GGUF 用の Vision は現在サポートされています。対応は後ほど追加されます。
{% endhint %}

表: Mistral Medium 3.5 の推奨ハードウェア要件。単位は総メモリ量です: RAM + VRAM、またはユニファイドメモリ。

| Mistral 3.5     | 3-bit | 4-bit | 8-bit      |
| --------------- | ----- | ----- | ---------- |
| Medium 3.5 128B | 64 GB | 80 GB | 128-170 GB |

{% hint style="info" %}
ダウンロードした量子化モデルのサイズを、利用可能な総メモリが少なくとも上回っている必要があります。そうでない場合でも、llama.cpp は RAM / ディスクへの部分オフロードで実行できますが、生成は遅くなります。長いコンテキスト、大きなバッチ、ツールを多用するエージェント実行、画像プロンプトにはさらに多くのメモリが必要です。
{% endhint %}

#### 推奨設定

Mistral 推奨の推論設定を使用してください:

* `reasoning_effort="none"` → 高速な即時応答、チャット、抽出、単純な指示向け。
* `reasoning_effort="high"` → 推論モード。複雑なプロンプト、コーディング、調査、数学、エージェント用途に推奨。

推奨サンプリングのデフォルト:

* 使用 `temperature = 0.7` を `reasoning_effort="high"`.
* 使用 `temperature = 0.0` に `0.7` を `reasoning_effort="none"`、タスクに応じて。
* 反復ペナルティと出現ペナルティは無効のまま、または `1.0` にしておいてください。ただしループが見られる場合は別です。
* 最大コンテキスト長は `262,144`

#### **推論モード**

Mistral Medium 3.5 は、即時の instruct モードと、'high' オプションを持つ推論モードをサポートしています。

llama.cpp / llama-server で高い推論を有効にするには:

```bash
--chat-template-kwargs '{"reasoning_effort":"high"}'
```

推論を無効にするには:

```bash
--chat-template-kwargs '{"reasoning_effort":"none"}'
```

Windows PowerShell の場合は、以下を使ってください:

```powershell
--chat-template-kwargs "{\"reasoning_effort\":\"none\"}"
```

## Mistral 3.5 チュートリアルの実行

Mistral Medium 3.5 は密な 128B モデルなので、ローカル推論の出発点としては Dynamic 4-bit GGUF を推奨します。GGUF: `unsloth/Mistral-Medium-3.5-128B-GGUF`

<a href="/pages/d5ea4312f45148d1f1e8083c9c508a3ee23914a4#unsloth-studio-guide" class="button primary">Unsloth Studio で実行</a><a href="/pages/d5ea4312f45148d1f1e8083c9c508a3ee23914a4#llama.cpp-guide" class="button secondary">llama.cpp で実行</a>

{% hint style="warning" %}
現在、いかなるマルチモーダル/ビジョン GGUF も **Ollama** では別々の `mmproj` vision ファイルのため動作しません。llama.cpp 互換のバックエンドを使用してください。

使用しないでください **CUDA 13.2** 、さもないと意味不明な出力になることがあります。NVIDIA が修正に取り組んでいます。
{% endhint %}

### 🦥 Unsloth Studio ガイド

このチュートリアルでは、 [Unsloth Studio](/docs/jp/xin-zhe/studio.md)を使用します。これは LLM の実行と学習のための新しい Web UI です。Unsloth Studio を使えば、モデルを実行し、 **音声**、画像、テキストをローカルで **Mac、Windows**、Linux 上で入力でき、さらに次のことができます:

{% columns %}
{% column %}

* 検索、ダウンロード、 [GGUF を実行](/docs/jp/xin-zhe/studio.md#run-models-locally) し、safetensor モデルを扱う
* **モデルを** 比較する **横並びで**
* [**自己修復** ツール呼び出し](/docs/jp/xin-zhe/studio.md#execute-code--heal-tool-calling) + **Web 検索**
* [**コード実行**](/docs/jp/xin-zhe/studio.md#run-models-locally) （Python、Bash）
* [自動推論](/docs/jp/xin-zhe/studio.md#model-arena) パラメータ調整（temp、top-p など）
* [LLM を学習する](/docs/jp/xin-zhe/studio.md#no-code-training) VRAM を 70% 少なくして 2 倍高速
  {% endcolumn %}

{% column %}

<div data-with-frame="true"><figure><img src="/files/c32867f999db074387ac16732ce548485cc593de" alt=""><figcaption></figcaption></figure></div>
{% endcolumn %}
{% endcolumns %}

{% stepper %}
{% step %}

#### Unsloth をインストール

**MacOS、Linux、WSL:**

```bash
curl -fsSL https://unsloth.ai/install.sh | sh
```

**Windows PowerShell:**

```bash
irm https://unsloth.ai/install.ps1 | iex
```

{% endstep %}

{% step %}

#### Unsloth Studio をセットアップ（1回のみ）

セットアップでは自動的に Node.js（nvm 経由）をインストールし、フロントエンドをビルドし、必要な Python 依存関係をすべてインストールし、CUDA サポート付きで llama.cpp をビルドします。

{% hint style="info" %}
**WSL ユーザー:** 次のインストールのために `sudo` パスワードの入力を求められます（ビルド依存関係のインストール用: `cmake`, `git`, `libcurl4-openssl-dev`).
{% endhint %}
{% endstep %}

{% step %}

#### Unsloth を起動

**MacOS、Linux、WSL:**

```bash
source unsloth_studio/bin/activate
unsloth studio -H 0.0.0.0 -p 8888
```

**Windows Powershell:**

```bash
& .\unsloth_studio\Scripts\unsloth.exe studio -H 0.0.0.0 -p 8888
```

<div data-with-frame="true"><figure><img src="/files/698ae7636b7c9b8a8122c6fbdabc1bd2273fdb2c" alt="" width="375"><figcaption></figcaption></figure></div>

**その後、 `http://localhost:8888` をブラウザで開いてください。**
{% endstep %}

{% step %}

#### Mistral Medium 3.5 を検索してダウンロード

初回起動時には、アカウントを保護するためのパスワードを作成し、後で再度サインインする必要があります。その後、 [Studio Chat](/docs/jp/xin-zhe/studio/chat.md) タブに移動し、検索バーで Mistral 3.5 を検索して、必要なモデルと量子化版をダウンロードしてください。
{% endstep %}

{% step %}

#### Mistral 3.5 を実行

Unsloth Studio を使用すると推論パラメータは自動設定されるはずですが、手動で変更することもできます。コンテキスト長、チャットテンプレート、その他の設定も編集できます。

詳細は、 [Unsloth Studio 推論ガイド](/docs/jp/xin-zhe/studio/chat.md).
{% endstep %}
{% endstepper %}

### 🦙 Llama.cpp ガイド

このガイドでは、Mistral Medium 3.5 に Unsloth Dynamic 4-bit を使用します。参照: `unsloth/Mistral-Medium-3.5-128B-GGUF`.

これらのチュートリアルでは、特に CPU または大容量ユニファイドメモリ機をお持ちの場合、素早いローカル推論のために llama.cpp を使用します。

**1. llama.cpp をビルドする**

最新の `llama.cpp` を GitHub で入手してください。 `-DGGML_CUDA=ON` に `-DGGML_CUDA=OFF` GPU がない場合、または CPU 推論だけを使いたい場合は変更してください。Apple Mac / Metal デバイスでは、 `-DGGML_CUDA=OFF`を設定してください。Metal サポートはデフォルトで有効です。

```bash
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \\
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp
```

**2. Hugging Face から直接実行**

```bash
export LLAMA_CACHE="unsloth/Mistral-Medium-3.5-128B-GGUF"

./llama.cpp/llama-cli \\
    -hf unsloth/Mistral-Medium-3.5-128B-GGUF:UD-Q4_K_XL \\
    --temp 0.7 \\
    --chat-template-kwargs '{"reasoning_effort":"none"}'
```

高推論モードの場合:

```bash
./llama.cpp/llama-cli \\
    -hf unsloth/Mistral-Medium-3.5-128B-GGUF:UD-Q4_K_XL \\
    --temp 0.7 \\
    --chat-template-kwargs '{"reasoning_effort":"high"}'
```

**3. モデルを手動でダウンロード**

をインストールした後で `huggingface_hub` および `hf_transfer`:

```bash
pip install huggingface_hub hf_transfer

hf download unsloth/Mistral-Medium-3.5-128B-GGUF \\
    --local-dir unsloth/Mistral-Medium-3.5-128B-GGUF \\
    --include "*UD-Q4_K_XL*" \\
    --include "*mmproj*"
```

ダウンロードが止まる場合は、次を設定してください:

```bash
export HF_HUB_ENABLE_HF_TRANSFER=1
```

**4. ローカル GGUF を実行**

```bash
./llama.cpp/llama-cli \\
    --model unsloth/Mistral-Medium-3.5-128B-GGUF/Mistral-Medium-3.5-128B-UD-Q4_K_XL.gguf \\
    --temp 0.7 \\
    --chat-template-kwargs '{"reasoning_effort":"none"}'
```

マルチモーダル projector GGUF が含まれている場合は、次を使用してください:

```bash
./llama.cpp/llama-cli \\
    --model unsloth/Mistral-Medium-3.5-128B-GGUF/Mistral-Medium-3.5-128B-UD-Q4_K_XL.gguf \\
    --mmproj unsloth/Mistral-Medium-3.5-128B-GGUF/mmproj-BF16.gguf \\
    --temp 0.7 \\
    --chat-template-kwargs '{"reasoning_effort":"none"}'
```

#### Llama-server デプロイ

llama-server に Mistral Medium 3.5 をデプロイするには、次を使用してください:

```bash
./llama.cpp/llama-server \\
    -hf unsloth/Mistral-Medium-3.5-128B-GGUF:UD-Q4_K_XL \\
    --alias "mistral-medium-3.5" \\
    --host 0.0.0.0 \\
    --port 8001 \\
    --temp 0.7 \\
    --chat-template-kwargs '{"reasoning_effort":"none"}'
```

推論モードの場合:

```bash
--chat-template-kwargs '{"reasoning_effort":"high"}'
```

Windows PowerShell の場合は、以下を使ってください:

```powershell
--chat-template-kwargs "{\"reasoning_effort\":\"high\"}"
```

OpenAI 互換のリクエストで llama-server に ping できます:

```bash
curl http://localhost:8001/v1/chat/completions \\
  -H "Content-Type: application/json" \\
  -d '{
    "model": "mistral-medium-3.5",
    "messages": [
      {"role": "user", "content": "即時モードと推論モードの主な違いを説明してください。"}
    ],
    "temperature": 0.7
  }'
```

### Mistral 3.5 ベストプラクティス

#### プロンプト例

**簡単な推論プロンプト**

```
システム:
あなたは正確な推論アシスタントです。注意深く解き、最終回答と短い説明だけを提示してください。

ユーザー:
列車は午前8:15に出発し、午前11:47に到着しました。所要時間はどれくらいですか？
```

使用 `reasoning_effort="high"` このタイプのプロンプト向け。

**OCR / 文書プロンプト**

OCR と文書抽出では、画像を最初に置き、構造化出力を求めてください。

```
[画像を最初に]
この領収書からすべてのテキストを抽出してください。merchant、date、line_items、total を JSON で返してください。
```

**マルチモーダル比較プロンプト**

```
[画像 1]
[画像 2]
この 2 つのスクリーンショットを比較し、新しいユーザーをより混乱させそうなのはどちらか教えてください。具体的な理由を 3 つ挙げてください。
```

**コーディングエージェントプロンプト**

```
あなたはリポジトリ内で作業しているコーディングエージェントです。
まず関連ファイルを確認し、その後最小限のパッチを提案してください。
最終回答は summary、files changed、tests run、risks を含めて返してください。
```

使用 `reasoning_effort="high"` そして、コードベース探索のためのツール呼び出し。

**JSON / 関数呼び出しプロンプト**

```
計算や検索が必要な場合は、提供されたツールを必ず使用してください。
有効な JSON のみを返してください。JSON オブジェクト外の説明文は含めないでください。
```

### ベンチマーク

<div><figure><img src="/files/601f6c8ad15c03f0a4086427e68090dc38e749a3" alt=""><figcaption></figcaption></figure> <figure><img src="/files/7202d2c147553a738f109dbd1856c2a40f5e95ce" alt=""><figcaption></figcaption></figure></div>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://unsloth.ai/docs/jp/moderu/mistral-3.5.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.