# Gemma 4 - 如何本地运行

Gemma 4 是 Google DeepMind 推出的新一代开放模型家族，包括 **E2B**, **E4B**, **26B-A4B**，以及 **31B。** 这些多模态、混合思维模型支持 140 多种语言，最高可达 **256K 上下文**，并提供稠密版和 MoE 版本。Gemma 4 采用 Apache-2.0 许可证，并可在你的本地设备上运行。

{% columns %}
{% column %} <a href="/pages/10f714f4a513e0d0a86b6f9d5945f9014729b035#run-gemma-4-tutorials" class="button primary">运行 Gemma 4</a><a href="/pages/33fa9e3bb3ccf6a5c0011aa600e98abbe3a829e3" class="button secondary">微调 Gemma 4</a>

**Gemma-4-E2B** 和 **E4B** 支持图像和音频。可运行于 **5GB 内存** （4 位）或 15GB（完整 16 位）。查看我们的 [Gemma 4 GGUF](https://huggingface.co/collections/unsloth/gemma-4).

**Gemma-4-26B-A4B** 运行于 **18GB** （4 位）或 28GB（8 位）。 **Gemma-4-31B** 需要 **20GB 内存** （4 位）或 34GB（8 位）。
{% endcolumn %}

{% column %}

<div data-with-frame="true"><figure><img src="/files/2dfd7fbf0b551d243091cd1054c69104594c25d5" alt=""><figcaption></figcaption></figure></div>
{% endcolumn %}
{% endcolumns %}

{% hint style="success" %}
**4 月 20 日：** 我们进行了 [Gemma 4 GGUF 基准测试](#unsloth-gguf-benchmarks) 以帮助你选择最佳量化版本。

**4 月 11 日更新：** Gemma 4 现已更新，包含 Google 更新后的聊天模板和 llama.cpp 修复。\
**不要** 对任何 GGUF 使用 CUDA 13.2 运行时，因为这会导致输出质量较差。

你现在可以运行 GGUF 并微调 Gemma 4，在 [Unsloth Studio](#unsloth-studio-guide)✨
{% endhint %}

### 使用指南

Gemma 4 在推理、编程、工具使用、长上下文与代理式工作流以及多模态任务方面表现出色。较小的 E2B 和 E4B 版本专为手机和笔记本设计，而较大的模型则面向中高端 CPU / VRAM 系统，例如配备 NVIDIA RTX GPU 的 PC。

| Gemma 4 版本  | 详情                                          | 最适合                   |
| ----------- | ------------------------------------------- | --------------------- |
| **E2B**     | <p>Dense + PLE（128K 上下文）<br>支持：文本、图像、音频</p> | 适用于手机 / 边缘推理、ASR、语音翻译 |
| **E4B**     | <p>Dense + PLE（128K 上下文）<br>支持：文本、图像、音频</p> | 适合笔记本和快速本地多模态使用的小模型   |
| **26B-A4B** | <p>MoE（256K 上下文）<br>支持：文本、图像</p>            | 计算机使用中的最佳速度 / 质量折中    |
| **31B**     | <p>Dense（256K 上下文）<br>支持：文本、图像</p>          | 最强性能，但推理较慢            |

**查看 Gemma 4：** [**性能基准测试**](#official-gemma-benchmarks) **和** [**GGUF 基准测试**](#unsloth-gguf-benchmarks)**.**

**我该选 26B-A4B 还是 31B？**

* **26B-A4B** - 在速度和准确性之间取得平衡。其 MoE 设计使其比 31B 更快，活跃参数为 4B。如果内存有限，并且你愿意用少许质量换取速度，就选它。
* **31B** - 目前最强的 Gemma 4 模型。如果你有足够的内存并能接受稍慢的速度，想要最高质量就选它。

### 硬件要求

**表：Gemma 4 推理 GGUF 推荐硬件要求** （单位 = 总内存：RAM + VRAM，或统一内存）。你可以在 MacOS、NVIDIA RTX GPU 等设备上使用 Gemma 4。

| Gemma 4 版本  |      4 位 |      8 位 | BF16 / FP16 |
| ----------- | -------: | -------: | ----------: |
| **E2B**     |     4 GB |   5–8 GB |       10 GB |
| **E4B**     | 5.5–6 GB |  9–12 GB |       16 GB |
| **26B A4B** | 16–18 GB | 28–30 GB |       52 GB |
| **31B**     | 17–20 GB | 34–38 GB |       62 GB |

{% hint style="info" %}
经验法则是，你可用的总内存至少应超过所下载量化模型的大小。如果没有达到，llama.cpp 仍可通过部分 RAM / 磁盘卸载运行，但生成速度会更慢。根据你使用的上下文窗口大小，你还会需要更多算力。
{% endhint %}

### 推荐设置

建议使用 Google 的默认 Gemma 4 参数：

* `temperature = 1.0`
* `top_p = 0.95`
* `top_k = 64`

本地推理的推荐实用默认值：

* 从 **32K 上下文** 开始以获得更好的响应速度，然后再增加
* 保持 **重复惩罚 / 存在惩罚** 禁用或设为 1.0，除非你看到循环输出。
* 句子结束标记是 `<turn|>`

{% hint style="info" %}
Gemma 4 的最大上下文是 **128K** 用于 **E2B / E4B** 和 **256K** 用于 **26B A4B / 31B**.
{% endhint %}

#### 思维模式

与较旧的 Gemma 聊天模板相比，Gemma 4 使用标准的 **`system`**, **`assistant`**，以及 **`user`** 角色，并增加了显式思维控制。

**如何启用思维：**

添加令牌 **`<|think|>`** 在 **系统提示词开头**.

{% columns %}
{% column %}
**已启用思维**

```
<|think|>
你是一位严谨的编程助手。请清晰地解释你的答案。
```

{% endcolumn %}

{% column %}
**已禁用思维**

```
你是一位严谨的编程助手。请清晰地解释你的答案。
```

{% endcolumn %}
{% endcolumns %}

**输出行为：**

{% columns %}
{% column %}
启用思维时，模型会在最终答案前输出其内部推理通道。

```
<|channel>thought
[内部推理]
<channel|>
[最终答案]
```

{% endcolumn %}

{% column %}
禁用思维时，较大的模型仍可能输出一个 **空的 thought 块** 然后才给出最终答案。

```
<|channel>thought
<channel|>
[最终答案]
```

{% endcolumn %}
{% endcolumns %}

**例如使用“**&#x6CD5;国的首都是哪里？”：

{% code overflow="wrap" %}

```
<bos><|turn>system\n<|think|><turn|>\n<|turn>user\n法国的首都是哪里？<turn|>\n<|turn>model\n
```

{% endcode %}

**然后它会输出：**

{% code overflow="wrap" %}

```
<|channel>thought\n用户在询问法国的首都。\n法国的首都是巴黎。<channel|>法国的首都是巴黎。<turn|>
```

{% endcode %}

**多轮聊天规则：**

对于多轮对话， **只在聊天历史中保留最终可见答案**。 **不要** 把之前的 thought 块再次输入到下一轮。

{% code overflow="wrap" %}

```
<bos><|turn>user\n1+1 等于几？<turn|>\n<|turn>model\n2<turn|>\n<|turn>user\n1+1 等于几？<turn|>\n<|turn>model\n2<turn|>\n<|turn>user\n1+1 等于几？<turn|>\n<|turn>model\n2<turn|>\n<|turn>user\n1+1 等于几？<turn|>\n<|turn>model\n2<turn|>\n
```

{% endcode %}

**如何禁用思维：**

注意 `llama-cli` 可能无法稳定工作，因此请使用 `llama-server` 来禁用推理：

{% hint style="warning" %}
要 [禁用思维 / 推理](#how-to-enable-or-disable-reasoning-and-thinking)，请使用 `--chat-template-kwargs '{"enable_thinking":false}'`

如果你使用的是 **Windows** Powershell，请用： `--chat-template-kwargs "{\"enable_thinking\":false}"`

可以交替使用 'true' 和 'false'。
{% endhint %}

## 运行 Gemma 4 教程

由于 Gemma 4 GGUF 有多种大小，小模型推荐的起点是 8 位，而大模型推荐的是 **动态 4 位**. [Gemma 4 GGUF](https://huggingface.co/collections/unsloth/gemma-4) 或 [MLX](#mlx-dynamic-quants):

| [gemma-4-E2B](https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF) | [gemma-4-E4B](https://huggingface.co/unsloth/gemma-4-E4B-it-GGUF) | [gemma-4-26B-A4B](https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF) | [gemma-4-31B](https://huggingface.co/unsloth/gemma-4-31B-it-GGUF) |
| ----------------------------------------------------------------- | ----------------------------------------------------------------- | ------------------------------------------------------------------------- | ----------------------------------------------------------------- |

<a href="/pages/10f714f4a513e0d0a86b6f9d5945f9014729b035#unsloth-studio-guide" class="button primary">🦥 Unsloth Studio 指南</a><a href="/pages/10f714f4a513e0d0a86b6f9d5945f9014729b035#llama.cpp-guide" class="button primary">🦙 Llama.cpp 指南</a>

{% columns %}
{% column %}
**你可以在我们的** [**Unsloth Studio**](/docs/zh/xin/studio.md)✨ **notebook：**
{% endcolumn %}

{% column %}
{% embed url="<https://colab.research.google.com/github/unslothai/unsloth/blob/main/studio/Unsloth_Studio_Colab.ipynb>" %}
{% endcolumn %}
{% endcolumns %}

### 🦥 Unsloth Studio 指南

Gemma 4 现在可以在 [Unsloth Studio](/docs/zh/xin/studio.md)中运行和微调，这是我们全新的本地 AI 开源 Web UI。Unsloth Studio 让你可以在本地运行模型，支持 **MacOS、Windows**、Linux，以及：

{% columns %}
{% column %}

* 搜索、下载， [运行 GGUF](/docs/zh/xin/studio.md#run-models-locally) 和 safetensor 模型
* [**自愈式** 工具调用](/docs/zh/xin/studio.md#execute-code--heal-tool-calling) + **网页搜索**
* [**代码执行**](/docs/zh/xin/studio.md#run-models-locally) （Python、Bash）
* [自动推理](/docs/zh/xin/studio.md#model-arena) 参数调优（temp、top-p 等）
* 通过 llama.cpp 实现快速 CPU + GPU 推理
* [训练 LLM](/docs/zh/xin/studio.md#no-code-training) 速度提升 2 倍，VRAM 减少 70%
  {% endcolumn %}

{% column %}

<div data-with-frame="true"><figure><img src="/files/650cd087ac9ab1b567e284813a7713806d466601" alt=""><figcaption></figcaption></figure></div>
{% endcolumn %}
{% endcolumns %}

{% stepper %}
{% step %}

#### 安装 Unsloth

在终端中运行：

**MacOS、Linux、WSL：**

```bash
curl -fsSL https://unsloth.ai/install.sh | sh
```

**Windows PowerShell：**

```bash
irm https://unsloth.ai/install.ps1 | iex
```

{% endstep %}

{% step %}

#### 启动 Unsloth

**MacOS、Linux、WSL 和 Windows：**

```bash
unsloth studio -H 0.0.0.0 -p 8888
```

然后打开 `http://127.0.0.1:8888` 在你的浏览器中。
{% endstep %}

{% step %}

#### 搜索并下载 Gemma 4

首次启动时，你需要创建一个密码来保护你的账户，并在之后重新登录。然后你会看到一个简短的引导向导，用于选择模型、数据集和基本设置。你可以随时跳过。

然后前往 [Studio Chat](/docs/zh/xin/studio/chat.md) 标签页，在搜索栏中搜索 Gemma 4，并下载你想要的模型和量化版本。

<div data-with-frame="true"><figure><img src="/files/ae392b7077a8f5857a60be994eb52447f286483f" alt="" width="375"><figcaption></figcaption></figure></div>
{% endstep %}

{% step %}

#### 运行 Gemma 4

在使用 Unsloth Studio 时，推理参数应会自动设置，不过你仍然可以手动更改。你也可以编辑上下文长度、聊天模板和其他设置。

更多信息请查看我们的 [Unsloth Studio 推理指南](/docs/zh/xin/studio/chat.md).

<div data-with-frame="true"><figure><img src="/files/650cd087ac9ab1b567e284813a7713806d466601" alt="" width="563"><figcaption></figcaption></figure></div>
{% endstep %}
{% endstepper %}

### 🦙 Llama.cpp 指南

在本指南中，我们将对 26B-A4B 和 31B 使用动态 4 位，对 E2B 和 E4B 使用 8 位。参见： [Gemma 4 GGUF 集合](https://huggingface.co/collections/unsloth/gemma-4)

在这些教程中，我们将使用 [llama.cpp](llama.cpphttps://github.com/ggml-org/llama.cpp) 进行快速本地推理，特别是当你有 CPU 时。

{% stepper %}
{% step %}
获取最新的 `llama.cpp` **，在** [**GitHub 这里**](https://github.com/ggml-org/llama.cpp)。你也可以按照下面的构建说明进行操作。如果你没有 GPU，或者只想进行 CPU 推理，请将 `-DGGML_CUDA=ON` 改为 `-DGGML_CUDA=OFF` 。 **对于 Apple Mac / Metal 设备**，设置 `-DGGML_CUDA=OFF` 后继续按常规操作——Metal 支持默认已开启。

```bash
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \\
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp
```

{% endstep %}

{% step %}
如果你想直接使用 `llama.cpp` 来加载模型，你可以根据每个模型按照下面的命令操作。 `UD-Q4_K_XL` 是量化类型。你也可以通过 Hugging Face 下载（第 3 步）。这类似于 `ollama run` 。使用 `export LLAMA_CACHE="folder"` 以强制 `llama.cpp` 保存到指定位置。无需设置上下文长度，因为 llama.cpp 会自动使用精确所需的数量。

**26B-A4B：**

```bash
export LLAMA_CACHE="unsloth/gemma-4-26B-A4B-it-GGUF"
./llama.cpp/llama-cli \\
    -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL \
    --temp 1.0 \\
    --top-p 0.95 \
    --top-k 64
```

**31B：**

```bash
export LLAMA_CACHE="unsloth/gemma-4-31B-it-GGUF"
./llama.cpp/llama-cli \\
    -hf unsloth/gemma-4-31B-it-GGUF:UD-Q4_K_XL \
    --temp 1.0 \\
    --top-p 0.95 \
    --top-k 64
```

**E4B：**

```bash
export LLAMA_CACHE="unsloth/gemma-4-E4B-it-GGUF"
./llama.cpp/llama-cli \\
    -hf unsloth/gemma-4-E4B-it-GGUF:Q8_0 \
    --temp 1.0 \\
    --top-p 0.95 \
    --top-k 64
```

**E2B：**

```bash
export LLAMA_CACHE="unsloth/gemma-4-E2B-it-GGUF"
./llama.cpp/llama-cli \\
    -hf unsloth/gemma-4-E2B-it-GGUF:Q8_0 \
    --temp 1.0 \\
    --top-p 0.95 \
    --top-k 64
```

{% endstep %}

{% step %}
通过以下方式下载模型（安装 `pip install huggingface_hub hf_transfer` 之后）。你可以选择 `UD-Q4_K_XL` 或其他量化版本，例如 `Q8_0` 。如果下载卡住，请参见： [Hugging Face Hub，XET 调试](/docs/zh/ji-chu/troubleshooting-and-faqs/hugging-face-hub-xet-debugging.md)

```bash
hf download unsloth/gemma-4-26B-A4B-it-GGUF \
    --local-dir unsloth/gemma-4-26B-A4B-it-GGUF \
    --include "*mmproj-BF16*" \\
    --include "*UD-Q4_K_XL*" # 动态 2bit 请使用 "*UD-Q2_K_XL*"
```

{% endstep %}

{% step %}
然后以对话模式运行模型（带视觉 `mmproj-F16`):

{% code overflow="wrap" %}

```bash
./llama.cpp/llama-cli \\
    --model unsloth/gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
    --mmproj unsloth/gemma-4-26B-A4B-it-GGUF/mmproj-BF16.gguf \
    --temp 1.0 \\
    --top-p 0.95 \
    --top-k 64
```

{% endcode %}
{% endstep %}

{% step %}

### Llama-server 部署

要在 llama-server 上部署 Gemma-4，请使用：

```bash
./llama.cpp/llama-server \\
    --model unsloth/gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
    --mmproj unsloth/gemma-4-26B-A4B-it-GGUF/mmproj-BF16.gguf \
    --temp 1.0 \\
    --top-p 0.95 \
    --top-k 64 \
    --alias "unsloth/gemma-4-26B-A4B-it-GGUF" \
    --port 8001 \
    --chat-template-kwargs '{"enable_thinking":true}'
```

{% hint style="warning" %}
要 [禁用思维 / 推理](#how-to-enable-or-disable-reasoning-and-thinking)，请使用 `--chat-template-kwargs '{"enable_thinking":false}'`

如果你使用的是 **Windows** Powershell，请用： `--chat-template-kwargs "{\"enable_thinking\":false}"`

可以交替使用 'true' 和 'false'。
{% endhint %}
{% endstep %}
{% endstepper %}

### MLX 动态量化

我们还上传了动态 4 位和 8 位量化版本，作为 MacOS 设备上的首次试验！

{% hint style="success" %}
现在已支持 **视觉** ！
{% endhint %}

| Gemma 4 | 4 位 MLX                                                             | 8 位 MLX                                                          |
| ------- | ------------------------------------------------------------------- | ---------------------------------------------------------------- |
| 31B     | [链接](https://huggingface.co/unsloth/gemma-4-31b-it-UD-MLX-4bit)     | [链接](https://huggingface.co/unsloth/gemma-4-31b-it-MLX-8bit)     |
| 26B-A4B | [链接](https://huggingface.co/unsloth/gemma-4-26b-a4b-it-UD-MLX-4bit) | [链接](https://huggingface.co/unsloth/gemma-4-26b-a4b-it-MLX-8bit) |
| E4B     | [链接](https://huggingface.co/unsloth/gemma-4-E4B-it-UD-MLX-4bit)     | [链接](https://huggingface.co/unsloth/gemma-4-E4B-it-MLX-8bit)     |
| E2B     | [链接](https://huggingface.co/unsloth/gemma-4-E2B-it-UD-MLX-4bit)     | [链接](https://huggingface.co/unsloth/gemma-4-E2B-it-MLX-8bit)     |

要试用它们，请使用：

{% code overflow="wrap" %}

```bash
curl -fsSL https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/scripts/install_gemma4_mlx.sh | sh
source ~/.unsloth/unsloth_gemma4_mlx/bin/activate
python -m mlx_vlm.chat --model unsloth/gemma-4-26b-a4b-it-UD-MLX-4bit
```

{% endcode %}

## Gemma 4 最佳实践

### 提示词示例

#### 简单推理提示词

```
系统：
<|think|>
你是一位精确的推理助手。

用户：
一列火车上午 8:15 出发，11:47 到达。旅程持续了多久？
```

#### OCR / 文档提示词

对于 OCR，请使用 **较高的视觉令牌预算** 例如 **560** 或 **1120**.

```
[图像优先]
提取这张收据中的所有文本。以 JSON 返回行项目、总额、商家和日期。
```

#### 多模态比较提示词

```
[图像 1]
[图像 2]
比较这两张截图，并告诉我哪一张更可能让新用户感到困惑。
```

#### 音频 ASR 提示词

```
[音频优先]
将以下英语语音片段转写为英语文本。

请遵循以下特定格式要求：
* 只输出转写内容，不要换行。
* 转写数字时，使用阿拉伯数字，即写 1.7 而不是 one point seven，并写 3 而不是 three。
```

#### 音频翻译提示词

```
[音频优先]
先将以下西班牙语语音片段转写出来，然后将其翻译成英语。
格式化答案时，先输出西班牙语转写内容，然后换一行，再输出字符串 'English: '，然后输出英语翻译。
```

### 多模态设置

为了在多模态提示中获得最佳结果，请将多模态内容放在前面：

* 将 **图像和/或音频放在文本之前**.
* 对于视频，先传入一系列帧，再给出指令。

#### 可变图像分辨率

Gemma 4 支持多种视觉令牌预算：

* `70`
* `140`
* `280`
* `560`
* `1120`

像这样使用它们：

* **70 / 140**：分类、图像描述、快速视频理解
* **280 / 560**：通用多模态聊天、图表、屏幕、UI 推理
* **1120**：OCR、文档解析、手写体、小文字

#### 音频和视频限制

* **音频** 仅在 **E2B** 和 **E4B** 上可用。
* 音频最多支持 **30 秒**.
* 视频最多支持 **60 秒** 假设 **每秒 1 帧** 处理。

#### 音频提示模板

**ASR 提示词**

```
将以下 {LANGUAGE} 语音片段转写为 {LANGUAGE} 文本。

请遵循以下特定格式要求：
* 只输出转写内容，不要换行。
* 转写数字时，使用阿拉伯数字，即写 1.7 而不是 one point seven，并写 3 而不是 three。
```

**语音翻译提示词**

```
先将以下 {SOURCE_LANGUAGE} 语音片段转写出来，然后将其翻译成 {TARGET_LANGUAGE}。
格式化答案时，先输出 {SOURCE_LANGUAGE} 的转写内容，然后换一行，再输出字符串 '{TARGET_LANGUAGE}: '，然后输出 {TARGET_LANGUAGE} 的翻译。
```

## 📊 基准测试

### Unsloth GGUF 基准测试

我们对各家提供商的 Gemma 4 GGUF 进行了平均 KL 散度基准测试，以帮助你选择最佳量化版本（越低越好）。

* KL 散度将所有 Unsloth GGUF 置于最先进的帕累托前沿上
* KLD 展示了量化模型与原始 BF16 输出分布的匹配程度，表明其保留的准确性。

<div data-with-frame="true"><figure><img src="/files/165ac9ca11098b16b371d7ca880c4b6b77335e1f" alt=""><figcaption><p>26B A4B - KLD 基准测试（越低越好）</p></figcaption></figure></div>

### 官方 Gemma 基准测试

| Gemma 4     | MMLU Pro | AIME 2026（无工具） | LiveCodeBench v6 | MMMU Pro |
| ----------- | -------: | -------------: | ---------------: | -------: |
| **31B**     |    85.2% |          89.2% |            80.0% |    76.9% |
| **26B A4B** |    82.6% |          88.3% |            77.1% |    73.8% |
| **E4B**     |    69.4% |          42.5% |            52.0% |    52.6% |
| **E2B**     |    60.0% |          37.5% |            44.0% |    44.2% |

<div data-with-frame="true"><figure><img src="/files/c08e39442f65ecbed28e4b7974151644bf4f22ce" alt=""><figcaption></figcaption></figure></div>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://unsloth.ai/docs/zh/mo-xing/gemma-4.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.