# Gemma 4 微调指南

你现在可以训练 Google 的 [Gemma 4](/docs/zh/mo-xing/gemma-4.md) E2B、E4B、26B-A4B 和 31B，使用 [**Unsloth**](https://github.com/unslothai/unsloth)。Unsloth 支持 Gemma 4 的所有视觉、文本、音频和强化学习微调。

* Unsloth 训练 Gemma 4 的速度 **快约 1.5 倍** 并且只需 **少约 60% 的显存** 相比 FA2 配置（无精度损失）
* 我们修复了许多通用的 [Gemma 4 训练错误](#bug-fixes--tips) （并非源自 Unsloth）。
* Gemma 4 E2B 可在 **8GB 显存**上训练。E4B 需要 10GB 显存。

<a href="/pages/33fa9e3bb3ccf6a5c0011aa600e98abbe3a829e3#quickstart" class="button primary" data-icon="bolt">快速开始</a><a href="/pages/33fa9e3bb3ccf6a5c0011aa600e98abbe3a829e3#bug-fixes--tips" class="button secondary" data-icon="sparkle">错误修复 + 提示</a>

微调 Gemma 4 通过我们的 **免费** **Google Colab 笔记本**:

| [**E4B + E2B** （Studio）](https://colab.research.google.com/github/unslothai/unsloth/blob/main/studio/Unsloth_Studio_Colab.ipynb) | [**31B** （Kaggle）](https://www.kaggle.com/code/danielhanchen/gemma4-31b-unsloth) | [E4B **（视觉 + 文本）**](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma4_\(E4B\)-Vision.ipynb) | [E4B **（音频）**](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma4_\(E4B\)-Audio.ipynb) | [E2B **（RL GRPO）**](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma4_\(E2B\)_Reinforcement_Learning_Sudoku_Game.ipynb) |
| -------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ |

{% columns %}
{% column %}
你现在可以在我们的界面中免费运行和训练 Gemma 4，使用 [Unsloth Studio](/docs/zh/xin/studio.md)✨ 笔记本：

你还可以查看更多 [笔记本在这里](#unsloth-core-code-based-guide).
{% endcolumn %}

{% column %}
{% embed url="<https://colab.research.google.com/github/unslothai/unsloth/blob/main/studio/Unsloth_Studio_Colab.ipynb>" %}
{% endcolumn %}
{% endcolumns %}

* 你也可以使用 [强化学习](#reinforcement-learning-rl) （RL）在 9GB 显存上训练 Gemma 4。
* Gemma 4 E2B LoRA 可在 8-10GB 显存上运行。E4B LoRA 需要 17GB 显存。
* **31B QLoRA 可在 22GB 上运行** 而 26B-A4B LoRA 需要 >40GB
* **导出**/将模型保存为 GGUF 等。 以及 全量微调 **（FFT）** 也同样可行。

### :bug: 错误修复 + 提示

{% hint style="success" %}
如果你看到 **Gemma-4 E2B 和 E4B 的 loss 为 13-15，这是完全正常的** - 这是多模态模型的常见特性。Gemma-3N、Llama Vision、Mistral vision models 等也出现过这种情况。

**Gemma 26B 和 31B 的 loss 更低，在 1-3 或更低。视觉任务会高 2 倍，因此为 3-5**
{% endhint %}

#### :grapes:梯度累积可能会抬高你的 loss

{% columns %}
{% column %}

<div data-with-frame="true"><figure><img src="/files/9dccd2ec361315ebfc342fc7a74f7f37af9da602" alt=""><figcaption></figcaption></figure></div>
{% endcolumn %}

{% column %}

<div data-with-frame="true"><figure><img src="/files/27c627eb5ef78e283ed8831a6c6b2ceff9496e34" alt=""><figcaption></figcaption></figure></div>
{% endcolumn %}
{% endcolumns %}

如果你看到高于 13-15 的 loss（例如 100 或 300），很可能是梯度累积没有被正确处理——我们已经 **在 Unsloth 和 Unsloth Studio 中修复了这个问题。**

想进一步了解梯度累积，请查看我们的梯度累积错误修复博客： <https://unsloth.ai/blog/gradient>

#### :interrobang:Gemma-4 31B 和 26B-A4B 推理中的 IndexError

在使用 31B 和 26B 做推理时，你可能会看到这个错误：

```python
文件 "/.../cache_utils.py"，第 937 行，在 update 中
    keys, values = self.layers[layer_idx].update(...)
IndexError：列表索引超出范围
```

问题出在下面：

```python
if hasattr(decoder_config, "num_kv_shared_layers"):
    layer_types = layer_types[: -decoder_config.num_kv_shared_layers]
```

其中 Gemma-4 31B 和 26B-A4B 自带 `num_kv_shared_layers = 0`。在 Python 中， `-0 == 0`，所以 `layer_types[:-0]` 会退化为 `layer_types[:0] == []`。缓存会以零层槽位构建，而第一次 attention forward 就会在 `Cache.update`.

#### :no\_entry: `use_cache = True` 对 E2B、E4B 来说生成结果是乱码

[查看问题](https://github.com/huggingface/transformers/issues/45242) "\[Gemma 4] `use_cache=False` 会破坏 attention 计算，产生垃圾 logits #45242"

Gemma-4 E2B 和 E4B 在层之间共享 KV 状态（`num_kv_shared_layers = 20` 以及 `18`）。缓存是早期层保存 KV 以供后续层复用的唯一位置。当 `use_cache=False` （正如每个 QLoRA 教程所设置的那样，并且如 `gradient_checkpointing=True` 所强制的一样）， `Gemma4TextModel.forward` 会跳过缓存构建，因此 KV 共享层会退回为从当前隐藏状态在本地重新计算 K 和 V。logits 会变成垃圾，训练 loss 会发散。

**修复前（`unsloth/gemma-4-E2B-it`，提示词 "What is 1+1?"）：**

```
use_cache=True  -> '1 + 1 = **2**'
use_cache=False -> 'BROAD\肯. Specificallyboard K supposed\_n통  \'
max_abs_logit_diff: 48.937500
```

**修复后：**

```
use_cache=True  -> '1 + 1 = **2**'
use_cache=False -> '1 + 1 = **2**'
max_abs_logit_diff: 0.000000     （位级完全一致，全部 9 个 token 相同）
```

#### :radio:音频 float16 溢出

`Gemma4AudioAttention` 使用 `config.attention_invalid_logits_value = -1e9` 在一个 `masked_fill` 调用中。在 fp16（Tesla T4）上，-1e9 超过了 fp16 的最大值 65504，导致：

```python
RuntimeError: 值无法在不溢出的情况下转换为类型 c10::Half
```

这是由于 `self.config.attention_invalid_logits_value` :

```python
attn_weights = attn_weights.masked_fill(
    attention_mask.logical_not(), self.config.attention_invalid_logits_value
)
```

#### 💡Gemma-4 提示

1. 如果你想 **保留推理** 能力，你可以将推理风格示例与直接答案混合（至少保留 75% 的推理）。否则你也可以完全输出它。\
   \
   使用 `gemma-4` 用于非思考型聊天模板，而 `gemma-4-thinking` 用于思考变体。\
   较大的 26B 和 31B 建议使用 thinking 版本，小模型则使用 non thinking 版本。<br>

   ```python
   from unsloth.chat_templates import get_chat_template
   tokenizer = get_chat_template(
       tokenizer,
       chat_template = "gemma-4-thinking", # 或 "gemma-4"
   )
   ```
2. 要启用思考模式，请使用 `enable_thinking = True / False` 在 `tokenizer.apply_chat_template`<br>

   已启用思考：

   <pre class="language-python" data-overflow="wrap"><code class="lang-python">processor.tokenizer.apply_chat_template([
       {"role" : "user", "content" : "2+2 等于多少？"},
   ], tokenize = False, enable_thinking = True, add_generation_prompt = True)
   </code></pre>

   将输出 `<bos><|turn>system\n<|think|><turn|>\n<|turn>user\n2+2 等于多少？<turn|>\n<|turn>model\n`<br>

   已禁用思考：

   ```python
   processor.tokenizer.apply_chat_template([
       {"role" : "user", "content" : "2+2 等于多少？"},
   ], tokenize = False, enable_thinking = False, add_generation_prompt = True)
   ```

   将输出 `<bos><|turn>user\n2+2 等于多少？<turn|>\n<|turn>model\n<|channel>thought\n<channel|>`
3. Gemma 4 非常适合多语言微调，因为它支持 140 种语言。
4. 建议训练 **E4B QLoRA** 而不是 **E2B LoRA** 因为 E4B 更大，量化精度差异极小。Gemma 4 E4B LoRA 甚至更好。
5. 微调后，你可以导出为 [GGUF](#saving-export-your-fine-tuned-model) （用于 llama.cpp/Unsloth/Ollama 等）

### ⚡快速开始

#### 🦥 Unsloth Studio 指南

{% columns %}
{% column %}
Gemma 4 可以在 [Unsloth Studio](/docs/zh/xin/studio.md)中运行和微调，这是我们新的本地 AI 开源 Web UI。

使用 Unsloth Studio，你可以在本地运行模型，支持 **MacOS、Windows**、Linux，并可训练 NVIDIA GPU。本月将支持 Intel、MLX 和 AMD 训练。
{% endcolumn %}

{% column %}

<div data-with-frame="true"><figure><img src="/files/50f000d1bf4775dd4acd552c14d38600cd6e8c39" alt=""><figcaption></figcaption></figure></div>
{% endcolumn %}
{% endcolumns %}

{% stepper %}
{% step %}

#### 安装 Unsloth

在终端中运行：

**MacOS、Linux、WSL：**

```bash
curl -fsSL https://unsloth.ai/install.sh | sh
```

**Windows PowerShell：**

```bash
irm https://unsloth.ai/install.ps1 | iex
```

{% hint style="success" %}
**安装过程很快，约需 1-2 分钟。**
{% endhint %}
{% endstep %}

{% step %}

#### 启动 Unsloth

**MacOS、Linux、WSL 和 Windows：**

```bash
unsloth studio -H 0.0.0.0 -p 8888
```

**然后打开 `http://localhost:8888` 在你的浏览器中。**
{% endstep %}

{% step %}

#### 训练 Gemma 4

首次启动时，你需要创建一个密码来保护你的账户，之后再次登录。然后你会看到一个简短的引导向导，用于选择模型、数据集和基本设置。你可以随时跳过它。

在搜索栏中搜索 Gemma 4，并选择你想要的模型和数据集。接下来，根据需要调整超参数和上下文长度。

<div data-with-frame="true"><figure><img src="/files/50f000d1bf4775dd4acd552c14d38600cd6e8c39" alt="" width="563"><figcaption></figcaption></figure></div>
{% endstep %}

{% step %}

#### 监控训练进度

点击开始训练后，你将能够监控和观察模型的训练进度。训练 loss 应该会稳定下降。\
完成后，模型将自动保存。

<div data-with-frame="true"><figure><img src="/files/8307e19c6d02f6c28d84f5386706c3b37f958067" alt="" width="563"><figcaption></figcaption></figure></div>
{% endstep %}

{% step %}

#### 导出你微调后的模型

完成后，Unsloth Studio 允许你将模型导出为 GGUF、safetensor 等格式。

<div data-with-frame="true"><figure><img src="/files/d1130dc8880b70d1db60cadb771bd69c69088e90" alt="" width="563"><figcaption></figcaption></figure></div>
{% endstep %}

{% step %}

#### 比较微调模型与原始模型

点击 `比较模式` 来比较 LoRA 适配器和原始模型。

<div data-with-frame="true"><figure><img src="/files/80d70deeeeff75ab5fc4b65e9206475c99471833" alt="" width="563"><figcaption></figcaption></figure></div>
{% endstep %}
{% endstepper %}

#### 🦥 Unsloth Core（基于代码）指南

我们为 Gemma 4 制作了免费笔记本：

| [E4B **（推理 + 文本）**](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma4_\(E4B\)-Text.ipynb) | [E4B **（视觉 + 文本）**](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma4_\(E4B\)-Vision.ipynb) | [E4B **（音频）**](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma4_\(E4B\)-Audio.ipynb) |
| ------------------------------------------------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------- |
| [**31B** （Kaggle）](https://www.kaggle.com/code/danielhanchen/gemma4-31b-unsloth)                                         | [E2B **（视觉 + 文本）**](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma4_\(E2B\)-Vision.ipynb) | [E2B **（音频）**](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma4_\(E2B\)-Audio.ipynb) |

以及用于强化学习（RL）的： [E2B **（RL GRPO）**](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma4_\(E2B\)_Reinforcement_Learning_Sudoku_Game.ipynb)

我们还为更大的 Gemma 4 模型制作了笔记本，但它们需要 A100：

| [Gemma-4-26B-A4B](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma4_\(26B_A4B\)-Vision.ipynb) - A100 GPU | [Gemma-4-31B](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma4_\(31B\)-Vision.ipynb) - A100 GPU |
| --------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------- |

{% hint style="info" %}
**如果你想做** [**GRPO**](/docs/zh/kai-shi-shi-yong/reinforcement-learning-rl-guide.md)**，只要禁用快速 vLLM 推理并改用 Unsloth 推理，它就在 Unsloth 中可用。请参考我们的** [**视觉 RL**](/docs/zh/kai-shi-shi-yong/reinforcement-learning-rl-guide/vision-reinforcement-learning-vlm-rl.md) **笔记本示例。**
{% endhint %}

下面是一个独立的 Gemma-4-26B-A4B-it 文本 SFT 配方。这只是文本——另请参阅我们的 [视觉微调](/docs/zh/ji-chu/vision-fine-tuning.md) 部分以了解更多细节。

{% code expandable="true" %}

````python
from unsloth import FastModel
import torch

model, tokenizer = FastModel.from_pretrained(
    model_name = "unsloth/gemma-4-26B-A4B-it", # 将这里改为 unsloth/gemma-4-E2B-it 等
    dtype = None, # None 表示自动检测
    max_seq_length = 8192, # 长上下文可任选！
    load_in_4bit = True,  # 4 位量化以减少内存
    full_finetuning = False, # [新！] 我们现在支持全量微调了！
    # token = "YOUR_HF_TOKEN", # 门控模型所需的 HF Token
)

"""# Gemma 4 可以处理文本、视觉和音频！

让我们先体验一下 Gemma 4 如何处理多模态输入。我们使用 Gemma 4 推荐的设置：`temperature = 1.0, top_p = 0.95, top_k = 64`
"""

from transformers import TextStreamer
# 推理辅助函数
def do_gemma_4_inference(messages, max_new_tokens = 128):
    _ = model.generate(
        **tokenizer.apply_chat_template(
            messages,
            add_generation_prompt = True, # 生成时必须添加
            tokenize = True,
            return_dict = True,
            return_tensors = "pt",
        ).to("cuda"),
        max_new_tokens = max_new_tokens,
        use_cache=True,
        temperature = 1.0, top_p = 0.95, top_k = 64,
        streamer = TextStreamer(tokenizer, skip_prompt = True),
    )

"""# Gemma 4 可以看图！

<img src="https://files.worldwildlife.org/wwfcmsprod/images/Sloth_Sitting_iStock_3_12_2014/story_full_width/8l7pbjmj29_iStock_000011145477Large_mini__1_.jpg" alt="替代文本" height="256">
"""

sloth_link = "https://files.worldwildlife.org/wwfcmsprod/images/Sloth_Sitting_iStock_3_12_2014/story_full_width/8l7pbjmj29_iStock_000011145477Large_mini__1_.jpg"

messages = [{
    "role" : "user",
    "content": [
        { "type": "image", "image" : sloth_link },
        { "type": "text",  "text" : "这种动物出现在哪些电影中？" }
    ]
}]
# 你可能需要等待 1 分钟以完成 Unsloth 的自动编译
do_gemma_4_inference(messages, max_new_tokens = 256)

"""让我们写一首关于树懒的诗！"""

messages = [{
    "role": "user",
    "content": [{ "type" : "text",
                  "text" : "写一首关于树懒的诗。" }]
}]
do_gemma_4_inference(messages)

"""# 让我们微调 Gemma 4！

你现在可以通过选择来微调视觉和文本部分——音频部分也可以微调——我们正在努力让它也可选择！

我们现在添加 LoRA 适配器，这样只需更新少量参数！
"""

model = FastModel.get_peft_model(
    model,
    finetune_vision_layers     = False, # 仅文本时请关闭！
    finetune_language_layers   = True,  # 应保持开启！
    finetune_attention_modules = True,  # Attention 对 GRPO 很有用
    finetune_mlp_modules       = True,  # 应始终保持开启！

    r = 8,           # 越大 = 精度越高，但可能过拟合
    lora_alpha = 8,  # 建议 alpha 至少等于 r
    lora_dropout = 0,
    bias = "none",
    random_state = 3407,
)

"""<a name="Data"></a>
### 数据准备
我们现在对对话风格微调使用 `Gemma-4` 格式。我们使用 Maxime Labonne 的 [FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) 数据集，采用 ShareGPT 风格。Gemma-4 会像下面这样渲染多轮对话：

```
<bos><|turn>user
你好<turn|>
<|turn>model
嘿，你好！<turn|>
```
我们使用 `get_chat_template` 函数来获取正确的聊天模板。我们支持 `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3, phi4, qwen2.5, gemma3, gemma-4` 等等。
"""

from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "gemma-4-thinking",
)

"""我们获取数据集的前 3000 行"""

from datasets import load_dataset
dataset = load_dataset("mlabonne/FineTome-100k", split = "train[:3000]")

"""我们现在使用 `standardize_data_formats` 尝试将数据集转换为适合微调的正确格式！"""

from unsloth.chat_templates import standardize_data_formats
dataset = standardize_data_formats(dataset)

"""让我们看看第 100 行是什么样子！"""

dataset[100]

"""现在我们必须将 `Gemma-3` 的聊天模板应用到这些对话上，并将其保存到 `text` 中。由于我们正在微调，所以使用 removeprefix(`'<bos>'`) 去除 `<bos>` token。Processor 会在训练前添加该 token，而模型只期望一个。"""

def formatting_prompts_func(examples):
   convos = examples["conversations"]
   texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False).removeprefix('<bos>') for convo in convos]
   return { "text" : texts, }

dataset = dataset.map(formatting_prompts_func, batched = True)

"""让我们看看聊天模板处理得如何！注意这里没有 `<bos>` token，因为 processor tokenizer 会添加一个。"""

dataset[100]["text"]

"""<a name="Train"></a>
### 训练模型
现在让我们训练模型。我们执行 60 步来加快速度，但你可以设置 `num_train_epochs=1` 来进行完整训练，并关闭 `max_steps=None`。
"""

from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    eval_dataset = None, # 可以设置评估！
    args = SFTConfig(
        dataset_text_field = "text",
        per_device_train_batch_size = 1,
        gradient_accumulation_steps = 4, # 使用 GA 来模拟批大小！
        warmup_steps = 5,
        # num_train_epochs = 1, # 将此设为 1 进行一次完整训练。
        max_steps = 60,
        learning_rate = 2e-4, # 长时间训练请降至 2e-5
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.001,
        lr_scheduler_type = "linear",
        seed = 3407,
        report_to = "none", # 使用 TrackIO/WandB 等
    ),
)

"""我们还使用 Unsloth 的 `train_on_completions` 方法，仅在助手输出上训练，并忽略用户输入上的 loss。这有助于提高微调精度！"""

from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|turn>user\n",
    response_part = "<|turn>model\n",
)

"""让我们验证 instruction 部分是否已被正确遮蔽！再次打印第 100 行。注意样本中只有一个 `<bos>`，这正是预期的！"""

tokenizer.decode(trainer.train_dataset[100]["input_ids"])

"""现在让我们打印被遮蔽后的示例——你应该只看到答案部分："""

tokenizer.decode([tokenizer.pad_token_id if x == -100 else x for x in trainer.train_dataset[100]["labels"]]).replace(tokenizer.pad_token, " ")

"""# 让我们训练模型！

要恢复训练运行，请设置 `trainer.train(resume_from_checkpoint = True)`
"""

trainer_stats = trainer.train()
````

{% endcode %}

{% hint style="info" %}
如果你 OOM 了：

* 降低 `per_device_train_batch_size` 到 **1** 和/或减小 `max_seq_length`.&#x20;
* 保持 `use_`[`gradient_checkpointing`](/docs/zh/bo-ke/500k-context-length-fine-tuning.md#unsloth-gradient-checkpointing-enhancements)`="unsloth"` 开启（它旨在减少显存占用并扩展上下文长度）。
  {% endhint %}

**MoE 的加载示例（bf16 LoRA）：**

```python
import os
import torch
from unsloth import FastModel

model, tokenizer = FastModel.from_pretrained(
    model_name = "unsloth/Gemma-4-26B-A4B-it",
    max_seq_length = 2048,
    load_in_4bit = False,     # 不推荐 MoE QLoRA，稠密 31B 可以
    load_in_16bit = True,     # bf16/16 位 LoRA
    full_finetuning = False,
)
```

加载后，你将附加 LoRA 适配器，并以与上面的 SFT 示例类似的方式进行训练。

### 强化学习（RL）

你现在可以使用 RL、GSPO、GRPO 等来训练 Gemma 4，借助 [我们的免费笔记本](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_5_\(4B\)_Vision_GRPO.ipynb).

{% columns %}
{% column %}
Gemma 4 E2B RL 可在 9GB 上运行。

{% embed url="<https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_5_(4B)_Vision_GRPO.ipynb>" %}

该笔记本的目标是让 Gemma 4 学会使用以下方式解决数独谜题 [GRPO](/docs/zh/kai-shi-shi-yong/reinforcement-learning-rl-guide.md#from-rlhf-ppo-to-grpo-and-rlvr).

模型将设计一种策略来填充空白格，我们会根据正确填入和完成有效谜题来给予奖励。

即使 vLLM 不支持，你仍然可以通过设置以下参数，使用 Unsloth 运行 Gemma 4 RL： `fast_inference=False` 在加载模型时：
{% endcolumn %}

{% column %}

<figure><img src="/files/7d4bccab277c9bcd1776a34163650b1598f908ef" alt=""><figcaption></figcaption></figure>
{% endcolumn %}
{% endcolumns %}

```python
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/gemma-4-E2B-it",
    fast_inference=False,
)
```

### MoE 微调（26B-A4B）

这个 **26B-A4B** 模型是 Gemma 4 系列中速度 / 质量的折中选择。由于它是一个 **MoE** 模型，每个 token 只有一部分参数处于激活状态，因此一种保守的微调方法是：

* 使用 **LoRA** 而不是全量微调
* 优先选择 **16 位 / bf16 LoRA** 如果内存允许
* 先从较短上下文和较小 rank 开始
* 仅在流程稳定后再逐步扩大

如果你的目标是最高质量并且你有更多内存，请使用 **31B** 来代替。

### 多模态微调（E2B / E4B）

因为 **E2B** 以及 **E4B** 支持 **图像** 以及 **音频**，它们是 Gemma 4 中用于多模态微调的主要变体。

* 使用以下方式加载多模态模型 `FastVisionModel`
* 保持 `finetune_vision_layers = False` 一开始
* 只微调语言、attention 和 MLP 层
* 如果任务需要，稍后再启用视觉或音频层

#### Gemma 4 多模态 LoRA 示例：

{% code expandable="true" %}

````python
from unsloth import FastVisionModel # LLM 使用 FastLanguageModel
import torch

model, processor = FastVisionModel.from_pretrained(
    "unsloth/gemma-4-26B-A4B-it",
    load_in_4bit = True, # 使用 4bit 来减少内存使用。16bit LoRA 时设为 False。
    use_gradient_checkpointing = "unsloth", # 长上下文时使用 True 或 "unsloth"
)

"""我们现在添加 LoRA 适配器以进行参数高效微调，使我们能够高效地只训练全部模型参数中的 1%。

**[新]** 我们还支持只微调视觉组件、只微调语言组件，或两者同时微调。此外，你还可以选择微调 attention 模块、MLP 层，或两者都微调！
"""

model = FastVisionModel.get_peft_model(
    model,
    finetune_vision_layers     = True, # 如果不微调视觉层则设为 False
    finetune_language_layers   = True, # 如果不微调语言层则设为 False
    finetune_attention_modules = True, # 如果不微调 attention 层则设为 False
    finetune_mlp_modules       = True, # 如果不微调 MLP 层则设为 False

    r = 32,                           # 越大，精度越高，但可能过拟合
    lora_alpha = 32,                  # 建议 alpha 至少等于 r
    lora_dropout = 0,
    bias = "none",
    random_state = 3407,
    use_rslora = False,               # 我们支持 rank stabilized LoRA
    loftq_config = None,               # 以及 LoftQ
    target_modules = "all-linear",    # 现在是可选的！如有需要可以指定一个列表
)

"""<a name="Data"></a>
### 数据准备
我们将使用一个手写数学公式的采样数据集。目标是将这些图像转换为计算机可读的格式——具体来说是 LaTeX——以便进行渲染。这对于复杂表达式特别有用。

你可以在[这里](https://huggingface.co/datasets/unsloth/LaTeX_OCR)访问该数据集。完整数据集在[这里](https://huggingface.co/datasets/linxy/LaTeX_OCR)。
"""

from datasets import load_dataset
dataset = load_dataset("unsloth/LaTeX_OCR", split = "train")

"""让我们先概览一下数据集。我们将查看第二张图像及其对应的说明文字。"""

dataset

dataset[2]["image"]

dataset[2]["text"]

"""我们还可以直接在浏览器中渲染 LaTeX！"""

from IPython.display import display, Math, Latex

latex = dataset[3]["text"]
display(Math(latex))

"""为了格式化数据集，所有视觉微调任务都应遵循以下格式：

```python
[
    {
        "role": "user",
        "content": [
            {"type": "text", "text": instruction},
            {"type": "image", "image": sample["image"]},
        ],
    },
    {
        "role": "user",
        "content": [
            {"type": "text", "text": instruction},
            {"type": "image", "image": sample["image"]},
        ],
    },
]
```
"""

instruction = "为这张图片写出 LaTeX 表示。"

def convert_to_conversation(sample):
    conversation = [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": instruction},
                {"type": "image", "image": sample["image"]},
            ],
        },
        {"role": "assistant", "content": [{"type": "text", "text": sample["text"]}]},
    ]
    return {"messages": conversation}
pass

"""让我们将数据集转换为微调所需的“正确”格式："""

converted_dataset = [convert_to_conversation(sample) for sample in dataset]

"""第一个示例如下所示："""

converted_dataset[0]

"""让我们采用 Gemma 4 指令聊天模板，并将其用于我们的基础模型"""

from unsloth import get_chat_template

processor = get_chat_template(
    processor,
    "gemma-4-thinking"
)

"""在微调之前，让我们评估一下基础模型的性能。由于它之前从未见过这种聊天模板，我们不期望有很好的结果。"""

image = dataset[2]["image"]
instruction = "为这张图片写出 LaTeX 表示。"

messages = [
    {
        "role": "user",
        "content": [{"type": "image"}, {"type": "text", "text": instruction}],
    }
]
input_text = processor.apply_chat_template(messages, add_generation_prompt = True)
inputs = processor(
    image,
    input_text,
    add_special_tokens = False,
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer

text_streamer = TextStreamer(processor, skip_prompt = True)
result = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
                        use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)

"""你可以看到它简直糟透了！它根本没有遵循指令

<a name="Train"></a>
### 训练模型
现在让我们训练模型。为了加快速度，我们只进行 60 步，但你可以设置 `num_train_epochs=1` 进行完整训练，并将 `max_steps=None` 关闭。我们还支持用于强化学习的 `DPOTrainer` 和 `GRPOTrainer`！！

我们使用新的 `UnslothVisionDataCollator`，它将帮助我们完成视觉微调设置。
"""

from unsloth.trainer import UnslothVisionDataCollator
from trl import SFTTrainer, SFTConfig

trainer = SFTTrainer(
    model = model,
    train_dataset = converted_dataset,
    processing_class = processor.tokenizer,
    data_collator = UnslothVisionDataCollator(model, processor),
    args = SFTConfig(
        per_device_train_batch_size = 1,
        gradient_accumulation_steps = 4,
        max_grad_norm = 0.3,
        warmup_ratio = 0.03,
        max_steps = 60,
        # num_train_epochs = 2, # 在完整训练运行中，用这个代替 max_steps
        learning_rate = 2e-4,
        logging_steps = 1,
        save_strategy = "steps",
        optim = "adamw_8bit",
        weight_decay = 0.001,
        lr_scheduler_type = "cosine",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # 适用于 Weights and Biases 或其他工具

        # 进行视觉微调时，你必须添加以下项目：
        remove_unused_columns = False,
        dataset_text_field = "",
        dataset_kwargs = {"skip_prepare_dataset": True},
        max_length = 2048,
    )
)

trainer_stats = trainer.train()
````

{% endcode %}

#### 图像示例格式

记住：对于 Gemma 4 多模态提示，请将图像放在 **之前** 文本指令之前。

{% code expandable="true" %}

```json
{
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "image", "image": "/path/to/image OR object"},
        {"type": "text", "text": "从这张收据中提取所有文本。以 JSON 形式返回商品条目、总计、商家和日期。"}
      ]
    },
    {
      "role": "assistant",
      "content": [
        {"type": "text", "text": "{\"merchant\": \"Example Store\", \"total\": \"19.99\"}"}
      ]
    }
  ]
}
```

{% endcode %}

#### 音频示例格式

音频仅用于 **E2B / E4B** 。仅限短音频片段并且任务需明确。

{% code expandable="true" %}

```json
{
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "audio", "audio": "/path/to/audio OR object"},
        {"type": "text", "text": "将以下英语语音片段转写为英文文本。仅输出转写结果。"}
      ]
    },
    {
      "role": "assistant",
      "content": [
        {"type": "text", "text": "大家好，欢迎回来。"}
      ]
    }
  ]
}
```

{% endcode %}

### 保存 / 导出微调后的模型

你可以查看我们针对以下内容的特定推理 / 部署指南： [Unsloth Studio](/docs/zh/xin/studio/export.md), [llama.cpp](/docs/zh/ji-chu/inference-and-deployment/saving-to-gguf.md), [vLLM](/docs/zh/ji-chu/inference-and-deployment/vllm-guide.md), [llama-server](/docs/zh/ji-chu/inference-and-deployment/llama-server-and-openai-endpoint.md), [Ollama](/docs/zh/ji-chu/inference-and-deployment/saving-to-ollama.md) 或 [SGLang](/docs/zh/ji-chu/inference-and-deployment/sglang-guide.md).

#### 保存为 GGUF

Unsloth 支持直接保存为 GGUF：

```python
model.save_pretrained_gguf("directory", tokenizer, quantization_method = "q4_k_m")
model.save_pretrained_gguf("directory", tokenizer, quantization_method = "q8_0")
model.save_pretrained_gguf("directory", tokenizer, quantization_method = "f16")
```

或者将 GGUF 推送到 Hugging Face：

```python
model.push_to_hub_gguf("hf_username/directory", tokenizer, quantization_method = "q4_k_m")
model.push_to_hub_gguf("hf_username/directory", tokenizer, quantization_method = "q8_0")
```

如果导出的模型在另一个运行时中的表现更差，Unsloth 会指出最常见的原因： **推理时使用了错误的聊天模板 / EOS 令牌** （你必须使用与你训练时相同的聊天模板）。

更多详情请阅读我们的推理指南：

{% columns %}
{% column width="50%" %}
{% content-ref url="/pages/9a72670992feb75def412a693565c84a88c8a266" %}
[推理与部署](/docs/zh/ji-chu/inference-and-deployment.md)
{% endcontent-ref %}

{% content-ref url="/pages/b83d88f106d75c3396c46f5342fb401501910093" %}
[GGUF & llama.cpp](/docs/zh/ji-chu/inference-and-deployment/saving-to-gguf.md)
{% endcontent-ref %}
{% endcolumn %}

{% column width="50%" %}
{% content-ref url="/pages/f7c3389bdba9af3050e66a941596d827cdb11e0b" %}
[Model Export](/docs/zh/xin/studio/export.md)
{% endcontent-ref %}

{% content-ref url="/pages/9f0e22d200c9105481e4854b8473aba99ca44835" %}
[vLLM](/docs/zh/ji-chu/inference-and-deployment/vllm-guide.md)
{% endcontent-ref %}
{% endcolumn %}
{% endcolumns %}

### Gemma 4 数据最佳实践

Gemma 4 有一些你需要注意的格式细节。

#### 1. 使用标准聊天角色

Gemma 4 使用标准的：

* `system`
* `user`
* `assistant`

这意味着你的 SFT 数据集应该使用常规聊天格式编写，而不是旧版 Gemma 特定的角色格式。

#### 2. 思考模式是显式的

如果你想在 SFT 期间保留思考风格的行为：

* 保持格式一致
* 决定你是否要在以下内容上训练 **可见的思考块** 或者 **仅最终答案**
* 要 **不要** 在同一个数据集中混合多种互不兼容的思考格式

对于大多数生产环境中的助手，最简单的设置是只在 **最终可见答案**.

#### 3. 多轮规则

对于多轮对话，只保留 **最终可见答案** 在对话历史中。不要 **不要** 将较早的思考块再喂回后续轮次。


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://unsloth.ai/docs/zh/mo-xing/gemma-4/train.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.