Gemma 3 - 运行指南

如何使用我们的 GGUF 在 llama.cpp、Ollama、Open WebUI 上有效运行 Gemma 3，以及如何用 Unsloth 进行微调！

Google 发布了 Gemma 3，新增 270M 模型，并保留之前的 1B、4B、12B 和 27B 尺寸。270M 和 1B 为仅文本模型，而更大模型支持文本与视觉。我们提供 GGUF 文件，并提供如何有效运行它的指南，以及如何微调与进行强化学习使用 Gemma 3！

新更新 2025 年 8 月 14 日： 试试我们的微调 Gemma 3（270M）笔记本和用于运行的 GGUFs.

另见我们的 Gemma 3n 指南.

运行教程微调教程

Unsloth 是唯一在 float16 机器上支持 Gemma 3 推理与训练的框架。 这意味着带有免费 Tesla T4 GPU 的 Colab 笔记本也能工作！

使用我们的免费 Colab 笔记本

根据 Gemma 团队，推理的最佳配置是 temperature = 1.0，top_k = 64，top_p = 0.95，min_p = 0.0

Unsloth 为 Gemma 3 上传的最佳配置：

GGUF

Unsloth 动态 4 位指令版

16 位指令版

270M - 新增
1B
4B
12B
27B

⚙️ 推荐的推理设置

根据 Gemma 团队，官方推荐的推理设置为：

Temperature 为 1.0
Top_K 为 64
Min_P 为 0.00（可选，但 0.01 效果很好，llama.cpp 默认是 0.1）
Top_P 为 0.95
重复惩罚（Repetition Penalty）为 1.0。（在 llama.cpp 和 transformers 中 1.0 表示禁用）

聊天模板：

<bos><start_of_turn>user\nHello!<end_of_turn>\n<start_of_turn>model\nHey there!<end_of_turn>\n<start_of_turn>user\nWhat is 1+1?<end_of_turn>\n<start_of_turn>model\n

带有以下内容的聊天模板 \n换行符被渲染（除最后一行外）

<bos><start_of_turn>user
Hello!<end_of_turn>
<start_of_turn>model
Hey there!<end_of_turn>
<start_of_turn>user
What is 1+1?<end_of_turn>
<start_of_turn>model\n

llama.cpp 和其他推理引擎会自动添加一个 <bos> —— 切勿添加两个 <bos> 令牌！在提示模型时应忽略 <bos>！

✨在手机上运行 Gemma 3

要在手机上运行这些模型，我们推荐使用任何能在边缘设备（如手机）本地运行 GGUF 的移动应用。微调后可以导出为 GGUF，然后在手机上本地运行。确保你的手机有足够的内存/性能来处理这些模型，因为可能会过热，因此我们建议针对该使用场景使用 Gemma 3 270M 或 Gemma 3n 模型。你可以试试开源项目 AnythingLLM 的移动应用，你可以在此处下载 Android 版或 ChatterUI，它们是非常适合在手机上运行 GGUF 的应用。

记住，你可以将模型名称 'gemma-3-27b-it-GGUF' 更改为任何 Gemma 模型，例如 'gemma-3-270m-it-GGUF:Q8_K_XL'，适用于所有教程。

🦙 教程：如何在 Ollama 中运行 Gemma 3

安装 ollama 如果你还没有安装！

apt-get update
apt-get install pciutils -y
curl -fsSL https://ollama.com/install.sh | sh

运行模型！注意如果失败你可以在另一个终端运行 ollama serve！我们在 Hugging Face 上传中包含了所有修复和建议参数（如 temperature 等），在 params 中！你可以将模型名称 'gemma-3-27b-it-GGUF' 改为任何 Gemma 模型，例如 'gemma-3-270m-it-GGUF:Q8_K_XL'。

ollama run hf.co/unsloth/gemma-3-27b-it-GGUF:Q4_K_XL

📖 教程：如何在 llama.cpp 中运行 Gemma 3 27B

获取最新的 llama.cpp 在 GitHub 这里。你也可以按照下面的编译说明。将 -DGGML_CUDA=ON 改为 -DGGML_CUDA=OFF 如果你没有 GPU 或者只想用 CPU 推理。

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggerganov/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=ON -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split llama-mtmd-cli
cp llama.cpp/build/bin/llama-* llama.cpp

如果你想直接使用 llama.cpp 加载模型，可以按下面操作：(:Q4_K_XL) 是量化类型。你也可以通过 Hugging Face（第 3 点）下载。这与 ollama run

./llama.cpp/llama-mtmd-cli \
    -hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL

或通过以下方式下载模型（在安装 pip install huggingface_hub hf_transfer 之后）。你可以选择 Q4_K_M，或其他量化版本（如 BF16 全精度）。更多版本见： https://huggingface.co/unsloth/gemma-3-27b-it-GGUF

# !pip install huggingface_hub hf_transfer
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id = "unsloth/gemma-3-27b-it-GGUF",
    local_dir = "unsloth/gemma-3-27b-it-GGUF",
    allow_patterns = ["*Q4_K_XL*", "mmproj-BF16.gguf"], # 对于 Q4_K_M
)

运行 Unsloth 的 Flappy Bird 测试
编辑 --threads 32 以设置 CPU 线程数， --ctx-size 16384 用于上下文长度（Gemma 3 支持 128K 上下文长度！）， --n-gpu-layers 99 用于指定多少层进行 GPU 卸载。若 GPU 内存不足可尝试调整。如果只是 CPU 推理则移除此项。
对于会话模式：

./llama.cpp/llama-mtmd-cli \
    --model unsloth/gemma-3-27b-it-GGUF/gemma-3-27b-it-Q4_K_XL.gguf \
    --mmproj unsloth/gemma-3-27b-it-GGUF/mmproj-BF16.gguf \
    --ctx-size 16384 \
    --n-gpu-layers 99 \
    --seed 3407 \
    --prio 2 \
    --temp 1.0 \
    --repeat-penalty 1.0 \
    --min-p 0.01 \
    --top-k 64 \
    --top-p 0.95

用于非会话模式以测试 Flappy Bird：

./llama.cpp/llama-cli \
    --model unsloth/gemma-3-27b-it-GGUF/gemma-3-27b-it-Q4_K_XL.gguf \
    --ctx-size 16384 \
    --n-gpu-layers 99 \
    --seed 3407 \
    --prio 2 \
    --temp 1.0 \
    --repeat-penalty 1.0 \
    --min-p 0.01 \
    --top-k 64 \
    --top-p 0.95 \
    -no-cnv \
    --prompt "<start_of_turn>user\nCreate a Flappy Bird game in Python. You must include these things:\n1. You must use pygame.\n2. The background color should be randomly chosen and is a light shade. Start with a light blue color.\n3. Pressing SPACE multiple times will accelerate the bird.\n4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.\n5. Place on the bottom some land colored as dark brown or yellow chosen randomly.\n6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.\n7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.\n8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.\nThe final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<end_of_turn>\n<start_of_turn>model\n"

我们在 https://unsloth.ai/blog/deepseekr1-dynamic 1.58bit 博客中的完整输入为：

记得移除 <bos>，因为 Gemma 3 会自动添加一个 <bos>！

<start_of_turn>user
创建一个用 Python 编写的 Flappy Bird 游戏。你必须包含以下内容：
1. 你必须使用 pygame。
2. 背景颜色应随机选择且为浅色调。以浅蓝色开始。
3. 多次按下空格键会加速小鸟。
4. 小鸟的形状应随机选择为方形、圆形或三角形。颜色应随机选择为深色。
5. 在底部放置一些土地，颜色随机为深棕色或黄色。
6. 在右上角显示分数。通过管道且不撞击时分数增加。
7. 随机间隔生成管道，间距足够。颜色随机为深绿色、浅棕色或深灰色。
8. 当你失败时，显示最高分。将文字显示在屏幕内。按 q 或 Esc 退出游戏。按空格键重新开始。
最终游戏应放在一个 Python 的 markdown 区块中。检查你的代码是否有错误并在最终 markdown 区块前修复它们。

🦥 在 Unsloth 中微调 Gemma 3

Unsloth 是唯一在 float16 机器上支持 Gemma 3 推理与训练的框架。 这意味着带有免费 Tesla T4 GPU 的 Colab 笔记本也能工作！

试试我们的新 Gemma 3（270M）笔记本它使得 270M 参数模型在下棋方面非常聪明，并能预测下一步棋。
使用我们的笔记本对 Gemma 3（4B）进行微调，适用于：文本或视觉
或微调 Gemma 3n（E4B）使用文本 • 视觉 • 音频

在尝试对 Gemma 3 进行完全微调（FFT）时，所有层在 float16 设备上默认使用 float32。Unsloth 期望使用 float16 并动态上转换。为了解决，加载后运行 model.to(torch.float16) ，或使用支持 bfloat16 的 GPU。

Unsloth 微调修复

我们在 Unsloth 的解决方案有三方面：

将所有中间激活保持为 bfloat16 格式 —— 也可以是 float32，但这会使用 2 倍的显存或内存（通过 Unsloth 的异步梯度检查点实现）
使用张量核在 float16 中完成所有矩阵乘法，但手动进行上转换/下转换，而不是依赖 PyTorch 的混合精度 autocast。
将所有不需要矩阵乘法的其它操作（如 layernorm）上转换为 float32。

🤔 Gemma 3 修复分析

首先，在我们微调或运行 Gemma 3 之前，我们发现使用 float16 混合精度时，梯度和 激活会变为无穷大 不幸的是会发生这种情况。这在仅有 float16 张量核的 T4、RTX 20x 系列和 V100 GPU 上会出现。

对于较新的 GPU，如 RTX 30x 或更高、A100、H100 等，这些 GPU 有 bfloat16 张量核，因此不会发生这个问题！ 但为什么会这样？

Float16 只能表示最大到 65504，而 bfloat16 能表示高达 10^38！但请注意这两种数值格式都只使用 16 位！这是因为 float16 分配了更多位以更好地表示更小的小数，而 bfloat16 无法很好地表示小数部分。

但为什么不用 float32？我们就用 float32 吧！但不幸的是，GPU 上的 float32 在矩阵乘法上非常慢 —— 有时慢 4 到 10 倍！所以我们无法这样做。

上一页Qwen3-2507 下一页Gemma 3n

最后更新于23天前

这有帮助吗？

hashtag⚙️ 推荐的推理设置

hashtag✨在手机上运行 Gemma 3

hashtag🦙 教程：如何在 Ollama 中运行 Gemma 3

hashtag📖 教程：如何在 llama.cpp 中运行 Gemma 3 27B

hashtag🦥 在 Unsloth 中微调 Gemma 3

hashtagUnsloth 微调修复

hashtag🤔 Gemma 3 修复分析

⚙️ 推荐的推理设置

✨在手机上运行 Gemma 3

🦙 教程：如何在 Ollama 中运行 Gemma 3

📖 教程：如何在 llama.cpp 中运行 Gemma 3 27B

🦥 在 Unsloth 中微调 Gemma 3

Unsloth 微调修复

🤔 Gemma 3 修复分析