flask-gearGemma 4 Fine-tuning Guide

Train Gemma 4 by Google with Unsloth.

You can now fine-tune Google's Gemma 4 E2B, E4B, 26B-A4B and 31B with Unslotharrow-up-right. Support includes all vision, text, audio and RL fine-tuning.

  • Fine-tune Gemma 4 via our free Google Colab notebooks:

  • If you want to preserve reasoning ability, you can mix reasoning-style examples with direct answers (keep a minimum of 75% reasoning). Otherwise you can emit it fully.

  • Full fine-tuning (FFT) works as well. It will use 4x more VRAM.

  • Gemma 4 is powerful for multilingual fine-tuning as it supports 140 languages.

  • After fine-tuning, you can export to GGUF (for llama.cpp/Unsloth/Ollama/etc.)

If you’re on an older version (or fine-tuning locally), update first:

Unsloth Studio:

Unsloth code-based:

pip install --upgrade --force-reinstall --no-cache-dir unsloth unsloth_zoo

Quickstart

🦥 Unsloth Studio Guide

Gemma 4 can be run and fine-tuned in Unsloth Studio, our new open-source web UI for local AI. With Unsloth Studio, you can run models locally on MacOS, Windows, Linux and:

1

Install Unsloth

Run in your terminal:

MacOS, Linux, WSL:

curl -fsSL https://unsloth.ai/install.sh | sh

Windows PowerShell:

irm https://unsloth.ai/install.ps1 | iex
circle-check
2

Launch Unsloth

MacOS, Linux, WSL and Windows:

unsloth studio -H 0.0.0.0 -p 8888

Then open http://localhost:8888 in your browser.

3

Train Gemma 4

On first launch you will need to create a password to secure your account and sign in again later. You’ll then see a brief onboarding wizard to choose a model, dataset, and basic settings. You can skip it at any time.

Search for Gemma 4 in the search bar and select your desired model and dataset. Next, adjust your hyperparameters, context length as desired.

4

Monitor training progress

After you click start training, you will be able to monitor and observe the training progress of the model. The training loss should be steadily decreasing. Once done, the model will be automatically saved.

5

Export your fine-tuned model

Once done, Unsloth Studio allows you to export the model to GGUF, safetensor etc formats.

🦥 Unsloth Core (code-based) Guide

Below is a minimal SFT recipe (works for “text-only” fine-tuning). See also our vision fine-tuning section.

circle-info

If you'd like to do GRPO, it works in Unsloth if you disable fast vLLM inference and use Unsloth inference instead. Follow our Vision RL notebook examples.

circle-info

If you OOM:

  • Drop per_device_train_batch_size to 1 and/or reduce max_seq_length.

  • Keep use_gradient_checkpointing="unsloth" on (it’s designed to reduce VRAM use and extend context length).

Loader example for MoE (bf16 LoRA):

Once loaded, you’ll attach LoRA adapters and train similarly to the SFT example above.

MoE fine-tuning (26B-A4B)

The 26B-A4B model is the speed / quality middle ground in the Gemma 4 lineup. Since it is an MoE model with only a subset of parameters active per token, a conservative fine-tuning approach is:

  • use LoRA rather than full fine-tuning

  • prefer 16-bit / bf16 LoRA if memory allows

  • start with shorter contexts and smaller ranks first

  • scale up only after the pipeline is stable

If your goal is the highest quality and you have more memory, use 31B instead.

Multimodal fine-tuning (E2B / E4B)

Because E2B and E4B support image and audio, they are the main Gemma 4 variants for multimodal fine-tuning.

  • load the multimodal model with FastVisionModel

  • keep finetune_vision_layers = False first

  • fine-tune only the language, attention, and MLP layers

  • enable vision or audio layers later if your task needs it

Gemma 4 Multimodal LoRA example:

Image example format

Remember: for Gemma 4 multimodal prompts, put the image before the text instruction.

Audio example format

Audio is for E2B / E4B only. Keep clips short and task-specific.

Saving / export fine-tuned model

You can view our specific inference / deployment guides for Unsloth Studio, llama.cpp, vLLM, llama-server, Ollama or SGLang.

Save to GGUF

Unsloth supports saving directly to GGUF:

Or push GGUFs to Hugging Face:

If the exported model behaves worse in another runtime, Unsloth flags the most common cause: wrong chat template / EOS token at inference time (you must use the same chat template you trained with).

For more details read our inference guides:

Gemma 4 data best practices

Gemma 4 has a few formatting details you need to keep in mind.

1. Use standard chat roles

Gemma 4 uses the standard:

  • system

  • user

  • assistant

This means your SFT dataset should be written in regular chat format rather than older Gemma-specific role formats.

2. Thinking mode is explicit

To enable thinking mode, put <|think|> at the start of the system prompt.

Thinking enabled:

Thinking disabled:

If you want to preserve thinking-style behavior during SFT:

  • keep the format consistent

  • decide whether you want to train on visible thought blocks or on final answers only

  • do not mix multiple incompatible thought formats in the same dataset

For most production assistants, the simplest setup is to fine-tune on the final visible answer only.

3. Multi-turn rule

For multi-turn conversations, only keep the final visible answer in the conversation history. Do not feed earlier thought blocks back into later turns.

4. Multimodal content should come first

For Gemma 4 multimodal prompts, put:

  • image before text

  • audio before text

  • video frames before text

This should be reflected in your training data too.

Last updated

Was this helpful?