6 cute pastel coloured sloths staring at their computer screens happy
Fine-tune & Run Gemma 3n

Jun 26, 2025 • By Daniel & Michael

Jun 26, 2025

By Daniel & Michael

Gemma 3n are Google's new multimodal (text, vision & audio) models. Available in 2B and 4B sizes, Gemma 3n has a 32K context window, multilingual support and is now supported in Unsloth.
  • Fine-tune Gemma-3n-E4B for free using our Colab notebook. Also, view our Gemma 3 (1B) GRPO notebook
  • Unsloth makes Gemma 3n finetuning 1.6x faster, use 60% less VRAM, and enables 6x longer than environments with Flash Attention 2 on a 24GB GPU.
  • We uploaded all versions of Gemma 3n, including Dynamic GGUFs, 4-bit, and 16-bit versions, on Hugging Face here. Currently GGUFs only support text.
  • Read our detailed guide on How to Run & Fine-tune Gemma 3n here.
A big thank you to the Gemma team for all their support! We're also super excited to meet you all at today’s Gemma Developer Meetup with Google!

Get the latest stable Unsloth to use Gemma 3n:
pip install --upgrade unsloth unsloth_zoo
✨Gemma 3n Fine-tuning
Gemma 3n-E4B finetuning fits with Unsloth in under 12GB of VRAM! It’s also 1.6x faster, and default uses Unsloth dynamic 4-bit quants for superior accuracy! Technically you can

We also heard a lot of you asking for a Gemma 3 (4B) Vision notebooks so you can try it now in our free Google Colab Notebook here. To view all our notebooks and model uploads, please visit our documentation.

Performance benchmarks

Model
VRAM
🦥Unsloth speed
🦥 VRAM reduction
🦥 Longer context
🤗Hugging Face+FA2
Gemma-3n-E4B
24GB
1.7x
>60%
6xlonger
1x
We tested using the Alpaca Dataset, a batch size of 2, gradient accumulation steps of 4, rank = 32, and applied QLoRA on all linear layers (q, k, v, o, gate, up, down).
🔮 Gemma 3n Analysis

Here's an in depth analysis for Gemma 3n's MatFormer architecture:
So what is so special about Gemma 3n you ask? It is based on Matryoshka Transformer or MatFormer architecture meaning that each transformer layer/block embeds/nests FFNs of progressively smaller sizes. Think of it like progressively smaller cups put inside one another. The training is done so that at inference time you can choose the size you want and get the most of the performance of the bigger models.

There is also Per Layer Embedding which can be cached to reduce memory usage at inference time. So the 2B model (E2B) is a sub-network inside the 4B (aka 5.44B) model that is achieved by both Per Layer Embedding caching and skipping audio and vision components focusing solely on text.

The MatFormer architecture, typically is trained with exponentially spaced sub-models aka of sizes S, S/2, S/4, S/8 etc in each of the layers. So at training time, inputs are randomly forwarded through one of the said sub blocks giving every sub block equal chance to learn. Now the advantage is, at inference time, if you want the model to be 1/4th of the original size, you can pick S/4 sized sub blocks in each layer.

You can also choose to Mix and Match where you pick say, S/4 sized sub block of one layer, S/2 sized sub block of another layer and S/8 sized sub block of another layer. In fact, you can change the sub models you pick based on the input itself if you fancy so. Basically its like choose your own kind of structure at every layer. So by just training a model of one particular size, you are creating exponentially many models of smaller sizes. No learning goes waste. Pretty neat huh.

💕 Thank you! 
A huge thank you to the Google Gemma team for enabling us to have Day 0 support. Also thanks to everyone for using & sharing Unsloth - we really appreciate it. 🙏

As always, be sure to join our Reddit page and Discord server for help or just to show your support! You can also follow us on Twitter and join our newsletter.
Thank you for reading!
Daniel & Michael Han 🦥
26 Jun 2025

Fine-tune Gemma 3n now!

Join Our Discord