6 cute pastel coloured sloths staring at their computer screens happy
2.4x faster Gemma + 58% less VRAM

Feb 26, 2024 • By Daniel & Michael

Feb 26, 2024

By Daniel & Michael

Gemma 7b
1x A100

Gemma 7b
1x A100

Gemma 2b

Gemma 2b

You can now finetune Gemma 7b 2.43x faster than HF + Flash Attention 2 with 57.5% less VRAM use. When compared to vanilla HF, Unsloth is 2.53x faster and uses 70% less VRAM.

On 1x A100 80GB GPU, Unsloth can fit 40K total tokens (8192 * bsz of 5), whilst FA2 can fit ~15K tokens and vanilla HF can fit 9K tokens.

We also uploaded 2 Colab notebooks to finetune on a free Tesla T4! Gemma 7b Notebook and a Gemma 2b Notebook. We uploaded pre-quantized 4bit models for 4x faster downloading to our HF page.

Below, we breakdown how we made Gemma faster, what we learnt about Gemma, and some bugs we found along the way! On our other releases:
  • Unsloth now natively supports 2x faster inference! (inc. Gemma!)
  • We've added new chat templates meaning you can now finetune conversational style datasets! ChatML, Vicuna, Zephyr + more chat templates all work: Notebook
  • We’ve also revamped our Github page for easier navigation. Don't forget to ⭐Star us ❤️
💡 Gemma Learnings
At first glance, Gemma might look very similar to Llama's architecture, but on closer inspection, we find the following notable differences:
  • GeGLU vs Swiglu activation function
    We had to rewrite our manual autograd engine to accept other activation functions. We had to use Wolfram Alpha to derive the derivative for GeGLU!
  • Tied Embeddings
    Interestingly, unlike from Mistral and Llama, Gemma's embedding and lm_head have identical weights.
  • 256K Vocab Size
    Gemma's vocab size is gigantic. Llama and Mistral only has 32K, whilst Gemma uses 256K. Although it's tied, we had to rewrite our Cross Entropy Kernel to work on all vocab sizes. Originally our kernel only worked up to CUDA's max blocksize of 2^16 or 65536.
  • Much more VRAM usage
    Gemma's MLP size is 24576, whilst Llama is 11008 and Mistral is 14336. Gemma's Attention size is 3072, projected to 4096 before attention. Llama is 4096, and Mistral is also 4096 but uses grouped query attention (so 512 repeated until 4096). This causes Gemma's memory usage to explode. Below, you can see Unsloth can handle 40K total tokens, whilst FA2 can handle at most 15K, and vanilla HF only 9K.
  • +1 to RMS Layernorm
    Llama and Mistral's RMS Layernorm don't use +1, whilst Gemma uses (w+1). Llama would have initialized the weights to be 1, whilst Gemma can use 0. I doubt this has any effect on performance.
  • 256 Head Size
    Flash Attention 2 just recently added support for head sizes of larger than 192 on consumer GPUs due to Gemma.
  • Embedding * sqrt(hd)
    Embeddings are also scaled by the sqrt of the hidden dimension.
  • <start_of_turn> <end_of_turn>
    Interestingly, <start_of_turn> is id 106 and <end_of_turn> is id 107. We're working on a ChatML template to automatically map <|im_start|> and <|im_end|> to these tokens.
We also noticed a bug in Hugging Face! We're working with the HF team to resolve the bug. Essentially using bfloat16 causes RoPE Embeddings to lose precision on long contexts. If you look below, the current HF implementation for Llama and Gemma outputs [8192, 8192, 8192], whilst the correct RoPE embeddings should be [8189, 8190, 8191]. We fixed this in Unsloth, and so the losses are correct. You can see our PR to transformers here.Our full table of VRAM comparisons between vanilla HF, HF with Flash Attention 2 and Unsloth. At maximum, Unsloth uses 70% less VRAM than vanilla HF, and uses 57.5% less VRAM than with FA2.
Batch size Seqlen 🤗Hugging Face FA2 🦥Unsloth Reduction
1 2048 19793 17423 10809 0.379613155
2 2048 31829 31155 14504 0.5344567485
1 4096 33669 29155 14505 0.502486709
2 4096 62023 51459 21859 0.5752152199
1 8192 72587 47843 21859 0.5431097548
2 8192 OOM OOM 36565
4 8192 OOM OOM 66019
5 8192 OOM OOM 80771
🦥Unsloth Studio (Beta)
We've recently been working hard on making our Colab notebooks more accessible. Unsloth Studio (Beta) has been in the works since we launched and will allow users to do fine-tuning in just a single click on Google Colab. Simply upload your dataset, choose your model, and run our notebook to get your fine-tuned model ready for download. That's it!

We initially considered Gradio for an enhanced UI, but complications with Automatic1111’s stable diffusion UI and Colab led us to reconsider. Unsloth Studio (Beta) will be launching pretty soon, so be on the lookout!
🚀 2x faster inference
We're working on even faster inference but for now, this will help most of you guys get things done faster. All QLoRA, LoRA and non LoRA inference paths are 2x faster.

This requires no change of code or any new dependencies! Check out our video GIF comparison below for a live example of speed differences:
🌠Other updates 
  • Uploaded more 4-bit models which have 4x faster downloading. Check out our Gemma, Yi 6, 34b, Phi, Llama Chat, Codellama, Phi-2 models, Solar & more of our models on 🤗Hugging Face
  • We support LoftQ, RSLoRA (Rank stabilized LoRA) and all PEFT arguments
  • We now support all other GGUF formats (iq2_xxs, iq3_xxs, etc)
  • And as usual, many many bug fixes and optimizations!
💕 Support us! 
As a team of just 2 brothers with 0 revenue or funding, it would be amazing if you could support us via our Ko-fi donation page. A huge shout out to: duffahtolla, Gerald, Alexis, Raghu Ganti, Brandon, just a person, Malcome Sharpe, P_Dog, LostGoatOnHill, yogi, TheRealAneesh and iulius666 who have already supported us! 🙏

As always, be sure to join our Discord server for help or just to show your support! You can also follow us on Twitter and Substack. We appreciate your continued love and support!
Thank you for reading!
Daniel & Michael Han 🦥
26 February 2024

Unsloth Studio coming soon!

Join Our Discord