6 cute pastel coloured sloths staring at their computer screens happy
Finetune Llama 3 - 2x faster + 6x longer context + 68% less VRAM

Apr 23, 2024 • By Daniel & Michael

May 5, 2024

By Daniel & Michael

Llama-3 8B
1xL4 24GB
205%
faster

Llama-3 8B
1xL4 24GB
-63%
VRAM

Llama-3 70B
1xA100 80GB
183%
faster

Llama-3 70B
1xA100 80GB
-68%
VRAM

You can now finetune Meta’s latest Llama 3 (8B) model 2x faster and use 63% less memory than Flash Attention 2 (FA2) + Hugging Face (HF). Llama 3 (70B) is 1.8x faster and uses 68% less VRAM.

On 1xA100 80GB GPU, Llama-3 70B with Unsloth can fit 48K total tokens (8192 * bsz of 5) vs 7K tokens without Unsloth. That's 6x longer context lengths!

We uploaded a Colab notebook to finetune Llama-3 8B on a free Tesla T4: Llama-3 8b Notebook. We also uploaded pre-quantized 4bit models for 4x faster downloading to our Hugging Face page which includes Llama-3 70b Instruct and Base in 4bit form.

Someone from our community tested LoRA fine-tuning of bf16 Llama 3 8B and it only used 16GB of VRAM.

P.S. Don't forget to ⭐Star us on Github and join our Discord server ❤️

Llama 3 performance benchmarks

Model
VRAM
🦥Unsloth speed
🦥 VRAM reduction
🦥 Longer context
🤗Hugging Face+FA2
Llama-3 8B
24GB
2x
63%
3x longer
1x
Llama-3 70B
80GB
1.8x
68%
6x longer
1x
We tested using the Alpaca Dataset, a batch size of 2, gradient accumulation steps of 4, rank = 32, and applied QLoRA on all linear layers (q, k, v, o, gate, up, down).
🦙 6x longer context lengths
By using Unsloth’s latest long context support, Llama-3 70b can now easily fit on a 48GB GPU card, allowing you to finetune on ~7K context lengths, whilst HF + FA2 might allow you to finetune lengths of 2 or even OOM.

On a A100 80GB SXM machine, Unsloth allows 6x longer context lengths with only +1.9% overhead, allowing you finetune on 48K sequence lengths vs 7.5K lengths. We can see below the VRAM vs context length data we experimentally gathered, showing the stark advantage of using Unsloth vs HF + FA2 for long context finetuning.

Llama 3 (70B) max. context length

GPU VRAM
Unsloth
(New)
Unsloth
(Old)
Hugging Face+FA2
48 GB
7,698
2,875
OOM
80 GB
48,053
18,332
7,433
In all our experiments, we used QLoRA with a rank of 32 and applied LoRA adapters to all linear linears (q, k, v, o, gate, up, down). We used a batch size of 1, and repeated data to make it fit to the maximum context window.
🦙 Llama 3 (8B) finetuning fits in 8GB
By using a batch size of 1, and a lora rank of 32 on all linear layers, HF + FA2 unfortunately fails or OOMs on 8GB GPU cards (needs ~9GB memory), whilst Unsloth comfortably allows 2K context lengths. On a 24GB consumer card, Unsloth allows 20K context lengths, or 3.5x longer context lengths than HF+FA2.

Below shows the VRAM consumption vs context lengths tested on a L4 GPU via Colab.

Llama 3 (8B) max. context length

GPU VRAM
Unsloth
(New)
Unsloth
(Old)
Hugging Face+FA2
8 GB
1,983
1,594
OOM
12 GB
6,638
5,352
1,044
16 GB
11,292
9,110
2,663
24 GB
20,601
16,626
5,901
40 GB
39,219
31,657
12,377
48 GB
48,528
39,172
15,615
80 GB
85,765
69,235
28,567
🦙 Llama 3 Quirks
There are a few weird “bugs” and quirks with Llama-3 as well! First it seems like the tokenizer does not add the BOS token unlike Llama-2. HuggingFace added a fix today, and we quickly resolved it inside Unsloth! We did test both scenarios, and saw virtually no difference with adding or not adding the BOS token.A more unfortunate “bug” or quirk is Llama-3’s base (not instruct) model has untrained tokens, namely <|reserved_special_token_{0->250}|>
<|eot_id|>
<|start_header_id|>
<|end_header_id|>
We tweeted about this a few days ago here
Essentially if one uses these untrained tokens (like using the instruct template for the base model), then gradients will be NaN. As first noticed by Geronimo, one has to simply set these untrained tokens to be the mean vector.

However, from our investigations, you cannot simply set the mean, since it’s biased. You must first set these untrained tokens to 0 (bfloat16 will cause these vectors to not be 0 but rather 1e-23), then sum them, and then divide them by the number of trained tokens (n total tokens minus n untrained). We found 287 untrained tokens in total.

Unsloth’s new release now automatically fixes this for you during finetuning.
💕 Thank you! 
Feel free to support us via our Ko-fi donation page. Huge shout out to: Nguyen, Mo, Icecream102, arthrod, Teto, Chimiste, Martin & FullOff_Bad_Ideas who are new supporters! 🙏

As always, be sure to join our Discord server for help or just to show your support! You can also follow us on Twitter and Substack.
Thank you for reading!
Daniel & Michael Han 🦥
5 May 2024

May 9 announcement

Join Our Discord