Meta's new Llama 3.3 (70B) model delivers similar performance to Llama 3.1 (405B) and Unsloth makes Llama 3.3 (70B) fine-tuning 2x faster and use 70% less memory than Flash Attention 2 (FA2) + Hugging Face (HF).
New: Unsloth can now fine-tune Llama 3.3 (70B) up to 89,000 context lengths, which is 13x longer than what HF + FA2 supports at 6,900 on a 80GB GPU. Unsloth's new ultra long context support is 1.85x longer than previous versions of Unsloth. It is powered by our gradient checkpointing algorithm and we collaborated with Apple to add their new Cut Cross Entropy (CCE) algorithm.
Please note this new update works for all models, not just Llama. We also uploaded GGUFs + pre-quantized 4bit models for 4x faster downloading to our Hugging Face page.
For Llama 3.1 (8B), Unsloth can now do a whopping 342,000 context length, which exceeds the 128K context lengths Llama 3.1 supports. Hugging Face + FA2 can only do 28,000 on a 80GB GPU, so Unsloth supports 12x context lengths. HF+FA2 goes out of memory for 8GB GPUs, whilst Unsloth now supports up to 2,900 context lengths, up from 1,500.
Model
VRAM
🦥Unsloth speed
🦥 VRAM reduction
🦥 Longer context
🤗Hugging Face+FA2
Llama 3.3 (70B)
80GB
2x
>75%
13x longer
1x
Llama 3.1 (8B)
80GB
2x
>70%
12x longer
1x
We tested using the Alpaca Dataset, a batch size of 2, gradient accumulation steps of 4, rank = 32, and applied QLoRA on all linear layers (q, k, v, o, gate, up, down).
🦙 Llama 3.3 (70B) finetuning fits in 41GB
Llama 3.3 (70B) max. context length
GPU VRAM
Unsloth (+ Apple CCE)
Unsloth (Old)
Hugging Face+FA2
48 GB
12,106
7,385
OOM
80 GB
89,389
48,447
6,916
We tested Llama 3.3 (70B) Instruct on a 80GB A100 and did 4bit QLoRA on all linear layers (Q, K, V, O, gate, up and down) with rank = 32 with a batch size of 1. We padded all sequences to a certain maximum sequence length to mimic long context finetuning workloads.
Apple Cut Cross Entropy
We collabed with the author's of Apple’s Cut Cross Entropy including Erik to bring a memory efficient cross entropy kernel written in Triton into Unsloth. We also made it work for older GPUs like Tesla T4 and the RTX 20 series. You can read their paper here.
By not materializing the logits, but rather doing the matrix multiplications on the fly (like in Flash Attention), memory usage is vastly reduced. By also smartly ignoring extremely small numbers during the gradient calculation without affecting accuracy, performance can also be improved. CCE further increases long context support by another 3.6x for Llama 3.1 (8B) and 1.85x for Llama 3.3 (70B) Instruct. So in total, by multiplying the benefits of Unsloth Gradient Checkpointing and also Cut Cross Entropy, we have Unsloth’s new version to have 12-13x longer context length support than HF+FA2!
To leverage long context finetuning for Llama 3.3 70B, you can download our 4bit pre-quantized bitsandbytes versions to reduce VRAM fragmentation and allow 4x faster downloads here.
🦥 Unsloth Gradient Checkpointing
In April 2024, we introduced our smart Unsloth Gradient Checkpointing algorithm which smartly offloads activations to system RAM with a tiny performance overhead. You can read our past blog post here about it here.
Our method allows context lengths to be 7x longer than HF+FA2 for Llama 3.3 (70B) Instruct, and 3.3x longer for Llama 3.1 (8B). We continue leveraging this for extreme long context finetuning.
💾 System RAM usage
Just a reminder you will need a bit more system RAM since Unsloth smartly offloads activations to system RAM!
For Llama 3.3 (70B) Instruct, we see it has 80 hidden layers and a hidden size of 8192. This means 89,000 sequence lengths will need at least 89,000 x 80 x 8192 x 2 bytes = 109GB of system RAM.
For Llama 3.1 (8B), it has 32 hidden layers and a hidden size of 4096. 128K context lengths will need 32GB of system RAM. The maximum context length of 342,000 with Unsloth will need 84GB of system RAM.
🦙 Llama 3.1 (8B) benchmarks
We tested Llama 3.1 (8B) Instruct and did 4bit QLoRA on all linear layers (Q, K, V, O, gate, up and down) with rank = 32 with a batch size of 1. We padded all sequences to a certain maximum sequence length to mimic long context finetuning workloads.
Llama 3.1 (8B) max. context length
GPU VRAM
Unsloth (+ Apple CCE)
Unsloth (Old)
Hugging Face+FA2
8 GB
2,972
1,586
OOM
12 GB
21,848
6,744
932
16 GB
40,724
11,903
2,551
24 GB
78,475
22,220
5,789
40 GB
153,977
42,855
12,264
48 GB
191,728
53,173
15,502
80 GB
342,733
94,442
28,454
💕 Thank you!
As usual, a huge thank you to everyone for using & sharing Unsloth - we really appreciate it. 🙏 We would like to thank Apple's team including: Erik Wijmans, Brody Huval, Alexander Hertzberg, Vladlen Koltun, and Philipp Kr. for their research and Erik for their help.
As always, be sure to join our Reddit page and Discord server for help or just to show your support! You can also follow us on Twitter and Substack.