Meta's new Llama 3.2 models come in 1B, 3B, 11B and 90B sizes with 128K context lengths. Unsloth makes Llama 3.2 (3B) finetuning 2x faster and use 60% less memory than Flash Attention 2 (FA2) + Hugging Face (HF). Llama 3.1 (90B) is 2x faster and uses 65% less VRAM.
We uploaded a Google Colab notebook to finetune Llama 3.2 (8B) on a free Tesla T4: Llama 3.2 (3B) and Llama 3.2 (11B) Vision. We also have a new UI on Google Colab for chatting with your Llama 3.1 Instruct models which uses our own 2x faster inference engine.
We also uploaded pre-quantized 4bit models for 4x faster downloading to our Hugging Face page which includes Llama 3.2 Instruct (1B, 3B, 11B and 90B) and Base (1B, 3B, 11B and 90B) in 4bit bnb form.
👁️Vision/multimodal models now supported
One of Unsloth's most highly requested features is now supported! You can now fine-tune Llama 3.2's vision models using Unsloth so be sure to experiment and post your results!
🦙 Llama 3.2 Benchmarks
Model
VRAM
🦥Unsloth speed
🦥 VRAM reduction
🦥 Longer context
🤗Hugging Face+FA2
Llama 3.2 (1B)
24GB
2x
60%
3x longer
1x
Llama 3.2 (3B)
24GB
2x
65%
6x longer
1x
Llama 3.2 (11B)
80GB
2x
65%
6x longer
1x
Llama 3.2 (90B)
80GB
2x
65%
6x longer
1x
We tested using the Alpaca Dataset, a batch size of 2, gradient accumulation steps of 4, rank = 32, and applied QLoRA on all linear layers (q, k, v, o, gate, up, down).
🧶 6x longer context lengths
Unsloth significantly enhances long context support for Llama 3.1 (70B), fitting it on a 48GB GPU and enabling fine-tuning on ~7K context lengths. In comparison, HF + FA2 can only handle lengths of 2 or even out-of-memory (OOM) errors. Meta’s update increases context lengths to 128K but requires more VRAM.
With 80GB VRAM, Unsloth supports 6x longer context lengths with just a +1.9% overhead, allowing fine-tuning on 48K sequence lengths versus 7.5K lengths. Experimental data shows Unsloth’s clear advantage over HF + FA2 for long context fine-tuning.
Llama 3.1 (70B) max. context length
GPU VRAM
Unsloth
Hugging Face+FA2
48 GB
7,698
OOM
80 GB
48,053
7,433
In all our experiments, we used QLoRA with a rank of 32 and applied LoRA adapters to all linear linears (q, k, v, o, gate, up, down). We used a batch size of 1, and repeated data to make it fit to the maximum context window.
🦙 Llama 3.1 (8B) finetuning fits in 8GB
Using a batch size of 1 and a LoRA rank of 32 on all linear layers, HF + FA2 fails or runs out of memory (OOM) on 8GB GPU cards, needing around 9GB of memory. In contrast, Unsloth comfortably supports 2K context lengths on the same 8GB cards. On a 24GB consumer card, Unsloth allows for 20K context lengths, which is 3.5 times longer than HF + FA2.
Below shows the VRAM consumption vs context lengths tested on a L4 GPU via Colab:
Llama 3.1 (8B) max. context length
GPU VRAM
Unsloth
Hugging Face+FA2
8 GB
1,983
OOM
12 GB
6,638
1,044
16 GB
11,292
2,663
24 GB
20,601
5,901
40 GB
39,219
12,377
48 GB
48,528
15,615
80 GB
85,765
28,567
💕 Thank you!
As usual, a huge thank you to everyone for using & sharing Unsloth - we really appreciate it. Also a huge shout out to: Marshall from Jun, John, Steven & Aaron who are new supporters! 🙏
As always, be sure to join our Discord server for help or just to show your support! You can also follow us on Twitter and Substack.