Meta's update to their Llama 3 models makes them the most advanced models to date. Llama 3.1 was trained on 15.6T tokens, expands context lengths to 128K and now supports new languages. Unsloth makes Llama 3.1 (8B) finetuning 2.1x faster and use 60% less memory than Flash Attention 2 (FA2) + Hugging Face (HF). Llama 3.1 (70B) is 1.9x faster and uses 65% less VRAM.
We uploaded a Google Colab notebook to finetune Llama 3.1 (8B) on a free Tesla T4: Llama 3.1 (8B) Notebook. We also have a new UI on Google Colab for chatting with your Llama 3.1 Instruct models which uses our own 2x faster inference engine.
Unsloth provides 6x longer context length for Llama 3.1 training. On 1xA100 80GB GPU, Llama 3.1 (70B) with Unsloth can fit 48K total tokens (8192 * bsz of 5) vs 7K tokens without Unsloth.
We also uploaded pre-quantized 4bit models for 4x faster downloading to our Hugging Face page which includes Llama 3.1 Instruct (8B, 70B and 405B) and Base (8B, 70B and 405B) in 4bit bnb form.
💎Introducing Unsloth Run UI
We created a new chat UI using Gradio where users can upload and chat with their Llama 3.1 Instruct models online for free on Google Colab. The chat UI is a work in progress and we will add support for all models later. It's entirely powered by our own inference engine which provides 2x faster inference than Hugging Face.
This release will just be a taste of what's to expect for Unsloth Studio (Beta), our upcoming fine-tuning UI.
🦙 Llama 3.1 Benchmarks
Model
VRAM
🦥Unsloth speed
🦥 VRAM reduction
🦥 Longer context
🤗Hugging Face+FA2
Llama 3.1 (8B)
24GB
2.1x
60%
3x longer
1x
Llama 3.1 (70B)
80GB
1.9x
65%
6x longer
1x
We tested using the Alpaca Dataset, a batch size of 2, gradient accumulation steps of 4, rank = 32, and applied QLoRA on all linear layers (q, k, v, o, gate, up, down).
🧶 6x longer context lengths
Unsloth significantly enhances long context support for Llama 3.1 (70B), fitting it on a 48GB GPU and enabling fine-tuning on ~7K context lengths. In comparison, HF + FA2 can only handle lengths of 2 or even out-of-memory (OOM) errors. Meta’s update increases context lengths to 128K but requires more VRAM.
With 80GB VRAM, Unsloth supports 6x longer context lengths with just a +1.9% overhead, allowing fine-tuning on 48K sequence lengths versus 7.5K lengths. Experimental data shows Unsloth’s clear advantage over HF + FA2 for long context fine-tuning.
Llama 3.1 (70B) max. context length
GPU VRAM
Unsloth
Hugging Face+FA2
48 GB
7,698
OOM
80 GB
48,053
7,433
In all our experiments, we used QLoRA with a rank of 32 and applied LoRA adapters to all linear linears (q, k, v, o, gate, up, down). We used a batch size of 1, and repeated data to make it fit to the maximum context window.
🦙 Llama 3.1 (8B) finetuning fits in 8GB
Using a batch size of 1 and a LoRA rank of 32 on all linear layers, HF + FA2 fails or runs out of memory (OOM) on 8GB GPU cards, needing around 9GB of memory. In contrast, Unsloth comfortably supports 2K context lengths on the same 8GB cards. On a 24GB consumer card, Unsloth allows for 20K context lengths, which is 3.5 times longer than HF + FA2.
Below shows the VRAM consumption vs context lengths tested on a L4 GPU via Colab:
Llama 3.1 (8B) max. context length
GPU VRAM
Unsloth
Hugging Face+FA2
8 GB
1,983
OOM
12 GB
6,638
1,044
16 GB
11,292
2,663
24 GB
20,601
5,901
40 GB
39,219
12,377
48 GB
48,528
15,615
80 GB
85,765
28,567
🔎 Llama 3.1 Analysis
Though the architecture from Llama 3 mostly remains the same, there are some key differences. All 3.1 model outputs can now be used to train other models, not just Llama and the 3.1 models now use fp8 precision. Here is our tweet and a list of our findings:
New RoPE extension method Uses an interesting low and high scaling factor, and scales the inv_freq vector - can be computed in 1 go, so no need for dynamic re computation. Used a 6 stage ramping up approach from 8K tokens to 128K tokens with 800B tokens.
Training bfloat16 38% to 43% MFU using bfloat16. Pipeline parallelism used + FSDP. Model averaging for RM, SFT & DPO stages.
Data mixture 50% general knowledge25% maths & reasoning17% code data and tasks8% multilingual data
Preprocessing steps Uses Roberta, DistilRoberta, fasttext to filter out good quality data. Lots of de-duplication and heuristics to remove bad data.
Float8 quantization Quantizes weights to fp8 and input to fp8, then multiplies by scaling factors. fp8 x fp8 then output is bf16. Faster for inference & less VRAM use.
Vision & Speech Experiments The Llama 3.1 team also trained vision & speech adapters - not released though, but very cool!
💕 Thank you!
Meta released an article by Mark Zuckerberg addressing the importance of open-source: "We need to train, fine-tune, and distill our own models. Every organization has different needs that are best met with models of different sizes that are trained or fine-tuned with their specific data. On-device tasks and classification tasks require small models, while more complicated tasks require larger models. Now you’ll be able to take the most advanced Llama models, continue training them with your own data and then distill them down to a model of your optimal size – without us or anyone else seeing your data." So a big thank you to the Meta team as always for supporting open-source!
Feel free to support us via our Ko-fi donation page. Huge shout out to: Marshall from NASA, Anthony, John, Pichet & Steven who are new supporters! 🙏
As always, be sure to join our Discord server for help or just to show your support! You can also follow us on Twitter and Substack.