6 cute pastel coloured sloths staring at their computer screens happy

Fine-tune & Run Llama 3.2 with Unsloth

Sept 25, 2024 • By Daniel & Michael

Sep 25, 2024

• By Daniel & Michael

Llama 3.2 (3B)
1xL4 24GB
210%
faster

Llama 3.2 (3B)
1xL4 24GB
-60%
VRAM

Llama 3.2 (90B)
1xA100 80GB
190%
faster

Llama 3.2 (90B)
1xA100 80GB
-65%
VRAM

Meta's new Llama 3.2 models come in 1B, 3B, 11B and 90B sizes with 128K context lengths. Unsloth makes Llama 3.2 (3B) finetuning 2x faster and use 60% less memory than Flash Attention 2 (FA2) + Hugging Face (HF). Llama 3.1 (90B) is 2x faster and uses 65% less VRAM.

We uploaded a Google Colab notebook to finetune Llama 3.2 (8B) on a free Tesla T4: Llama 3.2 (3B) and Llama 3.2 (11B) Vision. We also have a new UI on Google Colab for chatting with your Llama 3.1 Instruct models which uses our own 2x faster inference engine.

We also uploaded pre-quantized 4bit models for 4x faster downloading to our Hugging Face page which includes Llama 3.2 Instruct (1B, 3B, 11B and 90B) and Base (1B, 3B, 11B and 90B) in 4bit bnb form.

👁️Vision/multimodal models now supported

One of Unsloth's most highly requested features is now supported! You can now fine-tune Llama 3.2's vision models using Unsloth so be sure to experiment and post your results!

🦙 Llama 3.2 Benchmarks

Model

VRAM

🦥Unsloth speed

🦥 VRAM reduction

🦥 Longer context

🤗Hugging Face+FA2

Llama 3.2 (1B)

24GB

60%

3x longer

Llama 3.2 (3B)

24GB

65%

6x longer

Llama 3.2 (11B)

80GB

65%

6x longer

Llama 3.2 (90B)

80GB

65%

6x longer

We tested using the Alpaca Dataset, a batch size of 2, gradient accumulation steps of 4, rank = 32, and applied QLoRA on all linear layers (q, k, v, o, gate, up, down).

🧶 6x longer context lengths

Unsloth significantly enhances long context support for Llama 3.1 (70B), fitting it on a 48GB GPU and enabling fine-tuning on ~7K context lengths. In comparison, HF + FA2 can only handle lengths of 2 or even out-of-memory (OOM) errors. Meta’s update increases context lengths to 128K but requires more VRAM.

With 80GB VRAM, Unsloth supports 6x longer context lengths with just a +1.9% overhead, allowing fine-tuning on 48K sequence lengths versus 7.5K lengths. Experimental data shows Unsloth’s clear advantage over HF + FA2 for long context fine-tuning.

Llama 3.1 (70B) max. context length

GPU VRAM

Unsloth

Hugging Face+FA2

48 GB

7,698

OOM

80 GB

48,053

7,433

In all our experiments, we used QLoRA with a rank of 32 and applied LoRA adapters to all linear linears (q, k, v, o, gate, up, down). We used a batch size of 1, and repeated data to make it fit to the maximum context window.

🦙 Llama 3.1 (8B) finetuning fits in 8GB

Using a batch size of 1 and a LoRA rank of 32 on all linear layers, HF + FA2 fails or runs out of memory (OOM) on 8GB GPU cards, needing around 9GB of memory. In contrast, Unsloth comfortably supports 2K context lengths on the same 8GB cards. On a 24GB consumer card, Unsloth allows for 20K context lengths, which is 3.5 times longer than HF + FA2.

Below shows the VRAM consumption vs context lengths tested on a L4 GPU via Colab:

Llama 3.1 (8B) max. context length

GPU VRAM

Unsloth

Hugging Face+FA2

8 GB

1,983

OOM

12 GB

6,638

1,044

16 GB

11,292

2,663

24 GB

20,601

5,901

40 GB

39,219

12,377

48 GB

48,528

15,615

80 GB

85,765

28,567

💕 Thank you!

As usual, a huge thank you to everyone for using & sharing Unsloth - we really appreciate it. Also a huge shout out to: Marshall from Jun, John, Steven & Aaron who are new supporters! 🙏

As always, be sure to join our Discord server for help or just to show your support! You can also follow us on Twitter and Substack.

Thank you for reading!

Daniel & Michael Han 🦥
25 Sep 2024

Unsloth Studio next

Get started for free

Sept 25, 2024 • By Daniel & Michael

Sep 25, 2024

•

By Daniel & Michael

Llama 3.2 (3B)1xL4 24GB210%faster

Llama 3.2 (3B)1xL4 24GB-60%VRAM

Llama 3.2 (90B)1xA100 80GB190%faster

Llama 3.2 (90B)1xA100 80GB-65%VRAM

Llama 3.1 (70B) max. context length

Llama 3.1 (8B) max. context length

Unsloth Studio next

Llama 3.2 (3B)
1xL4 24GB
210%
faster

Llama 3.2 (3B)
1xL4 24GB
-60%
VRAM

Llama 3.2 (90B)
1xA100 80GB
190%
faster

Llama 3.2 (90B)
1xA100 80GB
-65%
VRAM