Finetune Mistral 14x faster
Dec 14, 2023 • By Daniel Han
Dec 14, 2023
•
By Daniel Han
We’re excited to release QLoRA support for Mistral 7B, CodeLlama 34B, and all other models based on the Llama architecture! We added sliding window attention, preliminary Windows and DPO support, and will share all 59 notebooks for reproducing our numbers.
You can now QLoRA finetune Mistral 7B 2.2x on 1x A100 faster with 62% less memory or 12.4GB peak VRAM. CodeLlama 34B is 1.9x faster on 1x A100, using 32% less memory or 27GB peak VRAM. It finally doesn’t OOM!Our PRO version can finetune Mistral 7B a whopping 14x faster on 1x A100, using 70% less peak VRAM, and CodeLlama 34B is 13x faster on 1x A100 using 50% less VRAM or 20GB peak VRAM. Llama 7B is also 21x faster on 1x A100, with a crazy 71% reduction in peak VRAM usage. On 2x Tesla T4s, Llama 7B is 28x faster via DDP support.We also added installing Flash Attention v2 directly rather than through Xformers. If you have a RTX 3060 or higher (A100, H100 etc), use the "ampere" install path!pip install "unsloth[cu118_ampere] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121_ampere] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[colab_ampere] @ git+https://github.com/unslothai/unsloth.git"
As requested, we provide a preliminary breakdown of how we made things faster with Unsloth!Benchmarking
1 A100 40GB |
Hugging Face |
Flash Attention 2 |
Unsloth Open |
Unsloth Equal |
Unsloth Pro |
Unsloth Max |
Alpaca |
1x |
1.04x |
1.98x |
2.48x |
5.32x |
15.64x |
LAION Chip2 |
1x |
0.92x |
1.61x |
1.84x |
7.05x |
20.73x |
OASST |
1x |
1.19x |
2.17x |
2.66x |
5.04x |
14.83x |
Slim Orca |
1x |
1.18x |
2.22x |
2.64x |
5.04x |
14.82x |
We benchmark Unsloth against Hugging Face’s original implementation, and against adding Flash Attention 2 support on 1x A100 via Google Colab. Flash Attention at most speeds up training by 1.2x, whilst with Unsloth’s open source package, training is 2.2x faster. “Unsloth Equal” is our PRO version, but under the condition that all settings and the loss curve stays the same. Under this scenario, we further boost training to 2.7x. Our MAX version can boost speeds on the LAION dataset to 21x!
All benchmarks use the following setup (unless if some tests OOM, in which we decrease the batch size for all tests):QLoRA nf4 layers = [
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
]
QLoRA rank = 16, alpha = 16, dropout = 0
max_seq_length = 2048
learning_rate = 2e-4
weight_decay = 0.01
max_steps = 240
warmup_steps = 10
batch_size = 4
gradient_accumulation_steps = 4
lr_scheduler_type = "linear"
optimizer = "adamw_8bit", bfloat16
use_gradient_checkpointing = True
random_state = 3407
1 A100 40GB |
Hugging Face |
Flash Attention 2 |
Unsloth Open |
Unsloth Equal |
Unsloth Pro |
Unsloth Max |
Mistral 7B Slim Orca |
1x |
1.15x |
2.15x |
2.53x |
4.61x |
13.69x |
code |
Code |
Code |
Code |
Code |
|
seconds |
1813 |
1571 |
842 |
718 |
393 |
132 |
memory MB |
32853 |
19385 |
12465 |
10271 |
|
|
memory saved % |
|
40.99 |
62.06 |
68.74 |
|
On Mistral 7B for 1x A100, Flash Attention v2 boosts training by 1.15x, whilst Unsloth Open boosts it by 2.15x, and we reduce memory by 62%. Again, "Unsloth Equal" which runs an equalized training run boosts speeds by 2.53x, and uses 69% less peak VRAM. Unsloth MAX boosts training by 13.7x.
You can click on "Code" to access our shared notebooks for reproducibility purposes. "Unsloth Equal" only shows the training losses and obscures our other codepaths.
1 A100 40GB |
Hugging Face |
Flash Attention 2 |
Unsloth Open |
Unsloth Equal |
Unsloth Pro |
Unsloth Max |
Code Llama 34B |
OOM ❌ |
0.99x |
1.87x |
2.61x |
4.27x |
12.82x |
code |
Code |
Code |
Code |
Code |
|
seconds |
1953 |
1982 |
1043 |
748 |
458 |
152 |
memory MB |
40000 |
33217 |
27413 |
22161 |
|
|
memory saved % |
|
16.96 |
31.47 |
44.60 |
|
|
On CodeLlama 34B, we used a batch size of 1 on a sequence length of 4096. Unfortunately, Hugging Face's original implementation OOMs on batch size = 2, so we had to do a benchmark on bsz = 1. Flash Attention v2 does not noticeably change the runtime, whilst Unsloth Open is 1.87x faster, and uses 32% less peak VRAM. "Unsloth Equal" is 2.6x faster, and uses 45% less memory, and MAX is 12.8x faster.
2 T4 DDP |
Hugging Face |
Flash Attention |
Unsloth Open |
Unsloth Equal |
Unsloth Pro |
Unsloth Max |
LAION Chip2 |
1x |
1.12x |
5.28x |
4.21x |
10.01x |
28.32x |
code |
Code |
Code |
Code |
|
|
seconds |
5418 |
4854 |
1027 |
1286 |
541 |
191 |
memory MB |
7316 |
7316 |
5732 |
5934 |
|
|
memory saved % |
|
0.00 |
21.65 |
18.89 |
|
|
Via Kaggle's 2 Tesla T4 instance, we find that Unsloth Open trains 5.3x faster on 1 GPU only. We multiply the gradient accumulation steps by 2 to be fair, since the open source version does not support multi GPU. "Unsloth Equal" is 4.21x faster via DDP. DDP has an overhead, since gradients must be synchronized at each step. Unsloth MAX trains 28x faster!
At the end of this blog post, we provide the whole table of all benchmarks, and all 59 notebook links for reproducibility purposes.Performance breakdowns bit by bit
No. |
Method |
Time (s) |
Peak VRAM (GB) |
Time saved (%) |
VRAM saved (%) |
Final error |
1 |
Huggingface Original PEFT QLoRA |
594 |
16.7 |
|
|
1.0202 |
2 |
Reduce data upcasting |
465 |
15.5 |
21.7% |
7.2% |
1.0203 |
3 |
Bitsandbytes bfloat16 |
424 |
15.3 |
8.9% |
1.3% |
1.0208 |
4 |
SDPA |
418 |
14.9 |
1.4% |
2.6% |
1.0214 |
5 |
SDPA causal = True |
384 |
14.9 |
8.1% |
0.0% |
1.0219 |
6 |
Xformers |
353 |
9.1 |
8.1% |
38.9% |
1.021 |
7 |
Flash Attention 2 |
353 |
9.1 |
0.0% |
0.0% |
1.0215 |
8 |
Fast RoPE Embeddings |
326 |
9 |
7.6% |
1.1% |
1.0211 |
9 |
Fast RMS Layernorm |
316 |
9 |
3.1% |
0.0% |
1.021 |
10 |
Fast Cross Entropy Loss |
315 |
7.4 |
0.4% |
17.8% |
1.021 |
11 |
Manual Autograd MLP |
302 |
6.8 |
4.0% |
8.1% |
1.0222 |
12 |
Manual Autograd QKV |
297 |
6.8 |
1.7% |
0.0% |
1.0217 |
We provide a breakdown of each optimization we did for the Open Source version of Unsloth on a A10G GPU via AWS. By implementing our changes, we can speed training by 2x, and use 61% less peak VRAM. This section is somewhat more maths heavy, so be warned!1. Reduce data upcasting
By reducing upcasting of weights during QLoRA, we can easily save 7.2% of VRAM, and make training take 21.7% less time.2. Bitsandbytes bfloat16
Bitsandbytes internally uses float16, so we have to do an extra memory copy to convert it to bfloat16. We fix this internally, saving 9% time.3. Scaled Dot Product Attention
We use Pytorch's fast implementation of attention, saving 1.4% time. 4, 5, 6. Causal Masking, Xformers, Flash Attention 2
By using a causal mask and not a separate attention mask, we made things 8.1% faster, since we don't need to read the attention matrix. We then switch over to using Xformers, which makes things 8.1% faster and save a whopping 39% of VRAM usage. Switching to Flash Attention v2 had no noticeable effect, since Xformers calls FA2 internally anyways.7. Fast RoPE Embeddings
By implementing RoPE Embeddings in OpenAI's Triton, we save another 7.6% of time. But to do so, we must find the derivative of the RoPE function through manual inspection. Notice RoPE can be rewritten as a matrix multiplication between a rotation matrix R and the original matrix Q. If we do this, the derivative is simply R transpose.8. Fast RMS Layernorm
Unfortunately, the RMS Layernorm's derivative is much more involved. If you carefully use the chain rule and carefully derive the derivative, we get an ugly derivative for the RMS Layernorm. We again implement this into OpenAI's Triton language, which boosts training by 3.1%.9. Fast Cross Entropy Loss
The Cross Entropy Loss is again a bit more involved. We use the log trick where x = exp(log(x)) to derive the derivative. We also use Wikipedia to find the derivative of the infamous logsumexp function is in fact the softmax function! We slash VRAM usage by 17%.10, 11. Manual Autograd
By bracketing correctly, we can massively reduce the actual number of FLOPs during LoRA finetuning! Normally Pytorch's autograd engine backpropagates through the graph from the end to the start. We find by fusing multiple operations into 1, and bracketing correctly through Chained Matrix Multiplication, the actual # of FLOPs is reduced.If you bracket incorrectly, like what Pytorch's autograd currently does, you first do the multiplication of X.T and dW. Take X to be of size (bsz, seq_len, d). We then reshape X to be of size (m, d), where m is simply bsz * seq_len. d is the attention dimension. In Llama 7b it's 4096, whilst in Llama 70b it's 8192.
dW is of size (m, h) where h is the MLP intermediate size. For Llama 7b it's 11,008 and Llama 70b it's 28,672. And B.T is the LoRA weight of size (h, r), where r is the rank of the LoRA matrix, which can be a small 16 or 64. We find that the slow path takes around (h * d)(m + r) FLOPs.And the fast path, where we instead bracket on the 2nd term takes (m * r)(h + d) FLOPs. We can then divide the slow path by the fast path to get a speedup fraction:To simplify the above, notice normally r is quite small, say 16 or 64. m can be very big, as a batch size of 4 and a sequence length of 4096 can make m = 4 * 4096 = 16,384. This makes (m + r) insignificant, and we drop the addition term. We cannot do this for (m * r), since a multiplication of 16 much bigger than an addition of 16.If we do this, we get a simplified expression, where the speedup is a function of the MLP intermediate size h, the attention size d and the LoRA rank r.
For Llama 7b where h = 11,008 and d = 4,096 and r = 16, we get a speedup of 186.58.
For Llama 70b where h = 28,672 and d = 8,192 and r = 16, we get a speedup of 398.22.Other features
- 152334H managed to make Unsloth work with DPO! It's still preliminary support, but it seems like it works via TRL.
- RandomInternetPreson managed to make Unsloth work on WSL! So Windows support is currently in preliminary!
- Other bug fixes - supports all vocab sizes up to 2^16 (65536), group query attention now works correctly.
- GQA on older GPUs is now fully supported via Xformers - we had to manually reshape K and V to make Xformers get tricked into doing a normal attention calculation. Sadly Xformers does not support the backward pass for GQA.
FAQ
- Q: Do you support Mixtral?
We're working on it! - Q: How we do buy PRO or MAX?
We're working on a platform now. Stay tuned! - Q: Do we reduce FLOPs?
Yes. - Q: Does full finetuning work on the Open Source version?
No. See Issue. All optimizations are turned off, so you will see no noticeable speed improvement, other than from Flash Attention and some Triton kernels. - Q: Is LoRA, so not QLoRA supported?
Yes. Pass in load_in_4bit to be False. - Q: Does the PRO / MAX support full finetuning, pretraining?
Yes.
Thank you for reading! 🦥
Daniel Han
13 December 2023Full benchmarking tables
1 T4 16GB |
Hugging Face |
Flash Attention |
Unsloth Open |
Unsloth Pro Equal |
Unsloth Pro |
Unsloth Max |
Alpaca |
1x |
1.09x |
1.69x |
1.79x |
2.93x |
8.3x |
code |
Code |
Code |
Code |
Code |
|
|
seconds |
1599 |
1468 |
942 |
894 |
545 |
193 |
memory MB |
7199 |
7059 |
6459 |
5443 |
|
|
memory saved % |
|
1.94 |
10.28 |
24.39 |
|
1 T4 16GB |
Hugging Face |
Flash Attention |
Unsloth Open |
Unsloth Pro Equal |
Unsloth Pro |
Unsloth Max |
LAION Chip2 |
1x |
0.99x |
1.80x |
1.75x |
4.15x |
11.75x |
code |
Code |
Code |
Code |
Code |
|
|
seconds |
952 |
955 |
529 |
543 |
229 |
81 |
memory MB |
6037 |
6033 |
5797 |
4855 |
|
|
memory saved % |
|
0.07 |
3.98 |
19.58 |
|
1 T4 16GB |
Hugging Face |
Flash Attention |
Unsloth Open |
Unsloth Pro Equal |
Unsloth Pro |
Unsloth Max |
OASST |
1x |
1.19x |
1.95x |
1.86x |
2.58x |
7.3x |
code |
Code |
Code |
Code |
Code |
|
|
seconds |
2640 |
2222 |
1355 |
1421 |
1024 |
362 |
memory MB |
14827 |
10391 |
8413 |
7031 |
|
|
memory saved % |
|
29.92 |
43.26 |
52.58 |
|
1 T4 16GB |
Hugging Face |
Flash Attention |
Unsloth Open |
Unsloth Pro Equal |
Unsloth Pro |
Unsloth Max |
Slim Orca |
1x |
1.21x |
1.77x |
1.85x |
2.71x |
7.67x |
code |
Code |
Code |
Code |
Code |
|
|
seconds |
2735 |
2262 |
1545 |
1478 |
1009 |
356 |
memory MB |
13933 |
10489 |
7661 |
6563 |
|
|
memory saved % |
|
24.72 |
45.02 |
52.90 |
|
2 T4 DDP |
Hugging Face |
Flash Attention |
Unsloth Open |
Unsloth Equal |
Unsloth Pro |
Unsloth Max |
Alpaca |
1x |
0.99x |
4.95x |
4.44x |
7.28x |
20.61x |
code |
Code |
Code |
Code |
|
|
seconds |
9882 |
9946 |
1996 |
2227 |
1357 |
480 |
memory MB |
9176 |
9128 |
6904 |
6782 |
|
|
memory saved % |
|
0.52 |
24.76 |
26.09 |
|
|
2 T4 DDP |
Hugging Face |
Flash Attention |
Unsloth Open |
Unsloth Equal |
Unsloth Pro |
Unsloth Max |
LAION Chip2 |
1x |
1.12x |
5.28x |
4.21x |
10.01x |
28.32x |
code |
Code |
Code |
Code |
|
|
seconds |
5418 |
4854 |
1027 |
1286 |
541 |
191 |
memory MB |
7316 |
7316 |
5732 |
5934 |
|
|
memory saved % |
|
0.00 |
21.65 |
18.89 |
|
2 T4 DDP |
Hugging Face |
Flash Attention |
Unsloth Open |
Unsloth Equal |
Unsloth Pro |
Unsloth Max |
OASST (bsz=1) |
1x |
1.14x |
5.56x |
5.09x |
5.64x |
15.97x |
code |
Code |
Code |
Code |
|
|
|
seconds |
4503 |
3955 |
811 |
885 |
798 |
282 |
memory MB |
11896 |
11628 |
6616 |
7105 |
|
|
memory saved % |
|
2.25 |
44.38 |
40.27 |
|
2 T4 DDP |
Hugging Face |
Flash Attention |
Unsloth Open |
Unsloth Equal |
Unsloth Pro |
Unsloth Max |
Slim Orca (bsz=1) |
1x |
0.97x |
5.54x |
4.68x |
6.88x |
19.46x |
code |
Code |
Code |
Code |
|
|
seconds |
4042 |
4158 |
729 |
863 |
588 |
208 |
memory MB |
11010 |
11042 |
6492 |
7410 |
|
|
memory saved % |
|
-0.29 |
41.04 |
32.70 |
|
|
2 T4 DDP |
Hugging Face |
Flash Attention |
Unsloth Open |
Unsloth Equal |
Unsloth Pro |
Unsloth Max |
OASST (bsz=2) |
OOM ❌ |
OOM ❌ |
✓ |
✓ |
✓ |
✓ |
code |
Code |
Code |
Code |
|
|
|
seconds |
OOM |
OOM |
2719 |
3391 |
2794 |
987 |
memory MB |
OOM |
OOM |
8134 |
9600 |
|
|
memory saved % |
OOM |
OOM |
|
|
|
2 T4 DDP |
Hugging Face |
Flash Attention |
Unsloth Open |
Unsloth Equal |
Unsloth Pro |
Unsloth Max |
Slim Orca (bsz=2) |
OOM ❌ |
OOM ❌ |
✓ |
✓ |
✓ |
✓ |
code |
Code |
Code |
Code |
|
|
seconds |
OOM |
OOM |
2990 |
3444 |
2351 |
831 |
memory MB |
OOM |
OOM |
7594 |
8881 |
|
|
memory saved % |
OOM |
OOM |
|
|
|
|
1 A100 40GB |
Hugging Face |
Flash Attention 2 |
Unsloth Open |
Unsloth Equal |
Unsloth Pro |
Unsloth Max |
Alpaca |
1x |
1.04x |
1.98x |
2.48x |
5.32x |
15.64x |
code |
Code |
Code |
Code |
Code |
|
|
seconds |
1040 |
1001 |
525 |
419 |
196 |
67 |
memory MB |
18235 |
15365 |
9631 |
8525 |
|
|
memory saved % |
|
15.74 |
47.18 |
53.25 |
|
|
1 A100 40GB |
Hugging Face |
Flash Attention 2 |
Unsloth Open |
Unsloth Equal |
Unsloth Pro |
Unsloth Max |
LAION Chip2 |
1x |
0.92x |
1.61x |
1.84x |
7.05x |
20.73x |
code |
Code |
Code |
Code |
Code |
|
|
seconds |
581 |
631 |
361 |
315 |
82 |
28 |
memory MB |
7763 |
8047 |
7763 |
6441 |
|
|
memory saved % |
|
-3.66 |
0.00 |
17.03 |
|
|
1 A100 40GB |
Hugging Face |
Flash Attention 2 |
Unsloth Open |
Unsloth Equal |
Unsloth Pro |
Unsloth Max |
OASST |
1x |
1.19x |
2.17x |
2.66x |
5.04x |
14.83x |
code |
Code |
Code |
Code |
Code |
|
|
seconds |
1852 |
1558 |
852 |
696 |
367 |
125 |
memory MB |
26431 |
16565 |
12267 |
11223 |
|
|
memory saved % |
|
37.33 |
53.59 |
57.54 |
|
1 A100 40GB |
Hugging Face |
Flash Attention 2 |
Unsloth Open |
Unsloth Equal |
Unsloth Pro |
Unsloth Max |
Slim Orca |
1x |
1.18x |
2.22x |
2.64x |
5.04x |
14.82x |
code |
Code |
Code |
Code |
Code |
|
|
seconds |
1824 |
1545 |
821 |
691 |
362 |
123 |
memory MB |
24557 |
15681 |
10595 |
9007 |
|
|
memory saved % |
|
36.14 |
56.86 |
63.32 |
|
1 A100 40GB |
Hugging Face |
Flash Attention 2 |
Unsloth Open |
Unsloth Equal |
Unsloth Pro |
Unsloth Max |
Mistral 7B Slim Orca |
1x |
1.15x |
2.15x |
2.53x |
4.61x |
13.69x |
code |
Code |
Code |
Code |
Code |
|
seconds |
1813 |
1571 |
842 |
718 |
393 |
132 |
memory MB |
32853 |
19385 |
12465 |
10271 |
|
|
memory saved % |
|
40.99 |
62.06 |
68.74 |
|
1 A100 40GB |
Hugging Face |
Flash Attention 2 |
Unsloth Open |
Unsloth Equal |
Unsloth Pro |
Unsloth Max |
Code Llama 34B |
OOM ❌ |
0.99x |
1.87x |
2.61x |
4.27x |
12.82x |
code |
Code |
Code |
Code |
Code |
|
seconds |
1953 |
1982 |
1043 |
748 |
458 |
152 |
memory MB |
40000 |
33217 |
27413 |
22161 |
|
|
memory saved % |
|
16.96 |
31.47 |
44.60 |
|
|