6 cute pastel coloured sloths staring at their computer screens happy
Finetune Mistral 14x faster

Dec 14, 2023 • By Daniel Han

Dec 14, 2023

By Daniel Han

Unsloth open source:

Mistral 7B
1xA100
2.2x
faster

Code Llama 34B
1xA100
1.9x
faster

Llama 7B
1xA100
2.2x
faster

Llama 7B
1xT4
2x
faster

We’re excited to release QLoRA support for Mistral 7B, CodeLlama 34B, and all other models based on the Llama architecture! We added sliding window attention, preliminary Windows and DPO support, and will share all 59 notebooks for reproducing our numbers.

You can now QLoRA finetune Mistral 7B 2.2x on 1x A100 faster with 62% less memory or 12.4GB peak VRAM. CodeLlama 34B is 1.9x faster on 1x A100, using 32% less memory or 27GB peak VRAM. It finally doesn’t OOM!
Unsloth Pro version:

Mistral 7B
1xA100
14x
faster

Code Llama 34B
1xA100
13x
faster

Llama 7B
1xA100
21x
faster

Llama 7B
2xT4
28x
faster

Our PRO version can finetune Mistral 7B a whopping 14x faster on 1x A100, using 70% less peak VRAM, and CodeLlama 34B is 13x faster on 1x A100 using 50% less VRAM or 20GB peak VRAM. Llama 7B is also 21x faster on 1x A100, with a crazy 71% reduction in peak VRAM usage. On 2x Tesla T4s, Llama 7B is 28x faster via DDP support.
Unsloth Pro version:

Mistral 7B
1xA100
-70%
peak VRAM

Code Llama 34B
1xA100
-50%
peak VRAM

Llama 7B
1xA100
-71%
peak VRAM

Llama 7B
2xT4
-44%
peak VRAM

We also added installing Flash Attention v2 directly rather than through Xformers. If you have a RTX 3060 or higher (A100, H100 etc), use the "ampere" install path!
pip install "unsloth[cu118_ampere] @ git+https://github.com/unslothai/unsloth.git"

pip install "unsloth[cu121_ampere] @ git+https://github.com/unslothai/unsloth.git"

pip install "unsloth[colab_ampere] @ git+https://github.com/unslothai/unsloth.git"
As requested, we provide a preliminary breakdown of how we made things faster with Unsloth!
Benchmarking
1 A100 40GB Hugging Face Flash Attention 2 Unsloth Open Unsloth Equal Unsloth Pro Unsloth Max
Alpaca 1x 1.04x 1.98x 2.48x 5.32x 15.64x
LAION Chip2 1x 0.92x 1.61x 1.84x 7.05x 20.73x
OASST 1x 1.19x 2.17x 2.66x 5.04x 14.83x
Slim Orca 1x 1.18x 2.22x 2.64x 5.04x 14.82x
We benchmark Unsloth against Hugging Face’s original implementation, and against adding Flash Attention 2 support on 1x A100 via Google Colab. Flash Attention at most speeds up training by 1.2x, whilst with Unsloth’s open source package, training is 2.2x faster. “Unsloth Equal” is our PRO version, but under the condition that all settings and the loss curve stays the same. Under this scenario, we further boost training to 2.7x. Our MAX version can boost speeds on the LAION dataset to 21x!

All benchmarks use the following setup (unless if some tests OOM, in which we decrease the batch size for all tests):
QLoRA nf4 layers = [
"q_proj", "k_proj", "v_proj", "o_proj", 
"gate_proj", "up_proj", "down_proj",
]
QLoRA rank = 16, alpha = 16, dropout = 0
max_seq_length = 2048
learning_rate = 2e-4
weight_decay = 0.01
max_steps = 240
warmup_steps = 10
batch_size = 4
gradient_accumulation_steps = 4
lr_scheduler_type = "linear"
optimizer = "adamw_8bit", bfloat16
use_gradient_checkpointing = True
random_state = 3407
1 A100 40GB Hugging Face Flash Attention 2 Unsloth Open Unsloth Equal Unsloth Pro Unsloth Max
Mistral 7B Slim Orca 1x 1.15x 2.15x 2.53x 4.61x 13.69x
code Code Code Code Code
seconds 1813 1571 842 718 393 132
memory MB 32853 19385 12465 10271
memory saved % 40.99 62.06 68.74
On Mistral 7B for 1x A100, Flash Attention v2 boosts training by 1.15x, whilst Unsloth Open boosts it by 2.15x, and we reduce memory by 62%. Again, "Unsloth Equal" which runs an equalized training run boosts speeds by 2.53x, and uses 69% less peak VRAM. Unsloth MAX boosts training by 13.7x.

You can click on "Code" to access our shared notebooks for reproducibility purposes. "Unsloth Equal" only shows the training losses and obscures our other codepaths.
1 A100 40GB Hugging Face Flash Attention 2 Unsloth Open Unsloth Equal Unsloth Pro Unsloth Max
Code Llama 34B OOM ❌ 0.99x 1.87x 2.61x 4.27x 12.82x
code Code Code Code Code
seconds 1953 1982 1043 748 458 152
memory MB 40000 33217 27413 22161
memory saved % 16.96 31.47 44.60
On CodeLlama 34B, we used a batch size of 1 on a sequence length of 4096. Unfortunately, Hugging Face's original implementation OOMs on batch size = 2, so we had to do a benchmark on bsz = 1. Flash Attention v2 does not noticeably change the runtime, whilst Unsloth Open is 1.87x faster, and uses 32% less peak VRAM. "Unsloth Equal" is 2.6x faster, and uses 45% less memory, and MAX is 12.8x faster.
2 T4 DDP Hugging Face Flash Attention Unsloth Open Unsloth Equal Unsloth Pro Unsloth Max
LAION Chip2 1x 1.12x 5.28x 4.21x 10.01x 28.32x
code Code Code Code
seconds 5418 4854 1027 1286 541 191
memory MB 7316 7316 5732 5934
memory saved % 0.00 21.65 18.89
Via Kaggle's 2 Tesla T4 instance, we find that Unsloth Open trains 5.3x faster on 1 GPU only. We multiply the gradient accumulation steps by 2 to be fair, since the open source version does not support multi GPU. "Unsloth Equal" is 4.21x faster via DDP. DDP has an overhead, since gradients must be synchronized at each step. Unsloth MAX trains 28x faster!

At the end of this blog post, we provide the whole table of all benchmarks, and all 59 notebook links for reproducibility purposes.
Performance breakdowns bit by bit
No. Method Time (s) Peak VRAM (GB) Time saved (%) VRAM saved (%) Final error
1 Huggingface Original PEFT QLoRA 594 16.7 1.0202
2 Reduce data upcasting 465 15.5 21.7% 7.2% 1.0203
3 Bitsandbytes bfloat16 424 15.3 8.9% 1.3% 1.0208
4 SDPA 418 14.9 1.4% 2.6% 1.0214
5 SDPA causal = True 384 14.9 8.1% 0.0% 1.0219
6 Xformers 353 9.1 8.1% 38.9% 1.021
7 Flash Attention 2 353 9.1 0.0% 0.0% 1.0215
8 Fast RoPE Embeddings 326 9 7.6% 1.1% 1.0211
9 Fast RMS Layernorm 316 9 3.1% 0.0% 1.021
10 Fast Cross Entropy Loss 315 7.4 0.4% 17.8% 1.021
11 Manual Autograd MLP 302 6.8 4.0% 8.1% 1.0222
12 Manual Autograd QKV 297 6.8 1.7% 0.0% 1.0217
We provide a breakdown of each optimization we did for the Open Source version of Unsloth on a A10G GPU via AWS. By implementing our changes, we can speed training by 2x, and use 61% less peak VRAM. This section is somewhat more maths heavy, so be warned!
1. Reduce data upcasting
By reducing upcasting of weights during QLoRA, we can easily save 7.2% of VRAM, and make training take 21.7% less time.
2. Bitsandbytes bfloat16
Bitsandbytes internally uses float16, so we have to do an extra memory copy to convert it to bfloat16. We fix this internally, saving 9% time.
3. Scaled Dot Product Attention
We use Pytorch's fast implementation of attention, saving 1.4% time.
4, 5, 6. Causal Masking, Xformers, Flash Attention 2
By using a causal mask and not a separate attention mask, we made things 8.1% faster, since we don't need to read the attention matrix. We then switch over to using Xformers, which makes things 8.1% faster and save a whopping 39% of VRAM usage. Switching to Flash Attention v2 had no noticeable effect, since Xformers calls FA2 internally anyways.
7. Fast RoPE Embeddings
By implementing RoPE Embeddings in OpenAI's Triton, we save another 7.6% of time. But to do so, we must find the derivative of the RoPE function through manual inspection. Notice RoPE can be rewritten as a matrix multiplication between a rotation matrix R and the original matrix Q. If we do this, the derivative is simply R transpose.
8. Fast RMS Layernorm
Unfortunately, the RMS Layernorm's derivative is much more involved. If you carefully use the chain rule and carefully derive the derivative, we get an ugly derivative for the RMS Layernorm. We again implement this into OpenAI's Triton language, which boosts training by 3.1%.
9. Fast Cross Entropy Loss
loss = CrossEntropyLoss ( logits , labels ) C E l o s s = 1 n y i log p i C E l o s s = 1 n y i log exp ( x i ) exp x i C E i = y i log exp ( x i ) exp x i = y i ( x i log exp x i ) = y i ( logsumexp ( x ) x ) = { 0 if  y = 0 logsumexp ( x ) x otherwise d C d x i = y i exp ( x logsumexp ( x ) ) d d x i x k y i \begin{align} \text{loss} = \text{CrossEntropyLoss}(\text{logits}, \text{labels}) \\ CE_{loss} = \frac{1}{n} \sum{ - y_i \log{p_i}} \\ CE_{loss} = \frac{1}{n} \sum{ - y_i \log{\frac{\exp(x_i)}{\sum{\exp{x_i}}}}} \\ CE_i = - y_i \log{\frac{\exp(x_i)}{\sum{\exp{x_i}}}} \\ = -y_i (x_i - \log{\sum{\exp{x_i}}}) \\ = y_i ( \text{logsumexp}(x) - x ) \\ = \left\{ \\ \begin{array}{ c l } 0 & \quad \textrm{if } y = 0 \\ \text{logsumexp}(x)-x & \quad \textrm{otherwise} \\ \end{array} \\ \right. \\ \frac{dC}{dx_i} = y_i \cdot \text{exp} \big( x - \text{logsumexp}(x) \big) - \frac{d}{dx_i} x_k \cdot y_i \end{align}
The Cross Entropy Loss is again a bit more involved. We use the log trick where x = exp(log(x)) to derive the derivative. We also use Wikipedia to find the derivative of the infamous logsumexp function is in fact the softmax function! We slash VRAM usage by 17%.
10, 11. Manual Autograd
By bracketing correctly, we can massively reduce the actual number of FLOPs during LoRA finetuning! Normally Pytorch's autograd engine backpropagates through the graph from the end to the start. We find by fusing multiple operations into 1, and bracketing correctly through Chained Matrix Multiplication, the actual # of FLOPs is reduced.
( X ( m , d ) T × d W ( m , h ) ) × B ( h , r ) T (X_{(m,d)}^T \times dW_{(m,h)}) \times B_{(h,r)}^T
If you bracket incorrectly, like what Pytorch's autograd currently does, you first do the multiplication of X.T and dW. Take X to be of size (bsz, seq_len, d). We then reshape X to be of size (m, d), where m is simply bsz * seq_len. d is the attention dimension. In Llama 7b it's 4096, whilst in Llama 70b it's 8192.

dW is of size (m, h) where h is the MLP intermediate size. For Llama 7b it's 11,008 and Llama 70b it's 28,672. And B.T is the LoRA weight of size (h, r), where r is the rank of the LoRA matrix, which can be a small 16 or 64.
( X ( m , d ) T × d W ( m , h ) ) × B ( h , r ) T X ( m , d ) T × d W ( m , h )  takes  ( m × d × h )  FLOPs ( X T × d W ) ( d , h ) × B ( h , r ) T  takes  ( h × r × d )  FLOPs FLOPs = ( m × d × h ) + ( h × r × d ) = ( h × d ) ( m + r ) \begin{align} (X_{(m,d)}^T \times dW_{(m,h)}) \times B_{(h,r)}^T \\ X_{(m,d)}^T \times dW_{(m,h)} \text{ takes } (m \times d \times h) \text{ FLOPs} \\ (X^T \times dW)_{(d,h)} \times B_{(h,r)}^T \text{ takes } (h \times r \times d) \text{ FLOPs} \\ \text{FLOPs} = (m \times d \times h) + (h \times r \times d) = (h \times d)(m + r) \end{align}
We find that the slow path takes around (h * d)(m + r) FLOPs.
X ( m , d ) T × ( d W ( m , h ) × B ( h , r ) T ) d W ( m , h ) × B ( h , r ) T  takes  ( h × m × r )  FLOPs X ( m , d ) T × ( d W × B T ) ( m , r )  takes  ( m × d × r )  FLOPs FLOPs = ( h × m × r ) + ( m × d × r ) = ( m × r ) ( h + d ) \begin{align} X_{(m,d)}^T \times (dW_{(m,h)} \times B_{(h,r)}^T) \\ dW_{(m,h)} \times B_{(h,r)}^T \text{ takes } (h \times m \times r) \text{ FLOPs} \\ X_{(m,d)}^T \times (dW \times B^T)_{(m,r)} \text{ takes } (m \times d \times r) \text{ FLOPs} \\ \text{FLOPs} = (h \times m \times r) + (m \times d \times r) = (m \times r)(h + d) \end{align}
And the fast path, where we instead bracket on the 2nd term takes (m * r)(h + d) FLOPs. We can then divide the slow path by the fast path to get a speedup fraction:
Slow Fast = ( m + r ) ( h × d ) ( m × r ) ( h + d ) \frac{\text{Slow}}{\text{Fast}} = \frac{(m + r)(h \times d)}{(m \times r)(h + d)}
To simplify the above, notice normally r is quite small, say 16 or 64. m can be very big, as a batch size of 4 and a sequence length of 4096 can make m = 4 * 4096 = 16,384. This makes (m + r) insignificant, and we drop the addition term. We cannot do this for (m * r), since a multiplication of 16 much bigger than an addition of 16.
Slow Fast ( m ) ( h × d ) ( m × r ) ( h + d ) Slow Fast h × d r ( h + d ) \begin{align} \frac{\text{Slow}}{\text{Fast}} \approx \frac{(m)(h \times d)}{(m \times r)(h + d)} \\ \frac{\text{Slow}}{\text{Fast}} \approx \frac{h \times d}{r(h + d)} \\ \end{align}
If we do this, we get a simplified expression, where the speedup is a function of the MLP intermediate size h, the attention size d and the LoRA rank r.

For Llama 7b where h = 11,008 and d = 4,096 and r = 16, we get a speedup of 186.58.
For Llama 70b where h = 28,672 and d = 8,192 and r = 16, we get a speedup of 398.22.
Other features
  • 152334H managed to make Unsloth work with DPO! It's still preliminary support, but it seems like it works via TRL.
  • RandomInternetPreson managed to make Unsloth work on WSL! So Windows support is currently in preliminary!
  • Other bug fixes - supports all vocab sizes up to 2^16 (65536), group query attention now works correctly.
  • GQA on older GPUs is now fully supported via Xformers - we had to manually reshape K and V to make Xformers get tricked into doing a normal attention calculation. Sadly Xformers does not support the backward pass for GQA.
FAQ
  • Q: Do you support Mixtral?
    We're working on it!
  • Q: How we do buy PRO or MAX?
    We're working on a platform now. Stay tuned!
  • Q: Do we reduce FLOPs?
    Yes.
  • Q: Does full finetuning work on the Open Source version?
    No. See Issue. All optimizations are turned off, so you will see no noticeable speed improvement, other than from Flash Attention and some Triton kernels.
  • Q: Is LoRA, so not QLoRA supported?
    Yes. Pass in load_in_4bit to be False.
  • Q: Does the PRO / MAX support full finetuning, pretraining?
    Yes.
Thank you for reading! 🦥
Daniel Han
13 December 2023
Full benchmarking tables
1 T4 16GB Hugging Face Flash Attention Unsloth Open Unsloth Pro Equal Unsloth Pro Unsloth Max
Alpaca 1x 1.09x 1.69x 1.79x 2.93x 8.3x
code Code Code Code Code
seconds 1599 1468 942 894 545 193
memory MB 7199 7059 6459 5443
memory saved % 1.94 10.28 24.39
1 T4 16GB Hugging Face Flash Attention Unsloth Open Unsloth Pro Equal Unsloth Pro Unsloth Max
LAION Chip2 1x 0.99x 1.80x 1.75x 4.15x 11.75x
code Code Code Code Code
seconds 952 955 529 543 229 81
memory MB 6037 6033 5797 4855
memory saved % 0.07 3.98 19.58
1 T4 16GB Hugging Face Flash Attention Unsloth Open Unsloth Pro Equal Unsloth Pro Unsloth Max
OASST 1x 1.19x 1.95x 1.86x 2.58x 7.3x
code Code Code Code Code
seconds 2640 2222 1355 1421 1024 362
memory MB 14827 10391 8413 7031
memory saved % 29.92 43.26 52.58
1 T4 16GB Hugging Face Flash Attention Unsloth Open Unsloth Pro Equal Unsloth Pro Unsloth Max
Slim Orca 1x 1.21x 1.77x 1.85x 2.71x 7.67x
code Code Code Code Code
seconds 2735 2262 1545 1478 1009 356
memory MB 13933 10489 7661 6563
memory saved % 24.72 45.02 52.90
2 T4 DDP Hugging Face Flash Attention Unsloth Open Unsloth Equal Unsloth Pro Unsloth Max
Alpaca 1x 0.99x 4.95x 4.44x 7.28x 20.61x
code Code Code Code
seconds 9882 9946 1996 2227 1357 480
memory MB 9176 9128 6904 6782
memory saved % 0.52 24.76 26.09
2 T4 DDP Hugging Face Flash Attention Unsloth Open Unsloth Equal Unsloth Pro Unsloth Max
LAION Chip2 1x 1.12x 5.28x 4.21x 10.01x 28.32x
code Code Code Code
seconds 5418 4854 1027 1286 541 191
memory MB 7316 7316 5732 5934
memory saved % 0.00 21.65 18.89
2 T4 DDP Hugging Face Flash Attention Unsloth Open Unsloth Equal Unsloth Pro Unsloth Max
OASST (bsz=1) 1x 1.14x 5.56x 5.09x 5.64x 15.97x
code Code Code Code
seconds 4503 3955 811 885 798 282
memory MB 11896 11628 6616 7105
memory saved % 2.25 44.38 40.27
2 T4 DDP Hugging Face Flash Attention Unsloth Open Unsloth Equal Unsloth Pro Unsloth Max
Slim Orca (bsz=1) 1x 0.97x 5.54x 4.68x 6.88x 19.46x
code Code Code Code
seconds 4042 4158 729 863 588 208
memory MB 11010 11042 6492 7410
memory saved % -0.29 41.04 32.70
2 T4 DDP Hugging Face Flash Attention Unsloth Open Unsloth Equal Unsloth Pro Unsloth Max
OASST (bsz=2) OOM ❌ OOM ❌
code Code Code Code
seconds OOM OOM 2719 3391 2794 987
memory MB OOM OOM 8134 9600
memory saved % OOM OOM
2 T4 DDP Hugging Face Flash Attention Unsloth Open Unsloth Equal Unsloth Pro Unsloth Max
Slim Orca (bsz=2) OOM ❌ OOM ❌
code Code Code Code
seconds OOM OOM 2990 3444 2351 831
memory MB OOM OOM 7594 8881
memory saved % OOM OOM
1 A100 40GB Hugging Face Flash Attention 2 Unsloth Open Unsloth Equal Unsloth Pro Unsloth Max
Alpaca 1x 1.04x 1.98x 2.48x 5.32x 15.64x
code Code Code Code Code
seconds 1040 1001 525 419 196 67
memory MB 18235 15365 9631 8525
memory saved % 15.74 47.18 53.25
1 A100 40GB Hugging Face Flash Attention 2 Unsloth Open Unsloth Equal Unsloth Pro Unsloth Max
LAION Chip2 1x 0.92x 1.61x 1.84x 7.05x 20.73x
code Code Code Code Code
seconds 581 631 361 315 82 28
memory MB 7763 8047 7763 6441
memory saved % -3.66 0.00 17.03
1 A100 40GB Hugging Face Flash Attention 2 Unsloth Open Unsloth Equal Unsloth Pro Unsloth Max
OASST 1x 1.19x 2.17x 2.66x 5.04x 14.83x
code Code Code Code Code
seconds 1852 1558 852 696 367 125
memory MB 26431 16565 12267 11223
memory saved % 37.33 53.59 57.54
1 A100 40GB Hugging Face Flash Attention 2 Unsloth Open Unsloth Equal Unsloth Pro Unsloth Max
Slim Orca 1x 1.18x 2.22x 2.64x 5.04x 14.82x
code Code Code Code Code
seconds 1824 1545 821 691 362 123
memory MB 24557 15681 10595 9007
memory saved % 36.14 56.86 63.32
1 A100 40GB Hugging Face Flash Attention 2 Unsloth Open Unsloth Equal Unsloth Pro Unsloth Max
Mistral 7B Slim Orca 1x 1.15x 2.15x 2.53x 4.61x 13.69x
code Code Code Code Code
seconds 1813 1571 842 718 393 132
memory MB 32853 19385 12465 10271
memory saved % 40.99 62.06 68.74
1 A100 40GB Hugging Face Flash Attention 2 Unsloth Open Unsloth Equal Unsloth Pro Unsloth Max
Code Llama 34B OOM ❌ 0.99x 1.87x 2.61x 4.27x 12.82x
code Code Code Code Code
seconds 1953 1982 1043 748 458 152
memory MB 40000 33217 27413 22161
memory saved % 16.96 31.47 44.60

Unslow AI training now!

Join Our Discord