6 cute pastel coloured sloths staring at their computer screens happy
Introducing Unsloth: 30x faster LLM training

Dec 1, 2023 • By Daniel Han

Dec 1, 2023

By Daniel Han

We’re excited to introduce our AI startup focusing on creating cool AI products! Our first launch is Unsloth which makes LLM training 30x faster! Some highlights:
  • 30x faster. Alpaca takes 3 hours instead of 85.
  • 60% less memory usage, allowing 6x larger batches.
  • 0% loss in accuracy or +20% increased accuracy with our Max offering.
  • No need for new hardware - only software changes.
  • Supports NVIDIA, Intel and AMD GPUs with our Max offering.
  • Manual autograd and chained matrix multiplication optimizations.
  • Rewrote all kernels in OpenAI's Triton language.
  • Flash Attention via xformers and Tri Dao's implementation.
  • Free open source version makes finetuning 2x faster with 50% less memory.
Have a try with our open source version!pip install "unsloth[cu118] @ git+https://github.com/unslothai/unsloth.git"

pip install "unsloth[cu121] @ git+https://github.com/unslothai/unsloth.git"
We also have Google Colab notebooks for the Alpaca dataset and SlimOrca. We also have a Kaggle notebook for the LAION dataset.
Manual autograd
Q = X W q ̃ = X ( W q + A q B q ) K = X W k ̃ = X ( W k + A k B k ) V = X W v ̃ = X ( W v + A v B v ) A ( X ) = σ ( 1 d Q K T + M ) V
Pytorch’s autograd is reasonably efficient for most tasks, but if you want extreme performance, you have to derive the matrix differentials yourself. In the infamous attention mechanism coupled with LoRA adapters, we have to find 6 matrix differentials since we freeze the original weight matrices W, but train A and B.D(W) is just notation for the derivative during the backward pass. We must also derive the derivative with respect to the inputs to the attention head, since we apply LoRA to all 32 layers for Llama, and so we need to backpropagate the gradients until layer 1.

After we derive the differentials ourselves, we notice that since the LoRA weights are super skinny with sizes up to around 128, and maybe as small as size 8, and that the Llama weight dimensions are multiples of 1024, with most being 4096 or larger, correctly placing brackets during chained matrix multiplication is critical for superb performance.
We find through some simple matrix dimension FLOP calculations that bracketing the LoRA weight multiplications provides superb performance. Likewise, we find that the final dC/dX should be done inplace with torch's inplace operations to conserve memory. We do these type of manual differentiation optimizations for all other layers (MLP, lm_head, layernorms, RoPE embeddings) as well!
Performance Analysis
On the infamous Alpaca dataset on a single Tesla T4 GPU, Huggingface's original implementation takes a long 23 hours and 15m, whilst our Max offering takes 2 hours and 34m, which is 8.8x faster. On SlimOrca, 391 hours is shaved to 51 hours or a 7.6x speedup.

On 2 Tesla T4 GPUs via DDP, LAION's Chip2 dataset takes 164 hours whilst ours takes 5 hours (31x faster). And on SlimOrca, 1301 hours or around 54 days is slashed to 54 hours or 24x faster.

On memory usage for the Open Assistant dataset, we show on 1 A10 GPU (bfloat16 support), peak memory usage is slashed to 6.9GB from 16.7GB (59% less), whilst on a Tesla T4, peak memory is reduced to 7.5GB from 14.6GB (49% less).
OpenAI Triton kernels
@triton.jit
def _rope_embedding(
Q, Q_row_stride,
cos, cos_row_stride,
sin, sin_row_stride,
seqlen, head_dim,
BACKWARD_PASS: tl.constexpr,
BLOCK_SIZE : tl.constexpr,
):
row_position = tl.program_id(0)
head_position = tl.program_id(1)
col_offsets = tl.arange(0, BLOCK_SIZE)
half_head_dim = head_dim // 2
mask = col_offsets < half_head_dim

rot_position = row_position % seqlen

Q += row_position* Q_row_stride + head_position*head_dim
cos += rot_position*cos_row_stride
sin += rot_position*sin_row_stride

Q1 = tl.load(Q + half_head_dim*0 + col_offsets,
mask = mask, other = 0)
sin1 = tl.load(sin + half_head_dim*0 + col_offsets,
mask = mask, other = 0)
cos1 = tl.load(cos + half_head_dim*0 + col_offsets,
mask = mask, other = 0)

Q2 = tl.load(Q + half_head_dim*1 + col_offsets,
mask = mask, other = 0)

tl.store(Q + half_head_dim*0 + col_offsets,
Q1*cos1 - Q2*sin1, mask = mask)
tl.store(Q + half_head_dim*1 + col_offsets,
Q2*cos1 + Q1*sin1, mask = mask)
pass
We also rewrote all kernels using OpenAI's Triton language. For example, the above is our implementation of the RoPE embedding kernel with the backward pass. We tried our best to make it super clean, readable and fast. You can explore more of our hand written kernels in our Github repo
Our asks
  • Try out our open source package Unsloth with 2X faster finetuning!
  • Our Pro offering provides multi GPU support, more crazy speedups and more. Our Max offering also provides kernels for full training of LLMs from scratch, and allows our code to be ported to AMD and Intel GPUs. We successfully ported bitsandbytes's 4bit quantization methods to Triton! Contact us for more info!
  • We were planning to maybe launch a training and inference platform - if there is enough interest, we shall do so! Currently we're currently using ChatGPT / Llama and LLMs to make the frontend / backend, since I (Daniel) have no experience with providing Cloud GPUs to people. If you can help us with that, please email us at info@unsloth.ai.
  • We're trying to bootstrap our startup, but if anyone is willing to cooperate with us on building any company partnerships, maybe buying our code? (maybe), or willing to chat investment - email us! Super appreciate it!
Future plans
  • Make inference faster on NVIDIA, Intel and AMD GPUs.
  • Sqrt gradient checkpointing - further reduces memory usage by 25%.
  • Join the AI-MO competition.
  • Make DPO faster as well. OpenAI's ChatGPT uses PPO for RLHF, whilst DPO is shown to be equally as powerful.
  • And we have multiple product launches (a Finance LLM, recession predictor, data science consultant agent etc) planned in the coming months!!!
Thank you for reading! 🦥
Daniel Han
30, November 2023

Unslow AI training now!

Join Our Discord