6 cute pastel coloured sloths staring at their computer screens happy

Bugs in LLM Training - Gradient Accumulation Fix

Oct 15, 2024 • By Daniel & Michael

Oct 15, 2024

• By Daniel & Michael

This past week, we've been fixing a universal issue in Gradient Accumulation that negatively impacts everyone's training, pre-training & finetuning runs for sequence models like LLMs. Unsloth's Gradient Accumulation fix ensures training runs and loss calculations are performed accurately and correctly.

The goal of gradient accumulation is to mimic full batch training with reduced VRAM usage. Gradient accumulation is also used in DDP and multi GPU setups, so this issue affects large scale training runs as well. P.S. If you’ve enjoyed our work, don't forget to ⭐Star us on GitHub and join our awesome community on Discord - your support means the world to us! 🦥

Previously in 2021, Zhaofeng first discovered the issue, and Benjamin Marie rediscovered the issue last week. They showed if you use gradient accumulation, you will get a higher loss than if you used full batch training:

We managed to formulate a new methodology that solves the issue - and it's now in Unsloth! Please update Unsloth (pip install --upgrade unsloth) and use unsloth_train! We have a free Colab notebook to finetune Llama 3.2 1/3B with our fixed trainer here - Colab notebook. And a free Kaggle notebook.

from unsloth import unsloth_train
# trainer_stats = trainer.train() << Buggy gradient accumulation
trainer_stats = unsloth_train(trainer)

💡 Our Findings

Replicating the issue

Before we tried fixing the issue, could we first reproduce the error? The theory was gradient accumulation should effectively be equivalent mathematically to full batch training. We trained with an effective full batch size of 16, and so bsz * ga should stay constant. We tested bsz=1,2,4,8 and 16, and could unfortunately replicate the issue. The training loss for larger gradient accumulation steps was always higher.

What is gradient accumulation?

During training or finetuning, one selects some number of random rows in the training dataset to update the model's weights at every step. But how many rows? For very large pretraining jobs, the batch size could be in the many millions like in Llama 3.1 to reduce overfitting and improve generalization capabilities. For finetuning jobs like in Unsloth's Llama 3.2 notebook, the batch size could be a small 32.

The issue is the memory usage for large batches is quite large. If 1 batch uses 1 unit of memory, then a 1 million batch size would use 1 million units of memory. How do we mimic large batch training but also not use a lot of memory?

Enter gradient accumulation! We instead create the gradients on the fly by forming them each time a new mini batch comes in. We then simply add all mini gradients up, do some scaling, and we'll have the final large batch gradient.

Possible explanations

One popular theory was gradient accumulation has numerical errors during the accumulation step. But researchers found that accumulating in float32 had the same effect. Our findings show in fact there is some tiny accumulation error.

The 2nd theory was there is in fact a bug in the loss calculation, which we find to be the case.

Mathematically the same?

Are gradient accumulation and full batch training mathematically the same? Sadly not if you do it naively by simply adding them up! We first note that the cross entropy loss is calculated by:

\begin{array}{r} \frac{1}{I {y_{i} \neq - 100}} \sum L_{i} \end{array}

Notice the denominator counts the number of non padded or non ignored tokens - ie it normalizes the losses by the number of trained tokens each text piece has. The indicator function is simply the sum of unpadded tokens, which is the sum of all the sequence lengths ie:

\begin{array}{r} I {y_{i} \neq - 100} = \sum m_{i} \end{array}

Ie we get the final equation as:

\begin{array}{r} \frac{\sum L_{i}}{\sum m_{i}} \end{array}

We then add 1/n to the numerator and denominator - this is allowed since both can cancel out:

\begin{array}{r} \frac{\frac{1}{n} \sum L_{i}}{\frac{1}{n} \sum m_{i}} \end{array}

This means the final loss is the mean loss value divided by the mean of all unpadded sequence lengths:

\begin{array}{r} \frac{\bar{L}}{\bar{m}} \end{array}

Since we're doing gradient accumulation, we now calculate each loss separately, then add them up to get the final loss. We first utilize the mean loss and mean sequence length for each partition.

But we see below that the final sum does not equal to the original full batch loss - in fact it's G bigger (where G is the number of gradient accumulation steps).

\begin{array}{r} L = \sum [\frac{L_{1}}{m_{1}} | \frac{L_{2}}{m_{2}} | \frac{L_{3}}{m_{3}} | \frac{L_{4}}{m_{4}}] \\ L = \sum [\frac{\bar{L}}{\bar{m}} | \frac{\bar{L}}{\bar{m}} | \frac{\bar{L}}{\bar{m}} | \frac{\bar{L}}{\bar{m}}] \\ L = G \frac{\bar{L}}{\bar{m}} \neq \frac{\bar{L}}{\bar{m}} \end{array}

So in gradient accumulation, we have to scale each mini gradient accumulator by the number of gradient accumulation steps to we get the desired result.

\begin{array}{r} L = \sum [\frac{1}{G} \frac{L_{1}}{m_{1}} | \frac{1}{G} \frac{L_{2}}{m_{2}} | \frac{1}{G} \frac{L_{3}}{m_{3}} | \frac{1}{G} \frac{L_{4}}{m_{4}}] \end{array}

This generally works well in expectation for large batches. But, what happens if they are different sequence lengths - wouldn't this cause issues? We tested by removing the denominator entirely - ie instead of using the normalized cross entropy loss, we simply used an un-normalized loss to confirm if gradient accumulation still works. The training losses in a modified Unsloth training run is below:

Miraculously we see all training loss curves match up! This means the denominator is definitely the culprit! This means naively averaging over each gradient accumulation step is wrong, but instead we must derive the denominator beforehand.

We implemented this fix in Unsloth, and all loss curves now match up, showing indeed gradient accumulation is equivalent to full batch training.

Numerical differences

Another point to consider is maybe this error won't actually affect the difference in the final weights. So, we trained a LoRA adapter using Unsloth with the full batch (bsz=16, ga=1) vs just a gradient accumulated version (bsz=1, ga=16).

We ran all combinations (bsz=1,ga=16 all the way to bsz=16,ga=1) and compare the LoRA weights to the full batch version (bsz=16,ga=1) and obtain the L2 norm difference.

We show that (1) there is an inherent accumulation error due to floating point arithmetic (0.0068 L2 norm), and (2) as the number of gradient accumulations increase, the L2 norm increases (from 0.0196 to 0.0286 L2 norm)

This essentially means gradient accumulation does inherently have a tiny floating point addition penalty. And naively, the larger the gradient accumulation steps, the higher the discrepancy. By using our fixed Unsloth Gradient Accumulation version, the L2 norm error can be reduced by over a magnitude.So, please update Unsloth (pip install --upgrade unsloth) and use unsloth_train! We have a free Colab notebook to finetune Llama 3.2 1/3B with our fixed trainer here - Colab notebook.

from unsloth import unsloth_train
# trainer_stats = trainer.train() << Buggy gradient accumulation
trainer_stats = unsloth_train(trainer)

Extra - mathematical proofs

Assume we have a batch size of 2 and gradient accumulation steps of 2, then we have the final loss with both no gradient accumulation (full batch training) and gradient accumulation as below:

\begin{aligned} L & = \frac{L_{1} + L_{2} + L_{3} + L_{4}}{m_{1} + m_{2} + m_{3} + m_{4}} \\ L & = \frac{1}{2} \frac{L_{1} + L_{2}}{m_{1} + m_{2}} + \frac{1}{2} \frac{L_{3} + L_{4}}{m_{3} + m_{4}} \end{aligned}

The goal is to show standard or naive gradient accumulation always has a different (higher or lower) loss than full batch training. We first prove this for a specific batch size of 2 and gradient accumulation steps of 2. The rest should be similar for other lengths. So, we need to show that:

\begin{array}{r} \frac{1}{2} \frac{L_{1} + L_{2}}{m_{1} + m_{2}} + \frac{1}{2} \frac{L_{3} + L_{4}}{m_{3} + m_{4}} \geq \frac{L_{1} + L_{2} + L_{3} + L_{4}}{m_{1} + m_{2} + m_{3} + m_{4}} \end{array}

We then prove the other inequality direction as well in a similar manner. We note instead of proving it directly, we show that if we can prove if the ratio is greater than 1, we would have proved it as well. We can only do this, since we know that the sequence lengths m are always greater than 0. We also use the average loss.

\begin{array}{r} \frac{\frac{1}{2} \frac{L_{1} + L_{2}}{m_{1} + m_{2}} + \frac{1}{2} \frac{L_{3} + L_{4}}{m_{3} + m_{4}}}{\frac{L_{1} + L_{2} + L_{3} + L_{4}}{m_{1} + m_{2} + m_{3} + m_{4}}} \geq 1 \\ \frac{\frac{1}{2} \frac{2 \bar{L}}{m_{1} + m_{2}} + \frac{1}{2} \frac{2 \bar{L}}{m_{3} + m_{4}}}{\frac{4 \bar{L}}{m_{1} + m_{2} + m_{3} + m_{4}}} \geq 1 \end{array}

By simplifying and doing some algebra, we get:

\begin{array}{r} (\frac{1}{2} \frac{2 \bar{L}}{m_{1} + m_{2}} + \frac{1}{2} \frac{2 \bar{L}}{m_{3} + m_{4}}) \frac{m_{1} + m_{2} + m_{3} + m_{4}}{4 \bar{L}} \geq 1 \\ (\frac{\bar{L}}{m_{1} + m_{2}} + \frac{\bar{L}}{m_{3} + m_{4}}) \frac{m_{1} + m_{2} + m_{3} + m_{4}}{4 \bar{L}} \geq 1 \\ \frac{(m_{3} + m_{4}) \bar{L} + (m_{1} + m_{2}) \bar{L}}{(m_{1} + m_{2}) (m_{3} + m_{4})} \frac{m_{1} + m_{2} + m_{3} + m_{4}}{4 \bar{L}} \geq 1 \\ \frac{(m_{3} + m_{4}) + (m_{1} + m_{2})}{(m_{1} + m_{2}) (m_{3} + m_{4})} \frac{m_{1} + m_{2} + m_{3} + m_{4}}{4} \geq 1 \\ \frac{1}{4} \frac{{(m_{1} + m_{2} + m_{3} + m_{4})}^{2}}{(m_{1} + m_{2}) (m_{3} + m_{4})} \geq 1 \\ {(m_{1} + m_{2} + m_{3} + m_{4})}^{2} \geq 4 (m_{1} + m_{2}) (m_{3} + m_{4}) \end{array}

Now let us assume all sequence lengths are the same. We should expect full batch training to be the same as gradient accumulation.

\begin{array}{r} {(m + m + m + m)}^{2} \geq 4 (m + m) (m + m) \\ {(4 m)}^{2} \geq 4 (2 m) (2 m) \\ 16 m^{2} \geq 16 m^{2} \end{array}

We can see we get what we expected - full batch training and gradient accumulation are the same!

But what happens if 1 sequence length (just 1) is bigger than the rest by a small epsilon? What would that do?

\begin{aligned} {(4 m)}^{2} & \geq 4 (2 m) (2 m) \\ {(4 m + ϵ)}^{2} & \geq 4 (2 m + ϵ) (2 m) \\ 16 m^{2} + 8 m ϵ + ϵ^{2} & \geq 16 m^{2} + 8 m ϵ \end{aligned}

We see that there's an epsilon squared term, which is always greater than 0! But, we also need to prove this holds if there is 1 sequence length which is slightly smaller than the rest:

\begin{aligned} {(4 m)}^{2} & \geq 4 (2 m) (2 m) \\ {(4 m - ϵ)}^{2} & \geq 4 (2 m - ϵ) (2 m) \\ 16 m^{2} - 8 m ϵ + ϵ^{2} & \geq 16 m^{2} - 8 m ϵ \end{aligned}

In both cases, the inequality holds, since we know epsilon squared is a number greater than or equal to 0 always. This essentially proves that naive or standard gradient accumulation will always have a higher loss than a full batch for bsz=2, ga=2. We then generalize the proof to other combinations of bsz and ga, which can get more involved.

We also have to prove the inequality holds the other way as well - ie the goal is to prove that general naive gradient accumulation is not equivalent to full batch training.

Update 17th October

We worked with our friends at Hugging Face to fix the issue in HF's trainers - see the PR here. We're also aware other training frameworks are actively working to resolve this issue, and working with some of them to fix this.

💕 Thank you!

As usual, a huge thank you to everyone for using & sharing Unsloth - we really appreciate it. Also a huge shout out to: Dario, Bronson, Jun, John, Steven & Aaron who are new supporters! 🙏

We are hiring by the way so feel free to reach out via support@unsloth.ai! As always, be sure to join our Reddit page or Discord server for help or just to show your support! You can also follow us on Twitter and Substack.

Thank you for reading!

Daniel & Michael Han 🦥
15 Oct 2024

Vision fine-tuning next!

Get started for free