The goal of gradient accumulation is to mimic full batch training with reduced VRAM usage. Gradient accumulation is also used in DDP and multi GPU setups, so this issue affects large scale training runs as well. P.S. If you’ve enjoyed our work, don't forget to ⭐Star us on GitHub and join our awesome community on Discord - your support means the world to us! 🦥

Previously in 2021, Zhaofeng first discovered the issue, and Benjamin Marie rediscovered the issue last week. They showed if you use gradient accumulation, you will get a higher loss than if you used full batch training:We managed to formulate a new methodology that solves the issue - and it's now in Unsloth! Please update Unsloth (

```
from unsloth import unsloth_train
# trainer_stats = trainer.train() << Buggy gradient accumulation
trainer_stats = unsloth_train(trainer)
```

💡 Our Findings

Enter gradient accumulation! We instead create the gradients on the fly by forming them each time a new mini batch comes in. We then simply add all mini gradients up, do some scaling, and we'll have the final large batch gradient.

The 2nd theory was there is in fact a bug in the loss calculation, which we find to be the case.

But we see below that the final sum does not equal to the original full batch loss - in fact it's G bigger (where G is the number of gradient accumulation steps).So in gradient accumulation, we have to scale each mini gradient accumulator by the number of gradient accumulation steps to we get the desired result.This generally works well in expectation for large batches.

We implemented this fix in Unsloth, and all loss curves now match up, showing indeed gradient accumulation is equivalent to full batch training.

We ran all combinations (bsz=1,ga=16 all the way to bsz=16,ga=1) and compare the LoRA weights to the full batch version (bsz=16,ga=1) and obtain the L2 norm difference.

We show that (1) there is an inherent accumulation error due to floating point arithmetic (0.0068 L2 norm), and (2) as the number of gradient accumulations increase, the L2 norm increases (from 0.0196 to 0.0286 L2 norm)This essentially means gradient accumulation does inherently have a tiny floating point addition penalty. And naively, the larger the gradient accumulation steps, the higher the discrepancy. By using our fixed Unsloth Gradient Accumulation version, the L2 norm error can be reduced by over a magnitude.So, please update Unsloth (

```
from unsloth import unsloth_train
# trainer_stats = trainer.train() << Buggy gradient accumulation
trainer_stats = unsloth_train(trainer)
```

But what happens if 1 sequence length (just 1) is bigger than the rest by a small epsilon? What would that do?We see that there's an epsilon squared term, which is always greater than 0! But, we also need to prove this holds if there is 1 sequence length which is slightly smaller than the rest:In both cases, the inequality holds, since we know epsilon squared is a number greater than or equal to 0 always. This essentially proves that naive or standard gradient accumulation will always have a higher loss than a full batch for bsz=2, ga=2. We then generalize the proof to other combinations of bsz and ga, which can get more involved.

We also have to prove the inequality holds the other way as well - ie the goal is to prove that general naive gradient accumulation is not equivalent to full batch training.

💕 Thank you!

As usual, a huge thank you to everyone for using & sharing Unsloth - we really appreciate it. Also a huge shout out to: Dario, Bronson, Jun, John, Steven & Aaron who are new supporters! 🙏We are

Thank you for reading!

Daniel & Michael Han 🦥15 Oct 2024

© 2024 unsloth. All rights reserved.