6 cute pastel coloured sloths staring at their computer screens happy

387% faster TinyLlama, 6x faster GGUF conversion

Jan 18, 2024 • By Daniel & Michael

Jan 18, 2024

• By Daniel & Michael

TinyLlama
Colab T4
387%
faster

TinyLlama
Colab T4
-74%
VRAM

DPO Zephyr
1xA100
188%
faster

DPO Zephyr
1xA100
-11.6%
VRAM

Hey there, it’s been a while, but we’ve got lots of new things to talk about in this release! Happy New Year as well!

You can finetune TinyLlama 387% faster + use 74% less memory on 1 epoch of Alpaca's 52K dataset in 84 minutes on a free Google Colab instance with packing support! We also extended the context window from 2048 to 4096 tokens automatically! Notebook
With packing support through 🤗Hugging Face, Tiny Llama is not 387% faster but a whopping 6,700% faster than non packing!! Shocking!
We pre-quantized Llama-7b, Mistral-7b, Codellama-34b etc to make downloading 4x faster + reduce 500MB - 1GB in VRAM use by reducing fragmentation. No more OOMs! Notebook for Mistral-7b.
You can save Unsloth trained models directly to float16 for VLLM / save to the 🤗HF Hub or save to GGUF directly 6x faster using save_pretrained_merged or saved_pretrained_gguf! Notebook for Mistral-7b. Scroll down the notebook for saving.
For an easy UI interface, Unsloth is integrated through Llama Factory, with help from the lovely team!
We’ve achieved 188% faster DPO training with 12% less VRAM. Notebook
As highly requested by many of you, all Llama/Mistral models, including Yi, Deepseek, Starling, and Qwen, are now supported. Just try your favorite model out! We'll error out if it doesn't work :)

Hugging Face 🤗

In case you missed it, we've also written a blog post up on Hugging Face. By directly integrating Unsloth, users can now achieve 2x faster finetuning and use 50% less memory by installing our package. A huge thanks to the Hugging Face team and Younes Belkada for making this possible. We look forward to more collabs in the future! We're also in 🤗Hugging Face's docs!

1 A100 40GB	Dataset	🤗Hugging Face	🤗 + Flash Attention 2	🦥Unsloth	🦥 VRAM reduction
Code Llama 34b	Slim Orca	1x	1.01x	1.94x	-22.7%
Llama-2 7b	Slim Orca	1x	0.96x	1.87x	-39.3%
Mistral 7b	Slim Orca	1x	1.17x	1.88x	-65.9%
Tiny Llama 1.1b	Alpaca	1x	1.55x	2.74x	-57.8%
DPO with Zephyr	Ultra Chat	1x	1.24x	1.88x	-11.6%

Unsloth was benchmarked across 59 runs using 4 datasets on Tesla T4 and A100 Google Colab instances. QLoRA was applied to all linear layers (attention and MLP) with a rank of 16, and gradient checkpointing was on. By testing against the latest Transformers version (4.36), which has SDPA natively integrated if you have Pytorch 2.1.1, Unsloth is up to 2.7x faster and uses up to 74% less memory. We also tested Unsloth on a free Google Colab instance (low RAM, 1 T4 GPU, Pytorch 2.1.0 CUDA 12.1). All 59 notebooks are provided for full reproducibility, and more details are in Unsloth’s benchmarking details here

Free Colab T4	Dataset	🤗 Hugging Face	🤗 + Pytorch 2.1.1	🦥 Unsloth	🦥 VRAM reduction
Llama-2 7b	OASST	1x	1.19x	1.95x	-43.3%
Mistral 7b	Alpaca	1x	1.07x	1.56x	-13.7%
Tiny Llama 1.1b	Alpaca	1x	2.06x	3.87x	-73.8%
DPO with Zephyr	Ultra Chat	1x	1.09x	1.55x	-18.6%

Other important updates

Updated LoRA to include Bias and Dropout functionality
We support LoftQ, RSLoRA (Rank stabilized LoRA) and all PEFT arguments
For our Jupyter Notebook users, we’ve introduced DPO streaming stats. We hope this helps you keep a closer eye on your model’s performance
We’ve also enabled native text streaming in all notebooks, making it easier for you to manage your text data

Support us! 💕

As a team of just 2 brothers with 0 revenue or funding, we’ve decided to open a donation page. You can now support us directly on Ko-fi! Any donations will be greatly appreciated and members will get exclusive early access to some of our future releases.

As always, be sure to join our Discord server for help or just to show your support! You can also follow us on Twitter and Substack. We appreciate your continued patronage!

Code

6x faster GGUF conversion and QLoRA to float16 merging support:

model.save_pretrained_merged("dir", save_method = "merged_16bit")
model.save_pretrained_merged("dir", save_method = "merged_4bit")
model.save_pretrained_gguf("dir", tokenizer, quantization_method = "q4_k_m")
model.save_pretrained_gguf("dir", tokenizer, quantization_method = "fast_quantized")

model.push_to_hub_merged("hf_username/dir", save_method = "merged_16bit")
model.push_to_hub_merged("hf_username/dir", save_method = "merged_4bit")
model.push_to_hub_gguf("hf_username/dir", tokenizer, quantization_method = "q4_k_m")
model.push_to_hub_gguf("hf_username/dir", tokenizer, quantization_method = "fast_quantized")

4x faster model downloading + >= 500MB less GPU fragmentation by pre-quantized models:

    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/llama-2-7b-bnb-4bit",
    "unsloth/llama-2-13b-bnb-4bit",
    "unsloth/codellama-34b-bnb-4bit",
    "unsloth/tinyllama-bnb-4bit",

Thank you for reading!

Daniel & Michael Han 🦥
18 January 2024

Unsloth Studio coming soon!

Get started for free

Jan 18, 2024 • By Daniel & Michael

Jan 18, 2024

•

By Daniel & Michael

TinyLlamaColab T4387%faster

TinyLlamaColab T4-74%VRAM

DPO Zephyr1xA100188%faster

DPO Zephyr1xA100-11.6%VRAM

Unsloth Studio coming soon!

TinyLlama
Colab T4
387%
faster

TinyLlama
Colab T4
-74%
VRAM

DPO Zephyr
1xA100
188%
faster

DPO Zephyr
1xA100
-11.6%
VRAM