
This magic could be recreated through GRPO, a RL algorithm that optimizes responses efficiently without requiring a value function, unlike Proximal Policy Optimization (PPO) which relies on a value function. In our notebooks, we train a model with GRPO, aiming for it to develop its own self-verification and search abilities autonomously - creating a mini "aha moment".
In addition to adding GRPO support, we subsequently have support for Online DPO, PPO and RLOO as well! More details can be seen in Keith’s post and blog that includes the Github fork on how he got Online DPO working. The initial draft of GRPO changes on Google Colab can be seen in Joey’s tweet as well! Both of their contributions allowed us to also make support for the other generation based RL methods. See below for a graph comparison of Unsloth's Online DPO VRAM consumption vs standard Hugging Face + FA2.
pip install unsloth vllm
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/Llama-3.2-3B-Instruct",
fast_inference = True,
)
model.fast_generate(["Hello!"])