6 cute pastel coloured sloths staring at their computer screens happy

Vision Reinforcement Learning

Aug 28, 2025 • By Daniel & Michael

Aug 28, 2025

• By Daniel & Michael

Unsloth now supports vision/multimodal RL with Gemma 3 and Qwen2.5-VL. Due to Unsloth's unique weight sharing and custom kernels, Unsloth makes VLM RL 1.5–2× faster, uses 90% less VRAM, and enables 10× longer context lengths than FA2 setups, with no accuracy loss. This update also introduces Qwen's GSPO algorithm.
Unsloth can train Qwen2.5-VL-7B with GRPO on a free Colab T4 GPU. Other VLMs work too, but may need larger GPUs.

Gemma requires newer GPUs than T4 because vLLM restricts to Bfloat16, thus we recommend NVIDIA L4 on Colab. Our notebooks solve numerical math problems involving images and diagrams:
We have also added vLLM VLM integration into Unsloth natively, so all you have to do to use vLLM inference is enable the fast_inference=True flag when initializing the model.
This VLM support also integrates our latest update for even more memory efficient + faster RL including our Standby feature, which uniquely limits speed degradation compared to other implementations.

🦥Introducing Unsloth Standby

We're excited to introduce more efficient reinforcement learning (RL) in Unsloth with multiple algorithmic advancements:
1.2 to 1.7x increased context lengths with no slowdown and no extra memory usage!
10% faster RL training runs with revamped kernels and async data movements
2x faster torch.compile times during model loading

Unsloth already increases RL training speed, context window and reduces VRAM usage by 50–90% vs. all other setups with FA2, but now Unsloth's Standby improves this even further. Our Standby feature uniquely limits speed degradation compared to other implementations and sometimes makes training even faster!

Now, Qwen3-32B LoRA 16-bit can attain 6,144 context lengths vs 3,600 (1.7x longer) before on 1xH100 80GB GPU. Llama-3.1-8B QLoRA 4bit can attain 47,500 lengths vs 42,000 before (1.13x longer).

We made RL runs 10% faster through various kernel optimizations, and removed the LoRA communication channel between the CPU and GPU when switching from training to inference mode. Finally, we used custom torch.compile flags to make vLLM's rollout faster by 10%, and reduced compilation time by 2x.

If you prefer to merge the model and push to the hugging-face hub directly instead, you could do so using:model.push_to_hub_merged(repo_name, tokenizer=tokenizer, token=hf_token)

✨Fine-tuning gpt-oss directly

We also added support for directly fine-tuning of gpt-oss models by implementing patches that allow loading the native MXFP4 quantized format. This makes it possible to load the 'openai/gpt-oss' model with less than 24GB of VRAM, and QLoRA fine-tune it. Simply load the model using:

model, tokenizer = FastLanguageModel.from_pretrained(
    #model_name = "unsloth/gpt-oss-20b-BF16", 
    model_name = "unsloth/gpt-oss-20b",
    dtype = dtype, # None for auto detection
    max_seq_length = max_seq_length, # Choose any for long context!
    load_in_4bit = True,  # 4 bit quantization to reduce memory
    full_finetuning = False, # [NEW!] We have full finetuning now!
    # token = "hf_...", # use one if using gated models
)

🐛Bug Fixes for gpt-oss

We recently collaborated with Hugging Face to resolve inference issues by using OpenAI’s kernels and ensuring that swiglu_limit = 7.0 is correctly applied during MXFP4 inference.

Based on user feedback, we discovered that extended QLoRA training runs (beyond 60 steps) could cause the loss to diverge and eventually error out. This issue only occurred on devices that do not support BF16 and instead fall back to F16 (e.g., T4 GPUs). Importantly, it did not impact QLoRA training on A100 or H100 GPUs, nor LoRA training on f16 GPUs.

After extensive investigation, we’ve now aligned training loss behavior across all GPU setups, including GPUs limited to F16. If you were previously experiencing issues because of this, we recommend using our new updated gpt-oss notebook!

We had to do many many experiments to move float16's training loss curve to be equivalent to bfloat16 machines (blue line). We found the following:
Pure float16 will go to infinity on step 50
We found the down projections in the MoE to have huge outliers
Activations must be saved in bfloat16 or float32
Below shows the absolute magnitude activations for GPT OSS 20B, and some really spike - this will overflow in float16 machines since float16's maximum range is 65504.
We fixed this in Unsloth, so all float16 training works out of the box!

📈gpt-oss-20b benchmarks

We tested gpt-oss-20b and did LoRA on all linear layers (Q, K, V, O, gate, up and down) with rank = 32 with a batch size of 1. We padded all sequences to a certain maximum sequence length to mimic long context finetuning workloads.

gpt-oss-20b BF16 LoRA - Context Length vs/ GPU VRAM

Context Length

Unsloth
(+ Flex A)

Official Cookbook + FA3

Official Cookbook

1024

45.2

46.6

47.3

2048

45.94

49.7

51.3

4096

47.07

56.1

71.1

8192

49.27

68.7

OOM

16,384

OOM

32,768

63.73

OOM

61,234

OOM

💕 Thank you!

As usual, a huge thank you to everyone for using & sharing Unsloth - we really appreciate it. 🙏

As always, be sure to join our Reddit page and Discord server for help or just to show your support! You can also follow us on Twitter and our newsletter on: Substack.

Thank you for reading!

Daniel & Michael Han 🦥
Aug 28, 2025

Fine-tune gpt-oss now!

Get started for free