💎Fine-tune MoE Models 12x Faster with Unsloth

Train MoE LLMs locally using Unsloth Guide.

We’re introducing ~12x faster Mixture of Experts (MoE) LLM training with >35% less VRAM and ~6x longer context with our new MoE Triton kernels and new mathematical optimizations, all with no loss in accuracy.

Unsloth now supports fast training for MoE architectures including gpt-oss, Qwen3 (30B, 235B, VL, Coder), DeepSeek R1, V3 and GLM (4.6, 4.7, Flash).
gpt-oss-20b fine-tunes in 12.8 GB VRAM. Qwen3-30B-A3B (16-bit LoRA) uses 63GB.
Our kernels work on both data-center (B200, H100), consumer and older GPUs (e.g., RTX 3090), and FFT, LoRA and QLoRA.

In collaboration with 🤗Hugging Face, we made all MoE training runs standardized with PyTorch’s new torch._grouped_mm function. Transformers v5 was recently optimized with ~6x faster MoE than v4 and Unsloth pushes this even further with custom Triton grouped‑GEMM + LoRA kernels for an additional ~2x speedup, >35% VRAM reduction and >6x longer context (12-30x overall speedup vs v4).

Try our Unsloth Notebooks for fast MoE training:

gpt-oss (20b) (free)

Qwen3-30B-A3B (A100)

GLM-4.7-Flash (A100)

gpt-oss-120b (A100)

gpt-oss (500K context)

TinyQwen3 MoE (test only)

🦥 Unsloth MoE Triton Kernels

Alongside torch._grouped_mm (see ❓What is torch._grouped_mm?), we created custom Triton MoE kernels that can be even faster in some cases. They are also backwards compatible with older hardware like A100, and older PyTorch versions.

On A100, our Triton kernels are ~2.5× faster than torch._grouped_mm. The kernels also have a one‑time autotune step to pick the best kernel config.

Auto-tuning takes ~2 minutes once at the start of training, but can speed up the full run by 35% on A100 vs _grouped_mm, which is well worth it for longer runs.

The larger the model and more context you use, the more pronounced the memory savings from our Unsloth kernels will be (efficiency will scale exponentially).

🧭 Automatic backend selection

Our main innovation is our Split LoRA approach for efficient MoE, which uses ~35% less memory and is 2x faster training when compared to Transformers v5 + torch._grouped_mm. Custom torch._grouped_mm + our Triton kernels are ~12-30x faster than Transformers v4.

Training MoE models in 4-bit QLoRA isn’t recommended right now because BitsandBytes doesn’t support it. This isn’t specific to Unsloth. For now, use bf16 for LoRA or full fine-tuning.

Unsloth will auto select either the following backends depending on your hardware:

Backend

Optimizations

grouped_mm

torch._grouped_mm - available on T4s all the way until B200s, but optimized for H100s+.

unsloth_triton

Unsloth Triton kernels - which will turn on automatically for A100s, and older PyTorch versions.

native_torch

Native PyTorch. It's 12x slower, but our VRAM reductions are still there!

You can also toggle them yourself:

os.environ["UNSLOTH_MOE_BACKEND"] = "grouped_mm"
os.environ["UNSLOTH_MOE_BACKEND"] = "unsloth_triton"
os.environ["UNSLOTH_MOE_BACKEND"] = "native_torch"

To enable faster MoE training, update Unsloth via pip install --upgrade unsloth unsloth_zoo

❓What is torch._grouped_mm?

Previously, Mixture-of-Experts (MoE) weights were stored as a ModuleList of per‑expert linear layers. The only practical way to run a forward pass was a for‑loop over experts, which is expensive and suboptimal.

for expert_idx in expert_hit:
    expert_idx = expert_idx[0]
    if expert_idx == num_experts: continue
    _, token_idx = torch.where(expert_mask[expert_idx])
    current_state = hidden_states[token_idx]
    gate, up = nn.functional.linear(current_state, self.gate_up_proj[expert_idx]).chunk(2, dim=-1)

PyTorch recently introduced grouped_mm to address this exact bottleneck. In parallel, we provide our own MoE‑optimized Triton kernels. This also lines up with a key Transformers change: as of Transformers v5, expert weights are stored as a single nn.Parameter, making grouped_mm a natural fit for faster MoE training and inference.

So transformers 4.57.6 changed:

self.experts = nn.ModuleList(
    [Qwen3MoeMLP(config, intermediate_size) for _ in range(self.num_experts)]
)

to transformers 5.0.0 style:

self.gate_up_proj = nn.Parameter(torch.empty(num_experts, 2 * intermediate_dim, hidden_dim))

torch._grouped_mm works on GPUs starting with the NVIDIA T4, and we’ve verified it on H100, A100, B200, and RTX 6000 Pro, so support is broadly available.

We also previously introduced Unsloth Flex Attention for gpt-oss, and these optimizations should make it even more efficient.

📊 Kernel Results + Benchmarks

Below is a comparison across sequence lengths for training speed and memory usage versus Transformers v5 (which already uses torch._grouped_mm for MoE). For gpt-oss BF16 MoE training, we see 7x faster training and 36% VRAM reduction on NVIDIA B200. For Qwen3-30B-A3B, it's 1.8x faster, and GLM 4.7 Flash is 2.1x faster on RTX PRO 6000. All benchmarks are done with LoRA rank = 64 and all LoRA modules on MoE layers (gate, up, down).

gpt-oss Benchmarks

We fine-tuned unsloth/gpt-oss-20b-BF16 for benchmarking. Unsloth is 7x faster and uses 36% less VRAM at 16K context lengths. Transformers v5 + TRL goes out of memory whilst Unsloth does not. Also the speed up increases with sequence length in this case thanks to our Unsloth's Flex Attention implementation, and our MoE kernels.

Context Length

Unsloth (ms)

TF v5 (ms)

Unsloth Mem (GB)

TF v5 Mem (GB)

Speed Up

VRAM Saving

1024

275.35

376.99

40.91

43.88

1.4x

6.76%

2048

292.88

696.57

41.83

44.93

2.4x

6.89%

4096

370.30

1785.89

43.68

49.86

4.8x

12.39%

8192

712.33

5226.86

47.43

73.80

7.3x

35.73%

16384

1775.80

OOM

55.13

OOM

N/A

Qwen3 Benchmarks

On an NVIDIA B200, we see ~1.7x speedup and ~35% better memory efficiency with Qwen3-30B-A3B LoRA, with memory savings improving further at longer sequence lengths.

Qwen3-Next and Coder surprisingly fit on a single B200 GPU in bf16 LoRA.

On H100 GPU, we perform significantly better than the baseline getting up to 1.77x speed up in training while also saving ~5.3GB when fine tuning at 4K context length. While we seamlessly scale to 8192 context lengths, Transformers v5 + TRL OOMs at 8K. Notice that we use less memory at 8K than the baseline does at 4K so we can keep pushing the context length further.

Context Length

Unsloth (ms)

TF v5 (ms)

Unsloth Mem (GB)

TF v5 Mem (GB)

Speed Up

VRAM Saving

1024

366.3

628.3

80.88

104.80

1.7x

2.06%

2048

467.0

745.3

80.88

104.81

1.6x

2.57%

4096

711.6

975.5

80.89

104.80

1.4x

5.08%

8192

1376.6

1633.5

80.90

104.81

1.2x

9.17%

16384

3182.2

3407.9

85.53

116.61

1.1x

15.26%

GLM 4.7 Benchmarks

Unsloth achieves 2.6x faster throughput with >15% less VRAM across all batch sizes for GLM 4.7 Flash. GLM 4.7 Flash is a 30B MoE (3B active parameters) agentic & coding model and employs a configuration similar to the DeepSeek MoE style, featuring 64 routed experts and 1 shared expert. We benchmarked Unsloth MoE training vs the new optimized Transformers v5.

Use our new Colab notebook for GLM 4.7 Flash below:

Context Length

Unsloth (ms)

TF v5 (ms)

Unsloth Mem (GB)

TF v5 Mem (GB)

Speed Up

VRAM Saving

512

1145.0

2992.1

57.81

60.89

2.6x

6.51%

1024

1298.9

3323.3

58.76

62.55

2.6x

6.22%

2048

1831.9

4119.3

60.09

67.32

2.3x

9.46%

4096

2883.9

5646.1

63.34

76.78

14.83%

⚡Faster LoRA MoE training

In Transformers/PEFT, the usual approach is to merge the LoRA adapter into the base weight and then run the MoE computation (especially since MoE often uses nn.Parameter instead of nn.Linear). The problem is that this merge effectively materializes the LoRA delta (for all the experts) lora_B @ lora_A.t, which is very memory-hungry.

Unsloth avoids that. We previously used the same idea to optimize generic LoRA training and inference, and we’ve now applied it to MoE + LoRA as well. The math is identical, so the loss, gradients, and outputs stay the same. The only change is the order of operations, made possible by matrix-multiplication associativity. With this reordering, we get major speedups and memory reductions.

Training MoE models in 4-bit QLoRA isn’t recommended right now because BitsandBytes doesn’t support it. This isn’t specific to Unsloth. For now, use bf16 for LoRA or full fine-tuning.

These optimizations are enabled by default when training MoE models with Unsloth (notably Qwen-3 MoE, gpt-oss, and the models mentioned above). You can switch implementations via the UNSLOTH_MOE_BACKEND environment variable: either torch._grouped_mm Triton kernels or a basic PyTorch for-loop, depending on compatibility and preference. We default to grouped_mm for the best performance and broad support.

import os
# if you want to choose a different backend (grouped_mm by default), set the below variable:
# os.environ['UNSLOTH_MOE_BACKEND'] = 'unsloth_triton' # or grouped_mm or native_torch
lora_rank = 16
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Qwen/Qwen3-30B-A3B-Instruct-2507", #MoE model
    max_seq_length = max_seq_length,
    load_in_4bit = False, # MoE nn.Parameter doesn't support bnb 4bit yet
)
model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank,
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_up_proj", "down_proj", # LoRA on MoE layers!
    ],
    lora_alpha = lora_rank*2, # *2 speeds up training
    use_gradient_checkpointing = "unsloth", # Reduces memory usage
    random_state = 3407,
)

📚 Details of implementation

LoRA is a parameter-efficient fine-tuning method: instead of updating the full weight matrix, you train a low-rank “adapter” with far fewer parameters, which drastically reduces optimizer memory.

If the original weight has shape (m, n), LoRA adds two trainable matrices with shapes (m, r) and (r, n). Their product is (m, n), but you only track optimizer states and gradients for:

m*r + r*n parameters (LoRA) instead of
m*n parameters (full fine-tuning)

On fine-tuning MoE's - it's not a good idea to fine-tune the router layer so we disabled it by default.

For typical MLP layers, m ≈ 4096, n ≈ 12k, and r ≈ 64, that’s roughly ~1M LoRA parameters vs ~48M full parameters - about ~2%, often with minimal to no accuracy loss.

MoE LoRA changes things

MoE layers are different because you have E expert MLPs in parallel, so any per‑expert change (like adding LoRA) scales across all experts.

Take Qwen3‑30B‑A3B: hidden size m=2048, intermediate size n=768, E=128 experts with k=8 activated per token. Per expert:

gate_proj and up_proj: (m, n) = (2048, 768)
down_proj: (n, m) = (768, 2048)

With LoRA rank r=64, each projection adds r*(m+n)=64*(2048+768)=180,224 parameters per expert (≈ 11% of a 2048×768 matrix). The core issue is that r/n = 64/768 is large compared to typical MLP setups, for e.g., r/n = 64/25600 in Qwen3-32B of similar size.

If you materialize this across all experts, memory adds up quickly. And since gate_proj and up_proj are often fused as gate_up_proj, you typically materialize both together, roughly doubling the overhead/peak memory.

In terms of memory, for a sequence length s, E experts and k chosen, we have the following common for both approaches

# All these values are per expert
Final output: (s, n)
Input activations: (s, m)
Final output: (s, n)

This is where things start to diverge. For peft’s approach we have

delta = loraA@loraB  = (m,n) per expert = Emn parameters

For Unsloth’s split LoRA approach, we perform the following operations

Y = X @ loraA : (s,m) @ (m, r)       # but sparse for k experts = ksr parameters
Y @ loraB: (s, r) @ (r, n)           # but sparse again for k experts = ksn parameters

Now lets take the case of Qwen3-30B-A3B.

E = 128, k = 8, m = 2048, n = 768. Plugging all these in , we get s < 32K.

\begin{aligned} \text{PEFT params} &:\quad Emn \\ \text{Unsloth Split LoRA params} &:\quad ks(r+n) \\ \text{In typical LoRA we have} &:\quad r \ll n \\ \text{Split LoRA is better when} &:\quad Emn > ksn \;=\; Em > ks \\ \\ \text{For Qwen3-30B-A3B, we have} \\ E &= 128, \quad k = 8, \quad m = 2048, \quad n = 768 \\ \\ \text{So, Split LoRA is mathematically better when} \\ s &< \frac{Emn}{kn} = 32K \end{aligned}

In terms of compute, for a sequence length s, E experts and top k chosen, we're doing:

\begin{aligned} \Delta = AB, A \in \mathbb{R}^{m \times r}, \; B \in \mathbb{R}^{r \times n} &\quad \Rightarrow \quad 2mnr \text{ flops per expert lora} \\ \\ W' = W + \Delta \quad &\Rightarrow \quad mn \text{ flops} \\ \\ XW' \quad | \quad X \in \mathbb{R}^{s \times m}, \; W' \in \mathbb{R}^{m \times n} \quad &\Rightarrow \quad 2smn \text{ flops} \\ \\ \text{MoE peft lora flops} &= E\big(2mnr + mn\big) + 2k\,smn \end{aligned}

In case of Unsloth split lora that we mentioned, we have

\begin{aligned} XW &= 2smn \text{ flops} \\ Y = XA, &= 2smr \quad \text{(applied only to routed token--expert pairs)} \\ \ Z = YB &= 2srn \\ \text{MoE split lora flops} &= 2k\big(smn + smr + srn\big) \\ \text{Crossover condition} &:\quad 2ksr(m+n) > 2Emn(r+1/2) \Rightarrow s > \frac{Emn}{k(m+n)} \times (1+ \frac{1}{2r}) \\ \\ \text{For Qwen3-30B-A3B with} &: E = 128,\; m = 2048,\; n = 768,\; k = 8 \\ \\ \Rightarrow \quad s & \;\approx\; 16\text{K tokens} \end{aligned}

The point till where the Split LoRA from analytical perspective is better is when s > Emn/k(m+n) which is in the order of 16K tokens for Qwen3-30B-A3B style model.

Finally, some speedups come from reduced memory traffic: modern GPUs are often bandwidth‑bound, so transferring less data can matter more than FLOPs. A rough speedup estimate is Emn / [k·s·(m+n)], so it depends strongly on s, E, k, and the matrix shapes.

🔮 Model Support

Unsloth supports faster MoE training for Qwen, gpt-oss, DeepSeek and GLM models:

Qwen3 (Thinking and Instruct): VL • 2507 • Coder
gpt-oss: 20B • 120B • safeguard
GLM: 4.5 • 4.6 • 4.6-Air • 4.7 • 4.7-Flash
DeepSeek: V3 • R1 • V3.1 • V3.2

We may have not uploaded some MoE models, but Unsloth should still support them.

📈 More Benchmarks

gpt-oss BF16 Benchmarks

Training Speed including vs Transformers v4

Context length

Unsloth (ms)

TF v5 (ms)

TF v4 (ms)

Speed Up

1024

275.35

376.99

2111.18

1.37x

2048

292.88

696.57

2626.80

2.38x

4096

370.30

1785.89

4027.93

4.82x

8192

712.33

5226.86

8513.52

7.34x

16384

1775.80

OOM

N/A

Memory VRAM usage

Context length

Unsloth Mem (GB)

TF v5 Mem (GB)

TF v4 Mem (GB)

VRAM Saving

1024

40.91

43.88

89.75

6.76%

2048

41.83

44.93

90.47

6.89%

4096

43.68

49.86

92.72

12.39%

8192

47.43

73.80

100.3

35.73%

16384

55.13

OOM

N/A

🎉 Important Unsloth Updates

As part of our MoE release, we also made Gemma-3 now use Flex-Attention by default, and this works in float16 settings as well (there were infinities which we solved a while back). Gemma-3 now uses O(N) memory and not O(N^2) memory, and trains >3x faster (scales even better with context length). Previous Unsloth versions would OOM.

Context

Old Peak VRAM

New Peak VRAM

VRAM Saving

20.1 GB

0 GB (0%)

21.5 GB

21.1 GB

0.3 GB (2%)

27.7 GB

23.3 GB

4.5 GB (16%)

52.3 GB

27.5 GB

24.8 GB (47%)

16K

OOM

36.0 GB

24K

OOM

44.6 GB

32K

OOM

53.1 GB

48K

OOM

38.4 GB

64K

OOM

44.7 GB

Vision fine-tuning now accepts mixed data of only images and text data!
Windows is now officially supported with no need for WSL.
trl==0.27.1 and transformers==5.1.0 are supported well - previous coverage was 30% of all our 120 notebooks, but now we have >80% coverage - we plan to make it 100% over the next few days.
Many bug fixes and other updates - see https://github.com/unslothai/unsloth/releases/tag/February-2026

To enable faster MoE training, update Unsloth via pip install --upgrade unsloth unsloth_zoo

Acknowledgements

We thank the Hugging Face team for collaborating with us on improving MoE training for the community.

We also sincerely thank the torchao team, especially Vasily Kuznetsov (vkuzo) for working helping us enabling grouped_mm support for float16 to get it work on T4 and backward compatibility with A100.

PreviousDPO, ORPO, KTO NextEmbedding Fine-tuning

Last updated 14 days ago

Was this helpful?

hashtag🦥 Unsloth MoE Triton Kernels

hashtag🧭 Automatic backend selection

hashtag❓What is torch._grouped_mm?

hashtag📊 Kernel Results + Benchmarks

hashtaggpt-oss Benchmarks

hashtagQwen3 Benchmarks

hashtagGLM 4.7 Benchmarks

hashtag⚡Faster LoRA MoE training

hashtag📚 Details of implementation

hashtagMoE LoRA changes things

hashtag🔮 Model Support

hashtag📈 More Benchmarks

hashtaggpt-oss BF16 Benchmarks

hashtag🎉 Important Unsloth Updates

hashtagAcknowledgements

🦥 Unsloth MoE Triton Kernels

🧭 Automatic backend selection

❓What is torch._grouped_mm?

📊 Kernel Results + Benchmarks

gpt-oss Benchmarks

Qwen3 Benchmarks

GLM 4.7 Benchmarks

⚡Faster LoRA MoE training

📚 Details of implementation

MoE LoRA changes things

🔮 Model Support

📈 More Benchmarks

gpt-oss BF16 Benchmarks

🎉 Important Unsloth Updates

Acknowledgements