# GSPO 强化学习

我们正在介绍 GSPO，它是以下方法的一种变体： [GRPO](/docs/zh/kai-shi-shi-yong/reinforcement-learning-rl-guide.md#from-rlhf-ppo-to-grpo-and-rlvr) 由阿里巴巴的 Qwen 团队提出。他们注意到一个现象：当 GRPO 对每个 token 使用重要性权重时，尽管优势本质上不会随着每个 token 进行缩放或变化。正是这一观察促成了 GSPO 的诞生；现在它将重要性分配在序列似然上，而不是各个 token 的单独似然上。

* 使用我们的免费 GSPO 笔记本来： [**gpt-oss-20b**](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt-oss-\(20B\)-GRPO.ipynb) 和 [**Qwen2.5-VL**](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen2_5_7B_VL_GRPO.ipynb)

在 Unsloth 中启用 GSPO，只需设置 `importance_sampling_level = "sequence"` 在 GRPO 配置中即可。这两种算法的区别如下所示，均来自 Qwen 和阿里巴巴的 GSPO 论文：

<figure><img src="/files/144d00a8e4cb7a64c6513cadffe7ff26b79f226f" alt="" width="563"><figcaption><p>GRPO 算法，来源： <a href="https://arxiv.org/abs/2507.18071">Qwen</a></p></figcaption></figure>

<figure><img src="/files/17975acecdcbf15315e2f339f6a7ed32ad546592" alt="" width="563"><figcaption><p>GSPO 算法，来源： <a href="https://arxiv.org/abs/2507.18071">Qwen</a></p></figcaption></figure>

在公式 1 中可以看到，优势会在该张量求和之前，将每一行缩放到 token 的对数概率中。实际上，每个 token 都被赋予了相同的缩放，尽管这种缩放原本是赋予整个序列而不是每个单独的 token。下面可以看到一个简单示意图：

<figure><img src="/files/679a927132dfac050b75671d1fb7653bd0564ecc" alt="" width="286"><figcaption><p>GRPO 对数概率比按行使用优势进行缩放</p></figcaption></figure>

公式 2 表明，在计算出对数概率比后，会先将每个序列的对数概率比求和并取指数，然后才将得到的序列比按行与优势相乘。

<figure><img src="/files/10fbe0234dd03c48b2df0033eeb235248beca2f6" alt="" width="313"><figcaption><p>GSPO 序列比按行使用优势进行缩放</p></figcaption></figure>

启用 GSPO 很简单，你只需要设置 `importance_sampling_level = "sequence"` 标志在 GRPO 配置中。

```python
training_args = GRPOConfig(
    output_dir = "vlm-grpo-unsloth",
    per_device_train_batch_size = 8,
    gradient_accumulation_steps = 4,
    learning_rate = 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "adamw_8bit",
    # beta = 0.00,
    epsilon = 3e-4,
    epsilon_high = 4e-4,
    num_generations = 8,    
    max_prompt_length = 1024,
    max_completion_length = 1024,
    log_completions = False,
    max_grad_norm = 0.1,
    temperature = 0.9,
    # report_to = "none", # 如果你想记录到 Weights & Biases，则设为 "wandb"
    num_train_epochs = 2, # 用于快速测试运行，完整训练时请增加
    
    # GSPO 如下：
    importance_sampling_level = "sequence",
    
    # Dr GRPO / GAPO 等
    loss_type = "dr_grpo",
)
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://unsloth.ai/docs/zh/kai-shi-shi-yong/reinforcement-learning-rl-guide/advanced-rl-documentation/gspo-reinforcement-learning.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
