GSPO 强化学习
在 Unsloth 中使用 GSPO(组序列策略优化)进行训练。
最后更新于
这有帮助吗?
在 Unsloth 中使用 GSPO(组序列策略优化)进行训练。
最后更新于
这有帮助吗?
这有帮助吗?
training_args = GRPOConfig(
output_dir = "vlm-grpo-unsloth",
per_device_train_batch_size = 8,
gradient_accumulation_steps = 4,
learning_rate = 5e-6,
adam_beta1 = 0.9,
adam_beta2 = 0.99,
weight_decay = 0.1,
warmup_ratio = 0.1,
lr_scheduler_type = "cosine",
optim = "adamw_8bit",
# beta = 0.00,
epsilon = 3e-4,
epsilon_high = 4e-4,
num_generations = 8,
max_prompt_length = 1024,
max_completion_length = 1024,
log_completions = False,
max_grad_norm = 0.1,
temperature = 0.9,
# report_to = "none", # 如果你想记录到 Weights & Biases,请设置为 "wandb"
num_train_epochs = 2, # 用于快速测试运行,完整训练请增加
report_to = "none"
# GSPO 如下:
importance_sampling_level = "sequence",
# Dr GRPO / GAPO 等
loss_type = "dr_grpo",
)