# 用于 RL 的 FP16 vs BF16

### Float16 与 Bfloat16

有一篇论文标题为 "**通过 FP16 克服训练-推理不匹配**" <https://arxiv.org/pdf/2510.26788> 展示了在进行强化学习时使用 float16 精度相比使用 bfloat16 可以显著更好。

<figure><img src="/files/38a1957c134849963488e5c8ef42250b47907041" alt=""><figcaption></figcaption></figure>

实际上，生成长度越长，使用 bfloat16 时情况越糟：

<figure><img src="/files/cd03955424e7cf397efaf1ab9f1d8f88c79f416b" alt=""><figcaption></figcaption></figure>

我们进行了调查，且 **确实发现 float16 更稳定** 比 bfloat16 具有更小得多的梯度范数，见 <https://x.com/danielhanchen/status/1985557028295827482> 以及 <https://x.com/danielhanchen/status/1985562902531850472>

{% columns %}
{% column width="50%" %}

<figure><img src="/files/1530b7836b71b51371b794a17b033a0ed6a29346" alt=""><figcaption></figcaption></figure>

<figure><img src="/files/d9ad2177dd3a592187c457643df0da1a2e5f1b51" alt=""><figcaption></figcaption></figure>
{% endcolumn %}

{% column width="50%" %}

<figure><img src="/files/8b31bd945b1366dfd1d04cc01fed5cd0e45f9d1f" alt=""><figcaption></figcaption></figure>
{% endcolumn %}
{% endcolumns %}

### :exploding\_head:A100 级联注意力错误

根据 <https://x.com/RichardYRLi/status/1984858850143715759> 以及 <https://yingru.notion.site/When-Speed-Kills-Stability-Demystifying-RL-Collapse-from-the-Training-Inference-Mismatch-271211a558b7808d8b12d403fd15edda>，较旧的 vLLM 版本（0.11.0 之前）在 A100 和类似 GPU 上存在损坏的注意力机制。请更新 vLLM！如果我们检测到较旧的 vLLM 版本，在 Unsloth 强化学习期间我们默认也会禁用 vLLM 的级联注意力。

<figure><img src="/files/692683dea746ea4172a0c0da8f1cba0cc1adbce6" alt=""><figcaption></figcaption></figure>

不同硬件也会改变结果，较新且更昂贵的 GPU 在推理端与训练端之间的 KL 差异更小：

<figure><img src="/files/347990be67a6a524673981e881cfcdca166bbd92" alt=""><figcaption></figcaption></figure>

### :fire:在 Unsloth RL 中使用 float16

要在 Unsloth GRPO 和 RL 中使用 float16 精度，你只需设置 `dtype = torch.float16` 我们会处理剩下的！

{% code overflow="wrap" %}

```python
pip install unsloth vllm
import torch
max_seq_length = 2048 # 对于更长的推理轨迹可以增加
lora_rank = 32 # 更大的秩 = 更智能，但更慢

from unsloth import FastLanguageModel
    model_name = "unsloth/Qwen3-4B-Base",
    max_seq_length = max_seq_length,
    load_in_4bit = False, # LoRA 16 位时为 False
    fast_inference = True, # 启用 vLLM 快速推理
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.9, # 若内存不足请降低
    
    dtype = torch.float16, # 使用 torch.float16、torch.bfloat16
)
```

{% endcode %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://unsloth.ai/docs/zh/kai-shi-shi-yong/reinforcement-learning-rl-guide/advanced-rl-documentation/fp16-vs-bf16-for-rl.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
