# 偏好优化训练 - DPO、ORPO 和 KTO

DPO（直接偏好优化）、ORPO（赔率比偏好优化）、PPO、KTO 奖励建模都可以与 Unsloth 一起使用。

我们有用于重现 GRPO、ORPO、DPO Zephyr、KTO 和 SimPO 的 Google Colab 笔记本：

* [GRPO 笔记本](/docs/zh/kai-shi-shi-yong/unsloth-notebooks.md#grpo-reasoning-rl-notebooks)
* [ORPO 笔记本](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_\(8B\)-ORPO.ipynb)
* [DPO Zephyr 笔记本](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Zephyr_\(7B\)-DPO.ipynb)
* [KTO 笔记本](https://colab.research.google.com/drive/1MRgGtLWuZX4ypSfGguFgC-IblTvO2ivM?usp=sharing)
* [SimPO 笔记本](https://colab.research.google.com/drive/1Hs5oQDovOay4mFA6Y9lQhVJ8TnbFLFh2?usp=sharing)

我们也出现在 🤗Hugging Face 的官方文档中！我们在 [SFT 文档](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth) 和 [DPO 文档](https://huggingface.co/docs/trl/main/en/dpo_trainer#accelerate-dpo-fine-tuning-using-unsloth).

## DPO 代码

```python
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0" # 可选：设置 GPU 设备 ID

from unsloth import FastLanguageModel, PatchDPOTrainer
from unsloth import is_bfloat16_supported
PatchDPOTrainer()
import torch
from trl import DPOTrainer, DPOConfig  # 已从 TrainingArguments 更改

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/zephyr-sft-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = None,
    load_in_4bit = True,
)

# 对模型进行补丁并添加快速 LoRA 权重
model = FastLanguageModel.get_peft_model(
    model,
    r = 64,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 64,
    lora_dropout = 0, # 支持任意值，但 = 0 已优化
    bias = "none",    # 支持任意值，但 = "none" 已优化
    # [新] "unsloth" 使用 30% 更少的显存，支持 2 倍更大的批次！
    use_gradient_checkpointing = "unsloth", # 对于非常长的上下文，可设置为 True 或 "unsloth"
    random_state = 3407,
    max_seq_length = max_seq_length,
)

dpo_trainer = DPOTrainer(
    model = model,
    ref_model = None,
    args = DPOConfig( # 使用 DPOConfig
        per_device_train_batch_size = 4,
        gradient_accumulation_steps = 8,
        warmup_ratio = 0.1,
        num_train_epochs = 3,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        seed = 42,
        output_dir = "outputs",
    ),
    beta = 0.1,
    train_dataset = YOUR_DATASET_HERE,
    # eval_dataset = YOUR_DATASET_HERE,
    tokenizer = tokenizer,
    max_length = 1024,
    max_prompt_length = 512,
)

dpo_trainer.train()
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://unsloth.ai/docs/zh/kai-shi-shi-yong/reinforcement-learning-rl-guide/preference-dpo-orpo-and-kto.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
