# Training zur Präferenzoptimierung - DPO, ORPO & KTO

DPO (Direct Preference Optimization), ORPO (Odds Ratio Preference Optimization), PPO, KTO Reward Modelling funktionieren alle mit Unsloth.

Wir haben Google Colab-Notebooks, um GRPO, ORPO, DPO Zephyr, KTO und SimPO zu reproduzieren:

* [GRPO-Notebooks](/docs/de/loslegen/unsloth-notebooks.md#grpo-reasoning-rl-notebooks)
* [ORPO-Notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_\(8B\)-ORPO.ipynb)
* [DPO Zephyr-Notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Zephyr_\(7B\)-DPO.ipynb)
* [KTO-Notebook](https://colab.research.google.com/drive/1MRgGtLWuZX4ypSfGguFgC-IblTvO2ivM?usp=sharing)
* [SimPO-Notebook](https://colab.research.google.com/drive/1Hs5oQDovOay4mFA6Y9lQhVJ8TnbFLFh2?usp=sharing)

Wir sind auch in den offiziellen Docs von 🤗Hugging Face! Wir sind im [SFT-Dokumentation](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth) und im [DPO-Dokumentation](https://huggingface.co/docs/trl/main/en/dpo_trainer#accelerate-dpo-fine-tuning-using-unsloth).

## DPO-Code

```python
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0" # Optional: GPU-Geräte-ID setzen

from unsloth import FastLanguageModel, PatchDPOTrainer
from unsloth import is_bfloat16_supported
PatchDPOTrainer()
import torch
from trl import DPOTrainer, DPOConfig  # Geändert von TrainingArguments

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/zephyr-sft-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = None,
    load_in_4bit = True,
)

# Modell-Patching durchführen und schnelle LoRA-Gewichte hinzufügen
model = FastLanguageModel.get_peft_model(
    model,
    r = 64,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 64,
    lora_dropout = 0, # Unterstützt beliebige Werte, aber = 0 ist optimiert
    bias = "none",    # Unterstützt beliebige Werte, aber = "none" ist optimiert
    # [NEU] "unsloth" verwendet 30 % weniger VRAM, ermöglicht 2x größere Batch-Größen!
    use_gradient_checkpointing = "unsloth", # True oder "unsloth" für sehr langen Kontext
    random_state = 3407,
    max_seq_length = max_seq_length,
)

dpo_trainer = DPOTrainer(
    model = model,
    ref_model = None,
    args = DPOConfig( # Verwende DPOConfig
        per_device_train_batch_size = 4,
        gradient_accumulation_steps = 8,
        warmup_ratio = 0.1,
        num_train_epochs = 3,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        seed = 42,
        output_dir = "outputs",
    ),
    beta = 0.1,
    train_dataset = YOUR_DATASET_HERE,
    # eval_dataset = YOUR_DATASET_HERE,
    tokenizer = tokenizer,
    max_length = 1024,
    max_prompt_length = 512,
)

dpo_trainer.train()
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://unsloth.ai/docs/de/loslegen/reinforcement-learning-rl-guide/preference-dpo-orpo-and-kto.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
