# 使用 Unsloth 在 Intel GPU 上微调 LLM

现在您可以使用 Unsloth 在本地 Intel 设备上微调大型语言模型（LLM）！阅读我们的指南，了解如何开始训练您自己的自定义模型。

在开始之前，请确保您具备：

* **Intel GPU：** 数据中心 GPU Max 系列、Arc 系列或 Intel Ultra AIPC
* **操作系统：** Linux（建议使用 Ubuntu 22.04+）或 Windows 11（推荐）
* **仅限 Windows：** 安装 Intel oneAPI Base Toolkit 2025.2.1（选择版本 2025.2.1）
* **Intel 显卡驱动：** Windows/Linux 的最新推荐驱动
* **Python：** 3.10+

### 使用对 Intel 的支持构建 Unsloth

{% stepper %}
{% step %}

#### 创建新的 conda 环境（可选）

```bash
conda create -n unsloth-xpu python==3.10
conda activate unsloth-xpu
```

{% endstep %}

{% step %}

#### 安装 Unsloth

```bash
git clone https://github.com/unslothai/unsloth.git
cd unsloth
pip install .[intel-gpu-torch290]
```

{% hint style="info" %}
仅限 Linux：安装 [vLLM](https://unsloth.ai/docs/zh/ji-chu/inference-and-deployment/vllm-guide) (可选)\
您也可以为 [推理](https://unsloth.ai/docs/zh/ji-chu/inference-and-deployment) 和 [强化学习](https://unsloth.ai/docs/zh/kai-shi-shi-yong/reinforcement-learning-rl-guide)安装。请遵循 [vLLM 的指南](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/#intel-xpu).
{% endhint %}
{% endstep %}

{% step %}

#### 验证您的环境

```python
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"XPU available: {torch.xpu.is_available()}")
print(f"XPU device count: {torch.xpu.device_count()}")
print(f"XPU device name: {torch.xpu.get_device_name(0)}")
```

{% endstep %}

{% step %}

#### 开始微调。

您可以直接使用我们的 Unsloth [笔记本](https://unsloth.ai/docs/zh/kai-shi-shi-yong/unsloth-notebooks) 或查看我们的专门 [微调](https://unsloth.ai/docs/zh/kai-shi-shi-yong/fine-tuning-llms-guide) 或 [强化学习](https://unsloth.ai/docs/zh/kai-shi-shi-yong/reinforcement-learning-rl-guide) 指南。
{% endstep %}
{% endstepper %}

### 仅限 Windows - 运行时配置

以管理员权限在命令提示符中，在 Windows 注册表中启用长路径支持：

```bash
powershell -Command "Set-ItemProperty -Path "HKLM:\\SYSTEM\\CurrentControlSet\\Control\\FileSystem" -Name "LongPathsEnabled" -Value 1
```

此命令只需在单台机器上设置一次。无需在每次运行前配置。然后：

1. 从以下位置下载 level-zero-win-sdk-1.20.2.zip [GitHub](https://github.com/oneapi-src/level-zero/releases/tag/v1.20.2)
2. 解压 level-zero-win-sdk-1.20.2.zip
3. 在命令提示符中，在 conda 环境 unsloth-xpu 下：

```bash
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" -
set ZE_PATH=path\to\the\unzipped\level-zero-win-sdk-1.20.2
```

### 示例 1：使用 SFT 的 QLoRA 微调

此示例演示如何在 Intel GPU 上使用 4 位 QLoRA 对 Qwen3-32B 模型进行微调。QLoRA 大幅降低内存需求，使在消费级硬件上微调大型模型成为可能。

{% code expandable="true" %}

```python
from unsloth import FastLanguageModel, FastModel
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
max_seq_length = 2048 # 内部支持 RoPE 缩放，因此可自由选择！
# 获取 LAION 数据集
url = "https://huggingface.co/datasets/laion/OIG/resolve/main/unified_chip2.jsonl"
dataset = load_dataset("json", data_files = {"train" :
url}, split = "train")

# 我们支持的 4 位预量化模型，便于快速下载且不会 OOM（内存溢出）。
fourbit_models = [
"unsloth/Qwen3-32B-bnb-4bit",
"unsloth/Qwen3-14B-bnb-4bit",
"unsloth/Qwen3-8B-bnb-4bit",
"unsloth/Qwen3-4B-bnb-4bit",
"unsloth/Qwen3-1.7B-bnb-4bit",
"unsloth/Qwen3-0.6B-bnb-4bit",
# "unsloth/Qwen2.5-32B-bnb-4bit",
# "unsloth/Qwen2.5-14B-bnb-4bit",
# "unsloth/Qwen2.5-7B-bnb-4bit",
# "unsloth/Qwen2.5-3B-bnb-4bit",
# "unsloth/Qwen2.5-1.5B-bnb-4bit",
# "unsloth/Qwen2.5-0.5B-bnb-4bit",
# "unsloth/Llama-3.2-3B-bnb-4bit",
# "unsloth/Llama-3.2-1B-bnb-4bit",
# "unsloth/Llama-3.1-8B-bnb-4bit",
# "unsloth/Llama-3.1-70B-bnb-4bit",
# "unsloth/mistral-7b-bnb-4bit",
# "unsloth/Phi-4",
# "unsloth/Phi-3.5-mini-instruct",
# "unsloth/Phi-3-medium-4k-instruct",
# "unsloth/Phi-3-mini-4k-instruct",
# "unsloth/gemma-2-9b-bnb-4bit",
# "unsloth/gemma-2-27b-bnb-4bit",
] # 更多模型见 https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/Qwen3-32B-bnb-4bit",
max_seq_length = max_seq_length,
load_in_4bit = True,
# token = "hf_...", # 如果使用诸如 meta-llama/Llama-2-7b-hf 之类的受限模型，可使用令牌
)

model = FastLanguageModel.get_peft_model(
model,
r = 16, # 选择任意大于 0 的数！建议 8、16、32、64、128
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
lora_alpha = 16,
lora_dropout = 0, # 支持任意值，但 = 0 已优化
bias = "none", # 支持任意值，但 = "none" 已优化
use_gradient_checkpointing = "unsloth", # 对于非常长的上下文使用 True 或 "unsloth"
random_state = 3407,
use_rslora = False, # 我们支持秩稳定 LoRA（rank stabilized LoRA）
loftq_config = None, # 以及 LoftQ
)

trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
dataset_text_field = "text",
max_seq_length = max_seq_length,
dataset_num_proc = 1, # 在 Windows 上推荐
packing = False, # 对于短序列可使训练快 5 倍。
args = SFTConfig(
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_steps = 5,
max_steps = 60,
learning_rate = 2e-4,
logging_steps = 1,
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "linear",
seed = 3407,
dataset_num_proc=1, # 在 Windows 上推荐
),
)

trainer.train()
```

{% endcode %}

### 示例 2：强化学习 GRPO

GRPO 是一种 [强化学习](https://unsloth.ai/docs/zh/kai-shi-shi-yong/reinforcement-learning-rl-guide) 用于将语言模型与人类偏好对齐的技术。此示例展示如何使用多个奖励函数训练模型以遵循特定的 XML 输出格式。

#### 什么是 GRPO？

GRPO 在传统 RLHF 基础上改进：

* 使用基于组的归一化以实现更稳定的训练
* 支持多个奖励函数以进行多目标优化
* 比 PPO 更节省内存

{% code expandable="true" %}

```python
from unsloth import FastLanguageModel
import re
from trl import GRPOConfig, GRPOTrainer
from datasets import load_dataset, Dataset

max_seq_length = 1024  # 可为更长的推理轨迹增加
lora_rank = 32  # 更大的秩＝更聪明，但更慢
max_prompt_length = 256

# 加载并准备数据集
SYSTEM_PROMPT = """
以以下格式响应：
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""

XML_COT_FORMAT = """\
<reasoning>
{reasoning}
</reasoning>
<answer>
{answer}
</answer>
"""


def extract_xml_answer(text: str) -> str:
    answer = text.split("<answer>")[-1]
    answer = answer.split("</answer>")[0]
    return answer.strip()


def extract_hash_answer(text: str) -> str | None:
    if "####" not in text:
        return None
    return text.split("####")[1].strip()


# 取消注释中间消息以进行 1-shot 提示
def get_gsm8k_questions(split: str = "train") -> Dataset:
    data = load_dataset("openai/gsm8k", "main")[split]  # type: ignore
    data = data.map(
        lambda x: {  # type: ignore
            "prompt": [
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": x["question"]},
            ],
            "answer": extract_hash_answer(x["answer"]),
        }
    )  # type: ignore
    return data  # type: ignore


# 奖励函数
def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
    responses = [completion[0]["content"] for completion in completions]
    q = prompts[0][-1]["content"]
    extracted_responses = [extract_xml_answer(r) for r in responses]
    print(
        "-" * 20,
        f"Question:\n{q}",
        f"\nAnswer:\n{answer[0]}",
        f"\nResponse:\n{responses[0]}",
        f"\nExtracted:\n{extracted_responses[0]}",
    )
    return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)]


def int_reward_func(completions, **kwargs) -> list[float]:
    responses = [completion[0]["content"] for completion in completions]
    extracted_responses = [extract_xml_answer(r) for r in responses]
    return [0.5 if r.isdigit() else 0.0 for r in extracted_responses]


def strict_format_reward_func(completions, **kwargs) -> list[float]:
    """检查完成内容是否具有特定格式的奖励函数。"""
    pattern = r"^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\n$"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]


def soft_format_reward_func(completions, **kwargs) -> list[float]:
    """检查完成内容是否具有特定格式的奖励函数。"""
    pattern = r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]


def count_xml(text: str) -> float:
    count = 0.0
    if text.count("<reasoning>\n") == 1:
        count += 0.125
    if text.count("\n</reasoning>\n") == 1:
        count += 0.125
    if text.count("\n<answer>\n") == 1:
        count += 0.125
    count -= len(text.split("\n</answer>\n")[-1]) * 0.001
    if text.count("\n</answer>") == 1:
        count += 0.125
    count -= (len(text.split("\n</answer>")[-1]) - 1) * 0.001
    return count


def xmlcount_reward_func(completions, **kwargs) -> list[float]:
    contents = [completion[0]["content"] for completion in completions]
    return [count_xml(c) for c in contents]


if __name__ == "__main__":
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name="unsloth/Qwen3-0.6B",
        max_seq_length=max_seq_length,
        load_in_4bit=False,  # LoRA 使用 16 位 时为 False
        fast_inference=False,  # 启用 vLLM 快速推理
        max_lora_rank=lora_rank,
        gpu_memory_utilization=0.7,  # 若内存不足请降低
        device_map="xpu:0",
    )

    model = FastLanguageModel.get_peft_model(
        model,
        r=lora_rank,  # 选择任意大于 0 的数！建议 8、16、32、64、128
        target_modules=[
            "q_proj",
            "k_proj",
            "v_proj",
            "o_proj",
            "gate_proj",
            "up_proj",
            "down_proj",
        ],  # 如果内存不足可移除 QKVO
        lora_alpha=lora_rank,
        use_gradient_checkpointing="unsloth",  # 启用长上下文微调
        random_state=3407,
    )

    dataset = get_gsm8k_questions()

    training_args = GRPOConfig(
        learning_rate=5e-6,
        adam_beta1=0.9,
        adam_beta2=0.99,
        weight_decay=0.1,
        warmup_ratio=0.1,
        lr_scheduler_type="cosine",
        optim="adamw_torch",
        logging_steps=1,
        per_device_train_batch_size=1,
        gradient_accumulation_steps=1,  # 若要更平滑的训练可增加到 4
        num_generations=4,  # 若内存不足请减少
        max_prompt_length=max_prompt_length,
        max_completion_length=max_seq_length - max_prompt_length,
        # num_train_epochs=1,  # 对于完整训练运行设为 1
        max_steps=20,
        save_steps=250,
        max_grad_norm=0.1,
        report_to="none",  # 可使用 Weights & Biases
        output_dir="outputs",
    )

    trainer = GRPOTrainer(
        model=model,
        processing_class=tokenizer,
        reward_funcs=[
            xmlcount_reward_func,
            soft_format_reward_func,
            strict_format_reward_func,
            int_reward_func,
            correctness_reward_func,
        ],
        args=training_args,
        train_dataset=dataset,
        dataset_num_proc=1,  # 在 Windows 上推荐
    )

    trainer.train()

```

{% endcode %}

## 故障排除

### 内存不足（OOM）错误

如果发生内存不足，请尝试以下解决方案：

1. **减少批量大小：** 降低 `per_device_train_batch_size`.
2. **使用更小的模型：** 从更小的模型入手以减少内存需求。
3. **减少序列长度：** 降低 `max_seq_length`.
4. **降低 LoRA 秩：** 使用 `r=8` 代替 `r=16` 或 `r=32`.
5. **对于 GRPO，减少生成数量：** 降低 `num_generations`.

### （仅限 Windows）Intel Ultra AIPC iGPU 共享内存

对于在 Windows 上使用近期 GPU 驱动的 Intel Ultra AIPC，集成 GPU 的共享显存通常默认为系统内存的 **57%** 。对于较大的模型（例如， **Qwen3-32B**）或在使用更长的最大序列长度、更大的批量、具有更大 LoRA 秩的 LoRA 适配器等情况下，在微调期间可以通过提高分配给 iGPU 的系统内存百分比来增加可用显存。

您可以通过修改注册表来调整：

* 路径： `Computer\HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicsDrivers\MemoryManager`
* 要更改的键：\
  `SystemPartitionCommitLimitPercentage` （设置为更大的百分比）


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://unsloth.ai/docs/zh/kai-shi-shi-yong/install/intel.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
