# vLLM 部署与推理指南

### :computer:安装 vLLM

对于 NVIDIA GPU，使用 uv 并运行：

```bash
pip install --upgrade pip
pip install uv
uv pip install -U vllm --torch-backend=auto
```

对于 AMD GPU，请使用 nightly Docker 镜像： `rocm/vllm-dev:nightly`

对于 NVIDIA GPU 的 nightly 分支，运行：

{% code overflow="wrap" %}

```bash
pip install --upgrade pip
pip install uv
uv pip install -U vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly
```

{% endcode %}

参见 [vLLM 文档](https://docs.vllm.ai/en/stable/getting_started/installation) 了解更多细节

### :truck:部署 vLLM 模型

在保存微调后，你可以简单地执行：

```bash
vllm serve unsloth/gpt-oss-120b
```

### :fire\_engine:vLLM 部署服务器标志、引擎参数与选项

一些重要的服务器标志在 [#vllm-deployment-server-flags-engine-arguments-and-options](#vllm-deployment-server-flags-engine-arguments-and-options "mention")

### 🦥 在 vLLM 中部署 Unsloth 微调

微调完成后 [fine-tuning-llms-guide](https://unsloth.ai/docs/zh/kai-shi-shi-yong/fine-tuning-llms-guide "mention") 或使用我们的笔记本在 [unsloth-notebooks](https://unsloth.ai/docs/zh/kai-shi-shi-yong/unsloth-notebooks "mention")，你可以通过 vLLM 在单个工作流程中直接保存或部署你的模型。下面是一个示例 Unsloth 微调脚本，例如：

```python
from unsloth import FastLanguageModel
import torch
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/gpt-oss-20b",
    max_seq_length = 2048,
    load_in_4bit = True,
)
model = FastLanguageModel.get_peft_model(model)
```

**要以 16 位保存以供 vLLM 使用，使用：**

{% code overflow="wrap" %}

```python
model.save_pretrained_merged("finetuned_model", tokenizer, save_method = "merged_16bit")
## 或 上传到 HuggingFace：
model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")
```

{% endcode %}

**仅保存 LoRA 适配器**，可选择使用：

```python
model.save_pretrained("finetuned_lora")
tokenizer.save_pretrained("finetuned_lora")
```

或者直接使用我们的内置函数来完成：

{% code overflow="wrap" %}

```python
model.save_pretrained_merged("finetuned_model", tokenizer, save_method = "lora")
## 或 上传到 HuggingFace
model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")
```

{% endcode %}

要合并为 4bit 以在 HuggingFace 上加载，首先调用 `merged_4bit`。然后使用 `merged_4bit_forced` 如果你确定要合并为 4bit 才使用。我强烈不建议这样做，除非你确切知道如何使用该 4bit 模型（例如用于 DPO 训练或 HuggingFace 的在线推理引擎）。

{% code overflow="wrap" %}

```python
model.save_pretrained_merged("finetuned_model", tokenizer, save_method = "merged_4bit")
## 上传到 HuggingFace：
model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")
```

{% endcode %}

然后在另一个终端中在 vLLM 中加载微调模型：

```bash
vllm serve finetuned_model
```

如果上面不起作用，你可能需要提供完整路径，例如：

```bash
vllm serve /mnt/disks/daniel/finetuned_model
```

参见其他内容：

### [vllm-engine-arguments](https://unsloth.ai/docs/zh/ji-chu/inference-and-deployment/vllm-guide/vllm-engine-arguments "mention")

### [lora-hot-swapping-guide](https://unsloth.ai/docs/zh/ji-chu/inference-and-deployment/vllm-guide/lora-hot-swapping-guide "mention")


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://unsloth.ai/docs/zh/ji-chu/inference-and-deployment/vllm-guide.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
