# 使用 Unsloth 微调嵌入模型指南

微调嵌入模型可以大幅提升特定任务上的检索和 RAG 性能。它会将模型的向量与您的领域以及对您的用例真正重要的“相似性”对齐，从而提升您数据上的搜索、RAG、聚类和推荐效果。

例如：标题“Google launches Pixel 10”和“Qwen releases Qwen3”如果你只是把它们都标记为“科技”，可能会被嵌入为相似；但如果你在做语义搜索，它们就不相似，因为它们讲的是不同的事情。微调有助于模型为您的用例建立“正确”类型的相似性，从而减少错误并改善结果。

[**Unsloth**](https://github.com/unslothai/unsloth) 现已支持训练嵌入、 **分类器**, **BERT**, **重排序器** 模型 [**快约 1.8-3.3 倍**](#unsloth-benchmarks) ，并且比其他 Flash Attention 2 实现少用 20% 内存、上下文长度加倍——且不会降低准确率。EmbeddingGemma-300M 仅需 **3GB 显存**。您可以在任何地方使用您训练好的 **模型**：transformers、LangChain、Ollama、vLLM、llama.cpp 等。

Unsloth 使用 [SentenceTransformers](https://github.com/huggingface/sentence-transformers) 来支持像 Qwen3-Embedding、BERT 等兼容模型。 **即使没有 notebook 或上传，仍然支持。**

**我们创建了免费的微调 notebooks，包含 3 个主要用例：**

| [EmbeddingGemma（300M）](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/EmbeddingGemma_\(300M\).ipynb) | [Qwen3-Embedding 4B](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_Embedding_\(4B\).ipynb) • [0.6B](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_Embedding_\(0_6B\).ipynb) | [BGE M3](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/BGE_M3.ipynb)                        |
| ------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------- |
| [ModernBERT](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/bert_classification.ipynb) - 分类          | [All-MiniLM-L6-v2](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/All_MiniLM_L6_v2.ipynb)                                                                                                                            | [ModernBERT-large](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/bert_classification.ipynb) |

* `All-MiniLM-L6-v2`：生成紧凑、特定领域的句向量嵌入，用于语义搜索、检索和聚类，并可基于您自己的数据进行调优。
* `tomaarsen/miriad-4.4M-split`：为高质量医疗语义搜索和 RAG 嵌入医学问题和生物医学论文。
* `electroglyph/technical`：更好地捕捉技术文本（文档、规格和工程讨论）中的含义和语义相似性。

您可以在 [我们的合集这里](https://huggingface.co/collections/unsloth/embedding-models).

> 查看我们其余已上传模型。非常感谢 Unsloth 贡献者 [**electroglyph**](https://github.com/unslothai/unsloth/pull/3719)，其工作对支持这一功能至关重要。您可以在 Hugging Face 上查看 electroglyph 的自定义模型 [这里](https://huggingface.co/electroglyph).

### 🦥 Unsloth 功能

* 用于嵌入的 LoRA/QLoRA 或全量微调，无需重写您的流水线
* 对仅编码器（encoder-only） `SentenceTransformer` 模型的最佳支持（包含一个 `modules.json`)
* 即使在回退路径下，交叉编码器模型也已确认可以正确训练
* 此版本还支持 `transformers v5`

对于没有 `modules.json` 的模型，支持有限（我们会自动分配默认的 `SentenceTransformers` 池化模块）。如果您在做自定义内容（自定义头、非标准池化），请仔细检查输出，例如池化嵌入的行为。

通过把梯度检查点机制补丁到 `transformers` 模型中，启用了某些需要自定义补丁的模型，如 MPNet 或 DistilBERT。

### 🛠️ 微调工作流

新的微调流程以 `FastSentenceTransformer`.

为中心。主要的保存/推送方法：

* `save_pretrained()` 保存 **LoRA 适配器** 到本地文件夹
* `save_pretrained_merged()` 保存 **合并后的模型** 到本地文件夹
* `push_to_hub()` 推送 **LoRA 适配器** 到 Hugging Face
* `push_to_hub_merged()` 推送合并后的 **合并后的模型** 到 Hugging Face

**还有一个非常重要的细节：推理加载需要 `for_inference=True`**

`from_pretrained()` 与 Lacker 的其他快速类相似，只有 **一个例外**:

* 要加载用于 **推理** 的模型，使用 `FastSentenceTransformer`时，您 **必须** 传入： `for_inference=True`

因此您的推理加载应如下所示：

```python
model = FastSentenceTransformer.from_pretrained(
    "sentence-transformers/all-MiniLM-L6-v2",
    for_inference=True,
)
```

对于 Hugging Face 授权，如果您运行：

```
hf auth login
```

并在调用 hub 方法之前在同一个 virtualenv 中执行，那么：

* `push_to_hub()` 和 `push_to_hub_merged()` **不需要 token 参数**.

### ✅ 可推理，可部署到任何地方！ <a href="#docs-internal-guid-c10bfa80-7fff-446e-714d-732eebcd72d6" id="docs-internal-guid-c10bfa80-7fff-446e-714d-732eebcd72d6"></a>

您微调后的 Unsloth 模型可以与所有主流工具一起使用和部署：transformers、LangChain、Weaviate、sentence-transformers、Text Embeddings Inference (TEI)、vLLM 和 llama.cpp、自定义嵌入 API、pgvector、FAISS/向量数据库，以及任何 RAG 框架。

没有锁定，因为微调后的模型之后可以在您自己的设备上本地下载。

```python
# 1. 加载一个预训练的 Sentence Transformer 模型
model = SentenceTransformer("<your-unsloth-finetuned-model")

query = "哪个行星被称为红色星球？"
documents = [
    "由于大小相似且距离较近，金星常被称为地球的双胞胎。",
    "火星因其偏红的外观而闻名，常被称为红色星球。",
    "木星是太阳系中最大的行星，拥有一个显著的红斑。",
    "土星以其环而闻名，有时会被误认为是红色星球。"
]

# 2. 通过 encode_query 和 encode_document 进行编码，以在需要时自动使用正确的提示词
query_embedding = model.encode_query(query)
document_embedding = model.encode_document(documents)
print(query_embedding.shape, document_embedding.shape)

# 3. 计算相似度，例如通过内置的 similarity 辅助函数
similarity = model.similarity(query_embedding, document_embedding)
print(similarity)
```

### 📊 Unsloth 基准测试

Unsloth 在嵌入微调方面的优势包括速度！我们展示了我们始终保持 **快 1.8 到 3.3 倍** ，适用于各种嵌入模型，以及从 128 到 2048 及更长的不同序列长度。

EmbeddingGemma-300M QLoRA 仅需 **3GB 显存** ，而 LoRA 仅需 6GB 显存即可运行。

下面是我们与 `SentenceTransformers` + Flash Attention 2 (FA2) 在 4bit QLoRA 上的热力图基准对比。 **对于 4bit QLoRA，Unsloth 快 1.8x 到 2.6x：**

<figure><img src="/files/ce6401986affc46c6142842bdeec1ff4bb84eee9" alt=""><figcaption></figcaption></figure>

下面是我们与 `SentenceTransformers` + Flash Attention 2 (FA2) 在 16bit LoRA 上的表现。 **对于 16bit LoRA，Unsloth 快 1.2x 到 3.3x：**

<figure><img src="/files/7508bd7760cd35438a45e7b7047e96084aacf4d1" alt=""><figcaption></figcaption></figure>

### 🔮 模型支持

以下是 Unsloth 支持的一些流行嵌入模型（此处未列出全部模型）：

```
Alibaba-NLP/gte-modernbert-base
BAAI/bge-large-en-v1.5
BAAI/bge-m3
BAAI/bge-reranker-v2-m3
Qwen/Qwen3-Embedding-0.6B
answerdotai/ModernBERT-base
answerdotai/ModernBERT-large
google/embeddinggemma-300m
intfloat/e5-large-v2
intfloat/multilingual-e5-large-instruct
mixedbread-ai/mxbai-embed-large-v1
sentence-transformers/all-MiniLM-L6-v2
sentence-transformers/all-mpnet-base-v2
Snowflake/snowflake-arctic-embed-l-v2.0
```

大多数 [常见模型](https://huggingface.co/models?library=sentence-transformers) 已经支持。如果您想要的仅编码器模型尚未支持，欢迎提交一个 [GitHub issue](https://github.com/unslothai/unsloth/issues) 来请求支持。


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://unsloth.ai/docs/zh/ji-chu/embedding-finetuning.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.