# 保存为 GGUF

将模型保存为 16 位以用于 GGUF，这样你就可以将其用于 [Unsloth Studio](/docs/zh/xin/studio.md)、Ollama、llama.cpp 等更多工具！

{% tabs %}
{% tab title="本地" %}
要保存为 GGUF，请使用下面的方法本地保存：

```python
model.save_pretrained_gguf("directory", tokenizer, quantization_method = "q4_k_m")
model.save_pretrained_gguf("directory", tokenizer, quantization_method = "q8_0")
model.save_pretrained_gguf("directory", tokenizer, quantization_method = "f16")
```

要推送到 Hugging Face hub：

```python
model.push_to_hub_gguf("hf_username/directory", tokenizer, quantization_method = "q4_k_m")
model.push_to_hub_gguf("hf_username/directory", tokenizer, quantization_method = "q8_0")
```

所有受支持的量化选项，适用于 `quantization_method` 如下所列：

```python
# https://github.com/ggml-org/llama.cpp/blob/master/examples/quantize/quantize.cpp#L19
ALLOWED_QUANTS = \
{
    "not_quantized"  : "推荐。转换速度快。推理速度慢，文件大。",
    "fast_quantized" : "推荐。转换速度快。推理还可以，文件大小还可以。",
    "quantized"      : "推荐。转换速度慢。推理速度快，文件小。",
    "f32"     : "不推荐。保留 100% 准确度，但速度极慢且非常占内存。",
    "f16"     : "转换速度最快 + 保留 100% 准确度。速度慢且占内存。",
    "q8_0"    : "转换速度快。资源占用高，但通常可接受。",
    "q4_k_m"  : "推荐。对 attention.wv 和 feed_forward.w2 张量的一半使用 Q6_K，其余使用 Q4_K",
    "q5_k_m"  : "推荐。对 attention.wv 和 feed_forward.w2 张量的一半使用 Q6_K，其余使用 Q5_K",
    "q2_k"    : "对 attention.vw 和 feed_forward.w2 张量使用 Q4_K，对其他张量使用 Q2_K。",
    "q3_k_l"  : "对 attention.wv、attention.wo 和 feed_forward.w2 张量使用 Q5_K，其余使用 Q3_K",
    "q3_k_m"  : "对 attention.wv、attention.wo 和 feed_forward.w2 张量使用 Q4_K，其余使用 Q3_K",
    "q3_k_s"  : "对所有张量使用 Q3_K",
    "q4_0"    : "原始量化方法，4 位。",
    "q4_1"    : "比 q4_0 精度更高，但不如 q5_0 高。不过推理速度比 q5 模型更快。",
    "q4_k_s"  : "对所有张量使用 Q4_K",
    "q4_k"    : "q4_k_m 的别名",
    "q5_k"    : "q5_k_m 的别名",
    "q5_0"    : "更高的精度、更高的资源占用和更慢的推理。",
    "q5_1"    : "更高的精度、资源占用和更慢的推理。",
    "q5_k_s"  : "对所有张量使用 Q5_K",
    "q6_k"    : "对所有张量使用 Q8_K",
    "iq2_xxs" : "2.06 bpw 量化",
    "iq2_xs"  : "2.31 bpw 量化",
    "iq3_xxs" : "3.06 bpw 量化",
    "q3_k_xs" : "3 位超小量化",
}
```

{% endtab %}

{% tab title="手动保存" %}
首先将你的模型保存为 16 位：

```python
model.save_pretrained_merged("merged_model", tokenizer, save_method = "merged_16bit",)
```

然后使用终端并执行：

{% code overflow="wrap" %}

```bash
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp

python llama.cpp/convert-hf-to-gguf.py FOLDER --outfile OUTPUT --outtype f16
```

{% endcode %}

或者按照 <https://rentry.org/llama-cpp-conversions#merging-loras-into-a-model> 上的步骤，使用模型名称 "merged\_model" 合并为 GGUF。
{% endtab %}
{% endtabs %}

### 在 Unsloth 中运行效果很好，但导出并在其他平台上运行后，结果很差

你有时可能会遇到这样的问题：模型在 Unsloth 上运行并产生良好结果，但当你在另一个平台如 Ollama 或 vLLM 上使用它时，结果很差，或者你可能得到胡言乱语、无限/无尽生成 *或* 重复输&#x51FA;**.**

* 这种错误最常见的原因是使用了 <mark style="background-color:blue;">**错误的聊天模板**</mark>**.** 务必要使用与在 Unsloth 中训练模型时相同的聊天模板，并且在之后你在另一个框架中运行它时也要使用相同的模板，例如 llama.cpp 或 Ollama。从已保存的模型进行推理时，应用正确的模板至关重要。
* 你必须使用正确的 `eos 令牌`。否则，在较长的生成中你可能会得到胡言乱语。
* 这也可能是因为你的推理引擎添加了一个不必要的“序列起始”令牌（或者相反地缺少它），所以请确保两种假设都检查一下！
* <mark style="background-color:green;">**使用我们的对话笔记本来强制应用聊天模板——这将修复大多数问题。**</mark>
  * Qwen-3 14B 对话笔记本 [**在 Colab 中打开**](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_\(14B\)-Reasoning-Conversational.ipynb)
  * Gemma-3 4B 对话笔记本 [**在 Colab 中打开**](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma3_\(4B\).ipynb)
  * Llama-3.2 3B 对话笔记本 [**在 Colab 中打开**](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_\(1B_and_3B\)-Conversational.ipynb)
  * Phi-4 14B 对话笔记本 [**在 Colab 中打开**](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Phi_4-Conversational.ipynb)
  * Mistral v0.3 7B 对话笔记本 [**在 Colab 中打开**](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Mistral_v0.3_\(7B\)-Conversational.ipynb)
  * **更多笔记本请见我们的** [**笔记本文档**](/docs/zh/kai-shi-shi-yong/unsloth-notebooks.md)

### 保存为 GGUF / vLLM 16 位时崩溃

你可以尝试通过更改以下设置来降低保存期间的最大 GPU 使用量 `maximum_memory_usage`.

默认值是 `model.save_pretrained(..., maximum_memory_usage = 0.75)`。将其降低到例如 0.5，以使用 GPU 峰值内存的 50% 或更少。这可以减少保存期间的 OOM 崩溃。

### 我如何手动保存为 GGUF？

首先通过以下方式将你的模型保存为 16 位：

```python
model.save_pretrained_merged("merged_model", tokenizer, save_method = "merged_16bit",)
```

像下面这样从源代码编译 llama.cpp：

{% code overflow="wrap" %}

```bash
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp
```

{% endcode %}

然后，将模型保存为 F16：

```bash
python llama.cpp/convert_hf_to_gguf.py merged_model \
    --outfile model-F16.gguf --outtype f16 \
    --split-max-size 50G
```

```bash
# 对于 BF16：
python llama.cpp/convert_hf_to_gguf.py merged_model \
    --outfile model-BF16.gguf --outtype bf16 \
    --split-max-size 50G
    
# 对于 Q8_0：
python llama.cpp/convert_hf_to_gguf.py merged_model \
    --outfile model-Q8_0.gguf --outtype q8_0 \
    --split-max-size 50G
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://unsloth.ai/docs/zh/ji-chu/inference-and-deployment/saving-to-gguf.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.