# 持续预训练

* 该 [文本补全笔记本](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Mistral_\(7B\)-Text_Completion.ipynb) 用于继续预训练/原始文本。
* 该 [继续预训练笔记本](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Mistral_v0.3_\(7B\)-CPT.ipynb) 用于学习另一种语言。

您可以在我们的以下内容中了解有关继续预训练和我们发布的更多信息： [博客文章](https://unsloth.ai/blog/contpretraining).

## 什么是继续预训练？

继续或持续预训练（CPT）对于“引导”语言模型理解新的知识领域或分布外领域是必要的。像 Llama-3 8b 或 Mistral 7b 这样的基础模型首先在数万亿令牌的巨大数据集上进行预训练（例如 Llama-3 为 15 万亿）。

但有时这些模型在其他语言或特定文本领域（如法律、医学或其他领域）上的训练并不充分。因此需要继续预训练（CPT）来让语言模型学习新的令牌或数据集。

## 高级功能：

### 为继续微调加载 LoRA 适配器

如果您通过 Unsloth 保存了 LoRA 适配器，也可以使用您的 LoRA 权重继续训练。优化器状态也将被重置。要加载甚至优化器状态以继续微调，请参见下一节。

```python
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "LORA_MODEL_NAME",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)
trainer = Trainer(...)
trainer.train()
```

### 继续预训练与微调 `lm_head` 和 `embed_tokens` 矩阵

添加 `lm_head` 和 `embed_tokens`。对于 Colab，有时 Llama-3 8b 会超出内存。如果是这样，只需添加 `lm_head`.

```python
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",
                      "lm_head", "embed_tokens",],
    lora_alpha = 16,
)
```

然后使用两种不同的学习率——对 `lm_head` 或 `embed_tokens` 如下：

```python
from unsloth import UnslothTrainer, UnslothTrainingArguments

trainer = UnslothTrainer(
    ....
    args = UnslothTrainingArguments(
        ....
        learning_rate = 5e-5,
        embedding_learning_rate = 5e-6, # 比 learning_rate 小 2-10 倍
    ),
)
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://unsloth.ai/docs/zh/ji-chu/continued-pretraining.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
