# Qwen3-2507: Run Locally Guide

Qwen released 2507 (July 2025) updates for their [Qwen3](https://unsloth.ai/docs/models/tutorials/qwen3-how-to-run-and-fine-tune) 4B, 30B and 235B models, introducing both "thinking" and "non-thinking" variants. The non-thinking '**Qwen3-30B-A3B-Instruct-2507**' and '**Qwen3-235B-A22B-Instruct-2507'** features a 256K context window, improved instruction following, multilingual capabilities and alignment.

The thinking models '**Qwen3-30B-A3B-Thinking-2507**' and '**Qwen3-235B-A22B-Thinking-2507**' excel at reasoning, with the 235B achieving SOTA results in logic, math, science, coding, and advanced academic tasks.

[Unsloth](https://github.com/unslothai/unsloth) also now supports fine-tuning and [Reinforcement Learning (RL)](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide) of Qwen3-2507 models — 2x faster, with 70% less VRAM, and 8x longer context lengths

<a href="#run-qwen3-30b-a3b-2507-tutorials" class="button secondary">Run 30B-A3B</a><a href="#run-qwen3-235b-a22b-2507-tutorials" class="button secondary">Run 235B-A22B</a><a href="#fine-tuning-qwen3-2507-with-unsloth" class="button secondary">Fine-tune Qwen3-2507</a>

**Unsloth** [**Dynamic 2.0**](https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs) **GGUFs:**

| Model                    | GGUFs to run:                                                                                                                                                 |
| ------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Qwen3-**4B-2507**        | [Instruct](https://huggingface.co/unsloth/Qwen3-4B-Instruct-2507-GGUF) • [Thinking](https://huggingface.co/unsloth/Qwen3-4B-Thinking-2507-GGUF)               |
| Qwen3-**30B-A3B**-2507   | [Instruct](#llama.cpp-run-qwen3-30b-a3b-instruct-2507-tutorial) • [Thinking](https://huggingface.co/unsloth/Qwen3-30B-A3B-Thinking-2507-GGUF)                 |
| Qwen3-**235B-A22B**-2507 | [Instruct](https://huggingface.co/unsloth/Qwen3-235B-A22B-Instruct-2507-GGUF) • [Thinking](https://huggingface.co/unsloth/Qwen3-235B-A22B-Thinking-2507-GGUF) |

## ⚙️Best Practices

{% hint style="success" %}
The settings for the Thinking and Instruct model are different.\
The thinking model uses temperature = 0.6, but the instruct model uses temperature = 0.7\
The thinking model uses top\_p = 0.95, but the instruct model uses top\_p = 0.8
{% endhint %}

To achieve optimal performance, Qwen recommends these settings:

| Instruct Model Settings:                                                                                      | Thinking Model Settings:                                                                                      |
| ------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------- |
| <mark style="background-color:blue;">`Temperature = 0.7`</mark>                                               | <mark style="background-color:blue;">`Temperature = 0.6`</mark>                                               |
| `Min_P = 0.00` (llama.cpp's default is 0.1)                                                                   | `Min_P = 0.00` (llama.cpp's default is 0.1)                                                                   |
| `Top_P = 0.80`                                                                                                | `Top_P = 0.95`                                                                                                |
| `TopK = 20`                                                                                                   | `TopK = 20`                                                                                                   |
| `presence_penalty = 0.0 to 2.0` (llama.cpp default turns it off, but to reduce repetitions, you can use this) | `presence_penalty = 0.0 to 2.0` (llama.cpp default turns it off, but to reduce repetitions, you can use this) |

**Adequate Output Length**: Use an output length of `32,768` tokens for most queries, which is adequate for most queries.

Chat template for both Thinking (thinking has `<think></think>`) and Instruct is below:

```
<|im_start|>user
Hey there!<|im_end|>
<|im_start|>assistant
What is 1+1?<|im_end|>
<|im_start|>user
2<|im_end|>
<|im_start|>assistant
```

## 📖 Run Qwen3-30B-A3B-2507 Tutorials

Below are guides for the [Thinking](#thinking-qwen3-30b-a3b-thinking-2507) and [Instruct](#instruct-qwen3-30b-a3b-instruct-2507) versions of the model.

### Instruct: Qwen3-30B-A3B-Instruct-2507

Given that this is a non thinking model, there is no need to set `thinking=False` and the model does not generate `<think> </think>` blocks.

#### ⚙️Best Practices

To achieve optimal performance, Qwen recommends the following settings:

* We suggest using `temperature=0.7, top_p=0.8, top_k=20, and min_p=0.0` `presence_penalty` between 0 and 2 if the framework supports to reduce endless repetitions.
* **`temperature = 0.7`**
* `top_k = 20`
* `min_p = 0.00` (llama.cpp's default is 0.1)
* **`top_p = 0.80`**
* `presence_penalty = 0.0 to 2.0` (llama.cpp default turns it off, but to reduce repetitions, you can use this) Try 1.0 for example.
* Supports up to `262,144` context natively but you can set it to `32,768` tokens for less RAM use

#### 🦙 Ollama: Run Qwen3-30B-A3B-Instruct-2507 Tutorial

1. Install `ollama` if you haven't already! You can only run models up to 32B in size.

```bash
apt-get update
apt-get install pciutils -y
curl -fsSL https://ollama.com/install.sh | sh
```

2. Run the model! Note you can call `ollama serve`in another terminal if it fails! We include all our fixes and suggested parameters (temperature etc) in `params` in our Hugging Face upload!

```bash
ollama run hf.co/unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF:UD-Q4_K_XL
```

#### :sparkles: Llama.cpp: Run Qwen3-30B-A3B-Instruct-2507 Tutorial

1. Obtain the latest `llama.cpp` on [GitHub here](https://github.com/ggml-org/llama.cpp). You can follow the build instructions below as well. Change `-DGGML_CUDA=ON` to `-DGGML_CUDA=OFF` if you don't have a GPU or just want CPU inference. **For Apple Mac / Metal devices**, set `-DGGML_CUDA=OFF` then continue as usual - Metal support is on by default.

```bash
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp
```

2. You can directly pull from HuggingFace via:

   ```bash
   ./llama.cpp/llama-cli \
       -hf unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF:Q4_K_XL \
       --jinja -ngl 99 --ctx-size 32768 \
       --temp 0.7 --min-p 0.0 --top-p 0.80 --top-k 20 --presence-penalty 1.0
   ```
3. Download the model via (after installing `pip install huggingface_hub hf_transfer` ). You can choose UD\_Q4\_K\_XL or other quantized versions.

```python
# !pip install huggingface_hub hf_transfer
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id = "unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF",
    local_dir = "unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF",
    allow_patterns = ["*UD-Q4_K_XL*"],
)
```

### Thinking: Qwen3-30B-A3B-Thinking-2507

This model supports only thinking mode and a 256K context window natively. The default chat template adds `<think>` automatically, so you may see only a closing `</think>` tag in the output.

#### ⚙️Best Practices

To achieve optimal performance, Qwen recommends the following settings:

* We suggest using `temperature=0.6, top_p=0.95, top_k=20, and min_p=0.0` `presence_penalty` between 0 and 2 if the framework supports to reduce endless repetitions.
* **`temperature = 0.6`**
* `top_k = 20`
* `min_p = 0.00` (llama.cpp's default is 0.1)
* **`top_p = 0.95`**
* `presence_penalty = 0.0 to 2.0` (llama.cpp default turns it off, but to reduce repetitions, you can use this) Try 1.0 for example.
* Supports up to `262,144` context natively but you can set it to `32,768` tokens for less RAM use

#### 🦙 Ollama: Run Qwen3-30B-A3B-Instruct-2507 Tutorial

1. Install `ollama` if you haven't already! You can only run models up to 32B in size. To run the full 235B-A22B models, [see here](#run-qwen3-235b-a22b-instruct-2507).

```bash
apt-get update
apt-get install pciutils -y
curl -fsSL https://ollama.com/install.sh | sh
```

2. Run the model! Note you can call `ollama serve`in another terminal if it fails! We include all our fixes and suggested parameters (temperature etc) in `params` in our Hugging Face upload!

```bash
ollama run hf.co/unsloth/Qwen3-30B-A3B-Thinking-2507-GGUF:UD-Q4_K_XL
```

#### :sparkles: Llama.cpp: Run Qwen3-30B-A3B-Instruct-2507 Tutorial

1. Obtain the latest `llama.cpp` on [GitHub here](https://github.com/ggml-org/llama.cpp). You can follow the build instructions below as well. Change `-DGGML_CUDA=ON` to `-DGGML_CUDA=OFF` if you don't have a GPU or just want CPU inference. **For Apple Mac / Metal devices**, set `-DGGML_CUDA=OFF` then continue as usual - Metal support is on by default.

```bash
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp
```

2. You can directly pull from Hugging Face via:

   ```bash
   ./llama.cpp/llama-cli \
       -hf unsloth/Qwen3-30B-A3B-Thinking-2507-GGUF:Q4_K_XL \
       --jinja -ngl 99 --ctx-size 32768 \
       --temp 0.6 --min-p 0.0 --top-p 0.95 --top-k 20 --presence-penalty 1.0
   ```
3. Download the model via (after installing `pip install huggingface_hub hf_transfer` ). You can choose UD\_Q4\_K\_XL or other quantized versions.

```python
# !pip install huggingface_hub hf_transfer
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id = "unsloth/Qwen3-30B-A3B-Thinking-2507-GGUF",
    local_dir = "unsloth/Qwen3-30B-A3B-Thinking-2507-GGUF",
    allow_patterns = ["*UD-Q4_K_XL*"],
)
```

## 📖 Run **Qwen3-235B-A22B-2507** Tutorials

Below are guides for the [Thinking](#run-qwen3-235b-a22b-thinking-via-llama.cpp) and [Instruct](#run-qwen3-235b-a22b-instruct-via-llama.cpp) versions of the model.

### Thinking: Qwen3-**235B-A22B**-Thinking-2507

This model supports only thinking mode and a 256K context window natively. The default chat template adds `<think>` automatically, so you may see only a closing `</think>` tag in the output.

#### :gear: Best Practices

To achieve optimal performance, Qwen recommends these settings for the Thinking model:

* **`temperature = 0.6`**
* `top_k = 20`
* `min_p = 0.00` (llama.cpp's default is 0.1)
* `top_p = 0.95`
* `presence_penalty = 0.0 to 2.0` (llama.cpp default turns it off, but to reduce repetitions, you can use this) Try 1.0 for example.
* **Adequate Output Length**: Use an output length of `32,768` tokens for most queries, which is adequate for most queries.

#### :sparkles:Run Qwen3-235B-A22B-Thinking via llama.cpp:

For Qwen3-235B-A22B, we will specifically use Llama.cpp for optimized inference and a plethora of options.

{% hint style="success" %}
If you want a **full precision unquantized version**, use our `Q8_K_XL, Q8_0` or `BF16` versions!
{% endhint %}

1. Obtain the latest `llama.cpp` on [GitHub here](https://github.com/ggml-org/llama.cpp). You can follow the build instructions below as well. Change `-DGGML_CUDA=ON` to `-DGGML_CUDA=OFF` if you don't have a GPU or just want CPU inference. **For Apple Mac / Metal devices**, set `-DGGML_CUDA=OFF` then continue as usual - Metal support is on by default.

   ```bash
   apt-get update
   apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
   git clone https://github.com/ggml-org/llama.cpp
   cmake llama.cpp -B llama.cpp/build \
       -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
   cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split
   cp llama.cpp/build/bin/llama-* llama.cpp
   ```
2. You can directly use llama.cpp to download the model but I normally suggest using `huggingface_hub` To use llama.cpp directly, do:

   ```bash
   ./llama.cpp/llama-cli \
       -hf unsloth/Qwen3-235B-A22B-Thinking-2507-GGUF:Q2_K_XL \
       --ctx-size 16384 \
       --n-gpu-layers 99 \
       -ot ".ffn_.*_exps.=CPU" \
       --temp 0.6 \
       --min-p 0.0 \
       --top-p 0.95 \
       --top-k 20 \
       --presence-penalty 1.0
   ```
3. Download the model via (after installing `pip install huggingface_hub hf_transfer` ). You can choose UD-Q2\_K\_XL, or other quantized versions..

   ```python
   # !pip install huggingface_hub hf_transfer
   import os
   os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0" # Can sometimes rate limit, so set to 0 to disable
   from huggingface_hub import snapshot_download
   snapshot_download(
       repo_id = "unsloth/Qwen3-235B-A22B-Thinking-2507-GGUF",
       local_dir = "unsloth/Qwen3-235B-A22B-Thinking-2507-GGUF",
       allow_patterns = ["*UD-Q2_K_XL*"],
   )
   ```
4. Run the model and try any prompt.
5. Edit `--threads -1` for the number of CPU threads, `--ctx-size` 262114 for context length, `--n-gpu-layers 99` for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.

{% hint style="success" %}
Use `-ot ".ffn_.*_exps.=CPU"` to offload all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.
{% endhint %}

{% code overflow="wrap" %}

```bash
./llama.cpp/llama-cli \
    --model unsloth/Qwen3-235B-A22B-Thinking-2507-GGUF/UD-Q2_K_XL/Qwen3-235B-A22B-Thinking-2507-UD-Q2_K_XL-00001-of-00002.gguf \
    --ctx-size 16384 \
    --n-gpu-layers 99 \
    -ot ".ffn_.*_exps.=CPU" \
    --seed 3407 \
    --temp 0.6 \
    --min-p 0.0 \
    --top-p 0.95 \
    --top-k 20
    --presence-penalty 1.0
```

{% endcode %}

### Instruct: Qwen3-**235B-A22B**-Instruct-2507

Given that this is a non thinking model, there is no need to set `thinking=False` and the model does not generate `<think> </think>` blocks.

#### ⚙️Best Practices

To achieve optimal performance, we recommend the following settings:

**1. Sampling Parameters**: We suggest using `temperature=0.7, top_p=0.8, top_k=20, and min_p=0.` `presence_penalty` between 0 and 2 if the framework supports to reduce endless repetitions.

2\. **Adequate Output Length**: We recommend using an output length of `16,384` tokens for most queries, which is adequate for instruct models.

3\. **Standardize Output Format:** We recommend using prompts to standardize model outputs when benchmarking.

* **Math Problems**: Include `Please reason step by step, and put your final answer within \boxed{}.` in the prompt.
* **Multiple-Choice Questions**: Add the following JSON structure to the prompt to standardize responses: "Please show your choice in the \`answer\` field with only the choice letter, e.g., \`"answer": "C".

#### :sparkles:Run Qwen3-235B-A22B-Instruct via llama.cpp:

For Qwen3-235B-A22B, we will specifically use Llama.cpp for optimized inference and a plethora of options.

{% hint style="info" %}
If you want a **full precision unquantized version**, use our `Q8_K_XL, Q8_0` or `BF16` versions!
{% endhint %}

1. Obtain the latest llama.cpp on [GitHub](https://github.com/ggml-org/llama.cpp) here. You can follow the build instructions below as well. Change `-DGGML_CUDA=ON` to `-DGGML_CUDA=OFF` if you don't have a GPU or just want CPU inference. **For Apple Mac / Metal devices**, set `-DGGML_CUDA=OFF` then continue as usual - Metal support is on by default.

```bash
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp
```

2\. You can directly use llama.cpp to download the model but I normally suggest using `huggingface_hub` To use llama.cpp directly, do:\\

```bash
./llama.cpp/llama-cli \
    -hf unsloth/Qwen3-235B-A22B-Instruct-2507-GGUF:Q2_K_XL \
    --ctx-size 16384 \
    --n-gpu-layers 99 \
    -ot ".ffn_.*_exps.=CPU" \
    --temp 0.7 \
    --min-p 0.0 \
    --top-p 0.8 \
    --top-k 20 \
    --repeat-penalty 1.0
```

3\. Download the model via (after installing `pip install huggingface_hub hf_transfer` ). You can choose UD-Q2\_K\_XL, or other quantized versions..

```python
# !pip install huggingface_hub hf_transfer
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0" # Can sometimes rate limit, so set to 0 to disable
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id = "unsloth/Qwen3-235B-A22B-Instruct-2507-GGUF",
    local_dir = "unsloth/Qwen3-235B-A22B-Instruct-2507-GGUF",
    allow_patterns = ["*UD-Q2_K_XL*"],
)
```

4\. Run the model and try any prompt. 5. Edit `--threads -1` for the number of CPU threads, `--ctx-size` 262114 for context length, `--n-gpu-layers 99` for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.

{% hint style="success" %}
Use `-ot ".ffn_.*_exps.=CPU"` to offload all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.
{% endhint %}

{% code overflow="wrap" %}

```bash
./llama.cpp/llama-cli \
    --model unsloth/Qwen3-235B-A22B-Instruct-2507-GGUF/UD-Q2_K_XL/Qwen3-235B-A22B-Instruct-2507-UD-Q2_K_XL-00001-of-00002.gguf \
    --ctx-size 16384 \
    --n-gpu-layers 99 \
    -ot ".ffn_.*_exps.=CPU" \
    --temp 0.7 \
    --min-p 0.0 \
    --top-p 0.8 \
    --top-k 20
```

{% endcode %}

### 🛠️ Improving generation speed <a href="#improving-generation-speed" id="improving-generation-speed"></a>

If you have more VRAM, you can try offloading more MoE layers, or offloading whole layers themselves.

Normally, `-ot ".ffn_.*_exps.=CPU"` offloads all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.

If you have a bit more GPU memory, try `-ot ".ffn_(up|down)_exps.=CPU"` This offloads up and down projection MoE layers.

Try `-ot ".ffn_(up)_exps.=CPU"` if you have even more GPU memory. This offloads only up projection MoE layers.

You can also customize the regex, for example `-ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU"` means to offload gate, up and down MoE layers but only from the 6th layer onwards.

The [latest llama.cpp release](https://github.com/ggml-org/llama.cpp/pull/14363) also introduces high throughput mode. Use `llama-parallel`. Read more about it [here](https://github.com/ggml-org/llama.cpp/tree/master/examples/parallel). You can also **quantize the KV cache to 4bits** for example to reduce VRAM / RAM movement, which can also make the generation process faster. The [next section](#how-to-fit-long-context-256k-to-1m) talks about KV cache quantization.

### 📐How to fit long context <a href="#how-to-fit-long-context-256k-to-1m" id="how-to-fit-long-context-256k-to-1m"></a>

To fit longer context, you can use **KV cache quantization** to quantize the K and V caches to lower bits. This can also increase generation speed due to reduced RAM / VRAM data movement. The allowed options for K quantization (default is `f16`) include the below.

`--cache-type-k f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1`

You should use the `_1` variants for somewhat increased accuracy, albeit it's slightly slower. For eg `q4_1, q5_1` So try out `--cache-type-k q4_1`

You can also quantize the V cache, but you will need to **compile llama.cpp with Flash Attention** support via `-DGGML_CUDA_FA_ALL_QUANTS=ON`, and use `--flash-attn` to enable it. After installing Flash Attention, you can then use `--cache-type-v q4_1`

## 🦥 Fine-tuning Qwen3-2507 with Unsloth

Unsloth makes [Qwen3](https://unsloth.ai/docs/models/tutorials/qwen3-how-to-run-and-fine-tune/..#fine-tuning-qwen3-with-unsloth) and Qwen3-2507 fine-tuning 2x faster, use 70% less VRAM and supports 8x longer context lengths. Because Qwen3-2507 was only released in a 30B variant, this means you will need about a 40GB A100 GPU to fine-tune the model using QLoRA (4-bit).

For a notebook, because the model cannot fit in Colab's free 16GB GPUs, you will need to utilize a 40GB A100. You can utilize our Conversational notebook but replace the dataset to any of your using. This time you do not need to combined reasoning in your dataset as the model has no reasoning.

* [Qwen3 (14B) Reasoning + Conversational notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_\(14B\)-Reasoning-Conversational.ipynb)

If you have an old version of Unsloth and/or are fine-tuning locally, install the latest version of Unsloth:

```bash
pip install --upgrade --force-reinstall --no-cache-dir unsloth unsloth_zoo
```

### Qwen3-2507 MOE models fine-tuning

Fine-tuning support includes MOE models: 30B-A3B and 235B-A22B. Qwen3-30B-A3B works on 30GB VRAM with Unsloth. On fine-tuning MoE's - it's probably not a good idea to fine-tune the router layer so we disabled it by default.

**Qwen3-2507-4B notebooks for:** [Thinking](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_\(4B\)-Thinking.ipynb) and [Instruct](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_\(4B\)-Instruct.ipynb)

The 30B-A3B fits in 30GB VRAM, but you may lack RAM or disk space since the full 16-bit model must be downloaded and converted to 4-bit on the fly for QLoRA fine-tuning. This is due to issues importing 4-bit BnB MOE models directly. This only affects MOE models.

{% hint style="warning" %}
If you're fine-tuning the MOE models, please use `FastModel` and not `FastLanguageModel`
{% endhint %}

```python
from unsloth import FastModel
import torch
model, tokenizer = FastModel.from_pretrained(
    model_name = "unsloth/Qwen3-30B-A3B-Instruct-2507",
    max_seq_length = 2048, # Choose any for long context!
    load_in_4bit = True,  # 4 bit quantization to reduce memory
    load_in_8bit = False, # [NEW!] A bit more accurate, uses 2x memory
    full_finetuning = False, # [NEW!] We have full finetuning now!
    # token = "hf_...", # use one if using gated models
)
```

<figure><img src="https://3215535692-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FxhOjnexMCB3dmuQFQ2Zq%2Fuploads%2Fgit-blob-be34c101c627020c7a6cfb6cd249f2462587d235%2Fqwen33%20mascot.png?alt=media" alt=""><figcaption></figcaption></figure>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://unsloth.ai/docs/models/tutorials/qwen3-how-to-run-and-fine-tune/qwen3-2507.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
