# Qwen3-Coder: ローカル実行方法 Qwen3-Coderは、Qwenの新しいコーディングエージェントモデル群で、30B（**Qwen3-Coder-Flash**）および480Bパラメータで利用できます。 **Qwen3-480B-A35B-Instruct** は、Claude Sonnet-4、GPT-4.1、 [Kimi K2](/docs/jp/moderu/tutorials/kimi-k2-thinking-how-to-run-locally.md)に匹敵するSOTAのコーディング性能を達成し、Aider Polygotで61.8%を記録し、256K（1Mまで拡張可能）のトークンコンテキストをサポートします。また、Qwen3-Coderのネイティブ **1Mコンテキスト長** をYaRNで拡張したものと、フル精度の8bitおよび16bit版もアップロードしました。 [Unsloth](https://github.com/unslothai/unsloth) さらに、ファインチューニングと [RL](/docs/jp/meru/reinforcement-learning-rl-guide.md) にも対応しました。 {% hint style="success" %} [**更新：** Qwen3-Coderのツール呼び出しを修正しました！ ](#tool-calling-fixes)これで llama.cpp、Ollama、LMStudio、Open WebUI、Jan などでツール呼び出しをシームレスに使えます。この問題は全体的なもので、（Unslothだけでなく）すべてのアップロードに影響していました。修正についてQwenチームと連絡を取っています！ [続きを読む](#tool-calling-fixes) {% endhint %} 30B-A3Bを実行 480B-A35Bを実行 {% hint style="success" %} **Unsloth Dynamic Quantsは** [**Unsloth Dynamic Quants**](/docs/jp/ji-ben/unsloth-dynamic-2.0-ggufs.md) **動作しますか？** はい、しかも非常によく動作します。Aider Polyglotベンチマークでの第三者テストでは、 **UD-Q4\_K\_XL（276GB）** の動的量子化は **フルbf16（960GB）** のQwen3-coderモデルにほぼ匹敵し、60.9%対61.8%という結果でした。 [詳細はこちら。](https://huggingface.co/unsloth/Qwen3-Coder-480B-A35B-Instruct-GGUF/discussions/8) {% endhint %} #### **Qwen3 Coder - Unsloth Dynamic 2.0 GGUFs**: | Dynamic 2.0 GGUF（実行用） | 1Mコンテキスト Dynamic 2.0 GGUF | | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | |

| ## 🖥️ **Qwen3-Coderの実行** 以下は [**30B-A3B**](#run-qwen3-coder-30b-a3b-instruct) および [**480B-A35B**](#run-qwen3-coder-480b-a35b-instruct) モデルの派生版。 ### :gear: 推奨設定 Qwenは、両モデルに対して以下の推論設定を推奨しています： `temperature=0.7`, `top_p=0.8`, `top_k=20`, `repetition_penalty=1.05` * **温度 0.7** * Top\_K 20 * Min\_P 0.00（任意ですが、0.01 でもうまく動作します。llama.cpp のデフォルトは 0.1 です） * Top\_P 0.8 * **繰り返しペナルティ 1.05** * チャットテンプレート： ``` <|im_start|>user やあ！<|im_end|> <|im_start|>assistant 1+1は？<|im_end|> <|im_start|>user 2<|im_end|> <|im_start|>assistant ``` * 推奨コンテキスト出力：65,536トークン（増やすことも可能）。詳細はこちら。 **改行がレンダリングされないチャットテンプレート/プロンプト形式** {% code overflow="wrap" %} ``` <|im_start|>user\nやあ！<|im_end|>\n<|im_start|>assistant\n1+1は？<|im_end|>\n<|im_start|>user\n2<|im_end|>\n<|im_start|>assistant\n ``` {% endcode %} **ツール呼び出し用チャットテンプレート** （サンフランシスコの現在気温を取得）。ツール呼び出しのフォーマット方法の詳細はこちら。 ``` <|im_start|>user サンフランシスコの今の気温は？明日はどう？<|im_end|> <|im_start|>assistant \n\n\n米国カリフォルニア州サンフランシスコ \n\n<|im_end|> <|im_start|>user {"temperature": 26.1, "location": "San Francisco, CA, USA", "unit": "celsius"} \n<|im_end|> ``` {% hint style="info" %} このモデルは非思考モードのみをサポートし、出力に `` ブロックを生成しません。また、 `enable_thinking=False` を指定する必要はなくなりました。 {% endhint %} ### Qwen3-Coder-30B-A3B-Instructを実行： Dynamic 4-bit量子化で毎秒6トークン以上の推論速度を達成するには、少なくとも **18GBの統合メモリ** （VRAMとRAMの合計）または **18GBのシステムRAM** 単体が必要です。目安として、利用可能なメモリは使用するモデルのサイズ以上であるべきです。たとえば、UD\_Q8\_K\_XL量子化（フル精度）は32.5GBなので、少なくとも **33GBの統合メモリ** （VRAM + RAM）または **33GBのRAM** が最適性能のために必要です。 **注：** モデルは総サイズより少ないメモリでも動作しますが、その場合は推論が遅くなります。最大メモリは最速速度を出す場合にのみ必要です。これは非思考モデルなので、 `thinking=False` を設定する必要はなく、モデルは ` ` ブロックを生成しません。 {% hint style="info" %} 上記の [**ベストプラクティスに従ってください**](#recommended-settings)。480Bモデルと同じです。 {% endhint %} #### 🦙 Ollama: Qwen3-Coder-30B-A3B-Instruct 実行チュートリアル 1. インストール `ollama` まだであれば！32Bまでのサイズのモデルしか実行できません。 ```bash apt-get update apt-get install pciutils -y curl -fsSL https://ollama.com/install.sh | sh ``` 2. モデルを実行します！失敗した場合は `ollama serve`を別の端末で呼び出せることに注意してください。修正内容と推奨パラメータ（temperature など）はすべて、Hugging Face のアップロード内の `params` に含まれています！ ```bash ollama run hf.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:UD-Q4_K_XL ``` #### :sparkles: Llama.cpp: Qwen3-Coder-30B-A3B-Instruct 実行チュートリアル 1. 最新の `llama.cpp` を [GitHub こちら](https://github.com/ggml-org/llama.cpp)から取得してください。以下のビルド手順に従うこともできます。 `-DGGML_CUDA=ON` を `-DGGML_CUDA=OFF` に変更してください。GPU がない場合、または CPU 推論だけを使いたい場合です。 **Apple Mac / Metal デバイスの場合**、次を設定して `-DGGML_CUDA=OFF` その後は通常どおり続けてください - Metal サポートは既定で有効です。 ```bash apt-get update apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y git clone https://github.com/ggml-org/llama.cpp cmake llama.cpp -B llama.cpp/build \\ -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split cp llama.cpp/build/bin/llama-* llama.cpp ``` 2. HuggingFace から次の方法で直接取得できます: ```bash ./llama.cpp/llama-cli \\ -hf unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:Q4_K_XL \\ --jinja -ngl 99 --ctx-size 32768 \\ --temp 0.7 --min-p 0.0 --top-p 0.80 --top-k 20 --repeat-penalty 1.05 ``` 3. モデルのダウンロード（ `pip install huggingface_hub hf_transfer` ）。UD\_Q4\_K\_XLや他の量子化版を選べます。ダウンロードが止まる場合は、 [Hugging Face Hub、XETデバッグ](/docs/jp/ji-ben/troubleshooting-and-faqs/hugging-face-hub-xet-debugging.md) ```python # !pip install huggingface_hub hf_transfer import os os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1" from huggingface_hub import snapshot_download snapshot_download( repo_id = "unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF", local_dir = "unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF", allow_patterns = ["*UD-Q4_K_XL*"], ) ``` ### Qwen3-Coder-480B-A35B-Instructを実行： 1-bit量子化で毎秒6トークン以上の推論速度を達成するには、少なくとも **150GBの統合メモリ** （VRAMとRAMの合計）または **150GBのシステムRAM** 単体を推奨します。目安として、利用可能なメモリは使用するモデルのサイズ以上であるべきです。たとえば、Q2\_K\_XL量子化は180GBなので、少なくとも **180GBの統合メモリ** （VRAM + RAM）または **180GBのRAM** が最適性能のために必要です。 **注：** モデルは総サイズより少ないメモリでも動作しますが、その場合は推論が遅くなります。最大メモリは最速速度を出す場合にのみ必要です。 {% hint style="info" %} 上記の [**ベストプラクティスに従ってください**](#recommended-settings)が必要です。480Bモデルと同じです。 {% endhint %} #### 📖 Llama.cpp: Qwen3-Coder-480B-A35B-Instruct 実行チュートリアル Coder-480B-A35Bでは、最適化された推論と豊富なオプションのために、特にLlama.cppを使用します。 {% hint style="success" %} もし **フル精度の非量子化版**が欲しいなら、 `Q8_K_XL、Q8_0` または `BF16` 版を使ってください！ {% endhint %} 1. 最新の `llama.cpp` を [GitHub こちら](https://github.com/ggml-org/llama.cpp)から取得してください。以下のビルド手順に従うこともできます。 `-DGGML_CUDA=ON` を `-DGGML_CUDA=OFF` に変更してください。GPU がない場合、または CPU 推論だけを使いたい場合です。 ```bash apt-get update apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y git clone https://github.com/ggml-org/llama.cpp cmake llama.cpp -B llama.cpp/build \\ -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split cp llama.cpp/build/bin/llama-* llama.cpp ``` 2. llama.cppを直接使ってモデルをダウンロードすることもできますが、通常は `huggingface_hub` llama.cppを直接使う場合は、次のようにします： ```bash ./llama.cpp/llama-cli \\ -hf unsloth/Qwen3-Coder-480B-A35B-Instruct-GGUF:Q2_K_XL \\ --ctx-size 16384 \\ --n-gpu-layers 99 \\ -ot ".ffn_.*_exps.=CPU" \ --temp 0.7 \\ --min-p 0.0 \\ --top-p 0.8 \\ --top-k 20 \\ --repeat-penalty 1.05 ``` 3. または、（インストール後に）次を通じてモデルをダウンロードします： `pip install huggingface_hub hf_transfer` ）。UD-Q2\_K\_XLや他の量子化版を選べます。 ```python # !pip install huggingface_hub hf_transfer import os os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0" # レート制限されることがあるので、無効化するには 0 に設定 from huggingface_hub import snapshot_download snapshot_download( repo_id = "unsloth/Qwen3-Coder-480B-A35B-Instruct-GGUF", local_dir = "unsloth/Qwen3-Coder-480B-A35B-Instruct-GGUF", allow_patterns = ["*UD-Q2_K_XL*"], ) ``` 4. 会話モードでモデルを実行し、任意のプロンプトを試してください。 5. 編集 `--threads -1` CPU スレッド数を `--ctx-size` コンテキスト長用の262114、 `--n-gpu-layers 99` GPU オフロードする層数を指定します。GPU のメモリ不足になる場合は調整してみてください。CPU のみで推論する場合は、これも削除してください。 {% hint style="success" %} 使用 `-ot ".ffn_.*_exps.=CPU"` すべてのMoE層をCPUへオフロードするためです！これにより、非MoE層を1枚のGPUに収められるようになり、生成速度が向上します。GPU容量がもっとあれば、正規表現を調整してより多くの層を収めることもできます。その他の विकल्पについては [こちら](#improving-generation-speed). {% endhint %} {% code overflow="wrap" %} ```bash ./llama.cpp/llama-cli \\ --model unsloth/Qwen3-Coder-480B-A35B-Instruct-GGUF/UD-Q2_K_XL/Qwen3-Coder-480B-A35B-Instruct-UD-Q2_K_XL-00001-of-00004.gguf \\ --ctx-size 16384 \\ --n-gpu-layers 99 \\ -ot ".ffn_.*_exps.=CPU" \ --temp 0.7 \\ --min-p 0.0 \\ --top-p 0.8 \\ --top-k 20 \\ --repeat-penalty 1.05 ``` {% endcode %} {% hint style="success" %} 新しいQwen3アップデートも忘れずに。 [**Qwen3-235B-A22B-Instruct-2507**](/docs/jp/moderu/tutorials/qwen3-next.md) をllama.cppでローカル実行しましょう。 {% endhint %} #### :tools: 生成速度の改善 VRAM がさらにある場合は、より多くの MoE レイヤーをオフロードするか、レイヤー全体をオフロードすることができます。通常、 `-ot ".ffn_.*_exps.=CPU"` すべての MoE レイヤーを CPU にオフロードします！これにより、非 MoE レイヤーを 1 枚の GPU に収められるようになり、生成速度が向上します。GPU 容量がさらにある場合は、正規表現を調整してより多くのレイヤーを収めることができます。 GPU メモリがもう少し多い場合は、試してみてください `-ot ".ffn_(up|down)_exps.=CPU"` これにより、アップ投影とダウン投影の MoE レイヤーがオフロードされます。試してみてください `-ot ".ffn_(up)_exps.=CPU"` GPU メモリがさらに多い場合は、これを使ってください。これにより、アップ投影の MoE レイヤーのみがオフロードされます。これは最も少ない VRAM を使用します。 `正規表現をカスタマイズすることもできます。例えば` -ot "\\.(6|7|8|9|\[0-9]\[0-9]|\[0-9]\[0-9]\[0-9])\\.ffn\_(gate|up|down)\_exps.=CPU" この [llama.cppリリース](https://github.com/ggml-org/llama.cpp/pull/14363) では高スループットモードも導入されています。 `llama-parallel`。詳細は [こちら](https://github.com/ggml-org/llama.cpp/tree/master/examples/parallel)。また **KV キャッシュを 4bit に量子化することもできます** たとえば、VRAM / RAM 間の移動を減らし、生成処理をさらに高速化できます。 #### :triangular\_ruler:長いコンテキスト（256K〜1M）を収める方法より長いコンテキストを収めるには、 **KV キャッシュ量子化** を使って K と V のキャッシュをより低いビットに量子化できます。これにより、RAM / VRAM のデータ移動が減るため、生成速度も向上します。K の量子化で許可されるオプション（デフォルトは `f16`）は以下を含みます。 `--cache-type-k f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1` 多少の精度向上のために `_1` 版を使うべきですが、少し遅くなります。例： `q4_1, q5_1` V キャッシュも量子化できますが、 **Flash Attention サポート付きで llama.cpp をコンパイルする必要があります** を `-DGGML_CUDA_FA_ALL_QUANTS=ON`で有効化し、 `--flash-attn` を有効にします。 YaRNスケーリングによる100万コンテキスト長のGGUFもアップロードしました [こちら](https://unsloth.ai/docs/jp/). ## :toolbox: ツール呼び出しの修正ツール呼び出しを次で修正できました： `llama.cpp --jinja` を使って、特に次で配信するために `llama-server`！30B-A3Bの量子化版をダウンロードする場合は、これらには既に修正が含まれているので心配不要です。480B-A35Bモデルについては、次を行ってください： 1. UD-Q2\_K\_XL用にの最初のファイルをダウンロードし、現在のファイルを置き換えてください 2. 使用 `snapshot_download` は通常どおりの手順で行うと、古いファイルが自動で上書きされます 3. 新しいチャットテンプレートを次で使用してください `--chat-template-file`。参照： [GGUFチャットテンプレート](https://huggingface.co/unsloth/Qwen3-Coder-480B-A35B-Instruct-GGUF?chat_template=default) または [chat\_template.jinja](https://huggingface.co/unsloth/Qwen3-Coder-480B-A35B-Instruct/raw/main/chat_template.jinja) 4. おまけとして、Ollamaでも動作する単一の150GB UD-IQ1\_Mファイルもに用意しましたこれで次のような問題が解決するはずです： ### ツール呼び出しの使用ツール呼び出し用のプロンプト形式を示すため、例を使って説明します。私は次のPython関数を作成しました： `get_current_temperature` これは、ある場所の現在の気温を取得するための関数です。今は、常に摂氏21.6度を返すプレースホルダー関数を作成しました。これを本物の関数に変更してください！！ {% code overflow="wrap" %} ```python def get_current_temperature(location: str, unit: str = "celsius"): """ある場所の現在の気温を取得する。引数： location: 気温を取得する場所。形式は「都市, 州, 国」。 unit: 気温を返す単位。デフォルトは「celsius」。（選択肢: ["celsius", "fahrenheit"]）戻り値：温度、場所、単位を含むdict """ return { "temperature": 26.1, # PRE_CONFIGURED -> ここを変更してください！ "location": location, "unit": unit, } ``` {% endcode %} 次に、トークナイザーを使ってプロンプト全体を作成します： {% code overflow="wrap" %} ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("unsloth/Qwen3-Coder-480B-A35B-Instruct") messages = [ {'role': 'user', 'content': "サンフランシスコの今の気温は？明日はどう？"}, {'content': "", 'role': 'assistant', 'function_call': None, 'tool_calls': [ {'id': 'ID', 'function': {'arguments': {"location": "San Francisco, CA, USA"}, 'name': 'get_current_temperature'}, 'type': 'function'}, ]}, {'role': 'tool', 'content': '{"temperature": 26.1, "location": "San Francisco, CA, USA", "unit": "celsius"}', 'tool_call_id': 'ID'}, ] prompt = tokenizer.apply_chat_template(messages, tokenize = False) ``` {% endcode %} ## :bulb:性能ベンチマーク {% hint style="info" %} これらの公式ベンチマークはフルBF16チェックポイントのものです。これを使うには、単に以下を使用してください `Q8_K_XL、Q8_0、BF16` アップロード済みのチェックポイントを使ってください。これらの版でも、MoEオフロードのようなテクニックを引き続き使えます！ {% endhint %} 480Bモデルのベンチマークは次のとおりです： #### エージェント型コーディング

ベンチマーク	Qwen3‑Coder 480B‑A35B‑Instruct	Kimi‑K2	DeepSeek‑V3-0324	Claude 4 Sonnet	GPT‑4.1
Terminal‑Bench	37.5	30.0	2.5	35.5	25.3
SWE‑bench Verified w/ OpenHands （500ターン）	69.6	–	–	70.4	–
SWE‑bench Verified w/ OpenHands （100ターン）	67.0	65.4	38.8	68.0	48.6
SWE‑bench Verified w/ Private Scaffolding	–	65.8	–	72.7	63.8
SWE‑bench Live	26.3	22.3	13.0	27.7	–
SWE‑bench Multilingual	54.7	47.3	13.0	53.3	31.5
Multi‑SWE‑bench mini	25.8	19.8	7.5	24.8	–
Multi‑SWE‑bench flash	27.0	20.7	–	25.0	–
Aider‑Polyglot	61.8	60.0	56.9	56.4	52.4
Spider2	31.1	25.2	12.8	31.1	16.5

#### エージェント型ブラウザ使用

ベンチマーク	Qwen3‑Coder 480B‑A35B‑Instruct	Kimi‑K2	DeepSeek‑V3 0324	Claude Sonnet‑4	GPT‑4.1
WebArena	49.9	47.4	40.0	51.1	44.3
Mind2Web	55.8	42.7	36.0	47.4	49.6

#### エージェント型ツール使用

ベンチマーク	Qwen3‑Coder 480B‑A35B‑Instruct	Kimi‑K2	DeepSeek‑V3 0324	Claude Sonnet‑4	GPT‑4.1
BFCL‑v3	68.7	65.2	56.9	73.3	62.9
TAU‑Bench Retail	77.5	70.7	59.1	80.5	–
TAU‑Bench Airline	60.0	53.5	40.0	60.0	–

--- # Agent Instructions: Querying This Documentation If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question. Perform an HTTP GET request on the current page URL with the `ask` query parameter: ``` GET https://unsloth.ai/docs/jp/moderu/tutorials/qwen3-coder-how-to-run-locally.md?ask= ``` The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation. Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.