✨Gemma 4 - How to Run Locally
Run Google’s new Gemma 4 models locally, including E2B, E4B, 26B A4B, and 31B.
Gemma 4 is Google DeepMind’s new family of open models, including E2B, E4B, 26B-A4B, and 31B. These multimodal, hybrid-thinking models support 140+ languages, up to 256K context, and come in both dense and MoE variants. E2B and E4B also support image and audio. Released under Apache-2.0 license, Gemma 4 can run locally on your device and be fine-tuned in Unsloth Studio. Gemma-4-E2B and E4B runs on 5GB RAM (4-bit) or 15GB (full 16-bit precision). Gemma-4-26B-A4B runs on 18GB (4-bit) or 28GB (8-bit). Gemma-4-31B needs 20GB RAM (4-bit) or 34GB (8-bit). See: Unsloth Gemma 4 GGUFs
Usage Guide
Gemma 4 excels at reasoning, coding, tool use, long-context tasks, agentic workflows, and multimodal tasks. The smaller E2B and E4B variants are built for phones, laptops.
E2B
Dense + PLE (128K context) Supports: Text, Image, Audio
For phone / edge inference, ASR, speech translation
E4B
Dense + PLE (128K context) Support: Text, Image, Audio
Small model for laptops and fast local multimodal use
26B-A4B
MoE (256K context) Support: Text, Image
Best speed / quality tradeoff for computer use
31B
Dense (256K context) Support: Text, Image
Strongest performance at slower inference
Should I pick 26B-A4B or 31B?
26B-A4B - balances speed and accuracy. Its MoE design makes it faster than 31B, with 4B active parameters. Pick it if RAM is limited and you are fine trading a bit of quality for speed.
31B - currently the strongest Gemma 4 model. Pick it for maximum quality if you have enough memory and can accept slightly slower speeds.
Gemma 4 Benchmarks
31B
85.2%
89.2%
80.0%
76.9%
26B A4B
82.6%
88.3%
77.1%
73.8%
E4B
69.4%
42.5%
52.0%
52.6%
E2B
60.0%
37.5%
44.0%
44.2%
Hardware requirements
Table: Gemma 4 Inference GGUF recommended hardware requirements (units = total memory: RAM + VRAM, or unified memory). You can use Gemma 4 on MacOS, NVIDIA RTX GPUs etc.
E2B
4 GB
5–8 GB
10 GB
E4B
5.5–6 GB
9–12 GB
16 GB
26B A4B
16–18 GB
28–30 GB
52 GB
31B
17–20 GB
34–38 GB
62 GB
As a rule of thumb, your total available memory should at least exceed the size of the quantized model you download. If it does not, llama.cpp can still run using partial RAM / disk offload, but generation will be slower. You will also need more compute, depending on the context window you use.
Recommended Settings
It is recommended to use Google's default Gemma 4 parameters:
temperature = 1.0top_p = 0.95top_k = 64
Recommended practical defaults for local inference:
Start with 32K context for responsiveness, then increase
Keep repetition/presence penalty disabled or 1.0 unless you see looping.
The End of Sentence token is
<turn|>
Gemma 4's max context is 128K for E2B / E4B and 256K for 26B A4B / 31B.
Thinking Mode
Compared to older Gemma chat templates, Gemma 4 uses the standard system, assistant, and user roles and adds explicit thinking control.
How to enable thinking:
Add the token <|think|> at the start of the system prompt.
Thinking enabled
Thinking disabled
Output behavior:
When thinking is enabled, the model outputs its internal reasoning channel before the final answer.
When thinking is disabled, the larger models may still emit an empty thought block before the final answer.
For example using "What is the capital of France?":
then it outputs with:
Multi-turn chat rule:
For multi-turn conversations, only keep the final visible answer in chat history. Do not feed prior thought blocks back into the next turn.
Run Gemma 4 Tutorials
Because Gemma 4 GGUFs comes in several sizes, the recommended starting point for the small models is 8-bit and the larger models is Dynamic 4-bit. Gemma 4 GGUFs:
🦥 Unsloth Studio Guide🦙 Llama.cpp Guide
Run Gemma 4 for free via our Unsloth Studio Google Colab notebook:
🦙 Llama.cpp Guide
For this guide we will be utilizing Dynamic 4-bit for the 26B-A4B and 31B and 8-bit for E2B and E4B. See: Gemma 4 GGUF collection
For these tutorials, we will using llama.cpp for fast local inference, especially if you have a CPU.
Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference. For Apple Mac / Metal devices, set -DGGML_CUDA=OFF then continue as usual - Metal support is on by default.
If you want to use llama.cpp directly to load models, you can follow commands below, according to each model. UD-Q4_K_XL is the quantization type. You can also download via Hugging Face (step 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location. There is no need to set context length as llama.cpp automatically uses the exact amount required.
26B-A4B:
31B:
E4B:
E2B:
Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-Q4_K_XL or other quantized versions like Q8_0 . If downloads get stuck, see: Hugging Face Hub, XET debugging
Then run the model in conversation mode (with vision mmproj-F16):
🦥 Unsloth Studio Guide
Gemma 4 can now be run and fine-tuned in Unsloth Studio, our new open-source web UI for local AI. Unsloth Studio lets you run models locally on MacOS, Windows, Linux and:
We're still working on Gemma 4 integration, will be finished within the hour!!!
Search, download, run GGUFs and safetensor models
Self-healing tool calling + web search
Code execution (Python, Bash)
Automatic inference parameter tuning (temp, top-p, etc.)
Fast CPU + GPU inference via llama.cpp
Train LLMs 2x faster with 70% less VRAM

Search and download Gemma 4
On first launch you will need to create a password to secure your account and sign in again later. You’ll then see a brief onboarding wizard to choose a model, dataset, and basic settings. You can skip it at any time.
Then go to the Studio Chat tab and search for Gemma 4 in the search bar and download your desired model and quant.
Run Gemma 4
Inference parameters should be auto-set when using Unsloth Studio, however you can still change it manually. You can also edit the context length, chat template and other settings.
For more information, you can view our Unsloth Studio inference guide.

Gemma 4 Best Practices
Prompting examples
Simple reasoning prompt
OCR / document prompt
For OCR, use a high visual token budget like 560 or 1120.
Multi-modal comparison prompt
Audio ASR prompt
Audio translation prompt
Multi-modal Settings
For best results with multimodal prompts, put multimodal content first:
Put image and/or audio before text.
For video, pass a sequence of frames first, then the instruction.
Variable image resolution
Gemma 4 supports multiple visual token budgets:
701402805601120
Use them like this:
70 / 140: classification, captioning, fast video understanding
280 / 560: general multimodal chat, charts, screens, UI reasoning
1120: OCR, document parsing, handwriting, small text
Audio and video limits
Audio is available on E2B and E4B only.
Audio supports a maximum of 30 seconds.
Video supports a maximum of 60 seconds assuming 1 frame per second processing.
Audio prompt templates
ASR prompt
Speech translation prompt
Resources and links
Last updated
Was this helpful?

