Gemma 4 - How to Run Locally

Run Google’s new Gemma 4 models locally, including E2B, E4B, 26B A4B, and 31B.

Gemma 4 is Google DeepMind’s new family of open models, including E2B, E4B, 26B-A4B, and 31B. These multimodal, hybrid-thinking models support 140+ languages, up to 256K context, and come in both dense and MoE variants. E2B and E4B also support image and audio. Released under Apache-2.0 license, Gemma 4 can run locally on your device and be fine-tuned in Unsloth Studio. Gemma-4-E2B and E4B runs on 5GB RAM (4-bit) or 15GB (full 16-bit precision). Gemma-4-26B-A4B runs on 18GB (4-bit) or 28GB (8-bit). Gemma-4-31B needs 20GB RAM (4-bit) or 34GB (8-bit). See: Unsloth Gemma 4 GGUFs

Run Gemma 4Fine-tune Gemma 4

Usage Guide

Gemma 4 excels at reasoning, coding, tool use, long-context tasks, agentic workflows, and multimodal tasks. The smaller E2B and E4B variants are built for phones, laptops.

Gemma 4 Variant
Details
Best fit

E2B

Dense + PLE (128K context) Supports: Text, Image, Audio

For phone / edge inference, ASR, speech translation

E4B

Dense + PLE (128K context) Support: Text, Image, Audio

Small model for laptops and fast local multimodal use

26B-A4B

MoE (256K context) Support: Text, Image

Best speed / quality tradeoff for computer use

31B

Dense (256K context) Support: Text, Image

Strongest performance at slower inference

Should I pick 26B-A4B or 31B?

  • 26B-A4B - balances speed and accuracy. Its MoE design makes it faster than 31B, with 4B active parameters. Pick it if RAM is limited and you are fine trading a bit of quality for speed.

  • 31B - currently the strongest Gemma 4 model. Pick it for maximum quality if you have enough memory and can accept slightly slower speeds.

Gemma 4 Benchmarks

Gemma 4
MMLU Pro
AIME 2026 (no tools)
LiveCodeBench v6
MMMU Pro

31B

85.2%

89.2%

80.0%

76.9%

26B A4B

82.6%

88.3%

77.1%

73.8%

E4B

69.4%

42.5%

52.0%

52.6%

E2B

60.0%

37.5%

44.0%

44.2%

Hardware requirements

Table: Gemma 4 Inference GGUF recommended hardware requirements (units = total memory: RAM + VRAM, or unified memory). You can use Gemma 4 on MacOS, NVIDIA RTX GPUs etc.

Gemma 4 variant
4-bit
8-bit
BF16 / FP16

E2B

4 GB

5–8 GB

10 GB

E4B

5.5–6 GB

9–12 GB

16 GB

26B A4B

16–18 GB

28–30 GB

52 GB

31B

17–20 GB

34–38 GB

62 GB

circle-info

As a rule of thumb, your total available memory should at least exceed the size of the quantized model you download. If it does not, llama.cpp can still run using partial RAM / disk offload, but generation will be slower. You will also need more compute, depending on the context window you use.

It is recommended to use Google's default Gemma 4 parameters:

  • temperature = 1.0

  • top_p = 0.95

  • top_k = 64

Recommended practical defaults for local inference:

  • Start with 32K context for responsiveness, then increase

  • Keep repetition/presence penalty disabled or 1.0 unless you see looping.

  • The End of Sentence token is <turn|>

circle-info

Gemma 4's max context is 128K for E2B / E4B and 256K for 26B A4B / 31B.

Thinking Mode

Compared to older Gemma chat templates, Gemma 4 uses the standard system, assistant, and user roles and adds explicit thinking control.

How to enable thinking:

Add the token <|think|> at the start of the system prompt.

Thinking enabled

Thinking disabled

Output behavior:

When thinking is enabled, the model outputs its internal reasoning channel before the final answer.

When thinking is disabled, the larger models may still emit an empty thought block before the final answer.

For example using "What is the capital of France?":

then it outputs with:

Multi-turn chat rule:

For multi-turn conversations, only keep the final visible answer in chat history. Do not feed prior thought blocks back into the next turn.

Run Gemma 4 Tutorials

Because Gemma 4 GGUFs comes in several sizes, the recommended starting point for the small models is 8-bit and the larger models is Dynamic 4-bit. Gemma 4 GGUFsarrow-up-right:

🦥 Unsloth Studio Guide🦙 Llama.cpp Guide

Run Gemma 4 for free via our Unsloth Studio Google Colab notebook:

🦙 Llama.cpp Guide

For this guide we will be utilizing Dynamic 4-bit for the 26B-A4B and 31B and 8-bit for E2B and E4B. See: Gemma 4 GGUF collectionarrow-up-right

For these tutorials, we will using llama.cpparrow-up-right for fast local inference, especially if you have a CPU.

1

Obtain the latest llama.cpp on GitHub herearrow-up-right. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference. For Apple Mac / Metal devices, set -DGGML_CUDA=OFF then continue as usual - Metal support is on by default.

2

If you want to use llama.cpp directly to load models, you can follow commands below, according to each model. UD-Q4_K_XL is the quantization type. You can also download via Hugging Face (step 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location. There is no need to set context length as llama.cpp automatically uses the exact amount required.

26B-A4B:

31B:

E4B:

E2B:

3

Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-Q4_K_XL or other quantized versions like Q8_0 . If downloads get stuck, see: Hugging Face Hub, XET debugging

4

Then run the model in conversation mode (with vision mmproj-F16):

🦥 Unsloth Studio Guide

Gemma 4 can now be run and fine-tuned in Unsloth Studio, our new open-source web UI for local AI. Unsloth Studio lets you run models locally on MacOS, Windows, Linux and:

triangle-exclamation
1

Install Unsloth

Run in your terminal:

MacOS, Linux, WSL:

Windows PowerShell:

2

Launch Unsloth

MacOS, Linux, WSL and Windows:

Then open http://localhost:8888 in your browser.

3

Search and download Gemma 4

On first launch you will need to create a password to secure your account and sign in again later. You’ll then see a brief onboarding wizard to choose a model, dataset, and basic settings. You can skip it at any time.

Then go to the Studio Chat tab and search for Gemma 4 in the search bar and download your desired model and quant.

4

Run Gemma 4

Inference parameters should be auto-set when using Unsloth Studio, however you can still change it manually. You can also edit the context length, chat template and other settings.

For more information, you can view our Unsloth Studio inference guide.

Gemma 4 Best Practices

Prompting examples

Simple reasoning prompt

OCR / document prompt

For OCR, use a high visual token budget like 560 or 1120.

Multi-modal comparison prompt

Audio ASR prompt

Audio translation prompt

Multi-modal Settings

For best results with multimodal prompts, put multimodal content first:

  • Put image and/or audio before text.

  • For video, pass a sequence of frames first, then the instruction.

Variable image resolution

Gemma 4 supports multiple visual token budgets:

  • 70

  • 140

  • 280

  • 560

  • 1120

Use them like this:

  • 70 / 140: classification, captioning, fast video understanding

  • 280 / 560: general multimodal chat, charts, screens, UI reasoning

  • 1120: OCR, document parsing, handwriting, small text

Audio and video limits

  • Audio is available on E2B and E4B only.

  • Audio supports a maximum of 30 seconds.

  • Video supports a maximum of 60 seconds assuming 1 frame per second processing.

Audio prompt templates

ASR prompt

Speech translation prompt

Last updated

Was this helpful?