Google's Gemma 4 Makes Powerful AI Run on Your

The most interesting thing about Google DeepMind's new Gemma 4 models isn't that they're powerful—it's where they can run.

Last week, researcher Cassidy Hardin walked through the technical details of Gemma 4 at an AI Engineer conference, and what stuck with me wasn't the benchmark numbers (though they're wild). It's the engineering choices that let you run genuinely capable AI models on your phone without melting it.

The headline numbers are actually kinda bonkers

Gemma 4 comes in four sizes, and the performance jumps are... significant. The 31B (31 billion parameter) dense model ranked third globally on the LM Arena leaderboard. As Hardin put it: "This is outperforming models over 20 times its size."

The 26B mixture-of-experts model only activates 3.9 billion parameters during any forward pass, despite having access to 128 different expert networks. Both larger models hit the top six of all open-source models.

But here's what actually matters for developers: the two smaller "effective" models—E2B and E4B—are designed to run locally on phones, tablets, and laptops. No API calls. No cloud dependency. Just... running on the device in your pocket.

The "effective" trick is genuinely clever

When Hardin says a model is "effectively 2B," she's talking about a gap between operational parameters and representational depth. The E2B has 2.3 billion parameters required to run the model, but 5.1 billion parameters worth of representational capacity.

The magic happens through something called per-layer embeddings (PLE). Here's the thing that made me go "oh, that's smart": instead of storing one big embedding table for all tokens, each layer gets its own embedding table. The token "hi" has a different embedding representation at layer 1 versus layer 35.

The crucial bit? These per-layer embedding tables live in flash memory instead of VRAM. VRAM is the bottleneck on phones—you run out fast. Flash memory is cheaper and more abundant. By moving these tables to flash and keeping them small (256 dimensions instead of 1536+), the models can maintain rich representations without eating your phone's memory budget.

"VRAM is one of the largest constraints on-device, where you quickly run out of memory in phones and laptops," Hardin explained. So they... just didn't use VRAM for that part. Sometimes the elegant solution is the obvious one.

The attention mechanism got smarter about being expensive

Attention is computationally expensive—every token potentially looking at every other token adds up fast. Gemma 4 uses a mix of local and global attention layers in a 5:1 ratio (4:1 for the smaller models).

Local layers use a sliding window—512 tokens for small models, 1,024 for larger ones. They only look at nearby context. Global layers see everything that came before. The key architectural choice: the final layer is always global, ensuring the model can integrate information across the full context when making predictions.

But global layers are memory-intensive. This is where grouped query attention enters. In local layers, two queries share the same key-value heads. In global layers, eight queries share key-value heads. To compensate for this reduction, they doubled the key-value head length in global layers (512 vs 256).

The result, according to Hardin: "significant performance improvements without massive memory cost and inference increases."

Multimodal got way more flexible

Gemma 3 introduced vision. Gemma 4 makes it actually usable.

The big change: variable aspect ratios and variable resolutions. Previously, if you fed Gemma an image, it would split it into squares, pad what didn't fit, and process it as multiple separate images ("pan and scan"). Wildly inefficient.

Now you can choose how many tokens to allocate to each image across five different resolution settings. Doing OCR or object detection? Allocate 1,120 tokens and get high-resolution processing. Just need basic visual understanding? Use 70 tokens and save your context budget for text.

The vision encoder splits images into 16x16 pixel patches, but now it understands spatial positioning. A patch in the top-left corner of a 4x2 image gets different positional encoding than the same patch number in a 3x3 image. Small detail, huge impact.

The E2B and E4B models also add audio—MEL spectrograms processed through a conformer architecture with 35 million parameters. Text, vision, audio, all in models small enough to run on your phone.

Apache 2.0 matters more than it sounds

Hardin was explicit about the license change: "This was deliberately done in order to make our models more accessible for the everyday developer."

Gemma's previous license had restrictions that made it tricky for some commercial use cases. Apache 2.0 is the "just use it" license. You can integrate it, modify it, deploy it, sell products built on it—the friction is gone.

For developers, this means you can actually build a product lifecycle around these models without legal uncertainty. Prototype with the cloud-hosted versions on AI Studio or Vertex. Test with self-hosted versions from Hugging Face or Ollama. Deploy wherever makes sense.

The mixture-of-experts architecture is its own thing

The 26B MoE model is Gemma's first mixture-of-experts implementation, and the design is interesting. One shared router expert (three times the size of regular experts) activates on every forward pass. Then the router selects eight experts from a pool of 128 for each token.

This is different from some MoE approaches that treat all experts equally. Having a constantly-active shared expert means there's a baseline of computation that always happens, with specialization layered on top. Whether this is better than alternatives isn't clear yet—MoE architectures are still an active research area.

What this means for the on-device AI narrative

There's been a lot of talk about on-device AI. Most of it has been either tiny models that can't do much, or demos that technically run on-device but aren't actually useful.

Gemma 4's E2B and E4B models are the first time I've seen the engineering actually work out. Multimodal. Reasonable performance. Actually runs on consumer hardware without thermal throttling or killing your battery in 20 minutes.

The open question is whether this matters. Cloud APIs are cheap and getting cheaper. Is on-device AI solving a problem people actually have, or is it solving a problem AI researchers think people should have?

Privacy is the obvious answer—your data never leaves your device. Latency is another—no network round-trip. Offline functionality matters for some use cases.

But mostly, I think it's about optionality. Developers can now actually choose where their AI runs based on their specific constraints, not based on what's technically feasible. That's new.

Yuki Okonkwo covers AI and machine learning for Buzzrag.