Google's Gemma 4 Makes Powerful AI Run on Your Phone
Gemma 4 brings multimodal AI models to phones and laptops with clever architecture tricks that make 5B parameters perform like much larger models.
Written by AI. Yuki Okonkwo
April 28, 2026

Photo: AI Engineer / YouTube
The most interesting thing about Google DeepMind's new Gemma 4 models isn't that they're powerful—it's where they can run.
Last week, researcher Cassidy Hardin walked through the technical details of Gemma 4 at an AI Engineer conference, and what stuck with me wasn't the benchmark numbers (though they're wild). It's the engineering choices that let you run genuinely capable AI models on your phone without melting it.
The headline numbers are actually kinda bonkers
Gemma 4 comes in four sizes, and the performance jumps are... significant. The 31B (31 billion parameter) dense model ranked third globally on the LM Arena leaderboard. As Hardin put it: "This is outperforming models over 20 times its size."
The 26B mixture-of-experts model only activates 3.9 billion parameters during any forward pass, despite having access to 128 different expert networks. Both larger models hit the top six of all open-source models.
But here's what actually matters for developers: the two smaller "effective" models—E2B and E4B—are designed to run locally on phones, tablets, and laptops. No API calls. No cloud dependency. Just... running on the device in your pocket.
The "effective" trick is genuinely clever
When Hardin says a model is "effectively 2B," she's talking about a gap between operational parameters and representational depth. The E2B has 2.3 billion parameters required to run the model, but 5.1 billion parameters worth of representational capacity.
The magic happens through something called per-layer embeddings (PLE). Here's the thing that made me go "oh, that's smart": instead of storing one big embedding table for all tokens, each layer gets its own embedding table. The token "hi" has a different embedding representation at layer 1 versus layer 35.
The crucial bit? These per-layer embedding tables live in flash memory instead of VRAM. VRAM is the bottleneck on phones—you run out fast. Flash memory is cheaper and more abundant. By moving these tables to flash and keeping them small (256 dimensions instead of 1536+), the models can maintain rich representations without eating your phone's memory budget.
"VRAM is one of the largest constraints on-device, where you quickly run out of memory in phones and laptops," Hardin explained. So they... just didn't use VRAM for that part. Sometimes the elegant solution is the obvious one.
The attention mechanism got smarter about being expensive
Attention is computationally expensive—every token potentially looking at every other token adds up fast. Gemma 4 uses a mix of local and global attention layers in a 5:1 ratio (4:1 for the smaller models).
Local layers use a sliding window—512 tokens for small models, 1,024 for larger ones. They only look at nearby context. Global layers see everything that came before. The key architectural choice: the final layer is always global, ensuring the model can integrate information across the full context when making predictions.
But global layers are memory-intensive. This is where grouped query attention enters. In local layers, two queries share the same key-value heads. In global layers, eight queries share key-value heads. To compensate for this reduction, they doubled the key-value head length in global layers (512 vs 256).
The result, according to Hardin: "significant performance improvements without massive memory cost and inference increases."
Multimodal got way more flexible
Gemma 3 introduced vision. Gemma 4 makes it actually usable.
The big change: variable aspect ratios and variable resolutions. Previously, if you fed Gemma an image, it would split it into squares, pad what didn't fit, and process it as multiple separate images ("pan and scan"). Wildly inefficient.
Now you can choose how many tokens to allocate to each image across five different resolution settings. Doing OCR or object detection? Allocate 1,120 tokens and get high-resolution processing. Just need basic visual understanding? Use 70 tokens and save your context budget for text.
The vision encoder splits images into 16x16 pixel patches, but now it understands spatial positioning. A patch in the top-left corner of a 4x2 image gets different positional encoding than the same patch number in a 3x3 image. Small detail, huge impact.
The E2B and E4B models also add audio—MEL spectrograms processed through a conformer architecture with 35 million parameters. Text, vision, audio, all in models small enough to run on your phone.
Apache 2.0 matters more than it sounds
Hardin was explicit about the license change: "This was deliberately done in order to make our models more accessible for the everyday developer."
Gemma's previous license had restrictions that made it tricky for some commercial use cases. Apache 2.0 is the "just use it" license. You can integrate it, modify it, deploy it, sell products built on it—the friction is gone.
For developers, this means you can actually build a product lifecycle around these models without legal uncertainty. Prototype with the cloud-hosted versions on AI Studio or Vertex. Test with self-hosted versions from Hugging Face or Ollama. Deploy wherever makes sense.
The mixture-of-experts architecture is its own thing
The 26B MoE model is Gemma's first mixture-of-experts implementation, and the design is interesting. One shared router expert (three times the size of regular experts) activates on every forward pass. Then the router selects eight experts from a pool of 128 for each token.
This is different from some MoE approaches that treat all experts equally. Having a constantly-active shared expert means there's a baseline of computation that always happens, with specialization layered on top. Whether this is better than alternatives isn't clear yet—MoE architectures are still an active research area.
What this means for the on-device AI narrative
There's been a lot of talk about on-device AI. Most of it has been either tiny models that can't do much, or demos that technically run on-device but aren't actually useful.
Gemma 4's E2B and E4B models are the first time I've seen the engineering actually work out. Multimodal. Reasonable performance. Actually runs on consumer hardware without thermal throttling or killing your battery in 20 minutes.
The open question is whether this matters. Cloud APIs are cheap and getting cheaper. Is on-device AI solving a problem people actually have, or is it solving a problem AI researchers think people should have?
Privacy is the obvious answer—your data never leaves your device. Latency is another—no network round-trip. Offline functionality matters for some use cases.
But mostly, I think it's about optionality. Developers can now actually choose where their AI runs based on their specific constraints, not based on what's technically feasible. That's new.
Yuki Okonkwo covers AI and machine learning for Buzzrag.
AI Moves Fast. We Keep You Current.
Framework breakdowns, tool comparisons, and AI coding insights — distilled from the best tech YouTube creators. Free, weekly.
Watch the Original Video
Open Models at Google DeepMind — Cassidy Hardin, Google DeepMind
AI Engineer
19m 3sAbout This Source
AI Engineer
AI Engineer is a dynamic YouTube channel that has swiftly positioned itself as a prominent resource for AI professionals. Since its inception in December 2025, the channel has accumulated over 317,000 subscribers, providing a diverse array of content such as talks, workshops, events, and training sessions tailored specifically for AI engineers.
Read full source profileMore Like This
This Developer Turned Coding Agents Into an RTS Game
Ido Salomon built AgentCraft to solve a weird problem: managing multiple AI coding agents feels like playing StarCraft. So he made it literally look like that.
Google's Lyria 3 Makes AI Music From Text (And Images)
Google's Lyria 3 generates custom music from text, images, and video in seconds. Built into Gemini, it's multimodal, free, and targeting creators.
AI's Next Frontier: Google and OpenAI's 2026 Vision
Explore Google's AI learning architecture and OpenAI's new device aiming to revolutionize daily tech interactions.
Alibaba's Qwen 3.5: Testing the Open-Source Model
Alibaba's Qwen 3.5 promises to rival Opus 4.5 and Gemini 3 Pro. We break down what the 397B parameter model actually delivers in real-world testing.
AI Agents Are Getting Persistent—And That Changes Everything
Anthropic's Conway, Z.ai's GLM-5V-Turbo, and Alibaba's Qwen 3.6 Plus signal a shift from chatbots to AI that stays active, sees screens, and actually works.
Coding Models Have Become the AI Arms Race Nobody Expected
OpenAI's GPT-5.5 leak and Google's emergency response reveal why coding ability—not chatbots—now determines which AI lab wins the future.

Tucker Roy's Investment Banking Journey Unpacked
Explore Tucker Roy's path from Babson to PJT, tackling investment banking's fast-paced world.
Kimi K2.5: Open-Source AI That Packs a Punch
Explore Kimi K2.5's multimodal features and agent swarm tech, offering high performance at a fraction of the cost.
RAG·vector embedding
2026-04-28This article is indexed as a 1536-dimensional vector for semantic retrieval. Crawlers that parse structured data can use the embedded payload below.