Google's Gemma 4: Small Models, Big Performance

Google released Gemma 4 this week, and the performance charts tell an interesting story. The company claims its 31-billion-parameter model ranks third globally among open models—behind only systems with hundreds of billions or even a trillion parameters. If accurate, that's not incremental progress. That's a different approach to the problem.

The pitch is straightforward: smaller models that perform like larger ones. Gemma 4 comes in four sizes, from an "effective" 2 billion parameters up to 31 billion. The largest version supposedly matches models ten times its size on standard benchmarks. Tech YouTuber Matthew Berman, who covers AI developments, put it bluntly: "These are not massive models. These are actually relatively small models, perfect models to fit on your GPU."

I've covered enough AI releases to know that "unprecedented" and "state-of-the-art" appear in nearly every announcement. What matters is whether the claims hold up under scrutiny and what trade-offs were made to get there.

The Numbers Game

The performance data centers on something called ELO scoring, borrowed from chess rankings. On Google's charts, Gemma 4's 31B model sits near the top-left corner—the desirable position indicating high performance with relatively few parameters. It scores comparably to Qwen 3.5, which requires 397 billion parameters with 17 billion active. That's a meaningful difference in computational requirements.

Berman tested the model and confirmed it runs on "medium to high-end normal consumer hardware." Not everyone owns a desktop with 64GB of RAM, but such machines exist in offices and homes. Compare that to models requiring specialized server hardware that costs six figures.

The smaller "effective" models use a technique called per-layer embeddings. Instead of adding more layers, each decoder layer gets its own small embedding for every token. The embedding tables are large but only used for quick lookups, which is why the effective parameter count differs from the total. I had to look that up. It's a clever optimization, assuming it doesn't degrade quality in ways the benchmarks don't capture.

What Gets Sacrificed

No engineering achievement comes without compromise. Gemma 4's context window maxes out at 256,000 tokens for the larger models and 128,000 for the edge versions. That's adequate for most tasks but notably smaller than some competing models that handle a million tokens or more.

For code generation—one of AI's most practical applications—Berman was frank about the limitations: "If you are doing coding, you are most likely using a hosted frontier model. If I am writing code, I want to use the best model on the planet." Local models work for coding, he noted, but it's not his preference. That's useful honesty in a field prone to overselling capabilities.

The tool-calling benchmarks look cleaner. Gemma 4 31B scored perfectly on the Tool Call 15 benchmark, suggesting it can reliably interact with external APIs and execute structured workflows. That matters for building agents that actually work rather than agents that demo well.

The Distribution Question

Google made Gemma 4 available through every major platform: HuggingFace, Ollama, Nvidia Nims, LM Studio, and a half-dozen others. The Apache 2.0 license permits commercial use without restrictions. That's significant. Truly open models let developers fine-tune, modify, and deploy without seeking permission or paying per token.

The edge models—the 2B and 4B versions—are designed for phones, Raspberry Pis, and other devices with limited power budgets. Berman speculates they might appear in Apple devices, though that's speculation, not reporting. What's documented is collaboration with Qualcomm and MediaTek, the chip suppliers for most Android phones.

Running AI models locally eliminates latency from network requests and keeps data on-device. Whether consumers care about on-device AI enough to justify the engineering effort remains unclear. Apple has made that bet with its Intelligence features. Google now offers developers the tools to make similar capabilities available across hardware platforms.

The Context That Matters

I remember when "artificial intelligence" meant expert systems running on mainframes. Then it meant chess programs. Then spam filters. The definition keeps shifting toward whatever hasn't been solved yet. Today's AI hype focuses on large language models and the computational race to make them bigger.

Gemma 4 represents a different approach: efficient models that run where people actually work. Not everything needs GPT-4's capabilities or its cost structure. Most tasks might work fine with something smaller—if that something is good enough.

The question worth asking isn't whether Gemma 4 is revolutionary. It's whether Google can sustain this release cadence with genuinely open models while its competitors either close their systems or release hobbled versions. Berman credited Google for "continuing to push the frontier of open-source open weights models" and noted "not every company is doing that."

That's the story, really. The technical specifications matter less than the strategic decision to keep releasing capable models without restrictions. Whether Google maintains that approach depends on factors beyond engineering—business models, competitive pressure, regulatory requirements.

I've seen enough technology cycles to know that openness often loses to control when money gets serious. For now, developers have another option. How long that lasts is the more interesting question than what the benchmarks say.

Bob Reynolds is Senior Technology Correspondent for Buzzrag.