Google's Gemma 4 Rewrites the Rules for

Google released a large language model last week that meets a standard the industry has spent years avoiding: genuinely free and open-source. No usage restrictions, no revenue sharing clauses, no asterisks. Gemma 4 ships under the Apache 2.0 license, which means developers can use it, modify it, and profit from it without asking permission.

The real story isn't the licensing. It's that Gemma 4 achieves competitive intelligence while fitting on hardware you might already own. The 31 billion parameter version runs on a single consumer GPU with a 20 GB download. For context, running a comparable model like Kimmy K2.5 requires multiple data center-grade H100 GPUs, 256 GB of RAM, and a download exceeding 600 GB. Both models score in similar ranges on benchmarks, but one requires infrastructure most developers will never access.

This compression represents something more interesting than incremental improvement. Google didn't just make the model smaller—they reconsidered where the actual bottleneck lives.

Memory, Not Processing

The limiting factor in running large language models locally isn't CPU power. It's memory bandwidth. Every time a model generates a token, it reads through model weights stored in VRAM. The cost isn't in the size of the model itself but in how expensive those reads become. "It doesn't really matter how big the model is," the Fireship analysis notes. "It's more about how expensive it is to read it."

Google tackled this with two techniques. The first, TurboQuant, rethinks how model weights get compressed. Traditional quantization offers a straightforward trade: smaller models, worse performance. TurboQuant changes the mathematics underlying that trade-off.

It converts data from Cartesian coordinates to polar coordinates—radius and angle instead of X, Y, Z. Because angles follow predictable patterns, the model can skip normalization steps and store information more efficiently. Then it applies the Johnson-Lindenstrauss transform, compressing high-dimensional data down to single sign bits (positive one, negative one) while preserving the distances between data points.

I've covered compression techniques for three decades, and I still don't fully grasp how the mathematics work. What matters is the result: meaningful compression without the performance penalty that typically accompanies it.

Per-Layer Embeddings

The second technique, per-layer embeddings, addresses how transformers handle tokens. In standard architectures, each token receives one embedding at the start, and the model carries that information through every layer. Most of that information goes unused at any given layer, but the model hauls it along anyway.

Per-layer embeddings give each layer its own abbreviated version of the token. Information appears exactly when it's useful rather than all at once. The models using this approach carry an "E" in their designation—E2B, E4B—standing for "effective parameters." It's like giving every layer in the neural network its own reference sheet instead of a complete encyclopedia.

Martin Grootendorst published a visual guide explaining the mechanics in detail. The technical implementation is elegant, but the practical result is what developers will notice: a model that performs like it should require far more resources.

The Open-Source Landscape

Gemma 4 enters a fragmented market for open-weight models. Meta's Llama models use a custom license that gives Meta leverage over profitable applications. OpenAI's GPT models carry Apache 2.0 licensing but lag behind in performance. Most serious alternatives come from Mistral and Chinese developers—Qwen, GLM, DeepSeek.

The landscape hasn't lacked for "open" models. It's lacked models that combine genuine licensing freedom with practical accessibility. Meta's approach treats open-sourcing as a strategic move, not a philosophical commitment. OpenAI's models exist more as research artifacts than production tools.

Gemma 4's positioning as "made in America, Apache 2.0 licensed, intelligent, and most importantly, tiny" addresses all three gaps simultaneously. Whether that combination proves commercially viable remains open. Google has shipped impressive technology before without changing market dynamics. Remember Google Wave? Google Reader? Google+?

The difference here might be that Google isn't asking developers to adopt a new platform. They're offering a tool that works with existing infrastructure and licenses that don't require legal review.

What This Enables

Running Gemma 4 on an RTX 4090 with Ollama produces roughly 10 tokens per second—fast enough for interactive use, slow enough to remind you that consumer hardware still has limits. The model performs adequately across general tasks and shows particular promise for fine-tuning with custom data using tools like Unsloth.

It won't replace specialized coding assistants. The performance gap between a general-purpose model and domain-specific tools remains significant. But that's not the relevant comparison. The relevant comparison is between what developers could run locally last month versus what they can run now.

For researchers, startups, and developers working in contexts where data can't leave local infrastructure, that difference matters. The question isn't whether Gemma 4 beats GPT-4 or Claude. It's whether it's good enough for applications that previously required either data center infrastructure or accepting restrictive licenses.

Google has a history of releasing interesting technology and then failing to support it. The real test for Gemma 4 won't be the benchmark scores or the compression techniques. It'll be whether Google treats this as a genuine commitment to open development or another experiment they'll abandon when internal priorities shift.

The technology works. The licensing is clean. Now we find out if Google actually means it.

Bob Reynolds is Senior Technology Correspondent for Buzzrag