DiffusionGemma Generates Text Like an Image Model
Google DeepMind's DiffusionGemma borrows from image diffusion to generate 700–1,000+ tokens/sec. Here's how the architecture works—and where it falls short.
Written by AI. Yuki Okonkwo

Photo: AI. Saskia Aaltonen
There's a bottleneck that everyone running AI models locally knows intimately, even if they don't have a name for it. Your GPU—the expensive, powerful one—spends most of its time waiting. Not computing. Waiting. Loading weights out of memory, producing one token, sitting idle, loading again. It's a bit like hiring a sous chef for a Michelin-star kitchen and then only asking them to chop one onion at a time.
This is what Google DeepMind decided to go after with DiffusionGemma. And the approach they landed on is genuinely strange in the best way: they borrowed the core trick from AI image generators and applied it to text.
The memory-bound problem, explained without the handwaving
To understand why DiffusionGemma is interesting, you need to understand the thing it's trying to fix.
Every language model you've used—ChatGPT, Claude, Gemini, whatever's running locally on your machine—works the same fundamental way. It's autoregressive, which is a technical term for "generates one token at a time, left to right." The model writes a word, looks at everything it's written so far, predicts the next word, repeat. That's the whole loop.
For big cloud providers, this is fine. When a server generates a token, as Andress from Better Stack explains in his breakdown of DiffusionGemma, "most of the time isn't spent on computing—it's spent loading the model's weights out of memory." Cloud providers solve this by batching hundreds of users together: one memory load, 256 users served. Economically elegant.
Locally? You're the only user. There's nobody to batch with. Your GPU loads a massive chunk of weights, does a tiny computation for one token, then idles until it does it all again. The technical term for this is being memory-bound—the GPU's compute capacity is sitting underutilized because the bottleneck is memory bandwidth, not raw processing power.
DeepMind's question was essentially: what if instead of serving 256 people one token each, you served one person 256 tokens at once?
Noise, but for words
The answer they landed on comes from a completely different field. Diffusion models—the same architecture powering Stable Diffusion, Midjourney, DALL-E—work by starting with pure noise (a static-filled image) and progressively cleaning it up over multiple passes until something coherent emerges. The model is trained by corrupting real images with noise and then learning to reverse that process.
The clever-and-nontrivial question is: how do you apply this to text?
With images, "adding noise" is intuitive. Make a pixel a bit more red. Increase the grain. But words are discrete—there's no sensible way to make "the" slightly less "the." As Andress puts it: "What does that noise even mean for a word?"
DeepMind's answer is called uniform state diffusion. Instead of corrupting pixels, you corrupt text by randomly replacing real words with random garbage words. The model's job is to figure out which words are garbage and replace them with correct ones, in multiple passes. It starts with a full "canvas" of 256 random placeholder tokens—pure noise, word-style—and iteratively cleans it up until it becomes coherent text.
There's a simpler version of this called mask diffusion, which just blanks tokens out (think fill-in-the-blank). But mask diffusion has a fatal flaw: once the model commits to filling in a blank, that word is locked. It inherits the same rigidity as autoregressive models. Uniform state diffusion fixes this by always keeping some token in every position, which means the model can revisit a word it accepted two passes ago, decide it doesn't fit the emerging context, and swap it out. Genuine self-correction all the way through.
The architecture underneath
DiffusionGemma is built on top of Google's existing 27B parameter Gemma 4 model, with what DeepMind is calling an encoder-denoiser patch. When generating a response, the model operates in two modes:
Encoder mode reads your prompt, extracts context and semantic guidance, and stores everything in a KV cache (essentially a working memory for attention). That cached context then gets handed to the denoiser.
Denoiser mode is where the actual generation happens. Two key departures from standard LLMs make this possible. First, a normal LLM produces confidence scores (logits) for every token position but throws all of them away except the last one—wasteful when you're only generating one token. DiffusionGemma keeps all those scores, because every position on the canvas needs its own prediction. Second, the denoiser replaces causal attention (the rule that tokens can only look backward at previous tokens) with bidirectional attention—every token can see every other token, in both directions, simultaneously.
That bidirectionality is what allows the multi-pass cleaning to actually work. Token 254 can see what token 12 is becoming, and adjust accordingly. The canvas gets refined holistically rather than assembled sequentially.
The result, theoretically: 1,000+ tokens per second on an H100 GPU, because the GPU is finally doing the kind of dense, parallel matrix math it was actually designed for. You've flipped it from memory-bound to compute-bound.
What actually happened when someone tested it
Better Stack's Andress deployed DiffusionGemma on an H100 via RunPod—the model's weights are open-sourced under Apache 2.0 on Hugging Face, so this is replicable—and ran two real-world benchmarks: building a personal finance dashboard and generating an arcade-style game.
The speed was immediately notable. "Instantly, it starts streaming right away. Look how blazingly fast that is. Holy moly," he says, which is admittedly not the most rigorous measurement, but the logs backed it up: ~700 tokens per second during the output phase. The finance dashboard came back fast but was partially broken—categories worked, expense updating didn't. The arcade game, however, was fully functional in 14 seconds.
The gap between 700 observed and 1,000+ marketed is worth sitting with. Andress acknowledges it directly: "although their marketing page said that we could expect 1,000 token per second speeds on the H100... that was not my observation." He's careful not to blame the model—there might be template or prompt configurations to tune—but it's a real gap, and one that any developer benchmarking for production should verify themselves.
Where the tradeoff lives
Here's where it gets genuinely interesting from an architecture perspective: DiffusionGemma is not trying to be the best model. It's explicitly trading quality for speed.
"For maximum quality work, standard Gemma 4 is still a better pick," Andress notes. DiffusionGemma is built for specific contexts where speed matters more than perfection—inline code editing, fill-in-the-middle completions, rapid iteration loops. And there's a class of problems where the bidirectional, non-linear generation approach actually gives it a structural advantage over autoregressive models: filling in the middle of a code block, for instance, or solving constraint puzzles like Sudoku, where you need to reason about all positions simultaneously rather than left-to-right.
This is a real architectural distinction, not marketing spin. Autoregressive models have a genuine weakness on nonlinear tasks. DiffusionGemma's whole design is oriented toward contexts where that linearity was always a limitation.
What it opens up, beyond the model itself, is a question about the broader design space. If you can bolt a diffusion-style denoiser onto an existing pretrained transformer and unlock parallel generation without retraining from scratch, that's a meaningful result. It suggests the paradigm might be portable—that other model families could potentially get similar treatments.
The more interesting version of this story isn't whether DiffusionGemma hits 1,000 tokens/sec in practice. It's whether the approach scales, generalizes, and eventually produces a generation where local inference feels less like you're patiently waiting and more like the model is thinking alongside you in real time.
Yuki Okonkwo is Buzzrag's AI & Machine Learning correspondent. She covers the systems being built, the tradeoffs being made, and the people who will live with both.
AI Moves Fast. We Keep You Current.
Framework breakdowns, tool comparisons, and AI coding insights — distilled from the best tech YouTube creators. Free, weekly.
More Like This
This MCP Server Cuts Claude's Token Costs by 99%
Context Mode solves Claude Code's expensive context bloat problem by virtualizing data storage, extending coding sessions from 30 minutes to 3+ hours.
Google's Gemma 4 Makes Powerful AI Run on Your Phone
Gemma 4 brings multimodal AI models to phones and laptops with clever architecture tricks that make 5B parameters perform like much larger models.
DeepSeek V4 Uses 90% Less Memory Than Its Predecessor
DeepSeek's new V4 models achieve dramatic efficiency gains through hybrid attention mechanisms, running million-token contexts at a fraction of the cost.
Google's Lyria 3 Makes AI Music From Text (And Images)
Google's Lyria 3 generates custom music from text, images, and video in seconds. Built into Gemini, it's multimodal, free, and targeting creators.
Why Skills Are Flunking: Vercel's AI Agent Revelations
Vercel finds skills often unused by AI agents. Discover why agents.md might be the true MVP.
Speculative Decoding: The AI Trick Making LLMs 2-3x Faster
Researchers use speculative decoding to speed up AI language models 2-3x without quality loss. Here's how the clever technique actually works.
Desktop Environments vs Window Managers: What Linux Users Need to Know
DevOps engineer Mischa van den Burg explains the practical differences between Linux desktop environments and window managers—and why it matters for your workflow.
Claude's 1M Context Window: The Upgrade That Could Cost You
Anthropic's free 1M context window for Claude sounds amazing—until you understand how token management actually works under the hood.
RAG·vector embedding
2026-06-15This article is indexed as a 1536-dimensional vector for semantic retrieval. Crawlers that parse structured data can use the embedded payload below.