Text Diffusion AI: Speed, Privacy, and Ambient Risk
Google DeepMind's text diffusion model generates AI responses differently—and faster. Here's what that architectural shift means for privacy and everyday users.
Written by AI. Rachel "Rach" Kovacs

Photo: AI. Ren Takahashi
My beat is threats. So when a Google DeepMind researcher stands up at an AI engineering conference and demonstrates a fake Wikipedia that generates entirely on the fly—HTML, text, links, all of it, indistinguishable from the real thing—my first reaction isn't "cool demo." It's: who decides what that page says, and how would you know it wasn't real?
That's the lens I'm bringing to Brendan Dillon's recent talk on text diffusion. It's worth understanding because the architectural shift he describes is real and significant. But the story isn't really about architecture. It's about what happens to the information environment when AI gets fast enough to be ambient.
Let me explain the technology first, because you need it to understand why it matters beyond the conference room.
How current AI models actually work—and why it's slow
Every AI assistant you use right now—ChatGPT, Gemini, Claude—generates text the same fundamental way: one word at a time, left to right, each word dependent on everything that came before it. This is called autoregressive generation. It's like a typist who can never go back and fix a typo; once a word is committed, it's done.
The slowness isn't really a software problem. It's a hardware one. Modern AI chips—GPUs and TPUs—are extraordinarily good at doing math but have relatively limited bandwidth for moving data around. Every time an autoregressive model generates a single token, it has to drag the entire model's weights across that bandwidth channel. Then it does it again for the next token. And again. The chip is sitting there capable of enormous computation, but it's mostly waiting on data transfer. Dillon uses the term "memory bound"—the model is bottlenecked not by thinking power but by the cost of shuttling information.
Text diffusion sidesteps this bottleneck in a conceptually simple way: instead of generating one token at a time, the model initializes an entire block of output as random noise and then progressively refines it over multiple passes. Dillon's illustration: 24 refinement passes to generate 256 tokens means roughly ten times fewer total memory transfers than autoregressive generation would require for the same output. Under the right conditions—and Dillon is careful to note the assumptions here—that translates to roughly ten times lower latency. A research version of Gemini Diffusion reportedly demonstrated around 2,000 tokens per second, a figure Dillon cited from the preview that ran about a year before this talk. That claim comes from Google's own reporting on their research demo, not an independent benchmark, so treat it as directionally significant rather than a precise figure.
The speed gap is consistent with what's emerging elsewhere in the diffusion space: Inception Labs' Mercury 2 model crossed 1,000 tokens per second with competitive reasoning quality, suggesting the architecture's throughput advantages are reproducible across different implementations, not just a Google-specific result.
The bidirectional thing is actually interesting
Speed is the headline. But the more technically interesting property—and the one with longer-term implications—is that diffusion models can attend to their own future output as they generate.
Autoregressive models are causally constrained: when generating word seven, they literally cannot see words eight through eighty. Diffusion models have no such constraint. They work on the whole canvas simultaneously. Which means they can, in principle, notice that what they said on line one contradicts what they concluded on line twelve, and go fix line one.
Dillon demonstrated this with an arithmetic problem at a conference demo shown at Google IO last year. Asked a multi-step calculation with a correct answer of 39, the Gemini Diffusion model initially answered 60 after its first pass. Then 49. Then, once it had worked through the full reasoning chain and could see the complete output, it revised back to 39. For comparison, Dillon showed the same problem given to GPT-4o and Gemini 2.5 Flash—both considerably larger models. GPT-4o initially said 40 and caught its mistake. Gemini 2.5 Flash said 42 and then, notably, incorporated the wrong answer into its subsequent reasoning, arriving at "36 + 3 = 42" rather than admitting error.
A caveat that belongs here: this was a conference demo, not a controlled evaluation. The specific responses from named production models are not independently reproducible, and model behavior shifts with versions and context. What Dillon is illustrating is a structural property of bidirectional attention, not a head-to-head benchmark. The direction of the argument is sound; the specific numbers are illustrative.
The adaptive computation angle is genuinely novel. The model can determine for itself when it's done. Dillon showed examples where reciting the first 100 digits of pi took four refinement steps—the model has it memorized, done quickly—while explaining quantum mechanics in a single paragraph took 31 steps. It's not a user-set parameter; the model allocates its own effort. Harder tasks take longer. This is closer to how humans actually think than the fixed-cost-per-token model that currently dominates.
Why I'm writing about this under a privacy byline
Here's the thing my editor rightfully pushed me on: this is a cybersecurity and privacy column. What's the angle?
It's the demos.
Dillon showed four of them: a fake Wikipedia generated on the fly, a fake Reddit complete with AI comments and images, a fake operating system where every click generates the next screen, and a voice-coded app built in fifteen seconds. He presented them as proofs of concept for what low latency unlocks. And they are that. But they're also a preview of a specific threat model that I don't think gets enough attention.
Right now, synthetic content is detectable partly because it's slow. Generating a convincing fake news page, a fake Reddit thread, a fake product review section takes time and cost that limits who can do it at scale. At 2,000 tokens per second—or higher, as these models mature—that friction disappears. The fake Wikipedia in Dillon's demo looks exactly like Wikipedia. The fake Reddit thread has comments, upvotes, the right visual weight. The operating system responds to clicks like a real OS.
This isn't theoretical. We already have a documented problem with AI-generated disinformation, fake reviews, synthetic social media activity. Text diffusion doesn't create that problem, but it removes one of the remaining practical barriers to doing it cheaply and at scale. On-device deployment—which Dillon confirmed is already happening in parts of Alphabet's ecosystem, including robotics—means you don't even need a server bill. Fast, local, cheap.
The throughput problem that currently keeps diffusion models out of large-scale deployment also, coincidentally, keeps them out of large-scale misuse. Dillon was direct about this: "no one's landing text diffusion into any of these big models primarily because of that disadvantage. It's just too expensive to serve." That's a temporary constraint. Inception Labs is already closing the gap from one direction. DeepMind is signaling new releases soon. The economics of this will shift.
What I'm actually watching—the thing I think matters most for the people who read this column—is what happens to ambient trust in digital content when AI text generation gets fast enough to be invisible. Not "will deepfakes get better" (yes). Not "can models reason more accurately" (also yes). The specific question is: when generating a convincing fake interactive website costs roughly nothing and takes milliseconds, what verification infrastructure do we have, and who's building it?
Dillon didn't address this in his talk, which is fair—he's a research scientist explaining an architecture, not a policy analyst. But the "fake Wikipedia" he demoed so cheerfully as a latency stress test is exactly the kind of artifact that, in adversarial hands, is designed to be indistinguishable from the real thing.
The technology is genuinely impressive. The self-correction, the adaptive computation, the in-place editing that lets a model revise specific sections without regenerating everything—these are meaningful advances over the word-by-word approach that's dominated language models for years. If you use AI tools at work, faster and more accurate outputs are straightforwardly good.
But speed has a shadow. The same property that lets a model build a to-do app in fifteen seconds by voice also lets someone build a fake community forum in the time it takes to read this article.
I want to be clear that I'm not arguing the technology shouldn't exist, or that Dillon and his team are doing something wrong. I'm arguing that "look how fast we can generate convincing synthetic content" needs to be followed by a harder conversation about what happens when that capability is cheap, local, and widely available—and that conversation is not yet happening at the same speed as the research.
Rachel "Rach" Kovacs is Buzzrag's cybersecurity and privacy correspondent.
AI Moves Fast. We Keep You Current.
Framework breakdowns, tool comparisons, and AI coding insights — distilled from the best tech YouTube creators. Free, weekly.
More Like This
Quinn 3.5 Runs AI Models On Your Phone Without Internet
The Qwen 3.5 AI model runs entirely on your iPhone with zero internet connection. We tested how well local AI works when privacy actually matters.
Clone the Repo: What AI Coding Agents Actually Need
Michael Arnaldi's "just clone the repo" technique for AI coding agents has real security implications most developers aren't thinking about. Here's the full picture.
AI Benchmark Scores Are Broken. Here's Who's Fixing Them.
AI benchmark scores are less trustworthy than they look. Google DeepMind's Kaggle team is building open infrastructure to fix that—here's what you need to know.
Google's Gemma 4 Makes Powerful AI Run on Your Phone
Gemma 4 brings multimodal AI models to phones and laptops with clever architecture tricks that make 5B parameters perform like much larger models.
Humanoid Robots Are Watching. Who's Watching Them?
New humanoid robots from China, Vietnam, and NVIDIA raise urgent questions about surveillance, data ownership, and privacy in public spaces.
Inside Google DeepMind's Messy Reality of AI Agents at Scale
Google DeepMind engineers have worse token quotas than paying customers. KP Sawhney and Ian Ballantyne reveal what running AI agents at Google scale actually looks like.
Why Machine Learning Teams Need MLflow (And What It Actually Does)
MLflow solves the reproducibility crisis in ML development. Here's what happens when your team scales beyond Jupyter notebooks and memory-based decisions.
A2A vs MCP: How AI Agents Actually Talk to Each Other
A2A connects AI agents to each other. MCP connects them to your data. Here's what each protocol actually does and why you might need both.
RAG·vector embedding
2026-06-05This article is indexed as a 1536-dimensional vector for semantic retrieval. Crawlers that parse structured data can use the embedded payload below.