Speculative Decoding: The AI Trick Making LLMs

There's a technique making AI language models run 2-3 times faster without touching the hardware or changing the math. It sounds impossible—like claiming you can make your car faster by strapping a bicycle to it. But speculative decoding is real, it works, and honestly? Once you understand it, the magic trick is almost disappointingly simple.

Here's the problem it solves: When ChatGPT or Claude generates text, it's painfully inefficient. The model predicts one token (basically one word or word fragment), adds that token to your conversation, then runs the entire billion-parameter model again to predict the next token. Every. Single. Time.

"During inference, LLMs generate tokens one at a time," explains the bycloud video breaking down the technique. "The model predicts the next token. That token gets appended to the context and the entire sequence is fed through the model again to produce the next token, then repeats."

It's like having a supercomputer solve a simple addition problem, then immediately forgetting everything and starting from scratch for the next one. The GPU keeps loading those massive model weights over and over, which is why AI generation feels expensive—because it is.

The Draft Model Trick

Speculative decoding works by pairing your big, expensive AI model with a smaller, faster "draft" model. Think of it like having an intern sketch out a rough version before the senior editor reviews it.

The draft model is trained on similar data but way smaller—maybe 10 times less computation. It's not as smart and makes more mistakes, but here's what matters: it's fast, and because it learned from similar training data, it often guesses the same tokens the big model would choose anyway.

So instead of the big model generating tokens one by one, the draft model proposes several tokens ahead—let's say five. The big model then checks all five in a single forward pass, the same way it processes multiple tokens during training. Does token A make sense? Does B make sense after A? Does C work after AB?

If the draft model nailed all five tokens, you just generated five tokens for the cost of running the big model once plus the cheap draft model. That's already way better than five separate expensive passes. If the draft model screwed up at token three, the big model rejects it, generates the correct token itself, and drafting continues from there.

The math checks out even in pessimistic scenarios. If the big model costs 100 units of compute and the draft model costs 10, generating five tokens normally costs 500. With speculative decoding—even if the draft model makes early mistakes—you're looking at maybe 420 in the worst realistic case. You'd need the draft model to be constantly wrong to lose the advantage, and that shouldn't happen if it's trained on similar data.

The Distribution Problem

But there's a catch that makes this actually clever rather than just obvious: You can't naively accept or reject draft tokens, or you'll mess up the model's probability distribution.

AI models don't just spit out the most likely next word. They sample from a distribution—maybe 50% chance of "the," 30% chance of "a," 20% chance of something else. That distribution controls creativity, diversity, and whether the model sounds like itself or like some weird hybrid.

"That distribution is what defines the behavior of the model," the video notes. "And this controls diversity, creativity, and even factual behavior."

If you just trust the draft model whenever it happens to pick a token the big model also likes, but the draft model prefers certain tokens more often, those tokens start appearing too frequently. Over many generations, the text drifts away from what the big model actually intended. You're no longer running the big model—you're running some frankenmodel whose personality is distorted by the draft model's biases.

The solution is mathematically elegant: calculate an acceptance rate based on the probability difference between models. If the draft model thinks token A has 60% probability but the big model only gives it 30%, there's a 60% chance you accept it. If rejected, you don't just sample from the original distribution—you remove the probability mass the draft model already tried to claim and sample from what's left.

When you combine acceptance and rejection cases, the final output distribution matches the big model exactly. "You gain speed but the statistical behavior of the model stays exactly the same," the video explains. From the outside, it looks like the large model generated everything itself, just much faster. That's why it's called lossless.

Fun fact: Two separate research papers proposed basically the same technique in parallel without referencing each other. Sometimes good ideas are just obvious once the pieces are in place.

Speculative Speculative Decoding

A few weeks ago, researchers took this a step further with "speculative speculative decoding" (yes, really). The insight: even with speculative decoding, the draft model sits idle while the big model verifies tokens. Why not keep drafting?

SSD has the draft model prepare multiple possible continuations while verification happens. Since verification can only produce a few outcomes—tokens accepted up to position X—the draft model prepares branches for each likely scenario. When verification finishes, if one of the prepared branches matches, generation continues immediately without waiting for another draft pass.

The speed improvement is real: up to 50% faster than standard speculative decoding, roughly 4x faster than normal token generation. The catch is higher compute cost since you're drafting multiple branches. But if speed matters more than cost—and for many real-time applications it does—that's a trade-off worth making.

Here's what I find interesting: this isn't some esoteric research technique. It's already being implemented in production systems. The pragmatic efficiency gains without quality loss make it almost too good not to use. And unlike most AI breakthroughs that promise revolutionary speed-ups but deliver marginal improvements, speculative decoding actually ships.

The question isn't whether this technique works—it clearly does. The question is what happens when everyone's AI gets 2-3x faster overnight, and whether the next bottleneck in AI deployment becomes something else entirely.

—Tyler Nakamura, Consumer Tech & Gadgets Correspondent