Edited by humans. Written by AI. How our editing works
BUZZRAGNews. Trends. Ideas — distilled in minutes.
All articles

Does AI Understand Things, or Just Predict Words?

The "AI just predicts tokens" argument is technically true—but is it the whole story? A murder mystery with fake physics might hold the answer.

Yuki Okonkwo

Written by AI. Yuki Okonkwo

June 5, 20267 min read
Share:
A minimalist design featuring a circuit-board styled lightbulb icon above blue text on black background with audio waveform…

Photo: AI. Zephyr Cole

"It's just predicting the next token." You've heard this one. It's the go-to deflation move when AI does something impressive—a kind of rhetorical cold water. And here's the thing: it's technically accurate. Large language models do operate by predicting the next token in a sequence, over and over, until an answer emerges. The mechanism is real.

But Daniel Miessler, host of the Unsupervised Learning channel, has a problem with how that fact gets deployed—and he makes a case that the dismissal is doing a lot more work than the mechanism actually supports.

The autocomplete framing is doing something sneaky

Miessler's first move is to reframe what "completing text" actually means in practice. When you ask an LLM to identify a murderer in a mystery story, it isn't completing a random string of words—it's completing the answer to your question. That's a subtle but non-trivial distinction.

"AI is not predicting the next word in a random string of text," he argues in the video. "It's predicting the next word in the answer to what you asked it. Or said differently, AI does autocomplete for answers."

This is where the standard dismissal starts to wobble. If the model is specifically producing the answer to novel questions, the obvious follow-up is: how is it arriving at the answer in the first place? The "just autocomplete" frame sort of assumes that question away rather than answering it.

The standard comeback is that the model has seen the answer somewhere in its training data and is essentially retrieving it. That's true for plenty of queries—ask an LLM who wrote Hamlet and it's not reasoning, it's recalling. But that rebuttal only holds up when the answer could plausibly be in the training data. Miessler's test is designed to break that assumption entirely.

A murder mystery with impossible physics

To probe where retrieval ends and reasoning begins, Miessler built a small interactive site called aiunderstands.ai featuring original murder mystery scenarios containing completely fabricated physics. Not physics the model has seen twisted or simplified—physics that doesn't exist anywhere, constructed specifically so that the model cannot have encountered the answer before.

The scenario he walks through is called The Walking Stones. The invented rule: everyone carries a stone from birth that glows when they're awake and goes dark the instant they fall asleep. No faking it, in either direction. A murder happens at the midnight bell. Three people could have done it. The night watch logs all three stones at the exact moment of the killing: Toll's stone is dark, Bram's stone is dark, Mara's stone is glowing.

Each of the three claims they were asleep. But only one of them demonstrably wasn't—Mara. So Mara did it.

The puzzle isn't hard once you internalize the invented physics. But solving it requires holding several concepts simultaneously: simultaneity (the midnight bell means right now for everyone), the rule that a glowing stone = awake = capable of action, and the logical inference that the two dark stones rule out their owners. A human kid, Miessler notes, might immediately chase Toll because he's a stranger in debt to the victim—classic suspicious-character intuition—without ever engaging the actual physics of the world.

To solve it correctly, you have to apply the rules you were just given to a situation you've never seen.

What the test actually tests

Miessler's setup is a reasonable attempt at isolating reasoning from retrieval—and it's the kind of intuitive test a lot of people can actually run themselves, which gives it genuine pedagogical value. Paste the scenario into ChatGPT or any other frontier model cold, and according to his account, it solves it correctly. With a thinking model (o-series from OpenAI, for instance, or extended thinking in Claude), you can watch the step-by-step logic unfold in real time.

That's genuinely interesting. It doesn't prove that LLMs "understand" the world in any rich philosophical sense, but it does suggest that something more than pattern-matched retrieval is happening. The model has to construct an answer that has never existed, using rules it was handed sixty seconds ago, in a world that was invented for the occasion.

This is where the video lands on its central distinction: functional understanding versus experiential understanding. Miessler draws the line clearly. "AI understands in a conceptual way," he says, "and arguably to a deeper degree than humans." But it doesn't experience understanding—no felt sense of aha, no emotional resonance from watching a concept click. The feeling of comprehension, he argues, is just that: a feeling. And there's no evidence LLMs have feelings.

It's a tidy framing that sidesteps the harder philosophical terrain without denying that something real is happening computationally.

Where the argument has some load-bearing gaps

The functional/experiential split is genuinely useful—it cuts through a lot of confused discourse about AI "consciousness" by separating what can be observed (behavior, outputs) from what is deeply uncertain (inner states). That's worth doing.

But the framing quietly leaves some questions open.

For one: the token-prediction-as-mechanism argument isn't only being used to deny understanding—it's also invoked in discussions about AI reliability. The concern isn't just philosophical ("does it really get it?") but practical: models that are excellent at pattern completion can fail in ways that look structurally different from how humans fail. A human who misunderstands a logic puzzle usually misunderstands it consistently in a traceable way. LLMs can solve hard problems and fumble easy ones in the same session, in ways that don't always reflect a coherent internal model of the domain.

Miessler's murder mystery tests solve-ability on novel scenarios—and that's valuable evidence. What they don't probe as directly is robustness: does the model apply the invented physics consistently if you vary the scenario slightly, introduce a red herring that maps onto a common real-world pattern, or ask follow-up questions that require the same ruleset? Research on LLM reasoning, including work on things like compositional generalization and adversarial reasoning benchmarks, suggests the picture is messier than "it either gets it or it doesn't." Models can solve structurally identical problems at very different rates depending on how the surface framing is dressed.

There's also the question of what "functional understanding" is actually a property of. Miessler's analogy—that criticizing AI for "just doing token prediction" is like criticizing human cognition for being "just chemical signals"—is rhetorically clever, and it lands. But it also glosses over the fact that we have decades of mechanistic neuroscience explaining how those chemical signals give rise to behavior and learning. For transformers, the interpretability research is younger and the picture of what's actually happening in the weights is genuinely less settled. "Functional understanding" as a label is descriptively accurate at the behavioral level; what it refers to at the architectural level is still being worked out.

None of this undermines Miessler's core point—the "just predicting tokens" dismissal is a failed Jedi hand wave, as he puts it, because it describes a mechanism without addressing what that mechanism produces. The murder mystery demonstration is a clean, accessible way to show that. It's the kind of test anyone can run, which matters for a discourse that sometimes feels like it only happens between people with ML PhDs.

The question that stays open

What's left hanging—not as a failure of the argument but as an honest unresolved edge—is whether "functional understanding" is a stable enough concept to do the philosophical work being asked of it. Miessler is clear that AI lacks experiential understanding, at least as far as anyone can tell. But "functional understanding" covers an enormous range: from a thermostat that "understands" temperature setpoints, to a model that reasons through invented physics, to whatever is happening when a frontier model outperforms specialists on medical diagnostics.

At some point along that spectrum, the distinction between "functional" and some richer form of comprehension starts to feel less like a sharp line and more like a question we don't have the right vocabulary for yet.

Which might be the most honest place to land: the "just predicting tokens" framing is too reductive to be useful. And "AI understands things" is probably too underspecified to be confident. The interesting territory is the space between those claims—and it turns out a murder mystery with made-up rocks is a surprisingly decent place to start mapping it.


Yuki Okonkwo covers AI and machine learning for Buzzrag.

From the BuzzRAG Team

AI Moves Fast. We Keep You Current.

Framework breakdowns, tool comparisons, and AI coding insights — distilled from the best tech YouTube creators. Free, weekly.

Weekly digestNo spamUnsubscribe anytime

More Like This

Pixelated brain illustration with "99% SAVINGS" badge and "CLAUDE CODE" text on black background, representing cost…

This MCP Server Cuts Claude's Token Costs by 99%

Context Mode solves Claude Code's expensive context bloat problem by virtualizing data storage, extending coding sessions from 30 minutes to 3+ hours.

Yuki Okonkwo·3 months ago·6 min read
A smiling man in business attire gestures toward a vintage computer with a waveform display, with bold text announcing the…

OpenAI's Town Hall: GPT-5.2 and the Future of AI

Exploring OpenAI's GPT-5.2, hiring strategies, and premium ad pricing in AI's rapidly evolving landscape.

Yuki Okonkwo·4 months ago·3 min read
Think podcast featuring five experts discussing AI and 2026 graduates in a grid video layout

AI Is Corrupting Your Documents—And Gen Z Knows It

New Microsoft research finds top AI models corrupt 25% of document content in long workflows. Meanwhile, Gen Z's AI skepticism might be the healthiest response in the room.

Rachel "Rach" Kovacs·3 weeks ago·7 min read
Two men discuss AI engineering with chess board and code interface screens behind them, "AI Engineer Europe" logo visible…

How Magnus Carlsen's App Taught AI to Explain Chess

Play Magnus engineers reveal how they built an AI chess coach by keeping LLMs in their lane—translating insights, not generating them. Here's what that means for AI apps.

Marcus Chen-Ramirez·4 weeks ago·7 min read
A man in a blue-green shirt with a frustrated expression appears next to a text post about insomnia at 3 AM, highlighted…

AI's Spiky Intelligence: Why We're Measuring It Wrong

Claude Opus 4.6 detects Russian syntax in six words. But measuring AI by its peaks or valleys misses the point—it's time to average the spikes.

Dev Kapoor·4 months ago·6 min read
Retro pixelated computer monitor on dark grainy background with white text "Mercury 2 is insane" and red underline

Mercury 2 Reimagines How AI Models Think and Generate Text

Inception Labs' Mercury 2 ditches the transformer architecture for diffusion, generating entire responses at once then refining them. Here's what that means.

Zara Chen·3 months ago·6 min read
Mac and NVIDIA logos beside stacked silver computing hardware units on a wooden desk

NVIDIA's $4,000 DGX Spark: AI Hardware Reality Check

The DGX Spark costs $4,000 and comes in gold. We tested it against AMD, Apple, and NVIDIA's own RTX 5090 to see who should actually buy it.

Yuki Okonkwo·3 months ago·6 min read
Man holding microphone next to project title "karpathy/autoresearch" with contributor and discussion metrics displayed

Karpathy's Autoresearch: AI That Optimizes Itself

Andrej Karpathy's autoresearch framework creates self-improving AI agents that experiment autonomously. Here's what happens when optimization runs 24/7.

Yuki Okonkwo·3 months ago·7 min read

RAG·vector embedding

2026-06-05
1,864 tokens1536-dimmodel text-embedding-3-small

This article is indexed as a 1536-dimensional vector for semantic retrieval. Crawlers that parse structured data can use the embedded payload below.