AI's Spiky Intelligence: Why We're Measuring It Wrong

Automation entrepreneur Nick Saraev drops six words into Claude Opus 4.6: "Mom is sleeping in the next." The model needs exactly six words to determine the speaker is likely Russian. By word ten, it's explaining that this is translated text based on subject-verb-object ordering patterns that native English speakers wouldn't naturally produce.

Try this yourself. Six words. What could you infer? Probably nothing useful. Which makes Saraev's demonstration either impressive or concerning, depending on where you're standing when the goalpost moves again.

The Token Universe

Saraev's framing is worth sitting with: imagine existing without senses—no touch, smell, hearing, taste, or sight. Just awareness in a void. Then words appear. Just words, forever, trillions of them across thousands of lifetimes of processing. "Of course you would have to focus on this stuff and get pretty good at doing it," he says. "There's no way not to because that's just the way the brains work."

Large language models live in this token universe. They don't experience reality—they experience sequences of tokenized text with a focus that would make a monk jealous. The result is pattern recognition that looks superhuman because, in this narrow domain, it is.

"There's no way a human being would probably be able to figure this thing out without, first of all, extraordinarily extraordinary knowledge of like all languages, but second of all, probably like hundreds of years of careful study just looking at the words and squinting really hard," Saraev notes.

But here's where it gets interesting: this same model that performs linguistic forensics at a glance will also write you a joke that lands with the grace of a drunk uncle at Thanksgiving.

The Character Creation Problem

Saraev introduces what he calls "spikiness"—think of it as an RPG character screen where someone dumped all their points into INT and DEX while leaving CHA at 3. Human intelligence, he argues, looks like a hexagon: reasonably competent across humor, writing, reasoning, coding, empathy. We're generalists by design.

AI models look like a warlock who's min-maxing for a specific build. Exceptional at coding and reasoning, potentially superhuman at certain linguistic tasks, but struggling with humor and context that a reasonably alert seven-year-old would handle fine.

The measurement problem emerges from where you point your benchmark. Critics test AI on joke-writing and declare it stupid. Tech executives test it on code generation and mathematical reasoning and declare we're approaching AGI. Both are measuring real capabilities. Both are missing the average.

"What we should be doing is instead of both looking at the highest of the high and the lowest of the low, we should average all of these out," Saraev argues.

He's not wrong, but this raises questions about what we're actually trying to measure. The ARC AGI benchmark, mathematical reasoning tests, verbal reasoning assessments—these tools exist, and companies like Anthropic and OpenAI use them. But as Saraev points out, they're "heavily biased towards things like coding, reasoning, and writing." The skills that matter to the people building and funding these systems.

The Moving Goalposts

Saraev walks through his own history watching this play out. Six years ago, he built 1SecondPainting—a StyleGAN variant trained on abstract art instead of faces. Posted it to HackerNews, woke up to it being number one. The immediate response: sure, AI can make abstract squiggles, but it could never create coherent imagery.

Today we have image generators that produce photorealistic output indistinguishable from—and often superior to—actual photography. The goalpost moved.

When GPT-3 started handling terminal commands, the response was: it'll never replace my fifteen years of SQL expertise. Within a couple years, ChatGPT was outperforming median programmers on database queries. The goalpost moved.

The current line: "Oh yeah well Opus 4.6, you know, it can only do things that it was trained on. It can't do anything new... oh, sure, it can build this amazing end-to-end compiler that would have previously taken a team of, you know, seven people more than 4 months and it can do it in a couple of days, but that's only because it was trained on it."

Saraev's patience with this is visibly thin. "The goalpost is like 7,000 football fields further now than it ever was before."

What Average Spikiness Reveals

Here's where Saraev makes his actually provocative claim: if you average out AI's peaks and valleys across all relevant skills, "the boundary of AI right now would be very very similar to a human being. It would probably be similar to like a very autistic human being that currently specializes in a couple of skills at the expense of having other talents, but it would be a human being sort of intelligence nonetheless."

Then, almost casually: "I don't think it's any stretch to say that like we're probably at AGI right now."

This isn't AGI as Hollywood imagined it—no humanoid robot with perfect competence across all domains. It's distributed intelligence that crushes humans at intellectual tasks with clear rules and patterns while face-planting on tasks that require cultural intuition or emotional nuance.

Whether this counts as AGI depends entirely on how you weight those skills. If you value coding, reasoning, and pattern recognition, we might be there. If you value the full spectrum of human cognitive and emotional capabilities, we're nowhere close. The measurement problem isn't technical—it's political.

The Benchmark Wars

The disconnect Saraev identifies between different socioeconomic groups isn't accidental. "People that are in charge of these things are representing and then measuring sort of this big dick measuring contest is always on these like very far out things," he says. "It's things that we not really might consider super relevant to you know like the average person."

High-end mathematics, spatial reasoning, image segmentation—these are the benchmarks that matter to the people building the models. Meanwhile, the rest of us are asking about empathy, humor, and the ability to relate. Different yardsticks measuring different things, each group convinced the other is missing the point.

Saraev positions himself as optimistic about the outcome—"these models have the ability to deliver us into untold abundance and a complete cessation of scarcity"—but that optimism depends on understanding what we're actually building.

The question isn't whether AI is intelligent. The question is which kind of intelligence we're measuring, who gets to decide what counts, and what happens when a system that's superhuman at valuable tasks but can't tell a decent joke becomes economically indispensable. The spikiness isn't a bug in the measurement—it's the actual shape of the thing we built.

—Dev Kapoor