AI Benchmarks Are Breaking. Here's Why That Matters.

We have a measurement problem in AI, and it's getting worse.

Every major model release comes with the same ritual: a wall of percentages showing how the new model performs on MMLU, GPQA, SWEBench, and a dozen other benchmarks with increasingly obscure acronyms. The numbers go up. Labs celebrate. And yet—when you actually use these models, the improvements often feel... marginal? Sometimes non-existent?

That disconnect isn't your imagination. It's what happens when the tests we use to measure AI intelligence start measuring something else entirely.

When 80% Means Nothing

Benchmark saturation hit faster than anyone expected. By May 2024, GPT-4o was already scoring 88.7% on MMLU, the standard test for general knowledge. Other models clustered around 80-85%. When everyone's bunched together at the top of the scale, the score stops telling you anything useful about which model is actually better.

So labs did what you'd expect: they made harder tests. GPQA got replaced with GPQA Diamond. The old math benchmark gave way to AIME, which uses actual math competition problems. SWEBench got upgraded to SWEBench Pro. And for a while, this worked—new tests created breathing room to measure progress again.

But here's the thing about making tests harder: it's a treadmill. GPT-5.4 now scores 52.1% on "Humanity's Last Exam" (a benchmark specifically designed to test obscure knowledge not in training data). Anthropic's Opus 4.6 hits 53%. Google's models cluster nearby. Give it six months and we'll be right back where we started, with everyone at 80% and looking for the next upgrade.

Teaching to the Test (But Make It AI)

The saturation problem would be manageable if scores accurately reflected real-world capability. They don't.

This is where benchmark maxing enters the chat. Labs know which benchmarks matter for marketing, so they tune their models specifically to ace those tests. The result? Models with impressive scores that face-plant when you actually try to use them.

The clearest example came in February when someone released SUIRebench, a coding benchmark with different problems than the standard SWEBench. Chinese models that had dominated SWEBench rankings absolutely tanked—suggesting they'd been trained specifically on that narrow problem set. Western models dropped too, but nowhere near as dramatically.

Or take Meta's Llama 4 Maverick, which launched as the second-highest-rated model on LMArena, a crowdsourced platform where users vote on which model gives better responses. Turns out Meta had allegedly tested multiple variants until they found one that clicked with LMArena users. When people actually got their hands on Llama 4, almost nobody thought it deserved that #2 spot.

As AI researcher Brandon Hancock put it about the new ARC-AGI-3 benchmark: "An alien species with zero knowledge of human language could ace [it] on day one. I think that's beautiful at a time when AI is dominated by language models."

What We're Actually Measuring

Most benchmarks test one specific skill in isolation. Can you recall scientific facts? Can you solve GitHub coding problems? Can you ace math olympiad questions?

These are useful data points. They're also wildly incomplete.

Last year, OpenAI's o3 model and Google's systems earned gold medals at the International Math Olympiad—a genuinely impressive achievement in the narrow domain of competition mathematics. But competition math is a fundamentally different skill than the math a working engineer or physicist actually does. Traditional benchmarks excel at measuring task-specific competence. They're terrible at measuring how models handle the messy reality of real work, where you need to juggle multiple contexts, adapt on the fly, and figure out what the actual problem is before solving it.

The ARC Prize, which just released its third-generation benchmark, was explicitly designed to address this gap. Former Google computer scientist François Chollet built the original ARC test around a provocative thesis: "Modern LLMs have shown to be great memorization engines... But they cannot generate new reasoning based on novel situations."

The first ARC test used abstract visual logic puzzles—colored squares on grids that followed patterns you had to figure out and apply. Humans found them relatively easy. AI models struggled, scoring well below 50% of human performance. Then OpenAI's o3 preview crushed it with an 88% score in December 2024, using extended test-time compute to learn iteratively across problems.

So ARC released version 2, with more complex multi-rule puzzles designed to overload model context windows. By early 2025, frontier models were again hitting 77-84%. Benchmark saturated. Next.

The Game Where AI Can't Even Find the Controller

ARC-AGI-3, released this week, represents a complete rethink. Instead of static puzzles, it's 135 simple interactive games with zero instructions. Models have to explore the environment, figure out the rules through trial and error, build a strategy, and adapt based on what they learn.

Humans score 100%. Current frontier models—GPT-5.4, Opus 4.6, Gemini—all score less than 1%.

That gap is the point. According to ARC: "Most benchmarks test what models already know. ARC-AGI-3 tests how they learn."

Google DeepMind researcher Xiaom shared a playback of Gemini attempting the test. Her assessment: "Poor Gemini straight up thought it was playing Activision Tennis." The models aren't just failing—they're failing in ways that reveal they don't actually understand the concept of learning from interaction.

Not everyone loves the new scoring system. The benchmark measures efficiency compared to humans, using squared differences. If a human solves something in 10 steps and the model takes 100, the model gets a 1% score. This makes results incomparable to earlier ARC versions, which some researchers find frustrating.

But François Chollet himself warned against treating any single benchmark as definitive: "Keep in mind, ARC is not a final exam that you pass to claim AGI. The benchmarks target the residual gap between what's hard for AI and what's easy for humans."

Moving Targets All the Way Down

Here's what I keep thinking about: maybe the instability is the point.

The cycle of benchmark creation → saturation → replacement isn't a bug in how we measure AI progress. It's a feature. Each new benchmark reveals a different dimension where models still can't match human capability. They crush static knowledge tests, so we build adaptive reasoning tests. They master those through extended compute, so we build interactive learning tests. Presumably they'll eventually crack those too, and we'll need something else.

This is actually how scientific measurement is supposed to work. Your instruments should evolve as your understanding deepens. When the thermometer stops being useful for measuring extreme temperatures, you don't complain—you build a different instrument.

The question isn't whether benchmarks will keep breaking. They will. The question is whether we can build new ones fast enough to keep measuring the actual frontier rather than yesterday's solved problems.

Right now, with ARC-AGI-3, we have a test where the best AI systems in the world perform worse than a random human clicking around to see what happens. That's not a measurement failure. That's the measurement finally working again.

—Zara Chen