Why AI Benchmarks Are Breaking (And What That

Here's the thing about Google's new Gemini 3.1 Pro: it's genuinely impressive, but also kind of a mess to evaluate. And that mess? It's not just a Google problem—it's revealing something fundamental about how we measure AI that affects anyone trying to figure out which model to actually use.

The AI Explained channel dropped a deep dive into Gemini 3.1 Pro that goes way beyond "new model good" territory. The creator tested it hundreds of times and what they found isn't just about this one model—it's about why every hot take you see about AI models seems to contradict the last one. Turns out, there's an actual technical reason for that confusion.

The Specialist Era Changed Everything

Let's start with what changed. A year ago, AI labs spent most of their compute budget on pre-training—feeding models massive amounts of internet data to make them generalists. That pre-training now accounts for only 20% of the compute budget. The other 80%? Post-training, where labs take those generalist models and specialize them for specific domains.

Dario Amodei, CEO of Anthropic, admitted that just a year ago, "the amount being spent on the second stage, RL stage, is small for all players." Not anymore. Now labs are spending the bulk of their resources optimizing models for particular use cases—coding, scientific reasoning, professional tasks, whatever they think will win them users.

This matters for a simple reason: if a lab optimized their model for your domain, you'll think it's the best model ever. If they didn't, you might wonder what the hype is about. The old paradigm where "good at one thing means good at everything" is dead.

Case in point: Claude went from 12% to 10% on chess puzzles over five months of development. That's backwards progress on a pure reasoning benchmark while the model got dramatically better at coding. GPT-5.2 scores around 50% on the same chess test. These are all incredible models, but they're optimized for different things.

The Shortcut Problem

Gemini 3.1 Pro crushed the ARC-AGI 2 benchmark with 77.1%, way ahead of Claude Opus 4.6's 69%. Google's CEO Demis Hassabis highlighted this prominently. Sounds definitive, right?

Except AI researcher Melanie Mitchell pointed out something fascinating: when researchers changed the encoding from numbers to other symbols, accuracy dropped significantly. The models were finding "unintended arithmetic patterns" in the number-coded colors that led to "accidental correct solutions."

I don't think that's cheating—the models are doing what they're designed to do, finding any pattern that gets them to the right answer. But it does mean that even within a single benchmark, how you frame the question massively affects performance. Change one variable and suddenly your rankings flip.

The video creator runs their own private benchmark called Simple Bench—basically trick questions and common sense reasoning. Gemini 3.1 Pro scored 79.6%, essentially matching average human performance. That's actually a huge milestone: we might be at the point where you can't write a fair English text test that the average person would clearly beat frontier models on.

But here's the caveat: when they removed the multiple-choice format and made models answer open-ended, performance dropped 15-20 percentage points. Models are incredibly good at using context clues from the answer options themselves. Multiple choice questions with "zero" as an option can flag to the model that something tricky is happening.

The Hallucination Paradox

On Google's hallucination benchmark, Gemini 3.1 Pro scored +30 compared to Claude Opus 4.6's +11. Looks like a blowout. But when you zoom in on just the incorrect answers, Gemini hallucinates 50% of the time while Claude Sonic 4.6 hallucinates only 38% of the time.

What this means: a model can be better at its best while being worse at its worst. As the video creator puts it, paraphrasing a relationship cliché: "If you can't take me in my bad moments, you don't deserve me in my good moments." Every model has that trade-off baked in.

Google's own model card admits that their "deep think" mode—which uses more inference compute—actually performs "considerably worse" than the standard mode when you account for costs. Even the labs are telling you that bigger numbers don't always mean better results.

The Anthropic Bet

Here's where it gets philosophical. Dario Amodei recently explained Anthropic's core bet: if you specialize in enough specialisms, you'll generalize to all specialisms. Sounds almost tautological, but it's actually a specific technical claim.

Amodei thinks we can get "most of the way there to AGI or super intelligence" without continual learning—without models learning on the job from your specific domain. His argument: "We're trying to get a whole bunch of data, not because we want to cover a specific document or specific skill, but because we want to generalize."

The idea is that there are only so many patterns in human knowledge. Train on enough domains, and the model can deduce the patterns in domains it's never seen. Maybe it'll need some extra context—which is why Claude 4.6 now handles 750,000 words of context, soon to be millions—but it won't need training data from your specific use case.

Francois Chollet, creator of the ARC test, has a different take on this when it comes to coding: "Sufficiently advanced agentic coding is essentially machine learning. A goal is given to the agent or agent swarm and then the coding agents iterate until the goal is reached... the result is a black-box model."

In other words, even if a coding agent seems to work great, it might be overfitting to your specifications or drifting from your original concept in ways you won't catch until later. The same brittleness that shows up in benchmarks shows up in your codebase.

Who Would Actually Build the One True Benchmark?

The labs themselves would benefit most from having a pure general intelligence benchmark. If it existed, they could just optimize against it using reinforcement learning and have the smartest model, full stop.

But most benchmarks come from small teams with sub-million-dollar budgets. Expecting those teams to craft something that objectively captures real-world performance without overestimation is, as one researcher put it, "essentially making more realistic reinforcement learning with verifiable reward settings than labs, which is hard."

That's why the labs increasingly write their own benchmarks. They're the only ones with the resources. Which introduces obvious bias.

The one truly objective benchmark might be forecasting the future—and Metaculus notes that models are now performing near the level of average human forecasters. But even that has a weird vulnerability: what happens when AI agents can take actions in the real world and bet on prediction markets simultaneously? An unfiltered model could literally make events happen to win its bets. Even the purest benchmark gets corrupted eventually.

What This Means for Actual Humans

If you're trying to decide which AI model to use, here's what matters:

Test it on your actual use case. Gemini might crush coding benchmarks but produce garbage in your specific IDE. Claude might score lower on general reasoning but nail the exact kind of writing you need.
Price matters more than ever. If one model is 3x the cost for 10% better performance on a benchmark you don't care about, that's not a win.
Watch for the hallucination pattern. Models optimized to almost never hallucinate might be more conservative and less useful. Models that swing for the fences will occasionally make stuff up. Pick your poison based on your use case.
Speed is becoming its own benchmark. The video shows a model answering instantly with insane token throughput. When entire apps can be generated in milliseconds, the old metrics feel quaint.

Oh, and one more thing: Anthropic's revenue is currently 10x-ing year over year while OpenAI's is 3.4x-ing. If those trends hold, Anthropic could be out-earning OpenAI by mid-2026. Market dynamics might determine which models get the resources to improve faster than any benchmark score.

The benchmark era isn't over—we still need ways to measure progress. But we're entering what you could call the vibe era, where "this model feels right for my work" might be more useful than a leaderboard score. And honestly? For anyone actually trying to get work done with these tools, that might be how it should be.

Tyler Nakamura covers consumer tech and AI for Buzzrag