The Test AI Still Can't Pass: What ARC AGI 3 Reveals

There's a video game that every human can beat and no AI can figure out. This isn't some philosophical thought experiment—it's a measurable fact, backed by data from the latest frontier models. The game is simple enough that technology correspondent Matthew Berman solved it in about a minute while narrating his thought process. GPT-4, given the same task, scored zero percent.

This is the ARC AGI 3 benchmark, released this week as the third iteration of what remains the only AI performance test that hasn't been saturated by machine learning systems. Where coding benchmarks and math competitions now see AI surpassing the world's best human specialists, ARC AGI takes the opposite approach: it tests the kind of reasoning that average humans do effortlessly and current AI cannot replicate.

The difference matters more than the usual benchmark horse race would suggest.

What Makes This Test Different

The ARC AGI series—Abstraction and Reasoning Corpus for Artificial General Intelligence—doesn't measure how much an AI knows or how fast it processes information. It measures something closer to what we might call common sense, though that phrase undersells the complexity.

In the first two iterations, the test presented pattern recognition puzzles. You'd see a few examples showing colored squares in various arrangements, deduce the underlying rule, then apply it to a new configuration. Humans look at these puzzles and solve them immediately. AI systems, trained on billions of parameters, struggle.

Berman walked through an example from ARC AGI 1: "You can see three pink squares times two. Then we can see for each of these batches of three squares to complete the puzzle, you have to add a yellow one to the missing section to make a square." Simple enough. The pattern becomes obvious after two examples.

AI performance tells a different story. On ARC AGI 1, the best models now reach about 93-94% accuracy—impressive until you remember humans score 100%. On ARC AGI 2, which increased difficulty, GPT-4 scores 72% at a cost of $39 per task. The cost-per-task metric matters here: it's not enough to throw unlimited computing power at the problem. Efficiency counts.

Now comes ARC AGI 3, and the gap widens dramatically.

The Interactive Challenge

ARC AGI 3 abandons static puzzles for something more dynamic: interactive video game scenarios where you receive zero instructions. You're dropped into a simple maze-like environment with a few visual elements—colored squares, a character marker, what might be a movement meter—and you have to figure out what to do.

Berman's walkthrough demonstrates how humans approach this. He immediately starts building a mental model: "We have three dots over here. We have this yellow bar. Maybe it's a health meter. Maybe it's a turn meter. We have something in the bottom left that looks like maybe it can either be a mini map, but more likely it's looking like it's matching this."

He moves once. The character responds. The yellow bar decreases. He updates his model. He tries moving to what appears to be the goal, fails, notices an orientation mismatch, realizes he needs to hit an intermediate marker first. Problem solved.

Total time: under three minutes, including explanation.

GPT-4's attempt looks different. The model moves the character once, then repeatedly tries to return to the starting position. It never considers the intermediate marker. It doesn't appear to build any model of what's happening. As Berman notes: "It doesn't think to go to the plus, which is just wild to me. Why is it not thinking to do that? It seems so obvious. It seems so intuitive."

The answer to that "why" cuts to the heart of what current AI systems can and cannot do.

What This Actually Tests

The ARC AGI benchmark specifically targets generalization—the "G" in AGI. This means taking minimal information from a new situation and extrapolating to solve novel problems. Humans evolved to do this constantly. You've never seen this exact configuration of furniture in a room before, but you immediately understand how to navigate it. You've never played this specific video game, but you draw on decades of spatial reasoning and pattern recognition to figure it out.

Current AI systems, despite their impressive capabilities, don't generalize in this way. They interpolate within their training data. They recognize patterns they've seen before. When confronted with truly novel situations—even simple ones—they often fail in ways that seem bizarre from a human perspective.

Berman observes that on benchmarks testing specialized knowledge—complex mathematics, advanced coding, scientific reasoning—AI now beats the best human specialists. But on ARC AGI, "the average human can solve the benchmarks, but AI can't. It can't even get close."

This inversion reveals something important about the current state of artificial intelligence. We've built systems that can master narrow domains through exposure to enormous datasets. We haven't built systems that reason about novel situations the way humans do.

The Numbers Tell the Story

ARC AGI 3 results across frontier models:

GPT-4: 0%
Gemini 3.1 Pro: 0%
Grok: 0%
Claude Opus: 0%
Humans: 100%

The top-performing AI configuration managed 0.3% accuracy at a cost of over $5,000 per task. These aren't marginal differences. These are the kind of gaps that suggest we're measuring something fundamentally different from what these systems were designed to do.

The benchmark creators are offering $2 million to anyone who can saturate ARC AGI—get AI to perform at human levels consistently. That prize remains unclaimed, and nothing in current architectural approaches suggests an obvious path to claiming it.

What This Doesn't Mean

This isn't an argument that AI is useless or overhyped. Current systems demonstrably excel at numerous tasks. They write competent code, analyze complex documents, generate coherent text, and process information at scales humans cannot match. Businesses are finding legitimate productivity gains. Berman himself notes he uses AI tools daily and finds them transformative.

Nor does this mean AI won't eventually solve these benchmarks. Pattern recognition puzzles that seemed impossible for machines thirty years ago are now trivial. The question is whether solving them requires fundamentally new approaches or just scaling up existing methods.

What this does mean is that the gap between narrow AI capabilities and general intelligence remains wide. We have systems that perform spectacularly within their training distributions and struggle with novelty that humans handle instinctively. The path from one to the other isn't obvious.

The test you can try yourself exists at arcprize.org. You'll probably solve it quickly. The world's most sophisticated AI systems probably won't. That gap measures something worth understanding, whether you're building AI systems, investing in them, or simply trying to figure out what they actually can and cannot do.

Fifty years covering technology teaches you that what AI companies promise and what AI systems deliver often diverge significantly. The ARC AGI benchmark provides a clear, measurable line in the sand. Humans: 100%. AI: less than 1%. Until that changes, we know exactly how far we have to go.

—Bob Reynolds, Senior Technology Correspondent