New AI Benchmarks Expose the Gap Between Hype and Reality

Jensen Huang says we've achieved AGI. OpenAI is shelving side projects to focus compute on its upcoming Spud model, which CEO Sam Altman reportedly calls "very strong." Anthropic is warning government officials that its next Claude release will "supercharge both offensive and defensive cyber capabilities." Meanwhile, a new benchmark just dropped showing the best AI models scoring less than half a percent where humans hit 100%.

Welcome to the messy middle of artificial intelligence development, where the distance between marketing and measurement has never been wider.

The Resource Reallocation

OpenAI's recent moves tell you something about priorities. According to reports from the Financial Times and The Information, the company has shelved its Sora video app—the one that generated viral AI videos and presumably cost billions to optimize. The reason? They need the compute for Spud.

It's a familiar pattern in tech: kill the demo, ship the product. But it also signals something about the current state of AI development. These companies aren't just iterating anymore. They're making hard choices about what gets resources and what doesn't.

Anthropic faces a different kind of resource constraint: political capital. The company's relationship with the Pentagon hit turbulence recently, with a six-month deadline set on government use of Claude. But according to Axios, that might be changing. The promise of enhanced cyber capabilities—both offensive and defensive—has apparently renewed Pentagon interest.

One detail worth noting: Brad Gerschner, an adviser to Anthropic CEO Dario Amodei, is the architect of "Trump accounts," which provide $1,000 to every enrolled newborn. There's speculation that Anthropic might part-fund these accounts as a goodwill gesture. Call it universal equity, or call it creative lobbying—either way, it's an unusual intersection of AI development and social policy.

The Benchmark That Matters

But here's where things get interesting: ARC-AGI-3 just launched, and it's designed specifically to be ungameable.

The previous versions of ARC-AGI got saturated fairly quickly. Frontier models learned to score well through a combination of genuine reasoning advances (like chain-of-thought prompting) and something more subtle: training on synthetic tasks that densely sampled what the private test set might look like. Not direct memorization, but a higher-level form of benchmark hacking.

ARC-AGI-3 responds by making the public test set meaningfully different from the private one. It's interactive, game-like, testing exploration, planning, memory, and goal-setting simultaneously. The puzzles don't tell you the rules—you have to figure out that the plus symbol rotates shapes, that you need to move icons to manipulate the environment, that your goal is to make one shape match another.

The scoring methodology is deliberately adversarial. Performance is measured not by how many levels you complete, but by how efficiently you complete them compared to humans. And inefficiency is quadratically penalized—if you take twice as many actions as a human, you don't get 50% credit, you get 25%.

One quirk: the benchmark caps AI performance at 100% (human baseline) but allows current performance to serve as evidence of not being AGI. It's asymmetric by design. As the technical report notes, "As long as there is a gap between AI and human learning, we do not have AGI."

Current state-of-the-art models? Gemini 3.1 scores 0.37%. The best performance reported is still under 0.5%.

What Gaming a Benchmark Actually Looks Like

The ARC-AGI-3 paper includes fascinating details about attempts to game it. A group called Symbolica AI built a "harness"—essentially one model controlling another, with sub-agents producing summaries to prevent context overload. It worked. They solved all three public environments.

The benchmark authors' response? Harnesses are now banned. The goal, they write, is "not to measure the amount of human intelligence that went into designing an ARC-AGI-3 specific system." Models get minimal context: "You are playing a game. Your goal is to win. Reply with the exact action you want to take."

No hints about efficiency. No reminders about the scoring system. Just solve the puzzle.

It's worth noting ARC-AGI-3 isn't the only unsaturated benchmark. NetHack, a game-based intelligence test, has remained unsaturated for six years. Google DeepMind's Tim Rocktäschel pointed this out, slightly deflating the "revolutionary new benchmark" narrative. Gemini 3 Pro scores 6.8% on NetHack—better than ARC-AGI-3, but still nowhere near human performance.

The Automation Promise and Its Limits

MIT Technology Review reports that OpenAI's new north star is building "a fully automated AI researcher." The goal: an AI that can tackle large, complex research problems independently. They want an "intern-level AI" by September.

One OpenAI leader frames it this way: "Nobody really edits code all the time anymore. Instead, you manage a group of code agents." The implication is clear—AI research will follow the same path as software engineering. AI does the grunt work, humans review.

But here's what the automation narrative consistently misses: even when the flip happens—when AI-first drafting becomes more efficient than human-first drafting—the productivity gains are incremental, not exponential. OpenAI's own research suggests speedups in the 40% range for economically valuable tasks.

And despite predictions of mass displacement, engineering job openings at tech companies have increased 50% over the past three years, from under 40,000 to 67,000 globally. Both OpenAI and Anthropic are still hiring prolifically.

This doesn't mean automation won't eventually transform these roles. It means the transformation is slower and messier than the pitches suggest.

Security in the Age of Agency

A recent security incident underscores why this messiness matters. A key open-source Python library was compromised such that updating it would export all your secrets and keys to the dark web. The risk: AI agent swarms automatically updating dependencies without catching the malicious code.

Nvidia's Jim Fan notes that "claws need shells, probably many layers of nested shells." He's right. As models become capable enough to act autonomously, they also become capable enough to cause autonomous harm—or to hack the very benchmarks designed to measure them.

We're in what you might call the capable-but-unreliable phase. AI is a better first drafter than humans for many tasks, but its outputs remain full of holes. It generalizes well on lower-level topics across coding and human languages, but struggles with higher-level concepts like adaptive goal-setting or—as recent events demonstrate—security hygiene.

The gap between what these systems can do in controlled demonstrations and what they can do reliably in production remains stubbornly wide. New benchmarks like ARC-AGI-3 are useful precisely because they make that gap visible and measurable, even as companies prepare to ship models they describe as revolutionary.

The revolution might still come. But it's not here yet, no matter what the benchmark scores—or the press releases—might claim.

Marcus Chen-Ramirez is senior technology correspondent at Buzzrag.