Anthropic's Ultra Review: AI Code Reviews Enter

The arms race in AI-assisted code review just got interesting. Anthropic is rolling out a feature called Ultra Review for Claude Code that flips the usual approach: instead of one AI skimming your pull request for obvious problems, it deploys five independent agents that spend 10-20 minutes methodically attacking your codebase from different angles.

Developer Ray Amjad reverse-engineered early access to the feature and ran it against an 11,000-line PR for a voice calling implementation. What he found reveals both the promise and the economic reality of making AI code review actually useful.

How Ultra Review Actually Works

The standard /review command in Claude Code does what most AI code review tools do: it scans your changes, flags potential issues, and hands you a list. Fast, cheap, often wrong.

Ultra Review runs a four-stage gauntlet. First, setup. Then five separate AI agents—what Anthropic internally calls "bug hunters"—start from different positions in your codebase and traverse your changes along different paths. According to Amjad's digging through Claude Code's binary files, the default "fleet size" is five agents, with enterprise plans potentially getting up to 20.

Why multiple paths matter: "The order in which things are loaded into context window can reveal a bug," Amjad notes. "But if that order is swapped then that bug can be hidden to the model or harder for the model to spot." Different entry points mean different bugs become visible.

These agents likely have personas—one focused on security issues, another on billing logic, another on race conditions. In Amjad's test, they surfaced 64 bug candidates from the 11,000-line PR.

Then comes the verification stage, which is where this gets genuinely different. A separate set of agents independently evaluates each flagged issue to confirm it's actually a bug. In Amjad's run, nine of the initial findings got rejected as false positives. Finally, a deduplication stage combines identical bugs found from different angles.

The whole process takes 10-20 minutes and runs on Anthropic's servers. You get two free Ultra Reviews on the $200/month Claude Pro plan, then presumably you pay more.

The False Positive Problem

Here's the thing about AI code review that nobody in developer marketing wants to admit: most tools are fantastic at finding problems that don't exist. They flag stylistic preferences as bugs. They hallucinate security vulnerabilities. They suggest "fixes" that break working code.

The verification stage is Ultra Review's attempt to solve this. As Amjad puts it: "What I find interesting about this ultra review approach that the Anthropic team created whereby it's verifying which bugs are actually bugs, it kind of prevents Claude Code from making unnecessary changes for false positives."

This matters especially at scale. When you're running multiple agents from different perspectives, many findings will overlap or contradict. Some will be noise. The verification layer acts as a filter before any of this reaches a human developer who's already buried in notifications.

Amjad tested both the standard /review and Ultra Review on the same PR. His analysis: the standard review does "a quick audit of the entire codebase and everything that deviates from the mean slightly, it's just flagging as an issue." Ultra Review behaves differently—"it's kind of like an attacker instead. It's trying to pick one path of this entire PR and breaking it anyway."

The adversarial approach found race conditions and lifecycle bugs that the quick scan missed entirely, precisely because it held multiple files in context simultaneously while looking for ways to break things.

The Economics of Better Reviews

Ultra Review runs on Anthropic's servers for a reason: it's expensive. Two free reviews per month on a $200 plan suggests each review costs Anthropic somewhere in the double digits to run, possibly more. Compare that to the standard review, which runs locally in 3-4 minutes.

This creates an interesting decision tree for developers. Run the fast, cheap review for routine PRs. Maybe add a second tool—Amjad mentions also using Codex in his workflow—for redundancy. Save Ultra Review for the features that actually matter: the complicated stuff, the security-critical paths, the changes that touch ten different systems.

Or, if you're technical enough, build your own fleet review system. Amjad created one that spins up three Claude Code agents and three Codex agents, then runs both through Claude and Codex verifiers. Sometimes Codex says a Claude finding isn't real; sometimes Claude rejects a Codex bug. The cross-verification catches things either system alone would miss.

This is the pattern worth extracting regardless of which tools you use: if you're deploying multiple agents or models to find bugs, the verification stage isn't optional. It's the only thing standing between you and alert fatigue.

What This Tells Us About Anthropic's Strategy

Running Ultra Review on Anthropic's servers instead of locally gives them something valuable: a testing ground for different agent configurations, different prompts, potentially different models entirely. Amjad speculates they might be mixing in unreleased models or elements from other systems.

It's also a hedge against the context window problem. As codebases grow and PRs get more complex, loading everything into a single context becomes untenable. Multi-agent systems that can parallelize and synthesize are one answer. Whether it's the right answer at this price point remains to be seen.

The feature is still behind a flag—Amjad accessed it early through reverse engineering—so Anthropic is clearly still calibrating. But the direction is clear: AI code review is moving from "spot the syntax error" to "find the subtle logic bomb three files deep that only appears under specific conditions."

Whether developers will pay premium prices for that capability depends on whether the bugs it catches are actually the ones that would've made it to production. For an 11,000-line voice calling feature, maybe 20 minutes and a few dollars is worth it. For a two-line CSS fix, probably not.

The interesting question is whether the adversarial, multi-agent approach becomes table stakes or remains a premium feature. If AI code review tools all start sounding like attackers instead of copy editors, that's a meaningful shift in how we think about automated code quality. If Ultra Review stays expensive and niche, it's just another enterprise upsell in the endless stack of developer tools.

Either way, the verification stage is the innovation here. Everything else is just throwing more compute at the problem.

—Marcus Chen-Ramirez