AI Agents Are Running Way Below Their Actual Capability

Here's something weird about AI agents right now: they can technically handle way more than we're letting them.

Anthropic just dropped a study that looks at how people actually use Claude Code and their public API, and the gap between capability and deployment is... significant. Most Claude Code sessions last around 45 seconds. But when researchers looked at the 99.9th percentile—basically the power users pushing the limits—those sessions were hitting 45 minutes of autonomous work.

That's not a typo. We're talking about a 60x difference between typical use and what the tech can actually handle.

Why This Matters More Than Model Benchmarks

You've probably seen the METR benchmark chart floating around—the one that tracks how long AI can work on tasks rated by human-equivalent duration. It's been the go-to metric for agent autonomy, and honestly, it kept a lot of people optimistic during what some were calling bubble times last year.

But here's the thing: METR measures what models can do in a perfect lab environment with zero human interaction and zero real-world consequences. As Anthropic points out, "the METR evaluation captures what a model is capable of in an idealized setting with no human interaction and no real world consequences."

That's... not how anyone actually uses this stuff.

The Anthropic study flips the script by looking at real deployment data. They defined an agent as "an AI system equipped with tools that allow it to take actions" and analyzed tool calls from both their API and Claude Code. Simple definition, but it lets them track what's happening in the wild without trying to reverse-engineer everyone's custom agent architectures.

The Trust Curve Is Real

One of the most interesting patterns in the data is how user behavior evolves over time. New Claude Code users enable full auto-approval about 20% of the time. Experienced users? That number doubles to 40%.

Meanwhile, interruption rates follow the opposite pattern. New users interrupt Claude around 5% of the time, while experienced users interrupt almost twice as much—around 9%.

At first this seems contradictory. But think about it: if you're approving every single action manually, you don't need to interrupt much because you're already gatekeeping each step. Once you start using auto-approval more liberally, you're monitoring differently—checking in while work happens rather than before it starts.

Anthropic suggests experienced users develop "more honed instincts for when their intervention is needed." It's like the difference between micromanaging a junior employee versus knowing exactly when to check in on someone you trust.

The junior employee metaphor actually holds up really well here. The 20% to 40% auto-approval shift is Claude earning trust over time. But the increased interruptions aren't about distrust—they're about optimization. You're steering more actively because you understand the system better.

Claude Interrupts Itself Too

Here's where it gets interesting: human intervention is only half the autonomy equation. Claude also stops itself to ask for clarification, and it does this more often as task complexity increases.

For high-complexity tasks, humans interrupted 7.1% of the time. Claude asked for clarification 16.4% of the time—more than double. For simple tasks, those numbers were much closer: 5.5% human interrupts, 6.6% Claude clarifications.

But they're interrupting for different reasons. When humans step in, it's usually to provide missing context or corrections (32% of the time) or because Claude was hanging (17%). When Claude stops itself, it's most often—35% of the time—to present users with a choice between different approaches.

That last bit is fascinating because it's not really about capability limitations. Claude could theoretically just pick an approach and run with it. Instead, it's optimizing for human alignment upfront, which might actually be the smarter play for building trust.

Where Agents Are Actually Being Deployed

Software engineering dominates at around 50% of tool calls, which makes sense given that Claude Code is heavily represented in this data. But the other 50% is already spread across business functions: back-office automation (9.1%), marketing and copywriting (4.4%), sales and CRM (4.3%), finance and accounting (4.0%).

Even this early, more than half of agentic use cases are outside pure coding territory. That distribution is basically a map of where agent automation is headed next.

The Capability Overhang

David Hendrickson had maybe the best take on this: "What's most surprising from the paper is that real-world AI agents are currently given much less autonomy than they could technically handle."

We had to look at the 99.9th percentile to see what Claude can actually do, even though the median turn is just 45 seconds. That's a massive capability overhang—the technology can handle way more than we're currently asking of it.

Part of this is just adoption curve stuff. Between January and mid-February, Claude Code's user base doubled. When you scale that fast, you're bringing in tons of new users who haven't built up that trust relationship yet. The recent dip from 45 minutes back to 40 minutes at the 99.9th percentile? Probably explained by that influx of newcomers.

But there's also this interesting October-to-January period where autonomous turn duration jumped from 25 to 45 minutes in a smooth curve across multiple model releases. Anthropic notes that the increase was smooth, "suggesting that autonomy is not purely a function of model capability."

In other words: context matters as much as the model itself. How agents are deployed, how humans interact with them, what trust patterns develop—all of that shapes real-world autonomy as much as raw technical capability.

What Comes Next

There's an open question about whether the future looks like better interactions with current paradigms or a total shift to long-duration autonomy. OpenAI's Sherwin Woo recently argued that the next leap isn't smarter models but agents you can dispatch for 6+ hours of independent work.

Right now, as Anthropic's data shows, that's not remotely how people are using these tools. But the gap between what's possible and what's happening suggests we might get there faster than the current user behavior implies.

Yong Riu framed it well: "Autonomy is not just steps taken. It is permission, scope, and ability to change state."

That's the shift we're watching unfold—not just better models, but better frameworks for humans and agents to collaborate. The technology is ready. We're still figuring out the handshake.

—Zara Chen, Tech & Politics Correspondent