Opus 4.6 Is Smarter But Lost Its Soul, Says Developer

Anthropic dropped Opus 4.6 today, and developer Theo (from t3․gg) spent all day testing it. His verdict? It's the smartest AI coding model ever made—and somehow, that might be the problem.

The benchmarks are genuinely wild. Opus 4.6 scored 76% on needle-in-haystack retrieval tests where its predecessor got 18.5%. It crushed ARC-AGI V2, the benchmark specifically designed to be things humans excel at and AI struggles with, hitting close to 70% when most models can barely crack 50%. It comes with a 1 million token context window (in beta), meaning it can theoretically hold entire large codebases in memory.

Theo watched it build a full Rust-based C compiler from scratch capable of compiling the Linux kernel across multiple architectures. That took nearly 2,000 Claude Code sessions and $20,000 in API costs, but it worked. The model produced 100,000 lines of functional compiler code.

So why does he sound... underwhelmed?

The Speed-Intelligence Trade-off

Here's the thing nobody tells you about "smarter" models: they're often slower. Theo reports that tasks taking Opus 4.5 one to two minutes now take Opus 4.6 five to ten minutes. When you're iterating on code, that difference compounds fast.

"As great as these benchmarks are, as great as it is at code... I have a problem," Theo explains in his breakdown. "It feels slower. It's taken like 5 to 10 minutes to do things that Opus 4.5 would do in 1 to 2."

But speed isn't even the main complaint. The model also feels more... robotic. Theo ran two separate codebase audits on different projects with different prompts. Both times, Opus 4.6 ended with the exact same question: "Want me to fix any of these?" Same format, same phrasing, different contexts.

"This model feels a little more like it was pushed into a template for how it speaks and interacts," he notes. "And while it is smarter, it is less varied because of this."

That's probably intentional. As Theo points out, nobody's paying premium prices for Opus because it's charming. They're paying because it's the smartest option available. If Anthropic can make it even smarter at the cost of personality, that's likely a trade-off they're willing to make. Charm doesn't show up in benchmarks.

The Pricing Problem Nobody Wants to Talk About

Let's get real about cost. Opus 4.6 charges $5 per million input tokens and $25 per million output tokens. That's 4x higher on input and 2.5x higher on output compared to GPT-5 and 5.1 ($1.25 in, $10 out).

Even compared to OpenAI's more recent 5.2 models ($1.75 in, $14 out), Opus costs roughly double per token. And if you want to use that fancy million-token context window? The prices double again—$10 in, nearly $40 out.

"I can tell you from a lot of experience running a lot of these models over API with T3 chat that anthropic is a disproportionately high bill relative to the amount of usage those models get on our platform," Theo says. "They just cost more. It is what it is."

The question becomes: is Opus actually 2-4x better than the competition? Maybe for specific use cases. But for general coding work, that's a hard sell.

The Context Window Trap

That million-token context window sounds impressive until you learn about context rot. The more tokens you stuff into a model's memory, the worse it gets at retrieving specific information. It's like trying to find one email in an inbox with 50,000 messages versus 500.

Theo points to Anthropic's own research showing their models struggle with retrieval in large contexts. Haiku 4.5 scored the same as a lobotomized Grok model on retrieval tests—both around 60%, while GPT-5 Nano hit over 90%.

Last year, there was drama when users thought Claude had gotten dumber. Anthropic later revealed they were accidentally routing people to the large-context version of the model, which performed worse. "So again, spending more money to use more tokens, which inherently lowers the likelihood of success because of context rot, but also the version of the model that can do that is worse," Theo explains.

The entire industry has been moving away from just throwing more context at models. There's a reason for that.

The Features That Might Actually Matter

The genuinely interesting additions are the experimental team orchestration features. Opus 4.6 can coordinate multiple Claude Code instances working as a team—spinning up parallel agents to explore different approaches, then consolidating results.

Theo had it audit a large codebase by deploying five separate agents simultaneously. They even offer a tmux-style multiplex view to watch all the agents working.

The catch? It crashes a lot. "It does break things pretty consistently right now," Theo admits. "And I can't get it to trigger reliably enough to confidently tell you how it is to use."

Cursor is building similar long-running agent systems. They recently ran a week-long session that peaked at over 1,000 commits per hour across hundreds of agents. But the longer these systems run, the higher the chance of unnoticed errors that become catastrophic later. Plus, when something takes 10 hours to run and costs thousands of dollars, you're way less likely to restart it when it fails.

"Let the rich people figure it out as we all dwindle with our $200 a month subs and hope for the best," Theo quips.

The Sonnet 5 Mystery

Here's where it gets conspiratorial: Where the hell is Sonnet 5?

Anthropically usually rotates through their model families—Haiku, Sonnet, Opus. It's weird that they released two Opus models back-to-back. There's speculation (including from Theo) that Sonnet 5 might have become Opus 4.6.

A tweet went around claiming sources inside Anthropic said Sonnet 5's benchmarks would "mass retire every model released in 2025" and that they delayed it twice because the safety team couldn't explain emergent capabilities. Real or fake? Nobody knows.

What we do know: Historically, Sonnet has been Anthropic's most-used model by far. But when Opus 4.5 became both more capable and cheaper than previous Opus models, usage patterns shifted. Anthropic might be enjoying those higher-margin Opus revenues.

The pricing structure even supports this theory. The large context window in Opus 4.6 has the same constraints as Sonnet 4.5, which is... interesting timing.

Where This Leaves Developers

Opus 4.6 is demonstrably the smartest coding model available right now. The benchmarks don't lie. But intelligence isn't the only variable that matters when you're actually shipping code.

Speed matters. Cost matters. The feeling of flow when you're iterating matters. And yeah, maybe personality matters too when you're spending hours working with an AI.

The future of autonomous agents doing multi-day coding tasks is genuinely exciting. But we're not there yet. The error rates compound, the costs spiral, and the motivation to fix things after a 20-hour run crashes at hour 19 approaches zero.

For now, Opus 4.6 is a tool with specific use cases where its strengths outweigh its costs—literally and figuratively. Just don't expect it to chat about your day.

—Yuki Okonkwo