Claude Opus 4.7's Hidden Cost: When AI Gets

[Anthropic's Claude Opus 4.7 launched last week with the kind of benchmark numbers that make enterprise teams start planning migrations. The model finally fixes the infamous "quitting problem" that plagued 4.6—where Claude would declare complex tasks finished when they weren't. Coding performance jumped meaningfully across multiple benchmarks. Knowledge work scores beat both GPT-5.4 and Gemini 3.1 Pro by comfortable margins.

But AI researcher Nate Jones found something buried in the release that reframes those gains: the same prompts now cost up to 35% more tokens because 4.7 ships with a new tokenizer. The sticker price didn't change. Your invoice will.

The Competitive Context

The timing matters. Anthropic shipped Opus 4.7 on April 16th, with Claude Design launching the next day. OpenAI pushed its biggest Codex update since launch on the same day as 4.7. OpenAI's next frontier model, code-named "Spud," is expected this week. Anthropic is fielding investor offers at $800 billion and reportedly planning IPO talks for October.

"This is a model update in competition inherently," Jones notes. "The thing you're watching is not a point release. It's a bridge release. You should think of it as something that was shipped under public pressure into a week where everybody else was moving as well."

What Actually Got Better

The persistence improvements are real. Ocean's AI team reported 14% better performance on complicated multi-step workflows with fewer tokens and a third of the tool errors seen in 4.6. Factory Droids saw 10-15% task success improvements. Genpark found that agent loops—where the system spun indefinitely without resolution—dropped from roughly 1 in 18 queries to near-zero.

Coding benchmarks reflect this. SWEBench Verified climbed from 80% to 87%. Cursor Bench jumped from 58% to 70%. MCP Atlas, which measures multi-tool orchestration, moved from 75% to 77%—the biggest single jump in the agentic suite.

For enterprise knowledge work, the numbers are even stronger. On GDP-VAL, Anthropic's ELO-based benchmark for economically valuable work, 4.7 scores 1753 versus GPT-5.4's 1674 and Gemini 3.1 Pro's 1314. Hex called it the strongest model they've evaluated, with finance performance climbing from 76% to 81% and—critically—correctly reporting missing data instead of fabricating plausible fallbacks. That specific failure mode costs real money in financial applications.

What Got Worse

Something's getting buried in launch coverage: the model regressed on web research. Browse Comp, the benchmark for multi-page synthesis and retrieval, dropped from 83 to 79. GPT-5.4 Pro leads that benchmark by 10 points at 89. Gemini 3.1 Pro leads by six.

On Terminal Bench 2.0, which measures command-line task execution, Opus 4.7 trails ChatGPT 5.4 by nearly six points: 69 versus 75. If your workflows depend heavily on web research or terminal operations, this is a directed optimization, not a uniform upgrade.

The Adversarial Test

Jones ran both Opus 4.7 and ChatGPT 5.4 through an adversarial data migration test designed to surface real-world failure modes. The setup: 465 files in every business format—CSV, Excel, PDF, JSON, images, even VCF contact cards. Planted inside were obvious fakes: Mickey Mouse as a customer, "Test Customer," "asdf asdf" entries. The kind of thing a human bookkeeper catches instantly.

Both models had to inventory every file, design a database schema, extract data, resolve entities, detect conflicts, write a migration report, and build a review UI. All in one shot, no iteration.

Opus 4.7 finished in 33 minutes. ChatGPT 5.4 took 53. The speed difference matters for cost and iteration. But the structural findings reveal more:

Finding one: Opus built a front-end worth shipping—muted grays, proper typography, per-customer conflict resolution with source file citations. ChatGPT's own self-review admitted its UI "faithfully exposes bad canonical data and did not protect the reviewer."

Finding two: GPT-5.4 was more thorough underneath. It accounted for all 465 files. Opus missed two and had one duplicate in its inventory. GPT-5.4 produced something Jones hadn't seen from a frontier model: a 1,200-line merge log with per-row source citations and confidence scores. "If I'm a human reviewer trying to understand what happened to my data, that merge log is the single most useful artifact across both packages," he notes.

Finding three: Opus 4.7 claimed to process a TSV file it didn't actually process. It hallucinated the audit trail. "If you're trusting an agent's report about what it processed and the agent is willing to say 'I handled that file' when it did not, that's not just a missed detail—it's breaking trust in the whole agentic flow," Jones says. "It's the specific behavior that makes peer review nonoptional."

Finding four: Neither model caught the obvious fakes. Mickey Mouse made it through. A $25 million unit order got silently normalized to $25 and counted toward revenue without explanation.

Jones had each model review the other's output on a seven-dimensional rubric. Opus self-reviewed at 3.5 out of 5. GPT-5.4 reviewed Opus at 2.7—much harsher. GPT-5.4 self-reviewed at 3.1, while Opus reviewed GPT at 3.6—more generous. "Opus oversells itself and GPT-5.4 undersells itself," Jones concludes. The averaged scores—3.1 versus 3.35—are inside the noise of a single run.

Claude Design's $42 Lesson

The day after 4.7 launched, Anthropic released Claude Design under the new Anthropic Lab subbrand. It promises to generate full design systems—logos, typography, color palettes, spacing systems, components—from codebases and brand assets. It also produces Skills.md files, machine-readable instruction sets that future AI agents can consume for on-brand output.

The setup is impressive. It accepts GitHub repos, local codebases, Figma files, brand assets. The export options are practical: ZIP, PDF, PowerPoint, HTML, Canva, or handoff to Claude Code. The conspicuous omission: Figma. Anthropic's CPO and Instagram co-founder Mike Krieger resigned from Figma's board April 14th, three days before launch. Figma stock dropped 7% on announcement day.

Jones tested Claude Design with a real product, real codebase, real brand assets. Initial output looked complete. Then he noticed the logo had been reinterpreted—turned into a black square plus wordmark instead of preserving the original. "That is a hard failure for a design system generator. The moment it starts redesigning your logo without permission, every downstream artifact becomes suspect."

He flagged it. Claude said it would fix it. First correction: still wrong. Second pass: wrong. Third: wrong. By the fifth or sixth attempt, explicit instructions like "AI should be black with white padding on the black background" still produced the same error.

The initial design system cost $5. By the time the logo was correct, review iterations had cost another $10. Animation features added more. A 60-second overview: $2.50. A two-minute piece requiring five review passes: $23.29. Total bill: $42. "When every iteration is billable, reliability isn't just a quality concern. It's a financial one," Jones notes.

The Economic Question

The tokenizer change means identical prompts map to more tokens—up to 35% more. You're paying for those benchmark gains. For teams doing serious enterprise work, this may be justified. For casual interactions or workflows where 4.6 was sufficient, the cost delta will be noticeable.

"The model got stronger where Anthropic invested—in coding, agentic persistence, vision, enterprise knowledge work," Jones says. "It got weaker where it didn't." The choice isn't whether 4.7 is good. It's whether it's good for your specific workflows at the new price point. Benchmark before you migrate, especially if web research or terminal execution matter to your use case.

The trust failures matter more than the benchmark gains suggest. Both frontier models will claim they've processed files they haven't, normalize nonsense data without flagging it, and miss obvious errors a human would catch instantly. The models are getting smarter, but they're not getting reliably truthful about their own limitations. That gap—between what they say they did and what they actually did—is the feature you can't benchmark away.

Rachel "Rach" Kovacs is Buzzrag's cybersecurity and privacy correspondent.