Anthropic's Opus 4.7: The Enterprise Model You Can't Afford

Here's what's funny about AI model releases: the complaints tell you more than the benchmarks. When Anthropic dropped Opus 4.7, the internet split into two camps—one praising breakthrough enterprise performance, the other screaming about regression. Both are right, which tells you everything about where AI development is actually headed.

The video from TheAIGRID walks through what makes Opus 4.7 different, and the answer isn't particularly mysterious: Anthropic built this one for people who pay enterprise rates, not for hobbyists burning through Pro tier credits. The improvements cluster in exactly the areas that matter to companies automating real work. The degradation shows up everywhere else.

The Benchmarks That Actually Matter

OPUS 4.7 crushes its predecessor in document reasoning—reading multiple PDFs, contracts, financial reports, and making sense of them together. The benchmark score jumped to 80%, putting it in a different league from anything else available. As the video notes, "for Opus 4.7, it is a no-brainer now that if you're going to use this model, this is the model that you use when you have multiple different documents."

Long-term coherence—the ability to stick to a plan without losing the plot—saw a 36% improvement. The benchmark uses a vending machine simulation, tracking how much money the model ends up with after executing a complex series of operations. Opus 4.7 went from $8,000 to around $11,000. Not about vending machines, obviously. About whether an AI agent can handle a multi-step workflow without wandering off into the weeds.

Then there's the GDP benchmark—yes, GDP as in Gross Domestic Product. It measures performance on 1,300 tasks drawn from occupations that contribute heavily to US economic output. Finance, insurance, healthcare, manufacturing. Real deliverables, real briefs, real projects. Opus 4.7 scored 1,753, jumping from second to first place.

"The GDP val is probably one of the most important benchmarks right now because it's what current AI companies are now optimizing for," the video argues. "The only thing that really matters for these AI companies is how well can these AI agents perform tasks that otherwise humans would do."

That's the tell right there. Anthropic isn't optimizing for creative writing prompts or homework help. They're optimizing for automating knowledge work at scale.

The Jagged Frontier

Here's where it gets interesting. AI doesn't improve smoothly. It gets really good at some hard things while still failing at tasks that seem simple. Ethan Mollick calls this the "jagged frontier," and it explains why your experience with Opus 4.7 depends entirely on what you're asking it to do.

The video breaks down the performance chart between 4.6 and 4.7: massive gains in software services, IT, physical sciences, coding. Regression in entertainment, sports, media. It's not a universally better model—it's a model that made trade-offs.

"What most people tend to fail to realize here is that it isn't just a better model across the board," TheAIGRID notes. "It's only best on half areas, only realistically best in areas if you're an enterprise."

We've seen this before. Remember when OpenAI released GPT-4.2 and everyone complained it got dumber? Same dynamic. The model improved in the dimensions that mattered to paying customers—coding, reasoning, structured tasks—and regressed in the dimensions that mattered to free-tier users. The companies aren't hiding this; they're just not advertising it.

The Compute Crunch

But there's another layer here, and this one's messier. According to the Wall Street Journal reporting cited in the video, Anthropic has been "plagued by recent frequent outages" and started "metering computing supply to users during peak hours." Users are hitting limits faster than expected.

The video points to an AMD senior director of AI saying "Claude has regressed and that it cannot be trusted to perform complex engineering." User reports across Reddit and Twitter echo the same theme—something feels off.

The explanation: Anthropic doesn't have enough compute for everyone. Their more powerful model, Mythos, is being rolled out exclusively to enterprise partners—Microsoft, Google, JP Morgan, Nvidia. Everyone else gets adaptive reasoning mode with the throttle pulled back.

"We are actually currently getting rate limited and reason limited," the video argues, "which means that Opus 4.7, what you're missing is that this release is not as good as the others."

The Pricing Sleight of Hand

Here's where Anthropic got creative. On paper, Opus 4.7 and 4.6 have identical pricing: $15 per million input tokens, $75 per million output tokens. But Opus 4.7 uses a new tokenizer that maps the same text to 1.0x to 1.35x as many tokens.

Same per-token price. 35% more tokens for the same prompt. You do the math.

"Do you not think that is a little bit shady?" asks TheAIGRID. "This is why many individuals are starting to feel quote unquote robbed, considering that this sneaky pricing isn't really publicly known. It's just something that they kind of put in the fine print."

It's in the fine print because it needs to be disclosed. It's not in the headline because... well, who leads with "our new model costs more"?

What This Means

None of this makes Opus 4.7 bad. It makes it purpose-built. If you're running an enterprise automation workflow that needs to process hundreds of documents and maintain coherence across multi-step tasks, this model is probably worth every extra cent. If you're using Claude for creative projects, casual conversation, or anything that doesn't map to GDP-weighted economic tasks, you're paying 35% more for a model that might actually perform worse at what you need.

The broader pattern is clear: AI development is splitting. Consumer-facing features and enterprise optimization are diverging paths, and the companies building these models know exactly which path pays the bills. The hype cycle talks about AGI and changing the world. The business model talks about automating accounts payable processing.

That 36% improvement in long-term coherence? That's not about making a better chatbot. That's about replacing the person who currently does that work. The companies paying enterprise rates understand this. The people complaining on Twitter that the model got worse are using a tool that was never designed for them in the first place.

— Mike Sullivan, Technology Correspondent