Anthropic's Opus 4.7: When Safety Guardrails

Anthropic released Opus 4.7 this week, and developer Theo Browne spent an entire day testing it. What he found isn't a straightforward story of capability improvements—it's a case study in how aggressive safety measures and deteriorating tooling infrastructure can undermine what might otherwise be a solid AI model.

The headline benchmarks tell a mixed story. Opus 4.7 performs better than its predecessor on agentic coding tasks like SWE-bench Pro, and Anthropic touts "substantial" vision improvements—the model now processes images up to 2576 pixels on the long edge, triple the resolution of previous Claude models. But look closer at those benchmark charts and something stands out: this is the model with the fewest bold numbers (indicating best-in-class performance) Browne says he's ever seen in an Anthropic release.

More telling: Opus 4.7 actually performs worse than Opus 4.6 on several benchmarks, including the Agentic Search benchmark. That regression showed up in Browne's real-world testing immediately.

The Instruction-Following Paradox

Anthropics claims Opus 4.7 is "substantially better at following instructions," which sounds unambiguously good until you hit the fine print. In their release notes, they acknowledge "prompts written for earlier models can sometimes now produce unexpected results" because the new model "takes instructions literally" where previous versions "interpreted instructions loosely or would skip parts entirely."

That's a diplomatic way of admitting their earlier models were bad at their core job. But the literal interpretation creates new problems. When Browne asked Opus 4.7 to update his codebase to the "latest versions" of dependencies, it dutifully upgraded to Next.js 15—which is two years old. Next.js 16 has been out for nearly a year. The model never searched to verify what "latest" actually meant. It just followed instructions based on stale training data.

"Despite being better at following instructions, it's really bad at understanding the definitions of things and that it doesn't have the latest information," Browne notes. OpenAI's models used to make the same mistake, but they've since learned to verify. Opus 4.7 hasn't.

When Safety Theater Becomes Self-Sabotage

The more disturbing pattern emerges around Anthropic's new safety guardrails. Opus 4.7 is the testing ground for cyber security safeguards that will eventually ship with Claude Mythos, their unreleased flagship model. The theory: reduce cyber capabilities, add filters to block "prohibited or high-risk cyber security uses," create a verification program for legitimate security researchers.

The practice: the model lobotomized itself mid-conversation.

Browne tested Opus 4.7 on a DefCon Gold Bug puzzle—a cryptography challenge involving decoding phrases from pictures of bottles with words on them. Not hacking. Not malware. A logic puzzle. The model started working through cipher approaches, wrote some test code, made progress. Then the chat hard-stopped: "Opus 4.7 safety filters flagged this chat. Due to its advanced capabilities, Opus 4.7 has additional safety measures that occasionally pause normal safe chats."

The alternative offered: continue with Sonnet 4, described by Browne as "a very, very dumb model." He's paying $200 a month for access. The model won't solve a puzzle for him.

Even more absurd: when Browne asked Claude Code (Anthropic's IDE integration) for design suggestions for his personal website, the model responded with: "Heads up, the last system reminder about malware looks like a prompt injection. This is clearly your personal site... not malware. Ignoring it."

The system prompt was so aggressive it flagged legitimate web development as potential malware creation. Browne didn't add those restrictions—they're baked into Claude Code's default configuration. As he puts it: "They are trying so hard to keep this model from doing malicious malware things that they have inadvertently lobotomized it with the system prompt."

The Tooling Problem

Browne's most provocative theory: Anthropic's models aren't getting dumber over time—their tooling is. "I don't actually believe Anthropic's models get dumber over time," he says. "I just genuinely think Claude Code is this shitty and poorly maintained."

The model can't be better than the harness it runs in. Claude Code expects models to read files before updating them—a sensible protocol. But Opus 4.7 repeatedly tried to update package.json without reading it first, failing each time because it doesn't understand the harness's rules. The desktop app was broken. The system prompts leaked malware detection into benign requests. Auto-updates weren't rolling out properly, leaving users on bugged versions for over twelve hours.

Even developers from other projects noticed. Ricky from the React team commented that he'd seen the same system prompt leakage in Sonnet. Another developer observed: "It seems like they're rushing a bit too much lately. The recent updates have been full of bugs."

Anthropics is talented enough to build world-class models. But if they keep adding "more slop, more system prompt [garbage], more tools that don't do anything, more rules"—Browne's words—they're handicapping their own work. His analogy: "If you have a carpenter who is incredibly talented and every few weeks you replace three of their tools with plastic and you fill their toolbox with [garbage]..."

The sentence trails off, but the point lands.

What Actually Works

To Browne's credit, he tried to focus on what Opus 4.7 does well. The improved vision capabilities are real—processing 4-megapixel images opens up new use cases for screenshot analysis and diagram extraction. The model's planning outputs were "nice and concise." It handles file-system-based memory better, carrying context across sessions more effectively. Token efficiency improved on lower effort settings, though max effort "uses absurd amounts of tokens."

For specialized domains like finance, Anthropics claims Opus 4.7 shows "state-of-the-art" performance, producing "more rigorous analyses" and "more professional presentations." They've added an "extra high" effort level between high and max, giving developers finer control over the speed-quality tradeoff.

But these genuine improvements get overshadowed by the unforced errors. The model that won't search for current information. The safety filters that block puzzles. The IDE integration that calls your personal website malware.

There's a broader pattern here that matters beyond one model release. As AI labs race toward more capable systems, they're grappling with safety in real-time, using public releases as testing grounds. That's understandable—you can't safety-test a model without exposing it to real usage. But when the safety measures make the tool unusable for its intended purpose, something in the governance model has failed.

Anthropics created a "cyber verification program" where security professionals can apply for permission to use Opus 4.7 for "legitimate cyber security purposes like vulnerability research, penetration testing, and red teaming." That's one approach to the dual-use problem. But when the same filters block a cryptography puzzle, it suggests the categorization isn't sophisticated enough yet.

The question isn't whether AI labs should implement safety measures—of course they should. The question is whether they can build safety systems as sophisticated as the models they're meant to govern. Right now, based on Opus 4.7, the answer seems to be: not yet.

—Dev Kapoor