GPT-5.5 Is Great, But You Might Not Notice—Here's

Here's the weird thing about GPT-5.5: it's objectively better than its predecessors by almost every measure, but a lot of people using it won't actually feel different. And that's not a criticism—it's maybe the most interesting thing about where we are with frontier AI models right now.

OpenAI dropped GPT-5.5 (aka "Spud") on Friday, billing it as "a new class of intelligence for real work." The benchmarks back that up. It scored 82.7% on Terminal Bench 2.0 compared to Claude Opus 4.7's 69.4%. It topped Artificial Analysis's intelligence index by three points, becoming the first model to score in the 60s. The company positioned it squarely as a knowledge work model—writing, debugging code, analyzing data, moving across tools until tasks are done.

But then you get takes like this from Matt Schumer: "I've been using GPT-5.5 for the last few weeks, it's a massive leap forward. But the weird thing is for 99% of users, it probably won't matter."

That sounds contradictory until you dig into what he means. The previous generation of models—GPT-5.4, Claude Opus—were already crushing most normal work. "If I ask it to build something normal, it crushes it," Schumer wrote. "But GPT-5.3 Codex already crushed it. GPT-5.4 already crushed it. Opus often crushed it. The ceiling is getting so high that a lot of normal work does not stress the models anymore."

We've hit a plateau of competence where the delta between "really good" and "slightly more really good" doesn't register in day-to-day workflows. Unless you're doing something that genuinely pushes the boundaries—complex coding, scientific research, multi-hour autonomous tasks—you might not notice the upgrade.

The Benchmark Wars Get Messier

Of course, not every number told a clean story. GPT-5.5 significantly underperformed Claude Opus 4.7 on SWEbench Pro, a coding benchmark. OpenAI included a footnote suggesting Anthropic's model showed "signs of memorization" on some problems, which... yeah, that's doing some heavy lifting.

Tibo from OpenAI's Codex team pushed back hard: "You'll be missing out if you think SWEbench is representative of anything real." He pointed to an OpenAI article from February arguing that SWEbench Verified no longer measures frontier coding capabilities.

This is the messy reality of AI evaluation right now: benchmarks are imperfect proxies, and companies have incentives to emphasize the ones where they perform well. What matters more than any single score is the actual user experience—and on that front, the coding feedback has been overwhelmingly positive.

Flavio Adamo, an entrepreneur and engineer, captured it well: "GPT-5.5 is better than 5.4 at code. Yes, not because it suddenly turns every prompt into some magical perfect implementation, but because it seems to understand the shape of the request better. It writes cleaner code. It touches fewer things it does not need to touch."

That last bit matters more than you'd think. Anyone who's used AI coding assistants knows the frustration of asking for a small fix and getting back an over-engineered solution that touches unrelated files and adds abstractions you didn't ask for. "A model can be smart and still tiring to use," Adamo wrote. "GPT-5.5 feels less tiring."

Where 5.5 Actually Shines

The most compelling improvements seem to be around stamina and reliability for long-running tasks. Peter Gøsta from arena.ai reported having a migration run for over seven hours—"this literally never happened before." Another OpenAI engineer described setting a task before hanging out with friends for a few days, returning to find it had worked autonomously for 31 hours.

That's not a marginal improvement. That's the difference between using AI as a helpful assistant and using it as a genuinely autonomous agent that can handle complex multi-step work while you're offline.

Cost is another area where the picture gets interesting. At $5 per million tokens in and $30 out, GPT-5.5 is double the price of GPT-5.4 and 20% more expensive than Claude Opus 4.7. But looking purely at token pricing misses the efficiency gains. As Noam Brown from OpenAI pointed out, "What matters is intelligence per token or per dollar." By that measure, GPT-5.5 "completely dominates the cost performance frontier."

Design and planning remain areas where Claude Opus 4.7 seems to hold advantages. Multiple reviewers noted that Opus writes better plans and has superior aesthetic sense. The emerging workflow for some users: Opus for planning and design concepting, GPT-5.5 for execution.

The Mythos Shadow

All of this is happening against the backdrop of Anthropic's Mythos model—a reportedly powerful model that Anthropic says is too dangerous to release publicly. The decision has generated... let's say mixed reactions. Some skepticism centers on whether cybersecurity concerns are the real reason, or if compute constraints are doing more of the work.

OpenAI's messaging around GPT-5.5 seemed carefully crafted as a counterpoint. Sam Altman emphasized "iterative deployment" and "democratization," writing that the company wants "people to be able to use lots of AI" and believes "the world will be best equipped to win at the team sport of AI resilience" through broad access.

Whether 5.5 truly matches Mythos is unknowable until Anthropic actually releases their model. As one commenter put it: "Mythos benchmarks do not matter until released to the public. As far as I'm concerned, it does not exist."

What we can say is that GPT-5.5 represents OpenAI's clearest bid to reclaim narrative territory around "real work"—the coding, knowledge work, and agentic capabilities where Claude had been gaining ground. The company that once seemed to be pursuing every possible application (video generation with Sora, browsing, consumer features) is now laser-focused on being the tool that knowledge workers and developers reach for first.

The paradox remains: this is a genuinely impressive model that many people won't fully appreciate, precisely because we've gotten so accustomed to AI being genuinely impressive. The ceiling keeps rising, but most of our work doesn't require us to hit it.

—Yuki Okonkwo

GPT-5.5 Is Great, But You Might Not Notice—Here's Why

The Benchmark Wars Get Messier

Where 5.5 Actually Shines

The Mythos Shadow

We Watch Tech YouTube So You Don't Have To

Watch the Original Video

What I Learned Testing GPT 5 5

About This Source

The AI Daily Brief: Artificial Intelligence News

More Like This

Apple's New CEO Inherits a Paradox: Did Doing Nothing Win AI?

OpenAI's GPT-5.5 Leak: Sorting Signal From Hype

OpenAI's GPT-5.5: When the Benchmarks Don't Tell the Whole Story

OpenAI's GPT-5.5 Claims Speed Crown—But Costs 20% More

Claude Opus 4.7 Spotted as Quality Complaints Mount

Opus 4.6 Is Smarter But Lost Its Soul, Says Developer

Tiny Server, Big Potential: Beelink ME Pro Review

Mac Studio vs. Abacus AI: The $10K vs. $10 Showdown

RAG·vector embedding