GPT 5.5 vs DeepSeek V4: The Benchmarks Tell a

Two flagship AI models dropped within 20 hours of each other, which should make for a clean shootout. OpenAI's GPT 5.5 versus China's DeepSeek V4. Place your bets, refresh the leaderboards, crown a winner.

Except the benchmarks tell a messier story—one that's more revealing than any single performance number.

AI Explained's deep dive into both releases surfaces something I've suspected for a while: we're past the point where "better" means anything coherent. The models don't form a neat hierarchy anymore. They're jagged, specialized, optimized for different things. And the way companies present benchmarks increasingly looks like choosing your most flattering camera angle.

The Benchmark Whiplash

GPT 5.5 underperforms both Claude Opus 4.7 and Anthropic's Mythos on SWEBench Pro—the coding benchmark OpenAI itself recommended in February as less contaminated than SWEBench Verified. Falls behind by 6% and 20% respectively.

One row down? GPT 5.5 crushes on Agentic Terminal Coding. 82.7% versus Mythos's 82.0%.

On "Humanity's Last Exam"—obscure academic knowledge plus reasoning—GPT 5.5 loses to Opus 4.7, Mythos, and even Gemini 3.1 Pro. On pattern recognition (ARC-AGI 2), it beats the entire Claude Opus series, both in score and cost.

Hallucination rates? GPT 5.5 gets 57% of obscure questions correct versus Opus's 46%. Impressive, until you notice it hallucinates answers on 86% of the questions it gets wrong. Opus 4.7 on max settings? Just 36%. Mythos does better still.

But then GPT 5.5 wins on spreadsheet tasks. And demolishes Opus 4.7 on "Vending Bench," where models run simulated businesses with one instruction: make as much money as you can. Sam Altman retweeted it with "don't retweet this," which of course means retweet this.

The pattern repeats with DeepSeek V4. It trails GPT 5.4 and Gemini 3.1 Pro on some benchmarks. On Chinese professional tasks—finance, law, tech, education—it significantly outperforms Opus 4.6 Max. On the creator's private SimpleBench, it scores within 1-2% of Opus 4.7 at a tenth of the cost.

You could spin yourself dizzy trying to declare a winner.

The Death of Universal Intelligence

Here's what struck me about the analysis: OpenAI released GPT 5.4 for clinicians a few days ago. Requires special access. On the HealthBench Professional subset, that specialized version scores 59%—beating standard GPT 5.5's 52%.

If there's a single axis for intelligence, this shouldn't happen. A newer, more capable model should dominate an older specialized one, assuming sufficient training data. But it doesn't.

As the video notes: "The models aren't proving to be universal generalizers. They are fairly reliant on reinforcement learning environments for particular domains."

We've been here before. Remember when IBM's Watson crushed Jeopardy, and people assumed it would revolutionize medicine? Turned out Watson was excellent at one thing and mediocre at most others. The hype cycle demanded we pretend otherwise for a while.

The difference now is the pace. We're getting multiple generations per year, each one jagged in different directions. GPT 5.5 can't control its own chain of thought—asked to keep reasoning in lowercase, it manages this less than one in a thousand times across 100,000 tokens. OpenAI spins this as a feature: "We have increased confidence in the reliability of our monitoring systems" because the model can't fake its thinking.

Performance Per Dollar Becomes the Real Benchmark

Noam Brown, one of OpenAI's top researchers, gets quoted in the video: "What matters is intelligence per token or per dollar."

That's the shift. Not "which model is smartest" but "which model delivers what I need at a price that makes sense."

DeepSeek V4 hits around 50% on Vibe Code Bench. GPT 5.5 hits 70%. Opus 4.7 hits 71%. But DeepSeek costs one-tenth as much as Opus. GPT 5.5 costs 25% less than Opus.

The context window matters too. DeepSeek V4 supports 1 million tokens—roughly 750,000 words. That's a different kind of capability than raw reasoning power. The model has 1.6 trillion parameters but activates just 49 billion through mixture-of-experts architecture. Open weights, so you can run it locally, though we don't know the training data.

When DeepSeek created their own benchmark for Chinese professional tasks, they weren't just gaming the numbers. They were acknowledging that specialized data beats generalization. If you work in Mandarin across multiple technical domains, DeepSeek V4 Pro might be your best option regardless of what the English-language leaderboards say.

What Recursive Self-Improvement Actually Looks Like

Buried in OpenAI's system card: GPT 5.5 "does not have a plausible chance of reaching a high threshold for self-improvement."

This despite hitting high thresholds for cybersecurity capability—the UK AI Security Institute judges it the strongest model on narrow cyber tasks, though only marginally. It completed a 32-step corporate network attack simulation (one that takes human experts 20 hours) in one out of 10 attempts. Mythos managed three out of 10.

But when asked to debug 41 real bugs from OpenAI's internal research—problems that took their engineers hours or days—GPT 5.5 succeeded about 50% of the time. Same as GPT 5.4. At an 8-hour interval, success rate drops to roughly 25%. For day-long tasks, around 6%.

OpenAI's conclusion: the model is "too limited in coherence and goal sustenance" to worry about self-exfiltration or sabotaging internal research.

Meanwhile, Anthropic marketed Mythos with enough alarm that "the world's top bankers and CEOs have gotten together to discuss the risk." Sam Altman's take on this was pointed: "There are people in the world who for a long time have wanted to keep AI in the hands of a smaller group of people... if what you want is like we need control of AI just us cuz we're the trustworthy people, I think the fear-based marketing is probably the most effective way to justify that."

He's not wrong about the incentives. Doesn't mean the concerns are fake—but it does mean we should notice when safety theater serves market positioning.

The Jagged Frontier

What does AGI mean if a specialized medical model outperforms a newer general model? What does "frontier" mean when the benchmarks tell contradictory stories depending on which domains you test?

The honest answer is we're navigating terrain that doesn't reduce to simple rankings. DeepSeek admits in their paper that some of the architectural tricks they used have "underlying principles that remain insufficiently understood." They retained what worked empirically, making the system more complex in exchange for hitting that million-token context window.

This is the current state: models that work, sometimes spectacularly, for reasons we don't fully grasp. Each lab optimizing for different things, reporting the benchmarks that make them look best, while the actual capabilities remain domain-specific and cost-dependent.

For anyone trying to figure out which model to use, the answer increasingly depends on what you're trying to do and what you're willing to pay. The crown isn't slipping from OpenAI to Anthropic or from the West to China. The crown is fragmenting into specialized pieces, each relevant for different tasks.

We've been promised universal intelligence for so long that it's disorienting to realize we might be getting something else instead: a proliferation of specialized tools, each good at different things, requiring judgment about when to use which one.

Turns out that's harder to market than "one model to rule them all."

—Mike Sullivan