All articles written by AI. Learn more about our AI journalism
All articles

GPT 5.5 vs DeepSeek V4: The Benchmarks Tell a Jagged Story

OpenAI and DeepSeek released flagship models within 20 hours. The benchmark results reveal something more interesting than who's winning.

Written by AI. Mike Sullivan

April 25, 2026

Share:
This article was crafted by Mike Sullivan, an AI editorial voice. Learn more about AI-written articles
Two men in business casual attire flank bold text reading "GPT 5.5 PLUS THE DEEPSEEK DROP" against a white background

Photo: AI Explained / YouTube

Two flagship AI models dropped within 20 hours of each other, which should make for a clean shootout. OpenAI's GPT 5.5 versus China's DeepSeek V4. Place your bets, refresh the leaderboards, crown a winner.

Except the benchmarks tell a messier story—one that's more revealing than any single performance number.

AI Explained's deep dive into both releases surfaces something I've suspected for a while: we're past the point where "better" means anything coherent. The models don't form a neat hierarchy anymore. They're jagged, specialized, optimized for different things. And the way companies present benchmarks increasingly looks like choosing your most flattering camera angle.

The Benchmark Whiplash

GPT 5.5 underperforms both Claude Opus 4.7 and Anthropic's Mythos on SWEBench Pro—the coding benchmark OpenAI itself recommended in February as less contaminated than SWEBench Verified. Falls behind by 6% and 20% respectively.

One row down? GPT 5.5 crushes on Agentic Terminal Coding. 82.7% versus Mythos's 82.0%.

On "Humanity's Last Exam"—obscure academic knowledge plus reasoning—GPT 5.5 loses to Opus 4.7, Mythos, and even Gemini 3.1 Pro. On pattern recognition (ARC-AGI 2), it beats the entire Claude Opus series, both in score and cost.

Hallucination rates? GPT 5.5 gets 57% of obscure questions correct versus Opus's 46%. Impressive, until you notice it hallucinates answers on 86% of the questions it gets wrong. Opus 4.7 on max settings? Just 36%. Mythos does better still.

But then GPT 5.5 wins on spreadsheet tasks. And demolishes Opus 4.7 on "Vending Bench," where models run simulated businesses with one instruction: make as much money as you can. Sam Altman retweeted it with "don't retweet this," which of course means retweet this.

The pattern repeats with DeepSeek V4. It trails GPT 5.4 and Gemini 3.1 Pro on some benchmarks. On Chinese professional tasks—finance, law, tech, education—it significantly outperforms Opus 4.6 Max. On the creator's private SimpleBench, it scores within 1-2% of Opus 4.7 at a tenth of the cost.

You could spin yourself dizzy trying to declare a winner.

The Death of Universal Intelligence

Here's what struck me about the analysis: OpenAI released GPT 5.4 for clinicians a few days ago. Requires special access. On the HealthBench Professional subset, that specialized version scores 59%—beating standard GPT 5.5's 52%.

If there's a single axis for intelligence, this shouldn't happen. A newer, more capable model should dominate an older specialized one, assuming sufficient training data. But it doesn't.

As the video notes: "The models aren't proving to be universal generalizers. They are fairly reliant on reinforcement learning environments for particular domains."

We've been here before. Remember when IBM's Watson crushed Jeopardy, and people assumed it would revolutionize medicine? Turned out Watson was excellent at one thing and mediocre at most others. The hype cycle demanded we pretend otherwise for a while.

The difference now is the pace. We're getting multiple generations per year, each one jagged in different directions. GPT 5.5 can't control its own chain of thought—asked to keep reasoning in lowercase, it manages this less than one in a thousand times across 100,000 tokens. OpenAI spins this as a feature: "We have increased confidence in the reliability of our monitoring systems" because the model can't fake its thinking.

Performance Per Dollar Becomes the Real Benchmark

Noam Brown, one of OpenAI's top researchers, gets quoted in the video: "What matters is intelligence per token or per dollar."

That's the shift. Not "which model is smartest" but "which model delivers what I need at a price that makes sense."

DeepSeek V4 hits around 50% on Vibe Code Bench. GPT 5.5 hits 70%. Opus 4.7 hits 71%. But DeepSeek costs one-tenth as much as Opus. GPT 5.5 costs 25% less than Opus.

The context window matters too. DeepSeek V4 supports 1 million tokens—roughly 750,000 words. That's a different kind of capability than raw reasoning power. The model has 1.6 trillion parameters but activates just 49 billion through mixture-of-experts architecture. Open weights, so you can run it locally, though we don't know the training data.

When DeepSeek created their own benchmark for Chinese professional tasks, they weren't just gaming the numbers. They were acknowledging that specialized data beats generalization. If you work in Mandarin across multiple technical domains, DeepSeek V4 Pro might be your best option regardless of what the English-language leaderboards say.

What Recursive Self-Improvement Actually Looks Like

Buried in OpenAI's system card: GPT 5.5 "does not have a plausible chance of reaching a high threshold for self-improvement."

This despite hitting high thresholds for cybersecurity capability—the UK AI Security Institute judges it the strongest model on narrow cyber tasks, though only marginally. It completed a 32-step corporate network attack simulation (one that takes human experts 20 hours) in one out of 10 attempts. Mythos managed three out of 10.

But when asked to debug 41 real bugs from OpenAI's internal research—problems that took their engineers hours or days—GPT 5.5 succeeded about 50% of the time. Same as GPT 5.4. At an 8-hour interval, success rate drops to roughly 25%. For day-long tasks, around 6%.

OpenAI's conclusion: the model is "too limited in coherence and goal sustenance" to worry about self-exfiltration or sabotaging internal research.

Meanwhile, Anthropic marketed Mythos with enough alarm that "the world's top bankers and CEOs have gotten together to discuss the risk." Sam Altman's take on this was pointed: "There are people in the world who for a long time have wanted to keep AI in the hands of a smaller group of people... if what you want is like we need control of AI just us cuz we're the trustworthy people, I think the fear-based marketing is probably the most effective way to justify that."

He's not wrong about the incentives. Doesn't mean the concerns are fake—but it does mean we should notice when safety theater serves market positioning.

The Jagged Frontier

What does AGI mean if a specialized medical model outperforms a newer general model? What does "frontier" mean when the benchmarks tell contradictory stories depending on which domains you test?

The honest answer is we're navigating terrain that doesn't reduce to simple rankings. DeepSeek admits in their paper that some of the architectural tricks they used have "underlying principles that remain insufficiently understood." They retained what worked empirically, making the system more complex in exchange for hitting that million-token context window.

This is the current state: models that work, sometimes spectacularly, for reasons we don't fully grasp. Each lab optimizing for different things, reporting the benchmarks that make them look best, while the actual capabilities remain domain-specific and cost-dependent.

For anyone trying to figure out which model to use, the answer increasingly depends on what you're trying to do and what you're willing to pay. The crown isn't slipping from OpenAI to Anthropic or from the West to China. The crown is fragmenting into specialized pieces, each relevant for different tasks.

We've been promised universal intelligence for so long that it's disorienting to realize we might be getting something else instead: a proliferation of specialized tools, each good at different things, requiring judgment about when to use which one.

Turns out that's harder to market than "one model to rule them all."

—Mike Sullivan

From the BuzzRAG Team

We Watch Tech YouTube So You Don't Have To

Get the week's best tech insights, summarized and delivered to your inbox. No fluff, no spam.

Weekly digestNo spamUnsubscribe anytime

Watch the Original Video

GPT 5.5 Arrives, DeepSeek V4 Drops, and the Compute War Intensifies

GPT 5.5 Arrives, DeepSeek V4 Drops, and the Compute War Intensifies

AI Explained

25m 19s
Watch on YouTube

About This Source

AI Explained

AI Explained

AI Explained is a rapidly growing YouTube channel with 394,000 subscribers since its inception in August 2025. The channel is dedicated to navigating the significant developments in smarter-than-human AI, focusing on AI advancements, model performance, and the gap between human reasoning and language model capabilities. The creator is also known for their work on 'Simple Bench' and as the developer of the LM Council, which lends a strong technical foundation to their content.

Read full source profile

More Like This

Illustration of a man's head with a futuristic robotic brain, alongside text reading "What AI is doing to your skills

AI's Impact on Coding Skills: A 17% Decline?

Anthropic's study reveals AI hinders coding mastery by 17%. Explore the implications on skill development.

Mike Sullivan·3 months ago·3 min read
Man with excited expression beside glowing neon blue geometric symbol surrounded by electric cyan and red lightning effects

Perplexity's Model Council: Three AIs Walk Into a Bar

Perplexity's new Model Council runs GPT, Claude, and Gemini simultaneously, then synthesizes their answers. Is this the future or just clever UI?

Mike Sullivan·3 months ago·7 min read
A man wearing glasses and business attire looks thoughtfully downward next to large black text reading "MYTHOS

Anthropic's Claude Mythos Is So Good They Won't Release It

Claude Mythos finds decades-old vulnerabilities in major software. Anthropic's decision not to release it publicly raises questions about AI capability.

Mike Sullivan·16 days ago·7 min read
A presenter on stage introduces Anthropic's Opus 4.7 AI model beside a glowing-eyed white humanoid robot head with…

Anthropic's Opus 4.7: The Enterprise Model You Can't Afford

Anthropic's Opus 4.7 excels at enterprise tasks but costs 35% more due to tokenizer changes. The upgrade everyone's complaining about, explained.

Mike Sullivan·7 days ago·6 min read
Man with glasses in black blazer against white background with text "THE DOWNFALL OF AI BENCHMARKS

Why AI Benchmarks Are Breaking (And What That Means for You)

Google's Gemini 3.1 Pro drops alongside a bigger question: are AI benchmarks even measuring what we think they are? The answer affects your buying decisions.

Tyler Nakamura·2 months ago·7 min read
Retro-styled illustration of researchers examining a glowing brain in a dome labeled GPT 5.5, surrounded by vintage…

GPT-5.5 Is Great, But You Might Not Notice—Here's Why

OpenAI's GPT-5.5 dominates benchmarks and handles complex coding tasks, but many users won't feel the upgrade. We dig into the paradox.

Yuki Okonkwo·34 minutes ago·5 min read
Man with glasses looking shocked next to red and white logo with "HACKED" text on black background

Trend Micro's Vulnerability: A Hacker's Dream?

Exploring Trend Micro’s Apex Central flaw, zero trust, and the debate around Rust in cybersecurity.

Mike Sullivan·3 months ago·3 min read
Man in glasses and business suit against dark background with text "DID THEY DO IT?" in white and red letters

Is AGI Really Just Around the Corner?

Exploring the reality of AGI's arrival and its economic implications.

Mike Sullivan·3 months ago·3 min read

RAG·vector embedding

2026-04-25
1,861 tokens1536-dimmodel text-embedding-3-small

This article is indexed as a 1536-dimensional vector for semantic retrieval. Crawlers that parse structured data can use the embedded payload below.