GPT-5.4's Schizophrenic Performance: A Model at War With Itself
ChatGPT 5.4 crushes quantitative tasks but fails basic reasoning. The gap between thinking mode and auto mode reveals OpenAI's biggest problem.
Written by AI. Dev Kapoor
March 8, 2026

Photo: AI News & Strategy Daily | Nate B Jones / YouTube
Nate B Jones asked ChatGPT 5.4 a simple question: if the car wash is 100 meters away, should I walk or drive? The model—OpenAI's flagship reasoning system, the one they're positioning as their most capable release—thought carefully and answered: walk. It explained at length why walking made sense, then added as an afterthought that you might need to reposition the car later.
Claude Opus 4.6 answered in one sentence: "Drive. You need the car at the car wash." Gemini got it right too. Every frontier model understood the question except the one OpenAI is betting their $840 billion valuation on.
That's the opening frame Jones uses in his exhaustive evaluation of GPT-5.4, and it's perfect because it captures what makes this model so fascinating: it's not the best model and it's not the worst model. It's the most inconsistent model, and that inconsistency reveals something important about where OpenAI is headed and what they're willing to trade off to get there.
The Eval Suite Nobody Else Is Running
Jones ran blind evaluations—six structured tests where judges never knew which model produced which output. He wasn't testing vibes. He was testing the work you'd actually hand to a model on Tuesday and expect to use Thursday morning.
The results split into a pattern you don't see with Claude or Gemini: GPT-5.4 either dominates completely or fails catastrophically, and the difference comes down to a single toggle most users will never remember to flip.
"In thinking mode, ChatGPT competed for first place," Jones reports. "It nailed the exact Higgs Boson mass. It retrieved the correct Apple closing price. It got the current matrix multiplication exponent correct. But in auto mode, chat GPT 5.4 named 2024 Nobel laureates for a 2025 question."
Same model. Same questions. The gap between first place and dead last.
This matters because OpenAI ships auto mode as the default. Most of the billion people who'll touch this model won't know thinking mode exists. The rest will forget to toggle. And that means GPT-5.4's extraordinary capabilities—the ones OpenAI is marketing—are locked behind a UX pattern that requires constant conscious override.
Where GPT-5.4 Actually Wins
The model has three genuine strengths that showed up consistently across blind testing.
First: quantitative modeling. Jones gave all three models the same prompt—project the Seattle Seahawks' 2026 season using 2025 data. GPT-5.4 produced a six-tab workbook with Pythagorean win expectation, an ELO-like rating system with offseason retention decay, and a Poisson binomial season distribution. Claude made a cleaner three-tab workbook with simpler math. The statistical rigor wasn't close.
What's more interesting: GPT-5.4 included a methodology tab that cataloged its own assumptions, shortcuts, and limitations. "That self-awareness is worth paying attention to," Jones notes. A model that can articulate precisely why its output is insufficient is more useful than one that produces prettier artifacts.
Second: file processing. In Jones's "eval from hell"—a schema migration task involving handwritten receipts, multiple database formats, corrupted JSON backups, and a complete mess of business documents—GPT-5.4 discovered and processed 461 out of 465 files. That's 99.1% coverage across CSVs, Excel files, PDFs, VCF contacts, and OCR'd handwritten receipts.
Claude hit only 75% coverage because it chose not to install openpyxl, a three-second pip install any engineer would run when an import fails. "That's not an environment limitation," Jones says. "That is a judgment failure." OpenAI's tool philosophy pre-installs these libraries. Anthropic's doesn't. The result is a 24-point gap in file discovery.
Third: landscape knowledge. GPT-5.4 understands the AI ecosystem better than its competitors do. It knows what models have what capabilities, which matters when people use models to learn about models—a common pattern Jones sees in coaching work.
The Mickey Mouse Problem
But here's where GPT-5.4's quantitative strength reveals its blind spot. In that same schema migration eval, the model processed 99% of the files correctly but let dirty test data straight through to production. A fake customer named Mickey Mouse. A $25,000 car wash order from "test customer."
"It looked as if the model thought the job was to set up a pipeline and run it and pull the data in," Jones observes. "And as long as it got the reach, it was good." The model excelled at breadth—finding files, parsing formats, building infrastructure—but failed at filtering, at data hygiene, at the judgment calls that separate a demo from something you'd actually ship.
Claude caught less data but would have flagged Mickey Mouse. That's not a knock on GPT-5.4's technical capability. It's a question about what the model thinks success looks like.
What GPT-5.4 Cannot Do
The model can't write. Jones tested both creative and business writing. GPT-5.4 lost both. "It has a tin ear. It does not hear tone," he says. For editorial work, strategy memos, product decisions, executive communications—anywhere the reader needs to feel the author's presence—Claude Opus 4.6 remains the better choice.
Jones ties this to a broader pattern: Claude beat GPT-5.4 on a gnarly product management decision. His hypothesis? "Writing skills are very closely linked to product management skills. Being able to write well helps you make good product decisions." The model that understands voice understands nuance. The model that chases quantitative completeness misses the human judgment layer.
Speed is another issue. GPT-5.4 took 56 minutes on the schema migration eval. Claude finished in 15, Gemini in 21. But GPT-5.4 produced 4,000+ lines of migration script, an 11,000-line report, and 30 database tables. Claude produced 1,800 lines of code, a concise report, and 13 tables. You're trading time for exhaustiveness—which is fine if you know that's the trade you're making.
The Toggle Tax
The single most important finding in Jones's eval suite is the chasm between thinking mode and auto mode. This isn't a minor performance delta. It's the difference between a model that competes with the frontier and one that names last year's Nobel winners for this year's question.
"You are going to have to teach and train everyone in your office," Jones warns. "'Hey, this is going to do a great job. 5.4 does a great job on thinking mode creating this spreadsheet. But if it's not on thinking mode, it's going to be terrible.'" That's a sentence you should never have to say about enterprise software.
The auto-switcher should invoke thinking mode when thinking tasks appear. Jones didn't see that happening reliably. Which means every user becomes responsible for model selection on every query. That's a tax on attention, on training, on organizational cognitive load.
For open source communities watching OpenAI's infrastructure play, this is the pattern to track: not whether the model is capable, but whether that capability is accessible without constant manual override. If thinking mode is where the value lives but auto mode is what ships by default, you're looking at a capability/usability mismatch that no amount of benchmark performance can fix.
The model is schizophrenic by design. Whether that's a feature or a bug depends entirely on whether you remember to flip the switch.
—Dev Kapoor
Watch the Original Video
GPT-5.4 Let Mickey Mouse Into a Production Database. Nobody Noticed. (What This Means For Your Work)
AI News & Strategy Daily | Nate B Jones
37m 38sAbout This Source
AI News & Strategy Daily | Nate B Jones
AI News & Strategy Daily, managed by Nate B. Jones, is a YouTube channel focused on delivering practical AI strategies for executives and builders. Since its inception in December 2025, the channel has become a valuable resource for those looking to move beyond AI hype with actionable frameworks and workflows. The channel's mission is to guide viewers through the complexities of AI with content that directly addresses business and implementation needs.
Read full source profileMore Like This
The Specification Bottleneck: Why AI Creates Two Classes of Workers
When AI makes building free, knowing what to build becomes everything. How the shift from production to specification is splitting knowledge workers into two classes.
GPT-5.4 Merges OpenAI's Split Model Strategy
OpenAI's GPT-5.4 combines coding prowess with general intelligence, challenging Anthropic's unified approach. But the price tag tells a different story.
AI Healthcare and Robotics: Regulatory Challenges Ahead
Exploring AI's role in healthcare and robotics, focusing on regulatory implications.
AI Memory Systems Need Human Eyes, Not Just Agent Access
Thousands built AI memory databases through MCP servers. Now they're discovering the missing piece: visual interfaces that both humans and agents can use.