GPT-5.4's Schizophrenic Performance: A Model at

Nate B Jones asked ChatGPT 5.4 a simple question: if the car wash is 100 meters away, should I walk or drive? The model—OpenAI's flagship reasoning system, the one they're positioning as their most capable release—thought carefully and answered: walk. It explained at length why walking made sense, then added as an afterthought that you might need to reposition the car later.

Claude Opus 4.6 answered in one sentence: "Drive. You need the car at the car wash." Gemini got it right too. Every frontier model understood the question except the one OpenAI is betting their $840 billion valuation on.

That's the opening frame Jones uses in his exhaustive evaluation of GPT-5.4, and it's perfect because it captures what makes this model so fascinating: it's not the best model and it's not the worst model. It's the most inconsistent model, and that inconsistency reveals something important about where OpenAI is headed and what they're willing to trade off to get there.

The Eval Suite Nobody Else Is Running

Jones ran blind evaluations—six structured tests where judges never knew which model produced which output. He wasn't testing vibes. He was testing the work you'd actually hand to a model on Tuesday and expect to use Thursday morning.

The results split into a pattern you don't see with Claude or Gemini: GPT-5.4 either dominates completely or fails catastrophically, and the difference comes down to a single toggle most users will never remember to flip.

"In thinking mode, ChatGPT competed for first place," Jones reports. "It nailed the exact Higgs Boson mass. It retrieved the correct Apple closing price. It got the current matrix multiplication exponent correct. But in auto mode, chat GPT 5.4 named 2024 Nobel laureates for a 2025 question."

Same model. Same questions. The gap between first place and dead last.

This matters because OpenAI ships auto mode as the default. Most of the billion people who'll touch this model won't know thinking mode exists. The rest will forget to toggle. And that means GPT-5.4's extraordinary capabilities—the ones OpenAI is marketing—are locked behind a UX pattern that requires constant conscious override.

Where GPT-5.4 Actually Wins

The model has three genuine strengths that showed up consistently across blind testing.

First: quantitative modeling. Jones gave all three models the same prompt—project the Seattle Seahawks' 2026 season using 2025 data. GPT-5.4 produced a six-tab workbook with Pythagorean win expectation, an ELO-like rating system with offseason retention decay, and a Poisson binomial season distribution. Claude made a cleaner three-tab workbook with simpler math. The statistical rigor wasn't close.

What's more interesting: GPT-5.4 included a methodology tab that cataloged its own assumptions, shortcuts, and limitations. "That self-awareness is worth paying attention to," Jones notes. A model that can articulate precisely why its output is insufficient is more useful than one that produces prettier artifacts.

Second: file processing. In Jones's "eval from hell"—a schema migration task involving handwritten receipts, multiple database formats, corrupted JSON backups, and a complete mess of business documents—GPT-5.4 discovered and processed 461 out of 465 files. That's 99.1% coverage across CSVs, Excel files, PDFs, VCF contacts, and OCR'd handwritten receipts.

Claude hit only 75% coverage because it chose not to install openpyxl, a three-second pip install any engineer would run when an import fails. "That's not an environment limitation," Jones says. "That is a judgment failure." OpenAI's tool philosophy pre-installs these libraries. Anthropic's doesn't. The result is a 24-point gap in file discovery.

Third: landscape knowledge. GPT-5.4 understands the AI ecosystem better than its competitors do. It knows what models have what capabilities, which matters when people use models to learn about models—a common pattern Jones sees in coaching work.

The Mickey Mouse Problem

But here's where GPT-5.4's quantitative strength reveals its blind spot. In that same schema migration eval, the model processed 99% of the files correctly but let dirty test data straight through to production. A fake customer named Mickey Mouse. A $25,000 car wash order from "test customer."

"It looked as if the model thought the job was to set up a pipeline and run it and pull the data in," Jones observes. "And as long as it got the reach, it was good." The model excelled at breadth—finding files, parsing formats, building infrastructure—but failed at filtering, at data hygiene, at the judgment calls that separate a demo from something you'd actually ship.

Claude caught less data but would have flagged Mickey Mouse. That's not a knock on GPT-5.4's technical capability. It's a question about what the model thinks success looks like.

What GPT-5.4 Cannot Do

The model can't write. Jones tested both creative and business writing. GPT-5.4 lost both. "It has a tin ear. It does not hear tone," he says. For editorial work, strategy memos, product decisions, executive communications—anywhere the reader needs to feel the author's presence—Claude Opus 4.6 remains the better choice.

Jones ties this to a broader pattern: Claude beat GPT-5.4 on a gnarly product management decision. His hypothesis? "Writing skills are very closely linked to product management skills. Being able to write well helps you make good product decisions." The model that understands voice understands nuance. The model that chases quantitative completeness misses the human judgment layer.

Speed is another issue. GPT-5.4 took 56 minutes on the schema migration eval. Claude finished in 15, Gemini in 21. But GPT-5.4 produced 4,000+ lines of migration script, an 11,000-line report, and 30 database tables. Claude produced 1,800 lines of code, a concise report, and 13 tables. You're trading time for exhaustiveness—which is fine if you know that's the trade you're making.

The Toggle Tax

The single most important finding in Jones's eval suite is the chasm between thinking mode and auto mode. This isn't a minor performance delta. It's the difference between a model that competes with the frontier and one that names last year's Nobel winners for this year's question.

"You are going to have to teach and train everyone in your office," Jones warns. "'Hey, this is going to do a great job. 5.4 does a great job on thinking mode creating this spreadsheet. But if it's not on thinking mode, it's going to be terrible.'" That's a sentence you should never have to say about enterprise software.

The auto-switcher should invoke thinking mode when thinking tasks appear. Jones didn't see that happening reliably. Which means every user becomes responsible for model selection on every query. That's a tax on attention, on training, on organizational cognitive load.

For open source communities watching OpenAI's infrastructure play, this is the pattern to track: not whether the model is capable, but whether that capability is accessible without constant manual override. If thinking mode is where the value lives but auto mode is what ships by default, you're looking at a capability/usability mismatch that no amount of benchmark performance can fix.

The model is schizophrenic by design. Whether that's a feature or a bug depends entirely on whether you remember to flip the switch.

—Dev Kapoor