GPT-5.4 Merges OpenAI's Split Model Strategy

OpenAI shipped GPT-5.4 this week, and the move reveals something interesting about the direction frontier AI development is taking. For months, OpenAI's model lineup has been split: GPT-5.2 handled general tasks reasonably well, while GPT-5.3 Codex dominated at writing code but lacked the personality and reasoning breadth developers wanted in a daily driver. Anthropic, meanwhile, had been shipping models like Claude Opus 4.6 that didn't force users to choose.

GPT-5.4 is OpenAI's answer to that unified approach. According to Matthew Berman, a developer who tested the model during its early access period, it combines the coding capabilities of the Codex line with the general intelligence and—critically—personality of the mainline GPT models. "GPT 5.4 has everything," Berman noted in his testing overview. "It is good at coding. It has a personality. It's good at creative writing. It's good at tool calling."

The technical improvements are substantial. GPT-5.4 now includes a million-token context window, matching what Anthropic has been offering and enabling the kind of complex, multi-document reasoning that knowledge workers actually need. On the OSWorld benchmark, which measures how well models can operate computers through libraries like Playwright or direct mouse and keyboard commands, GPT-5.4 achieved 75% accuracy with just 15 tool calls. GPT-5.2, by comparison, topped out around 50% accuracy while requiring 42 tool calls—meaning it was both less accurate and dramatically less efficient.

Berman demonstrated the model handling Gmail operations: starring emails, applying labels, creating calendar invites. The demos ran at what appeared to be real-time speed based on visible timestamps, which suggests the efficiency gains are genuine rather than cherry-picked examples run on optimized infrastructure. OpenAI also showed off a theme park simulation game and a 2D RPG, both allegedly built from "lightly specified" single prompts—the kind of vague instruction that would have produced garbage from earlier models.

The Benchmark Wars Continue

OpenAI's benchmarking approach here is worth noting. On their own GDP-val benchmark—designed to measure "real world knowledge work" completion—GPT-5.4 Thinking scored 83%, five points higher than Anthropic's Opus 4.6. But as Berman observed, "it kind of sucks these companies are kind of picking and choosing which benchmarks they're running against because then it makes it very difficult to compare them."

This selective benchmarking isn't new, but it's getting more pronounced as the gap between frontier models narrows. Each company emphasizes the tests where their model performs best, making independent evaluation increasingly necessary. The SWE-bench Pro coding benchmark showed GPT-5.4 at 57.7%, slightly ahead of the specialized Codex model at 56.8%—but Anthropic didn't run Opus against this benchmark, so direct comparison is impossible.

What's clearer is that both OpenAI and Anthropic have solved their pre-training cycles. Models are shipping every few weeks now, each one an incremental slice from a continuously training system. Less than a year ago, OpenAI released GPT-4.5—a model Berman describes as "massive and slow and expensive to run" that was eventually retired. Now the entire 5.0 family ships regularly, each version faster and more efficient than the last.

The Prompting Problem

Here's where things get interesting for developers actually trying to use these models: GPT-5.4 apparently requires different prompting strategies than Claude models. OpenAI has already published prompting guides specific to 5.4, and developers integrating it into tools like OpenClaw—the open-source automation framework many developers have adopted—need to maintain separate prompt sets for different model families.

This fragmentation matters more than it might seem. As AI assistants become infrastructure rather than novelty, having to rewrite prompts every time you switch models imposes a real cost. It's not just about learning new syntax—it's about the implicit knowledge of how each model "thinks" and what kinds of instructions it responds to. Developer Flavio Adamo, who also tested GPT-5.4 early, noted it "basically one-shotted" tasks that previous models couldn't handle, but that success depended on understanding the model's particular capabilities and limits.

Matt Shumer, another early tester, called it "the best model on the planet by far," though he also documented specific failure modes. The model missed obvious real-world context in one case, suggesting spring break destinations without accounting for crowds. It stopped short of completing tasks within OpenClaw. And its "front-end taste"—presumably its aesthetic judgment when generating UI elements—lags behind Opus 4.6 and Gemini 3.1 Pro. Sam Altman responded to Shumer's feedback promising immediate fixes, which suggests OpenAI is treating this as an iterative release rather than a finished product.

The Price of Progress

The pricing tells its own story. GPT-5.4 costs $2.50 per million input tokens, up from $1.75 for GPT-5.2. Output pricing jumped to $15 per million tokens from $14. The Pro version—technically more capable but scoring lower on some benchmarks—costs $30 per million input tokens and $180 per million output tokens. For context, that output pricing means generating a single novel-length response could cost $18 or more.

"Frontier intelligence seems to be getting more expensive, not less," Berman observed, and that trajectory contradicts the standard tech narrative about capabilities becoming cheaper over time. These models are faster and more efficient in terms of tokens-per-task, but the absolute cost of accessing frontier capability keeps rising.

This creates a two-tier ecosystem: developers and companies who can afford to run these models at scale, and everyone else who gets pushed toward smaller, cheaper, less capable alternatives. Caching helps reduce input costs for repeated context, but output tokens—the actual generated responses—remain expensive regardless of optimization. For automation tools processing thousands of requests daily, those costs compound quickly.

The speed of releases raises its own questions. If OpenAI and Anthropic can now ship improved models every few weeks, what does stability mean for production systems? Developers building on these models need to know whether the behavior they're depending on will remain consistent or whether next month's release will require another round of prompt engineering and integration testing. The continuous training approach means the models in production today are being actively refined, but also that the foundation keeps shifting.

Peter Steinberger, now an OpenAI employee, characterized the coding improvements as comparable to the jump from GPT-5.0 to 5.1, "but now it's unified and smarter on everything else." That unification matters—not having to choose between a coding specialist and a general-purpose model removes a cognitive tax from the development workflow. But it also highlights how recent the split-model approach was, and how quickly these strategic decisions get made and unmade.

What's emerging is a pattern: frontier labs are converging on unified models optimized for what they call "knowledge work"—the spreadsheets and presentations and document analysis that constitute most professional computing. The agents can now operate computers, send emails, manage calendars, extract structured data. The demos are impressive. Whether the economics make sense for anyone beyond the labs themselves is still an open question.

—Dev Kapoor