GPT-5.4 Pro Costs $180 Per Million Tokens—And Beats Google at Its Game

OpenAI released GPT-5.4 Pro this week, and the headline is predictable: it's the best model yet. The more interesting story is what "best" actually means now, and what you'll pay for it.

The model costs $30 per million input tokens and $180 per million output tokens. That's not a typo. For context, Claude Opus 4.6 charges less. The standard GPT-5.4 without the reasoning features costs considerably less still. Remember when people in this field couldn't stop talking about "intelligence too cheap to meter"? That phrase has aged poorly.

Price matters because GPT-5.4 Pro isn't incrementally better—it's better in ways that suggest the race has shifted terrain. Take the BrowseComp benchmark, which tests whether an AI can pull real-time data from across the web accurately. You'd expect Google's Gemini 3.1 Pro to dominate here. Google owns the search engine. But GPT-5.4 Pro edges it out at 89.3%. Not a massive margin, but enough to notice.

That's the pattern across multiple benchmarks: OpenAI winning in areas where competitors were supposed to have home field advantage. The question isn't whether GPT-5.4 Pro is marginally better at tasks we already tested. The question is whether it's crossing thresholds that matter.

When Benchmarks Actually Mean Something

Most AI benchmarks have saturated. Models score so high that incremental improvements tell you almost nothing. The benchmarks that matter now are the ones designed to resist gaming—tests built by professionals to simulate actual work.

Frontier Math is one. Professional mathematicians designed these problems to require genuine research-level thinking, not pattern matching from training data. When it launched, top models scored around 2%. GPT-5.4 Pro recently scored 38% on the hardest tier.

That percentage matters less than what happened during testing. A mathematician named Bartos gave GPT-5.4 Pro a problem he'd been working on for twenty years. The model solved it. He called it his "personal move 37"—a reference to AlphaGo's famous 2016 move that no human Go player would have considered, but which turned out to be brilliant.

"It finally happened," Bartos wrote. "The solution is very nice, clean, and feels almost human."

This wasn't a benchmark problem you could scrape from the internet. This was novel work. That's a different category of capability.

The Professional Services Test

The Apex benchmark, built by a company called Merkco, measures something more directly threatening to white-collar employment: how well AI handles the actual daily work of investment bankers, consultants, and lawyers. Real professionals spent five to ten days building 480 realistic scenarios across 33 simulated work environments. Models get one shot to complete each task—no asking for clarification, no iteration.

When Merkco launched this benchmark in January, the best score was 24%. GPT-5.4 Pro hit 52% roughly six weeks later. That's not incremental progress. That's doubling performance on tasks explicitly designed to represent professional work.

OpenAI's internal GDP-val benchmark tells a similar story. It tests AI against human knowledge workers across 44 occupations in the nine industries that contribute most to US GDP. GPT-5.4 matches or beats human professionals 83% of the time on single-shot tasks. The models complete these tasks roughly 100 times faster and 100 times cheaper than human experts.

The caveat: these are well-defined, one-shot tasks. Real professional work involves iteration, context-building, ambiguity. But the trajectory is clear. These aren't high school homework problems anymore.

The Price-Performance Equation

Here's where it gets interesting. GPT-5.4 Pro costs significantly more than competitors, but the standard GPT-5.4 model—without the intensive reasoning features—beats Claude Opus 4.6 on many tasks while costing less. That creates a market dynamic we haven't seen before.

Which model wins won't necessarily be the smartest one. It might be whichever offers the best price-to-performance ratio for specific use cases. If you're running thousands of queries for financial analysis, cost per task matters more than theoretical capability on research mathematics.

OpenAI explicitly optimized GPT-5.4 for finance workflows. They're working with industry practitioners to target tasks that "often take analysts days or hours to complete." That's not general artificial intelligence. That's vertical integration into high-value professional services.

What This Actually Tells Us

I've covered technology long enough to know that "smartest AI in the world" is a marketing claim with a short shelf life. What matters is the pace of improvement on tasks that weren't possible six months ago.

GPT-5.4 Pro performing computer tasks in real-time—navigating desktop environments, reading invoices, filling forms—represents native computer use capabilities that previous general-purpose models couldn't match. The OSWorld benchmark measures this specifically, and GPT-5.4 hit 75%, exceeding its predecessor by a significant margin.

The model also apparently fixed OpenAI's creative writing problem. Earlier versions optimized so heavily for math and coding that they became, in Sam Altman's own words on stage, giant calculators you couldn't actually talk to. GPT-5.4 ranks second in creative writing on user votes, though with only 390 votes recorded so far.

That suggests OpenAI learned something about trade-offs. You can't just maximize for benchmark scores. The model needs to remain usable for humans who aren't feeding it Frontier Math problems.

The trajectory here isn't subtle. When professional-services benchmarks double in six weeks, when problems unsolved for twenty years get cracked, when costs rise while capabilities expand faster—that's not hype. That's data. The question is what happens when these models cross from "impressive in testing" to "deployed at scale in actual professional environments."

We'll find out soon enough. The models don't need to be perfect. They just need to be good enough, fast enough, and cheap enough to justify the workflow changes. On at least two of those three metrics, we're already there.

—Bob Reynolds