The Bitter Lesson Meets Its Match in Vertical AI Models

For most of AI's history, there's been a simple rule: the dumb approach beats the clever one. Give a computer enough data and computation, and it will steamroll the most elegant human-designed system every time. Computer scientist Rich Sutton codified this in 2019 as "the bitter lesson"—bitter because it deflates human ego, lesson because ignoring it has cost researchers decades of wasted effort.

Now that lesson is meeting something unexpected: real-world experience data. And the results suggest we might be entering a new phase where specialization beats scale, at least in specific domains.

Intercom announced this week that their customer service model, Apex, outperforms GPT-4 and Claude Opus 3.5 on customer service tasks while running faster and cheaper. That's notable not because a company made bold claims—everyone does that—but because of how they did it. They didn't build a model from scratch. They took an open-source base model and trained it on billions of actual customer service interactions.

This shouldn't work, according to conventional wisdom. Bloomberg tried something similar with Bloomberg GPT, a 50-billion-parameter model trained specifically for finance. The general-purpose models destroyed it. That seemed to confirm the bitter lesson: brute force computation beats domain expertise.

But what Intercom did is fundamentally different. They're not encoding human knowledge about customer service. They're training on what actually happens when millions of people interact with customer service systems. That's not expertise—it's experience.

The Cursor Precedent

This pattern showed up earlier with Cursor's Composer 2 coding model. When the company revealed their model matched GPT-4 performance and beat Claude on coding benchmarks, someone on X pointed out it was "just" the open-source DeepSeek-Coder v2.5 with additional reinforcement learning.

That "just" is doing heavy lifting. Cursor's Dev Relations representative Lee Robinson responded that three-quarters of the compute spent on the final model came from their own training, not the base. The benchmarks showed why: the model performed differently because it had learned from how people actually code, not how they should code.

Paul Adams, Intercom's chief product officer, framed the implications clearly: "This is only possible with the domain specific proprietary evals from our billions of human and agent customer service interaction data points. We also have a flywheel here where we will continue to get better at the edges."

That flywheel is the interesting part. Every interaction generates data that makes the next version better. It's the bitter lesson applied one level up—not human knowledge versus computation, but human-generated training data versus interaction-generated training data.

What Changed

The shift hinges on post-training. The major AI labs—OpenAI, Anthropic, Google—have made pre-training a commodity. The open-source models aren't quite as good as the frontier models, but they're close enough that intensive post-training on specific tasks can vault them past the general-purpose systems for those tasks.

Clem Delangue from Hugging Face catalogued the trend: "After Pinterest, Airbnb, Notion, Cursor, today it's Euan and Intercom publicly sharing that they're finding it better, cheaper, faster to use and train open models themselves rather than use APIs for many tasks, and hundreds of other companies are doing the same without sharing."

The business model implications are obvious. If you're spending millions reselling API access to OpenAI or Anthropic, and you can instead fine-tune an open model on your own interaction data for a fraction of the cost while getting better performance—why wouldn't you?

Decagon, another customer service AI company, reported that over 80% of their model traffic now runs on models they trained in-house. They built what they call a network of specialized models, each handling different parts of customer interactions. Not one general oracle, but a collection of specialists.

The Tension

This creates an interesting contradiction. The bitter lesson says general methods that leverage computation win. But these specialized models are using general methods—they're just applying them to proprietary interaction data rather than broad internet scrapes.

Richard Sutton himself addressed this tension in a podcast last year. He noted that large language models currently rely heavily on human-generated text, and asked whether future systems might "be superseded by things that can get more data just from experience rather than from people."

That's exactly what's happening. Intercom's Apex and Cursor's Composer 2 are post-trained from experience, not expertise. They're learning from what works, not what should work.

The question is whether this represents a sustainable advantage or just a temporary gap while the major labs catch up. Euan McCabe from Intercom acknowledged this directly: "The frontier labs still have the very best models, but the open-weight models are not that far behind. So it's not hard to see pre-training as a commodity of sorts. Where we think the frontier will move next is to post-training."

If he's right, the labs face classic disruption. Their general-purpose models over-serve specific use cases—they're more intelligent than necessary for customer service or coding assistance. Meanwhile, open models are good enough that specialized post-training can beat the general systems at specific tasks.

The labs have three options: build specialized models themselves, acquire companies with the evaluation data needed for specialization, or accept that they'll lose vertical markets while maintaining dominance in general intelligence. Most likely, all three happen simultaneously.

What This Doesn't Mean

Not every company with customer data will successfully spin up competitive models. Post-training at this level requires genuine expertise, not just data. The talent pool for this work is small, and the companies succeeding so far—Intercom, Cursor, Decagon—have made serious investments in building that capability.

But the results are encouraging enough that many more companies will experiment. Any vertical SaaS platform with millions of user interactions is sitting on potential training data. Whether they can turn that into model performance depends on execution, not just data volume.

Andre Karpathy captured the trajectory: "I do think we should expect more speciation in the intelligences. The animal kingdom is extremely diverse in the brains that exist, and there's lots of different niches of nature, and I think we should be able to see more speciation. You don't need this oracle that knows everything."

The bitter lesson taught us that general methods beat specialized ones. But it turns out the most general method of all might be learning from experience—even if that experience is specialized.

—Bob Reynolds, Senior Technology Correspondent