Harness Engineering: The New Frontier in AI

If you've been using Claude Code or Cursor lately, congratulations—you've been doing harness engineering, whether you knew it or not.

That's the slightly weird realization I had while digging into what's becoming a defining concept in AI development. Harness engineering is essentially everything you build around a model: the memory systems, the tools it can access, the safeguards that keep it from doing something catastrophic, the orchestration layer that helps it break down complex tasks. It's the infrastructure, [and it might matter more than the models themselves.

This feels like a significant inflection point, so let's map the terrain.

The Evolution of AI Engineering Concerns

The AI industry has cycled through engineering obsessions pretty quickly. In 2023-2024, everyone talked about prompt engineering—the art of coaxing models to do what you wanted through carefully crafted instructions. Remember when people swore by making the model adopt a persona? Or the whole JSON engineering trend where prompts got hyper-structured?

Then came context engineering in 2025, when everyone realized that what information the model has access to matters as much as how you ask it questions. Obviously if you want ChatGPT to help with your marketing campaign, it'll do better if it knows how your previous campaigns performed. Context engineering split into two camps: for engineers, it meant designing systems to handle persistence and state; for everyone else, it meant figuring out what information to actually give the AI.

Now we're in the harness engineering era, which is basically context engineering's more comprehensive older sibling.

The Big Model vs. Big Harness Debate

There's a fascinating tension emerging in AI development, and the Latent Space team frames it perfectly through a finance analogy: "A common debate in my finance days was about the value of the human versus the value of the seat. If a trader made 3 million in profits, how much of it was because of her skills and how much was because of the position, institution, and brand she is in?"

The same question applies to AI agents. When Claude Code or Cursor does something impressive, how much credit goes to the underlying model versus the harness?

The big model camp—including folks from Anthropic and OpenAI—argues that harnesses should be minimal. Boris Cherny from Claude Code put it bluntly: "I would like to say there's nothing that secret in the sauce. Generally, our approach is all the secret sauce, it's all in the model. And this is the thinnest possible wrapper over the model."

Noam Brown from OpenAI makes a similar point about reasoning models: Before they existed, people built complex agentic scaffolding to simulate reasoning. Then reasoning models arrived and rendered most of that scaffolding obsolete—or worse, counterproductive.

On the other side, Jerry Liu from Llama Index argues "The Model Harness Is Everything," pointing out that models are blank slates and the real barrier to AI value is users' ability to engineer the right context and workflows.

Both positions have evidence backing them up, which is what makes this interesting.

What Harness Engineering Actually Looks Like

Kyle from humanlayer.dev makes a point that resonated with me: "We spent the last year watching coding agents fail in every conceivable way, ignoring instructions, executing dangerous commands unprompted, and going in circles on the simplest of tasks. Every time the instinct was the same, we just need better models, GPT-6 will fix it... But over the course of dozens of projects and hundreds of agent sessions, we kept arriving at the same conclusion. It's not a model problem, it's a configuration problem."

This reframes failure as a design problem rather than a capability problem, which is kind of huge.

Harnesses work backward from what models can't do natively. Need the agent to write and execute code? Add bash and code execution. Need safe execution? Add sandboxed environments. Need long-term memory? Add memory files and retrieval systems. Those looping architectures everyone's talking about—the ones that let agents work on tasks for hours—those are harness features.

Anthropic breaks harness engineering into three layers:

Information layer: What can the agent see and invoke (memory, context, tools)
Execution layer: How work gets decomposed, how agents collaborate, how failures are handled
Feedback layer: How the system improves over time through evaluation and observability

This isn't theoretical. Blitzy recently hit 66.5% on SWE-bench Pro, beating GPT-5.4's 57.7%. Their whole thesis is that harness engineering—the agent scaffolding, the orchestration, the context infrastructure—unlocks bigger gains than model improvements alone. When they audited the performance gap, they found GPT-5.4 often got close but missed corner cases. Blitzy succeeded because its knowledge graph gave agents deeper codebase context than a raw model could match.

The Convergence Nicolas Charrier Sees Coming

AI entrepreneur Nicolas Charrier wrote something called "The Great Convergence" that connects a bunch of dots. He noticed that wildly different companies are all building toward the same product shape: Linear is building coding agents, OpenAI is going all-in on Codex, Anthropic has Claude Code, Notion is building work agents.

His argument: "Claude Code was a massive breakthrough. Although initially invented for coding use cases, it turns out that a smart looping agent generalizes incredibly well towards any computer-based task if you give it the right tools."

The architecture is simple: user input → context engineering → model → tools → loop until done. But this simple pattern appears to be a general problem-solving machine that scales along a unique dimension—it can keep running for a long time.

Charrier predicts that by end of 2026, many software companies will look like they're selling the same thing. Not because the industry lost imagination, but because the economics and architecture push everyone toward "self-improving software systems that can take a goal, use tools, and produce business outcomes."

The winners won't just have better models. They'll have distribution, trusted workflow positioning, proprietary context, and—critically—the shortest path from observation to improvement.

The Disposable Harness Problem

Here's where it gets weird: Anthropic's Managed Agents product is designed around the assumption that harnesses themselves need to be disposable.

They give an example: Claude Sonnet 4.5 had "context anxiety"—it would wrap up tasks prematurely when approaching its context limit. So they added context resets to the harness. But when Claude Opus 4.5 came out, that behavior was gone. The resets became dead weight.

"Harnesses encode assumptions that go stale as models improve," they write. Managed Agents is built around interfaces that stay stable as harnesses change.

This suggests we're entering a phase where the infrastructure around models needs to be as dynamic as the models themselves. The plumbing can't be static if the thing it's plumbing keeps fundamentally changing.

Which raises a question I genuinely don't know the answer to: Are we building toward a world where harness engineering matters less because models improve, or one where it matters more because we're giving models harder problems?

OpenAI's harness engineering post hints at the latter: "Our most difficult challenge is now center on designing environments, feedback loops, and control systems that help agents accomplish our goal building and maintaining complex reliable software at scale."

That's a very different proposition than just making a model better. And it suggests the real competitive moats in AI might not be the models at all.

—Yuki Okonkwo, AI & Machine Learning Correspondent