AI Agents in Production: What Actually Works

The gap between a demo and a deployed system is where most AI ambitions go to die. You've probably seen the demos—an agent that browses the web, writes code, books a meeting, all in one fluid chain of reasoning. Impressive. And almost completely beside the point of what enterprises actually need.

That's the framing Shailaja Patel-Pranav opens with in a recent IBM Technology video on building AI agents for real-world workflows, and it's a useful corrective at a moment when "agentic AI" has become the industry's favorite incantation. "The real question isn't whether we can build them or not," she says. "The real question is what it takes to make an AI agent effective in real-world environments."

Her answer isn't glamorous. It involves state management, timing constraints, escalation paths, and policy enforcement. Which is to say: it involves the unglamorous connective tissue of how organizations actually function.

The Demo-to-Production Chasm

The failure mode Patel-Pranav identifies is specific and worth sitting with. AI agents don't fail in production because the underlying models aren't capable enough. They fail because the problems they're asked to solve are "complex, constrained, and interconnected"—and demos, almost by definition, aren't. A demo has clean inputs, cooperative systems, and a forgiving audience. Production has legacy software, conflicting approval chains, and someone's paycheck on the line.

This isn't a new observation about AI specifically. It's the old truth about enterprise software writ large: the 20% of edge cases consume 80% of the engineering effort. What's interesting is how Patel-Pranav maps this onto the specific architecture of AI agents, and what design choices follow from taking it seriously.

Her core argument is that agents should function as coordination layers, not autonomous decision-makers. "Successful agents aren't standalone decision makers. They act like coordination layers, maintaining context, orchestrating actions across systems, enforcing rules, and determining when control needs to be transitioned to a human." That's a narrower mandate than the discourse around AI agents often implies—and probably a more honest one.

Four Patterns Worth Knowing

Patel-Pranav structures the case around four recurring patterns, and the taxonomy is genuinely useful for anyone thinking through where agents fit in an existing organization.

Multi-system workflow coordination. This is the employee onboarding use case. On the surface it sounds simple: new person starts, give them access to things. In practice, it's a sequence of dependent steps—provisioning entitlements, ordering hardware, scheduling orientation, assigning training—spread across systems that don't naturally talk to each other. An agent here doesn't replace the HR coordinator; it sequences the actions, monitors whether each step completed correctly, and flags deviations. "The hard part isn't reasoning," Patel-Pranav notes. "It's reliably orchestrating multiple systems while respecting policy and timing constraints."

Policy-governed action execution. IT support is the example. Incoming requests range from trivially low-risk (reset a password) to genuinely sensitive (grant elevated system access). An agent in this pattern reads intent, matches it against applicable policy, executes what's permitted automatically, and escalates what isn't. The value isn't speed for its own sake—it's consistent application of rules that humans, under volume pressure, tend to apply inconsistently. The agent makes the control boundary explicit and enforces it at scale.

Exception-handling within structured processes. Invoice processing is the illustration here. The "happy path"—document comes in, matches records, gets routed, done—isn't where agents earn their keep. It's the mismatched fields, the missing data, the vendor whose name is stored three different ways across three systems. Agents handle the routine reliably; they surface only genuine anomalies for human attention. The human's cognitive load drops because they're no longer triaging every invoice, just the ones that actually need judgment.

Triage and routing at volume. Customer service is the obvious domain. When ticket volume spikes, consistent prioritization breaks down—urgent requests get buried, context gets lost in handoffs, teams get routed the wrong work. An agent that analyzes, categorizes, and routes incoming requests doesn't resolve customer problems; it makes sure the humans who do have the right context and the right priority queue. Unglamorous infrastructure work, but the kind that compounds.

What's notable across all four patterns is that the agent is never the terminal actor. Humans resolve the customer issue. Humans approve the sensitive access request. Humans review the mismatched invoice. The agent's job is to make the human's job more tractable—to filter signal from noise, enforce consistency, and hand off at the right moment.

The "Narrowly Scoped" Principle

One of the cleaner through-lines in Patel-Pranav's framework is the argument for narrow scope as a feature, not a limitation. The successful agents she describes are designed for specific integration points, not general autonomy. They're "designed for integration, not isolation."

This cuts against a lot of the popular framing around AI agents, which tends to valorize breadth—the agent that can do anything, go anywhere, operate without guardrails. The production reality, at least as IBM tells it, is that the most reliable systems are the ones with the most explicit constraints. "These systems don't feel like flashy AI features," Patel-Pranav says. "They feel like well-designed components of a larger architecture."

That framing is worth examining. "Well-designed component of a larger architecture" is not how you sell a product on stage at a conference. It is how you describe software that actually works. The tension between those two things—marketability and reliability—runs through almost every enterprise AI conversation right now.

It's a tension Amazon's builders know well. Their experience deploying multi-agent systems at scale arrived at a similar conclusion: human oversight isn't a concession to AI's limitations, it's a design requirement. And Anthropic's research into agent architecture drift found a 17% failure rate in production deployments—a number that concentrates the mind when you're thinking about which tasks agents should own outright versus which they should flag and defer.

What This Framework Doesn't Resolve

Patel-Pranav's case is internally consistent and practically grounded. It's also, notably, a vendor's framework—IBM has obvious interest in enterprises adopting agent infrastructure, and in framing that adoption as methodical and manageable rather than risky. That doesn't make the argument wrong. It's worth noting that the hardest questions go largely unexamined.

Who decides where the policy boundaries are drawn? In the IT support example, an agent "automatically executes permitted actions" and "escalates ambiguous or high-risk cases." But who defines what counts as ambiguous? Policy-setting in large organizations is itself a political process, and encoding those policies into an agent's decision logic makes them both more consistent and harder to contest. The agent doesn't have bias in the human sense—but it faithfully reproduces whatever biases were baked into the rules it enforces.

There's also the question of failure transparency. When a well-orchestrated agent silently misroutes an invoice or misclassifies a ticket, who notices, and how fast? The efficiency gains from automation can create brittleness that's hard to detect precisely because everything looks like it's running smoothly. The enterprise AI failure rate problem isn't just about projects that never launch—it's also about systems that launch and quietly underperform in ways that take months to surface.

None of this invalidates the coordination-layer approach. It does suggest that the design work doesn't stop at the architecture diagram. Monitoring, auditability, and exception review processes aren't afterthoughts—they're the difference between an agent that makes your organization more capable and one that makes your organization's existing errors more efficient.

Patel-Pranav's closing formulation is worth quoting in full: "When agents are designed around coordination, rules, and accountability, they stop being experiments and start operating as reliable components in production systems."

Coordination, rules, and accountability. That's a long way from the breathless demos. It's also, probably, the actual job.

Marcus Chen-Ramirez covers AI, software development, and the intersection of technology and society for Buzzrag.