AI Agents in Production: What Actually Works
IBM's Shailaja Patel-Pranav breaks down why AI agents fail in production—and the coordination patterns that make them actually reliable in enterprise workflows.
Written by AI. Marcus Chen-Ramirez

Photo: AI. Sela Marin
The gap between a demo and a deployed system is where most AI ambitions go to die. You've probably seen the demos—an agent that browses the web, writes code, books a meeting, all in one fluid chain of reasoning. Impressive. And almost completely beside the point of what enterprises actually need.
That's the framing Shailaja Patel-Pranav opens with in a recent IBM Technology video on building AI agents for real-world workflows, and it's a useful corrective at a moment when "agentic AI" has become the industry's favorite incantation. "The real question isn't whether we can build them or not," she says. "The real question is what it takes to make an AI agent effective in real-world environments."
Her answer isn't glamorous. It involves state management, timing constraints, escalation paths, and policy enforcement. Which is to say: it involves the unglamorous connective tissue of how organizations actually function.
The Demo-to-Production Chasm
The failure mode Patel-Pranav identifies is specific and worth sitting with. AI agents don't fail in production because the underlying models aren't capable enough. They fail because the problems they're asked to solve are "complex, constrained, and interconnected"—and demos, almost by definition, aren't. A demo has clean inputs, cooperative systems, and a forgiving audience. Production has legacy software, conflicting approval chains, and someone's paycheck on the line.
This isn't a new observation about AI specifically. It's the old truth about enterprise software writ large: the 20% of edge cases consume 80% of the engineering effort. What's interesting is how Patel-Pranav maps this onto the specific architecture of AI agents, and what design choices follow from taking it seriously.
Her core argument is that agents should function as coordination layers, not autonomous decision-makers. "Successful agents aren't standalone decision makers. They act like coordination layers, maintaining context, orchestrating actions across systems, enforcing rules, and determining when control needs to be transitioned to a human." That's a narrower mandate than the discourse around AI agents often implies—and probably a more honest one.
Four Patterns Worth Knowing
Patel-Pranav structures the case around four recurring patterns, and the taxonomy is genuinely useful for anyone thinking through where agents fit in an existing organization.
Multi-system workflow coordination. This is the employee onboarding use case. On the surface it sounds simple: new person starts, give them access to things. In practice, it's a sequence of dependent steps—provisioning entitlements, ordering hardware, scheduling orientation, assigning training—spread across systems that don't naturally talk to each other. An agent here doesn't replace the HR coordinator; it sequences the actions, monitors whether each step completed correctly, and flags deviations. "The hard part isn't reasoning," Patel-Pranav notes. "It's reliably orchestrating multiple systems while respecting policy and timing constraints."
Policy-governed action execution. IT support is the example. Incoming requests range from trivially low-risk (reset a password) to genuinely sensitive (grant elevated system access). An agent in this pattern reads intent, matches it against applicable policy, executes what's permitted automatically, and escalates what isn't. The value isn't speed for its own sake—it's consistent application of rules that humans, under volume pressure, tend to apply inconsistently. The agent makes the control boundary explicit and enforces it at scale.
Exception-handling within structured processes. Invoice processing is the illustration here. The "happy path"—document comes in, matches records, gets routed, done—isn't where agents earn their keep. It's the mismatched fields, the missing data, the vendor whose name is stored three different ways across three systems. Agents handle the routine reliably; they surface only genuine anomalies for human attention. The human's cognitive load drops because they're no longer triaging every invoice, just the ones that actually need judgment.
Triage and routing at volume. Customer service is the obvious domain. When ticket volume spikes, consistent prioritization breaks down—urgent requests get buried, context gets lost in handoffs, teams get routed the wrong work. An agent that analyzes, categorizes, and routes incoming requests doesn't resolve customer problems; it makes sure the humans who do have the right context and the right priority queue. Unglamorous infrastructure work, but the kind that compounds.
What's notable across all four patterns is that the agent is never the terminal actor. Humans resolve the customer issue. Humans approve the sensitive access request. Humans review the mismatched invoice. The agent's job is to make the human's job more tractable—to filter signal from noise, enforce consistency, and hand off at the right moment.
The "Narrowly Scoped" Principle
One of the cleaner through-lines in Patel-Pranav's framework is the argument for narrow scope as a feature, not a limitation. The successful agents she describes are designed for specific integration points, not general autonomy. They're "designed for integration, not isolation."
This cuts against a lot of the popular framing around AI agents, which tends to valorize breadth—the agent that can do anything, go anywhere, operate without guardrails. The production reality, at least as IBM tells it, is that the most reliable systems are the ones with the most explicit constraints. "These systems don't feel like flashy AI features," Patel-Pranav says. "They feel like well-designed components of a larger architecture."
That framing is worth examining. "Well-designed component of a larger architecture" is not how you sell a product on stage at a conference. It is how you describe software that actually works. The tension between those two things—marketability and reliability—runs through almost every enterprise AI conversation right now.
It's a tension Amazon's builders know well. Their experience deploying multi-agent systems at scale arrived at a similar conclusion: human oversight isn't a concession to AI's limitations, it's a design requirement. And Anthropic's research into agent architecture drift found a 17% failure rate in production deployments—a number that concentrates the mind when you're thinking about which tasks agents should own outright versus which they should flag and defer.
What This Framework Doesn't Resolve
Patel-Pranav's case is internally consistent and practically grounded. It's also, notably, a vendor's framework—IBM has obvious interest in enterprises adopting agent infrastructure, and in framing that adoption as methodical and manageable rather than risky. That doesn't make the argument wrong. It's worth noting that the hardest questions go largely unexamined.
Who decides where the policy boundaries are drawn? In the IT support example, an agent "automatically executes permitted actions" and "escalates ambiguous or high-risk cases." But who defines what counts as ambiguous? Policy-setting in large organizations is itself a political process, and encoding those policies into an agent's decision logic makes them both more consistent and harder to contest. The agent doesn't have bias in the human sense—but it faithfully reproduces whatever biases were baked into the rules it enforces.
There's also the question of failure transparency. When a well-orchestrated agent silently misroutes an invoice or misclassifies a ticket, who notices, and how fast? The efficiency gains from automation can create brittleness that's hard to detect precisely because everything looks like it's running smoothly. The enterprise AI failure rate problem isn't just about projects that never launch—it's also about systems that launch and quietly underperform in ways that take months to surface.
None of this invalidates the coordination-layer approach. It does suggest that the design work doesn't stop at the architecture diagram. Monitoring, auditability, and exception review processes aren't afterthoughts—they're the difference between an agent that makes your organization more capable and one that makes your organization's existing errors more efficient.
Patel-Pranav's closing formulation is worth quoting in full: "When agents are designed around coordination, rules, and accountability, they stop being experiments and start operating as reliable components in production systems."
Coordination, rules, and accountability. That's a long way from the breathless demos. It's also, probably, the actual job.
Marcus Chen-Ramirez covers AI, software development, and the intersection of technology and society for Buzzrag.
AI Moves Fast. We Keep You Current.
Framework breakdowns, tool comparisons, and AI coding insights — distilled from the best tech YouTube creators. Free, weekly.
More Like This
The Hidden Architecture Making AI Agents Actually Work
Building AI agents isn't about choosing build vs. buy—it's about orchestration. Here's what IBM's engineers say makes multi-agent systems coherent.
IBM's Take on AI Agents: Less Skynet, More Assembly Line
IBM's Grant Miller argues against 'super agents' in favor of specialized AI systems. It's the principle of least privilege, repackaged for the AI era.
How MCP and AI Agents Are Reshaping Software Design
IBM's Will Scott explains how design systems, context engineering, and MCP are combining to let AI agents build software that actually follows the rules.
MCP and ADK: Two Tools, Two Jobs, One Stack
MCP handles how AI agents talk to the world. ADK handles how they think. IBM's Cedric Clyburn and Anna Gutowska break down why you likely need both.
Loop Engineering: Moving Beyond One-Shot AI Prompting
From cron-job automations to multi-day autonomous goals, loop engineering is changing how developers interact with AI. Here's what that actually means.
Gemini 3.5 & Omni: What Google I/O Actually Showed
Google unveiled Gemini 3.5 Flash and Omni at I/O 2026. Here's what the demos actually showed—and what questions they left open.
Why One Developer Built a Personal AI Research Lab
Alex Finn built a 24/7 AI research lab with OpenClaw and Hermes Agent. His reasoning reveals what's actually useful versus what's just hype.
The AI Agent Explosion: 35 Projects Solving Real Problems
From security sandboxes to autonomous research pipelines, GitHub's AI agent ecosystem is addressing practical problems—not just building demos.
RAG·vector embedding
2026-06-19This article is indexed as a 1536-dimensional vector for semantic retrieval. Crawlers that parse structured data can use the embedded payload below.