Amazon Built AI Agents for Millions. Here's What Actually

I've watched enough product demos to know the pattern. Someone shows you an AI that can "autonomously handle complex workflows," and you're supposed to imagine a future where software finally does what we've been promising since the 1980s. Usually it's vaporware with better marketing.

So when Abhinav Kasliwal, an AI Product and Technology Leader at Amazon, walked through how his team built multi-agent systems now serving millions of Amazon employees, I approached with my standard skepticism. But here's what caught my attention: he didn't lead with the vision. He led with the problems.

The Part Nobody Talks About

Kasliwal's presentation focuses on something most AI pitches gloss over—what happens when your demo becomes someone's production system. "Designing agents for production is very different from building demos," he notes early on, and that turns out to be the thesis of the entire talk.

The system he describes handles IT support, HR queries, learning recommendations, document generation, and a sprawl of other internal functions through what Amazon calls a "companion suite." Hundreds of specialized agents coordinated by a central orchestrator. Millions of monthly active users. Subsecond response times. A 30% reduction in support tickets.

Those numbers matter because they represent actual adoption, not pilot programs that quietly die. But the interesting part isn't the success metrics—it's how they got there.

Architectures That Actually Scale

Kasliwal lays out four basic patterns for building AI agents, and his preference reveals something about what works at scale versus what sounds good in architecture reviews.

The simplest approach—single monolithic agents—works fine for narrow tasks. Build an FAQ bot, give it document retrieval, ship it. Low latency, easy to debug, limited upside.

The collaborative peer-to-peer model sits at the other extreme. Multiple agents negotiate with each other, emergent intelligence, very cool, very complex, very unpredictable. "Difficult to control," Kasliwal says, which is engineering-speak for "this will wake you up at 3am."

What Amazon actually uses is the hierarchical coordinator-worker pattern. One agent routes requests to specialized domain agents—HR questions go to the HR agent, IT questions go to the IT agent, and so on. It's not the sexiest architecture, but it offers "the right balance of scalability, control, and maintainability."

Notice what's valued here: not capability, not innovation, but maintainability. That's the kind of priority that emerges after you've been on call for a system serving millions.

The Guardrails Nobody Builds Until They Have To

Here's where Kasliwal's talk diverges from typical AI evangelism. He spends significant time on what he calls "guardrails"—the layers of protection that keep agents from doing stupid or dangerous things.

Four layers, specifically. Input guardrails validate user requests and detect prompt injection attacks. Planning guardrails validate an agent's intended actions before execution. Tool guardrails restrict what APIs an agent can call and with what permissions. Output guardrails check for hallucinations and filter sensitive information.

"Guardrails matter because agents can take actions, not just generate text," Kasliwal explains. "Actions have real consequences—financial, legal, safety."

This is the unsexy work that determines whether your AI agent becomes a useful tool or a compliance nightmare. And critically, Amazon built these guardrails from day one, not after an incident forced their hand. "We prevented major incidents by building safety into the architecture," Kasliwal notes, which suggests they've seen what happens when you don't.

Human Oversight as Feature, Not Bug

The most interesting claim in Kasliwal's presentation is also the most counterintuitive: "Human in the loop is not a failure. It's a feature that makes agents more reliable and trustworthy."

This cuts against the standard AI narrative where human involvement represents incomplete automation—a gap to be filled by better models. Kasliwal presents five patterns for human-AI collaboration, from pre-approval (human okays actions before execution) to post-review (human checks work after the fact) to real-time oversight.

Amazon uses what he calls the "escalation pattern" most commonly. The agent attempts a task, and if it fails or hits low confidence, it escalates to a human. "Would you like to connect with human resource or benefit support?" Not a failure mode—a designed handoff.

The system captures these human decisions and uses them to improve the AI. Over time, escalations decrease. But they never disappear, because some decisions genuinely require human judgment, and pretending otherwise creates bigger problems than it solves.

What Actually Shipped

Amazon started with five agents and expanded to hundreds based on user demand. That detail matters. They didn't architect the entire system upfront—they built something that worked, shipped it, and scaled based on what people actually used.

The coordinator agent turned out to be critical. "Routing logic determines success," Kasliwal emphasizes. Get the orchestration wrong, and your specialized agents never get the right inputs. Get it right, and the system feels almost intelligent.

They instrumented everything—agent performance, tool usage, user satisfaction, feedback channels. Continuous monitoring not as surveillance but as the feedback loop that makes the system learn. This is production AI: less magic, more measurement.

The Pattern That Keeps Repeating

I've covered enough technology cycles to recognize when someone's describing real constraints versus imagined ones. Kasliwal's presentation has the shape of hard-won experience—the emphasis on guardrails, the acknowledgment that human oversight improves rather than diminishes the system, the focus on maintainability over capability.

None of this makes for viral demo videos. A coordinator routing requests to specialized agents isn't as exciting as an AI that promises to replace your entire workforce. But one is shipping at scale, and the other is usually still "in beta."

The question isn't whether AI agents can automate complex workflows. Amazon's already doing it. The question is whether other organizations will learn from what actually works—boring architectures, defensive engineering, human oversight—or chase the demo that never quite makes it to production.

—Mike Sullivan, Technology Correspondent