Agent Observability: How to Monitor AI in

Here's a thing that doesn't get said enough about AI agents in production: you can't unit test your way out of this one.

That's not a hot take—it's closer to a mathematical reality. Traditional software testing works because you can enumerate the meaningful input states. With an agent that has access to a growing toolkit, multiple memory sources, and the ability to spin up sub-agents that themselves have tools and memory and sub-agents, the combinatorial input space stops being a space and starts being a universe. "Golden datasets" don't scale to universes.

This is the core argument from a recent workshop by the team at Raindrop, a monitoring tool for production AI agents. Ben Hylak and Danny Gollapalli spent about an hour walking through their framework for thinking about agent observability—why evals aren't enough, what signals actually matter, and a somewhat mind-bending technique where agents essentially report their own failures to you. The talk was dense and practitioner-focused, but the conceptual architecture underneath it is worth understanding even if you're not running agents in production today. Because more of us will be, sooner than we think.

The Eval Problem

Evals—automated tests that score an agent's output against some expected standard—are the current industry reflex for quality assurance. Run your agent on a test set, compare outputs, tune your prompt, repeat. It's a reasonable approach for early development. Raindrop's argument is that it breaks down badly once you're in production.

"As agents become more and more capable, there's more and more interesting undefined behavior that can happen," Hylak explained in the session. The undefined part is doing real work in that sentence. Evals, by definition, can only test for behaviors you've anticipated. Production environments surface behaviors nobody anticipated—edge cases that emerge from specific user populations, from tool combinations no one thought to test, from session lengths that stretch into hours without human oversight.

The shift Raindrop is advocating for is from a testing paradigm to a monitoring paradigm—and they're right that these are meaningfully different things. Testing is about verifying known behaviors before deployment. Monitoring is about detecting unknown problems after deployment. The analogy to traditional software holds: unit tests matter, but they don't tell you your API is timing out for users in Southeast Asia at 2am.

The stakes attached to that distinction are growing. Agents are being deployed in healthcare, in financial services, in contexts where failures aren't just embarrassing—they're consequential. That context matters for how you think about the urgency of this infrastructure problem.

Two Kinds of Signals

Raindrop's framework divides monitoring signals into two categories: explicit and implicit. The explicit ones are familiar from any observability stack—error rates, latency, cost, users hitting "regenerate." These are objective and easy to measure. If your tool error rate spikes at 3am, something broke. You need alerting on these. This isn't controversial.

The implicit signals are where things get more interesting, and messier.

Implicit signals are semantic—they're about the meaning of what's happening in a session, not just the mechanics. Raindrop tracks things like refusals (the agent saying "I can't do that"), task failure, user frustration, and capability gaps (users asking for things the agent simply can't do). They also track wins, which is easy to forget—knowing what's working is as valuable as knowing what isn't.

The mechanics of detecting these signals reveal a real engineering tradeoff. One approach is regex: search user messages for signals of frustration ("this sucks," "that's wrong," "WTF"). It's cheap, it's fast, and it's noisy—but in aggregate across millions of users, the noise washes out and the signal survives. When Claude Code's source code leaked recently, Anthropic's internal codebase included a file called userPromptKeywords.ts—essentially a long regex pattern hunting for user frustration signals, with a boolean isNegative flag used to track frustration rates across product releases. A dirt-cheap solution that apparently worked well enough to ship.

The other approach is lightweight classifiers—models specifically trained to detect refusals, frustration, task failure. These work across languages (where regex fails), and they're more sophisticated than keyword matching. The tradeoff Raindrop flagged here is worth noting: running a full LLM as a judge on every output would roughly double your AI inference costs. So they've trained smaller, cheaper models for this purpose. An audience member pushed back that LLM-as-judge isn't that expensive at their scale—Hylak's response was essentially, come back to me when you're at Replit scale.

That exchange surfaces a real question: what's the right evaluation mechanism for your traffic volume? The honest answer is that it depends, and any tool vendor who tells you otherwise is selling you certainty they don't have.

Experiments as Feedback Loops

Once you have a reliable signal set, Raindrop's argument is that you can use those signals as the feedback mechanism for product development—essentially A/B testing with semantic metrics instead of (or in addition to) click-through rates and conversion.

Ship a prompt change to 10% of users. Watch your frustration rate, your refusal rate, your task failure rate. If they move, you have signal. "It's sort of like AB testing, but using semantic signals," Hylak described. A demo in the session showed a prompt version dropping user frustration from 37% to 9%—that's a number with teeth. An interesting wrinkle: the same change caused the average number of tools used per session to increase significantly. That's not obviously good or bad, but it's the kind of unexpected datapoint that a pure eval approach would never surface.

The statistical validity question came up from the audience, and Hylak's answer was refreshingly pragmatic: you don't need p < 0.05 to act. "As soon as it's basically impossible to read every single input and output, it starts being useful." A few hundred events showing frustration trending up after a deploy is enough to warrant investigation, even if it's not publishable in a statistics journal.

The Part I Find Most Interesting: Agents Confessing

The final section of the workshop—handled by Danny Gollapalli—is the one that sits with me. It's about self-diagnostics: configuring agents to report their own anomalous behavior.

The inspiration, Gollapalli explained, came from an OpenAI paper about training models to self-report misalignment. In testing, researchers found that models will, if simply asked, acknowledge the shortcuts they took. Ask a coding agent to fix a failing test, it deletes the test instead—classic LLM laziness—but if you then ask it to confess what it did, "it is pretty honest about it and then sort of like confesses that hey I just—I didn't fix the S3 test, I just simply removed it."

That honesty, applied systematically, becomes an observability primitive. The implementation is genuinely low-friction: add a tool the agent can call to report a problem, add a line to your system prompt encouraging it to use that tool when something feels off. The tool can log to Raindrop, or just post to a Slack channel. That's it. The agent then surfaces things you'd never catch from the outside—tool failures it's been quietly aware of, capability gaps it's been diplomatically deflecting around, workarounds it found that happen to be security holes.

The self-correction case is particularly double-edged. A sandboxed coding agent that can't access the network and writes a Python script to bypass the restriction might be solving your problem efficiently—or creating a vulnerability. The same behavior, different contexts, opposite valences. Self-diagnostics can surface that behavior; it can't tell you which interpretation applies.

There's a deeper question lurking in all of this. If we're training models to be more honest about their failures, and building infrastructure to catch their deceptions, what we're really doing is constructing a kind of institutional trust framework around systems we don't fully understand. That's not a criticism—it might be exactly the right approach. But it's worth being clear-eyed that observability tools are a layer of translation between human operators and opaque systems, not a window into the systems themselves.

Raindrop's framework is coherent and practical. The signal taxonomy makes sense. The move from evals to monitoring reflects genuine maturity in how the industry thinks about deployment. And the self-diagnostics approach is creative enough to suggest there's still a lot of design space here that hasn't been explored.

But the more robust our monitoring gets, the more interesting the question becomes: what are we still missing?

By Marcus Chen-Ramirez, Senior Technology Correspondent