An RL Agent for ETL Pipeline Self-Healing

Picture the scenario Anna Marie Benzon opens with: a production data job fails at some ungodly hour. An engineer drags themselves to a dashboard, spends the better part of a day tracing logs, diffing schemas, interrogating upstream sources—and when the dust settles, the failure itself was almost trivial. The expensive part was never the bug. It was everything wrapped around it: the inspection, the diagnosis, the careful selection of a fix, the re-run, and the anxious confirmation that the cure didn't quietly introduce a new disease.

That's the problem Benzon built a system to attack. In a recent talk for the AI Engineer channel, she walks through an architecture that uses a reinforcement learning agent to detect and remediate ETL pipeline failures on AWS—cutting mean time to resolution from a modeled baseline of roughly 2.5 working days down to about 5.24 minutes for cases the system can confidently handle. That number deserves a breath: it's not a rounding error. Across 30 controlled synthetic runs, Benzon reports approximately a 99.85% reduction in MTTR.

The obvious question is: okay, but what's the catch?

What the System Actually Does

The architecture sits on familiar AWS infrastructure. An AWS Glue job emits a failure event. EventBridge catches it and triggers a Lambda function housing the agent. From there, Lambda reads from two strictly read-only sources—CloudWatch for error logs and the Glue Data Catalog for current schema metadata—and begins constructing a picture of what went wrong.

What's worth pausing on here is the deliberate layering Benzon describes. The "intelligence" isn't a single monolithic model making holistic judgments. It's three explicitly separated concerns working in sequence:

Deterministic anomaly detection handles observable facts—schema drift, null-rate spikes, type changes, field additions and removals. These aren't learned; they're rule-based. A field disappeared. A null rate crossed a threshold. These are conditions you can write an explicit rule for, and Benzon argues that's exactly right: "An explicit rule is easier to validate than a learned component with a richer but less interpretable incident history."

A Q-learning policy handles contextual action selection. Given a compact state representation—failure category, risk level, data quality conditions—the policy selects from six possible responses: retry, schema coercion, rollback, quarantine, escalate, or log. The state and action spaces are intentionally small, which means the Q-tables are small, which means every decision is fully inspectable. You can look at the table and ask: for this state, what did the policy value most, and why?

An external safety layer sits outside the learned policy entirely and operates as a hard override. If the anomaly is classified as critical but the policy proposes something passive—just log it, say—the safety layer overrides that and escalates. Benzon is clear about why this separation matters: "A policy update cannot silently redefine its own authority." The safety constraints aren't a feature of the model. They're above it.

That last architectural choice is doing more philosophical work than it might appear. One of the consistent failure modes in deployed ML systems is that learned components can subtly shift their behavior as they update, including the implicit boundary of what they'll decide on their own. By placing the safety layer outside the policy, Benzon is treating that boundary as a hard engineering invariant rather than an emergent property of training.

The Part That's Most Interesting

The benchmark results are honest in a way that's actually somewhat unusual for this kind of feasibility talk.

The anomaly detector achieved perfect precision (1.0) but a recall of 0.8—it never false-flagged, but it missed about 20% of actual anomalies. For operations contexts, that conservatism is likely a feature: a false positive that triggers an unwanted remediation action is often worse than a miss that falls through to a human queue. But Benzon doesn't spin this. "Perfect precision does not mean perfect detection," she notes directly.

The simulated success rate across 30 runs was 74.63% (±1.51 percentage points). The non-escalation rate was 88.63%. Taken together, that means roughly one in four incidents still required human intervention—and about one in eight was escalated rather than resolved automatically. In the context of a feasibility study on synthetic data, those numbers are reasonable. In a production environment with messier, more diverse incident types, they could look quite different.

Benzon addresses this honestly. The results come from synthetic scenarios. The agent responds after a failure signal—it doesn't predict failures before they happen. Real incident diversity will stress the current state space. This is a credible feasibility demonstration, not a deployment recommendation.

And then there's a finding Benzon calls "the most useful part of the project," which is genuinely counterintuitive: the RL policy matches an equivalent hand-defined deterministic policy within 0.19 percentage points. In this compact state space, the learned policy doesn't outperform thoughtfully written rules. What it does do is provide a decision service that can accumulate preference data over time—one that becomes more valuable as incident history grows and maintaining action-preference rules by hand becomes increasingly impractical.

That's the honest pitch for RL here: not "it's smarter than rules," but "it's a structured way to let learned preferences replace manual curation as the problem gets more complex."

Escalation as a First-Class Outcome

One framing choice in this talk deserves explicit notice because it runs against a common instinct in automation design.

Benzon includes escalation in the agent's action space—not as a fallback or a failure mode, but as a legitimate first-class outcome. "The ability to say 'I should not do this automatically' is the capability," she argues. "If success is measured only by non-escalation, the optimization target is wrong."

This matters because a lot of automated systems are implicitly or explicitly optimized to avoid escalating, since escalation "costs" human time and looks like a failure in dashboards. The perverse result is systems that take action when they shouldn't—confident remediations that make things worse—because the incentive structure punishes them for the honest answer of "I don't have enough information to act safely here."

Framing escalation as a success condition rather than a failure condition is a subtle but meaningful design choice. Whether other teams building similar systems adopt that framing is worth watching.

What's Unresolved

The system's validation boundary is synthetic. That's not a fatal flaw—a well-constructed synthetic benchmark with 95% confidence intervals and repeated runs across varied seed values (36 seeds, from 42 to 71) is more rigorous than most one-off demos. The code, benchmarks, and experiment scripts are publicly available on GitHub, which means other engineers can inspect and reproduce the logic.

But the gap between synthetic and production is where most automated remediation systems run into trouble. Real ETL failures have context that schemas and error logs don't fully capture: upstream team decisions, known quirks in specific data vendors, business rules that aren't written down anywhere. How well the current state representation handles those cases—or how gracefully it escalates when it can't—is an empirical question that only shadow-mode deployment will answer.

Benzon's stated next step is exactly that: shadow deployment, where the agent makes recommendations without execution authority, so its judgment can be compared against what engineers actually do. That's the right move, and it's how you find out whether the 74.63% success rate holds up outside the lab.

The broader design philosophy here—deterministic rules for observable facts, learning for contextual preference, hard external safety constraints, escalation as a legitimate outcome—is not specific to ETL pipelines. It's a template worth examining for anyone building operational agents in domains where the cost of a wrong automated action is high. Whether it scales cleanly into messier problem spaces, or whether the careful bounding that makes it trustworthy in this context is also what limits it in others, is the question the next phase of this work will have to answer.

Dev Kapoor covers open source software, developer communities, and the politics of code for Buzzrag.