CNN's Strategy for AI That Doesn't Break User

There's a problem with how most organizations deploy human review in their AI systems: they're using it as a safety blanket instead of a scalpel.

Ashley Nutter, SVP of Product at CNN, laid out the failure mode at a recent Product School talk. When companies lack confidence in their AI agents, they insert human reviewers. The humans become bottlenecks. The system scales linearly—more decisions mean more humans mean more cost. Different reviewers make different calls. Workarounds proliferate. And critically, the team never fixes the underlying system because humans are masking its failures.

"Human in the loop is one way that we often mitigate those risks," Nutter explained. "But the problem is the way that human in the loop is employed, it often serves as a bottleneck that slows us down or a sort of gate that keeps us from getting to where we want to be on that spectrum."

The question isn't whether to involve humans. It's where that involvement has outsized impact.

The Risk-Confidence Trap

Most AI systems trigger human review when the agent's confidence drops below a threshold. This sounds reasonable until you map it to actual product risk.

Nutter's example: an AI agent managing subscriptions encounters an edge case—a user who moved between geographic markets mid-billing cycle and lost access to geo-gated content. The agent isn't sure how to handle this. Low confidence triggers human review. Someone spends time on a $5 refund decision.

Meanwhile, the same agent confidently processes a refund for a corporate account representing significant revenue. No review happens because the agent felt sure about its choice.

"This is less about how confident is your agent and more about how bad is it if your agent is wrong," Nutter said.

The distinction matters more as AI systems gain agency—the ability to make decisions and take actions without constant supervision. Organizations want this. Agency means scale, personalization, new capabilities. But agency also means unpredictability, and unpredictability threatens the user trust that brings people to your product in the first place.

For CNN, that trust is non-negotiable. One hundred percent of their journalism remains human-accountable. But that doesn't mean agents can't support production. It means being precise about where judgment matters.

Three Zones Where Judgment Actually Counts

Nutter identifies three scenarios where human involvement delivers disproportionate value:

First: ambiguous outcomes requiring nuance, context, or values judgment. CNN applies this to fairness standards—detecting bias in interview questions, for instance—not to fact-checking whether an event occurred at a specific time.

Second: irreversible consequences or reputationally costly errors. CNN's archival decisions involve millions of hours of footage from journalists worldwide. Deleting the wrong segment is irreversible, but it won't carry the same reputational weight as errors in published journalism.

Third: situations where the judgment actually improves the system over time. This is the leverage point most organizations miss.

Four Design Questions That Change Everything

Nutter's framework breaks down to four questions that determine whether human-in-the-loop becomes leverage or liability:

When does a human step in? The trigger should be explicit and testable—not based on vibes. Examples include external publication (CNN's editors review all journalism before it goes out), one-way doors (irreversible actions), blast radius (decisions affecting X users or Y revenue), regulated domains (financial or medical data), or scenario-based flags from red team testing.

CNN flags AI-generated product summaries involving firearms, alcohol, or religion for human review. These aren't confidence-based triggers. They're risk-based triggers that you can write tests for.

The trigger also needs to minimize unnecessary review. Too permissive and people add manual checks "just in case," undermining agency. Too broad and you get rubber-stamping—reviewers approve everything because they're seeing too many low-stakes decisions.

What is the human actually doing? This is a design choice with clear trade-offs. Binary feedback (approve/reject, archive/delete) scales well and maintains consistency, but provides weak signal for improving the agent. It works when decision criteria are clear and risk is well-understood—does this user behavior violate policy?

Open-ended feedback (comments, edits, explanations) captures judgment and improves the agent, but it's expensive and harder to standardize. Use it when rules are evolving or involve nuance. CNN applies this to editorial judgment calls.

In practice, most systems need a hybrid: binary decisions with open-ended feedback only on rejections, or staging where early training uses rich feedback that becomes more binary over time.

What happens after the human acts? If the judgment isn't captured in system improvements—updated prompts, new evaluation criteria, training data adjustments—you're just creating permanent overhead.

Nutter's team noticed their editorial assistant repeatedly referring to Kamala Harris as "Vice President Harris" in 2025. Editors kept correcting it to "former Vice President Harris." That repetition signaled a gap in how their system updated for new information—a system-level fix, not a one-off correction.

"If your human reviewers are providing the same feedback over and over again that should be encoded into the system," she noted.

How does it change over time? The exit strategy matters from day one. Without explicit criteria for downgrading human review, it becomes permanent. This is an organizational failure, not a technical one. Teams need to define success upfront in ways they can evaluate with data.

Good evaluation frameworks reveal whether human review is compensating for poor system design or whether risk is genuinely decreasing. They answer: are we ready to remove this review layer, and if we did, what's the potential blast radius?

What This Actually Looks Like

The framework isn't theoretical. CNN operates in an environment where errors carry immediate reputational costs to a brand built on trust. They're deploying agents anyway because the alternative—doing everything manually—means delivering less value to fewer users.

Their approach treats judgment as a finite resource to be allocated strategically. Not everywhere. Not nowhere. Where it counts.

This reframes the entire conversation around AI safety and human oversight. The question stops being "should humans review this?" and becomes "what type of human involvement makes this system better?"

Binary review gates specific actions when criteria are clear. Open-ended review teaches the system when rules are ambiguous or evolving. Triggers based on product risk catch what actually matters. Repetitive corrections get encoded into the system. Success criteria determine when to step back.

The result is AI that scales without sacrificing the trust users came for in the first place. Which matters more as these systems move from making recommendations to making decisions—from suggesting a headline to generating metadata, from flagging content to publishing it.

"As AI systems become more agentic we as product leaders are going to be deciding where does judgment live in the system," Nutter said. "It isn't going to be everywhere and it isn't going to be nowhere but we have to figure out where it really counts."

That's the actual challenge. Not building AI that works. Building AI that works and preserves the thing that made people trust you before the AI arrived.

Rachel "Rach" Kovacs