Claude Code's Self-Review Problem Has a Fix

"You ask Claude Code to judge what it's written, it's like, 'Oh yeah, sick. A+.'"

Honestly? My first reaction to Chase's framing here was that it's a little darkly funny — a hyper-capable system that writes production code and then grades itself with the confidence of a student who definitely did not read the rubric. The second reaction, once the laugh faded, was: this matters quite a lot. If you're a non-technical builder and Claude is both writing and evaluating your code, you have no external check at any point in the process. The model's well-documented tendency toward self-flattery isn't a quirk; it's a structural blind spot in the whole workflow. Chase, who runs the Chase AI channel and has been systematically building out Claude Code tooling, released a video this week that tries to patch exactly that gap.

The skill he's sharing is called grill-me-codex, and the core idea is simple enough to explain in a sentence: make Claude Code defend its plans to a second model before anyone writes a line of implementation code.

The problem it's actually solving

A whole cottage industry of Claude Code "skills" — Matt Pocock's Grill Me, GSD, Superpowers — has grown up around one shared diagnosis: the gap between what you mean to build and what you tell Claude you want to build is wide, and that gap fills with assumptions that quietly degrade your output. These tools run you through a structured interrogation before Claude touches your codebase. They're good. Chase himself calls them "plan mode on steroids."

But all of them hit the same ceiling. Even after you and Claude are aligned on what to build, you're still left with one model evaluating one model's plan. Claude's tendency to speak warmly about its own output isn't a conspiracy — it's a known behavioral characteristic of instruction-tuned assistants that are trained, in part, to be helpful and agreeable. Chase notes this in the video, though it's worth flagging that he doesn't cite a specific Anthropic source for it; it's more accurately described as a widely observed pattern than an official acknowledgment. The point stands either way: if you can't read code, and the model won't tell you when its plan is bad, you have no floor.

That's the gap grill-me-codex is built to close. It extends Pocock's Grill Me foundation with a second phase: once the planning questions are done and Claude has produced a plan.md, OpenAI's Codex (currently the relaunched CLI-based Codex agent — distinct from the original Codex model that powered GitHub Copilot and was deprecated in early 2023) is invoked headlessly to review it. The two models then go back and forth in a capped loop — five rounds maximum, configurable — with Claude revising the plan in response to Codex's objections and Codex checking whether the revisions actually hold.

How the loop actually works

The part that took me a moment to fully clock: Codex running "headlessly" doesn't mean it's stateless. Chase passes the session ID across iterations, so each Codex round has access to the full prior exchange — it's not re-reading the plan cold each time. To be precise, this persistence is implemented at the skill level by threading context through the calls; it's not native session memory baked into a headless Codex invocation. Think of it less like a fresh referee each round and more like a debate judge who keeps their notes. Codex in round two knows what it flagged in round one and whether Claude's fixes were genuine. That's what makes the iteration meaningful rather than redundant.

The outputs are two markdown files: plan.md (the living document, updated after each round Claude accepts feedback) and plan_review_log.md (the full transcript of the back-and-forth, which Chase aptly calls "where the sausage is made"). The plan.md at the end is the thing that gets built.

In Chase's demo — adding an email-gated download page to his agency site — the system ran three rounds before Codex approved. Round one surfaced eleven issues: an unbounded client skill slug, a case-sensitive DDoS bypass, a raw bombing vector, a table-scanning rate limit. Real stuff. Round two caught four more — and crucially, it caught false fixes, places where Claude had claimed to resolve an issue but hadn't actually wired it. The double opt-in that was described but not implemented. An expression index dedup that Supabase's JS client can't target. By round three, three low-severity non-blockers remained and Codex signed off.

"We caught real security and correctness holes," Chase says in the walkthrough, listing the findings with the satisfied energy of someone who just found out their building has a gas leak before moving in.

That's a fair way to put it. The adversarial review pattern is getting real traction right now — Anthropic's own Ultra Review feature uses a similar multi-agent attack structure — and the results in Chase's demo are exactly the kind of thing that's easy to miss when you're not a developer: not just bugs, but architectural assumptions that would've caused pain at scale.

The honest question this raises

Here's where I want to slow down, because this is the part that deserves more than a paragraph.

Two models reaching consensus isn't the same as two models being right.

That's not a gotcha — it's a genuine architectural limit of the whole pattern. If Claude and Codex share training data distributions, share similar blind spots about what "secure" or "optimal" means in a particular codebase context, or if Codex's feedback is itself shaped by the same agreeable tendencies Chase is trying to escape, then the iterative loop is doing some work but not the work you're imagining. You're not getting an independent audit; you're getting a second opinion from a colleague who went to a very similar school.

What the demo actually shows is that even with this caveat, the setup catches things. Eleven issues in round one isn't a theoretical edge case — it's a practical demonstration that a single-model planning pass misses real problems. And the false-fix detection in round two is genuinely interesting: Codex in that round isn't just re-reviewing the original plan, it's evaluating whether Claude's claimed improvements are semantically coherent. That's harder than flagging issues, and it's the move that separates iterative review from just running the same check twice.

Whether the loop catches enough is a question nobody can answer in general — it depends entirely on the complexity of what you're building and how well your prompts define the domain. What Chase has built is a floor, not a ceiling. For the non-technical Claude Code user who currently has no floor at all, that's a meaningful upgrade. For someone building financial infrastructure or anything with serious security surface area, I'd be cautious about treating multi-model consensus as a substitute for human review — which is a gap the Vercel DeepSec approach is trying to address from a different direction.

The model-swappability Chase mentions is worth flagging too. He's explicit that the Codex component can be replaced with DeepSeek, a local model, or anything else you can hook into the skill — the bones are there. That flexibility is real, but it also means the quality of your adversarial review scales directly with the quality of the model you're using as a critic. Swap in a weaker model and you might be buying confidence you haven't earned.

The OpenAI Codex plugin has been in the Claude Code ecosystem for a bit now, but mostly as a one-shot review rather than an iterative partner. What Chase has done is make the conversation go both ways and give it a memory. That's a genuinely different use of the tool — closer to pair programming than QA — and the output from his demo suggests it's doing more than cosmetic work.

Whether "both models approved it" eventually becomes a standard checkpoint in AI-assisted development pipelines, or whether it turns out to be one confident layer on top of another — that's the question the next few months of production deployments will actually answer.

— Yuki Okonkwo, AI & Machine Learning Correspondent, Buzzrag