AI That Improves Itself: Autoresearch Meets Claude Code

Nick Saraev's Claude Code skills work about 70% of the time. The other 30%? "A bag of rocks," he says. So he built a system that fixes them while he sleeps.

The technique comes from Andrej Karpathy, former founding member of OpenAI and head of AI at Tesla, who released an "autoresearch" framework on GitHub last week. Originally designed to optimize machine learning model training, Saraev adapted it to do something more immediately practical: make AI prompts better through automated trial and error.

The concept is straightforward. You give an AI system three things: a measurable objective (like "generate diagrams with legible text"), an automated way to test results, and permission to modify its own instructions. Then you let it run. The system generates outputs, evaluates them against your criteria, tweaks the prompt, and repeats—hundreds of times if necessary—until it hits your target.

The Economics of Self-Improvement

Saraev demonstrated this with a diagram generator that creates whiteboard-style visuals. His testing criteria: text must be legible and grammatically correct, colors should be soft pastels, layout should flow linearly, and numbers should be absent. Four binary checks per diagram, ten diagrams per test, maximum score of 40.

The first automated run scored 32 out of 40. After several iterations, the system hit 39—a 97.5% pass rate that Saraev considers good enough. Total cost: about $10, assuming 50 test cycles at 20 cents each (10 diagrams at 2 cents per generation using a fast model called Nano Banana Pro 2).

He frames this as straightforward ROI math. "A good banger video might make me several hundred in ad revenue per day," he notes. Spending $10 to optimize a tool that produces those videos? Easy decision.

But the more interesting claim isn't about cost—it's about transferability. Saraev applied the same framework to website optimization, taking his site's load time from 1,100 milliseconds to 67 through 67 automated tests. An 81.3% improvement with no manual debugging. He's now running it on cold email campaigns, tracking reply rates as the objective metric.

What Gets Measured Gets Gamed

The technique's power comes from its simplicity, which is also its constraint. You need binary evaluations—yes/no questions an AI can answer reliably. "Does this diagram contain legible text?" works. "Does this feel professional?" doesn't, because "feel" compounds uncertainty at each testing iteration.

Saraev warns against over-specification. "Don't go so concrete and so narrow that the model starts optimizing for silly things," he says. If you include criteria like "must be under X words" or "can't include these characters," the system will optimize for those rules rather than the underlying quality you actually want. He compares it to a student who doesn't understand the material but still scores 100% on the test—technically passing while missing the point entirely.

This is where the method reveals its philosophical assumptions. Autoresearch treats quality as something that can be decomposed into measurable components. That works beautifully for load times and legibility. It gets murkier for creativity, persuasiveness, or tone—the qualities that often determine whether content actually lands with human readers.

The Compounding Asset

The most ambitious part of Saraev's pitch is temporal. AI models are improving rapidly. What if you could pass optimization data forward?

"You get a big list of changes that the models will have tried to make in order to improve your skill," he explains. "You could take this big list and pass it on to GPT-6 or Opus 5.0 and it'll be able to pick up where its predecessors left off." He calls this "probably soon to be one of the most important and valuable assets of our time."

That's a strong claim worth examining. The underlying idea is that research data—records of what was tried and what worked—becomes more valuable as the systems interpreting that data become more sophisticated. A mediocre AI might turn a skill from 70% reliable to 80%. A future, more capable AI could use the same testing data to push it to 95%.

This assumes continuity: that future models will be compatible with current evaluation frameworks, that the criteria we define today will remain relevant, that improvements compound rather than requiring periodic resets. Maybe they will. Or maybe we're optimizing prompts for a generation of AI that will be obsolete in 18 months, replaced by systems that work entirely differently.

The Reproducibility Question

Saraev is offering his implementation freely—"no email, no gatekeeping whatsoever." That's useful for testing whether this actually works as broadly as he claims. The framework requires Claude Code (Anthropic's AI coding assistant), some way to generate and evaluate outputs programmatically, and patience to let it run.

The technique's real test will be replication outside Saraev's specific use cases. Does it work as well for someone optimizing customer support responses? Legal document analysis? Creative writing? The answer likely depends on how cleanly you can define success in binary terms.

Prompt engineering has historically been part art, part science—a craft where intuition about how models "think" matters as much as systematic testing. Autoresearch tries to automate the science part, letting machines run thousands of experiments humans wouldn't have the patience to complete manually. But it doesn't eliminate the art entirely. Someone still needs to define what "good" means.

The interesting development isn't that we can now automate prompt optimization. It's that we've reached the point where that automation is cheap enough to be routine. When testing costs 20 cents and runs unsupervised, the calculus shifts. You stop asking "is this prompt worth improving?" and start asking "why haven't I automated this yet?"

Marcus Chen-Ramirez is a senior technology correspondent covering AI, software development, and automation.