AutoResearch: AI That Optimizes Itself While You

Andrej Karpathy spent months manually optimizing a training script for GPT-2. Then he had a thought that should have been obvious: why not let AI do this work?

The result is AutoResearch, an open-source project that turns the optimization process into something that runs while you're asleep. Set it up before bed, and it will run [roughly 100 experiments overnight. The AI proposes changes, tests them, keeps what works, and discards what doesn't. No human required.

The concept is deceptively simple. David Ondrej, who recently produced a tutorial on the system, breaks it down to three essential components: one file the AI can modify, one metric that defines success, and an automated evaluation that can't be gamed. That's it.

The Architecture of Self-Improvement

AutoResearch operates on a three-file system. First is program.md, where humans set the goal and constraints. Second is train.py (or whatever file you're optimizing), which the AI agent can modify freely. Third is prepare.py, which defines the evaluation metric and remains permanently off-limits to the agent.

That last restriction matters more than it might seem. "The agent cannot touch prepare," Ondrej explains, "so that it can't cheat the eval you set for it. Without this limitation, it could rewrite the scoring function to fake its results."

The loop itself is straightforward: the agent generates a hypothesis, modifies code, runs a time-boxed test, evaluates results, then either commits the change to git history or resets and tries something else. The time-boxing ensures fair comparison between experiments. An agent can't win just by training longer—only better ideas advance.

Karpathy, who co-founded OpenAI and led Tesla's Autopilot development, released AutoResearch as open source. That's noteworthy timing. Major AI labs are reportedly spending tens of millions building similar systems internally. Karpathy just handed everyone the blueprint.

Beyond Machine Learning

The common misconception about AutoResearch is that it's only for training AI models. Karpathy used that as his example because that's what he was doing. But the pattern works anywhere you can measure an outcome objectively.

Consider trading strategies. Point AutoResearch at your buy-sell rules, let it test variations against historical market data, score each by Sharpe ratio (return versus risk), and watch it iterate. Or marketing: "Most marketing teams run 30 experiments per year," notes Eric Sue. "The next generation will run 36,000." That's roughly 100 per day—testing headlines, ad copy, email subject lines, landing page variants.

Developers are using it to optimize codebases for speed. Prompt engineers are using it to tune the system instructions that guide AI agents. Harrison Chase, founder of the billion-dollar company LangChain, points out that "agents mess up because they don't have the right context and system prompts are part of that context." AutoResearch can test different phrasings, languages, and complexity levels to find what works best.

The Shopify and Stripe CEOs have both expressed interest publicly. They understand this isn't about ML model training—it's about any process where better can be quantified.

The Three Requirements

AutoResearch succeeds when three conditions align. First, a clear metric: one number that defines improvement. Second, automated evaluation with no human in the loop. If you need to manually judge results, the system bogs down. Third, exactly one file for the agent to modify. Not zero, not two. One.

Miss any of these and the system fails. Worse, if you define the wrong metric, AutoResearch will confidently optimize toward the wrong goal. The system doesn't understand intent—it understands measurement.

This limitation exposes where AutoResearch doesn't work: brand design, user experience, pricing strategy, anything where "better" is subjective. Ondrej acknowledges an exception for pricing: "If you have a large volume of traffic to your pricing page and you can quickly AB test different pricing to see highest cash collected," it might work. But for most businesses, the feedback loop is too slow or the quality judgment too subjective.

The agent needs an objective metric. When success is a judgment call, it optimizes randomly.

What This Actually Looks Like

Ondrej demonstrated AutoResearch on a simple portfolio website. He set up a benchmark to measure load time, gave the AI agent permission to modify the site's code, and started the loop. First experiment: load time increased. The agent reverted the change and tried something else. No human intervention.

"I'm not doing anything," Ondrej noted during the demo. "My hands are up. And even if I was doing this, first of all, I would need to be a solid front-end developer. And second of all, I couldn't do it so quickly."

That's the point. Execution becomes nearly free. What becomes valuable is knowing what to measure, picking the right metric, and setting proper constraints. This is pattern recognition about optimization itself.

Karpathy's longer-term vision resembles SETI@home, the early 2000s project where people donated spare computing power to search for alien life. He wants distributed AutoResearch across thousands of computers, with millions of AI agents running experiments that anyone can direct.

The Skill That Matters

"Thanks to AI agents, soon enough the execution of any work or task will become basically free," Ondrej argues. "However, what will become valuable is knowing what to measure, picking the right metric, and setting the right constraints."

That's the bet, anyway. If AI can run experiments autonomously, the bottleneck shifts from doing the work to defining what good work looks like. It's a familiar transition—every automation cycle moves human value up the abstraction ladder.

Whether that makes millionaires or just redistributes who gets paid for what remains open. Karpathy himself predicts that "all LLM frontier labs will do this." If major AI companies build their own AutoResearch systems, they'll be optimizing the optimizers. Self-improvement goes recursive.

The code is on GitHub. The concept works today. Whether you use it to tune trading algorithms, optimize marketing copy, or speed up websites, the mechanism is the same: define success, automate evaluation, let the system iterate.

What you choose to optimize is the question that matters.

— Bob Reynolds, Senior Technology Correspondent