Karpathy's Autoresearch: AI That Optimizes Itself

Andrej Karpathy just released something that sounds like sci-fi until you actually run it: an AI system that experiments on itself while you sleep, logging its learnings and (hopefully) getting better without human intervention.

The repo is called autoresearch, and the core idea is deceptively simple. You give an AI agent a metric to optimize, a way to modify something, and the ability to measure results. Then you let it run in a loop: change thing, test thing, keep if better, discard if worse, repeat. "The idea is to give an AI agent a small but real LLM training setup [and just let it experiment autonomously overnight," Karpathy explains in the documentation. "It'll modify the code, train for 5 minutes, check if the results improved, keep or discard, and then just repeat."

For Karpathy, this meant training language models. His agent adjusted hyperparameters, ran quick training sessions, and tracked validation loss (basically: how accurate is this model?). But entrepreneur Nick Saraev saw something different when the repo dropped—a framework that could optimize anything with a measurable outcome.

When Optimization Never Sleeps

Saraev's immediate thought: cold emails. Not exactly as sexy as training neural networks, but way more directly profitable. His setup tracks reply rates—the percentage of strangers who actually respond to unsolicited pitches. The AI agent writes variations of email copy, deploys them through the Instantly API, waits for data, then iterates based on what worked.

The system runs every four hours. Baseline copy ("B") competes against challenger copy ("C"). Whichever performs better becomes the new baseline. But here's where it gets interesting: the agent maintains a growing document called resource.md where it logs everything it learns. "As the models get better and better and better, they log all of their learnings," Saraev notes, "that significantly improves future models' abilities to make changes."

This isn't just A/B testing. A/B testing requires a human to interpret results and decide what to try next. This is A/B testing that interprets itself, documents its own insights, and compounds knowledge over time. Run it for a few days and you get incrementally better copy. Run it for a year? The system accumulates a year's worth of learned patterns about what makes people reply.

Saraev's next step is tightening the loop from four hours to five minutes and scaling volume 10x. The question becomes less about whether the AI can optimize and more about whether the infrastructure can keep up.

The Recipe for Self-Improvement

The autoresearch pattern needs three things:

First: an objective metric. Not "make this better" or "improve quality"—those are too vague. You need something quantifiable. Reply rate, conversion rate, validation loss, customer satisfaction score, revenue per visitor. Something that goes up or down and tells you definitively whether you're winning.

Second: a thing you can modify programmatically. Email copy, landing page text, model hyperparameters, chatbot scripts, product descriptions, ad creative. If an API can change it or code can alter it, the agent can experiment with it.

Third: a feedback loop. The agent needs to deploy changes, measure the metric, and compare results—ideally without human intervention. This is where APIs become critical. If your email platform has an API (Instantly does), your landing page builder has an API (Webflow does), or your ad platform has an API (Facebook does), you're in business.

The human becomes optional not because they're useless—Saraev freely admits he'd probably make better optimization decisions than his AI—but because humans don't scale. "I take a lot more time to optimize than a model does," he points out. "I also eat, sleep, have to go to the washroom, and do a variety of other things with my day. AI agents don't."

A human might run two experiments in a day if they're really focused. An AI agent can run 24, 48, 288—depending on how tight you make the loop and how much infrastructure you throw at it. At some point, sheer volume of experimentation matters more than the quality of each individual decision.

Applications That Actually Exist Now

Saraev rattles off possibilities with the energy of someone who's already mentally deployed half of them:

Landing page optimization: The agent modifies copy through your website builder's API, tracks conversion rate, keeps winners. Set it loose for a month and watch it compound micro-improvements.

Ad creative testing: Feed it access to Facebook or Google's ad APIs, give it conversion rate as the north star metric, let it generate and test variations. (Though notably, these platforms already do primitive versions of this—the question is whether Claude or GPT-4 can do it better.)

Chatbot scripts: Track customer satisfaction scores, modify the base template that all support interactions follow, optimize toward higher ratings.

YouTube titles: Saraev mentions this casually—he creates YouTube content and could hook autoresearch into the YouTube Data Analytics v3 API to test titles based on click-through rates.

Product descriptions: For ecommerce, you might not have a clean API, but you could use Chrome DevTools with Model Context Protocol to programmatically update product pages and track revenue.

Newsletter subject lines, pricing pages, basically anything with a metric. If you can measure it and modify it, you can optimize it autonomously.

The common thread: these aren't speculative. They're things you can build this week if you clone the repo and spend a few hours adapting it.

What This Actually Looks Like

The implementation is surprisingly straightforward. Clone Karpathy's autoresearch repo. Create a test file with your goal, your metric, and your test method. Point an orchestrator agent (think: the conductor coordinating all the smaller specialized agents) at your API credentials. Deploy to something like GitHub Actions so it runs on a schedule without your laptop even being on.

Saraev walks through building a cold email optimizer live, using Claude Code and voice dictation. The agent scaffolds out the entire system: an API client for Instantly, an orchestrator script, utility functions for things like purging old leads, config files for baseline tests and learned resources, and a GitHub Actions workflow to run it hourly.

The bottleneck isn't the AI's ability to write this code—it's knowing what to ask for and how to structure the problem. Once you understand the pattern (metric + modification + measurement + loop), adapting it becomes almost mechanical.

The Quiet Part

Here's what's not being said loudly but probably should be: this works better in narrow domains with clear metrics than in fuzzy creative work. Reply rate is unambiguous. Validation loss is quantifiable. "Make the brand voice more compelling" is not.

The system also inherits all the limitations of the underlying model. If Claude or GPT-4 doesn't understand what makes cold emails work, giving it permission to run experiments won't fix that—it'll just let it be wrong faster. The resource.md document helps, but it's ultimately limited by the agent's ability to identify actual causal patterns versus noise.

And there's the infrastructure question. Saraev can run this every four hours because he has the email sending capacity and the lead volume to generate meaningful data in that timeframe. If you're getting 10 website visitors a day, automated landing page optimization will take months to surface anything useful. The system works, but it works proportionally to your scale.

Still. The fact that you can download a repo, adapt it to your specific business problem, and have an AI agent running optimization experiments while you sleep—that's genuinely new. Not in concept (scientists have been automating experiments for decades), but in accessibility. You don't need a research lab. You need a GitHub account and an API key.

Karpathy built this to train better language models. Saraev adapted it to write better cold emails. The question isn't whether autoresearch will transform how AI systems improve themselves—Karpathy's already doing that. The question is what happens when every growth marketer, every product manager, every person with a metric to move can spin up their own self-improving optimization loop in an afternoon.

— Yuki Okonkwo