When AI Agents Learn to Delegate: AutoResearch

There's something deeply satisfying about watching a single AI agent optimize code. Andrej Karpathy's AutoResearch project does exactly that—it takes nanoGPT (basically an LLM stripped to its essentials) and lets Claude make tiny improvements to the training script over hundreds of experiments. After about 600 runs, the model's efficiency hits its best point. It's algorithmic self-improvement in action.

But developer Ben Burtenshaw watched that and thought: why is one agent doing everything?

Karpathy's original setup used Claude Opus to handle all the tasks—finding optimization ideas, planning experiments, writing code patches, executing runs, analyzing results. It works, but it's like asking your best developer to also be the project manager, the lab tech, and the data analyst. Sure, they can do it all, but should they?

Burtenshaw's answer: split the job. Give each agent a specific role, make the tasks easier, and—here's the interesting part—see if open-source models can handle what previously required Claude.

The org chart for AI research

The reimagined system divides labor across four agent types, each with clearly defined responsibilities:

Researcher agents hunt for papers using Hugging Face's paper search and extract promising optimization techniques. They're basically the literature review team, proposing hypotheses based on what's already been published.

Planner agents maintain an experiment queue—the running list of things to try. Lower learning rate? Different optimizer? They keep track of what's next.

Worker agents do the hands-on work: taking those hypotheses, patching the training script, and actually executing the runs.

Reporter agents collect results from all the jobs and surface what's working (and what isn't).

As Burtenshaw explains in the demo: "The first one was a researcher. The job of the researcher agent would be to find papers, and they would use HF papers to do that, and then take improvements from those papers and propose hypotheses."

It's a setup that mirrors how actual research teams work—specialists collaborating rather than generalists juggling.

What specialization changes

This architectural shift does something interesting: it potentially lowers the capability threshold for each agent. When an agent only needs to be good at one thing—reviewing results, say, or maintaining a queue—you might not need frontier models for every task. Open-source models become viable.

The implementation runs on OpenCode (an open-source code harness) and Hugging Face's inference infrastructure. The repo structure is surprisingly straightforward: a training script, results files (JSON and CSV), and markdown-formatted instructions that define each sub-agent's behavior.

What gets fascinating is watching the agents coordinate. The system spawns a planner task and a reviewer task simultaneously. The reviewer gets fed successful experiments, failed experiments, and the baseline metrics. It thinks through what worked: "This experiment here with a lower learning rate was a win, improved win. Okay, these ones, and these ones, the scheduler, they were actually they were failed runs that they they increased decreased the efficiency of the training run, so it's going to ignore them."

Then it hands priorities back to the planner for the next round. It's a feedback loop, but distributed across specialized agents rather than contained in one monolithic system.

The infrastructure underneath

All of this requires some clever engineering to not collapse under its own complexity. Burtenshaw uses HF Cache—essentially a shared bucket on Hugging Face Hub—so jobs don't waste time uploading and downloading assets between runs. They just swap data from a common location.

Trackio (an open-source metrics tracker) monitors everything: active jobs, anomalies, and crucially, the "best delta versus master"—how much better or worse each experiment performs compared to the baseline. This surfaces the most important pattern: the system improved efficiency to a certain point, then struggled to push further. That plateau tells you something about either the approach or the problem space.

You can watch all the jobs on Hugging Face Hub, each tagged with "auto lab" and specific hypothesis markers. Some failed. Some got canceled by the agents themselves. It's messy, iterative, and very much how real research looks.

The open-source friction point

Burtenshaw mentions something casually that's actually kind of revealing: "I noticed with a lot of open source models, they have a tendency to stop. They have less of a of a long running ability, and and sometimes they just need a bit more prompting in order to keep running like that."

Open-source models, apparently, need more hand-holding to maintain autonomous operation. They're less inclined to keep going without explicit encouragement. Is that a training difference? A context window issue? An RLHF artifact from how frontier models are fine-tuned to be helpful and thorough?

Whatever the cause, it's solvable with prompting—telling the agents explicitly not to stop until they complete a full pass of experiments. But it's a reminder that "open source" doesn't mean "drop-in replacement." There are behavioral differences that matter when you're building autonomous systems.

What this approach enables

The multi-agent structure opens up some interesting possibilities. You could swap out individual agents for different models based on their strengths—maybe use a reasoning-focused model for the reviewer, a fast one for the planner, something code-specialized for workers. You could scale worker agents up or down based on compute availability. You could A/B test different agent configurations against each other.

More fundamentally, this demonstrates that agent orchestration isn't just about making one AI smarter—it's about designing systems where specialized components collaborate. That's closer to how we actually build software teams, research groups, and organizations.

The repo is public (github.com/burtenshaw/multiautoresearch), and Burtenshaw encourages people to try it themselves. OpenCode makes it relatively accessible—set up your Python environment, log into Hugging Face, configure your agents, and run.

Whether this specific implementation outperforms Karpathy's original single-agent approach isn't entirely clear from the demo. What is clear: there's more than one way to architect autonomous research systems, and specialization might be how open-source models compete in spaces currently dominated by frontier models doing everything.

Sometimes the best optimization isn't making one agent better—it's teaching your agents to delegate.

— Yuki Okonkwo, AI & Machine Learning Correspondent