The Karpathy Loop: When AI Runs 700 Experiments Overnight

Andre Karpathy went to sleep on March 8th after pointing an AI agent at his own training code. Two days later, it had run 700 experiments, discovered 20 genuine improvements, and shaved 11% off training time—on code one of the world's best ML researchers had already hand-optimized for months. It even found a bug in his attention implementation.

Not because the agent was smarter than Karpathy. Because it didn't get bored after the 15th failed attempt.

This is the Karpathy loop, and the mechanism matters more than the headline. Nate B Jones walks through why in a new video breaking down what happens when auto-optimization moves from research novelty to business infrastructure. The short version: constraints are the feature, not the bug.

The Magic Is in the Minimalism

Karpathy's setup is deliberately simple. Three files total. The agent can only touch one—train.py. It proposes an edit, runs a five-minute experiment, checks a single metric, and either commits the change or reverts. That's the whole loop.

"The minimalism isn't a limitation. It's the entire point," Jones explains. "By constraining the search space to one file and one metric, Karpathy made the problem tractable for an agent in a way that a very sprawling multi-file system wouldn't be."

The hit rate wasn't impressive—maybe 20 improvements out of 700 tries. But the iteration rate was inhuman. Twelve experiments per hour. A hundred overnight. A productive human researcher might manage 8-10 cycles in a full workday, most of it spent waiting for GPUs.

Shopify CEO Tobi Lütke tried the same pattern on internal data: 19% performance gain from 37 experiments in 8 hours. Sky Pilot pointed it at a 16-GPU Kubernetes cluster and ran 910 experiments in 8 hours for under $300 in compute. The agent spontaneously figured out that scaling model width mattered more than individual parameters and taught itself to use faster GPUs for validation.

From Training Code to Agent Scaffolding

Optimizing training code is useful but narrow. What happened in early April is bigger.

Kevin Goo's team at Third Layer (a small YC startup) took the same loop and applied it to harness engineering—the prompts, tool definitions, routing logic, and orchestration strategy that determine how agents behave. Instead of tweaking model weights, a meta-agent reads failure traces from a task agent, diagnoses what went wrong, modifies the scaffolding, and runs the benchmark again.

The claimed results: 96.5% on SpreadsheetBench, 55.1% on TerminalBench. First place on both. Jones notes these scores haven't hit official leaderboards yet, and the gap between claimed performance and verified state-of-the-art is substantial. But the direction matters more than any single benchmark.

"Every company deploying agents has to have a harness," Jones points out. "And most of those harnesses right now, they're designed by we humans who are doing our very best with the knowledge we have. The auto agent pattern suggests those harnesses can be systematically optimized by a meta agent that understands the inner model better than a human engineer would."

Goo's team discovered something non-obvious: separating meta-agent and task agent roles works better than self-improvement. Being good at a domain and being good at improving at that domain are different capabilities. They also found that same-model pairings outperform cross-model ones—a Claude meta-agent writes better harnesses for a Claude task agent than for a ChatGPT task agent, because it understands the failure modes from the inside.

And the meta-agent invented strategies nobody programmed. Spot-checking to save compute. Forced verification loops. Progressive disclosure when context windows overflow. Task-specific sub-agents with handoff logic. None of this was specified. The agent figured it out by analyzing its own failure traces.

Local Hard Takeoff

Jones introduces a term: local hard takeoff. Not the AI safety version where intelligence explodes beyond control. Something more mundane and immediately useful.

"A local hard takeoff is what happens when an optimization loop closes on a specific business system and compounds improvements faster than the surrounding organization can necessarily track it," he explains. Your pricing engine rewrites its own heuristics over the weekend and comes back 30% more accurate. Your fraud detection model finds patterns human analysts wouldn't try. Your customer service agent builds verification loops that cut resolution time in half.

Steep, sudden, compounding, largely autonomous—but bounded to a specific domain, metric, and sandbox. It doesn't escape or generalize. It just gets really good at one thing really fast.

The difference between companies that capture this advantage and those that don't comes down to infrastructure most organizations haven't built. Traces, specifically.

"When Goo's team only gave the meta agent scores without reasoning trajectories, the improvement rate dropped really fast," Jones notes. "Understanding why something improved seems to matter as much as knowing that it improved."

An optimization loop that only sees outcomes (revenue up, churn down) produces random improvements. One that sees the full reasoning chain makes surgical edits. The quality of trace infrastructure determines the quality of auto-improvement.

The Enterprise Reality Check

Everything Jones describes assumes your organization can deploy agents at all. Most can't.

The context layer problem is foundational. Without structured external memory—persistent representation of goals, state, constraints that survive across sessions—every agent session reinvents what "done" means. Auto-improvement layered on top of bad memory architecture means a meta-agent optimizing in the dark, unable to distinguish between "this change improved the harness" and "this change happened to work on three tasks before the context window got polluted."

Then there's the technical gap. Auto-improvement requires eval harnesses, sandbox environments where hundreds of experiments run without human intervention, scoring functions that reflect actual business value. Most teams struggle to write reliable eval suites for current deployments. They measure activity instead of outcomes, or use metrics that don't correlate with what they actually care about.

And governance: Who owns the output of an auto-improvement loop? Who reviews the 47th experiment at 3 a.m.? Who decides which optimizations go to production?

"Auto improvement is like a graduate level capability when most orgs are struggling with agents 101," Jones says. "It requires that you've already solved agent deployment."

The flip side: small, agile teams have a structural advantage right now. Karpathy's auto-research was built by one person. Auto agent came from a tiny YC startup. Sky Pilot scaled the approach for under $300 in compute. A three-person team with $500 can run the same optimization loop that would take a 20-person enterprise team months to spec, approve, procure infrastructure for, and execute.

The iteration speed advantage isn't marginal. It's orders of magnitude. And it currently favors teams that can move without approval gates and procurement cycles.

This doesn't mean small teams will beat enterprises at everything. It means on the specific dimension of rapid iterative optimization, complexity is the enemy. The pattern rewards simplicity: define the metric, build the harness, let the loop run.

Major labs are pursuing the same pattern at larger scale. Anthropic wants Claude N to build Claude N+1. OpenAI aims for a fully automated AI researcher by 2028, with an AI research intern by 2026. The loop is the same—propose, test, evaluate, keep or discard. Only the scope differs.

The question isn't whether this works. Karpathy's 700 experiments answered that. The question is how far it scales, how fast, and whether your organization's infrastructure is ready when it does.

—Tyler Nakamura

The Karpathy Loop: When AI Runs 700 Experiments Overnight

The Magic Is in the Minimalism

From Training Code to Agent Scaffolding

Local Hard Takeoff

The Enterprise Reality Check

Watch the Original Video

Karpathy's Agent Ran 700 Experiments While He Slept. It's Coming For You.

About This Source

AI News & Strategy Daily | Nate B Jones

More Like This

GitHub's Latest Trending Repos Reveal Where AI Is Actually Going

Why Your AI Agent Sits Idle After Installation

Nvidia's NemoClaw Bets on Engineering Basics, Not AI Hype

Why Most Companies Are Invisible to AI Shopping Agents