AI Agents That Work While You Sleep: The Loop

Over the weekend, Andrej Karpathy—OpenAI founding team member, former Tesla AI director, the guy who coined "vibe coding"—casually dropped a 630-line GitHub repo that might be more significant than it looks. It's called Autoresearch, and on the surface it's just a tiny system for training small language models. But the reaction from the AI community suggests something bigger is happening here.

The pattern Karpathy demonstrated isn't new, exactly. People have been talking about "agentic loops" for months now. But Autoresearch distills the concept down to something so clean and minimal that it's forcing everyone to reckon with what this actually means for how work gets done.

What Autoresearch Actually Does

The setup is almost aggressively simple. You've got three files. One handles data prep and evaluation—that's fixed infrastructure. Another contains the entire training code for a small GPT model—architecture, hyperparameters, everything. An AI agent can edit this file however it wants. The third file is a markdown document written in plain English that tells the agent how to behave as a researcher.

You point an AI agent at this repo and tell it to start experimenting. The agent reads its instructions, looks at the current training code, decides what to try next, makes an edit, and kicks off a training run. Every run has a hard five-minute time limit. When it finishes, the system checks a single number: validation bits-per-byte (val BPB), where lower is better.

If the new number is lower, the change gets committed to a git branch and becomes the new baseline. If it's the same or worse, the change gets discarded and the agent tries something else. Then the loop repeats. Forever, if you want.

Karpathy's example session ran 83 experiments, kept 15 improvements, and drove the val BPB from 0.9979 down to 0.9697. That's not the interesting part. The interesting part is what the human was doing during those 83 experiments: nothing. Or more precisely, the human wrote the initial instructions and then went to bed.

As Leor Alexander put it: "You don't write the training code anymore. You write a prompt that tells an AI agent how to think about research. The agent edits the code, trains a small model for exactly 5 minutes, checks the score, keeps or discards the result, and loops all night. No human in the loop."

The Ralph Wiggum Connection

Almost immediately, people started connecting Autoresearch to something called the Ralph Wiggum loop—a software development pattern that emerged a few months back. (Named after the Simpsons character for his indomitable persistence, which is honestly perfect.) The Ralph loop is basically this: run an AI coding agent in a loop, feed it a project spec, let it pick a task, implement it, run tests, and commit if everything passes. Then kill the agent and start fresh with a new one.

The genius of Ralph is that it solves the context window problem. Instead of letting the conversation history bloat until the model starts losing track, you deliberately restart. Memory doesn't live in the chat—it lives in the code, the git history, a progress.txt file. Each new agent instance bootstraps from those artifacts. The system is self-healing because state is externalized.

Y Combinator president Gary Tan made the connection explicit in a blog post: "Autoresearch didn't emerge from nothing. The same pattern—put an AI in a loop with clear success metrics—was already working in software development by mid-2025." What Karpathy added was applying this to scientific research itself, complete with that beautiful five-minute constraint that makes every experiment comparable on equal footing.

Why This Might Actually Matter

Craig Huitt argued that the specific context of training LLMs isn't what matters. He called Autoresearch "the cleanest example of the agent loop that's about to eat everything." The pattern: human writes a strategy doc, agent executes experiments autonomously, clear metric decides what stays, repeat 100x overnight. "The person who figures out how to apply this pattern to business problems, not just ML research, is going to build something massive," he wrote.

And people are already trying. Vadim, CEO of Vugola, built a version for his entire company. The problem he identified: most agent setups output something and stop. "The agent writes an email, sends an email, generates code, done. The next time it runs, it starts from zero. No memory of what worked, no memory of what failed. Pure amnesia. That's not automation. That's a script you babysit."

His fix: every agent in the system reads and writes to a shared "learnings.md" file. Before starting work, read it. After completing work, append what you learned. One file, all agents read it, all agents write to it. Now they're not isolated processes—they're a network that accumulates knowledge.

Other examples people mentioned: cold email optimization (15 inboxes, 300 emails/day, agent modifies one variable per experiment, scores positive reply rate, keeps or discards, repeats). Ad creative loops (generate thousands of variations, test against live audiences in real time, keep what works, kill what doesn't). Supply chain routing. A/B testing copy. Job postings. YouTube thumbnails.

Where This Works (And Where It Doesn't)

Not every task is loop-ready. The video mapped this out on two axes: how automatable the evaluation is, and how fast you can iterate. Top right quadrant—seconds-long iterations with fully automated scoring—is prime territory. Code generation, game AI, ad bid optimization, algorithmic trading, LLM training. These are the obvious wins.

Bottom left quadrant? Political negotiation. Therapy. Anything where success is deeply subjective and feedback takes months. The loop doesn't help you there, or at least not in any obvious way.

But here's what's striking: the number of work processes that do have objective metrics and fast feedback is... a lot. Content moderation. Supply chain routing. Even some things that feel subjective can be scored if you're creative about it. Roberto Nixon suggested applying this to advertising: "Define success (purchases, app installs, whatever), set a budget, and press go. Everything else is automated. A campaign moves from fixed asset to a living organism, ever-evolving towards your stated goals."

The claim being made here—sometimes explicitly, sometimes implicitly—is that agentic loops might become a work primitive. Not a specialized tool for ML researchers, but something as fundamental and cross-functional as spreadsheets or email. Product managers kicking off a Ralph loop before dinner and reviewing the PR in the morning. Sales reps writing targeting criteria and letting a loop run overnight on 200 leads. Financial analysts defining constraints and looping through portfolio allocation backtests. Lawyers writing risk flag checklists and looping through vendor contracts.

What Gets Weird

Karpathy included a sci-fi caption with his original post: "One day, Frontier AI research used to be done by meat computers in between eating, sleeping, having other fun, and synchronizing once in a while using soundwave interconnect in the ritual of a group meeting. That era is long gone. Research is now entirely the domain of autonomous swarms of AI agents running across compute cluster megastructures in the skies."

It's tongue-in-cheek, but it gestures at something real. If your job becomes "write better memos for the agents," what are you actually doing? You're designing the arena. You're setting the objective function. You're deciding what winning looks like. Those are legitimately high-level skills—evaluation design, strategy articulation, knowing what to measure.

But it's also a fundamentally different relationship to the work. You're not doing the thing. You're setting up the conditions under which the thing gets done, then reviewing the results. Whether that's more interesting or less interesting probably depends on what gave you satisfaction in the first place.

The productization is already starting. On the same day Karpathy released Autoresearch, Claude creator Boris Churnney released loop/loop—"a powerful new way to schedule recurring tasks for up to 3 days at a time." By default, the heartbeat fires every 30 minutes, giving the agent a moment to wake up and check if there's work to do.

The machinery is being built. The question now is what people choose to build with it.

—Yuki Okonkwo