This AI Content System Rewrites Itself Every Night

Content creator AI Andy just built something that sounds like science fiction but runs on about 1,000 lines of Python: a content machine that publishes five videos daily, watches how they perform, figures out what worked, and then rewrites its own instructions to do better tomorrow. No human input required.

The system is based on Andrej Karpathy's autoresearch framework—originally designed for optimizing machine learning models—but Andy repurposed it for something way more practical: making Instagram and Facebook shorts that actually get views.

The Original Framework

Karpathy, a founding member of OpenAI and former head of AI at Tesla, released the autoresearch repo a few weeks ago with [a deceptively simple premise. You give an AI agent three things: a file to change, instructions on what to optimize, and a way to measure success. Then you let it run overnight.

"In his case, he was optimizing for machine learning training scripts," Andy explains. "The agent would tweak the code, run it, check if the results improved, then kept the changes if they did, throw them out if they didn't, and repeat. He ran hundreds of experiments overnight and got an 11% improvement."

But here's where it gets interesting: Andy looked at this machine learning tool and saw his content pipeline. The prompt was his train.py file. His eval criteria was real view counts from social media. The loop instructions were his content strategy.

Binary Evaluation Is Everything

Andy pulled data from 200+ Instagram reels and Facebook videos using Meta's Graph API—completely free, he notes—to find patterns. Some videos hit 100K+ views. Others died at 198 views. The question was why.

This is where most people building AI systems mess up. They create vague evaluation criteria like "is this engaging?" or "does it sound good?" Andy calls this "just vibes."

Instead, he built 10 binary yes/no questions:

Does the hook describe a result or transformation, not just a feature?
Does the hook feature a person or story, not just a company?
Is the short framed around what you can do, not what it is?
Does the script avoid sounding like a press release or changelog?
Would the first frame make someone stop scrolling?

"The key is that binary criteria are machine readable," he says. "You can feed a script into Gemini and ask it these 10 questions and you get a score out of 10. No subjectivity, no rate this on a scale of 1 to 10 just on vibes. It's just yes, no."

This distinction matters more than it might seem. Subjective ratings introduce noise. Binary questions create clean data. Clean data means the system can actually learn.

The Daily Loop

Every morning at 8 AM, Andy's system wakes up and does its thing:

Pull view counts from yesterday's posts via Meta's API
Match those counts to scripts in Airtable (his production database)
Pre-score new content ideas scraped from social media
Score all published scripts using the 10 binary questions
Correlate high scores + high views (winners) vs. high scores + low views (false positives)
Generate improved prompts based on what actually worked
Push new prompts to the content workflow via API

Then it goes back to sleep until tomorrow.

The evolution happens automatically. Andy's system started with standard prompt structure: hook, detail, tone, close, word budget. Then it began experimenting. "It started reframing from announcement to a curiosity trigger. It pushed to be broader, more dramatic. Then it tried to deepen the personalization with life-altering language added."

After five revisions across two days, the prompts shifted from event-focused hooks to "universal and timeless discoveries" with "secrets about to be revealed" framing.

The Research Log Matters Most

Here's something subtle that Andy emphasizes: every change gets logged with the data that caused it. When GPT-5 or Claude 4 or whatever comes next drops, he doesn't start from scratch. He hands the new model his research log and it picks up exactly where the previous version left off.

"Karpathy said this himself," Andy notes. "The research log might be the most valuable asset."

This is actually pretty profound. Most people treat AI tools as disposable—when a new model launches, they rebuild everything. But if you're logging decisions and outcomes, you're creating institutional knowledge that survives model upgrades.

What Could Go Wrong?

Andy's system still has a human in the loop. A team member reviews each video before it goes live—approving good ones, rejecting bad ones. The AI learns from both decisions, but there's a quality gate.

This raises an obvious question: what happens when that gate comes down? Andy's building toward full automation, and he's honest about the risk. His sign-off is almost gleeful: "Maybe it's gotten better or it's posting corn. Who knows?"

There's also the question of local maxima. Binary evals are clean, but they're also constraining. If your 10 questions don't capture what actually matters, the system will optimize itself into a corner. It'll get really good at hitting metrics that don't correlate with your actual goals.

And then there's the existential weirdness: if everyone's running self-optimizing content systems, all training on the same platforms, do we just end up with algorithmic convergence? Does everything start looking the same because the systems are all chasing the same signals?

The Real Trick

What Andy built isn't technically complicated—he's using Claude Code, Meta's Graph API, Airtable, N8N for workflow automation, and Gemini for evaluation. The hard part was defining what to optimize for.

Most creators tweak things based on gut feeling. Andy's system tweaks things based on actual performance data from 200+ posts, scored against binary criteria derived from real patterns.

The autoresearch framework just automates the scientific method: hypothesis → test → measure → adjust → repeat. It's not magic. It's just faster than humans and it never gets bored.

Andy's making the whole template available for free in his community. Whether that's generous or chaos-inducing probably depends on how many people actually build it.

Either way, we're about to find out if content systems that rewrite themselves every night produce better videos or just more of them.

—Tyler Nakamura