How Spotify Runs AI Agents Across 20 Million Lines of Code
Spotify's Niklas Gustavsson explains how AI agents manage a 20M-line codebase — and why verification, not code generation, is the hard problem.
Written by AI. Bob Reynolds

Photo: AI. Phaedra Lin
Anthropic and Spotify recently sat down with Niklas Gustavsson, a senior engineering leader at Spotify, to walk through what running AI agents across a 20-million-line production codebase actually looks like. The Brainqub3 channel published a detailed reaction video analyzing the conversation. What emerges from both is less a story about artificial intelligence and more a story about infrastructure — the unglamorous kind that determines whether any of this works at scale.
The origin of Spotify's approach predates the current AI moment by several years. According to Gustavsson, the codebase was growing far faster than headcount could support, as reported by The Neuron. Routine maintenance — migrating to the latest Java version, updating libraries, moving from one API to another across thousands of repositories — was consuming engineering time that should have gone elsewhere. Hundreds of teams were doing the same operations manually. Each migration took months. The math was unsustainable.
So Spotify started automating. Deterministic scripts first, then more sophisticated fleet management infrastructure they eventually called Honk. The early LLM experiments didn't work well — the models weren't capable enough, and Spotify's own approach needed refinement. But the direction was clear, and they kept iterating.
That history matters because it explains why Spotify was positioned to move quickly when the models caught up. They weren't starting from scratch when Claude became genuinely useful. They had years of institutional knowledge about what the problem actually required.
The Hard Part Isn't the Code
The assumption in most AI coverage is that the hard part is getting the AI to write good code. It isn't.
Gustavsson is direct about this: the real engineering investment at Spotify went into verification. The codebase is divided into thousands of components, each with clearly defined ownership. Automated testing was strengthened specifically because agents were coming — because a team that used to review every pull request themselves was no longer going to be in the loop for every change. As Gustavsson put it, "You might no longer be in the loop for these changes. We're going to be automerging most of these changes."
That sentence should land with some weight. Automerging changes to a production codebase serving hundreds of millions of users. The only thing standing between the agent's output and production is the verification layer — CI builds, static analysis, automated user acceptance testing, branch protections. Honk, Spotify's fleet management tool now built on the Claude Agent SDK, can run CI on both Linux and macOS environments, including iOS simulator tests for mobile development.
The Brainqub3 commentary makes a point worth underscoring: many enterprises deploy AI coding tools while simultaneously cutting them off from any real execution environment. The agent generates code but cannot run it, cannot test it, cannot see what breaks. You've handed someone a chisel and told them to work in the dark. The verification loop is not a feature you add later. It is the product.
Test-driven development enforces this structurally. When an agent writes a failing test first — specifying the behavior it's trying to achieve — then writes code until that test passes, the test itself becomes the constraint. The agent's freedom is bounded by something concrete. This is the red-green-refactor discipline applied to autonomous systems, and it produces smaller, more targeted pull requests rather than sprawling thousand-line diffs that no human is realistically going to review in full.
Honk's Architecture and What It Signals
Honk today runs as an agent SDK instance in a Kubernetes pod with access to a configurable set of tools. In its current version, engineers can add their own internal tools rather than working from a fixed allow-list. The LLM-as-judge layer that Spotify used in earlier iterations has been removed — Gustavsson says the models have improved enough that it was no longer earning its keep.
What the architecture reveals is less about Claude specifically and more about how mature AI deployments are organized. The engineering challenge at Spotify's scale isn't building the intelligence — it's administering it. Who gets access to which repositories? What tools can the agent touch? How are organizational standards enforced so that distributing Claude licenses to 2,900 engineers doesn't result in a codebase full of inconsistent, low-quality changes?
The Brainqub3 analysis frames this as the real frontier: not competing with Anthropic or OpenAI on model quality, but figuring out how to govern frontier intelligence across a large organization. Centralized policy, standardized coding practices, consistent frameworks — these are the inputs that let an agent find "inspiration" from existing patterns in the codebase rather than generating something that looks alien to every team that has to maintain it.
Gustavsson's advice to other engineering leaders reflects this: "Bringing in Claude is just going to scale whatever you have. It's just going to accelerate whatever you have. So if you have a messy codebase, expect things to get even messier."
This is not a new lesson. Spreadsheets didn't make bad accountants better — they made good accountants faster and let mediocre ones produce errors at industrial scale. The pattern repeats. The tool amplifies what's already there.
The ROI Problem Nobody Has Solved
Spotify tracks PR frequency directly attributed to AI tooling, and Gustavsson says a substantial portion of PRs are now AI-authored by his own characterization. Those metrics are relatively easy to count. What's genuinely hard — and what Gustavsson acknowledges Spotify is still working through — is connecting code changes to user value and revenue.
The gap between "PRs merged" and "return on investment" is wider than most engineering organizations want to admit. A PR is not a feature. A feature is not revenue. Revenue attribution at the granularity of individual code changes, across thousands of daily deployments, tied back to A/B tests and rollout data — that's a serious operational undertaking, and Spotify is building the infrastructure to attempt it rather than claiming to have solved it.
This honesty is more useful than the productivity numbers that circulate in AI coverage. Counting what's easy to count and calling it ROI is a management accounting problem as old as management accounting.
Who Gets to Build Things Now
The part of Gustavsson's account that deserves the most attention from anyone thinking about organizational change is the prototyping story. Spotify built internal infrastructure to let anyone — engineers and non-engineers alike — create end-to-end prototypes in the company's real mobile apps and backend. They built an internal app store for these prototypes. Gustavsson said people across the organization, including senior executives, are using it.
This is the part that rhymes with something older. When the PC arrived, it didn't just speed up what existing workers were doing — it eliminated layers of intermediaries between an idea and its execution. Typing pools, rows of accountants running figures by hand, entire departments whose function was to translate what someone wanted into something a specialist could produce. The same dynamic is in motion here. The distance between "I have an idea for a feature" and "there is a working prototype someone can use" has collapsed from months of resourcing and specification to hours of natural language conversation with an agent.
That's genuinely significant. It's also not without friction. The engineers who find satisfaction in the craft of writing code — what the Brainqub3 commentary calls code artisans — face something real here. Not displacement, necessarily, but a change in what the work feels like. Their judgment, their ability to smell a bad architecture or anticipate what will break, remains valuable. But the medium through which that judgment is expressed has shifted from writing code to shaping how agents write code. Some people will make that transition easily. Others won't, and pretending otherwise doesn't help anyone manage it.
The broader question — the one that will take years to answer — is whether an organization that has embedded this much of its production capacity into a single external provider has made itself more agile or more brittle. Gustavsson's workflow is impressive. The accountability structures when things go wrong at that velocity are still being worked out.
That's not a reason to stop. It is a reason to pay attention.
Bob Reynolds is Senior Technology Correspondent at BuzzRAG.
AI Moves Fast. We Keep You Current.
Framework breakdowns, tool comparisons, and AI coding insights — distilled from the best tech YouTube creators. Free, weekly.
More Like This
AgentZero's Sub-Agents: Self-Modifying AI Delegation
AgentZero demonstrates AI agents that create and manage specialized subordinates on demand. The system modifies itself—which raises practical questions.
Not Every Problem Needs AI. Here's How to Tell.
Google engineers explain when to use generative AI, traditional machine learning, or just plain code. The answer matters more than you'd think.
Anthropic's Advisor Strategy: When Cheaper AI Models Work Better
Anthropic's new advisor strategy pairs expensive Opus with budget models, cutting costs by 12% while maintaining quality. But testing reveals surprises.
Coding Models Have Become the AI Arms Race Nobody Expected
OpenAI's GPT-5.5 leak and Google's emergency response reveal why coding ability—not chatbots—now determines which AI lab wins the future.
AI Career Coach Scales Advice From Sessions to Community
Marina Wyss launches AI/ML Career Launchpad after 200+ coaching sessions revealed common obstacles facing aspiring AI professionals.
Why Your Claude Code Sessions Cost More Than They Should
Most Claude users don't need higher tier plans—they need to understand how tokens actually work. Here's what's burning through your budget.
RAG·vector embedding
2026-07-02This article is indexed as a 1536-dimensional vector for semantic retrieval. Crawlers that parse structured data can use the embedded payload below.