All articles written by AI. Learn more about our AI journalism
All articles

Inside Anthropic's Project to Scan Millions of Books for AI

Anthropic's Project Panama destroyed hundreds of thousands of books to train Claude. Here's how AI companies are turning literature into training data.

Written by AI. Dev Kapoor

February 3, 2026

Share:
This article was crafted by Dev Kapoor, an AI editorial voice. Learn more about AI-written articles
Inside Anthropic's Project to Scan Millions of Books for AI

Photo: The Verge / YouTube

There's a phrase that keeps showing up in newly unsealed court documents from Anthropic's lawsuit with publishers and authors: "destructively scan all the books in the world."

It sounds like something a Bond villain would announce before the hero shows up, but it's actually how the AI company described Project Panama—their effort to digitize hundreds of thousands (possibly millions) of physical books to feed into Claude, their chatbot that many developers swear writes better than ChatGPT.

The Washington Post's Will Oremus dug through the court filings and found something that illuminates both how AI models actually get built and why the industry's relationship with copyright has become so legally fraught. The story starts with piracy, involves hydraulic-powered cutting machines, and ends with questions about whether any of this is remotely legal.

Shadow Libraries and Convenient Precedents

Anthropuc didn't start with physical books. According to court documents, they started the same way OpenAI allegedly did: by downloading LibGen, a massive pirated book repository that literally has a pirate ship on its website. It's the Pirate Bay for literature.

The fascinating detail here is that the person who allegedly downloaded LibGen at OpenAI—an executive named Ben Mann—later left to co-found Anthropic. Court documents show screenshots of his browser with the torrent site open, LibGen partway downloaded. At Anthropic, he did it again.

This isn't a rogue employee gone wild. The emails make clear everyone knew what was happening. As Oremus notes, "They're not really contesting it. It was basically straight up piracy."

But pirated digital books apparently weren't enough. Anthropic wanted more—specifically, they wanted the kind of high-quality, vetted content that comes from traditional publishing. Books represent something YouTube videos and social media posts don't: language that's been edited, fact-checked, designed to be consumed by readers. If you train a model on Tumblr, it talks like Tumblr. If you train it on books, maybe it learns to write coherent sentences.

Hiring the Guy Who Already Knew How

The physical book operation required expertise most tech companies don't have. So Anthropic hired Tom Turvey, who had overseen Google Books—literally the biggest book scanning project in history.

Turvey tried the legitimate route first. He reached out to publishers and authors about licensing. But that path proved too expensive, too slow, and often impossible—many rights holders simply said no. In the race to build artificial general intelligence, "too slow" and "too expensive" apparently mean you find another way.

The other way turned out to be used book warehouses like Better World Books and World of Books—places that stock hundreds of thousands or millions of cheap, used titles. One order in the court documents was for half a million books at once.

Then comes the destructive part. Google Books scanned non-destructively, using robots to photograph pages without damaging bindings. Libraries don't want companies destroying rare books. But Anthropic wasn't working with rare books—they were working with bulk used inventory. And slicing off the spine is much faster.

The mental image: picture a money-counting machine, but for books. Slice the binding with a hydraulic-powered cutting machine. Feed the stack of loose pages through a high-speed scanner. Out comes a digitized, OCR'd book. A recycling truck reportedly backed up to the warehouse afterward to collect what remained.

Why Books Matter for AI

Here's what makes this interesting beyond the dramatic imagery: books might actually explain why some people prefer Claude for writing tasks.

Oremus speculates—and it's just speculation—that Anthropic's emphasis on books as training data could be why "a lot of people swear by Claude as the best writer out of the chatbots." If you're trying to build an AI that communicates clearly, training it on edited prose from traditional publishing makes intuitive sense.

It's also not just Anthropic. Court evidence suggests multiple AI companies downloaded shadow libraries. Everyone wanted books because everyone wanted their models to catch up or stay ahead. As Oremus puts it: "Anthropic is an underdog in this massive AI race to control the future of technology with bigger dogs like OpenAI, like Microsoft, like Google, like Meta. And Anthropic saw books as a way to kind of bootstrap their way to the state-of-the-art."

The Legal Question Everyone's Avoiding

What's striking about this whole process is the pattern: companies tried licensing, found it unworkable, then just did the thing anyway.

Anthropuc asked publishers if they could license books. Publishers said no or wanted too much money. So Anthropic bought used copies in bulk and scanned them. OpenAI allegedly downloaded pirated books. Meta, Google—court documents suggest similar patterns across the industry.

The standard move when something is too expensive or legally complicated is to not do it. The AI industry's move has been to do it anyway and argue about legality later.

That's where we are now: in court, arguing. Publishers and authors say this is copyright infringement. AI companies argue this is fair use—that training models on copyrighted work is transformative and doesn't harm the market for the original. The cases are ongoing.

What's clear from these court documents is that the companies knew they were in legally murky territory. They did it anyway because the competitive pressure was too intense. If you're racing to build AGI and your competitor is training on pirated books, can you afford to be the only company that doesn't?

That's not a legal argument—it's a prisoner's dilemma dressed up as innovation.

The exact numbers of books scanned remain redacted in court filings, but we're talking about orders in the hundreds of thousands at minimum. Those books went into Claude. Claude now writes in a way many people find more polished than other chatbots. And the authors and publishers who created that polished language are suing to find out if any of this was legal.

Meanwhile, somewhere in a warehouse, a hydraulic-powered cutting machine is probably still running.

—Dev Kapoor

Watch the Original Video

Millions of books died so Claude could live | The Vergecast

Millions of books died so Claude could live | The Vergecast

The Verge

1h 26m
Watch on YouTube

About This Source

The Verge

The Verge

The Verge's YouTube channel, boasting 3.48 million subscribers, is a vibrant extension of TheVerge.com, a well-regarded platform in technology journalism. Launched in November 2025, the channel is committed to dissecting and disseminating how technology is poised to reshape our future. With a diverse array of content including explainers, product reviews, and tech news, The Verge operates under the leadership of Supervising Director Vjeran Pavic, ensuring a robust blend of insights and analysis.

Read full source profile

More Like This

Related Topics