AI Training Data: The Legal Vacuum No One Is

Every AI model has a name. ChatGPT. Claude. Gemini. Alex Reisner, a staff writer at The Atlantic who has spent the better part of two years reverse-engineering how these systems were built, has a different proposal: name them after what's actually inside them.

"These models have names like ChatGPT and Claude," Reisner told the Vergecast this week, "but I think you could make an argument that the right name for a model is actually the description of the data it was trained on — because that is the description of its capabilities. That's what it can output."

It's a useful provocation. And it surfaces the question that the AI industry has worked methodically to prevent anyone from answering: what, exactly, is inside these things?

The Infrastructure of Opacity

Reisner's investigation traces a supply chain most people have never thought about. Before any model generates a word, a song, or an image, someone had to assemble the raw material — the books, blog posts, YouTube videos, Reddit threads, and music recordings that the model learns from. That assembly is not incidental. It is, Reisner argues, the central act that determines what the model can do and whose work it has consumed.

The companies treat that information as proprietary. The competitive-advantage argument is real: if Anthropic's data-curation choices are genuinely superior to OpenAI's, disclosing them would hand rivals a roadmap. But Reisner identifies a second, less flattering reason for the secrecy. "They have gone about acquiring a lot of this data in ways that the people who created the data — the authors of the books and the creators of the videos and the music — would not be happy about," he told the Vergecast. "And in a lot of cases, they just don't know that their work is being used."

The research-paper pipeline, which once offered a partial window into company practices, has narrowed. Reisner describes an AI industry that resembles academia in one specific way — researchers want publication credits — but the lawyers have been tightening what can be disclosed. "It's not 2021 anymore," he observed. "The research papers read very differently now than they did a few years ago." He still gleans material from them: a Google paper, he noted, disclosed training on "tens of millions of songs." But the trajectory is toward less, not more, transparency.

Common Crawl and the Laundering Layer

The plumbing of this system runs through organizations most people have never heard of. Common Crawl — a nonprofit that has been scraping the public web since roughly the late 2000s and publishes updated datasets monthly — provided the foundation for virtually every early large language model. Reisner is careful about the nonprofit's self-presentation: Common Crawl's CEO Rich Scranta has said that websites refusing AI crawlers are essentially opting out of the internet's future, a posture that sits somewhere between business argument and threat.

But Common Crawl is only part of the story. Reisner describes what he calls a "data laundering network" in which AI companies route data acquisition through universities and nonprofits, then disclaim direct responsibility for what gets scraped. The canonical structure: a university downloads millions of images or articles on behalf of an AI company, and the company says the work was academic in nature.

One organization worth naming precisely here is LAION — the Large-scale Artificial Intelligence Open Network, a German nonprofit — which Reisner's reporting identifies as having assembled a dataset of approximately 12 million songs drawn from YouTube. LAION has faced copyright scrutiny in Europe that goes beyond hypothetical: German and UK courts have been working through questions of whether training-data scraping at commercial scale qualifies as research or as reproduction requiring a license. The EU's AI Act, which entered into force in 2024, includes transparency obligations requiring general-purpose AI providers to publish summaries of training data used — a provision the industry has been lobbying to interpret as narrowly as possible. Whether LAION-assembled datasets satisfy those disclosure requirements is a live question in Brussels, not a settled one.

YouTube's role in all of this is substantial and almost comically unpoliced. Reisner notes that YouTube content is particularly attractive to AI developers because the platform's relative lack of DRM makes downloading straightforward in ways that, say, Spotify is not. YouTube has stated that downloading its content violates its terms of service. It has not, in any meaningful operational sense, stopped it from happening. The question of whether YouTube's terms-of-service language creates any legal exposure for AI companies — or whether Google, YouTube's parent, has its own reasons for not pressing the issue aggressively — has not been resolved.

What Courts Are Actually Doing

The policy vacuum Reisner describes is not absolute. It is being negotiated, unevenly and expensively, through litigation.

The most consequential active cases involve publishers and authors suing OpenAI and other companies for copyright infringement. The New York Times v. OpenAI suit, filed in the Southern District of New York in December 2023, is the highest-profile stress test of whether training on copyrighted material constitutes infringement — and whether the fair use doctrine, which has historically protected transformative uses, extends to commercial AI training at scale. OpenAI's core defense is that training is transformative; the Times argues that the outputs compete directly with the original works. A ruling that goes against OpenAI would not just affect one company — it would destabilize the legal foundation on which every major model was built. The case is in discovery.

Separately, a consolidated group of author suits — including cases brought by John Grisham, George R.R. Martin, and other Writers Guild of America members — is working through similar questions in the Northern District of California. Music industry litigation against AI audio companies Suno and Udio, filed in mid-2024 by the major labels, raises the same underlying question in a different medium: does ingesting copyrighted recordings to train a generative model constitute reproduction requiring a license?

Congress, characteristically, has watched. Several bills have gestured toward the problem. The AI Transparency Act proposed mandatory disclosure of training data sources. The TRAIN Act — Training Real AI with Integrity Now, an acronym that tells you everything about how these bills get named — would require AI companies to obtain consent before using copyrighted works for training. Neither has advanced past committee. The legislative landscape on data privacy power more broadly has been similarly stalled for years, which is part of why the courts are carrying so much weight right now.

The Synthetic Data Escape Hatch That Doesn't Exist

The industry has a prepared answer to the data problem: synthetic data. If models can generate new content, why not train the next generation on that content, eliminating the need for human-created material entirely?

Reisner is unambiguous about what the research actually shows. "I don't think there's any evidence that that actually works," he said. "There's a lot of research out there on a phenomenon called model collapse — what happens when you train a model on its own outputs. It very quickly degrades." His explanation is intuitive: AI functions as a statistical averaging machine. Feed it averages of averages and the output converges toward a kind of content gruel — all the rough edges, the idiosyncrasies, the genuine strangeness of human-created work flattened out. Synthetic data, in this account, doesn't replace human creative output. It demonstrates why that output was irreplaceable in the first place.

This matters for the policy debate because synthetic data has been the industry's implicit answer to the consent problem. If you don't need human data anymore, you don't need to negotiate with humans for it. That answer is not available.

The Jaron Lanier Problem

The conversation that Reisner wants to have — about who owns data, who should profit from it, and what legal structure would enforce that — is not new. He cites Jaron Lanier, whose data-dignity arguments (developed across essays and interviews from the mid-2000s onward, crystallized in You Are Not a Gadget in 2010) held that individuals should be compensated for the data their online activity generates. The argument was dismissed as fringe when it was made. The AI industry's construction — where a musician's recorded catalog can be scraped, used to train a competing music generator, and then returned to the musician as a tool to replace her — is simply that argument at industrial scale.

The market has started adapting without waiting for policy to catch up. Reisner reports that after publishing his music training data investigation, he heard from a company claiming to have paid creators more than $10 million to produce content specifically for AI training purposes. An economy of humans writing and composing for AI audiences is already forming. Whether that resolves the underlying equity problem — or simply creates a two-tier system where some creators get paid and most don't — is exactly the kind of question that requires a legal framework to answer, and that the U.S. currently lacks one to address.

The Southern District of New York will issue a ruling in Times v. OpenAI before any relevant legislation passes. That ruling, more than any statement from any company, will define what the rules actually are. Watch for it.

Samira Barnes covers technology policy and regulation for Buzzrag.