AI Is Training on AI. The Math Says That Ends Badly.

52% of new internet content is now AI-generated. Models trained on their own output degrade irreversibly within 5 generations, losing the tail distributions that represent minority voices and rare knowledge. The internet before November 2022 is a finite resource, and companies are already buying it.

April 2026 Update Available: The research landscape has shifted significantly. Read The Ouroboros Update for frontier model evidence, Stanford's proof that collapse isn't inevitable, and the verification divide framework.

On November 30, 2022, something happened to the internet that nobody fully appreciated at the time. ChatGPT launched. Within two months, 100 million people were using it. Within a year, its output was everywhere: blog posts, product descriptions, social media comments, news summaries, student essays, customer service replies, legal filings, dating profiles.

By May 2025, a study by Graphite and Surfer analyzing 65,000 English-language articles found that 52% of new internet publications were AI-generated. Before ChatGPT, that figure was barely 10%.

Here's why that matters. AI models learn from the internet. And now the internet is mostly AI.

The Ouroboros

In June 2024, a team led by Ilia Shumailov at Oxford published a paper in Nature that should have caused more panic than it did. They proved, mathematically and experimentally, that AI models trained on data generated by other AI models degrade irreversibly. They called it model collapse.

Here's what happens. Take a large language model and train it on regular internet text. It learns to produce sentences that roughly match the distribution of human writing. Now take that model's output, mix it back into the training data, and train a new model on the combined set. Repeat.

By generation 3, the new model starts losing its grip on uncommon words, unusual phrasings, and rare topics. By generation 5, the tail distributions are gone. Not diminished. Gone. The model converges on a narrow band of high-probability outputs and loses the ability to represent anything outside of it.

Shumailov's team demonstrated this across three different architectures: large language models (OPT-125m), variational autoencoders, and Gaussian mixture models. Same result every time. It's not a quirk of one approach. It's a mathematical consequence of compounding three errors: finite sampling, function approximation, and model expressivity limits. Each generation amplifies the distortions of the last.

"Early model collapse" is when minority and edge-case data disappears. "Late model collapse" is when the output converges to a single mode and becomes, functionally, noise. Both stages are irreversible. You can't recover the lost tails by training longer or scaling up.

The Contamination Is Already Majority

This might matter less if AI-generated content were a small fraction of the web. It is not.

Graphite's methodology flagged content across 65,000 articles published between January 2020 and May 2025. The trajectory: under 10% before 2023, crossing 40% in 2024, hitting 52% in May 2025. Europol's widely cited 2022 forecast that 90% of online content could be AI-generated by 2026 looked alarmist at the time. Given the current trajectory, it looks conservative.

But the contamination goes deeper than web pages. The humans being paid to label training data are also using AI.

A 2023 study from EPFL, reported by MIT Technology Review, found that 33% to 46% of Amazon Mechanical Turk workers were using ChatGPT to complete AI training tasks. These are the people specifically hired to provide the "human" in human-labeled data. They were automating themselves. And the companies buying their output had no reliable way to detect it.

So the training pipeline is compromised at two levels. The internet data is majority AI-generated. And the human annotation layer meant to validate quality is itself AI-contaminated. The ouroboros isn't a metaphor anymore. It's the actual architecture.

300 Trillion Tokens and Counting Down

Epoch AI estimated in 2024 that roughly 300 trillion tokens of quality human-generated text exist on the public internet. That sounds enormous. It is not.

Modern training runs are aggressive. Meta's Llama 3-70B was trained on 15 trillion tokens, overtrained by a factor of 10x relative to its parameter count. Frontier models in 2026 are training on 30-50 trillion tokens per run. At these scales, with current growth in model size and the standard practice of overtraining by 10-100x, Epoch estimates the stock of quality human text will be fully utilized between 2026 and 2032. Not "we'll run low." Fully utilized. Every useful token, scraped.

And that's only counting the clean data. As AI-generated content floods the web, the fraction of those 300 trillion tokens that's actually human-written shrinks retroactively. You can't easily tell which 2024-era blog post was written by a person and which was written by GPT-4 with a human's name on it. The contamination is unlabeled.

The Land Rush for Pre-2022 Archives

Some companies figured this out early. Reddit signed a $60 million per year deal with Google in early 2024 for access to its archive of user-generated content. Months later, Reddit signed a separate deal with OpenAI, reportedly worth around $70 million per year. Stack Overflow licensed its developer Q&A corpus to multiple AI companies.

Reddit's archive stretches back to 2005. Eighteen years of people arguing about movies, troubleshooting plumbing, debating politics, sharing recipes, and explaining obscure math concepts. All of it indisputably human-written. All of it generated before AI contamination was possible.

Reddit's data is now worth $130 million a year to two companies. Not because Reddit improved its product. Because the internet that created it no longer exists. Pre-November 2022 human text is the new rare earth mineral: finite, non-renewable, and controlled by whoever had the foresight to archive it.

This creates a power asymmetry that compounds over time. Companies that locked in large-scale access to clean human data before the contamination wave have a structural advantage. Everyone else is training on a degrading commons. And the commons can't regenerate because the new content being added to it is increasingly machine-produced.

The Counterargument, At Full Strength

The strongest objection to all of this: synthetic data works. NVIDIA uses synthetic data to train autonomous driving models. Anthropic and OpenAI use AI-generated data for code and math reasoning tasks. Google DeepMind trained AlphaProof partly on machine-generated mathematical proofs. These are real successes, not hypotheticals.

And they're genuinely relevant. For narrow, verifiable domains where you can automatically check whether the output is correct (does the code compile? does the proof hold? did the car crash?), synthetic data is a legitimate and powerful tool. Nobody credible argues otherwise.

But model collapse isn't about narrow task-specific training. It's about what happens when the general internet becomes the training set and that internet is unknowingly contaminated with AI output. You can verify a math proof. You can't verify whether a blog post about grief or a Reddit thread about immigration policy authentically represents a human perspective or was generated by a model trained on similar posts.

The Mechanical Turk study is the canary. If the humans hired to produce ground-truth labels are outsourcing to AI, the distinction between "human-labeled" and "machine-labeled" becomes a legal fiction. And the companies buying this data don't know which batches are contaminated, because the contamination is designed to be undetectable.

What Disappears First

Shumailov's paper is precise about what model collapse erases: the tails of the distribution. Statistically, that means rare events, unusual patterns, low-frequency data points.

Translated to language, it means: minority dialects. Regional expressions. Niche expertise. The way a grandmother in Appalachia talks about canning tomatoes. The particular vocabulary of competitive pigeon racing. Medical case reports about rare diseases. Indigenous language fragments preserved in community forums.

These are exactly the forms of knowledge that make language models useful beyond generic summarization. The long tail is where the value lives. And it's the first thing to vanish when the ouroboros starts eating.

There's a bitter irony in this. AI companies pitched their products as democratizing access to knowledge. But the mechanism that delivers that access is systematically destroying the diverse, idiosyncratic, beautifully messy human internet that made the knowledge worth accessing. The product consumes its own supply chain.

Limitations

Several important caveats. Shumailov's experiments used OPT-125m, a relatively small model. Whether frontier-scale models (hundreds of billions of parameters) degrade at the same rate is unknown, and there are theoretical reasons to think larger models might resist collapse longer. No one has published a model collapse study at GPT-4 scale because the training cost would be in the hundreds of millions of dollars. The math says it should happen eventually. "Eventually" could mean 5 generations or 50.

Graphite's 52% figure relies on AI detection tools, which have known false-positive rates of 5-10%. Even if the true figure is 35-40%, the contamination trend is undeniable. But the exact percentage carries uncertainty.

And we should note what this analysis cannot prove: whether the current generation of foundation models has already been meaningfully degraded by AI-contaminated training data. That question requires controlled experiments that no company has incentive to publish.

The Bottom Line

The internet split into two eras on November 30, 2022, and there is no going back. Before that date, the web was human-written by default. After it, you can't be sure. This distinction now has a dollar value: $130 million a year for Reddit's archive alone. It will only go up.

Model collapse is not a prediction. The math has been proven and the experimental evidence published in Nature. The open question is whether the AI industry can build filters and curation systems fast enough to avoid training on its own exhaust. Given that half the internet is already AI-generated and a third of the human labelers are using ChatGPT, the evidence suggests: probably not.