Social Science Research Council Research AMP Just Tech
Citation

Outlier and collapse: The enron corpus and foundation model training data

Author:
Zimmer, Zac
Publication:
Big Data & Society
Year:
2026

The Enron Corpus is a canonical training dataset representing one of the first scale jumps in the size of natural language data for machine learning (ML) research. That corpus was built from 500,000 internal Enron emails released by the Federal Energy Regulatory Commission in the wake of the Enron prosecution. This article traces the historical and genealogical link between Enron and contemporary foundation models. Foundation model training sets are currently so large they include almost all available data contained on the Internet, and researchers anticipates that future models will incorporate even more tokens. That might suggest feeding AI-generated data back into the models to train future generations, but this poses an existential problem known as model collapse. This essay investigates how and why the corporate collapse documented by one of the earliest “massive” ML corpora resonates with the model collapse syndrome that threatens the integrity of LLMs and multimodal generative AI today. Model collapse can be understood as the phenomenon where a model becomes poisoned by its own projection of reality; this also describes the conditions that led to the collapse of Enron Corp. The Enron Corpus thus poses a more fundamental question: Why is it that theft, scandal, and fraud lie at the heart of so many of the most prominent training sets? Is this Enron story just an outlier in ML history, or is there a genealogical trace of collapse?