The dog days of August call up, for the historically minded, anniversaries of some notable disasters: The “guns of August,” for instance, boomed a century ago this year, as Europe lumbered into World War I.
I’m thinking, though, of a more recent disaster, with no special peg to August, other than the memory of California’s “rolling blackouts” in summer 2001.
The collapse of the Texas energy firm Enron was, at the time, the biggest corporate bankruptcy in American history. It led to the restructuring of the accounting industry and to major legal changes for American public companies. “Enron” even became a musical, winning critical raves in London, if not so much in New York.
But Enron also left a legacy for those who study language. The “Enron e-mail dataset” is a gold mine for researchers in computational linguistics.
As Jessica Leber explained in MIT Technology Review, the Federal Energy Regulatory Commission, on March 26, 2003, dumped online more than 1.6 million e-mails to and from Enron’s executives from 2000 through 2002.
Sharing all this e-mail, unearthed during FERC’s investigation of Enron, was meant to serve the public interest. In its original form, though, it was way too much information. FERC removed much of the most sensitive stuff.
Then scholars acquired the raw data and massaged it into a more usable form. As Ms. Leber notes, “the ‘Enron e-mail corpus,’ as the cleaned-up version is now known, remains the largest public domain database of real e-mails in the world – by far.”
Corpus is a scientific term for this kind of “body” of words gathered for study. Originally 517,431 e-mails, by 2004 the Enron corpus had been trimmed down to 200,000.
Computational linguistics is the computerized study of language. A huge preassembled “corpus” to work on lets scholars focus on analysis, not data collection.
As Leber wrote, because the Enron corpus “is a rich example of how real people in a real organization use e-mail ... it has become the foundation of hundreds of research studies in fields as diverse as machine learning and workplace gender studies.”
The researcher who first mentioned the Enron dataset to me had been studying gender and power. A University of Memphis team sought clues in the Enron messages as to who was lying, and when. “Apparently liars wanted to dissociate themselves from their words ... and made an attempt to create a story that seemed less complex ... and more concrete.” Another group has studied the flow of gossip up and down within an organization.
The Enron corpus is a product of a particular time: after e-mail had come into wide use but before the privacy and security implications of a move like FERC’s original data dump were understood. Enron’s bankruptcy record has long since been broken. But the Enron dataset is likely to remain unique.