Verbal Energy

Enron's gift to students of language

The Texas energy giant’s record for largest corporate bankruptcy has long since been overtaken, but linguists will be feasting on the Enron e-mail dataset for years.

By

  • close
    The E is taken off one of the final remaining Enron Field signs outside the formerly named ballpark in Houston in 2002.
    View Caption

The dog days of August call up, for the historically minded, anniversaries of some notable disasters: The “guns of August,” for instance, boomed a century ago this year, as Europe lumbered into World War I.

I’m thinking, though, of a more recent disaster, with no special peg to August, other than the memory of California’s “rolling blackouts” in summer 2001.

The collapse of the Texas energy firm Enron was, at the time, the biggest corporate bankruptcy in American history. It led to the restructuring of the accounting industry and to major legal changes for American public companies. “Enron” even became a musical, winning critical raves in London, if not so much in New York.

Recommended: Test your grammar 'smarts' with our quiz!

But Enron also left a legacy for those who study language. The “Enron e-mail dataset” is a gold mine for researchers in computational linguistics.

As Jessica Leber explained in MIT Technology Review, the Federal Energy Regulatory Commission, on March 26, 2003, dumped online more than 1.6 million e-mails to and from Enron’s executives from 2000 through 2002. 

Sharing all this e-mail, unearthed during FERC’s investigation of Enron, was meant to serve the public interest. In its original form, though, it was way too much information. FERC removed much of the most sensitive stuff. 

Then scholars acquired the raw data and massaged it into a more usable form. As Ms. Leber notes, “the ‘Enron e-mail corpus,’ as the cleaned-up version is now known, remains the largest public domain database of real e-mails in the world – by far.”

Corpus is a scientific term for this kind of “body” of words gathered for study. Originally 517,431 e-mails, by 2004 the Enron corpus had been trimmed down to 200,000. 

Computational linguistics is the computerized study of language. A huge preassembled “corpus” to work on lets scholars focus on analysis, not data collection.

As Leber wrote, because the Enron corpus “is a rich example of how real people in a real organization use e-mail ... it has become the foundation of hundreds of research studies in fields as diverse as machine learning and workplace gender studies.”

The researcher who first mentioned the Enron dataset to me had been studying gender and power. A University of Memphis team sought clues in the Enron messages as to who was lying, and when. “Apparently liars wanted to dissociate themselves from their words ... and made an attempt to create a story that seemed less complex ... and more concrete.” Another group has studied the flow of gossip up and down within an organization. 

The Enron corpus is a product of a particular time: after e-mail had come into wide use but before the privacy and security implications of a move like FERC’s original data dump were understood. Enron’s bankruptcy record has long since been broken. But the Enron dataset is likely to remain unique.

Share this story:
 
 
Make a Difference
Inspired? Here are some ways to make a difference on this issue.
Follow Stories Like This
Get the Monitor stories you care about delivered to your inbox.
 

We want to hear, did we miss an angle we should have covered? Should we come back to this topic? Or just give us a rating for this story. We want to hear from you.

Loading...

Loading...

Loading...