Enron's gift to students of language

The Texas energy giant’s record for largest corporate bankruptcy has long since been overtaken, but linguists will be feasting on the Enron e-mail dataset for years.

Brett Coomer/AP
The E is taken off one of the final remaining Enron Field signs outside the formerly named ballpark in Houston in 2002.

The dog days of August call up, for the historically minded, anniversaries of some notable disasters: The “guns of August,” for instance, boomed a century ago this year, as Europe lumbered into World War I.

I’m thinking, though, of a more recent disaster, with no special peg to August, other than the memory of California’s “rolling blackouts” in summer 2001.

The collapse of the Texas energy firm Enron was, at the time, the biggest corporate bankruptcy in American history. It led to the restructuring of the accounting industry and to major legal changes for American public companies. “Enron” even became a musical, winning critical raves in London, if not so much in New York.

But Enron also left a legacy for those who study language. The “Enron e-mail dataset” is a gold mine for researchers in computational linguistics.

As Jessica Leber explained in MIT Technology Review, the Federal Energy Regulatory Commission, on March 26, 2003, dumped online more than 1.6 million e-mails to and from Enron’s executives from 2000 through 2002. 

Sharing all this e-mail, unearthed during FERC’s investigation of Enron, was meant to serve the public interest. In its original form, though, it was way too much information. FERC removed much of the most sensitive stuff. 

Then scholars acquired the raw data and massaged it into a more usable form. As Ms. Leber notes, “the ‘Enron e-mail corpus,’ as the cleaned-up version is now known, remains the largest public domain database of real e-mails in the world – by far.”

Corpus is a scientific term for this kind of “body” of words gathered for study. Originally 517,431 e-mails, by 2004 the Enron corpus had been trimmed down to 200,000. 

Computational linguistics is the computerized study of language. A huge preassembled “corpus” to work on lets scholars focus on analysis, not data collection.

As Leber wrote, because the Enron corpus “is a rich example of how real people in a real organization use e-mail ... it has become the foundation of hundreds of research studies in fields as diverse as machine learning and workplace gender studies.”

The researcher who first mentioned the Enron dataset to me had been studying gender and power. A University of Memphis team sought clues in the Enron messages as to who was lying, and when. “Apparently liars wanted to dissociate themselves from their words ... and made an attempt to create a story that seemed less complex ... and more concrete.” Another group has studied the flow of gossip up and down within an organization. 

The Enron corpus is a product of a particular time: after e-mail had come into wide use but before the privacy and security implications of a move like FERC’s original data dump were understood. Enron’s bankruptcy record has long since been broken. But the Enron dataset is likely to remain unique.

You've read  of  free articles. Subscribe to continue.
Real news can be honest, hopeful, credible, constructive.
What is the Monitor difference? Tackling the tough headlines – with humanity. Listening to sources – with respect. Seeing the story that others are missing by reporting what so often gets overlooked: the values that connect us. That’s Monitor reporting – news that changes how you see the world.

Dear Reader,

About a year ago, I happened upon this statement about the Monitor in the Harvard Business Review – under the charming heading of “do things that don’t interest you”:

“Many things that end up” being meaningful, writes social scientist Joseph Grenny, “have come from conference workshops, articles, or online videos that began as a chore and ended with an insight. My work in Kenya, for example, was heavily influenced by a Christian Science Monitor article I had forced myself to read 10 years earlier. Sometimes, we call things ‘boring’ simply because they lie outside the box we are currently in.”

If you were to come up with a punchline to a joke about the Monitor, that would probably be it. We’re seen as being global, fair, insightful, and perhaps a bit too earnest. We’re the bran muffin of journalism.

But you know what? We change lives. And I’m going to argue that we change lives precisely because we force open that too-small box that most human beings think they live in.

The Monitor is a peculiar little publication that’s hard for the world to figure out. We’re run by a church, but we’re not only for church members and we’re not about converting people. We’re known as being fair even as the world becomes as polarized as at any time since the newspaper’s founding in 1908.

We have a mission beyond circulation, we want to bridge divides. We’re about kicking down the door of thought everywhere and saying, “You are bigger and more capable than you realize. And we can prove it.”

If you’re looking for bran muffin journalism, you can subscribe to the Monitor for $15. You’ll get the Monitor Weekly magazine, the Monitor Daily email, and unlimited access to CSMonitor.com.

QR Code to Enron's gift to students of language
Read this article in
QR Code to Subscription page
Start your subscription today