The enron email corpus is a collection of hundreds of thousands of email messages from the infamous Enron corporation that researchers have been using to improve and evaluate techniques for analyzing email, e.g., NLP analysis, information extraction, sentiment detection, social network analysis, information flow, etc. It’s become important because it is the only substantial collection of real email that is public. In the ebiquity lab, for example, Akshay Java has worked with UMBC’s Institute for Language and Information Technologies to bring to bear their NLP technology on the messages.
InBoxer has put up an Enron Email site that lets anyone explore and search the collection on the Web. InBoxer is not a research group, but a company that sells an “anti-risk appliance” that is used to detect when email that is about to be sent or has been sent violates policy. (There should be a good market for this in the Government, too!).
You can also surf the corpus via a simple database interface at UC Berkeley.
William Cohen of CMU describes the collection:
This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). It contains data from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a total of about 0.5M messages. This data was originally made public, and posted to the web, by the Federal Energy Regulatory Commission during its investigation. … The dataset here does not include attachments, and some messages have been deleted “as part of a redaction effort due to requests from affected employees”.
Now it’s convenient to explore corporate malfeasance on the Web.