UMBC ebiquity research group Building intelligent systems in open, heterogeneous, dynamic, distributed environments
NLP

Archive for the 'NLP' Category

Search the Enron email corpus online

February 5th, 2006, by Tim Finin, posted in AI, NLP

The enron email corpus is a collection of hundreds of thousands of email messages from the infamous Enron corporation that researchers have been using to improve and evaluate techniques for analyzing email, e.g., NLP analysis, information extraction, sentiment detection, social network analysis, information flow, etc. It’s become important because it is the only substantial collection of real email that is public. In the ebiquity lab, for example, Akshay Java has worked with UMBC’s Institute for Language and Information Technologies to bring to bear their NLP technology on the messages.

InBoxer has put up an Enron Email site that lets anyone explore and search the collection on the Web. InBoxer is not a research group, but a company that sells an “anti-risk appliance” that is used to detect when email that is about to be sent or has been sent violates policy. (There should be a good market for this in the Government, too!).

You can also surf the corpus via a simple database interface at UC Berkeley.

William Cohen of CMU describes the collection:

This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). It contains data from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a total of about 0.5M messages. This data was originally made public, and posted to the web, by the Federal Energy Regulatory Commission during its investigation. … The dataset here does not include attachments, and some messages have been deleted “as part of a redaction effort due to requests from affected employees”.

Now it’s convenient to explore corporate malfeasance on the Web.

SemNews: NLP system generates Semantic Web representation of news summaries

January 12th, 2006, by Tim Finin, posted in AI, NLP, Ontologies, Semantic Web, Swoogle, Web

SemNews is a prototype application being developed by UMBC Ph.D. student Akshay Java that uses a sophisticated text understanding system to interpret summaries of news stories, publishes the results on the semantic web and provides browsing and query services over them. The project is the result of a collaboration between the UMBC’s Institute for Language and Information Technologies and Ebiquity Laboratory with partial support from the Lockheed Martin Corporation.

SemNews monitors a number of news source RSS feeds and processes new stories as they are published. After extracting a story’s metadata, its news summary is interpreted by the OntoSem text analyzer which does a syntactic, semantic, and pragmatic analysis of the text, resulting in its text meaning representation or TMR. A TMR is a language-neutral description (an interlingua) of the meaning conveyed in a natural language text. In addition to providing information about the lexical-semantic dependencies in the text, the TMR represents stylistic factors, discourse relations, speaker attitudes, and other pragmatic factors present in the discourse structure. In doing so, the TMR captures not only the meaning of individual elements in the text, but also the relations between those elements, and captures both propositional and non-propositional components of textual meaning. OntoSem’s TMRs are represented in a custom frame-based representation language and grounded in the Mikrokosmos ontology, an extensive ontology with over 30K concepts and nearly 400K entities.

Each story’s metadata and TMR are translated into the Semantic Web language OWL via the OntoSem2OWL translator developed for this project. The results are then added to a special collection indexed by the Swoogle search engine and also put into a RDF triple store. These are used to support several services enabling people and agents to semantically browse, query and visualize the stories in the collection, enabling access to information that would otherwise not be easy to find using simple keyword based search.

For example, one can browse through the story collection via the ontology to find stories that involve certain concepts, such as a terrorist organization; find all stories that involve an entities in OntoSem’s onomasticon, such as al qaeda or Karbala; visualize the stories on a map based on the locations they reference; or construct an arbitrary query, such as finding “stories in which the nation named Afghanistan was the location of a bombing event.” Users can also define semantic “alerts” as queries over the RDF triple store and/or the Swoogle collection. For each alert, SemNews will generate an RSS feed of the results.

The SemNews system is currently a research prototype that is being used to refine the underlying technologies and to explore how the sophisticated automatic linguistic processing of text can be integrated into the Semantic Web and conventional web applications. Ongoing work on SemNews includes an evaluation of its semantic recall and precision as well as a service that can group and cluster stories based on their semantic representations.

For more information

You are currently browsing the archives for the NLP category.

  Home | Archive | Login | Feed






UMBC