UMBC ebiquity

Information Extraction via Automatic Generation of Semantic Classifiers

Speaker: Zareen Syed

Start: Tuesday, September 16, 2008, 10:30AM

End: Tuesday, September 16, 2008, 12:00PM

Location: ITE 346

Abstract: Information extraction is an important unsolved problem of natural language processing (NLP). It is the problem of extracting entities (such as people, organizations or locations) and named relations between entities (such as "People born-in Country") from text documents. An important challenge in information extraction is the labeling of training data which is usually done manually and is therefore very expensive.

This talk introduces a new "model" to generate training data with least manual intervention. Our approach uses structured data available in Encarta (Encyclopedia) to generate the training data. Encarta articles are categorized and linked to related articles by experts. We harvest the structured data available in Encarta and use it in an intuitive way for automatic generation of classifiers. The classifiers were employed on the following information extraction tasks:

  • Entity Classification
  • Entity Clustering
  • Relation Extraction
We also tested our classifiers automatically built from Encarta on Wikipedia articles. In addition to that we conducted experiments to evaluate the performance of features extracted using MindNet, a lexical knowledge base that can be constructed fully automatically from text, built from Encarta.

The talk will also cover the challenges faced in using the Encarta and MindNet resources and give an overview of promising future work directions.

Web Site:

Tags: information extraction, natural language processing, encarta