Extracting Wikipedia infobox values from text

January 27th, 2009

Text Analysis Conference This year’s Text Analysis Conference (TAC) has an interesting track focused on processing text to populate Wikipedia infoboxes, both for existing entities with missing values as well as newly discovered entities.

TAC has been run by the US National Institute of Standards and Technology (NIST) to to encourage research in natural language processing and related applications. As in the NIST sponsored MUC, TREC and ACE workshops, this is done by by providing a large test collection, common evaluation procedures, and a forum for organizations to share their results. The first TAC was held this year and included 65 teams from 20 countries who participated in three tracks: question answering, summarization and recognizing textual entailments.

TAC 2009 will include a new track on Knowledge Base Population coordinated by Paul McNamee of the Johns Hopkins University Human Language Technology Center of Excellence.

“The goal of the new Knowledge Base Population track is to augment an existing knowledge representation with information about entities that is discovered from a collection of documents. A snapshot of Wikipedia infoboxes will be used as the original knowledge source, and participants will be expected to fill in empty slots for entities that do exist, add missing entities and their learnable attributes, and provide links between entities and references to text supporting extracted information. The KBP task lies at the intersection of Question Answering and Information Extraction and is expected to be of particular interest to groups that have participated in ACE or TREC QA.”

This is an exciting task and doing well in it will require a a mixture of language processing, knowledge-based processing and (probably) machine learning.

The TAC 2009 workshop will be co-located with TREC and held 16-17 November in Gaithersburg, MD. If you are interested in participating, you should register by March 3.