Wikipedia as an ontology for describing documents

Monday, October 29, 2007, 11:30am - Monday, October 29, 2007, 13:00pm

325b ITE

Tags: information retrieval, ontology, social media, wikipedia

Identifying the topics and concepts associated with a document or collection of documents is a common task for many applications. It can help in the annotation and categorization of documents in a corpus. Knowing the topics of documents a user has selected and viewed on the Web or from a collection can be used to model the user's current topical interests for improving search results, business intelligence or selecting appropriate advertisements.

We are exploring the idea of using Wikipedia's articles and associated pages as a topic ontology. The benefits of this approach are that the terms in the derived ontology are kept current, represent the consensus of a large community, and can be understood by ordinary people by reading the associated Web pages.

We have investigated the use of the text of Wikipedia articles, the category link graph and the article links graph for predicting common concepts related to a set of documents. We describe several heuristics and algorithms that we implemented and evaluated to aggregate and refine results, including the use of a spreading activation approach on the graphs.

The Wikipedia Category graph can be used to predict generalized concepts however, using the article links graph can help in predicting more specific concepts or concepts that do not exist in the category hierarchy. We show through our experiments on Wikipedia that it is possible to predict common concepts that do not exist as Wikipedia categories by utilizing the page links graph. Such predicted concept could in turn be used to define new categories or sub-categories within Wikipedia. The results of our preliminary experiments are encouraging and give us a direction for future research and experimentation along these lines.

OWL Tweet