Using Information Extraction to Automatically Generate Probabilistic Ontologies


Tuesday, April 25, 2006, 12:00pm - Tuesday, April 25, 2006, 14:00pm


learning, ontology, phd proposal

The Semantic Web is a rapidly developing research area that promises to deliver Tim Berners-Lee's vision of a world where agents can communicate, reason, and act to complete complex tasks for their users. Ontology languages have evolved as the de facto presentation language for the Semantic Web. Today there are over one million Semantic Web Documents indexed in the Swoogle database collection. This seemingly impressive number is dwarfed by the more than nine billion pages in the Google database. Many complain that the only result of the information revolution is that we are swimming in information, but unable to effectively use it.

Publishing data for the Semantic Web is a time consuming process requiring individuals who posses both domain specific knowledge and expertise with Description Logic languages. This is becoming the single greatest challenge to future development of the Semantic Web. There is a strong need for an autonomous agent that is capable of interpreting the vast amount of loosely organized data currently available on the web and in databases into a formal ontological representation. Recently, there have been several key innovations in the fields of Text Mining, Information Extraction, and Concept Learning which led to increased accuracy of these methods.

Previous approaches towards ontology generation using information extraction techniques rely on crisp ontology languages. However, uncertainty, generally the result of noise in the inputs, pervades the process from beginning to end, and is a challenge to crisp DL's. BayesOWL is a probabilistic ontology language which allows assertion of concepts and role relations with a degree of belief in the assertion. We propose that a framework can be developed that will automatically create taxonomic ontologies from an existing corpus of relevant documents using techniques from Information Extraction and Text Mining to extract concepts from these documents. Relevant concepts can be placed in a hierarchy using a semantic dictionary (such as WordNet), and a final BayesOWL ontology can be marked up using probabilities derived from frequency counts observed in the corpus.

Yun Peng

OWL Tweet

UMBC ebiquity