TABEL - A Domain Independent and Extensible Framework for Inferring the Semantics of Tables
January 8, 2015
Tables are an integral part of documents, reports and Web pages, compactly encoding important information that can be difficult to express in text. Table like structures outside documents, such as spreadsheets, CSV files, log files and databases, are widely used to represent and share information. Many scientific and technical domains use tables to compactly depict information which is difficult to express in text. However, tables remain beyond the scope of regular text processing systems which rely on sentence and grammatical structure as well as the context from surrounding words to understand the meaning of text. They also ignore the structure of table, which humans use to both encode and understand the meaning of information inside a table.
This dissertation presents TABEL - a domain independent & extensible framework to infer the semantics of tables and represent them as RDF Linked Data. TABEL captures the intended meaning by mapping header cells to classes, data cell values to existing entities and pair of columns to relations from an given ontology and knowledge base. The core of the framework consists of a module that represents a table as a graphical model to jointly infer the semantics of headers, data cells and relation between headers. We also introduce a novel Semantic Message Passing scheme, which incorporates semantics into message passing, to perform joint inference over the probabilistic graphical model. We present techniques that are both extensible and domain agnostic. Our framework allows the addition of "preprocessing" modules without affecting existing ones, making TABEL extensible. It allows the inferred semantics to be represented as RDF triples using the framework's ontology or a user's custom ontology. TABEL also allows the background knowledge bases to be adapted and changed based on the domains of the table, thus making it domain independent. We also introduce & explore a "human-in-the-loop" paradigm, presenting different models of user interaction with TABEL and its impact on the quality of inferred semantics.
We demonstrate the extensibility and domain independence of our techniques by developing an application of TABEL in the healthcare domain. Evidence-based Medicine analyzes questions such as efficacy of drug dosages, correlates various medical factors or tries to find a correlation between drugs by performing meta-analyses (i.e., systematic reviews) over evidence and data previously published in scientific literature and clinical trial studies. We develop a proof of concept system that can automatically produce meta-analyses reports to replace the existing manual and tedious process. Tables from medical research reports (medical tables) not only pose domain challenges, but also structural ones as they are multi-dimensional in nature. TABEL is both extended and adapted by adding a new domain specific preprocessing module and domain specific knowledge bases. We use TABEL to infer the semantics of medical tables and build a proof of concept user interactive system which utilizes the inferred semantics to help researchers discover, extract and integrate data from relevant studies to produce meta-analysis reports.
A thorough evaluation with experiments over dataset of tables from the web and medical research reports present promising results. Our experiments also show that limited user feedback can have significant impact on the quality of the semantics inferred by our framework.
PhdThesis
University of Maryland, Baltimore County
Downloads: 2773 downloads