PhD defense: Varish Mulwad — Inferring the Semantics of Tables

December 29th, 2014


Dissertation Defense

TABEL — A Domain Independent and Extensible Framework
for Inferring the Semantics of Tables

Varish Vyankatesh Mulwad

8:00am Thursday, 8 January 2015, ITE325b

Tables are an integral part of documents, reports and Web pages in many scientific and technical domains, compactly encoding important information that can be difficult to express in text. Table-like structures outside documents, such as spreadsheets, CSV files, log files and databases, are widely used to represent and share information. However, tables remain beyond the scope of regular text processing systems which often treat them like free text.

This dissertation presents TABEL — a domain independent and extensible framework to infer the semantics of tables and represent them as RDF Linked Data. TABEL captures the intended meaning of a table by mapping header cells to classes, data cell values to existing entities and pair of columns to relations from an given ontology and knowledge base. The core of the framework consists of a module that represents a table as a graphical model to jointly infer the semantics of headers, data cells and relation between headers. We also introduce a novel Semantic Message Passing scheme, which incorporates semantics into message passing, to perform joint inference over the probabilistic graphical model. We also develop and explore a “human-in-the-loop” paradigm, presenting plausible models of user interaction with our framework and its impact on the quality of inferred semantics.

We present techniques that are both extensible and domain agnostic. Our framework supports the addition of preprocessing modules without affecting existing ones, making TABEL extensible. It also allows background knowledge bases to be adapted and changed based on the domains of the tables, thus making it domain independent. We demonstrate the extensibility and domain independence of our techniques by developing an application of TABEL in the healthcare domain. We develop a proof of concept for an application to generate meta-analysis reports automatically, which is built on top of the semantics inferred from tables found in medical literature.

A thorough evaluation with experiments over dataset of tables from the Web and medical research reports presents promising results.

Committee: Drs. Tim Finin (chair), Tim Oates, Anupam Joshi, Yun Peng, Indrajit Bhattacharya (IBM Research) and L. V. Subramaniam (IBM Research)

DOCTOR for BBN LISP, circa 1966

December 21st, 2014

Jeff Shager’s Genealogy of Eliza project has added a BBN LISP version of DOCTOR from 1966 that was recovered from a paper tape. Eliza is the classic conversational program written by Joseph Weizenbaum and and described in a 1966 CACM paper, “ELIZA–a computer program for the study of natural language communication between man and machine“. Weizenbaum wrote Eliza in his Lisp-like SLIP programming language, which ran on an IBM 7094 computer.

BBNer Bernie Cosell wrote this first Lisp version in BBN LISP and based it on the description and examples he read in the CACM paper. The recovered code is in Jeff’s github repository and an emulator that can run it is promised soon.

This is probably pretty close to the MACLISP version of DOCTOR that I played with in the early 1970s. I still have some DECtapes with old files from those days — maybe I’ll find that version of DOCTOR on one of them.

Semantics for Privacy and Shared Context

December 15th, 2014

Roberto Yus, Primal Pappachan, Prajit Das, Tim Finin, Anupam Joshi, and Eduardo Mena, Semantics for Privacy and Shared Context, Workshop on Society, Privacy and the Semantic Web-Policy and Technology, held at Int. Semantic Web Conf., Oct. 2014.

Capturing, maintaining, and using context information helps mobile applications provide better services and generates data useful in specifying information sharing policies. Obtaining the full benefit of context information requires a rich and expressive representation that is grounded in shared semantic models. We summarize some of our past work on representing and using context models and briefly describe Triveni, a system for cross-device context discovery and enrichment. Triveni represents context in RDF and OWL and reasons over context models to infer additional information and detect and resolve ambiguities and inconsistencies. A unique feature, its ability to create and manage “contextual groups” of users in an environment, enables their members to share context information using wireless ad-hoc networks. Thus, it enriches the information about a user’s context by creating mobile ad hoc knowledge networks.

UMBC seeks nine new computing faculty

December 13th, 2014


UMBC has a total of nine open full-time positions for computing faculty including five tenure track professors, a professor of the practice and three lecturers.

UMBC’s Computer Science and Electrical Engineering department is seeking to fill five positions for the coming year. They include two tenure track positions in Computer Science, up to three full-time lecturers. See the CSEE jobs page for more information.

The College of Engineering and Information Technology has a position for a full-time lecturer or Professor of Practice to focus on the needs of incoming computing majors through teaching, advising, and helping develop programs in computing. This person will work closely with faculty in the Computer Science and Electrical Engineering Department and Information Systems Department.

UMBC’s Information Systems department is accepting applications for three tenure track faculty positions in data science, software engineering and human-centered computing.

Amir Karami on a fuzzy approach topic models for medical corpora

December 2nd, 2014

In this week’s Ebiquity meeting (10am Wed 12/3 in ITE346), Amir Karami will talk about “Fuzzy Approach Topic Models for Medical Corpus”.

Abstract: Looking for ways to automatically retrieve the enormous amount of medical knowledge has always been an intriguing topic. The massive flow of medical documents including scholarly publications and clinical notes has benefited experts by providing ease to access to a huge amount of text data. However, due to this amount of data, medical experts are finding it increasingly difficult locate information of interest. As a consequence, finding relevant documents has become more difficult. Effective text mining systems should be able to extract and exploit not only explicitly stated information but also implied and inferred data. Using bag-of-words leads to sparse high dimension problem that has low performance and needs more cost of computation. Dimension reduction techniques, specially topic models, are one of useful techniques to overcome the problems of bag-of-words. This research proposes a novel approach for topic modeling using fuzzy clustering. To evaluate our model, we experiment with two text datasets of medical documents. The evaluation metrics carried out through document classification, document modeling, and document clustering show that our approach produces better performance than LDA, the most-cited topic model article in Google scholar, indicating that fuzzy set theory can improve the performance of topic models in medical domain. Our approach solves redundancy issue in medical domain and can discover the relation between topics in a documents. In addition, the previous research of fuzzy clustering can help to solve the challenges of topic modeling such as defining the number of topics.