Schema Free Querying of Semantic Data
August 1, 2014
Developing interfaces to enable casual, non-expert users to query complex structured data has been the subject of much research over the past forty years. Since such interfaces allow users to freely query data without understanding its schema, knowing how to refer to objects, or mastering the appropriate formal query language, we call them as schema-free query interfaces. Schema-free query interface systems address a fundamental problem in NLP, Database and AI - to bridge the user conceptual world and the machine representation.
However, schema-free query interface systems are challenged by three hard problems. First, we still lack a practical interface. Natural Language Interface (NLI) is easy for users but hard for machines. NLP techniques of today are still not reliable to parse out the relational structure from natural language questions. Keyword query interface, on the other hand, has limited expressiveness and ambiguity inherited from the natural language terms used as keywords. Second, people have many different ways to express or model the same meaning, which can result in the vocabulary and structure mismatches between the user's query and the machine's representation. This is often referred to as the semantic heterogeneity problem. Today we still heavily rely on ad hoc and labor-intensive approaches to deal with the semantic heterogeneity problem. Third, theWeb has seen increasing amounts of open domain semantic data with heterogeneous or unknown schemas, which daunts traditional NLI systems that require a well-defined schema. Some modern systems gave up the approach of translating the user query into a formal query at the schema level and chose to directly search into the entity network (ABox) for the matchings of the user query. This approach, however, is computational expensive and tends to have an ad hoc nature.
In this thesis, we develop a novel approach to address the three hard problems. We introduce a new schema-free query interface that we call SFQ interface, in which the user explicitly specifies the relational structure of the query as a graphical "skeleton" and annotates it with freely chosen words, phrases and entity names. By using SFQ interface, we work around the unreliable step of extracting complete relations from natural language queries.
We describe a framework for interpreting these SFQ queries over open domain semantic data that automatically translates them to formal queries. First, we learn a schema statistically from the entity network. The schema itself is also represented as a network, which we call the schema network. Our mapping algorithms run on the schema network rather than the entity network, thus making it much more scalable. We define the probability of "observing" a path on the schema network. Following it, we create two statistical association models that will be used to carry out disambiguation. Novel mapping algorithms are developed that exploit semantic similarity measures and assoication measures to address the structure and vocabulary mismatch problems. Our approach is fully computation-based, not requiring lexicons, mapping rules, domain-specific syntatic or semantic parsers, thesaurus or any hard-coded semantics.
We evaluate our approach on two large datasets, DBLP+ and DBpedia. DBLP+ is a dataset we developed by augmenting the DBLP dataset with data from CiteSeerX and ArnetMiner. We created 220 SFQ queries on the DBLP+ dataset. On the other hand, we asked three human subjects who are not familiar with DBpedia to translate 33 natural language questions, coming from 2011 QALD workshop, into 99 SFQ queries on the DBpedia dataset. We carried out cross-validation on the 220 DBLP+ queries and cross-domain validation on the 99 DBpedia queries in which the parameters tuned for the DBLP+ queries are applied to the DBpedia queries. The evaluation results on the two datasets show that our system has very good efficacy and efficiency.
PhdThesis
University of Maryland, Baltimore County
Downloads: 1651 downloads