UMBC ebiquity

Archive for the 'KR' Category

Discovering and Querying Hybrid Linked Data

June 5th, 2015, by Tim Finin, posted in Big data, KR, Machine Learning, Semantic Web


New paper: Zareen Syed, Tim Finin, Muhammad Rahman, James Kukla and Jeehye Yun, Discovering and Querying Hybrid Linked Data, Third Workshop on Knowledge Discovery and Data Mining Meets Linked Open Data, held in conjunction with the 12th Extended Semantic Web Conference, Portoroz Slovenia, June 2015.

In this paper, we present a unified framework for discovering and querying hybrid linked data. We describe our approach to developing a natural language query interface for a hybrid knowledge base Wikitology, and present that as a case study for accessing hybrid information sources with structured and unstructured data through natural language queries. We evaluate our system on a publicly available dataset and demonstrate improvements over a baseline system. We describe limitations of our approach and also discuss cases where our system can complement other structured data querying systems by retrieving additional answers not available in structured sources.

PhD defense: Varish Mulwad — Inferring the Semantics of Tables

December 29th, 2014, by Tim Finin, posted in KR, Machine Learning, NLP, Ontologies, Semantic Web


Dissertation Defense

TABEL — A Domain Independent and Extensible Framework
for Inferring the Semantics of Tables

Varish Vyankatesh Mulwad

8:00am Thursday, 8 January 2015, ITE325b

Tables are an integral part of documents, reports and Web pages in many scientific and technical domains, compactly encoding important information that can be difficult to express in text. Table-like structures outside documents, such as spreadsheets, CSV files, log files and databases, are widely used to represent and share information. However, tables remain beyond the scope of regular text processing systems which often treat them like free text.

This dissertation presents TABEL — a domain independent and extensible framework to infer the semantics of tables and represent them as RDF Linked Data. TABEL captures the intended meaning of a table by mapping header cells to classes, data cell values to existing entities and pair of columns to relations from an given ontology and knowledge base. The core of the framework consists of a module that represents a table as a graphical model to jointly infer the semantics of headers, data cells and relation between headers. We also introduce a novel Semantic Message Passing scheme, which incorporates semantics into message passing, to perform joint inference over the probabilistic graphical model. We also develop and explore a “human-in-the-loop” paradigm, presenting plausible models of user interaction with our framework and its impact on the quality of inferred semantics.

We present techniques that are both extensible and domain agnostic. Our framework supports the addition of preprocessing modules without affecting existing ones, making TABEL extensible. It also allows background knowledge bases to be adapted and changed based on the domains of the tables, thus making it domain independent. We demonstrate the extensibility and domain independence of our techniques by developing an application of TABEL in the healthcare domain. We develop a proof of concept for an application to generate meta-analysis reports automatically, which is built on top of the semantics inferred from tables found in medical literature.

A thorough evaluation with experiments over dataset of tables from the Web and medical research reports presents promising results.

Committee: Drs. Tim Finin (chair), Tim Oates, Anupam Joshi, Yun Peng, Indrajit Bhattacharya (IBM Research) and L. V. Subramaniam (IBM Research)

:BaseKB offered as a better Freebase version

July 15th, 2014, by Tim Finin, posted in Big data, KR, Ontologies, RDF, Semantic Web


In The trouble with DBpedia, Paul Houle talks about the problems he sees in DBpedia, Freebase and Wikidata and offers up :BaseKB as a better “generic database” that models concepts that are in people’s shared consciousness.

:BaseKB is a purified version of Freebase which is compatible with industry-standard RDF tools. By removing hundreds of millions of duplicate, invalid, or unnecessary facts, :BaseKB users speed up their development cycles dramatically when compared to the source Freebase dumps.

:BaseKB is available for commercial and academic use under a CC-BY license. Weekly versions (:BaseKB Now) can be downloaded from Amazon S3 on a “requester-paid basis”, estimated at $3.00US per download. There are also BaseKB Gold releases which are periodic :BaseKB Now snapshots. These can be downloaded free via Bittorrent or purchased as a Blu Ray disc.

It looks like it’s worth checking out!

Jan 30 Ontology Summit: Tools, Services, and Techniques

January 30th, 2014, by Tim Finin, posted in KR, Ontologies, Semantic Web

Today’s online meeting (Jan 30, 12:30-2:30 EST) in the 2014 Ontology Summit series is part of the Tools, Services, and Techniques track and features presentations by

  • Dr. ChrisWelty (IBM Research) on “Inside the Mind of Watson – a Natural Language Question Answering Service Powered by the Web of Data and Ontologies”
  • Prof. AlanRector (U. Manchester) on “Axioms & Templates: Distinctions and Transformations amongst Ontologies, Frames, & Information Models
  • Professor TillMossakowski (U. Magdeburg) on “Challenges in Scaling Tools for Ontologies to the Semantic Web: Experiences with Hets and OntoHub”

Audio via phone (206-402-0100) or Skype. See the session page for details and access to slides.

Ontology Summit: Use and Reuse of Semantic Content

January 23rd, 2014, by Tim Finin, posted in KR, Ontologies, Semantic Web

The first online session of the 2014 Ontology Summit on “Big Data and Semantic Web Meet Applied Ontology” takes place today (Thurday January 23) from 12:30pm to 2:30pm (EST, UTC-5) with topic Common Reusable Semantic Content — The Problems and Efforts to Address Them. The session will include four presentations:

followed by discussion.

Audio connection is via phone (206-402-0100, 141184#) or Skype with a shared screen and participant chatroom. See the session page for more details.

2014 Ontology Summit: Big Data and Semantic Web Meet Applied Ontology

January 14th, 2014, by Tim Finin, posted in Big data, KR, Ontologies, Semantic Web


The ninth Ontology Summit starts on Thursday, January 16 with the theme “Big Data and Semantic Web Meet Applied Ontology.” The event kicks off a three month series of weekly online meetings on Thursdays that feature presentations from expert panels and discussions with all of the participants. The series will culminate with a two day symposium on April 28-29 in Arlington VA. The sessions are free and open to all, including researchers, practitioners and students.

The first virtual meeting will be held 12:30-2:00 2:30 (EST) on Thursday, January 16 and will introduce the nine different topical tracks in the series, their goals and organizers. Audio connection is via phone (206-402-0100, 141184#) or Skype with a shared screen and participant chatroom. See the session page for more details.

This year’s Ontology Summit is an opportunity for building bridges between the Semantic Web, Linked Data, Big Data, and Applied Ontology communities. On the one hand, the Semantic Web, Linked Data, and Big Data communities can bring a wide array of real problems (such as performance and scalability challenges and the variety problem in Big Data) and technologies (automated reasoning tools) that can make use of ontologies. On the other hand, the Applied Ontology community can bring a large body of common reusable content (ontologies) and ontological analysis techniques. Identifying and overcoming ontology engineering bottlenecks is critical for all communities.

The 2014 Ontology Summit is chaired by Michael Gruninger and Leo Obrst.

Google Top Charts uses the Knowledge Graph for entity recognition and disambiguation

May 23rd, 2013, by Tim Finin, posted in AI, Google, KR, NLP, OWL, Semantic Web

Top Charts is a new feature for Google Trends that identifies the popular searches within a category, i.e., books or actors. What’s interesting about it, from a technology standpoint, is that it uses Google’s Knowledge Graph to provide a universe of things and the categories into which they belong. This is a great example of “Things, not strings”, Google’s clever slogan to explain the importance of the Knowledge Graph.

Here’s how it’s explained in in the Trends Top Charts FAQ.

“Top Charts relies on technology from the Knowledge Graph to identify when search queries seem to be about particular real-world people, places and things. The Knowledge Graph enables our technology to connect searches with real-world entities and their attributes. For example, if you search for ice ice baby, you’re probably searching for information about the musician Vanilla Ice or his music. Whereas if you search for vanilla ice cream recipe, you’re probably looking for information about the tasty dessert. Top Charts builds on work we’ve done so our systems do a better job finding you the information you’re actually looking for, whether tasty desserts or musicians.”

One thing to note is that the Knowledge Graph, which is said to have more than 18 billion facts about 570 million objects, is that its objects include more than the traditional named entities (e.g., people, places, things). For example, there is a top chart for Animals that shows that dogs are the most popular animal in Google searches followed by cats (no surprises here) with chickens at number three on the list (could their high rank be due to recipe searches?). The dog object, in most knowledge representation schemes, would be modeled as a concept or class as opposed to an object or instance. In some representation systems, the same term (e.g., dog) can be used to refer to both a class of instances (a class that includes Lassie) and also to an instance (e.g., an instance of the class animal types). Which sense of the term dog is meant (class vs. instance) is determined by the context. In the semantic web representation language OWL 2, the ability to use the same term to refer to a class or a related instance is called punning.

Of course, when doing this kind of mapping of terms to objects, we only want to consider concepts that commonly have words or short phrases used to denote them. Not all concepts do, such as animals that from a long way off look like flies.

A second observation is that once you have a nice knowledge base like the Knowledge Graph, you have a new problem: how can you recognize mentions of its instances in text. In the DBpedia knowledge based (derived from Wikipedia) there are nine individuals named Michael Jordan and two of them were professional basketball players in the NBA. So, when you enter a search query like “When did Michael Jordan play for Penn”, we have to use information in the query, its context and what we know about the possible referents (e.g., those nine Michael Jordans) to decide (1) if this is likely to be a reference to any of the objects in our knowledge base, and (2) if so, to which one. This task, which is a fundamental one in language processing, is not trivial, but luckily, in applications like Top Charts, we don’t have to do it with perfect accuracy.

Google’s Top Charts is a simple, but effective, example that demonstrates the potential usefulness of semantic technology to make our information systems better in the near future.

Entity Disambiguation in Google Auto-complete

September 23rd, 2012, by Varish Mulwad, posted in AI, Google, KR, Ontologies, Semantic Web

Google has added an “entity disambiguation” feature along with auto-complete when you type in your search query. For example, when I search for George Bush, I get the following additional information in auto-complete.

As you can see, Google is able to identify that there are two George Bushes’ — the 41st and the 43rd President and accordingly makes a suggestion to the user to select the appropriate president. Similarly, if you search for Johns Hopkins, you get suggestions for John Hopkins – the University, the Entrepreneur and the Hospital.  In the case of the Hopkins query, its the same entity name but with different types and thus Google appends different entity types along with the entity name.

However, searching for Michael Jordan produces no entity disambiguation. If you are looking for Michael Jordan, the UC Berkeley professor, you will have to search for “Michael I Jordan“. Other examples that Google is not handling right now include queries such as apple — {fruit, company}, jaguar {animal, car}.  It seems to me that Google is only including disambiguation between popular entities in its auto-complete. While there are six different George Bushes’ and ten different Michael Jordans‘ on Wikipedia, Google includes only two and none respectively when it disambiguates George Bush and Michael Jordan.

Google talked about using its knowledge graph to produce this information.  One can envision the knowledge graph maintaining, a unique identity for each entity in its collection, which will allow it to disambiguate entities with similar names (in the Semantic Web world, we call it as assigning a unique uri to each unique thing or entity). With the Hopkins query, we can also see that the knowledge graph is maintaining entity type information along with each entity (e.g. Person, City, University, Sports Team etc).  While folks at Google have tried to steer clear of the Semantic Web, one can draw parallels between the underlying principles on the Semantic Web and the ones used in constructing the Google knowledge graph.

Google releases dataset linking strings and concepts

May 19th, 2012, by Tim Finin, posted in AI, Google, KR, NLP, Ontologies, Semantic Web, Wikipedia

Yesterday Google announced a very interesting resource with 175M short, unique text strings that were used to refer to one of 7.6M Wikipedia articles. This should be very useful for research on information extraction from text.

“We consider each individual Wikipedia article as representing a concept (an entity or an idea), identified by its URL. Text strings that refer to concepts were collected using the publicly available hypertext of anchors (the text you click on in a web link) that point to each Wikipedia page, thus drawing on the vast link structure of the web. For every English article we harvested the strings associated with its incoming hyperlinks from the rest of Wikipedia, the greater web, and also anchors of parallel, non-English Wikipedia pages. Our dictionaries are cross-lingual, and any concept deemed too fine can be broadened to a desired level of generality using Wikipedia’s groupings of articles into hierarchical categories.

The data set contains triples, each consisting of (i) text, a short, raw natural language string; (ii) url, a related concept, represented by an English Wikipedia article’s canonical location; and (iii) count, an integer indicating the number of times text has been observed connected with the concept’s url. Our database thus includes weights that measure degrees of association.”

The details of the data and how it was constructed are in an LREC 2012 paper by Valentin Spitkovsky and Angel Chang, A Cross-Lingual Dictionary for English Wikipedia Concepts. Get the data here.

Google Knowledge Graph: first impressions

May 19th, 2012, by Tim Finin, posted in AI, Google, KR, NLP, Ontologies, Semantic Web

The Google’s Knowledge Graph showed up for me this morning — it’s been slowly rolling out since the announcement on Wednesday. It builds lots of research from human language technology (e.g., entity recognition and linking) and the semantic web (graphs of linked data). The slogan, “things not strings”, is brilliant and easily understood.

My first impression is that it’s fast, useful and a great accomplishment but leaves lots of room for improvement and expansion. That last bit is a good thing, at least for those of us in the R&D community. Here are some comments based on some initial experimentation.

GKG only works on searches that are simple entity mentions like people, places, organizations. It doesn’t do products (Toyota Camray), events (World War II), or diseases (diabetes) but does recognize that ‘Mercury’ could be a planet or an element.

It’s a bit aggressive about linking: when searching for “John Smith” it zeros in on the 17th century English explorer. Poor Professor Michael Jordan never get a chance, and providing context by adding Berkeley just suppresses the GKG sidebar. “Mitt” goes right to you know who. “George Bush” does lead to a disambiguation sidebar, though. Given that GKG doesn’t seem to allow for context information, the only disambiguating evidence it has is popularity (i.e., pagerank).

Speaking of context, the GKG results seem not to draw on user-specific information, like my location or past search history. When I search for “Columbia” from my location here in Maryland, it suggests “Columbia University” and “Columbia, South Carolina” and not “Columbia, Maryland” which is just five miles away from me.

Places include not just GPEs (geo-political entities) but also locations (Mars, Patapsco river) and facilities (MOMA, empire state building). To the GKG, the White House is just a place.

Organizations seem like a weak spot. It recognizes schools (UCLA) but company mentions seem not to be directly handled, not even for “Google”. A search for “NBA” suggests three “people associated with NBA” and “National Basketball Association” is not recognized. Forget finding out about the Cult of the Dead Cow.

Mike Bergman has some insights based on his exploration of the GKG in Deconstructing the Google Knowledge Graph

The use of structured and semi-structure knowledge in search is an exciting area. I expect we will see much more of this showing up in search engines, including Bing.

Got a problem? There’s a code for that

September 15th, 2011, by Tim Finin, posted in Google, KR, Ontologies, OWL, Semantic Web, Social media

The Wall Street Journal article Walked Into a Lamppost? Hurt While Crocheting? Help Is on the Way describes the International Classification of Diseases, 10th Revision that is used to describe medical problems.

“Today, hospitals and doctors use a system of about 18,000 codes to describe medical services in bills they send to insurers. Apparently, that doesn’t allow for quite enough nuance. A new federally mandated version will expand the number to around 140,000—adding codes that describe precisely what bone was broken, or which artery is receiving a stent. It will also have a code for recording that a patient’s injury occurred in a chicken coop.”

We want to see the search engine companies develop and support a Microdata vocabulary for ICD-10. An ICDM-10 OWL DL ontology has already been done, but a Microdata version might add a lot of value. We could use it on our blogs and Facebook posts to catalog those annoying problems we encounter each day, like W59.22XD (Struck by turtle, initial encounter), or Y07.53 (Teacher or instructor, perpetrator of maltreat and neglect).

Humor aside, a description logic representation (e.g., in OWL) makes the coding system seem less ridiculous. Instead of appearing as a catalog of 140K ground tags, it would emphasize that it is a collection of a much smaller number of classes that can be combined in productive ways to produce them or used to create general descriptions (e.g., bitten by an animal).

Mid-Atlantic student colloquium on speech, language and learning

September 2nd, 2011, by Tim Finin, posted in AI, Conferences, KR, Machine Learning, NLP

The First Mid-Atlantic Student Colloquium on Speech, Language and Learning is a one-day event to be held at the Johns Hopkins University in Baltimore on Friday, 23 September 2011. Its goal is to bring together students taking computational approaches to speech, language, and learning, so that they can introduce their research to the local student community, give and receive feedback, and engage each other in collaborative discussion. Attendance is open to all and free but space is limited, so online registration is requested by September 16. The program runs from 10:00am to 5:00pm and will include oral presentations, poster sessions, and breakout sessions.

You are currently browsing the archives for the KR category.

  Home | Archive | Login | Feed