Entity Disambiguation in Google Auto-complete

September 23rd, 2012

Google has added an “entity disambiguation” feature along with auto-complete when you type in your search query. For example, when I search for George Bush, I get the following additional information in auto-complete.

As you can see, Google is able to identify that there are two George Bushes’ — the 41st and the 43rd President and accordingly makes a suggestion to the user to select the appropriate president. Similarly, if you search for Johns Hopkins, you get suggestions for John Hopkins – the University, the Entrepreneur and the Hospital.  In the case of the Hopkins query, its the same entity name but with different types and thus Google appends different entity types along with the entity name.

However, searching for Michael Jordan produces no entity disambiguation. If you are looking for Michael Jordan, the UC Berkeley professor, you will have to search for “Michael I Jordan“. Other examples that Google is not handling right now include queries such as apple — {fruit, company}, jaguar {animal, car}.  It seems to me that Google is only including disambiguation between popular entities in its auto-complete. While there are six different George Bushes’ and ten different Michael Jordans‘ on Wikipedia, Google includes only two and none respectively when it disambiguates George Bush and Michael Jordan.

Google talked about using its knowledge graph to produce this information.  One can envision the knowledge graph maintaining, a unique identity for each entity in its collection, which will allow it to disambiguate entities with similar names (in the Semantic Web world, we call it as assigning a unique uri to each unique thing or entity). With the Hopkins query, we can also see that the knowledge graph is maintaining entity type information along with each entity (e.g. Person, City, University, Sports Team etc).  While folks at Google have tried to steer clear of the Semantic Web, one can draw parallels between the underlying principles on the Semantic Web and the ones used in constructing the Google knowledge graph.

From tables to 5 star linked data

December 25th, 2010

The goal and vision of the Semantic Web is to create a Web of connected and interlinked data (items) which can be shared and reused by all. Sharing and opening up “raw data” is great; but the Semantic Web isn’t just about sharing data. To create a Web of data, one needs interlinking between data. In 2006, Sir Tim Berners-Lee introduced the notion of linked data in which he outlined the best practices for creating and sharing data on the Web. To encourage people and government to share data, he recently developed the following rating system –

The highest rating is for the data that can link to other people’s data to provide context. While the Semantic Web has been growing steadily, there is lot of data that is still in raw format. A study by Google researchers shows that there are 154 million tables with high quality relational data on the world wide web. The US government along with 7 other nations have started sharing data publicly. Not all the data is RDF or confers with the best practices of publishing and sharing linked data.

Here in the Ebiquity Research Lab, we have been focusing on converting data in tables and spreadsheets into RDF; but our focus is not on generating just RDF, but rather generate high quality linked data (as now Berners-Lee calls it “5 star data”). Our goal is to build a completely automated framework for interpreting tables and generating linked data from it.

As part of our preliminary research, we have already developed a baseline framework which can link the table column headers to classes from ontologies in the linked data cloud datasets, link the table cells to entities in the linked data cloud and identify relations between table columns and map them to properties in the linked data cloud. You can read papers related to our preliminary research at [1]. We will use this blog as a medium to publish updates in our pursuit of creating “5-star” data for the Semantic Web.

If you are data publisher, go grab some Linked Data star badges at [2]. You can show your support to the open data movement by gettings t-shirts, mugs and bumper stickers from [3]  ! (all profits go to W3C)

Happy Holidays ! Let 2011 be yet another step forward in the open data movement !

[1] – https://ebiquity.umbc.edu/person/html/Varish/Mulwad/?pub=on#pub

[2] – http://lab.linkeddata.deri.ie/2010/lod-badges/

[3] – http://www.cafepress.co.uk/w3c_shop

Provenance Tracking in Science Data Processing Systems

May 28th, 2008

Maybe we should think of data provenance as being like a recipe. Recipes for preparing food are more than just a list of ingredients and specify, often in great detail, how the ingredients are combined, cooked and served and also specify the cooking implements and their settings.

Curt Tilmes presented his PhD dissertation proposal yesterday on “Provenance Tracking in Science Data Processing Systems”. Curt works at at the NASA Goddard Spaceflight Center and is responsible for managing the data processing of earth science climate research data. Curt has some very good ideas about how to capture all of the relevant provenance data for sophisticated scientific data. He’s using, of course, the Semantic Web languages (RDF and OWL) to express and share the provenance data.

Part of the problem is that you have to capture not just the inputs to a dataset, but how the inputs were processed to produce the dataset, including (ideally) the algorithms, software and hardware. As an easily grasped example to illustrate this, he referred to a recent post by Ray Pierre on the RealClimate blog, How to cook a graph in three easy lessons. This post demonstrates how Roy Spencer processes inputs from two common climate datasets (the Southern Oscillation and Pacific Decadal Oscillation indexes) to get the results that support the conclusion that global warming is due to natural causes and not human activity.

Faviki uses Wikipedia and DBpedia for semantic tagging

May 26th, 2008

Faviki is a new social bookmarking system that uses Wikipedia articles for tags. It actually uses URLS in the DBpedia namespace that correspond to Wikipedia pages. The immediate benefits of this approach are several:

  • Users select tags from a large, common tag space. The ‘meaning’ of each tag ca be understood by reading the associated Wikipedia page. This makes it more likely that resources that share a tag, even if assigned by different people, are actually related.
  • Since the universe of tags is derived from Wikipedia, it is generated, kept current and maintained by a large and diverse set of people.
  • The tags have structured information associated with them and are part of broader-than, narrower-than lattice. It is not clear to me how much reasoning Faviki does with the linked data or when. But there is clearly a lot of potential here.
  • There is an opportunity to make the tagging system multi-lingual, since Wikipedia has articles in multiple languages and supports a way to link equivalent articles expressed in different languages.

The downside, of course, is that you lose the freedom and ease of most open tagging approaches — using the words and phrases that come immediately to mind.

The Faviki system is related to our own Wikitology project, which is exploring the use of using Wikipedia terms as an ontology, and also to Harry Chen’s Gnizer tagging system, which is an RDF-based social tagging system. Our current Wikitology work is focused on mapping text and entities from text into a set of terms derived from Wikipedia and salted with additional data from Dbpedia and Freebase.

One interesting research question is whether it’s possible to combine the ease of using user-generated tags with the power of mapping them into tags in a structured or semi-structured knowledge base.

Deriving knowledge bases from Wikipedia and using them in innovative is a very exciting topic that is sure to receive a lot of work in the coming years.

(spotted on ReadWriteWeb)

Int. Semantic Web Conf. workshop details

May 23rd, 2008

The 7th International Semantic Web Conference (ISWC) has an exciting program of thirteen one-day workshops that will be held on October 26 and 27. The deadlines for submitting papers vary. See the individual workshop pages for detailed information on their scope and structure and for information on submitting papers and participating.

The final scheduling of the workshops, assigning them to the 26th or 27th, has not yet been done.

PhD proposal: Context and Policies in Declarative Networked Systems

May 19th, 2008

UMBC PhD student Palanivel Kodeswaran will present his dissertation proposal on Use of Context and Policies in Declarative Networked Systems at 3:30 on Tuesday May 20 in ITE 325. Dissertation proposals are public and visitors are welcome. If you are a PhD student and are (or should be!) working on your own proposal, going to these is a good way to prepare. You can see what’s involved, what work and doesn’t and what kind of questions you can expect. See the link above for the full abstract, but here is a teaser.

“In this thesis, we propose to build a declarative framework that can reason over the requirements of applications, the current network context, operator policies, and appropriately configure the network to provide better network support for applications. … In particular, the contributions of this thesis are (i) Developing a framework for using context and policies in declarative networked systems (ii) Runtime adaptation of network configuration based on application requirements and node/operator policy (iii) Formalize cross layer interactions as opposed to ad hoc optimizations (iv) Simulation and test bed implementations to validate and evaluate proposed approach.”