Entity Disambiguation in Google Auto-complete

September 23rd, 2012

Google has added an “entity disambiguation” feature along with auto-complete when you type in your search query. For example, when I search for George Bush, I get the following additional information in auto-complete.

As you can see, Google is able to identify that there are two George Bushes’ — the 41st and the 43rd President and accordingly makes a suggestion to the user to select the appropriate president. Similarly, if you search for Johns Hopkins, you get suggestions for John Hopkins – the University, the Entrepreneur and the Hospital.  In the case of the Hopkins query, its the same entity name but with different types and thus Google appends different entity types along with the entity name.

However, searching for Michael Jordan produces no entity disambiguation. If you are looking for Michael Jordan, the UC Berkeley professor, you will have to search for “Michael I Jordan“. Other examples that Google is not handling right now include queries such as apple — {fruit, company}, jaguar {animal, car}.  It seems to me that Google is only including disambiguation between popular entities in its auto-complete. While there are six different George Bushes’ and ten different Michael Jordans‘ on Wikipedia, Google includes only two and none respectively when it disambiguates George Bush and Michael Jordan.

Google talked about using its knowledge graph to produce this information.  One can envision the knowledge graph maintaining, a unique identity for each entity in its collection, which will allow it to disambiguate entities with similar names (in the Semantic Web world, we call it as assigning a unique uri to each unique thing or entity). With the Hopkins query, we can also see that the knowledge graph is maintaining entity type information along with each entity (e.g. Person, City, University, Sports Team etc).  While folks at Google have tried to steer clear of the Semantic Web, one can draw parallels between the underlying principles on the Semantic Web and the ones used in constructing the Google knowledge graph.

Wikimatix explains Google’s hot search trends

August 16th, 2008

Who is alisyn camerota and why are so many people suddenly interested in her?

Pundits often describe the Web as humanity’s giant, collective brain. So what are we thinking about today? Yesterday UMBC PhD student Akshay Java launched a new service, Wikimatix, that shows the Google’s 100 hot search terms for the past hour and tries to explain each with information extracted from Wikipedia.

At the top of his form, Akshay whipped up the system in a three hour break from writing his dissertation as a mashup of Google’s Hot Trends, Wikipedia, and Disqus.

Google’s hot trends gives an hourly list of the 100 Google search queries whose frequency has increased relative to the recent past. When you scan the list, the meaning of many of these is obvious — Michael Phelps or hurricane fay. But who or what is alisyn camerota and what’s up with the interest in opossum. Here’s where Wikimatix helps by annotating each of the hot terms with a snippet about it extracted from Wikipedia, along with links to the full article, a search for blog posts about the topic, and a place for users to add comments.

Here’s an example that shows some of the hot search trends from earlier this morning.

Screenshot from Wikimatix explaining Google hot search trends

Akshay’s post, Wikimatix: Wikify and Disqus Hot Search Keywords, has the details and ideas for improving the service.

Dell trying to trademark cloud computing

August 3rd, 2008

Cloud computing is a hot topic this year, with IBM, Microsoft, Google, Yahoo, Intel, HP and Amazon all offering, using or developing high-end computing services typically described as “cloud computing”. We’ve started using it in our lab, like many research groups, via the Hadoop software framework and Amazon’s Elastic Compute Cloud services.

Bill Poser notes in a post (Trademark Insanity) on Language Log that Dell as applied for a trademark on the term “cloud computing”.

It’s bad enough that we have to deal with struggles over the use of trademarks that have become generic terms, like “Xerox” and “Coke”, and trademarks that were already generic terms among specialists, such as “Windows”, but a new low in trademarking has been reached by the joint efforts of Dell and the US Patent and Trademark Office. Cyndy Aleo-Carreira reports that Dell has applied for a trademark on the term “cloud computing”. The opposition period has already passed and a notice of allowance has been issued. That means that it is very likely that the application will soon receive final approval.

It’s clear, at least to me, that ‘cloud computing’ has become a generic term in general use for “data centers and mega-scale computing environments” that make it easy to dynamically focus a large number of computers on a computing task. It would be a shame to have one company claim it as a trademark. On Wikipedia a redirect for the Cloud Computing page was created several weeks before Dell’s USPTO application. A Google search produces many uses of cloud computing in news articles before 2007, although it’s clear that it’s use didn’t take off until mid 2007.

An examination of a Google Trends map shows that searches for ‘cloud computing’ (blue) began in September 2007 and have increased steadily, eclipsing searches for related terms like Hadoop, ‘map reduce’ and EC2 over the past ten months.

Here’s a document giving the current status of Dell’s trademark application, (USPTO #77139082) which was submitted on March 23, 2007. According to the Wikipedia article on cloud computing, Dell

“… must file a ‘Statement of Use’ or ‘Extension Request’ within 6 months (by January 8, 2009) in order to proceed to registration, and thereafter must enforce the trademark to prevent removal for ‘non-use’. This may be used to prevent other vendors (eg Google, HP, IBM, Intel, Yahoo) from offering certain products and services relating to data centers and mega-scale computing environments under the cloud computing moniker.”

HealthMap mines text for a global disease alert map

July 8th, 2008

HealthMap is an interesting Web site that displays a “global disease alert map” based on information extracted from a variety of text sources on the Web, including news, WHO and NGOs. HealthMap was developed as a research project by Clark Freifeld and John Brownstein of the Children’s Hospital Informatics Program, part of the Harvard-MIT Division of Health Sciences & Technology.

HealthMap mines text for a global disease alert map

Their site says

“HealthMap brings together disparate data sources to achieve a unified and comprehensive view of the current global state of infectious diseases and their effect on human and animal health. This freely available Web site integrates outbreak data of varying reliability, ranging from news sources (such as Google News) to curated personal accounts (such as ProMED) to validated official alerts (such as World Health Organization). Through an automated text processing system, the data is aggregated by disease and displayed by location for user-friendly access to the original alert. HealthMap provides a jumping-off point for real-time information on emerging infectious diseases and has particular interest for public health officials and international travelers.”

The work was done in part with support from Google, as described in a story on ABC news, Researchers Track Disease With Google News, Google.org Money

Microsoft rumored to buy semantic search startup Powerset

June 26th, 2008

Venture Beat reports that Microsoft will acquire Powerset for a price “rumored to be slightly more than $100 million”. Powerset has been developing a Web search system that uses natural language processing technology acquired from PARC to more fully understand user’s queries and the text of documents indexed.

“By buying Powerset, Microsoft is hoping to close the perceived quality gap with Google’s search engine. The move comes as Microsoft CEO Steve Ballmer continues to argue that improving search is Microsoft’s most important task. Microsoft’s market share in search has steadily declined, dropping further and further behind first-place Google and second place Yahoo.

Google has generally dismissed Powerset’s semantic, or “natural language” approach as being only marginally interesting, even though Google has hired some semantic specialists to work on that approach in limited fashion. Google’s search results are still based primarily on the individual words you type into its search bar, and its approach does very little to understand the possible meaning created by joining two or more words together.”

If you put the query “Where is Mount Kilimanjaro” into the beta version of Powerset, it answers “Mount Kilimanjaro: Contained by Tanzania” in addition to showing web pages extracted from Wikipedia. That’s a pretty good answer.

Its response to “what is the Serengeti” is a little less precise. It reports seven things it knows about Serengeti — that it replaced “desert, Platinum”, twilight and Caribbean Blue”, that it hosted ‘migration’, that it provided ‘draw’, that it gained ‘fame’, that it recorded ‘explorations’, that it rutted ‘season’ and that it boasted ‘Blue Wildebeests’. I’m just glad I don’t have a school report due on the Serengeti due tomorrow!

Asking “Who is the president of Zimbabwe” results only in the fallback answer — which appears to be just the set of Wikipedia pages that the query words produce in an IR query. Compare this with the results of the Google query who is the president of zimbabwe site:wikipedia.org.

By the way, the AskWiki system often does a better job on these kinds of question. Asking “where is the Serengeti” produces the answer “The Serengeti ecosystem is located in north-western Tanzania and extends to south-western Kenya between latitudes 1 and 3 S and longitudes 34 and 36 E. It spans some 30,000 km.” It’s a bit of a hack, though. It seems to work by selecting the sentence or two in Wikipedia that best serves as an answer. See our post on Askwiki from last Fall for more examples.

Still, Powerset is an ambitious system that shows promise. What they are trying to do is important and will eventually be done. They have shown real progress in the past two years, more than I had expected. I hope Microsoft can accelerate the development and find practical ways to improve Web search even if the ultimate goal of full language understanding is many years away.

Feedburner to include AdSense ads starting next week

May 30th, 2008

A post on the Feedburner blog, Into the wild: AdSense for feeds, annunced that Google will start integrating AdSense ads into feeds next week.

“… publishers already in the FeedBurner Ad Network will continue to see premium CPM ads directly sold onto their content, but with the added bonus of contextually targeted ads that will fill up the remainder of their inventory. … And with AdSense, you’ll know that your back-filled ads are using the strongest contextual ad engine, ensuring the most relevant and profitable ads are delivered to your subscribers. … For publishers who are not yet placing ads in their feeds, any publisher who meets the requirements to join the AdSense program will also be able to use AdSense for feeds. You will be able to manage your feed ad units directly from AdSense Setup tab, and track performance right on the AdSense Report tab. …”

Does the Google AdSense bot have a sense of irony?

August 4th, 2006

Earlier this week we got a message from Google’s AdSense bot (model T-800) about our post on Google’s anti c1ick fraud techniques.

“While reviewing your account, we noticed that you are currently displaying Google ads in a manner that is not compliant with our policies. For instance, we found violations of AdSense policies on pages such as http://…. Publishers are not permitted to encourage users to c1ick on Google ads or bring excessive attention to ad units. For example, your site cannot contain phrases such as “c1ick the ads,” “support our sponsors,” “visit these recommended links,” or other similar language that could apply to the Google ads on your site. …”

We can only guess that our post had too many occurrences of the word c1ick and that this drew the bot’s attention. The bot wants us to change the post and warned “I’ll be back”.

“Once you update your site, we will automatically detect the changes and ad serving will not be affected. If you choose not to make the changes to your account within the next three days, your account will remain active but you will no longer be able to display ads on the site. Please note, however, that we may disable your account if further violations are found in the future.”

We sympathize with the bot. Google’s AdSense policy is reasonable and for the greater good. It’s surely a challenging problem to automatically detect possible violations of it and having a person screen them all probably an expensive solution. We’ve replied to the email, of course, asking a person to look and verify that we are following the policy. But we’ve not heard back and the clock is ticking. We may have to de-c1ickify the post.

UPDATE (5 Aug): We did hear back from the Google AdSense team and our interpretation of the problem was wrong.