Google Knowledge Graph: first impressions

May 19th, 2012

The Google’s Knowledge Graph showed up for me this morning — it’s been slowly rolling out since the announcement on Wednesday. It builds lots of research from human language technology (e.g., entity recognition and linking) and the semantic web (graphs of linked data). The slogan, “things not strings”, is brilliant and easily understood.

My first impression is that it’s fast, useful and a great accomplishment but leaves lots of room for improvement and expansion. That last bit is a good thing, at least for those of us in the R&D community. Here are some comments based on some initial experimentation.

GKG only works on searches that are simple entity mentions like people, places, organizations. It doesn’t do products (Toyota Camray), events (World War II), or diseases (diabetes) but does recognize that ‘Mercury’ could be a planet or an element.

It’s a bit aggressive about linking: when searching for “John Smith” it zeros in on the 17th century English explorer. Poor Professor Michael Jordan never get a chance, and providing context by adding Berkeley just suppresses the GKG sidebar. “Mitt” goes right to you know who. “George Bush” does lead to a disambiguation sidebar, though. Given that GKG doesn’t seem to allow for context information, the only disambiguating evidence it has is popularity (i.e., pagerank).

Speaking of context, the GKG results seem not to draw on user-specific information, like my location or past search history. When I search for “Columbia” from my location here in Maryland, it suggests “Columbia University” and “Columbia, South Carolina” and not “Columbia, Maryland” which is just five miles away from me.

Places include not just GPEs (geo-political entities) but also locations (Mars, Patapsco river) and facilities (MOMA, empire state building). To the GKG, the White House is just a place.

Organizations seem like a weak spot. It recognizes schools (UCLA) but company mentions seem not to be directly handled, not even for “Google”. A search for “NBA” suggests three “people associated with NBA” and “National Basketball Association” is not recognized. Forget finding out about the Cult of the Dead Cow.

Mike Bergman has some insights based on his exploration of the GKG in Deconstructing the Google Knowledge Graph

The use of structured and semi-structure knowledge in search is an exciting area. I expect we will see much more of this showing up in search engines, including Bing.


Google Knowledge Graph: things, not string

May 16th, 2012

Google announced its “knowledge graph” today and describes it as “an intelligent model—in geek-speak, a ‘graph’ — that understands real-world entities and their relationships to one another: things, not strings. … It currently contains more than 500 million objects, as well as more than 3.5 billion facts about and relationships between these different objects. And it’s tuned based on what people search for, and what we find out on the web.” Information from the knowledge graph will initially augment search results — the feature is already being rolled out to US English users. A short video explains more.

A CNET article quotes KG project manager Jack Menzel: Menzel pitches Knowledge Graph without using the word “semantic” even once. While he says, “I dream of the semantic Web,” he takes pains to point out that what Google is announcing today is not what people talk about when they discuss semantic Web concepts. “We do continue to work on how to make search semantic,” he says, “but talking about it brings out the crazy people.” I hope this did not come out the way he intended it to.


Google on Rich Snippets

April 19th, 2012

Google’s Webmasters blog has a post on rich snippets and structured data. While this is from Google, Microsoft’s Bing search engine has a very similar approach.

Snippets are “the few lines of text that appear under every search result” that are designed to “give users a sense for what’s on the page and why it’s relevant to their query.” The post points out that the search engine needs to understand the content on a page in order to produce the snippets.

“The snippet for a restaurant might show the average review and price range; the snippet for a recipe page might show the total preparation time, a photo, and the recipe’s review rating; and the snippet for a music album could list songs along with a link to play each song.

You can help search engines understand key information in the content by adding structured data in several formats (Google looks for Microdata (preferred), Microformats and RDFa) and for a small set of topics. These topics are those covered by schema.org: Reviews, People, Products, Businesses and organizations, Recipes, Events and Music.

See the full post for details and links.


Google semantic web search

March 15th, 2012

The Wall Street Journal’s Amir Efrati has an article (Google Gives Search a Refresh) and blog post (What Google’s Search Changes Might Mean for You) on upcoming changes Google to its search engine to exploit semantic data.

“Google is undergoing a major, long-term overhaul of its search-engine, using what’s called semantic Web search to enhance the current system in the coming years. The move, starting over the next few months, will impact the way people can use the search engine as well as how the search engine examines sites across the Web before ranking them in search results.

A Google spokesman said the company wouldn’t comment on future search-engine features. But people familiar with the initiative say that Google users will able to browse through the company’s “knowledge graph,” or its ever-expanding database of information about “entities”—people, places and things—the “attributes” of those entities and how different entities are connected to one another.

Some open standards come from the W3C Semantic Web and Schema.org, which the major search engine players including Google have agreed to recognize, Cornett said.


Are Apple and Google creating a crisis for the open Web?

February 10th, 2012

A CNET article, W3C co-chair: Apple, Google power causing Open Web crisis, says that “The dominance of Apple and Google mobile browsers is leading to a situation that’s even worse for Web programming than the former dominance of Internet Explorer, a standards group leader warned today.”

The problem is that both the Safari and Chrome browsers, and their counterparts on Android, iPhone and iPad, use the WebKit layout engine. WebKit supports many non-standard CSS features and Web developers are building sites and pages that take advantage of them.

Daniel Glazman, co-chairman of the CSS Working Group, described it this way.

Not so long ago, IE6 was the over-dominant browser on the Web. Technically, the Web was full of works-only-in-IE6 web sites and the other browsers, the users were crying. IE6 is dead, this time is gone, and all browsers vendors including Microsoft itself rejoice. Gone? Not entirely… IE6 is gone, the problem is back.

WebKit, the rendering engine at the heart of Safari and Chrome, living in iPhones, iPads and Android devices, is now the over-dominant browser on the mobile Web and technically, the mobile Web is full of works-only-in-WebKit web sites while other browsers and their users are crying.

He issued a call to action that describes the steps that the web community of authors, designers and developers can take to support an open web based on standards.


Got a problem? There’s a code for that

September 15th, 2011

The Wall Street Journal article Walked Into a Lamppost? Hurt While Crocheting? Help Is on the Way describes the International Classification of Diseases, 10th Revision that is used to describe medical problems.

“Today, hospitals and doctors use a system of about 18,000 codes to describe medical services in bills they send to insurers. Apparently, that doesn’t allow for quite enough nuance. A new federally mandated version will expand the number to around 140,000—adding codes that describe precisely what bone was broken, or which artery is receiving a stent. It will also have a code for recording that a patient’s injury occurred in a chicken coop.”

We want to see the search engine companies develop and support a Microdata vocabulary for ICD-10. An ICDM-10 OWL DL ontology has already been done, but a Microdata version might add a lot of value. We could use it on our blogs and Facebook posts to catalog those annoying problems we encounter each day, like W59.22XD (Struck by turtle, initial encounter), or Y07.53 (Teacher or instructor, perpetrator of maltreat and neglect).

Humor aside, a description logic representation (e.g., in OWL) makes the coding system seem less ridiculous. Instead of appearing as a catalog of 140K ground tags, it would emphasize that it is a collection of a much smaller number of classes that can be combined in productive ways to produce them or used to create general descriptions (e.g., bitten by an animal).


Google lobbies Nevada to allow self-driving cars

May 11th, 2011

A story in yesterday’s NYT, Google Lobbies Nevada To Allow Self-Driving Cars, reports that Google has hired a Nevada lobbyist to promote two bills related to autonomous vehicles that are expected to be voted on this summer.

“Google hired David Goldwater, a lobbyist based in Las Vegas, to promote the two measures, which are expected to come to a vote before the Legislature’s session ends in June. One is an amendment to an electric-vehicle bill providing for the licensing and testing of autonomous vehicles, and the other is the exemption that would permit texting.”

Arguments the lobbyist offered included that “the autonomous technology would be safer than human drivers, offer more fuel-efficient cars and promote economic development.”

I’d add that the Google Bot has a clean driving record, exhibits an excellent sense of direction, will obey any laws inserted into a state’s robots.txt, and does not drink. However, the Google Bot’s current cars are all Toyotas and an Audis. Maybe the Nevada legislator should find a way to encourage it to support the US auto industry and buy some American cars.

I liked project leader Sebastian Thrun’s example of a potential benefit of autonomous vehicles.

“In frequent public statements, he has said robotic vehicles would increase energy efficiency while reducing road injuries and deaths. And he has called for sophisticated systems for car sharing that, he says, could cut the number of cars in the United States in half. “What if I could take out my phone and say, ‘Zipcar, come here,’ ” he asked an industry conference last year, “and a moment later the Zipcar came around the corner?””


Google recipe search exploits semantic web data in RDFa

February 26th, 2011

Many people now use the Web to find recipes rather than their own collection of cookbooks and it is estimated that about one percent of all Google searches are for recipes. This past Thursday, Google released Recipe View in the US, letting you limit results to pages that are recipes and further narrow your search by ingredients, cooking time and calories. This feature is powered by semantic metadata encoded in RDFa and other formats

Google recipe search exploits semantic data in RDFa

Google describes the new recipe search in a post on the Official Google Blog:

“Recipe View lets you narrow your search results to show only recipes, and helps you choose the right recipe amongst the search results by showing clearly marked ratings, ingredients and pictures. To get to Recipe View, click on the “Recipes” link in the left-hand panel when searching for a recipe. You can search for specific recipes like [chocolate chip cookies], or more open-ended topics—like [strawberry] to find recipes that feature strawberries, or even a holiday or event, like [cinco de mayo]. In fact, you can try searching for all kinds of things and still find interesting results: a favorite chef like [ina garten], something very specific like [spicy vegetarian curry with coconut and tofu] or even something obscure like [strange salad].”

Recipe View extracts data embedded in Web pages that is encoded in Google’s rich snippets format. This includes both the W3C Semantic Web standard RDFa as well as microformats. Google recognizes a simple recipe vocabulary with fourteen properties.

This is a great example of the potential of semantic web technology that can be understood and appreciated by anyone with an interest in cooking. Or eating.


Test drive a (free) Google Chrome-48 notebook

December 7th, 2010

Google is looking for some people who “Live on the Web” to take part in a pilot program for their new Chrome-48 notebook. If you are accepted, you get a free Chrome notebook in return for providing feedback.

“We have a limited number of Chrome notebooks to distribute, and we need to ensure that they find good homes. That’s where you come in. Everything is still very much a work in progress, and it’s users, like you, that often give us our best ideas about what feels clunky or what’s missing. So if you live in the United States, are at least 18 years old, and would like to be considered for our small Pilot program, please fill this out. It should take about 15 minutes. We’ll review the requests that come in and contact you if you’ve been selected. This application will be open until 11:59:59 PM PST on December 21, 2010.”

The Cr-48’s features are said to include:

  • 12-inch display
  • Built-in Wi-Fi and 3G service via Verizon
  • Full-sized keyboard w/o function keys
  • Oversized clickable touchpad
  • ~3.8 pounds
  • Solid state hard drive
  • Eight hours of active usage with a week of standby power

See Google Chrome OS site for details and to apply.


Lisp bots win Planet Wars Google AI Challenge

December 2nd, 2010

top programming languages in Planet Wars
The Google-supported Planet Wars Google AI Challenge had over 4000 entries that used AI and game theory to compete against one another. C at the R-Chart blog analyzed the programming languages used by the contestants with some interesting results.

The usual suspects were the most popular languages used: Java, C++, Python, C# and PHP. The winner, Hungarian Gábor Melis, was just one of 33 contestants who used Lisp. Even less common were entries in C, but the 18 “C hippies” did remarkably well.

Blogger C wonders if Lisp was the special sauce:

Paul Graham has stated that Java was designed for “average” programmers while other languages (like Lisp) are for good programmers. The fact that the winner of the competition wrote in Lisp seems to support this assertion. Or should we see Mr. Melis as an anomaly who happened to use Lisp for this task?


Android to support near field communication

November 15th, 2010

As TechCrunch and others report, Google’s Eric Schmidt announced that the next version of Android (Gingerbread 2.3) will support near field communication. What?

Wikipedia explains that NFC refers to RFID and RFID-like technology commonly used for contactless smart cards, mobile ticketing, and mobile payment systems.

Near Field Communication or NFC, is a short-range high frequency wireless communication technology which enables the exchange of data between devices over about a 10 centimeter (around 4 inches) distance.”

The next iphone is rumored to have something similar.

Support for NFC in popular smart phones could unleash lots of interesting applications, many of which have already been explored in research prototypes in labs around the world. One interesting possibility is that this could be used to allow android devices to share RDF queries and data with other devices.


Recorded Future analyses streaming Web data to predict the future

October 30th, 2010

Recorded Future is a Boston-based startup with backing from Google and In-Q-Tel uses sophisticated linguistic and statistical algorithms to extract time-related information from streams of Web data about entities and events. Their goal is to help their clients to understand how the relationships between entities and events of interest are changing over time and make predictions about the future.

Recorded Future system architecture

A recent Technology Review article, See the Future with a Search, describes it this way.

“Conventional search engines like Google use links to rank and connect different Web pages. Recorded Future’s software goes a level deeper by analyzing the content of pages to track the “invisible” connections between people, places, and events described online.
   “That makes it possible for me to look for specific patterns, like product releases expected from Apple in the near future, or to identify when a company plans to invest or expand into India,” says Christopher Ahlberg, founder of the Boston-based firm.
   A search for information about drug company Merck, for example, generates a timeline showing not only recent news on earnings but also when various drug trials registered with the website clinicaltrials.gov will end in coming years. Another search revealed when various news outlets predict that Facebook will make its initial public offering.
   That is done using a constantly updated index of what Ahlberg calls “streaming data,” including news articles, filings with government regulators, Twitter updates, and transcripts from earnings calls or political and economic speeches. Recorded Future uses linguistic algorithms to identify specific types of events, such as product releases, mergers, or natural disasters, the date when those events will happen, and related entities such as people, companies, and countries. The tool can also track the sentiment of news coverage about companies, classifying it as either good or bad.”

Pricing for access to their online services and API starts at $149 a month, but there is a free Futures email alert service through which you can get the results of some standing queries on a daily or weekly basis. You can also explore the capabilities they offer through their page on the 2010 US Senate Races.

“Rather than attempt to predict how the the races will turn out, we have drawn from our database the momentum, best characterized as online buzz, and sentiment, both positive and negative, associated with the coverage of the 29 candidates in 14 interesting races. This dashboard is meant to give the view of a campaign strategist, as it measures how well a campaign has done in getting the media to speak about the candidate, and whether that coverage has been positive, in comparison to the opponent.”

Their blog reveals some insights on the technology they are using and much more about the business opportunities they see. Clearly the company is leveraging named entity recognition, event recognition and sentiment analysis. A short A White Paper on Temporal Analytics has some details on their overall approach.