A post in Micrsoft’s Bing blog, Understand Your World with Bing, announced that an update to their Satori knowledge base allows Bing to do a better job of identifying queries that mention a known entity, i.e., a person, place of organization. Bing’s use of Satori parallels the efforts of Google and Facebook in developing graph-based knowledge bases to move from “strings” to “things”.
Microsoft is using data from Satori to provide “snapshots” with data about an entity when it detects a likely mention of it in a query. This is very similar to how Google is using its Knowledge Graph KB.
One interesting thing that Satori is now doing is importing data from LinkedIn — data that neither Google’s Knowledge Graph nor Facebook’s Open Graph has. Another difference is that Satori uses RDF as its native model, or at least appears to, based on this description from 2012.
Google, Bing and Yahoo! are cooperating on an approach to representing structured data in Web pages via the launch of schema.org. The approach is microdata and the schema.org site documents the schemas that are supported today.
“This site provides a collection of schemas, i.e., html tags, that webmasters can use to markup their pages in ways recognized by major search providers. Search engines including Bing, Google and Yahoo! rely on this markup to improve the display of search results, making it easier for people to find the right web pages. Many sites are generated from structured data, which is often stored in databases. When this data is formatted into HTML, it becomes very difficult to recover the original structured data. Many applications, especially search engines, can benefit greatly from direct access to this structured data. On-page markup enables search engines to understand the information on web pages and provide richer search results in order to make it easier for users to find relevant information on the web. Markup can also enable new tools and applications that make use of the structure. A shared markup vocabulary makes easier for webmasters to decide on a markup schema and get the maximum benefit for their efforts. So, in the spirit of sitemaps.org, Bing, Google and Yahoo! have come together to provide a shared collection of schemas that webmasters can use.”
That’s the good news. The bad news, or at least the less good news, is that it based on microdata and not RDFa. Microdata is a relatively new way to embed semantic information in HTML and designed to be part of the HTML5 suite. It is less expressive than RDFa but also simpler. It’s main advantage over microformats is that it is extensible — you can define new semantic vocabulary terms. Here is how the three companies described the choice.
Google: “Historically, we’ve supported three different standards for structured data markup: microdata, microformats, and RDFa. We’ve decided to focus on just one format for schema.org to create a simpler story for webmasters and to improve consistency across search engines relying on the data.”
Yahoo!:“Today’s announcement offers tremendous opportunity for growth. In addition to consolidating the schemas for the vocabularies we already support, there are schemas for more than a hundred newly created categories including movies, music, organizations, TV shows, products, places and more. We will continue to expand these categories by listening to feedback from the community and will continue publishing new schemas on a regular basis. Don’t worry if your site has already added RDFa or microformats currently supported by our Enhanced Displays program, that site will still appear with an Enhanced Display on Yahoo! – no changes required.”
Bing:“At Bing we understand the significant investment required to implement markup, and feel strongly that by partnering with Google and Yahoo! on standard schemas webmasters can be more efficient with the time they invest… Bing accepts a wide variety of markup formats today (Open Graph, microformat, etc.) for features like Tiles and will continue to do so, but by standardizing on schema.org we are looking to simplify the markup choices for webmasters and amplify the value the receive in return.
The scheme.org site has a FAQ that includes the question “Q: Why microdata? Why not RDFa or microformats?” which is answered thusly:
“Focusing on microdata was a pragmatic decision. Supporting multiple syntaxes makes documentation for webmasters more complex and introduces more overhead in terms of defining new formats. Microformats are concise and easy to understand, but they don’t offer an open extensibility mechanism and the reuse of the class tag can cause conflicts with website CSS. RDFa is extensible and very expressive, but the substantial complexity of the language has contributed to slower adoption. Microdata is the most recent well-known standard, created along with HTML5. It strikes a balance between extensibility and simplicity, and is most suitable for building the schema.org. Google and Yahoo! have in the past supported both microformats and RDFa for certain schemas and will continue to support these syntaxes for those schemas. We will also be monitoring the web for RDFa and microformats adoption and if they pick up, we will look into supporting these syntaxes. Also read the section on the data model for more on RDFa.”
Guha has a generous comment in his post on the official Google blog:
“While this collaborative initiative is new, we draw heavily from the decades of work in the database and knowledge representation communities, from projects such as Jim Gray’s SDSS Skyserver, Cyc and from ongoing efforts such as dbpedia.org and linked data. We feel privileged to build upon this great work. We look forward to seeing structured markup continue to grow on the web, powering richer search results and new kinds of applications.”
I’ve not studied microdata yet, so don’t know how I feel about the expressiveness/simplicity tradeoffs it has made. I wonder if it is possible to add an OWL-like layer on top ofMicrodata, for example.
Recorded Future is a Boston-based startup with backing from Google and In-Q-Tel uses sophisticated linguistic and statistical algorithms to extract time-related information from streams of Web data about entities and events. Their goal is to help their clients to understand how the relationships between entities and events of interest are changing over time and make predictions about the future.
“Conventional search engines like Google use links to rank and connect different Web pages. Recorded Future’s software goes a level deeper by analyzing the content of pages to track the “invisible” connections between people, places, and events described online.
“That makes it possible for me to look for specific patterns, like product releases expected from Apple in the near future, or to identify when a company plans to invest or expand into India,” says Christopher Ahlberg, founder of the Boston-based firm.
A search for information about drug company Merck, for example, generates a timeline showing not only recent news on earnings but also when various drug trials registered with the website clinicaltrials.gov will end in coming years. Another search revealed when various news outlets predict that Facebook will make its initial public offering.
That is done using a constantly updated index of what Ahlberg calls “streaming data,” including news articles, filings with government regulators, Twitter updates, and transcripts from earnings calls or political and economic speeches. Recorded Future uses linguistic algorithms to identify specific types of events, such as product releases, mergers, or natural disasters, the date when those events will happen, and related entities such as people, companies, and countries. The tool can also track the sentiment of news coverage about companies, classifying it as either good or bad.”
Pricing for access to their online services and API starts at $149 a month, but there is a free Futures email alert service through which you can get the results of some standing queries on a daily or weekly basis. You can also explore the capabilities they offer through their page on the 2010 US Senate Races.
“Rather than attempt to predict how the the races will turn out, we have drawn from our database the momentum, best characterized as online buzz, and sentiment, both positive and negative, associated with the coverage of the 29 candidates in 14 interesting races. This dashboard is meant to give the view of a campaign strategist, as it measures how well a campaign has done in getting the media to speak about the candidate, and whether that coverage has been positive, in comparison to the opponent.”
Their blog reveals some insights on the technology they are using and much more about the business opportunities they see. Clearly the company is leveraging named entity recognition, event recognition and sentiment analysis. A short A White Paper on Temporal Analytics has some details on their overall approach.
Microsoft’s Bing team announced on their blog that that the Bing search engine is “powering Yahoo!’s search results” in the US and Canada for English queries. Yahoo also has a post on their Yahoo! Search Blog.
“Tuesday, nearly 13 months after Yahoo and Microsoft announced plans to collaborate on Internet search in hopes of challenging Google’s market dominance, the two companies announced that the results of all Yahoo English language searches made in the United States and Canada are coming from Microsoft’s Bing search engine. The two companies are still racing to complete the transition of paid search, the text advertising links that run beside and above the standard search results, before the make-or-break holiday period — a much more difficult task.”
Combining the traffic from Microsoft and Yahoo will give the Bing a more significant share of the Web search market. That should help them by providing both companies with a larger stream of search related data that can be exploited to improve search relevance, ad placement and trend spotting. It will also help to foster competition with Google focused on developing better search technology.
Hopefully, Bing will be able to benefit from the good work done at Yahoo! on adding more semantics to Web search.
Google announced today that it has acquired Metaweb, the company behind Freebase — a free, semantic database of “over 12 million people, places, and things in the world.” This is from their announcement on the Official Google blog:
“Over time we’ve improved search by deepening our understanding of queries and web pages. The web isn’t merely words — it’s information about things in the real world, and understanding the relationships between real-world entities can help us deliver relevant information more quickly. … With efforts like rich snippets and the search answers feature, we’re just beginning to apply our understanding of the web to make search better. Type [barack obama birthday] in the search box and see the answer right at the top of the page. Or search for [events in San Jose] and see a list of specific events and dates. We can offer this kind of experience because we understand facts about real people and real events out in the world. But what about [colleges on the west coast with tuition under $30,000] or [actors over 40 who have won at least one oscar]? These are hard questions, and we’ve acquired Metaweb because we believe working together we’ll be able to provide better answers.”
In their announcement, Google promises to continue to maintain Freebase “as a free and open database for the world” and invites other web companies use and contribute to it.
Freebase is a system very much in the linked open data spirit, even thought RDF is not its native representation. It’s content is available as RDF and there are many links that bind it to the LOD cloud. Moreover, Freebase has a very good wiki-like interface allowing people to upload, extend and edit both its schema and data.
Here’s a video on the concepts behind Metaweb which are, of course, also those underlying the Semantic Web. What the difference — I’d say a combination of representational details and centralized (Metaweb) vs. distributed (Semantic Web).
Yong Yu and Rudi Studer are editing a special issue of the Journal of Web Semantics on semantic search that will appear in the summer 2010. The special issue will cover interdisciplinary topics between Semantic Web and search. See the call for papers for a list of relevant topics and details on how to submit papers, which are due by 20 January 2010
“IDGNS: What’s the status of semantic search at Google? You have said in the past that through “brute force” — analyzing massive amounts of queries and Web content — Google’s engine can deliver results that make it seem as if it understood things semantically, when it really functions using other algorithmic approaches. Is that still the preferred approach?
Mayer: We believe in building intelligent systems that learn off of data in an automated way, [and then] tuning and refining them. When people talk about semantic search and the semantic Web, they usually mean something that is very manual, with maps of various associations between words and things like that. We think you can get to a much better level of understanding through pattern-matching data, building large-scale systems. That’s how the brain works. That’s why you have all these fuzzy connections, because the brain is constantly processing lots and lots of data all the time.
IDGNS: A couple of years ago or so, some experts were predicting that semantic technology would revolutionize search and blindside Google, but that hasn’t happened. It seems that semantic search efforts have hit a wall, especially because semantic engines are hard to scale.
Mayer: The problem is that language changes. Web pages change. How people express themselves changes. And all those things matter in terms of how well semantic search applies. That’s why it’s better to have an approach that’s based on machine learning and that changes, iterates and responds to the data. That’s a more robust approach. That’s not to say that semantic search has no part in search. It’s just that for us, we really prefer to focus on things that can scale. If we could come up with a semantic search solution that could scale, we would be very excited about that. For now, what we’re seeing is that a lot of our methods approximate the intelligence of semantic search but do it through other means.”
I interpret these comments to mean that Google’s management still views the concept of semantic search (and the Semantic Web) as involving better understanding of the intended meaning of text in documents and queries. The W3C’s web of data model is still not on their radar.
Yong Yu and Rudi Studer are editing a special issue of the Journal of Web Semantics on Semantic Search that will appear in the summer 2010. Papers are due 20 January 2010 and decisions will will be sent two months later. Relevant topics include:
Information retrieval tasks on the Semantic Web
Incentives and interaction paradigms for resource annotation
Interaction paradigms for semantic search
Semantic technologies for query interpretation, refinement and routing
Modeling expressive resource descriptions
natural language processing and information extractions for the acquisition of resource descriptions
Scalable repositories and infrastructures for semantic search
Crawling, storing and indexing of expressive resource descriptions
fusion of semantic search results on the Semantic Web
Algorithms for matching expressive queries and resource descriptions
Algorithms and procedure to deal with vagueness, incompleteness and inconsistencies in semantic search
Evaluation methodologies for semantic search
Standard datasets and benchmarks for semantic search
Who’s got the best basic web search engine? One way to approach that question is to conduct an experiment in which subjects rank the results returned by several engines without knowing which is which.
BlindSearch is a simple and neat site that collects ‘objective’ opinions on search quality by showing query results from Google, Yahoo and Bing side by side without identifying which is which and inviting you to select the best.
“Type in a search query above, hit search then vote for the column which you believe best matches your query. The columns are randomised with every query.
The goal of this site is simple, we want to see what happens when you remove the branding from search engines. How differently will you perceive the results?”
As of this writing there have been 1679 votes for preferred results with Google getting 39%, Bing 39% and Yahoo: 22%.
How’s this for truth in advertising. The Chromium blog announces beta versions of Google Chrome for MAC OS X and Linus, but warns people not to try them in a post Danger: Mac and Linux builds available.
“In order to get more feedback from developers, we have early developer channel versions of Google Chrome for Mac OS X and Linux, but whatever you do, please DON’T DOWNLOAD THEM! Unless of course you are a developer or take great pleasure in incomplete, unpredictable, and potentially crashing software. How incomplete? So incomplete that, among other things, you won’t yet be able to view YouTube videos, change your privacy settings, set your default search provider, or even print.”
Of course, they know that this will make trying them irresistible to some of us. If that includes you, go get the Mac or Linux version.
Microsoft’s new Bing search engine is getting a lot of interest. Glenn McDonald posts about a nice side-by-side Bing vs Google comparator tat he developed. It makes it easy to compare how the two services do on a range of different types of searches. Here are the ones that Glen said he found useful in developing his initial opinion.
I sense form some of these queries that he is probing the systems where an advanced search engine can exploit a little bit of semantic knowledge. For example, recognizing that a user’s query “boston to asheville” matches a common pattern “ to “, and she probably is interested in information about how to travel from the first location tot he second. It seems like Google has been working on adding more such patterns, at least for the low hanging fruit.
Of course, if everyone hits on this site it may get throttled or blocked by either or both of the search engines. @Glen — would you be willing to share your code?