WWGD: Understanding Google’s Technology Stack

December 24th, 2008

It’s popular to ask “What Would Google Do” these days — The Google reports over 7,000 results for the phrase. Of course, it’s not just about Google, which we all use as the archetype for a new Web way of building and thinking about information systems. Asking WWGD can be productive, but only if we know how to implement and exploit the insights the answer gives us. This in turn requires us (well, some of us, anyway) to understand the algorithms, techniques, and software technology that Google and other large scale Web-oriented companies use. We need to ask “How Would Google Do It”.

Michael Nielsen has a nice post on using your laptop to compute PageRank for millions of webpages. His posts reviews PageRank and how to compute it and shows a short, but reasonably efficient, Python program that can easily do a graph with a few million nodes. While not sufficient for many applications, like the Web, there are lots of interesting and significant graphs this small Python program can handle — Wikipedia pages, DBLP publications, RDF namespaces, BGP routers, Twitter followers, etc.

The post is part of a series Nielsen is making on the Google Technology Stack including PageRank, MapReduce, BigTable, and GFS. The posts are a byproduct of a series of weekly lectures he’s giving starting earlier this month in Waterloo. Here’s the way that Nielsen describes the series.

“Part of what makes Google such an amazing engine of innovation is their internal technology stack: a set of powerful proprietary technologies that makes it easy for Google developers to generate and process enormous quantities of data. According to a senior Microsoft developer who moved to Google, Googlers work and think at a higher level of abstraction than do developers at many other companies, including Microsoft: “Google uses Bayesian filtering the way Microsoft uses the if statement” (Credit: Joel Spolsky). This series of posts describes some of the technologies that make this high level of abstraction possible.”

Videos of the first two lectures, Introducion to PageRank and Building our PageRank Intuition) are available online. Nielsen illustrates the concepts and algorithms with well-written Python code and provides exercises to help readers master the material as well as “more challenging and often open-ended problems” which he has worked on but not completely solved.

Nielsen was trained as a as a theoretical Physicist but has shifted his attention to “the development of new tools for scientific collaboration and publication”. As far as I can see, he is offering these as free public lectures out of a desire to share his knowledge and also to help (or maybe force) him to deepen his own understanding of the topics and develop better ways of explaining them. In both cases, it an admirable and inspiring example for us all and appropriate for the holiday season. Merry Christmas!

Journal requires authors to include Wikipedia article with submissions

December 18th, 2008

Scientific journals are undergoing rapid evolution as they adapt to the Web and various forms of social media. As reported by Nature (Publish in Wikipedia or perish) and in ReadWriteWeb, the journal RNA Biology is experimenting with a connection to Wikipedia. Articles submitted for publication about new RNA molecules must also include a draft Wikipedia page that summarizes the work. The journal will then peer review the page before publishing it in Wikipedia.

Here are the guidelines from the RNA Biology site:

“To be eligible for publication the Supplementary Material must contain: (1) a link to a Wikipedia article preferably in a User’s space. Upon acceptance this can easily be moved into Wikipedia itself together with a reference to the published article.

At least one stub article (essentially an extended abstract) for the paper should be added to either an author’s userspace at Wikipedia (preferred route) or added directly to the main Wikipedia space (be sure to add literature references to avoid speedy deletion). This article will be reviewed alongside the manuscript and may require revision before acceptance. Upon acceptance the former articles can easily be exported to the main Wikipedia space. See below for guidelines on how to do this. Existing articles can be updated in accordance with the latest published results.”

This is definitely an interesting and forward looking idea. Yet, I can not help having the cynical thought that it’s also a great way for the journal to boost it’s page rank.