The Web has become the repository of most the world’s pubic knowledge. Almost all of it is still bound up in text, images, audio and video, which are easy for people to understand but less accessible for machines. While the computer interpretation of visual and audio information is still challenging, text is within reach. The Web’s infrastructure makes access to all this information trivial, opening up tremendous opportunities to mine text to extract information that can be republished in a more structured representation (e.g., RDF, databases) or used by machine learning systems to discover new knowledge. Current technologies for human language understanding are far from perfect, but can harvest the low hanging fruit and are constantly improving. All that’s needed is an Internet connection and cycles — lots of them.
Business Week had an article a few weeks ago, The Two Flavors of Google, that touches on some of the recent developments, including Hadoop and IBM and Google’s university cloud computing program. Hadoop is the produce of an Apache Lucene project that provides a Java-based software framework to distribute processing over a cluster of processors. The BW article notes
“Cutting, a 44-year-old search veteran, started developing Hadoop 18 months ago while running the nonprofit Nutch Foundation. After he later joined Yahoo, he says, the Hadoop project (named after his son’s stuffed elephant) was just “sitting in the corner.” But in short order, Yahoo saw Hadoop as a tool to enhance the operations of its own search engine and to power its own computing clouds.” (source)
and adds this significant anecdote
“In early November, for example, the tech team at The New York Times rented computing power on Amazon’s cloud and used Hadoop to convert 11 million archived articles, dating back to 1851, to digital and searchable documents. They turned around in a single day a job that otherwise would have taken months.” (source)
The NYT’s Derek Gottfrid described he process in some detail in a post on the NTY Open blog, Self-service, Prorated Super Computing Fun!.
The Hadoop Quickstart page describes how to run it on a single node, enabling any high school geek who knows Java and has a laptop to try it out before finding (or renting) time on a cluster. This is just what we need in for several upcoming projects and I am looking forward to trying it out soon. One requires processing the 1M documents in the Trec 8 collection and another the 10K documents in ACE 2008 collection.