How Google processes 20 petabytes of data each day
Tim Finin, 9:20am 9 January 2008The latest CACM has an article by Google fellows Jeffrey Dean and Sanjay Ghemawat with interesting details on Google’s text processing engines. Niall Kennedy summarized it this way on his blog post, Google processes over 20 petabytes of data per day.
“Google currently processes over 20 petabytes of data per day through an average of 100,000 MapReduce jobs spread across its massive computing clusters. The average MapReduce job ran across approximately 400 machines in September 2007, crunching approximately 11,000 machine years in a single month.”
If big numbers numb your mind, 20 petabytes is 20,000,000,000,000,000 bytes (or 22,517,998,136,852,480 for the obsessive-compulsives among us) — enough data to fill up over five million 4G ipods a day, which, if laid end to end would …
Kevin Burton has a copy of the paper on his blog.
Jeffrey Dean and Sanjay Ghemawat, MapReduce: simplified data processing on large clusters, Communications of the ACM, pp 107-113, 51:1, January 2008.
MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google’s clusters every day, processing a total of more than twenty petabytes of data per day.
Dean and Ghemawat conclude their paper by summarizing the key reasons why MapReduce has worked so well for Google.
“First, the model is easy to use, even for programmers without experience with parallel and distributed systems, since it hides the details of parallelization, fault tolerance, locality optimization, and load balancing. Second, a large variety of problems are easily expressible as MapReduce computations. For example, MapReduce is used for the generation of data for Google’s production Web search service, for sorting, data mining, machine learning, and many other systems. Third, we have developed an implementation of MapReduce that scales to large clusters of machines comprising thousands of machines. The implementation makes efficient use of these machine resources and therefore is suitable for use on many of the large computational problems encountered at Google.”

January 9th, 2008 at 12:21 pm
>> If big numbers numb your mind, 20 petabytes is 20,000,000,000,000,000 bytes (or 22,517,998,136,852,480 for the obsessive-compulsives among us)
Why would it be 2^51 * 10? Peta means 10^15 or 1 000 000 000 000 000 so 20 000 000 000 000 000 is correct.
Nonetheless, very impressive.
January 9th, 2008 at 6:13 pm
I have been looking into MapReduce lately and I found Hadoop, which is an open source implementation lucene.apache.org/hadoop/
There is also an interesting article on how to run MapReduce using Amazon EC2
http://wiki.apache.org/lucene-hadoop/AmazonEC2
February 18th, 2008 at 6:52 pm
@Simon: A petabyte can be described as 10^15 bytes or 2^50 bytes, depending on your standard… so obviously 10*2^51 = 20 * 2^50 = 20 petabytes in that particular notation.
Some light reading on the topic:
http://en.wikipedia.org/wiki/Binary_prefix
http://en.wikipedia.org/wiki/Petabyte
And this is how I finally realised why my new 4GB iPod only holds 3.73GB according to Windows… Apple describes it in base 10 (so 4 * 10^9 bytes) and Windows describes it in base 2 (about 3.73 * 2^30). I feel cheated
March 2nd, 2008 at 4:55 pm
the wasted electricity to run these amusements must cost millions, when will people come to terms with efficiency. If the internet becomes the new TV, will it be as mindless, how can it not. Consumption of mass media includes infotainment, music and visual popular culture. Use sparingly, consume less and do more. Afterall everything is an abstraction from something, find the source.
February 26th, 2009 at 12:51 pm
I just read the other day that Google is starting to keep their indexing in RAM, which is stored across 1,000 servers. That is insane.
They are definitely doing some crazy stuff over there.
February 26th, 2009 at 6:54 pm
4GB is actually correct. 4GB = 4000MB = 4000000KB.
However, Windows (and other operating systems) use base 2 since computers use base 2. Why HDD manufacturers use the SI system while everybody else uses the binary system is beyond me. Perhaps it is due to how HDDs are designed?
I’m not big on hardware though, so I’m not sure if pendrives, SD cards, flash cards and other such media as well as RAM use the SI system or the binary system.
Still, the part of ISO/IEC 80000-13 dealing with binary prefixes may be standardized, so I wonder if we should start using the binary prefixes now… After all, IPv6 hasn’t really caught on, despite the fact that it is such a wonderful solution. Maybe binary prefixes will go the same way. Plenty of people who implement the technology, but few people who actually adopt the usage of it…
As for the number “20 petabytes”, you really must wonder whether it was 20 “petabytes” or 20 “pebibytes”.
April 23rd, 2009 at 8:12 pm
ISO/IEC 80000-13:2008 has been adopted by law in some countries. Therefore,
20 PB (petabytes) = 20,000,000,000,000,000 bytes
20 PiB (pebibytes) = 22,517,998,136,852,480 bytes
If you feel the difference is not worth discussing, maybe you’d like to loan the excess out of your bandwidth?
April 23rd, 2009 at 8:14 pm
By the way… a byte is not necessarily eight bites (IEEE 1541-2002). We should be talking about octets! LOL!
April 23rd, 2009 at 8:15 pm
correction: bits
October 12th, 2009 at 5:14 pm
[...] updates energy info every 15 minutes can deliver 400 MB per smart meter per year. For perspective, Google processes about 20 PB of data per day, and 1 PB is equivalent to the amount of data contained in 20 million [...]