UMBC ebiquity
MapReduce compared to parallel SQL databases

MapReduce compared to parallel SQL databases

Tim Finin, 8:21am 16 April 2009

Here’s an interesting paper that will appear in SIGMOD’09 comparing the MapReduce paradigm to parallel conventional databases. The benchmark study described in the paper showed that the parallel database approach performed significantly faster, although it took longer to load the data.

A Comparison of Approaches to Large-Scale Data Analysis, Pavlo, Paulson, Rasin Abadi, DeWitt, Madden, and Stonebraker.

There is currently considerable enthusiasm around the MapReduce (MR) paradigm for large-scale data analysis. Although the basic control flow of this framework has existed in parallel SQL database management systems (DBMS) for over 20 years, some have called MR a dramatically new computing model. In this paper, we describe and compare both paradigms. Furthermore, we evaluate both kinds of systems in terms of performance and development complexity. To this end, we define a benchmark consisting of a collection of tasks that we have run on an open source version of MR as well as on two parallel DBMSs. For each task, we measure each system’s performance for various degrees of parallelism on a cluster of 100 nodes. Our results reveal some interesting trade-offs. Although the process to load data into and tune the execution of parallel DBMSs took much longer than the MR system, the observed performance of these DBMSs was strikingly better. We speculate about the causes of the dramatic performance difference and consider implementation concepts that future systems should take from both kinds of architectures.

Benchmark details available so others can recreate the trials.

Related posts:

  1. UMBC to offer special course in parallel programming
  2. Parallel Semantic Search
  3. Yahoo PIG is aimed at parallel semantic search
  4. octo.py: quick and easy MapReduce for Python
  5. Make mincemeat out of MapReduce with Python

One Response to “MapReduce compared to parallel SQL databases”

  1. vlad Says:

    There was some talk about this paper at yesterday’s hadoop meeting. It was said that this paper has two major flaws. First is that the workload that was selected for the task was a type of workload that favors the Relational Database model. Typical MapReduce workload is when you have to process all your data and terabytes of it in the batchmode, whereas Relational Databases are optimized for retrieving relatively small amounts of data very quickly. Second Flaw is that the MapReduce code that was used in benchmarking is not very optimal.

    Also one has to consider the cost of setting up the processing cluster, the setup described in the paper runs about $100K per terabyte in hardware and licensing cost, a typical hadoop setup would run about $1K per terabyte