Correlation Aware Optimizations for Analytic Databases
Friday, March 9, 2012, 13:00pm - Friday, March 9, 2012, 14:00pm
325b ITE, UMBC
Recent years have seen that the analysis of large data-sets is crucially important in a wide range of business, governmental, and scientific applications. For example, research projects in astronomy need to analyze petabytes of image data taken from telescopes. Providing a fast and scalable analytical data management system for such users has become increasingly important.
The major bottlenecks for analytics on such big data are disk- and network-I/O. Because the data is too large to fit in RAM, each query causes substantial disk I/O. Traditional database systems provide indexes to speed up disk reads, but many analytic queries do not benefit from indexes because data is scattered over a large number of disk blocks and disk seeks are prohibitively expensive. Furthermore, such huge data sets need to be partitioned and distributed over hundreds or many thousands of nodes. When a query requires more than one data at once, such as a query involving a JOIN operation, the data management system must transmit a large amount of data over the network. For example, the Shuffle phase in Map-Reduce systems copies file blocks over the network and causes a significant bottleneck in many cases.
Our approach to tackling these challenges in big data analytics is to exploit correlations. I will describe our correlation-aware indexing, replication, and data placement which make big data analytics faster and more scalable.
Finally, if time allows, I will also introduce another on-going project to develop a scalable transactional processing system on modern hardware in collaboration with Hewlett-Packard Laboratories.