UMBC ebiquity

Correlation Aware Optimizations for Analytic Databases

Speaker: Hideaki Kimura

Start: Friday, March 09, 2012, 01:00PM

End: Friday, March 09, 2012, 02:00PM

Location: 325b ITE, UMBC

Abstract:

Recent years have seen that the analysis of large data-sets is crucially important in a wide range of business, governmental, and scientific applications. For example, research projects in astronomy need to analyze petabytes of image data taken from telescopes. Providing a fast and scalable analytical data management system for such users has become increasingly important.

The major bottlenecks for analytics on such big data are disk- and network-I/O. Because the data is too large to fit in RAM, each query causes substantial disk I/O. Traditional database systems provide indexes to speed up disk reads, but many analytic queries do not benefit from indexes because data is scattered over a large number of disk blocks and disk seeks are prohibitively expensive. Furthermore, such huge data sets need to be partitioned and distributed over hundreds or many thousands of nodes. When a query requires more than one data at once, such as a query involving a JOIN operation, the data management system must transmit a large amount of data over the network. For example, the Shuffle phase in Map-Reduce systems copies file blocks over the network and causes a significant bottleneck in many cases.

Our approach to tackling these challenges in big data analytics is to exploit correlations. I will describe our correlation-aware indexing, replication, and data placement which make big data analytics faster and more scalable.

Finally, if time allows, I will also introduce another on-going project to develop a scalable transactional processing system on modern hardware in collaboration with Hewlett-Packard Laboratories.

Web Site: http://www.csee.umbc.edu/2012/03/talk-correlation-aware-optimizations-for-analytic-databases/

Tags: database, big data, analytics

Host: Tim Finin

,