IEEE International Conference on Big Data
A database-based distributed computation architecture with Accumulo and D4M: An application of eigensolver for large sparse matrix
November 30, 2015
NoSQL distributed databases have been devised to tackle the challenges resulting from volume, velocity and variety of big data. Graph representation of datasets requires efficient distributed linear algebra operations for large sparse matrix constructed from big data. Storing the transformed matrix into the database not only speeds up the big data analysis process but also facilitates the computation because of indexing. The Hadoop based approach does not natively support iterative algorithms due to data shuffling during each iteration. This paper presents a novel database-based distributed computation architecture bridging the gap between Hadoop and HPC. The novelty results from exploring the indexing capability of D4M (Dynamic Distributed Dimensional Data Model) to support linear algebra operations in a distributed computation environment. The idea is to store input data and intermediate results in associative array format inside Accumulo table to facilitate the data sharing among working nodes. pMatlab is deployed as the parallel computation engine. Our proposed architecture is proved to be lighter, easier and faster than MapReduce based approach. One example application is calculating top k eigenvalues and eigenvectors for large sparse matrix. Experiments on Graph500 benchmark datasets demonstrate 2X speedup of our architecture as compared to HEIGEN (An eigensolver for billion-scale matrices using MapReduce).
Downloads: 509 downloads