Nimbus: Scalable, Distributed, In-Memory Data Storage

Thursday, June 6, 2013, 10:30am - Thursday, June 6, 2013, 15:00pm

325b ITE, UMBC

MS Defense

The Apache Hadoop project provides a framework for reliable, scalable, distributed computing. The storage layer of Hadoop, called the Hadoop Distributed File System (HDFS), is an append-only distributed file system designed for commodity hardware. The append-only nature of the file system limits the ability for applications to have random reads and writes of data. This was addressed by Apache HBase and Apache Accumulo, which both allow for quick random access to a highly scalable key/value store.

However, these projects still require data to be read from the local disk of the server, and therefore cannot handle the type of I/O throughput that many applications require. This limits the potential for "hot" data sets that cannot be stored in memory of one machine, but do not need the scalability of HBase, i.e. the ones that can be sharded and stored in memory on dozens of machines. These data sets are often referenced by many applications and can be dozens of gigabytes in size.

Nimbus is a project designed for Hadoop to expose distributed in-memory data structures, backed by the reliability of HDFS. By executing a series of I/O benchmarks against HBase, Nimbus's architecture and implementation are validated by demonstrating the performance advantage over HBase, allowing for high-throughput data fetch operations. The overall architecture and design of each component are discussed to validate Nimbus's design goals, as well as a description of relevant use cases and future work for the project.

Committee: Drs. Tim Finin (chair), Anupam Joshi and Konstantinos Kalpakis

LINK OWL Tweet