Implications of storage subsystem interactions on processing efficiency in data intensive computing

Koneru, Hanisha, author; Pallickara, Shrideep, advisor; Pallickara, Sangmi, committee member; Arabi, Mazdak, committee member

Implications of storage subsystem interactions on processing efficiency in data intensive computing

Files

Koneru_colostate_0053N_13265.pdf (417.07 KB)

Date

2015

Authors

Koneru, Hanisha, author

Pallickara, Shrideep, advisor

Pallickara, Sangmi, committee member

Arabi, Mazdak, committee member

Abstract

Processing frameworks such as MapReduce allow development of programs that operate on voluminous on-disk data. These frameworks typically include support for multiple file/storage subsystems. This decoupling of processing frameworks from the underlying storage subsystem provides a great deal of flexibility in application development. However, as we demonstrate, this flexibility often exacts a price: performance. Given the data volumes, storage subsystems (such as HDFS, MongoDB, and HBase) disperse datasets over a collection of machines. Storage subsystems manage complexity relating to preservation of consistency, redundancy, failure recovery, throughput, and load balancing. Preserving these properties involve message exchanges between distributed subsystem components, updates to in-memory data structures, data movements, and coordination as datasets are staged and system conditions change. Storage subsystems prioritize these properties differently, leading to vastly different network, disk, memory, and CPU footprints for staging and accessing the same dataset. This thesis proposes a methodology for comparing and identifying the storage subsystem suited for the processing that is being performed on a dataset. We profile the network I/O, disk I/O, memory, and CPU costs introduced by a storage subsystem during data staging, data processing, and generation of results. We perform this analysis with different storage subsystems and applications with different disk-I/O to CPU processing ratios.

Subject

big data

distributed storage systems

Hadoop MapReduce

HBase

HDFS

URI

http://hdl.handle.net/10217/170296

Collections

2000-2019
Theses and Dissertations

Full item page

Implications of storage subsystem interactions on processing efficiency in data intensive computing

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Abstract

Description

Rights Access

Subject

Citation

URI

Associated Publications

Collections