Document Type

Dissertation

Date of Degree

Summer 2016

Degree Name

PhD (Doctor of Philosophy)

Degree In

Electrical and Computer Engineering

First Advisor

Guadalupe M. Canahuate

Abstract

The past few years have brought a major surge in the volumes of collected data. More and more enterprises and research institutions find tremendous value in data analysis and exploration. Big Data analytics is used for improving customer experience, perform complex weather data integration and model prediction, as well as personalized medicine and many other services.

Advances in technology, along with high interest in big data, can only increase the demand on data collection and mining in the years to come.

As a result, and in order to keep up with the data volumes, data processing has become increasingly distributed. However, most of the distributed processing for large data is done by batch processing and interactive exploration is hardly an option. To efficiently support queries over large amounts of data, appropriate indexing mechanisms must be in place.

This dissertation proposes an indexing and query processing framework that can run on top of a distributed computing engine, to support fast, interactive data explorations in data warehouses. Our data processing layer is built around bit-vector based indices. This type of indexing features fast bit-wise operations and scales up well for high dimensional data. Additionally, compression can be applied to reduce the index size, and thus utilize less memory and network communication.

Our work can be divided into two areas: index compression and query processing.

Two compression schemes are proposed for sparse and dense bit-vectors. The design of these encoding methods is hardware-driven, and the query processing is optimized for the available computing hardware. Query algorithms are proposed for selection, aggregation, and other specialized queries. The query processing is supported on single machines, as well as computer clusters.

Keywords

Bit-vector, Database Indexing, Data Compression, Data Exploration, Distributed Database, Query Algorithm

Pages

xiv, 124

Bibliography

120-124

Copyright

Copyright 2016 Gheorghi Guzun

Share

COinS