Date of Degree
PhD (Doctor of Philosophy)
Electrical and Computer Engineering
Guadalupe M. Canahuate
The past few years have brought a major surge in the volumes of collected data. More and more enterprises and research institutions find tremendous value in data analysis and exploration. Big Data analytics is used for improving customer experience, perform complex weather data integration and model prediction, as well as personalized medicine and many other services.
Advances in technology, along with high interest in big data, can only increase the demand on data collection and mining in the years to come.
As a result, and in order to keep up with the data volumes, data processing has become increasingly distributed. However, most of the distributed processing for large data is done by batch processing and interactive exploration is hardly an option. To efficiently support queries over large amounts of data, appropriate indexing mechanisms must be in place.
This dissertation proposes an indexing and query processing framework that can run on top of a distributed computing engine, to support fast, interactive data explorations in data warehouses. Our data processing layer is built around bit-vector based indices. This type of indexing features fast bit-wise operations and scales up well for high dimensional data. Additionally, compression can be applied to reduce the index size, and thus utilize less memory and network communication.
Our work can be divided into two areas: index compression and query processing.
Two compression schemes are proposed for sparse and dense bit-vectors. The design of these encoding methods is hardware-driven, and the query processing is optimized for the available computing hardware. Query algorithms are proposed for selection, aggregation, and other specialized queries. The query processing is supported on single machines, as well as computer clusters.
Bit-vector, Database Indexing, Data Compression, Data Exploration, Distributed Database, Query Algorithm
Copyright 2016 Gheorghi Guzun