Date of Degree
PhD (Doctor of Philosophy)
Big data refers to information which cannot be processed and analyzed using traditional approaches and tools, due to 4 V's - sheer Volume, Velocity at which data is received and processed, and data Variety and Veracity. Today massive volumes of data originate in domains such as geospatial analysis, biological and social networks, etc. Hence, scalable algorithms for effcient processing of this massive data is a signicant challenge in the field of computer science. One way to achieve such effcient and scalable algorithms is by using shared & distributed memory parallel programming models. In this thesis, we present a variety of such algorithms to solve problems in various above mentioned domains. We solve five problems that fall into two categories.
The first group of problems deals with the issue of community detection. Detecting communities in real world networks is of great importance because they consist of patterns that can be viewed as independent components, each of which has distinct features and can be detected based upon network structure. For example, communities in social networks can help target users for marketing purposes, provide user recommendations to connect with and join communities or forums, etc. We develop a novel sequential algorithm to accurately detect community structures in biological protein-protein interaction networks, where a community corresponds with a functional module of proteins. Generally, such sequential algorithms are computationally expensive, which makes them impractical to use for large real world networks. To address this limitation, we develop a new highly scalable Symmetric Multiprocessing (SMP) based parallel algorithm to detect high quality communities in large subsections of social networks like Facebook and Amazon. Due to the SMP architecture, however, our algorithm cannot process networks whose size is greater than the size of the RAM of a single machine. With the increasing size of social networks, community detection has become even more difficult, since network size can reach up to hundreds of millions of vertices and edges. Processing such massive networks requires several hundred gigabytes of RAM, which is only possible by adopting distributed infrastructure. To address this, we develop a novel hybrid (shared + distributed memory) parallel algorithm to efficiently detect high quality communities in massive Twitter and .uk domain networks.
The second group of problems deals with the issue of effciently processing spatial Light Detection and Ranging (LiDAR) data. LiDAR data is widely used in forest and agricultural crop studies, landscape classification, 3D urban modeling, etc. Technological advancements in building LiDAR sensors have enabled highly accurate and dense LiDAR point clouds resulting in massive data volumes, which pose computing issues with processing and storage. We develop the first published landscape driven data reduction algorithm, which uses the slope-map of the terrain as a filter to reduce the data without sacrificing its accuracy. Our algorithm is highly scalable and adopts shared memory based parallel architecture. We also develop a parallel interpolation technique that is used to generate highly accurate continuous terrains, i.e. Digital Elevation Models (DEMs), from discrete LiDAR point clouds.
Algorithms, Big Data Analytics, Distributed Systems, High Performance Parallel Computing
Copyright © 2016 Rahil Sharma