Document Type

Dissertation

Date of Degree

Fall 2017

Degree Name

PhD (Doctor of Philosophy)

Degree In

Biostatistics

First Advisor

Patrick Breheny

First Committee Member

Grant Brown

Second Committee Member

Joseph Cavanaugh

Third Committee Member

Luke Tierney

Fourth Committee Member

Tianbao Yang

Abstract

Sparse machine learning models have become increasingly popular in analyzing high-dimensional data. With the evolving era of Big Data, ultrahigh-dimensional, large-scale data sets are constantly collected in many areas such as genetics, genomics, biomedical imaging, social media analysis, and high-frequency finance. Mining valuable information efficiently from these massive data sets requires not only novel statistical models but also advanced computational techniques. This thesis focuses on the development of scalable sparse machine learning methods to facilitate Big Data analytics.

Built upon the feature screening technique, the first part of this thesis proposes a family of hybrid safe-strong rules (HSSR) that incorporate safe screening rules into the sequential strong rule to remove unnecessary computational burden for solving the \textit{lasso-type} models. We present two instances of HSSR, namely SSR-Dome and SSR-BEDPP, for the standard lasso problem. We further extend SSR-BEDPP to the elastic net and group lasso problems to demonstrate the generalizability of the hybrid screening idea. In the second part, we design and implement an R package called \texttt{biglasso} to extend the lasso model fitting to Big Data in R. Our package \texttt{biglasso} utilizes memory-mapped files to store the massive data on the disk, only reading data into memory when necessary during model fitting, and is thus able to handle \textit{data-larger-than-RAM} cases seamlessly. Moreover, it's built upon our redesigned algorithm incorporated with the proposed HSSR screening, making it much more memory- and computation-efficient than existing R packages. Extensive numerical experiments with synthetic and real data sets are conducted in both parts to show the effectiveness of the proposed methods.

In the third part, we consider a novel statistical model, namely the overlapping group logistic regression model, that allows for selecting important groups of features that are associated with binary outcomes in the setting where the features belong to overlapping groups. We conduct systematic simulations and real-data studies to show its advantages in the application of genetic pathway selection. We implement an R package called \texttt{grpregOverlap} that has HSSR screening built in for fitting overlapping group lasso models.

Keywords

Big data, Feature screening, High-dimensional statistics, Lasso-type models, Sparse machine learning

Pages

xii, 100 pages

Bibliography

Includes bibliographical references (pages 95-100).

Copyright

Copyright © 2017 Yaohui Zeng

Included in

Biostatistics Commons

Share

COinS