DOI

10.17077/etd.lremrcvo

Document Type

Dissertation

Date of Degree

Fall 2016

Degree Name

PhD (Doctor of Philosophy)

Degree In

Statistics

First Advisor

Jian Huang

First Committee Member

Patrick Breheny

Second Committee Member

Kung-Sik Chan

Third Committee Member

Joseph B Lang

Fourth Committee Member

Luke Tierney

Abstract

In fields such as statistics, economics and biology, heterogeneity is an important topic concerning validity of data inference and discovery of hidden patterns. This thesis focuses on penalized methods for regression analysis with the presence of heterogeneity in a potentially high-dimensional setting. Two possible strategies to deal with heterogeneity are: robust regression methods that provide heterogeneity-resistant coefficient estimation, and direct detection of heterogeneity while estimating coefficients accurately in the meantime.

We consider the first strategy for two robust regression methods, Huber loss regression and quantile regression with Lasso or Elastic-Net penalties, which have been studied theoretically but lack efficient algorithms. We propose a new algorithm Semismooth Newton Coordinate Descent to solve them. The algorithm is a novel combination of Semismooth Newton Algorithm and Coordinate Descent that applies to penalized optimization problems with both nonsmooth loss and nonsmooth penalty. We prove its convergence properties, and show its computational efficiency through numerical studies.

We also propose a nonconvex penalized regression method, Heterogeneity Discovery Regression (HDR) , as a realization of the second idea. We establish theoretical results that guarantees statistical precision for any local optimum of the objective function with high probability. We also compare the numerical performances of HDR with competitors including Huber loss regression, quantile regression and least squares through simulation studies and a real data example. In these experiments, HDR methods are able to detect heterogeneity accurately, and also largely outperform the competitors in terms of coefficient estimation and variable selection.

Public Abstract

In fields such as statistics, economics and biology, heterogeneity is an important topic concerning validity of data inference and discovery of hidden patterns. Our insights and interpretation of the data can be dramatically influenced by the presence of heterogeneity. And this is especially challenging in high-dimensional data which become increasingly common nowadays in many areas such as genetics, behavioral sciences, image and natural language processing. This thesis focuses on penalized methods for regression analysis with the presence of heterogeneity in a potentially high-dimensional setting.

One strategy to deal with heterogeneity is robust regression methods that provide heterogeneity-resistant coefficient estimation. We develop a novel algorithm, Semismooth Newton Coordinate Descent, that computes two important classes of penalized robust regression methods efficiently and scales very well to ultra-high dimensions (e.g. 100000). Another strategy is direct detection of heterogeneity while estimating coefficients accurately in the meantime. We propose a nonconvex penalized regression method, Heterogeneity Discovery Regression (HDR), as a realization of this idea. We establish good theoretical properties for the approach, and demonstrate significant advantages of HDR over alternatives such as robust regressions through simulation studies. Finally, we also illustrate the application of HDR to a building energy data.

Keywords

heterogeneity detection, high-dimensional, nonconvex regularization, optimization, robust regression, variable selection

Pages

ix, 98 pages

Bibliography

Includes bibliographical references (pages 96-98).

Copyright

Copyright © 2016 Congrui Yi

Share

COinS