DOI

10.17077/etd.z6li-0cy7

Document Type

Dissertation

Date of Degree

Spring 2019

Access Restrictions

Access restricted until 07/29/2021

Degree Name

PhD (Doctor of Philosophy)

Degree In

Biostatistics

First Advisor

Cavanaugh, Joseph

First Committee Member

Breheny, Patrick J.

Second Committee Member

Oleson, Jacob J.

Third Committee Member

Polgreen, Philip M.

Fourth Committee Member

Miller, Aaron C.

Fifth Committee Member

Miller, Ryan E.

Abstract

In this dissertation, we explore and illustrate the concept of ranked sparsity, a phenomenon that often occurs naturally in the presence of derived variables. Ranked sparsity arises in modeling applications when an expected disparity exists in the quality of information between different feature sets. Its presence can cause traditional model selection methods to fail because statisticians commonly presume that each potential parameter is equally worthy of entering into the final model - we call this principle "covariate equipoise". However, this presumption does not always hold, especially in the presence of derived variables. For instance, when all possible interactions are considered as candidate predictors, the presumption of covariate equipoise will often produce misclassified and opaque models. The sheer number of additional candidate variables grossly inflates the number of false discoveries in the interactions, resulting in unnecessarily complex and difficult-to-interpret models with many (truly spurious) interactions. We suggest a modeling strategy that requires a stronger level of evidence in order to allow certain variables (e.g. interactions) to be selected in the final model. This ranked sparsity paradigm can be implemented either with a modified Bayesian information criterion (RBIC) or with the sparsity-ranked lasso (SRL).

In chapter 1, we provide a philosophical motivation for ranked sparsity by describing situations where traditional model selection methods fail. Chapter 1 also presents some of the relevant literature, and motivates why ranked sparsity methods are necessary in the context of interactions. Finally, we introduce RBIC and SRL as possible recourses. In chapter 2, we explore the performance of SRL relative to competing methods for selecting polynomials and interactions in a series of simulations. We show that the SRL is a very attractive method because it is fast, accurate, and does not tend to inflate the number of Type I errors in the interactions. We illustrate its utility in an application to predict the survival of lung cancer patients using a set of gene expression measurements and clinical covariates, searching in particular for gene-environment interactions, which are very difficult to find in practice.

In chapter 3, we present three extensions of the SRL in very different contexts. First, we show how the method can be used to optimize for cost and prediction accuracy simulataneously when covariates have differing collection costs. In this setting, the SRL produces what we call "minimally invasive" models, i.e. models that can easily (and cheaply) be applied to new data. Second, we investigate the use of the SRL in the context of time series regression, where we evaluate our method against several other state-of-the-art techniques in predicting the hourly number of arrivals at the Emergency Department of the University of Iowa Hospitals and Clinics. Finally, we show how the SRL can be utilized to balance model stability and model adaptivity in an application which uses a rich new source of smartphone thermometer data to predict flu incidence in real time.

Keywords

derived variables, interactions, lasso, model selection, Occam’s razor, ranked skepticism

Pages

xiii, 100 pages

Bibliography

Includes bibliographical references (pages 98-100).

Copyright

Copyright © 2019 Ryan Andrew Peterson

Available for download on Thursday, July 29, 2021

Included in

Biostatistics Commons

Share

COinS