Document Type

Dissertation

Date of Degree

Fall 2012

Degree Name

PhD (Doctor of Philosophy)

Degree In

Biostatistics

First Advisor

Joseph E. Cavanaugh

Abstract

In this thesis, we consider the clustering of time series data; specifically, time series that can be modeled in the state space framework. Of primary focus is the pairwise discrepancy between two state space time series. The state space model can be formulated in terms of two equations: the state equation, based on a latent process, and the observation equation. Because the unobserved state process is often of interest, we develop discrepancy measures based on the estimated version of the state process. We compare these measures to discrepancies based on the observed data. In all, seven novel discrepancies are formulated.

First, discrepancies derived from Kullback-Leibler (KL) information and Mahalanobis distance (MD) measures are proposed based on the observed data. Next, KL information and MD discrepancies are formulated based on the composite marginal contributions of the smoothed estimates of the unobserved state process. Furthermore, an MD is created based on the joint contributions of the collection of smoothed estimates of the unobserved state process. The cross trajectory distance, a discrepancy heavily influenced by both observed and smoothed data, is proposed as well as a Euclidean distance based on the smoothed state estimates. The performance of these seven novel discrepancies is compared to the often used Euclidean distance based on the observed data, as well as a KL information discrepancy based on the joint contributions of the collection of smoothed state estimates (Bengtsson and Cavanaugh, 2008).

We find that those discrepancy measures based on the smoothed estimates of the unobserved state process outperform those discrepancy measures based on the observed data. The best performance was achieved by the discrepancies founded upon the joint contributions of the collection of unobserved states, followed by the discrepancies derived from the marginal contributions.

We observed a non-trivial degradation in clustering performance when estimating the parameters of the state space model. To improve estimation, we propose an iterative estimation and clustering routine based on the notion of finding a series' most similar counterparts, pooling them, and estimating a new set of parameters. Under ideal circumstances, we show that the iterative estimation and clustering algorithm can potentially achieve results that approach those obtained in settings where parameters are known. In practice, the algorithm often improves the performance of the model-based clustering measures.

We apply our methods to two examples. The first application pertains to the clustering of time course genetic data. We use data from Cho et al. (1998) where a time course experiment of yeast gene expression was performed in order to study the yeast mitotic cell cycle. We attempt to discover the phase to which 219 genes belong.

The second application seeks to answer whether or not influenza and pneumonia mortality can be explained geographically. Data from a collection of cities across the U.S. are acquired from the Morbidity and Mortality Weekly Report (MMWR). We cluster the MMWR data without geographic constraints, and compare the results to clusters defined by MMWR geographic regions. We find that influenza and pneumonia mortality cannot be explained by geography.

Keywords

clustering, kullback-leibler, mahalanobis distance, state space, time series

Pages

x, 112 pages

Bibliography

Includes bibliographical references (pages 108-111).

Copyright

Copyright 2012 Eric D. Foster

Included in

Biostatistics Commons

Share

COinS