Document Type


Date of Degree

Fall 2014

Degree Name

PhD (Doctor of Philosophy)

Degree In

Computer Science

First Advisor

Segre, Alberto M

First Committee Member

Polgreen, Philip M

Second Committee Member

Pemmaraju, Sriram

Third Committee Member

Herman, Ted

Fourth Committee Member

Rushton, Gerard

Fifth Committee Member

Valle, Sara Y Del


Traditional disease surveillance systems are instrumental in guiding policy-makers' decisions and understanding disease dynamics. The first study in this dissertation looks at sentinel surveillance network design. We consider three location-allocation models: two based on the maximal coverage model (MCM) and one based on the K-median model. The MCM selects sites that maximize the total number of people within a specified distance to the site. The K-median model minimizes the sum of the distances from each individual to the individual's nearest site. Using a ground truth dataset consisting of two million de-identified Medicaid billing records representing eight complete influenza seasons and an evaluation function based on the Huff spatial interaction model, we empirically compare networks against the existing volunteer-based Iowa Department of Public Health influenza-like illness network by simulating the spread of influenza across the state of Iowa. We compare networks on two metrics: outbreak intensity (i.e., disease burden) and outbreak timing (i.e., the start, peak, and end of the epidemic). We show that it is possible to design a network that achieves outbreak intensity performance identical to the status quo network using two fewer sites. We also show that if outbreak timing detection is of primary interest, it is actually possible to create a network that matches the existing network's performance using 42% fewer sites. Finally, in an effort to demonstrate the generic usefulness of these location-allocation models, we examine primary stroke center selection. We describe the ineffectiveness of the current self-initiated approach and argue for a more organized primary stroke center system.

While these traditional disease surveillance systems are important, they have several downsides. First, due to a complex reporting hierarchy, there is generally a reporting lag; for example, most diseases in the United States experience a reporting lag of approximately 1-2 weeks. Second, many regions of the world lack trustworthy or reliable data. As a result, there has been a surge of research looking at using publicly available data on the internet for disease surveillance purposes. The second and third studies in this dissertation analyze Wikipedia's viability in this sphere.

The first of these two studies looks at Wikipedia access logs. Hourly access logs dating back to December 2007 are available for anyone to download completely free of charge. These logs contain, among other things, the total number of accesses for every article in Wikipedia. Using a linear model and a simple article selection procedure, we show that it is possible to nowcast and, in some cases, forecast up to the 28 days tested in 8 of the 14 disease-location contexts considered. We also demonstrate that it may be possible in some cases to train a model in one context and use the same model to nowcast or forecast in another context with poor surveillance data.

The second of the Wikipedia studies looked at disease-relevant data found in the article content. A number of disease outbreaks are meticulously tracked on Wikipedia. Case counts, death counts, and hospitalization counts are often provided in the article narrative. Using a dataset created from 14 Wikipedia articles, we trained a named-entity recognizer (NER) to recognize and tag these phrases. The NER achieved an F1 score of 0.753. In addition to these counts in the narrative, we tested the accuracy of tabular data using the 2014 West African Ebola virus disease epidemic. This article, like a number of other disease articles on Wikipedia, contains granular case counts and deaths counts per country affected by the disease. By computing the root-mean-square error between the Wikipedia time series and a ground truth time series, we show that the Wikipedia time series are both timely and accurate.

Public Abstract

This dissertation presents three studies that aim to improve the current state of disease surveillance.

The first study looks at designing traditional sentinel surveillance systems, which are instrumental in guiding policy-makers' decisions and understanding disease dynamics. We use several popular location-allocation models to algorithmically design surveillance networks of varying sizes. By simulating the spread of influenza across the state of Iowa, we demonstrate that we are capable of generating smaller networks capable of performance at least as good as the volunteer-based influenza surveillance network used by the Iowa Department of Public Health. We also apply these network design methods to primary stroke center placement.

The second and third studies recognize that while these traditional surveillance systems are important, they have drawbacks, such as reporting lags and untrustworthy, unreliable, or unavailable data in some instances. To help solve these problems, we introduce a novel data source: Wikipedia.

The second study displays how a linear model using time series of article accesses can now cast and forecast a variety of diseases in a variety of locations.

Finally, the third study demonstrates how disease-related data can be elicited from Wikipedia article content. We show how a named-entity recognizer can be trained to tag case, death, and hospitalization counts in the article narrative. We also analyze tabular time series data and show that they are accurate and timely.


publicabstract, disease surveillance, forecasting, sentinel surveillance, simulation, Wikipedia


xiv, 136 pages


Includes bibliographical references (pages 117-136).


Copyright 2014 Geoffrey Colin Fairchild