Document Type


Date of Degree

Fall 2014

Degree Name

PhD (Doctor of Philosophy)

Degree In

Genetics (Computational Genetics)

First Advisor

Braun, Terry A

Second Advisor

Stone, Edwin M

First Committee Member

Street, William N

Second Committee Member

Manak, John R

Third Committee Member

Scheetz, Todd E


The explosive growth in the ability to sequence DNA due to next-generation sequencing (NGS) technologies has brought an unprecedented ability to characterize an individual's exome inexpensively. This ability provides clinicians with additional tools to evaluate the likely genetic factors underlying heritable diseases. With this added capacity comes a need to identify relationships between the genetic variations observed in a patient and the disease with which the patient presents. This dissertation focuses on computational techniques to inform molecular diagnostics from NGS data. The techniques focus on three distinct domains in the characterization of disease-associated variants from exome sequencing.

First, strategies for producing complete and non-artifactual candidate variant lists are discussed. The process of converting patient DNA to a list of variants from the reference genome is very complex, and numerous modes of error may be introduced during the process. For this, a Random Forest classifier was built to capture biases in a laboratory variant calling pipeline, and a C4.5 decision tree was built to enable discovery of thresholds for false positive reduction. Additionally, a strategy for augmenting exome capture experiments through evaluation of RNA-sequencing is discussed.

Second, a novel positive and unlabeled learning for prioritization (PULP) strategy is proposed to identify candidate variants most likely to be associated with a patient's disease. Using a number of publicly available data sources, PULP ranks genes according to how alike they are to previously discovered disease genes. This strategy is evaluated on a number of candidate lists from the literature, and demonstrated to significantly enrich ordered candidate variants lists for likely disease-associated variants.

Finally, the Training for Recognition and Integration of Phenotypes in Ocular Disease (TRIPOD) web utility is introduced as a means of simultaneously training and learning from clinicians about heritable ocular diseases. This tool currently contains a number of case studies documenting a wide range of diseases, and challenges trainees to virtually diagnose patients based on presented image data. Annotations by trainee and expert alike are used to construct rich phenotypic profiles for patients with known disease genotypes.

The strategies presented in this dissertation are specifically applicable to heritable retinal dystrophies, and have resulted in a number of improvements to the accurate molecular diagnosis of patient diseases. However, these works also provide a generalizable framework for disease-associated variant identification in any heritable, genetically heterogeneous disease, and represent the ongoing challenge of accurate diagnosis in the information age.

Public Abstract

Now is a very exciting time in human medicine. Currently, a number of clinical trials are underway to gauge the efficacy of gene replacement therapy for a variety of heritable—and previously, incurable—diseases. Before a patient is treated, however, an accurate genetic diagnosis must be made indicating which of their genes causes their disease. This diagnosis is technically challenging, and can be very difficult if a large number of possible genes may cause the disease. To address these challenges, computational techniques are introduced that can assist physicians in sorting through all of a patient’s genetic content to produce an accurate diagnosis.

This dissertation addresses these challenges in three distinct ways. First, methods for reducing all of the “white noise” in the genetic diagnostic process are described, as well as data-driven methods for looking closer at hidden compartments in genes that may be important. Second, a novel strategy for finding new disease genes that function similarly to already discovered ones is proposed. Finally, a web tool is introduced that can train clinicians how to spot medical signs which hint at the likely underlying disease gene.

These tools and techniques represent a broad framework for using medical signs and molecular data to better diagnose the genes that cause disease. Throughout this dissertation, these tools and techniques are demonstrated to improve the genetic diagnostic process for patients with heritable eye diseases. Ultimately, these works add to an ongoing effort, and represent another step towards rapid and accurate genetic diagnoses.


publicabstract, Bioinformatics, Machine Learning, PULP, TRIPOD


xiv, 98 pages


Includes bibliographical references (pages 90-98).


Copyright 2014 Alex Handler Wagner