Date of Degree
PhD (Doctor of Philosophy)
Kin Fai Au
The purpose of this thesis is to explore methodology concerning Markov Chain Monte Carlo (MCMC), a powerful technique in the Bayesian framework, on binary variables. The primary application of interest in this thesis is applying this methodology to phase haplotypes, a type of categorical variable. Haplotypes are the combination of variants present in an individual’s genome. Phasing refers to estimating the true haplotype. By considering only biallelic and heterozygous variants, the haplotype can be expressed as a vector of binary variables. Accounting for differences in haplotypes is essential for the study of associations between genotype and disease.
MCMC is an extremely popular class of statistical methods for simulating autocorrelated draws from target distributions, including posterior distributions in Bayesian analysis. Techniques for sampling categorical variables in MCMC have been developed in a variety of disparate settings. Samplers include Gibbs, Metropolis-Hastings, and exact Hamiltonian based samplers. A review of these techniques is presented and their relevance to the genetic model discussed.
An important consideration in using simulated MCMC draws for inference is that they have converged to the distribution of interest. Since the distribution is typically of a non-standard form, convergence cannot generally be proven and, instead, is assessed with convergence diagnostics. The convergence diagnostics developed so far focus on continuous variables and may be inappropriate for binary variables or categorical variables in general. Two convergence diagnostics are proposed that are tailor-made for categorical variables by modeling the data using categorical time series models. Performance of the convergence diagnostics is evaluated under various simulations.
The methodology developed in the thesis is applied to estimate haplotypes. There are two main challenges involved in accounting for haplotype differences. One is estimating the true combination of genetic variants on a single chromosome, known as haplotype phasing. The other is the phenomenon of allele-specific expression (ASE) in which haplotypes can be expressed non-equally. No existing method addresses these two intrinsically linked challenges together. Rather, current strategies rely on known haplotypes or family trio data, i.e. having data on subject of interest and their parents. A novel method is presented, named IDP-ASE, which is capable of phasing haplotypes and quantifying ASE using only RNA-seq data. This model leverages the strengths of both Second Generation Sequencing (SGS) data and Third Generation Sequencing (TGS) data. The long read length of TGS data facilitates phasing, while the accuracy and depth of SGS data facilitates estimation of ASE. Moreover, IDP-ASE is capable of estimating ASE at both the gene and isoform level.
allele, binary, convergence, diagnostic, haplotype, MCMC
xiii, 179 pages
Includes bibliographical references (pages 170-179).
Copyright © 2017 Benjamin Enver Deonovic