Date of Degree
PhD (Doctor of Philosophy)
Au, Kin Fai
First Committee Member
Second Committee Member
Third Committee Member
The purpose of this thesis is to explore methodology concerning Markov Chain Monte Carlo (MCMC), a powerful technique in the Bayesian framework, on binary variables. The primary application of interest in this thesis is applying this methodology to phase haplotypes, a type of categorical variable. Haplotypes are the combination of variants present in an individual’s genome. Phasing refers to estimating the true haplotype. By considering only biallelic and heterozygous variants, the haplotype can be expressed as a vector of binary variables. Accounting for differences in haplotypes is essential for the study of associations between genotype and disease.
MCMC is an extremely popular class of statistical methods for simulating autocorrelated draws from target distributions, including posterior distributions in Bayesian analysis. Techniques for sampling categorical variables in MCMC have been developed in a variety of disparate settings. Samplers include Gibbs, Metropolis-Hastings, and exact Hamiltonian based samplers. A review of these techniques is presented and their relevance to the genetic model discussed.
An important consideration in using simulated MCMC draws for inference is that they have converged to the distribution of interest. Since the distribution is typically of a non-standard form, convergence cannot generally be proven and, instead, is assessed with convergence diagnostics. The convergence diagnostics developed so far focus on continuous variables and may be inappropriate for binary variables or categorical variables in general. Two convergence diagnostics are proposed that are tailor-made for categorical variables by modeling the data using categorical time series models. Performance of the convergence diagnostics is evaluated under various simulations.
The methodology developed in the thesis is applied to estimate haplotypes. There are two main challenges involved in accounting for haplotype differences. One is estimating the true combination of genetic variants on a single chromosome, known as haplotype phasing. The other is the phenomenon of allele-specific expression (ASE) in which haplotypes can be expressed non-equally. No existing method addresses these two intrinsically linked challenges together. Rather, current strategies rely on known haplotypes or family trio data, i.e. having data on subject of interest and their parents. A novel method is presented, named IDP-ASE, which is capable of phasing haplotypes and quantifying ASE using only RNA-seq data. This model leverages the strengths of both Second Generation Sequencing (SGS) data and Third Generation Sequencing (TGS) data. The long read length of TGS data facilitates phasing, while the accuracy and depth of SGS data facilitates estimation of ASE. Moreover, IDP-ASE is capable of estimating ASE at both the gene and isoform level.
A categorical variable is a variable that can be only one of a finite number of values. A binary variable is a special type of categorical variable that takes on only one of two possible values. This thesis explores statistical models which incorporate binary variables as parameters in a Bayesian model. A Bayesian model is one which treats the parameters of the model as random variables, imbuing them with their own distribution. Markov Chain Monte Carlo (MCMC), is an extremely popular class of statistical methods for analyzing parameters in the Bayesian framework. A literature review is provided of the existing methodology for the use of MCMC on binary variables. The use of MCMC requires convergence diagnostics. Two convergence diagnostics are developed in this thesis for binary variables and their performance analyzed through simulation. Finally, the methodology developed in the thesis is applied to phase haplotypes. Haplotypes are the combination of variants present in an individual’s genome. Phasing refers to estimating the true haplotype. By considering only biallelic and heterozygous variants, the haplotype can be expressed as a vector of binary variables. Accounting for differences in haplotypes is essential for the study of associations between genotype and disease.
allele, binary, convergence, diagnostic, haplotype, MCMC
xiii, 179 pages
Includes bibliographical references (pages 170-179).
Copyright © 2017 Benjamin Enver Deonovic
Deonovic, Benjamin Enver. "MCMC sampling methods for binary variables with application to haplotype phasing and allele specific expression." PhD (Doctor of Philosophy) thesis, University of Iowa, 2017.