Document Type

PhD diss.

Date of Degree

2006

Degree Name

PhD (Doctor of Philosophy)

Department

Biomedical Engineering

First Advisor

Todd E. Scheetz

Second Advisor

Thomas L. Casavant

Abstract

Although high-throughput methods exist to identify many small disease causing mutations (e.g. substitutions that alter an amino acid), assays to identify classes of larger mutations such as deletions/duplications are time consuming, laborious and expensive. In addition, no in-silico system exists to identify intragene deletion or duplication candidates. We hypothesize that a computational system, SPeeDD (System to Prioritize Deletion or Duplication candidates), utilizing machine learning techniques can be employed to identify the most likely disease causing deletion or duplication candidates within a gene.

Informative sequence based features were obtained from a set of genes with known intragene deletions or duplications for data mining. Machine learning techniques were applied to this data. Sensitivity from 20% to 74.2% varied depending on the specific machine learning model used, but specificity exceeded 90% for all methods evaluated. The logic model tree (LMT) method, which is a combination of decision tree and logistic regression model, yielded the best results. The SPeeDD system also succeeded in accurately predicting a recently published novel BRCA1 deletion.

These results suggest that the SPeeDD system provides good sensitivity and specificity and can be used to prioritize candidate genes and gene regions for focused screening. This will reduce the labor and associated costs of the biological assays, and should accelerate the process of mutation discovery.

Pages

ix, 96

Bibliography

88-96

Copyright

Copyright 2006 Krishna Rani Kalari