Variation-aware algorithms for cancer genome analysis
Repository URI
Repository DOI
Change log
Authors
Abstract
In this thesis, I explore variation-aware algorithms for analyzing cancer genomes. The scientific community has extensively catalogued millions of mutations present in cancer cells. This information is rarely used during read alignment and variant calling because of a lack of algorithms for doing so. Rediscovering these variants wastes significant computational time and negatively impacts the sensitivity of detection, motivating the development of new solutions.
The variants in malignant cells can arise from a number of genetic and environmental sources. In the years after the Chernobyl nuclear disaster, thousands of individuals in areas where radionuclides were deposited developed thyroid cancer. The excess relative risk of thyroid cancer has been estimated to be between fifteen and thirty fold higher following
In Chapter 3, I develop methods for working with structural variants in variation graphs. Variation graphs have been shown to reduce reference bias and improve alignment to variant sites, though previous work has primarily focused on variants less than fifty base pairs in size. I describe the tradeoffs of various representations of large variants in the graph. I then develop several methods for genotyping and calling large variants in graphs. I construct graphs of both germline and somatic variation and describe how to work with these structures. I develop a method for fast structural variant genotyping using graph mappings and show that this significantly outperforms standard structural variant callers for certain types of variation. I describe how to locate structural variant mismapping signatures on variation graphs and how variation graphs can improve calling of structural variants.
Lastly in Chapter 4, I demonstrate a new application of an alignment-free algorithm for genome analysis. I describe a MinHash toolkit for viral coinfection analysis and its application to Human Papillomavirus (HPV) samples. This toolkit is able to classify individual reads from multiple sequencing technologies and accurately detect clinically- relevant HPV coinfections. Finally, I dicuss how these approaches can be applied in other genomic analyses.
Description
Date
Advisors
Chanock, Stephen