Variation-aware algorithms for cancer genome analysis

Dawson, Eric Thomas

Variation-aware algorithms for cancer genome analysis

Repository URI

https://www.repository.cam.ac.uk/handle/1810/304029

Repository DOI

https://doi.org/10.17863/CAM.51110

Files

Thesis (16.08 MB)

Type

Thesis

Authors

Dawson, Eric Thomas

https://orcid.org/0000-0001-5448-1653

Abstract

In this thesis, I explore variation-aware algorithms for analyzing cancer genomes. The scientific community has extensively catalogued millions of mutations present in cancer cells. This information is rarely used during read alignment and variant calling because of a lack of algorithms for doing so. Rediscovering these variants wastes significant computational time and negatively impacts the sensitivity of detection, motivating the development of new solutions.

The variants in malignant cells can arise from a number of genetic and environmental sources. In the years after the Chernobyl nuclear disaster, thousands of individuals in areas where radionuclides were deposited developed thyroid cancer. The excess relative risk of thyroid cancer has been estimated to be between fifteen and thirty fold higher following $131 I$ exposure. In Chapter 2, I analyze more than 300 thyroid cancer cases from children and young adults exposed to ionizing radiation from increased levels of $131 I$ originating from Chernobyl. I characterize the mutational landscape of these tumors across the variant size spectrum and compare it to sporadic thyroid cancer cases. I investigate possible signs of radiation exposure in the genome, especially large balanced structural variants and small indels.

In Chapter 3, I develop methods for working with structural variants in variation graphs. Variation graphs have been shown to reduce reference bias and improve alignment to variant sites, though previous work has primarily focused on variants less than fifty base pairs in size. I describe the tradeoffs of various representations of large variants in the graph. I then develop several methods for genotyping and calling large variants in graphs. I construct graphs of both germline and somatic variation and describe how to work with these structures. I develop a method for fast structural variant genotyping using graph mappings and show that this significantly outperforms standard structural variant callers for certain types of variation. I describe how to locate structural variant mismapping signatures on variation graphs and how variation graphs can improve calling of structural variants.

Lastly in Chapter 4, I demonstrate a new application of an alignment-free algorithm for genome analysis. I describe a MinHash toolkit for viral coinfection analysis and its application to Human Papillomavirus (HPV) samples. This toolkit is able to classify individual reads from multiple sequencing technologies and accurately detect clinically- relevant HPV coinfections. Finally, I dicuss how these approaches can be applied in other genomic analyses.

Date

2019-09-27

Advisors

Durbin, Richard
Chanock, Stephen

Keywords

Cancer, Genomics, Bioinformatics, Variation Graphs, Structural Variation, Ionizing Radiation, Chernobyl, Algorithms, Kmers, Human Papillomavirus, HPV, MinHash, Thyroid Cancer, Papillary Thyroid Carcinoma, Mutational Signatures, Indels, Genetics, Genetic Variation, Variant Calling, Genome Assembly, Somatic Variation

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge

Rights

Attribution 4.0 International (CC BY 4.0)

Sponsorship

The majority of this thesis was supported by funds from the US National Institutes of Health Intramural Research Program as well as the Wellcome Sanger Institute. ETD was also supported by an NIH - Cambridge Trust Fellowship.

Collections

Theses - Genetics