Computational methods for the detection of somatic structural variants in cancer genomes using long-read sequencing
Repository URI
Repository DOI
Change log
Authors
Abstract
Accurate detection of somatic structural variants (SVs) is critical for informing the diagnosis and treatment of human cancers. In this thesis, I present SAVANA, a computational method for the analysis of somatic SVs using long-read whole genome sequencing data from tumours and matched normal samples. SAVANA employs machine learning to distinguish true somatic SVs from germline events and noise. Additionally, I establish best practices for benchmarking SV detection using simulated and sequencing replicates to demonstrate SAVANA’s superior sensitivity, specificity, and speed compared to existing methods. I show that SAVANA performs robustly across a variety of clonality levels, genomic regions, SV types, and sizes. Using Illumina and Oxford Nanopore whole-genome sequencing data from 99 tumours and matched normal samples of patients, I show that SVs reported by SAVANA are highly concordant with those detected using short-read sequencing, including in regions of complex structural variation. I also highlight the enhanced ability of long-reads to identify SVs in repetitive regions where short-reads are unable to map with high confidence. In summary, this thesis introduces SAVANA as a novel computational method to identify somatic SVs in long-reads, establishes a robust framework for benchmarking SV detection, and demonstrates the method’s high consistency and enhanced sensitivity compared to short-reads across a large patient cohort.
