Probabilistic modelling of somatic alterations in bulk tissue and single cells using repeat DNA
Chromosomal instability characterises several cancer types, in which large-scale structural alterations of the genome accumulate at an increased rate. An important class of structural alterations are somatic copy number alterations (SCNAs). SCNAs have been shown to be major drivers of oncogenesis and are associated with prognosis and response to therapies.
Current sequencing and array-based methods that are used to infer SCNAs are cost-prohibitive for widespread clinical use. A low-cost, simple and more clinically applicable method to amplify and sequence more than 10,000 repeat regions across the genome was recently developed, called FAST-SeqS. However, current computational methods do not make effective use of this low-cost assay. This limits its application to clinical medicine and to biomedical research.
In this thesis, I develop conliga; a probabilistic generative model and associated inference algorithms to infer relative copy number from FAST-SeqS data at the amplicon level. I implement this method in R and C++ and provide the software as an open-source tool. By applying conliga and FAST-SeqS to oesophageal adenocarcinoma and related conditions, I show that it has similar performance to QDNAseq applied to low-coverage whole-genome sequencing, which is a more expensive and laborious alternative for SCNA profiling.
I explore several aspects of FAST-SeqS data and show that sample-specific biases can affect SCNA inferences. By extending the conliga model, I demonstrate that these biases can be jointly inferred with SCNA profiles. I validate these extensions by comparing the results to inferences obtained from whole genome sequencing in prostate cancer samples.
I show that the variants present in FAST-SeqS data can be used to infer tumour purity, ploidy and allele-specific copy number. This has potential application in large-scale cancer genome studies to identify samples with sufficient purity before performing high-coverage whole-genome sequencing. Finally, I describe preliminary data showing that the FAST-SeqS protocol can be applied to single cells, enabling further extensions of the conliga model which could lead to the inference of SCNAs in single cells.
Morrissey, Edward Robert