Repository logo
 

Germline and somatic mutational processes across the Tree of Life


Loading...
Thumbnail Image

Type

Change log

Abstract

Pacific Biosciences' circular consensus sequencing (CCS) employs a highly processive, engineered DNA polymerase to repeatedly sequence both the forward and reverse strands of a circular DNA template. By leveraging the redundancy of multiple passes and the complementarity of the strands, a consensus algorithm generates a highly accurate CCS read, achieving a minimum read accuracy of Q20.

In this PhD thesis, I hypothesise that CCS reads, given their similarity to double-stranded consensus (duplex) library preparation, can achieve accuracy equal to or greater than duplex reads, which are widely used for ultra-rare somatic mutation detection. To test this hypothesis, I designed and developed himut, a bioinformatics software that leverages the read length and high base accuracy of CCS reads to detect and phase somatic mutations in bulk normal tissues, agnostic of species and clonality.

To demonstrate that a subset of CCS bases has sufficiently high base accuracy for single-molecule somatic mutation detection, I analysed a set of positive control samples with characterised somatic mutational processes, along with a negative control sample to establish the limit of detection. The mutational spectra derived from the detected somatic mutations closely matched the expected profiles of the positive controls, supporting the feasibility of single-molecule somatic mutation detection using CCS reads. To empirically define the limit of detection, I analysed cord blood granulocytes—which harbour very few somatic mutations—and estimated that Q93 CCS base accuracy ranges from Q60 to Q90, depending on the substitution type and trinucleotide sequence context. Finally, I demonstrate that observed false positives arise from inaccurate base quality score estimation during CCS generation, rather than from limitations inherent to the CCS library preparation or sequencing process itself.

Confident in the ability to detect single-molecule somatic mutations in normal human tissues, I extended this approach to approximately 500 eukaryotic species sequenced and assembled as part of the Darwin Tree of Life (DToL) project, which aims to generate reference genomes for around 70,000 eukaryotic species from Great Britain and Ireland. I used DeepVariant and himut to call germline and somatic mutations, respectively, and identified germline and somatic mutational signatures not previously observed in human samples. The phylogenetic analysis of these novel mutational signatures, revealing mutational processes conserved across multiple taxonomic levels. One such example is the conserved pattern of C>T somatic mutations at NCG trinucleotides—caused by the spontaneous deamination of 5-methylcytosine to thymine—observed across the Animal, Fungi, and Plant kingdoms.

I believe that the methods and analyses presented in this PhD thesis will enable the discovery of novel mutational processes across all forms of life and aid in identifying their underlying etiologies.

Description

Date

2023-05-15

Advisors

Campbell, Peter
Durbin, Richard

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge

Rights and licensing

Except where otherwised noted, this item's license is described as All rights reserved
Sponsorship
Wellcome Sanger Institute 4-Year PhD studentships