Repository logo
 

Compositional models for mutational signature analysis


Type

Thesis

Change log

Authors

Morrill Gavarro, Lena 

Abstract

Background: Cancer is a process whereby the accumulation of mutations leads to a clonal expansion of cells, forming a tumour mass with the capacity of expanding to surrounding tissues. Understanding how these mutations come to be in the first place is important from an epidemiological and evolutionary point of view. Several external – or exogenous – factors, such as UV light, tobacco smoke, or ionising radiation, have the capacity to create mutations. Moreover, the process of DNA replication which is necessary to create two cells from one is, like any other copying mechanism, not entirely faithful. Therefore, spontaneous – endogenous – mutations are created each time the genetic material of a cell is replicated. Some mechanisms of mutation lead to more mutations than others, and can create them either with varying intensity or at the same rate throughout life, or even throughout tumour development. Mutational signatures were introduced as a proxy to quantify the number of mutations created by each mutational process. This thesis focuses on statistical methods for the analysis of such mutational signatures. Crucially, mutational signatures, in the type of questions that I address, have a characteristic: they are compositional data, because we are interested in studying their relative contribution to the total mutation load. Because of this, they have to be analysed in a multivariate way and in relative terms.

Approach: I introduce appropriate compositional models to analyse two types of mutational signature: single-nucleotide polymorphism signatures, and copy number signatures. The Dirichlet-multinomial mixed effects model for single-nucleotide data is shown to have good sensitivity compared to existing fixed-effects alternatives, and higher specificity than compositional models that do not allow any overdispersion. Similarly, the incorporation of random intercepts in the logistic-normal model for copy number data increases sensitivity with respect to the fixed-effects version. The models are publicly available on github and are readily applicable to other types of compositional data.

Results: Firstly, I use the mixed-effects Dirichlet-multinomial model to characterise the differential abundance patterns between clonal and subclonal mutations across 23 cancer types of the PCAWG cohort. There is ubiquitous change, which can be detected already at nucleotide level. There is higher dispersion – higher variability between samples – of signatures in the subclonal group, indicating, possibly, the presence of di↵erent clones with distinct active mutational processes. The signatures of clearest differential abundance are signatures of low abundance, many of them with the tendency to be, to some extent, the result of bleeding from other signatures, and of unknown aetiology. Although we should be wary of these signatures, differential abundance persists despite excluding them, and the relative changes between clonal and subclonal mutations, in the form of ranked coeffi cients of the signatures of highest confidence, are robust to the subset of signatures used.

Secondly, I explore the use of similar models for the study of copy number signatures, and to answer three questions about the mutation dynamics in high grade serous ovarian cancer: whether the relative contribution of mutational processes changes from early-stage to late-stage samples, whether it changes from diagnosis to relapse, and whether it changes from whole-genome-duplicated (WGD) to non-whole-genome-duplicated samples. The CN signature landscape differs significantly between early and late stage samples in that there are much higher rates of WGD in late samples. However, there is no noticeable coordinated difference between matched archival and relapsed samples, although a few patients experience WGD between archival and relapse. Overall, the results indicate that there are large levels of heterogeneity in copy number signatures between patients, but less so within patients, with the exception of punctual cases, and therefore suggest that, although the first response to therapy is dictated by the mechanisms of repair, the relapse occurs not at the level of the genome at 0.1x resolution. Moreover, by use of a Support Vector Machine, I show that copy number signatures can be used to categorise WGD from non-WGD samples at 95% accuracy, using two independently labeled cohorts of TCGA and ICGC samples.

Outlook: This thesis introduces the use of compositional models to study the dynamics of mutational signatures in the comparison of two groups of samples, and they can be readily applied to several other regression settings in this discipline or others in which compositional data arises. From both the copy number and point mutation standpoint the models indicate that signatures are dynamic. Further work is needed to better elucidate which mutational signatures are behind the changes, and which mutational processes are behind the signatures. Besides the biological insight into DNA mutation and repair, these results have potential clinical relevance, as cancer treatment often targets or takes advantage of impaired mechanisms of repair.

Description

Date

2022-04-29

Advisors

Markowetz, Florian
Brenton, James
Wallace, Chris

Keywords

cancer research, compositional data, high grade serous ovarian carcinoma, multivariate models, mutational signatures

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge
Sponsorship
Wellcome Trust (108864/B/15/Z)
Wellcome Trust scholarship RG92770