Theses - Wellcome Sanger Institute


Recent Submissions

Now showing 1 - 20 of 114
  • ItemEmbargo
    Probabilistic models to resolve cell identity and tissue architecture
    Kleshchevnikov, Vitalii; Kleshchevnikov, Vitalii [0000-0001-9110-7441]
    Cell identity drives cell-cell communication and tissue architecture and is in return regulated by cell-extrinsic cues. Cell identity is determined by the combination of intrinsic developmentally established transcription factor use (TF) and constitutive as well as cell communication-dependent TF activities. In my thesis, I developed two probabilistic models that advance the understanding of these processes using single-cell and spatial genomic data. Spatial transcriptomic technologies promise to resolve cellular wiring diagrams of tissues in health and disease, but comprehensive mapping of cell types in situ remains a challenge. I present cell2location, a Bayesian model that can resolve fine-grained cell types in spatial transcriptomic data and create comprehensive cellular maps of diverse tissues. Cell2location accounts for technical sources of variation and borrows statistical strength across locations, thereby enabling the integration of single-cell and spatial transcriptomics with higher sensitivity and resolution than existing tools.We assess cell2location in three different tissues and demonstrate improved mapping of fine-grained cell types. In the mouse brain, we discover fine regional astrocyte subtypes across the thalamus and hypothalamus. In the human lymph node, we spatially map a rare pre-germinal centre B cell population. In the human gut, we resolve fine immune cell populations in lymphoid follicles. Collectively our results present cell2location as a versatile analysis tool for mapping tissue architectures in a comprehensive manner. Cell identity and plasticity is regulated by a combinatorial code mediated by transcription factors and the cell communication environment. Systematically dissecting how the regulatory code robustly defines the vast complexity of cell populations across tissues is a long-standing challenge. Measured using the assay for transposase-accessible chromatin with sequencing (ATAC-seq), DNA accessibility provides a readout of intermediate gene regulation steps at single-cell resolution, with technologies measuring both RNA and ATAC providing the necessary evidence to build mechanistic models of regulation. Existing methods address one or several subproblems of modelling DNA accessibility. For example, the DNA sequence-based deep learning models represent combinatorial interactions and in-vivo TF-DNA recognition preferences. In contrast, GRN models use TF abundance profiles across cells and in-vitro-derived TF-DNA recognition preferences, optionally incorporating ATAC-seq data as a filter. All models learn cell-type specific weights/properties and don't generalise to new TF abundance states such as new cell types. Therefore, we are missing an end-to-end mechanistic model that represents all steps of the biological process, that generalises to both new DNA sequences and TF abundance combinations and can simultaneously characterise hundreds to thousands of cell states observed in single-cell genomics atlases. Here, I formulated cell2state, a mechanistic end-to-end probabilistic model of TF recruitment to a chromatin locus and downstream TF effect on DNA accessibility. Cell2state is designed to achieve the generalisation of regulatory predictions to unseen cell types. Cell2state A) estimates TF nuclear protein abundance and models B) how TFs recognise DNA, C) how TF sites in DNA lead to TF recruitment to a chromatin locus, D) how the activity of DNA-associated TFs affects chromatin accessibility. To evaluate generalisation, I defined the computational problem and developed a workflow for predicting the scATAC-seq readout for previously unseen chromosomes and cell types. I show that cell2state outperforms the state-of-the-art deep learning models (ChromDragoNN) at explaining DNA accessibility differences across cells. Finally, to look at cell state plasticity, I developed ways to use cell2state to simulate the possible chromatin states given TF abundance of source cell types.
  • ItemEmbargo
    Phylogenetic studies into the development of foetal tissues and their neoplastic derivatives
    Oliver, Thomas
    Over a lifetime, each cell in the human body acquires a unique combination of somatic mutations that encode its ancestry, exposure to mutagens and strategy for optimising survival. Studies into normal and neoplastic tissues in adults have delineated clonal architecture and oncogenesis at an exquisite resolution. However, for over a century, data have indicated that childhood cancer is rather different, most likely emerging as an aberration of foetal development. This thesis explores how foetal tissues and their neoplastic progeny propagate, focusing specifically on the placenta, germ cell tumours and high-grade midline gliomas. In Chapter 1, I introduce the principles and technological advances that allow us to infer development from somatic mutations. I highlight existing evidence for the distinct origins of childhood and adult cancers and discuss the unique mutational forces that prenatal cells endure. Applying whole genome sequencing (WGS) to bulk and microdissected placental tissues, I begin my own lines of enquiry in Chapter 2. I show that placental trophoblast is unique amongst normal tissues in its clonal construction and sustains a pattern and rate of mutation normally seen in cancer. Each placental biopsy represents a driverless expansion of a spatially-fixed early embryonic precursor, making the placenta inherently mosaic. I turn my attention to neoplasia in Chapters 3 and 4. Using WGS from bulk and microdissected germ cell tumours in Chapter 3, I detail differences in cancer genomes by age group, including their mutational exposures. Where tumours form many, apparently normal tissues, RNA sequencing captures underlying foetal transcriptional signals and diversity that cannot be explained by mutation. In Chapter 4, I outline my work using WGS from post-mortem normal and neoplastic tissues of three children with high-grade midline gliomas. Germline *NF1* mutation is associated with independent, second *NF1* hits that pervade the macro- and microscopically unremarkable brain and spinal cord. Each glioma is characterised by abundant subclonal drivers with different lineages mutating the same genes recurrently, possibly exacerbated by radiotherapy. I conclude in Chapter 5 by exploring the clinical and histopathological utility of these types of experiments and considering new studies to gauge the impact of mutation on organogenesis. Lastly, I highlight other areas of child health where study of somatic mutation may prove beneficial.
  • ItemOpen Access
    Discovering variation from cell atlases: comparative methods for single-cell genomics
    Dann, Emma
    Single-cell genomics technologies have become the norm to investigate cell-to-cell heterogeneity and collective research efforts have built “cell atlases” to characterize previously unknown cell types and states in tissues. For this task, the collection of cells from multiple donors or tissue sites is often required, and differences between cell populations from different samples are largely attributed to technical variation. More recently, multi-sample and multi-condition experiments are being designed to quantify differences between biological conditions, using atlases from healthy tissues as references. These studies require robust quantification of biological differences between cellular phenotypes while accounting for variability between samples and conserving fine-grained information on cell-to-cell heterogeneity. The work presented in this thesis focuses on development and application of computational strategies and best practices for comparative analysis between samples of different biological conditions profiled with single-cell genomics. Chapter 2 presents Milo, a statistical framework for differential abundance testing on single-cell data. By quantifying differences in cell abundances between conditions in partially overlapping neighborhoods on a k-nearest neighbor graph, Milo can identify perturbations that are obscured by discretizing cells into clusters and it minimizes false discovery rate control even in the presence of batch effects. I present a comprehensive benchmark against alternative differential abundance testing strategies, using simulations and scRNA-seq data. I then demonstrate the utility of Milo by studying perturbations across lineages in a dataset of human liver cirrhosis. Chapters 3 and 4 present a case-study where Milo and other integration and comparative analysis methods were used to study the development of the human immune system as a distributed network across tissues. In Chapter 3, I describe the integration of scRNA-seq data of almost a million cells from nine prenatal organs across 11 weeks of gestation, to define common and tissue-specific immune cell populations, and how these compare with immune cell states identified in adulthood. Using this integrated view, I show how I used Milo to identify stage- and tissue-specific subpopulations of myeloid and lymphoid cells, and discuss their potential role in maturation of immune function and tissue morphogenesis. Chapter 4 is centered around the analysis of spatial cellular environments across fetal tissues. By integrating our cross-tissue scRNA-seq atlas with spatial transcriptomics data, we identified and compared cellular niches in the developing liver, spleen, thymus and gut. In Chapter 5, I present a systematic meta-analysis to identify best practices to identify cell states altered in human disease using integration and differential analysis on single-cell datasets. In particular, I examined whether atlas datasets are suitable references for disease-state identification or whether matched control samples should be employed, to minimise false discoveries driven by biological and technical confounders. By quantitatively comparing the use of atlas and control datasets as references for identification of disease-associated cell states, on simulations and real disease scRNA-seq datasets, I show that reliance on a single type of reference dataset introduces false positives. Conversely, using an atlas dataset as reference for latent space learning followed by differential analysis against a matched control dataset leads to precise identification of disease-associated cell states. I demonstrate how optimized design can guide discovery in two applications: using a cell atlas of blood cells from 12 studies to contextualise data from a case-control COVID-19 cohort to detect cell states associated with infection, and distinguish heterogeneous pathological cell states associated with distinct clinical severities. In parallel, I used a healthy reference lung atlas generated from 14 studies to study single-cell profiles obtained from patients with idiopathic pulmonary fibrosis, characterising two distinct aberrant basal cell states associated with disease and identifying unique marker genes with therapeutic potential. In the final chapter, I discuss how advancements in multi-condition single-cell analysis open up a new phase of population-level tissue genomics studies, and the impact that it will have on biomedicine.
  • ItemOpen Access
    Gene Regulatory Networks at Single-Cell Resolution: an approach to exploring the impact of genomic regulation on cellular heterogeneity
    Xu, Zhihan
    The computational analysis of single-cell RNA sequencing data provides a great opportunity to infer gene regulatory associations in different tissues, cell types and cells. However, there are many challenges still to be overcome. In this thesis, I discuss why this opportunity is fascinating but challenging from a computational point of view, and I build a computational method to demonstrate the plausibility of inferring one gene regulatory network (GRN) for each single cell, also evaluations are conducted from multiple perspectives. In Chapter 2, I investigate data-fitting models in existing computational methods which infer GRNs. To fully take advantage of the single-cell resolution in the input sequencing data, leads to the use of machine learning approaches, such as instance-wise feature selection models. I assess the eligibility at a methodological level, implement an instance-wise feature selection model, and aim to generate GRNs which learn cell-to-cell variations from single-cell sequencing. Based on the implementation above, I build a new method to infer GRN at single-cell resolution or single-cell specific GRN (scGRN). The inferred scGRN can be used to explore GRNs across cell types. Since the true underlying biological mechanism in scGRN cannot be known, my method is benchmarked based on available biological network databases which do not reveal cell-to-cell variation on GRN. Analysis of scRNA-seq data can provide biological insights into cellular heterogeneity because of the single-cell resolution; since scGRNs also contain cell-resolution information, it implies that scGRN can also be analysed about its corresponding cell-related properties. Since scGRN is generated from scRNA-seq data, the inherent cell-related patterns in scGRN shall not deviate substantially from the results in the existing analysis for scRNA-seq. A resulting scGRN suggesting very different cellular information may harm its reliability even before conducting GRN validation, so cellular information provides a powerful angle to evaluate scGRN. In Chapter 3, I build a pipeline to analyse scGRNs and I refer to three cell-related properties for comparison - cell types, cell-type trajectories and cell-type specific marker genes. The results demonstrate the difference in the implication of derived cell-related properties from scGRN and scRNA-seq. As this thesis’s ultimate goal, the analysis for resulting scGRN endows an unprecedented opportunity to explore changes in regulatory patterns along cell types or tissues. Thanks to the single-cell resolution in biologically interpretable scGRN, it is flexible to conduct various analyses such as implications from cell type specific GRNs. Chapter 4 focuses on the exploration of interactions between regulatory edges and cell subpopulations from scGRN. In other words, scGRNs are analysed to explore the changes in GRN edges across cell types, cellular lineages or organs. Based on the existing biological understanding, some expected patterns are summarised to evaluate scGRN. Unlike a quantitative score indicating the performance of a method, the evaluation of patterns is still sufficient to differentiate unreasonable scGRN candidates, and this work provides a novel perspective to evaluate GRNs. Besides the utility for evaluation, the observed patterns also suggest some meaningful biological implications about the impact of genomic regulation on cellular differentiation. I hope the methodology developed in this thesis is helpful to inspire more method developers to pursue scGRN, and I hope multi-angle evaluations of scGRN demonstrated here can facilitate more biological insights into the regulatory mechanisms driving cellular heterogeneity.
  • ItemEmbargo
    Developing and applying new methods to understand blood stage growth in Plasmodium falciparum
    Muhwezi, Allan
    *Plasmodium falciparum* parasites cause nearly half a million deaths from malaria each year. There is still no highly effective vaccine, and resistance has emerged or is emerging to all current drugs. All the pathology and symptoms associated with malaria are caused by the growth of these parasites inside human red blood cells. If the parasites are to survive and be successful, they must invade, grow and survive under diverse micro–environmental conditions within red blood cells. Understanding the *P. falciparum* genes that regulate these key developmental processes could lead to the identification of targets for new drugs. However, while sequencing efforts have led to a good understanding of the *P. falciparum* genome and how it evolves over time and space, a disconnect remains between the amount of genome sequence data and what exactly these genes do – the phenotype. This therefore underpins my PhD thesis with a major goal of bridging the gap in our understanding of gene function in *P. falciparum*. I have developed high–throughput approaches combining next generation sequencing, flow cytometry and cell sorting to develop assays to accurately phenotype key aspects of the parasite life cycle such as invasion, cell cycle progression, multiplicity of invasion and replication capacity. These assays have been applied to a panel of *P. falciparum* genes in which genes potentially involved in specific developmental transitions have been deleted and reveal new phenotypes not described elsewhere in malaria literature.
  • ItemOpen Access
    Understanding Development of Human Immunity One Cell at a Time
    Suo, Chenqu; Suo, Chenqu [0000-0002-8813-0875]
    The emergence of single-cell and spatial multi-omic technologies has revolutionized our understanding of the immune cells. The international Human Cell Atlas consortium has spearheaded and coordinated a global effort to construct atlases of human tissues across multiple developmental stages. This is revealing the identity and function of cells at unprecedented resolution and depth, enhancing our understanding of the immune system in health and disease. The Human Cell Atlas data is also providing us with valuable prior knowledge to guide *in vitro* engineering of immune cells towards specific cell states. This thesis aims to study human immune cell development from both *in vivo* and *in vitro* angles, utilizing both experimental and newly developed computational approaches. Importantly, I focus on applying insights from single-cell genomics studies of human immune cell development to *in vitro* lymphocyte engineering. Chapter 1 introduces the recent advances in single-cell technologies in the context of previous methods used to study immune cells, summarizes human immune cell development with a focus on lymphocyte development, and presents published efforts in *in vitro* T cell engineering. Chapter 2 covers the assembly of a multi-organ single cell atlas of the developing human immune system and describes the insights derived from this comprehensive atlas. We uncovered system-wide blood and immune cell development in organs other than primary haematopoietic organs, and characterized prenatal innate-like B and T cells, namely B1 cells and unconventional T cells in humans for the first time. Chapter 3 is dedicated to a new computational tool, Dandelion, that we developed for single cell antigen receptor sequencing (scVDJ-seq) data analysis. We also devised a novel strategy to leverage scVDJ-seq data in pseudotime trajectory inference. The application of Dandelion improved the alignment of human thymic development trajectories of double positive T cells to mature single-positive CD4/CD8 T cells, and provided novel insights into the origins of human B1 cells and ILC/NK cell development. Chapter 4 introduces another computational tool, Genes2Genes, for aligning two single-cell trajectories. By applying Genes2Genes to *in vivo* and *in vitro* T cell development, we found that *in vitro* single positive (SP) T cells were matched to an immature state of the *in vivo* SP T cells while lacking the final TNFα signaling. Chapter 5 summarizes the insights gained from my work on human immune cell development and highlights potential future directions of research in this area.
  • ItemEmbargo
    The influence of genetic background on drug resistance in the malaria parasite Plasmodium falciparum
    Carpenter, Emma; Carpenter, Emma [0000-0002-1911-6842]
    *Plasmodium falciparum* is a parasite that causes the most severe forms of human malaria; with multidrug resistance to modern antimalarials emerging and spreading in certain parasite populations, understanding the mechanisms for acquiring multidrug resistance will be key for predicting which genes may become important for clinical resistance in the future. Transporter proteins have key roles in drug resistance across a variety of organisms — in *P. falciparum*, the spread of resistance to the antimalarial chloroquine in the 1950s and 60s is largely due to mutations in the Chloroquine Resistance Transporter (PfCRT). Variations in the ABC transporter *pfmdr1* modulate sensitivity to multiple antimalarials, including mefloquine and lumefantrine, popular partner drugs in first-line artemisinin combination therapies. During my PhD, I assessed a panel of parasite lines that had undergone *in vitro* evolution experiments, identifying polymorphisms in a poorly characterised ABC transporter, ABCI3, that confer resistance to several experimental antimalarial compounds with diverse chemical scaffolds, suggesting that ABCI3 could mediate resistance to next-generation antimalarial drugs. Next, I examined a novel PfCRT mutation currently spreading in Southeast Asia that confers piperaquine resistance when edited into the wild-type *pfcrt* allele of laboratory lines using CRISPR/Cas9, raising concerns that piperaquine resistance could arise in susceptible populations with a single nucleotide polymorphism. Finally, I introduced unique 11-nucleotide ‘barcodes’ into 37 progeny from the NF54 × Cam3.II genetic cross using CRISPR/Cas9. The African NF54 parasite is considered wild-type and is broadly drug susceptible. The Cambodian Cam3.II parasite is multidrug-resistant owing to numerous mutations, including the R539T mutation in *pfkelch13*, a gene associated with artemisinin resistance. R539T has been demonstrated to have a high fitness cost, and similar PfKelch13 mutations display fitness costs that are exacerbated when engineered into non-Southeast Asian parasites. Combination of ‘barcoded’ parasite lines into a single flask, or ‘pool’ creates a highly valuable screening resource, allowing one to observe the change in proportion of each progeny within this pool over time via next-generation sequencing of the barcoded locus; linkage analysis can therefore be performed on data generated within a single experimental run. Using this tool, I note the enrichment of certain haplotypes of interest under various antimalarial pressures, including *pfabci3*, *pfcrt* and *pfkelch13* variants. I discuss the future avenues of research that will reveal the quantitative trait loci responsible for the differential survival of progeny under these antimalarial pressures, which will shed light on the complex interplay between drug resistance mutations, fitness, and genetic background.
  • ItemOpen Access
    Chromosome evolution in Rhabditina (Nematoda) with a focus on programmed DNA elimination
    Gonzalez De La Rosa, Pablo Manuel
    Recent advances in DNA sequencing technologies have enabled research on the mechanisms that shape chromosome evolution in diverse taxa. My thesis focuses on nematodes from the Rhabditina suborder, which exhibit a high rate of chromosome rearrangement and have several species with chromosome-level assemblies available for comparison. To study chromosome evolution in this group, I inferred a set of loci that comprised the ancestral linkage groups, called Nigon elements, in rhabditid nematodes. This showed that the sex chromosome of the model nematode Caenorhabditis elegans is the product of a fusion of an ancestral autosome and the ancestral sex chromosome. As part of this work, I also generated the first telomere-to-telomere genome with no gaps of unknown sequence of a non-Caenorhabditis nematode, Oscheius tipulae. We serendipitously discovered that O. tipulae undergoes programmed DNA elimination (PDE) by identifying that both ends of each chromosome had two alternative terminal telomere repeat arrays. PDE is a developmentally controlled process whereby somatic cells selectively lose part of their genome. In total, less than 0.5% of the 60 Mb O. tipulae genome is eliminated. To gain insights into the evolution of PDE, I generated chromosome-level assemblies for four Oscheius species that are closely related to O. tipulae. These revealed PDE in all species, and linked PDE with variations in telomere length and breakage-associated sequence motifs. I characterised a novel gene family linked to chromosome ends, derived from helitron2-like transposable elements, and variably eliminated regions in Oscheius dolichura involving divergent paralogs of conserved genes. To facilitate comparative research, I propose a framework outlining a standard set of features for future studies. Finally, I applied this framework to characterise PDE in Auanema rhodensis, a species closely related to Oscheius in which over 60% of its genome is eliminated by PDE. The eliminated regions in A. rhodensis are also delineated by a sequence motif, contain protein-coding genes, ncRNAs, and share the same large tandem repeats between all autosomes. Together, my research sheds light on the evolution of nematode chromosomes and PDE.
  • ItemOpen Access
    Geographic Migration and Evolution of Streptococcus pneumoniae
    Belman, Sophie
    Streptococcus pneumoniae (the pneumococcus) is a human-obligate, opportunistic bacteria. It resides asymptomatically in the nasopharynx of both children and adults. It occasionally goes on to cause local disease such as otitis media, or severe invasive disease as with pneumonia and meningitis. It is prevalent globally and rates of carriage have an inverse relationship with country income. The pneumococcus comprises massive diversity of >800 global lineages and 100 antigenically distinct serotypes. Many lineages and serotypes co-circulate endemically in any given location. Highly immunogenic conjugate vaccines have been implemented in >76% of national immunization schedules; we also have many antimicrobials to which the pneumococcus is susceptible. Following the introduction of the pneumococcal conjugate vaccine in the 2000’s there have been global ecological expansions and contractions of serotypes, lineages, and antimicrobial resistance (AMR). There are distinct qualitative patterns which characterise the global geographic structure of both the endemically circulating pneumococcus and the constantly changing ecology of its genes and loci. The extent and mechanisms of spread, and vaccine-driven changes in fitness and AMR, remain largely unquantified. Using geolocated genome sequences from South Africa (2000-2014) I developed models to reconstruct spread. I implemented simple statistical frameworks which can account for variable surveillance in both space and time. I also pair detailed human mobility data from Facebook users with genomic data within a mechanistic model to describe how human movement drives the geographic spread of the pathogen. I estimated that pneumococci only become homogenously mixed across South Africa after about 50 years of transmission, with the slow spread driven by the focal nature of human mobility. Further, the human population density in the municipality of introduction appears to be important as well — with a more rapid radius of spread when introduced to rural areas. I include both disease and carriage isolates in our models. There are similar ecological shifts among healthy Cambodian children, following vaccination, as among disease isolates. Among South African samples I utilize a logistic model to estimate the population level changes in fitness of strains that are (vaccine type, VT) and are not (non-vaccine type, NVT) as well as differences in strain fitness between those that are and are not resistant to penicillin continuously, across the vaccine period (PCV7: 2009; PCV13: 2011). In the years following vaccine implementation the relative fitness of NVT compared to VT strains increased with an increasing proportion of these NVT strains becoming penicillin resistant. These estimates indicate that initial vaccine-linked decreases in AMR may be transient. Data on human mobility between countries is scarce so to understand between country pneumococcal migration-dynamics I developed a simulation approach using Bayesian optimization for Likelihood-free inference. Incorporating thousands of pneumococcal genomes from South Africa, Malawi, The Gambia, and Kenya I included only neutral sites. I estimated migration rates and directions of migration between countries. I found some heterogeneity between lineages and across demes, and an inverse relationship between migration and the destination country population size. This model could be used to inform modeling frameworks on wider pneumococcal spread and timing and implementation of public health interventions. I also worked to better understand the history and evolution of the hundreds of extant GPSCs by delving into a 5700 year-old metagenome with genomic reads containing high identity to the pneumococcus. While this resulted in identifying a streptococcal ancestor with no polysaccharide capsule it did not further our knowledge of extant pneumococci; as such this paper is included in Appendix A. This work has contributed valuable insight into pneumococcal migration and mech- anisms of spread. I have quantified important fitness dynamics and incorporated them into our mobility models. Furthermore, I have developed a method which can now esti- mate migration parameters asymmetrically between countries elucidating paths of global pneumococcal spread. Together this thesis marks the advancement in our understanding of the geographic migration of Streptococcus pneumoniae and lays the groundwork for understanding its diverse, complex, global spread.
  • ItemEmbargo
    Somatic Mutations in Ageing and Degenerative Disease
    Harvey, Luke
    Following the first division of the zygote, somatic mutations begin to accumulate in all human cells. Daughter cells become increasingly mutated leading to mosaic tissues, composed of genetically heterogeneous clonal units. In recent years, large scale sequencing efforts have begun to characterise the process of mutagenesis in normal tissues. These observations have improved our understanding of how somatic cells evolve oncogenic phenotypes and hinted that somatic changes may contribute to the development of age- related phenotypes and degenerative disease. To accurately characterise the mutational processes in normal tissues, we developed nanorate sequencing (NanoSeq), a duplex sequencing protocol with error rates of less than five errors per billion base pairs, allowing for the accurate identification of mutations in single DNA molecules. Using NanoSeq we describe the mutational processes in the nuclear and mitochondrial genome of three distinct cell types across the brain and cardiac muscle. We show that post-mitotic tissues accumulate somatic mutations at comparable rates, and by similar processes to dividing cells. We explore the relationship of mutation rates to transcription and chromatin state, and reveal patterns of transcription-coupled damage and repair in neuronal genomes. In the analysis of somatic mutations in cardiomyocytes, we identify a novel mutational signature that we believe may be caused by hypoxia-induced oxidative stress. Lastly, we demonstrate that neurons isolated from Alzheimer’s disease brains have a lower mutation burden than healthy neurons. To further investigate the role of somatic mutations in degenerative disease, we characterised somatic mutagenesis in rheumatoid arthritis and osteoarthritis. Using laser capture microdissection, we isolated 2000 microbiopsies of intimal lining, sublining, and lymphocytes for whole exome sequencing. Using these results we show that the synovium is a mostly polyclonal tissue with the capacity for large clonal expansions. We explore how the histopathology of rheumatoid arthritis and osteoarthritis relates to driver mutations, clonal expansions, and immune infiltration. Using NanoSeq, we characterise the mutational processes in isolated synovial cell types. Finally we show the power of NanoSeq for high- throughput driver discovery and identify 15 genes under positive selection in the synovium. This thesis offers novel insights into the mutation rate of distinct cell types in healthy and disease states. This work contributes to a growing body of literature characterising somatic evolution and provides a foundation for further functional studies to directly investigate the role of mutant clones in disease states.
  • ItemEmbargo
    Enabling the in vitro study of long noncoding RNAs to understand their role in Plasmodium falciparum
    Hoshizaki, Johanna
    Long noncoding RNAs (lncRNAs) have been identified in *Plasmodium falciparum*, the parasitic cause for life-threatening malaria, yet their role remains largely undiscovered. Through interactions with nucleic acids and proteins, lncRNAs can modulate gene expression at the transcriptional, post-transcriptional, translational, and post-translational levels. Determining the role of lncRNAs in the regulation of the *P. falciparum* transcriptome and proteome is imperative to further our understanding of gene regulation in the parasite. The characterisation of *P. falciparum* lncRNAs has been hindered by an incomplete annotation and the absence of disruption methods that together would permit high-throughput systematic knockdown of lncRNAs. During my PhD, I addressed these challenges to enable the study of *P. falciparum* lncRNAs *in vitro*. I generated a high-quality lncRNA annotation using manual curation of sequencing data generated at the Sanger Institute, along with supportive datasets from the literature. I evaluated CRISPR-based approaches for *in vitro* disruption of lncRNAs including gene knockout, knockdown, and interference. CRISPR-associated enzymes were explored including commonly used DNA-cutting enzymes (Cas9), inactivated enzymes to block transcription (dCpf1) and enzymes that target RNA directly (Cas13), the latter of which had not been applied to *Plasmodium*. Furthermore, I implemented these tools to demonstrate the feasibility of lncRNA studies in *P. falciparum*. I interrogated a set of lncRNAs that were selected based on predicted biological significance and targetability using dCpf1. LncRNA-depleted parasites were phenotypically characterised by assessing changes in fitness, drug resistance, gametocytogenesis and expression. I identified potential roles for specific lncRNAs in drug resistance and gametocytogenesis. By developing bioinformatics and molecular tools, this work enables future studies elucidating the specific roles of lncRNAs in *P. falciparum*. Understanding the transcriptome and gene regulation will inform the development of novel interventions for the control and eradication of malaria, which remains a serious global health concern.
  • ItemOpen Access
    Quantitative modelling of CRISPR-Cas editing outcomes
    Pallaseni, Ananth
    Development of the CRISPR-Cas toolkit over the last decade has enabled unprecedented control over the genome and unlocked the capacity for new experiments and gene-therapies. However, use of these technologies is made more difficult due to variance in the rate and type of individual genetic outcomes generated by each editor. This variance is reproducible and a function of the sequence being targeted, thus making it amenable to quantitative modelling. My research focuses on building such models in a variety of contexts. In this thesis, I recount the history of gene editing from transgenesis to prime editors. I then review the modelling techniques I use in my projects, before covering the state of predictive modelling for genome engineering. In two main results chapters, I discuss my work on modelling base editor outcomes and the effects of DNA repair context on Cas9-induced double-stranded break repair. The final results chapter covers shorter, collaborative studies on other gene editing technologies. I will conclude by discussing the need for computational tools in genome editing and the gaps in our understanding. In my first results chapter, I examine the sequence- and position- specificity of base editor activity. Base editors are a gene editing technology derived from Cas9 that introduce precise base substitutions into a targeted region of the genome. The rate of these substitutions is known to vary between targeted sequences and the determinants of this variation are not completely understood. To untangle the determinants of base editing efficacy, our group performed a large-scale screen where we measured base editing outcomes across 20,000 targeted sequences in multiple cell lines and editors. I processed and analysed the data produced in this experiment and found that both the sequence flanking editable bases and the position of those bases in the sequence affects the rate of observed editing. I leveraged this understanding to construct a position-specific model of base editing activity for each editor type and used these models to predict the e fficacy and specificity of base editors for correcting pathogenic variants found in ClinVar. The second results chapter focuses on the mutational outcomes of Cas9-induced cuts in repair deficient backgrounds. Cas9 creates a double-stranded break at a targeted location in the genome and the cell repairs this lesion via several pathways which can leave mutations in the repaired sequence. It has been shown in previous studies that the distribution of these mutations is reproducible and dependent on the sequence being cut, but the effect of repair context on this process is not well understood. I planned an experiment to measure Cas9 repair outcomes at over 5000 target sites in 21 mouse cell lines with knockouts of single repair genes, then processed and analysed the data generated. I show that the knockout cells have reproducibly different repair patterns than controls. I highlight Nbn, Lig4 and PolQ as examples of knockouts with consistent effects on certain mutation types. I examine how the known sequence-determinants of Cas9 outcomes affect outcome preference knockout lines. Lastly, I use this understanding to train models that predict the distribution of Cas9 outcomes in various repair backgrounds. My final results chapter discusses two shorter collaborative studies on alternative editing technologies. First is the design of a large scale screen to profile the behaviour of a new Cas enzyme. I explain the experimental process of profiling Cas outcomes, the decisions involved in designing a guide library and an approach to modelling. The other collaboration is the prediction of editing rates when inserting sequences into the genome with prime editors. Here, I train a model to predict editing rates, examine which features are most important to predictive performance, and finally determine that collection of more data for training will improve model performance.
  • ItemOpen Access
    Somatic phylogenies: a window into human biology in health and disease
    Spencer Chapman, Michael; Spencer Chapman, Michael [0000-0002-5320-8193]
    All somatic cells in the human body originate from the fertilised egg, or ‘zygote’. Co-ordinated processes of cell division and differentiation result in the incredible complexity observed following development. Thereafter, each tissue must maintain its function for the decades of life that follow, while minimising risks such as cancer. Crucial to understanding the strategies employed to meet these demands is knowing what cell gives rise to what i.e. the structure of the somatic lineage tree. For this purpose, it is fortuitous that most cell divisions result in the acquisition of somatic mutations that are passed on to all of its future progeny. Therefore theoretically, if all somatic mutations in each cell of an organism were known, the organism’s full somatic lineage tree could be determined. While this is not yet possible at the scale of an entire organism, advances in clonal expansion and whole-genome sequencing technologies have facilitated such experiments on the scale of hundreds of cells in a single individual. This thesis contains three results chapters, each utilising somatic phylogenies in different ways. The first looks at early human development. Due to the small number of cells in early development, a near complete phylogeny can be determined. Assessing contributions of early lineages to different tissues is used to determine the developmental relationships and origins of tissues. The second looks at the clinically important context of gene therapy for sickle cell disease. Comparisons of somatic phylogenies from before and after the gene therapy procedure gives insights into the biology of sickle cell disease and the safety of gene therapy in this context. Finally, the third chapter looks at somatic phylogenies from a different perspective: as a way of testing our assumptions regarding mechanisms of mutation acquisition. The observation that some mutations do not fit these assumptions leads to the surprising conclusion that some mutation-causing DNA lesions persist unrepaired for months to years. Overall, this thesis demonstrates the power of interrogating somatic phylogenies to answer important questions in diverse biological disciplines. As technological advances allow such experiments on ever larger scales, the power of this approach will only increase.
  • ItemOpen Access
    The natural history of clonal haematopoiesis
    Fabre, Margarete
    Introduction Human cells acquire somatic mutations throughout life, some of which can drive clonal expansion. Such expansions are frequent in the haematopoietic system of healthy individuals and have been termed clonal haematopoiesis (CH). While CH predisposes to myeloid neoplasia and other diseases, we have limited understanding of its natural history and how this relates to clinical phenotype. Objectives 1. To characterise the behaviour of CH across the human lifespan; 2. To identify and quantify determinants of clonal behaviour; 3. To understand how CH clonal dynamics relate to malignant progression. Results By tracking 697 CH clones from 385 individuals aged 55 or older over a median of 13 years, we found that 92.4% of clones expanded at a stable exponential rate in old age, with different mutations driving substantially different growth rates, ranging from 5% (DNMT3A, TP53) to over 50%/yr (SRSF2-P95H). Growth rates of clones with the same mutation differed by +/-5%/yr, proportionately impacting “slow” drivers more substantially. Combining these time-series data with phylogenetic analysis of 1,731 whole genome-sequenced haematopoietic colonies from 7 older individuals revealed distinct patterns of lifelong clonal behaviour. DNMT3A-mutant clones preferentially expanded early in life and displayed slower growth in old age, in the context of an increasingly competitive oligoclonal landscape. By contrast, splicing gene mutations only drove expansion later in life, while TET2-mutant clones emerged across all ages. Using a separate cohort of 158 twins, we found that concordance for CH was no higher within monozygotic vs dizygotic pairs, suggesting that the inherited genome does not exert a dominant influence on CH behaviour. The identification of two monozygotic pairs in which both twins harboured identical rare somatic mutations confirmed that the origins of adult CH can be traced back to early life, even in utero. Finally, by comparing CH growth dynamics with (i) driver mutation selection patterns in large myeloid cancer data sets and (ii) driver-specific AML risk scores, we show that mutations driving faster clonal growth also carry a higher risk of malignant progression. Conclusions These findings characterise the origins and lifelong natural history of CH and give fundamental insights into the interactions between somatic mutation, ageing and clonal selection.
  • ItemOpen Access
    Technologies to decode the multicellular networks within the human body
    Shilts, Jarrod
    This is a story about how it is possible for collections of cells to physically assemble into coordinated multicellular systems. In other words, how millions of individual cells are able to physically interact with each other in an organized way, as well as how pathogens such as viruses can exploit these interaction points in order to infect the body. Its subject is principally centered around the proteins that cover the surfaces of human cells. These proteins have to bind with specific combinations of surface proteins on nearby cells, thereby establishing a complex ‘code’ of direct interactions possible between different cell populations and tissues. Some of these interactions trigger the exchange of signals that enable collections of multiple cells to coordinate complex behaviors such as immune responses, while others act as adhesive receptors that enable physical structure to emerge out of groups of cells. The influential roles of these surface proteins and their accessibility to systemic medications have also made them among the most effective targets for therapeutics, with surface proteins constituting a majority of all approved drug targets. So far however, prior research has only pieced together a fragmented picture of the direct receptor links between cells and the functional roles they have. Surface receptors pose unique experimental challenges to study and historically have lacked systematic methods to measure, leading most studies to only consider receptors at small-scale without a global view to the larger system. In this thesis, I take a different approach. I will present my work to establish a series of technological tools and strategies that overcome these challenges, in order to make it possible to systematically build up from characterizing the function of individual receptor molecules all the way to reconstructing multicellular interaction networks across entire systems of human cells. These methodologies can be categorized into three sequential steps. First, testing the binding of pairs of surface proteins across large arrays to decode the ‘interactome’ between two cells. Second, using cell-based assays to annotate the broad functional consequences a surface interaction has. And third, to computationally integrate these diverse data sources in order to understand how interacting communities of cells are organized. As my initial case study, I consider the question of how the distributed individual cells of the human immune system interact to produce a cohesive whole. By individually producing recombinant forms of most surface proteins detectable on white blood cells, I could assemble the first systematic and quantitative interaction network of these proteins, and in the process discover several novel interactions and reveal the identities of previously-unidentified binding partners for key immunomodulatory receptors. I could then adapt those recombinant proteins to experimentally manipulate live human immune cells in a multiplex microscopy technique, which revealed previously unknown interactions as having prominent roles in immune activation and leukocyte adhesion. I will show how these data can be integrated with high-resolution expression data in order to infer patterns of cell-to-cell connectivity throughout the human body, as well as to formulate a mathematical model that could predict the behavior of interacting cells from molecular first-principles. In the second half of my thesis, I will explain how this series of methods I established can be adapted and extended to new contexts. I will describe a large-scale effort I led applying these methods to characterize the cell-to-cell interactions occurring within the human brain, which revealed unexpected new pathways by which glia can directly communicate with cortical neurons. I will then extend my approaches to reveal which interactions may play a causal role in driving human disease. To do so, I will first show computational methods I devised for leveraging human clinical genetics in order to pinpoint cell-to-cell processes underlying the pathology. In the final section, I will extend this to infectious diseases driven by host-pathogen interactions. For this, I will explain how the tools I established allowed me to rapidly respond to the COVID-19 pandemic by systematically profiling the surface proteins that act as host factors during infection by the novel coronavirus SARS-CoV-2. That work has led to the discovery of two pathogen-host interactions that have subsequently been independently linked to COVID-19 severity, as well as helped clarify the precise host receptors that SARS-CoV-2 utilizes when invading human cells. From the combination of these technologies and approaches, I hope to provide a systematic and mechanistically-grounded foundation for deconstructing the emergence of biological function from the interacting communities of cells that make up the human body.
  • ItemOpen Access
    Clonal dynamics of haematopoiesis across the human lifespan
    Mitchell, Emily
    The haematopoietic system manifests several age-associated phenotypes including anaemia; loss of regenerative capacity, especially in the face of insults such as infection, chemotherapy or blood loss; and increased risk of clonal haematopoiesis and blood cancers. The cellular alterations that underpin these age-related phenotypes, which typically manifest in individuals aged over 70, remain elusive. In my thesis I have aimed to investigate whether changes in HSC population structure with age might underlie any aspects of haematopoietic system ageing. In addition, I have investigated the impact of chemotherapeutic perturbations on haematopoietic stem cell mutation burden and clonal dynamics. To answer the ageing question, I have sequenced 3579 genomes from single-cell-derived colonies of haematopoietic stem cell/multipotent progenitors (HSC/MPPs) from 10 haematologically normal subjects aged 0-81 years. HSC/MPPs accumulated 17 somatic mutations/year after birth with no increased rate of mutation accumulation in the elderly. HSC/MPP telomere length declined by 30 bp/yr. To interrogate changes in HSC population structure with age, I used the pattern of unique and shared mutations between the sampled cells from each individual to reconstruct their phylogenetic relationships. I found that haematopoiesis in adults aged <65 was polyclonal, with high indices of clonal diversity. In contrast, haematopoiesis in individuals aged >75 showed profoundly decreased clonal diversity. In each elderly subject, 30-60% of haematopoiesis was accounted for by 12-18 independent clones, each contributing 1-34% of blood production. Most clones had begun their expansion before age 40, but only 22% had known driver mutations. I used the ratio of non-synonymous to synonymous mutations (dN/dS) to identify any excess of non-synonymous (driver) mutations in the dataset. This genome-wide selection analysis estimated that the set of 300 - 400 HSC/MPPs sampled from each adult individual harboured around 100 driver mutations, over 10-fold higher than the number of known drivers we could identify. Novel drivers affected a wider pool of genes than identified in blood cancers. Simulations from a simple model of haematopoiesis, with constant HSC population size and constant acquisition of driver mutations conferring moderate fitness benefits, entirely explained the abrupt change in clonal structure observed over the age of 70. By old age the majority of HSCs harbour at least one driver mutation. Our data supports the view that dramatically decreased clonal diversity is a universal feature of haematopoiesis in elderly humans, underpinned by pervasive positive selection acting on many more genes than currently known. Finally, I also sequenced haematopoietic progenitor cells from individuals exposed to a wide range of chemotherapeutic agents. I was able to identify an increased mutation burden associated with a number of chemotherapeutic agents, including platinum and alkylating agents, some of which conferred thousands of excess mutations. There was a wide variation in mutation burden conferred from agents within the same class, meaning that there are potential patient benefits from switching drugs in commonly used regimens. I show that chemotherapy given in childhood can profoundly impact clonal dynamics in later life.
  • ItemOpen Access
    Common genetic variation and spliceosome variants in rare developmental disorders
    Wigdor, Emilie
    Although thousands of rare disorders are caused by single, deleterious, protein- coding variants, evidence suggests that common variants also contribute to risk for rare, neurodevelopmental disorders (NDDs). These are likely affecting the penetrance of protein-coding variants as well as expressivity, posing a major challenge in the interpretation of rare variants. An additional challenge is our incomplete understanding of which variants are likely to affect gene function. Due to the high burden of “variants of unknown significance” (VUS), there is a great need to develop molecular biomarkers of individual disorders which could be used as an intermediate phenotype to help determine whether a VUS is pathogenic or benign. For disorders which are due to mutations in spliceosomal components, global patterns of splicing changes may be a useful biomarker. The research presented falls into three main projects. First, I investigated via genetically-predicted gene expression, whether cis-regulatory variants modify the penetrance of inherited, putatively damaging variants in NDD probands in the Deciphering Developmental Disorders (DDD) Study. To determine whether there were overall differences in predicted gene expression between probands and controls, I conducted a Transcriptome Association Study. I then tested whether the predicted gene expression of genes harbouring inherited, putatively damaging variants, is lower in undiagnosed NDD probands compared to controls. Finally, I investigated the modified penetrance of inherited, putatively damaging variants by comparing predicted gene expression between undiagnosed NDD probands and their unaffected, variant-transmitting parents. Second, I further explored the role of common variants in severe NDDs using polygenic scores (PGS) in both DDD and the Genomics England (GEL) 100,000 Genomes project. I tested whether undiagnosed NDD probands over- or under- inherit PGS for NDDs and correlated traits. I found that NDD probands over-inherit PGS for NDDs and schizophrenia. To put these results into context, I compared unaffected parents of undiagnosed probands’ PGS to both controls and probands. I found that parents’ PGS are significantly different from controls’ PGS, but not from probands. Additionally, I explored sex differences in PGS, by examining both affected and unaffected individuals. I found preliminary evidence of a female protective effect in the context of common variation. Finally, I revisited the question of the modified penetrance of inherited, putatively damaging variants. Third, using whole genome sequencing and bulk RNA sequencing of whole blood from GEL, I investigated differential splicing and gene expression in rare disorder probands with a pathogenic variant in the spliceosome. I found enrichment of differentially expressed genes for processes related to genes containing minor introns. Additionally, I found enrichment of genes involved in spliceosomal components among differentially spliced genes, suggesting a potential feedback loop for regulation of splicing. These studies emphasise the importance of studying the convergence of common and rare variation, as well as the integration of functional data, in the context of rare disease genetics. Moreover, they highlight the need to collect phenotypic and genotypic data on parents and family members of rare disorder probands.
  • ItemOpen Access
    The Contribution of Structural Variants to 2,095 Molecular Phenotypes in 12,354 European Ancestry Individuals
    Howell, Brittany
    Structural Variants (SVs) are large scale rearrangements of the genome resulting in linear and spatial changes which can profoundly affect the function of the genome. SVs contribute the majority of nucleotide variation among human genomes by number of basepairs and have been linked to various diseases and traits including schizophrenia, autism and obesity. In this thesis whole genome sequence data at 15X coverage were generated using 12,354 samples from the INTERVAL cohort. A combination of Genome STRiP, Lumpy, CNVnator and svtools was used to call deletions, duplications and inversions. I implemented a stringent, tiered QC procedure to minimise false positives. For duplications and deletions I modelled sequential random forests on read alignment parameters, resulting in 88% sensitivity and 99% specificity for deletions, and 92% specificity and 55% sensitivity for duplications. Final tuning of the overall quality score was modelled to ensure that 90% of carrier genotypes were identical among duplicate samples. Finally, a graph-based procedure was used to collapse SVs with significant overlap in carriers and in genomic coordinates. The final callset consists of 123,801 sites, with each sample containing approximately 3,300 SVs. After rigorous QC, I compared the cohort to similar population cohorts including 1000 genomes project and Hall-SV. The cohort is sensitive - capturing 93% and 92% of common deletions from each cohort respectively. There is less sensitivity at duplications, where INTERVAL captures only 65% and 75% respectively. The majority of detected variants are rare: 95% have MAF < 0.01, and 49% are singletons, both figure are in line with expectations set by other similarly sized cohorts such as gnomAD-SV. Intergenic SVs are more common than all SVs affecting coding regions, and multi-gene SVs are the most rare class as expected. SVs are well tagged by SNPs - 88%, 97%, 93% and 46% of deletions, inversions, reference MEIs and duplications have at least one SNP in high LD (r2>0.8)- suggesting that genotyping of the SVs is high quality. We evaluated the contribution of SVs on a comprehensive range of phenotypes available in the cohort. These traits include a range of blood cell traits and phenotypes relating to inflammation and immunity including 1,348 metabolites, 92 plasma proteins and 125 full blood count traits. I modelled linear associations between single SVs and each trait, and identified 495 signals across 196 regions. After conditional analysis, I estimate SVs have a causal role in 34 signals, and are the lead variant in 54 signals. Chapter four details several examples of the contribution the SV is making to the association, describes potential genetic mechanisms, and gathers additional clinical data such as electronic health records and gene expression information to comprehensively describe the role of the SV at the association. Finally, there were 481 signals with genome wide significant SNPs present. At 339 signals, at least one SNP became non-significant when conditioning on the SV, suggesting that many SV signals have been accounted for by proxy previously. SVs are challenging to detect, however here I demonstrate the importance of including SVs in GWAS and further studies in understanding complex traits.
  • ItemOpen Access
    Factors Influencing the Somatic Mutational Landscape of Ageing Squamous Epithelium
    King, Charlotte
    The incidences of many cancers vary substantially across the world, reflecting genetic differences between populations and exposure to environmental carcinogens. This is illustrated by keratinocyte skin cancers and oesophageal squamous cell carcinoma, which both develop from squamous epithelium, yet are remodelled during ageing by very different mutagenic processes and environmental exposures. In this thesis, I investigate the influence of cancer risk factors on the somatic mutations present in normal aged skin and oesophageal epithelium using a range of sequencing methods. I find sun-exposed facial skin from donors of the UK to have a 4-fold increased mutation burden and 10-fold increase in copy number aberrant clones compared to donors of Singapore, a country with a 17-fold lower incidence of keratinocyte skin cancer. The majority of these mutations in the UK are due to ultraviolet radiation (UV) but, in Singapore, age-related signatures predominate. Mutations in TP53 are more strongly selected in epidermis of the UK, whilst those in NOTCH1 and NOTCH2 are preferentially selected in Singapore, reflecting differences in the level of competition within the tissue. A survey of mutations in UK skin across body sites reveals differences in UV signature and selection between sites. In aged oesophageal epithelium from UK donors, I observe an increase in mutations with an alcohol-associated signature with reported alcohol consumption. Furthermore, mutation burden increases with smoking, without a detectable change in the mutational signature, consistent with tobacco smoke increasing oesophageal cancer risk independent of its mutagenic effects. Finally, in donors over the age of 60, mutations in TP53 and FAT1 are more strongly selected, whilst those in NOTCH3 more weakly selected, suggesting changes to levels of competition within the tissue with age. I conclude that the mutational landscapes of normal oesophagus and skin are shaped by age and environmental exposures and that this, in turn, may alter the risk of keratinocyte cancers.
  • ItemOpen Access
    Normal cell signals in human cancer transcriptomes
    Kildisiute, Gerda
    Cancer cells arise from normal cells, and may retain transcriptional patterns present in the cell of origin. Recent advances in single-cell transcriptomics have allowed us to profile both normal and cancer cells at a single cell resolution, enabling direct comparisons between the two. Such comparisons can reveal information about which cells cancers originate from, transcriptional programs that underpin carcinogenesis, and differentiation states in cancer. The work presented in this thesis aims to study normal cell signals in human cancer, utilising single-cell transcriptomics. Chapter 1 provides an overview of transcriptomics as a whole, as well as in the context of cancer, with a focus on single-cell transcriptomics. In Chapter 2, I construct a single-cell normal reference of the fetal adrenal gland, and explore the development of the fetal adrenal medulla, from which the childhood cancer, neuroblastoma, arises. I then compare that reference to neuroblastoma single-cell and bulk transcriptomes in chapter 3. This comparison reveals sympathoblasts as the normal correlate of neuroblastoma cancer cells, and allows identification of transcripts present in both neuroblastoma and fetal adrenal medulla, but not adult tissues, making these transcripts an attractive therapeutic target. Chapter 4 presents a comparison of fetal and adult references across three different organs (gut, lung and liver) to bulk and single-cell transcriptomes of adult cancers originating in these organs. This analysis assesses the contribution of fetal and adult cell type signals to cancer transcriptomes. Overall, my work delineates the transcriptional relationship between cancer and normal cells.