Repository logo
 

Theses - Wellcome Sanger Institute

Browse

Recent Submissions

Now showing 1 - 20 of 121
  • ItemEmbargo
    Single-cell atlasing of the human tissues across the lifespan
    Kedlian, Veronika
    Single-cell and spatial sequencing have revolutionised our understanding of human tissue biology in recent years. The Human Cell Atlas (HCA) initiative aims to recover and position all of the cell types in the human body. Being part of HCA, I have worked on building atlases which allow comparison across age groups for two different organs: the human thymus and skeletal muscle. Firstly, I co-led a large multinational effort to create a spatial atlas of the human thymus from fetal and paediatric stages. As a part of this effort, we expanded the census of cell types in the human thymus and mapped them spatially in the tissue. We also developed a new morphological framework (Organ Axis) which allowed us to align and compare cell type locations in fetal vs. paediatric thymus. Secondly, I led an effort to create an ageing human skeletal muscle atlas, which systematically catalogued cell types and states in young adult and elderly skeletal muscle and described age-associated changes. In Chapter 1 I set the stage for the whole thesis by giving background on single-cell and spatial technologies as well as the research question, namely change across the lifespan. I start with a brief introduction to single-cell atlasing, the huge variety of spatial transcriptomics and proteomic technologies, focusing on Visium and IBEX. Next, I provide an overview of the prenatal and postnatal periods of development, followed by a more in depth discussion on causes of human ageing. Finally, I conclude with a description of the main events and functional changes in thymus and skeletal muscle across the lifespan. In Chapter 2 I give an overview of the spatial human thymus cell atlas. I give a brief introduction to human thymus development, its cell type composition and function. I describe major single-cell (single-cell, CITE-seq and TCR-seq) and spatial profiling modalities (Visium and IBEX) that were applied to the thymus and outline methods for imaging data processing, annotation and cortico-medullary axis construction for the thymus. I provide an annotation and spatial mapping of T cells and supporting resident cells, including thymic epithelial cells (TECs), fibroblasts, vascular and myeloid cells. Simultaneously, I use the cortico-medullary axis to compare the spatial position of cell types in the fetal vs. paediatric thymus. I conclude by discussing the differences in cell type localisation between fetal and paediatric thymus and the importance of spatial structure for thymus function. In Chapter 3 I provide insights from single-cell and spatial mapping of Hassall’s body, a keratinised structure in the thymus medulla, which used to be considered a degenerative epithelial structure. Firstly, I map the closest cell types to Hassall’s body and discuss the challenges of associating cell types to Hassall’s vs deep medulla. Next, I identify genes which are uniquely expressed by specific cell types in medulla, which led me to discover an underappreciated heterogeneity within mTECIII population including mucosal and skin-like subtypes. I conclude with a discussion on the putative function of Hassall’s bodies in T-cell development. In Chapter 4 I introduce the human skeletal muscle ageing atlas. I start by summarising knowledge about the main cell types in the muscle and known ageing changes. I describe the experimental setup for the creation of the atlas and give an overview of the recovered cell types and major ageing changes that we observe. Following on from this, I describe fine-grained cell states and ageing changes that occur in the different compartments of skeletal muscle, including muscle stem cells, the myofiber itself and supporting cells of the muscle microenvironment. I conclude with a discussion on how processes in different parts of the muscle combine to cause a breakdown in muscle function over age. In Chapter 5 I summarise the major insights from both studies and discuss their potential therapeutic uses. Next, I use my experience to provide suggestions on experimental design and challenges that need to be solved for the next generation of atlases looking to understand changes across human lifespan.
  • ItemOpen Access
    Functional genomics of developmental disorders
    Hampstead, Juliet
    DNA methylation, or the epigenetic modification of primarily cytosine bases within DNA to 5-methylcytosine through the addition of a methyl group, is an epigenetic mark with a variety of biological and cellular roles. Genetic and environmental influences can perturb DNA methylation patterns in humans, and the set of differentially methylated CpG sites perturbed can be collectively called a DNA methylation signature. In this thesis, I characterise DNA methylation signatures as a diagnostic biomarker for children with rare developmental disorders in chromatin-modifying genes. I show that DNA methylation signatures are a general property of these genes, that they have substantial clinical and diagnostic utility, and that they can be used to resolve variants of uncertain significance. I also show that these signatures are robust across scientific centres and can be generated across multiple tissues. Lastly, I compare DNA methylation signatures generated from methylation microarrays to those generated from genome-wide long read sequencing data, and provide evidence that long read sequencing is a reliable and scalable method to profile 5-methylcytosine for DNA methylation signature-based classification. Overall, my work emphasises the need for scalable, cost-effective, and relatively high-throughput biomarkers in the characterisation and diagnosis of rare developmental disorder syndromes.
  • ItemOpen Access
    Understanding Genomes Through Engineered Structural Variation
    Koeppel, Jonas
    Sequencing of the human genome has provided us with a detailed map of its content. While enormous progress has been made towards understanding the 1% of the human genome that is protein coding, we are still mostly in the dark about the function and relevance of the remaining 99%. Progress has been difficult because the non-coding genome is vast, the individual nucleotides hold less information, and we have lacked the tools to engineer and probe it to the necessary extent. This is beginning to change with the advent of ‘search and replace’ genome engineering technologies such as CRISPR prime editing. I leveraged the ability of prime editors to insert recognition sequences for recombinases at high throughput to engineer genomes at an unprecedented scale. In the process, I made discoveries about the biology of genome engineering, structural variation, and gene regulation. I first outlined the determinants of short sequence insertion using prime editing by systematically measuring the frequency of insertion for 3,604 short sequences in four target sites of three human cell lines with varying DNA repair contexts. I characterized how insertion sequence length and two cellular DNA processing pathways affected the incorporation rate. I reaffirmed that DNA mismatch repair suppressed the insertion of shorter sequences and made the discovery that 3’ flap nucleases TREX1 and TREX2 suppressed the insertion of longer sequences. I further delineated the effects of nucleotide composition and secondary structure of the insertion sequence on editing rates. Next, I targeted a prime editor to the high copy number LINE-1 retrotransposon to insert hundreds of recombinase sites into a single human genome. These engineered cell lines provided a latent substrate for large-scale genome randomization. After induction with Cre recombinase, I mapped thousands of deletions, inversions, extrachromosomal circular DNA, translocations, and fold- back inversions and tracked their abundance over time. Sequencing surviving variants and comparing them to early ones revealed strong selection pressures against creating non-segregable derivative chromosomes or deleting essential genes. However, it also demonstrated that haploid human cell lines could survive while losing megabases of DNA. I isolated 21 cell clones and linked variants to gene expression changes for three clones with multiple Cre-induced rearrangements. Finally, I used prime editing to insert loxPsym sites into the regulatory region of the *OTX2* developmental transcription factor. Cre recombinase induced stochastic deletions and inversions across the recombinase sites, and created diverse and novel enhancer arrangements. By endogenously fusing *OTX2* with a fluorophore and sorting, I could associate alternative regulatory architectures with *OTX2* expression and track changes in CpG methylation and chromatin accessibility. I discovered that three enhancers in a 20 kb cluster drove 50% of *OTX2* expression and that moving the cluster closer to the transcription start site while simultaneously deleting intermediate regulatory elements resulted in strong *OTX2* expression. The strategies presented here to more efficiently insert short DNA sequences with prime editing, shuffle DNA, and rearrange regulatory regions give a fundamentally new approach to randomizing mammalian genomes which will open new avenues to go beyond the 1% of coding sequence and study the 99% of underexplored regions. The data garnered from molecular phenotyping of novel genome architectures after randomization will allow predictive models to learn parameters beyond the limited diversity of our DNA.
  • ItemEmbargo
    The impact of cytotoxic chemotherapy on somatic mutation in normal human cells
    Dunstone, Eleanor
    Since their advent in the mid-twentieth century, cytotoxic chemotherapy drugs have been used to treat hundreds of millions of people with cancer. These drugs remain the most effective form of treatment in many cases, with over half of newly-diagnosed patients requiring chemotherapeutic intervention. Unfortunately, a small proportion of cancer survivors will go on to develop second tumours as a result of the treatment they received for their first diagnosis. These tumours are usually genetically unrelated to the patient’s first cancer, suggesting that cytotoxic chemotherapy may increase the chance of normal cells becoming malignant. Many chemotherapy drugs are thought to exert their effects on tumours by inducing DNA damage, which can generate somatic mutations in any surviving tumour cells. However, the systemic nature of many treatments means that the patient’s normal cells are also at risk of chemotherapy-induced mutagenesis. Patterns of mutations or ‘mutational signatures’ associated with chemotherapy have been seen in patient tumours, in cells exposed to drugs *in vitro*, and in some normal tissue samples from cancer patients. However, a comprehensive study of the impact of chemotherapy on the somatic mutational landscape of a wide range of normal human tissues has not yet been performed. In this project, I investigate the mutational impact of cytotoxic chemotherapy using two complementary approaches: the identification of chemotherapy-associated mutational signatures *in vitro* using organoid models; and the sequencing of samples of a wide range of tissue types from chemotherapy-treated patients. This work was enabled by recent developments in highly error-corrected duplex sequencing approaches, facilitating accurate detection of mutations at single molecule level. The results of this study show that many widely-used chemotherapy drugs are mutagenic in normal cells, generating distinctive patterns of single-base substitutions (SBS), doublet-base substitutions (DBS), and small insertions and deletions (indels). Treatment of normal human cells with chemotherapy drugs and environmental agents *in vitro* demonstrated mutational signatures associated with a wide range of agents. Alkylating and platinum-based agents were the most mutagenic classes of chemotherapeutics, with all 13 alkylating agent drugs and all four platinum-based drugs tested generating SBS signatures. Many alkylating agents and all platinum-based agents also generated an increase in DBS and indels. Previously undescribed signatures discovered include SBS signatures associated with mitomycin C, lomustine/carmustine, busulfan and thiotepa, alongside a DBS signature of mitomycin C and an indel signature associated with a broad range of platinum-based and alkylating agents. Additionally, indel signatures of topoisomerase II inhibitors and bleomycin were described for the first time. The antimetabolite drugs tested generally did not show significant mutagenesis when applied as single treatments; however, a previously-observed signature associated with 5-fluorouracil was recovered when samples were exposed to repeated treatments. The *in vitro* studies also highlighted the complex relationships between compound concentration, cytotoxicity and mutagenicity, with many compounds showing non-linear dose responses or differing relationships between dose and mutagenicity between different organoid tissues-of-origin. Treatment- associated mutagenesis was also shaped by DNA repair, demonstrated by investigating the relationship between *in vitro* temozolomide-induced mutagenesis and the activity of the DNA repair enzyme *O*-6- methylguanine-DNA methyltransferase (MGMT). Temozolomide exposure was shown to generate ten-fold higher SBS burdens in MGMT-knockout human induced pluripotent stem cells (hiPSCs) than in MGMT-wild-type hiPSCs, showing a different mutational signature depending on MGMT status. The sequencing of normal tissue samples from chemotherapy-treated patients showed that many of these drugs are also mutagenic *in vivo*, with eight of the signatures identified *in vitro* being observed in patient samples. Many alkylating and platinum-based agents were shown to have a major impact on SBS burden in normal human tissue samples, with some patients carrying several times as many mutations as would be expected for a person of their age. High excess SBS burdens are seen across many tissue types, including those composed predominantly of post-mitotic cells such as cardiac muscle, skeletal muscle and the neuron-rich cerebellar granular layer, which have historically been considered not to acquire substantial numbers of somatic mutations during adult life. These burdens are associated with signatures of platinum-based drugs, thiotepa, temozolomide and other alkylating agents. Additionally, DBS signatures associated with platinum-based agents and mitomycin C were observed across many normal human tissue samples, as were indel signatures associated with thiotepa and other cross-linking/alkylating agents. Conversely, some agents that generated signatures *in vitro* appeared not to be mutagenic *in vivo*, including 5-fluorouracil/capecitabine and topoisomerase II inhibitors. Alongside the requirement for repeated treatment *in vitro*, this suggests that antimetabolite- associated mutagenesis may be restricted to dividing cells, suggesting that mechanism of action influences the capacity of different drugs to generate mutations in normal cells. The prevalence of high chemotherapy-associated mutation burdens in normal tissues from cancer patients presents one possible factor contributing to the increased risk of second tumour development in chemotherapy-treated survivors of cancer, and the widespread DNA damage observed in post-mitotic tissues may inform research into other long-term side effects of chemotherapy treatment such as cardiac and neurological dysfunction. The work also generates insights into the fundamental mechanisms of mutation in human cells and provides a compendium of signatures of exposures to chemotherapy drugs and other environmental mutagens, facilitating future efforts to further characterise the distribution of these mutational patterns in different patients and tissue types.
  • ItemOpen Access
    Dissecting immune interactions in health and disease with multiomics and spatial technologies
    Arutyunyan, Anna
    Cells are the basic building blocks of life, forming the enormous plethora of tissues and living organisms on Earth. They have a high diversity of phenotypes and functions in different environments. The high-throughput tools to profile different modalities from a single cell have grown exponentially in recent years. This now allows us to draw a complete picture of how cells function in different environments for the first time. In the work of my thesis, I use high-throughput multiomics and spatial technologies to create comprehensive cell atlases. I focus on studying the immune cell communication among themselves and with other cells in the context of disease and development. Chapter 1 starts with an outline of the background on cell biology and the impact that genomic technologies have on how we can study cellular processes. I then discuss the experimental methodology of high-throughput multiomics and spatial techniques, and computational tools for the analysis of such data. Following is the introduction to the two projects comprising the work of my thesis: (i) a multiomics study of Common Variable Immunodeficiency (CVID) and, (ii) a spatial multiomics map of the Maternal-Fetal Interface (MFI) in early pregnancy in humans. Chapter 2 outlines materials and methods used in this work, showcasing the workflow for each project. Chapter 3 details the multiomics atlas of Common Variable Immunodeficiency (CVID). This condition is characterised by defects in the function of B cells, a type of adaptive immune cells capable of producing antibodies to fight infections. I analyse gene expression and chromatin accessibility data of B cells from a pair of monozygotic CVID-discordant twins. I uncover potential defects in the epigenome of the affected twin’s B cells. Next, after in vitro stimulation of these twins’ PBMCs, I observe CVID-associated transcriptional dysregulation in immune subsets additional to those in B cells. I discover defects in the immune cell crosstalk between B cells and other immune compartments of the CVID twin. With an expanded cohort of CVID patients and healthy individuals, I go on to further validate these findings. These results show that, in addition to B-cell-intrinsic alterations, defects in cell-cell communication between B cells and other immune compartments may be compromising the correct immune response in CVID. Chapter 4 presents the work on creating a comprehensive spatial multiomics atlas of the maternal-fetal interface in early pregnancy. Firstly, I characterise the signatures and differentiation trajectories of trophoblast cells - the building blocks of placenta. I then focus on the crosstalk between invading trophoblast and maternal immune cells. I predict putative cell-cell communication events and validate in situ the selected molecules mediating these interactions. I propose a model of arterial transformation facilitated by fetal trophoblast and their communication with maternal cells. This work expands our knowledge about the cellular and molecular players in the maternal-fetal dialog in the first trimester of pregnancy, definitive of its success. Chapter 5 describes the work on modelling the dialog between decidual natural killer (dNK) cells, a type of innate immune cell most abundant in pregnant decidua, and the invading trophoblast at the maternal-fetal interface using primary trophoblast organoids (PTO). I benchmark the PTO system against the in vivo trophoblast atlas I described in chapter 4. After defining trophoblast cell states in vitro, I perform comparative analysis of PTOs stimulated with a cocktail of chemokines that in vivo are secreted by dNK cells and unstimulated PTOs as control. I propose a putative effect of the signals from dNK cells on trophoblast invasion in the first trimester of pregnancy. Lastly, Chapter 6 provides an overview of all the described work, as well as a discussion of how the novel high-throughput multiomic and spatial technologies together with in vitro models shape our current view of fundamental biology, and how they will impact future directions of research.
  • ItemEmbargo
    Genetic architecture of transcript splicing in blood and phenotypic consequences
    Tokolyi, Alexander; Tokolyi, Alexander [0000-0003-4222-7484]
    Transcript splicing is a fundamental process which allows for the generation of multiple different isoforms from a single gene body, increasing the functional capacity of our genomes. Bulk RNA-sequencing has allowed us to analyse this at scale by investigating not only the amount of a gene product detected, but where, and by how much, certain parts of genes have been excised. This thesis presents the largest to-date investigation in to the genetic architecture of transcript splicing in blood, utilising a deeply-phenotyped cohort of 4,732 healthy adults, in addition to 638 adults presenting to intensive care units with sepsis. I first explore the genetic architecture of transcript splicing through the generation of splicing quantitative trait loci (sQTLs) in a healthy cohort of blood donors from the INTERVAL study. Transcript splicing is quantified through the use of split-reads present in RNA-seq data using the LeafCutter pipeline, allowing the quantification of transcript splicing without regards to established reference annotations. As the derived splice events do not have a 1-to-1 mapping to currently defined isoforms, I created a pipeline to richly annotate these splice events to aid in subsequent analyses. This resulted in 29,514 *cis*-sQTLs in 6,853 *cis*-sGenes, and I demonstrate large overlap with previous findings in addition to a plethora of new associations. Using *cis*-eSNPs derived from the same cohort, I perform a targeted *trans*-sQTL analysis under the hypothesis that *trans*-sSNPs could regulate splicing through the regulation of certain gene products involved in splicing. This validated the few currently known *trans*-sQTL associations, and provides a total of 642 splice events (in 208 sGenes), including known splice factors. Due to the magnitude and novelty of the created information, I develop an interactive online portal to browse and explore these sQTL results and incorporate subsequent analyses into, creating an interpretable form of the results generated by this thesis. This portal is publicly available at: https://intervalrna.org.uk/. As the INTERVAL cohort is deeply phenotyped, containing protein measurements in plasma along with metabolites, lipids, and their genetic associations, I perform colocalisation analysis of these with the generated spliceQTLs to explore their shared genetic architecture. This reveals that many splice events and molecular phenotypes appear to be regulated by shared genetic effects, and through examples demonstrate how splicing could be modulating these downstream phenotypes through mechanisms such as changes in solubility. As a proof of concept I then compare public GWAS statistics for immune and blood related diseases with both spliceQTLs and those of the downstream molecular phenotypes, detailing many splicing-mediated pathways of disease through which risk loci are putatively acting, the majority of which are independent of eQTLs. To investigate the interaction of genetic and environmental effects on transcript splicing in disease, I utilise the GAinS cohort of 638 adults that have had blood taken upon arrival to the ICU with sepsis. Using this, I explore the transcriptomic differences between these individuals that are explained by transcript splicing and how this information can be used to predict patient status, and subsequently compare the shared genetic architecture of these splicing events with those of the healthy individuals and the previously defined downstream associations with molecular phenotypes. Notably through colocalisation with summary statistics for COVID-19 susceptibility and severity, I observe risk loci shared with those impacting transcript splicing in the sepsis patients, that were not observed in the healthy individuals. In summary, this thesis provides an in-depth analysis of the genetic architecture of the largest to-date catalogue of transcript splicing, explores their utility in explaining the regulation of downstream molecular phenotypes, and demonstrates how these associations can be used to understand the mechanistic pathways of risk loci.
  • ItemOpen Access
    Probabilistic Dynamical Modelling of Spatiotemporal Cell Trajectories During Neural Development
    Aivazidis, Alexander
    In this PhD theses I present two new computational models, Cell2fate and CountCorrect, for the analysis of single-cell and spatial transcriptomics data and I show how they can be applied to more effectively map the rules of brain cell development in health and disease. Cell2fate is an RNA velocity model for inference of transcriptional dynamics from spliced and unspliced RNA counts. Unlike existing models, cell2fate is capable of capturing complex biological processes while still being analytically tractable. This is achieved with an implicit factorization of RNA velocity solutions into modules, which also enhances statistical power and interpretability. By evaluating cell2fate in various real-world scenarios, I demonstrate its enhanced ability to capture complex dynamics and weak dynamical signals in rare and mature cell types. Finally, I apply cell2fate to developing mouse and human brain single cell datasets, where I also demonstrate that RNA velocity modules can be mapped to parallel spatial transcriptomics data. The CountCorrect model provides new normalization and cell type mapping methods for the Nanostring WTA spatial transcriptomics technology that take into account, background binding of RNA probes. I use CountCorrect to analyze a spatial transcriptomics dataset of the human developing cortex, which revealed spatial autism enrichment patterns, a cortical cell type abundance map and differential gene expression patterns in Cajal-Retzius cells across developmental time and cortical regions.
  • ItemEmbargo
    Probabilistic models to resolve cell identity and tissue architecture
    Kleshchevnikov, Vitalii; Kleshchevnikov, Vitalii [0000-0001-9110-7441]
    Cell identity drives cell-cell communication and tissue architecture and is in return regulated by cell-extrinsic cues. Cell identity is determined by the combination of intrinsic developmentally established transcription factor use (TF) and constitutive as well as cell communication-dependent TF activities. In my thesis, I developed two probabilistic models that advance the understanding of these processes using single-cell and spatial genomic data. Spatial transcriptomic technologies promise to resolve cellular wiring diagrams of tissues in health and disease, but comprehensive mapping of cell types in situ remains a challenge. I present cell2location, a Bayesian model that can resolve fine-grained cell types in spatial transcriptomic data and create comprehensive cellular maps of diverse tissues. Cell2location accounts for technical sources of variation and borrows statistical strength across locations, thereby enabling the integration of single-cell and spatial transcriptomics with higher sensitivity and resolution than existing tools.We assess cell2location in three different tissues and demonstrate improved mapping of fine-grained cell types. In the mouse brain, we discover fine regional astrocyte subtypes across the thalamus and hypothalamus. In the human lymph node, we spatially map a rare pre-germinal centre B cell population. In the human gut, we resolve fine immune cell populations in lymphoid follicles. Collectively our results present cell2location as a versatile analysis tool for mapping tissue architectures in a comprehensive manner. Cell identity and plasticity is regulated by a combinatorial code mediated by transcription factors and the cell communication environment. Systematically dissecting how the regulatory code robustly defines the vast complexity of cell populations across tissues is a long-standing challenge. Measured using the assay for transposase-accessible chromatin with sequencing (ATAC-seq), DNA accessibility provides a readout of intermediate gene regulation steps at single-cell resolution, with technologies measuring both RNA and ATAC providing the necessary evidence to build mechanistic models of regulation. Existing methods address one or several subproblems of modelling DNA accessibility. For example, the DNA sequence-based deep learning models represent combinatorial interactions and in-vivo TF-DNA recognition preferences. In contrast, GRN models use TF abundance profiles across cells and in-vitro-derived TF-DNA recognition preferences, optionally incorporating ATAC-seq data as a filter. All models learn cell-type specific weights/properties and don't generalise to new TF abundance states such as new cell types. Therefore, we are missing an end-to-end mechanistic model that represents all steps of the biological process, that generalises to both new DNA sequences and TF abundance combinations and can simultaneously characterise hundreds to thousands of cell states observed in single-cell genomics atlases. Here, I formulated cell2state, a mechanistic end-to-end probabilistic model of TF recruitment to a chromatin locus and downstream TF effect on DNA accessibility. Cell2state is designed to achieve the generalisation of regulatory predictions to unseen cell types. Cell2state A) estimates TF nuclear protein abundance and models B) how TFs recognise DNA, C) how TF sites in DNA lead to TF recruitment to a chromatin locus, D) how the activity of DNA-associated TFs affects chromatin accessibility. To evaluate generalisation, I defined the computational problem and developed a workflow for predicting the scATAC-seq readout for previously unseen chromosomes and cell types. I show that cell2state outperforms the state-of-the-art deep learning models (ChromDragoNN) at explaining DNA accessibility differences across cells. Finally, to look at cell state plasticity, I developed ways to use cell2state to simulate the possible chromatin states given TF abundance of source cell types.
  • ItemEmbargo
    Phylogenetic studies into the development of foetal tissues and their neoplastic derivatives
    Oliver, Thomas
    Over a lifetime, each cell in the human body acquires a unique combination of somatic mutations that encode its ancestry, exposure to mutagens and strategy for optimising survival. Studies into normal and neoplastic tissues in adults have delineated clonal architecture and oncogenesis at an exquisite resolution. However, for over a century, data have indicated that childhood cancer is rather different, most likely emerging as an aberration of foetal development. This thesis explores how foetal tissues and their neoplastic progeny propagate, focusing specifically on the placenta, germ cell tumours and high-grade midline gliomas. In Chapter 1, I introduce the principles and technological advances that allow us to infer development from somatic mutations. I highlight existing evidence for the distinct origins of childhood and adult cancers and discuss the unique mutational forces that prenatal cells endure. Applying whole genome sequencing (WGS) to bulk and microdissected placental tissues, I begin my own lines of enquiry in Chapter 2. I show that placental trophoblast is unique amongst normal tissues in its clonal construction and sustains a pattern and rate of mutation normally seen in cancer. Each placental biopsy represents a driverless expansion of a spatially-fixed early embryonic precursor, making the placenta inherently mosaic. I turn my attention to neoplasia in Chapters 3 and 4. Using WGS from bulk and microdissected germ cell tumours in Chapter 3, I detail differences in cancer genomes by age group, including their mutational exposures. Where tumours form many, apparently normal tissues, RNA sequencing captures underlying foetal transcriptional signals and diversity that cannot be explained by mutation. In Chapter 4, I outline my work using WGS from post-mortem normal and neoplastic tissues of three children with high-grade midline gliomas. Germline *NF1* mutation is associated with independent, second *NF1* hits that pervade the macro- and microscopically unremarkable brain and spinal cord. Each glioma is characterised by abundant subclonal drivers with different lineages mutating the same genes recurrently, possibly exacerbated by radiotherapy. I conclude in Chapter 5 by exploring the clinical and histopathological utility of these types of experiments and considering new studies to gauge the impact of mutation on organogenesis. Lastly, I highlight other areas of child health where study of somatic mutation may prove beneficial.
  • ItemOpen Access
    Discovering variation from cell atlases: comparative methods for single-cell genomics
    Dann, Emma
    Single-cell genomics technologies have become the norm to investigate cell-to-cell heterogeneity and collective research efforts have built “cell atlases” to characterize previously unknown cell types and states in tissues. For this task, the collection of cells from multiple donors or tissue sites is often required, and differences between cell populations from different samples are largely attributed to technical variation. More recently, multi-sample and multi-condition experiments are being designed to quantify differences between biological conditions, using atlases from healthy tissues as references. These studies require robust quantification of biological differences between cellular phenotypes while accounting for variability between samples and conserving fine-grained information on cell-to-cell heterogeneity. The work presented in this thesis focuses on development and application of computational strategies and best practices for comparative analysis between samples of different biological conditions profiled with single-cell genomics. Chapter 2 presents Milo, a statistical framework for differential abundance testing on single-cell data. By quantifying differences in cell abundances between conditions in partially overlapping neighborhoods on a k-nearest neighbor graph, Milo can identify perturbations that are obscured by discretizing cells into clusters and it minimizes false discovery rate control even in the presence of batch effects. I present a comprehensive benchmark against alternative differential abundance testing strategies, using simulations and scRNA-seq data. I then demonstrate the utility of Milo by studying perturbations across lineages in a dataset of human liver cirrhosis. Chapters 3 and 4 present a case-study where Milo and other integration and comparative analysis methods were used to study the development of the human immune system as a distributed network across tissues. In Chapter 3, I describe the integration of scRNA-seq data of almost a million cells from nine prenatal organs across 11 weeks of gestation, to define common and tissue-specific immune cell populations, and how these compare with immune cell states identified in adulthood. Using this integrated view, I show how I used Milo to identify stage- and tissue-specific subpopulations of myeloid and lymphoid cells, and discuss their potential role in maturation of immune function and tissue morphogenesis. Chapter 4 is centered around the analysis of spatial cellular environments across fetal tissues. By integrating our cross-tissue scRNA-seq atlas with spatial transcriptomics data, we identified and compared cellular niches in the developing liver, spleen, thymus and gut. In Chapter 5, I present a systematic meta-analysis to identify best practices to identify cell states altered in human disease using integration and differential analysis on single-cell datasets. In particular, I examined whether atlas datasets are suitable references for disease-state identification or whether matched control samples should be employed, to minimise false discoveries driven by biological and technical confounders. By quantitatively comparing the use of atlas and control datasets as references for identification of disease-associated cell states, on simulations and real disease scRNA-seq datasets, I show that reliance on a single type of reference dataset introduces false positives. Conversely, using an atlas dataset as reference for latent space learning followed by differential analysis against a matched control dataset leads to precise identification of disease-associated cell states. I demonstrate how optimized design can guide discovery in two applications: using a cell atlas of blood cells from 12 studies to contextualise data from a case-control COVID-19 cohort to detect cell states associated with infection, and distinguish heterogeneous pathological cell states associated with distinct clinical severities. In parallel, I used a healthy reference lung atlas generated from 14 studies to study single-cell profiles obtained from patients with idiopathic pulmonary fibrosis, characterising two distinct aberrant basal cell states associated with disease and identifying unique marker genes with therapeutic potential. In the final chapter, I discuss how advancements in multi-condition single-cell analysis open up a new phase of population-level tissue genomics studies, and the impact that it will have on biomedicine.
  • ItemOpen Access
    Gene Regulatory Networks at Single-Cell Resolution: an approach to exploring the impact of genomic regulation on cellular heterogeneity
    Xu, Zhihan
    The computational analysis of single-cell RNA sequencing data provides a great opportunity to infer gene regulatory associations in different tissues, cell types and cells. However, there are many challenges still to be overcome. In this thesis, I discuss why this opportunity is fascinating but challenging from a computational point of view, and I build a computational method to demonstrate the plausibility of inferring one gene regulatory network (GRN) for each single cell, also evaluations are conducted from multiple perspectives. In Chapter 2, I investigate data-fitting models in existing computational methods which infer GRNs. To fully take advantage of the single-cell resolution in the input sequencing data, leads to the use of machine learning approaches, such as instance-wise feature selection models. I assess the eligibility at a methodological level, implement an instance-wise feature selection model, and aim to generate GRNs which learn cell-to-cell variations from single-cell sequencing. Based on the implementation above, I build a new method to infer GRN at single-cell resolution or single-cell specific GRN (scGRN). The inferred scGRN can be used to explore GRNs across cell types. Since the true underlying biological mechanism in scGRN cannot be known, my method is benchmarked based on available biological network databases which do not reveal cell-to-cell variation on GRN. Analysis of scRNA-seq data can provide biological insights into cellular heterogeneity because of the single-cell resolution; since scGRNs also contain cell-resolution information, it implies that scGRN can also be analysed about its corresponding cell-related properties. Since scGRN is generated from scRNA-seq data, the inherent cell-related patterns in scGRN shall not deviate substantially from the results in the existing analysis for scRNA-seq. A resulting scGRN suggesting very different cellular information may harm its reliability even before conducting GRN validation, so cellular information provides a powerful angle to evaluate scGRN. In Chapter 3, I build a pipeline to analyse scGRNs and I refer to three cell-related properties for comparison - cell types, cell-type trajectories and cell-type specific marker genes. The results demonstrate the difference in the implication of derived cell-related properties from scGRN and scRNA-seq. As this thesis’s ultimate goal, the analysis for resulting scGRN endows an unprecedented opportunity to explore changes in regulatory patterns along cell types or tissues. Thanks to the single-cell resolution in biologically interpretable scGRN, it is flexible to conduct various analyses such as implications from cell type specific GRNs. Chapter 4 focuses on the exploration of interactions between regulatory edges and cell subpopulations from scGRN. In other words, scGRNs are analysed to explore the changes in GRN edges across cell types, cellular lineages or organs. Based on the existing biological understanding, some expected patterns are summarised to evaluate scGRN. Unlike a quantitative score indicating the performance of a method, the evaluation of patterns is still sufficient to differentiate unreasonable scGRN candidates, and this work provides a novel perspective to evaluate GRNs. Besides the utility for evaluation, the observed patterns also suggest some meaningful biological implications about the impact of genomic regulation on cellular differentiation. I hope the methodology developed in this thesis is helpful to inspire more method developers to pursue scGRN, and I hope multi-angle evaluations of scGRN demonstrated here can facilitate more biological insights into the regulatory mechanisms driving cellular heterogeneity.
  • ItemEmbargo
    Developing and applying new methods to understand blood stage growth in Plasmodium falciparum
    Muhwezi, Allan
    *Plasmodium falciparum* parasites cause nearly half a million deaths from malaria each year. There is still no highly effective vaccine, and resistance has emerged or is emerging to all current drugs. All the pathology and symptoms associated with malaria are caused by the growth of these parasites inside human red blood cells. If the parasites are to survive and be successful, they must invade, grow and survive under diverse micro–environmental conditions within red blood cells. Understanding the *P. falciparum* genes that regulate these key developmental processes could lead to the identification of targets for new drugs. However, while sequencing efforts have led to a good understanding of the *P. falciparum* genome and how it evolves over time and space, a disconnect remains between the amount of genome sequence data and what exactly these genes do – the phenotype. This therefore underpins my PhD thesis with a major goal of bridging the gap in our understanding of gene function in *P. falciparum*. I have developed high–throughput approaches combining next generation sequencing, flow cytometry and cell sorting to develop assays to accurately phenotype key aspects of the parasite life cycle such as invasion, cell cycle progression, multiplicity of invasion and replication capacity. These assays have been applied to a panel of *P. falciparum* genes in which genes potentially involved in specific developmental transitions have been deleted and reveal new phenotypes not described elsewhere in malaria literature.
  • ItemOpen Access
    Understanding Development of Human Immunity One Cell at a Time
    Suo, Chenqu; Suo, Chenqu [0000-0002-8813-0875]
    The emergence of single-cell and spatial multi-omic technologies has revolutionized our understanding of the immune cells. The international Human Cell Atlas consortium has spearheaded and coordinated a global effort to construct atlases of human tissues across multiple developmental stages. This is revealing the identity and function of cells at unprecedented resolution and depth, enhancing our understanding of the immune system in health and disease. The Human Cell Atlas data is also providing us with valuable prior knowledge to guide *in vitro* engineering of immune cells towards specific cell states. This thesis aims to study human immune cell development from both *in vivo* and *in vitro* angles, utilizing both experimental and newly developed computational approaches. Importantly, I focus on applying insights from single-cell genomics studies of human immune cell development to *in vitro* lymphocyte engineering. Chapter 1 introduces the recent advances in single-cell technologies in the context of previous methods used to study immune cells, summarizes human immune cell development with a focus on lymphocyte development, and presents published efforts in *in vitro* T cell engineering. Chapter 2 covers the assembly of a multi-organ single cell atlas of the developing human immune system and describes the insights derived from this comprehensive atlas. We uncovered system-wide blood and immune cell development in organs other than primary haematopoietic organs, and characterized prenatal innate-like B and T cells, namely B1 cells and unconventional T cells in humans for the first time. Chapter 3 is dedicated to a new computational tool, Dandelion, that we developed for single cell antigen receptor sequencing (scVDJ-seq) data analysis. We also devised a novel strategy to leverage scVDJ-seq data in pseudotime trajectory inference. The application of Dandelion improved the alignment of human thymic development trajectories of double positive T cells to mature single-positive CD4/CD8 T cells, and provided novel insights into the origins of human B1 cells and ILC/NK cell development. Chapter 4 introduces another computational tool, Genes2Genes, for aligning two single-cell trajectories. By applying Genes2Genes to *in vivo* and *in vitro* T cell development, we found that *in vitro* single positive (SP) T cells were matched to an immature state of the *in vivo* SP T cells while lacking the final TNFα signaling. Chapter 5 summarizes the insights gained from my work on human immune cell development and highlights potential future directions of research in this area.
  • ItemEmbargo
    The influence of genetic background on drug resistance in the malaria parasite Plasmodium falciparum
    Carpenter, Emma; Carpenter, Emma [0000-0002-1911-6842]
    *Plasmodium falciparum* is a parasite that causes the most severe forms of human malaria; with multidrug resistance to modern antimalarials emerging and spreading in certain parasite populations, understanding the mechanisms for acquiring multidrug resistance will be key for predicting which genes may become important for clinical resistance in the future. Transporter proteins have key roles in drug resistance across a variety of organisms — in *P. falciparum*, the spread of resistance to the antimalarial chloroquine in the 1950s and 60s is largely due to mutations in the Chloroquine Resistance Transporter (PfCRT). Variations in the ABC transporter *pfmdr1* modulate sensitivity to multiple antimalarials, including mefloquine and lumefantrine, popular partner drugs in first-line artemisinin combination therapies. During my PhD, I assessed a panel of parasite lines that had undergone *in vitro* evolution experiments, identifying polymorphisms in a poorly characterised ABC transporter, ABCI3, that confer resistance to several experimental antimalarial compounds with diverse chemical scaffolds, suggesting that ABCI3 could mediate resistance to next-generation antimalarial drugs. Next, I examined a novel PfCRT mutation currently spreading in Southeast Asia that confers piperaquine resistance when edited into the wild-type *pfcrt* allele of laboratory lines using CRISPR/Cas9, raising concerns that piperaquine resistance could arise in susceptible populations with a single nucleotide polymorphism. Finally, I introduced unique 11-nucleotide ‘barcodes’ into 37 progeny from the NF54 × Cam3.II genetic cross using CRISPR/Cas9. The African NF54 parasite is considered wild-type and is broadly drug susceptible. The Cambodian Cam3.II parasite is multidrug-resistant owing to numerous mutations, including the R539T mutation in *pfkelch13*, a gene associated with artemisinin resistance. R539T has been demonstrated to have a high fitness cost, and similar PfKelch13 mutations display fitness costs that are exacerbated when engineered into non-Southeast Asian parasites. Combination of ‘barcoded’ parasite lines into a single flask, or ‘pool’ creates a highly valuable screening resource, allowing one to observe the change in proportion of each progeny within this pool over time via next-generation sequencing of the barcoded locus; linkage analysis can therefore be performed on data generated within a single experimental run. Using this tool, I note the enrichment of certain haplotypes of interest under various antimalarial pressures, including *pfabci3*, *pfcrt* and *pfkelch13* variants. I discuss the future avenues of research that will reveal the quantitative trait loci responsible for the differential survival of progeny under these antimalarial pressures, which will shed light on the complex interplay between drug resistance mutations, fitness, and genetic background.
  • ItemOpen Access
    Chromosome evolution in Rhabditina (Nematoda) with a focus on programmed DNA elimination
    Gonzalez De La Rosa, Pablo Manuel
    Recent advances in DNA sequencing technologies have enabled research on the mechanisms that shape chromosome evolution in diverse taxa. My thesis focuses on nematodes from the Rhabditina suborder, which exhibit a high rate of chromosome rearrangement and have several species with chromosome-level assemblies available for comparison. To study chromosome evolution in this group, I inferred a set of loci that comprised the ancestral linkage groups, called Nigon elements, in rhabditid nematodes. This showed that the sex chromosome of the model nematode Caenorhabditis elegans is the product of a fusion of an ancestral autosome and the ancestral sex chromosome. As part of this work, I also generated the first telomere-to-telomere genome with no gaps of unknown sequence of a non-Caenorhabditis nematode, Oscheius tipulae. We serendipitously discovered that O. tipulae undergoes programmed DNA elimination (PDE) by identifying that both ends of each chromosome had two alternative terminal telomere repeat arrays. PDE is a developmentally controlled process whereby somatic cells selectively lose part of their genome. In total, less than 0.5% of the 60 Mb O. tipulae genome is eliminated. To gain insights into the evolution of PDE, I generated chromosome-level assemblies for four Oscheius species that are closely related to O. tipulae. These revealed PDE in all species, and linked PDE with variations in telomere length and breakage-associated sequence motifs. I characterised a novel gene family linked to chromosome ends, derived from helitron2-like transposable elements, and variably eliminated regions in Oscheius dolichura involving divergent paralogs of conserved genes. To facilitate comparative research, I propose a framework outlining a standard set of features for future studies. Finally, I applied this framework to characterise PDE in Auanema rhodensis, a species closely related to Oscheius in which over 60% of its genome is eliminated by PDE. The eliminated regions in A. rhodensis are also delineated by a sequence motif, contain protein-coding genes, ncRNAs, and share the same large tandem repeats between all autosomes. Together, my research sheds light on the evolution of nematode chromosomes and PDE.
  • ItemOpen Access
    Geographic Migration and Evolution of Streptococcus pneumoniae
    Belman, Sophie
    Streptococcus pneumoniae (the pneumococcus) is a human-obligate, opportunistic bacteria. It resides asymptomatically in the nasopharynx of both children and adults. It occasionally goes on to cause local disease such as otitis media, or severe invasive disease as with pneumonia and meningitis. It is prevalent globally and rates of carriage have an inverse relationship with country income. The pneumococcus comprises massive diversity of >800 global lineages and 100 antigenically distinct serotypes. Many lineages and serotypes co-circulate endemically in any given location. Highly immunogenic conjugate vaccines have been implemented in >76% of national immunization schedules; we also have many antimicrobials to which the pneumococcus is susceptible. Following the introduction of the pneumococcal conjugate vaccine in the 2000’s there have been global ecological expansions and contractions of serotypes, lineages, and antimicrobial resistance (AMR). There are distinct qualitative patterns which characterise the global geographic structure of both the endemically circulating pneumococcus and the constantly changing ecology of its genes and loci. The extent and mechanisms of spread, and vaccine-driven changes in fitness and AMR, remain largely unquantified. Using geolocated genome sequences from South Africa (2000-2014) I developed models to reconstruct spread. I implemented simple statistical frameworks which can account for variable surveillance in both space and time. I also pair detailed human mobility data from Facebook users with genomic data within a mechanistic model to describe how human movement drives the geographic spread of the pathogen. I estimated that pneumococci only become homogenously mixed across South Africa after about 50 years of transmission, with the slow spread driven by the focal nature of human mobility. Further, the human population density in the municipality of introduction appears to be important as well — with a more rapid radius of spread when introduced to rural areas. I include both disease and carriage isolates in our models. There are similar ecological shifts among healthy Cambodian children, following vaccination, as among disease isolates. Among South African samples I utilize a logistic model to estimate the population level changes in fitness of strains that are (vaccine type, VT) and are not (non-vaccine type, NVT) as well as differences in strain fitness between those that are and are not resistant to penicillin continuously, across the vaccine period (PCV7: 2009; PCV13: 2011). In the years following vaccine implementation the relative fitness of NVT compared to VT strains increased with an increasing proportion of these NVT strains becoming penicillin resistant. These estimates indicate that initial vaccine-linked decreases in AMR may be transient. Data on human mobility between countries is scarce so to understand between country pneumococcal migration-dynamics I developed a simulation approach using Bayesian optimization for Likelihood-free inference. Incorporating thousands of pneumococcal genomes from South Africa, Malawi, The Gambia, and Kenya I included only neutral sites. I estimated migration rates and directions of migration between countries. I found some heterogeneity between lineages and across demes, and an inverse relationship between migration and the destination country population size. This model could be used to inform modeling frameworks on wider pneumococcal spread and timing and implementation of public health interventions. I also worked to better understand the history and evolution of the hundreds of extant GPSCs by delving into a 5700 year-old metagenome with genomic reads containing high identity to the pneumococcus. While this resulted in identifying a streptococcal ancestor with no polysaccharide capsule it did not further our knowledge of extant pneumococci; as such this paper is included in Appendix A. This work has contributed valuable insight into pneumococcal migration and mech- anisms of spread. I have quantified important fitness dynamics and incorporated them into our mobility models. Furthermore, I have developed a method which can now esti- mate migration parameters asymmetrically between countries elucidating paths of global pneumococcal spread. Together this thesis marks the advancement in our understanding of the geographic migration of Streptococcus pneumoniae and lays the groundwork for understanding its diverse, complex, global spread.
  • ItemEmbargo
    Somatic Mutations in Ageing and Degenerative Disease
    Harvey, Luke
    Following the first division of the zygote, somatic mutations begin to accumulate in all human cells. Daughter cells become increasingly mutated leading to mosaic tissues, composed of genetically heterogeneous clonal units. In recent years, large scale sequencing efforts have begun to characterise the process of mutagenesis in normal tissues. These observations have improved our understanding of how somatic cells evolve oncogenic phenotypes and hinted that somatic changes may contribute to the development of age- related phenotypes and degenerative disease. To accurately characterise the mutational processes in normal tissues, we developed nanorate sequencing (NanoSeq), a duplex sequencing protocol with error rates of less than five errors per billion base pairs, allowing for the accurate identification of mutations in single DNA molecules. Using NanoSeq we describe the mutational processes in the nuclear and mitochondrial genome of three distinct cell types across the brain and cardiac muscle. We show that post-mitotic tissues accumulate somatic mutations at comparable rates, and by similar processes to dividing cells. We explore the relationship of mutation rates to transcription and chromatin state, and reveal patterns of transcription-coupled damage and repair in neuronal genomes. In the analysis of somatic mutations in cardiomyocytes, we identify a novel mutational signature that we believe may be caused by hypoxia-induced oxidative stress. Lastly, we demonstrate that neurons isolated from Alzheimer’s disease brains have a lower mutation burden than healthy neurons. To further investigate the role of somatic mutations in degenerative disease, we characterised somatic mutagenesis in rheumatoid arthritis and osteoarthritis. Using laser capture microdissection, we isolated 2000 microbiopsies of intimal lining, sublining, and lymphocytes for whole exome sequencing. Using these results we show that the synovium is a mostly polyclonal tissue with the capacity for large clonal expansions. We explore how the histopathology of rheumatoid arthritis and osteoarthritis relates to driver mutations, clonal expansions, and immune infiltration. Using NanoSeq, we characterise the mutational processes in isolated synovial cell types. Finally we show the power of NanoSeq for high- throughput driver discovery and identify 15 genes under positive selection in the synovium. This thesis offers novel insights into the mutation rate of distinct cell types in healthy and disease states. This work contributes to a growing body of literature characterising somatic evolution and provides a foundation for further functional studies to directly investigate the role of mutant clones in disease states.
  • ItemEmbargo
    Enabling the in vitro study of long noncoding RNAs to understand their role in Plasmodium falciparum
    Hoshizaki, Johanna
    Long noncoding RNAs (lncRNAs) have been identified in *Plasmodium falciparum*, the parasitic cause for life-threatening malaria, yet their role remains largely undiscovered. Through interactions with nucleic acids and proteins, lncRNAs can modulate gene expression at the transcriptional, post-transcriptional, translational, and post-translational levels. Determining the role of lncRNAs in the regulation of the *P. falciparum* transcriptome and proteome is imperative to further our understanding of gene regulation in the parasite. The characterisation of *P. falciparum* lncRNAs has been hindered by an incomplete annotation and the absence of disruption methods that together would permit high-throughput systematic knockdown of lncRNAs. During my PhD, I addressed these challenges to enable the study of *P. falciparum* lncRNAs *in vitro*. I generated a high-quality lncRNA annotation using manual curation of sequencing data generated at the Sanger Institute, along with supportive datasets from the literature. I evaluated CRISPR-based approaches for *in vitro* disruption of lncRNAs including gene knockout, knockdown, and interference. CRISPR-associated enzymes were explored including commonly used DNA-cutting enzymes (Cas9), inactivated enzymes to block transcription (dCpf1) and enzymes that target RNA directly (Cas13), the latter of which had not been applied to *Plasmodium*. Furthermore, I implemented these tools to demonstrate the feasibility of lncRNA studies in *P. falciparum*. I interrogated a set of lncRNAs that were selected based on predicted biological significance and targetability using dCpf1. LncRNA-depleted parasites were phenotypically characterised by assessing changes in fitness, drug resistance, gametocytogenesis and expression. I identified potential roles for specific lncRNAs in drug resistance and gametocytogenesis. By developing bioinformatics and molecular tools, this work enables future studies elucidating the specific roles of lncRNAs in *P. falciparum*. Understanding the transcriptome and gene regulation will inform the development of novel interventions for the control and eradication of malaria, which remains a serious global health concern.
  • ItemOpen Access
    Quantitative modelling of CRISPR-Cas editing outcomes
    Pallaseni, Ananth
    Development of the CRISPR-Cas toolkit over the last decade has enabled unprecedented control over the genome and unlocked the capacity for new experiments and gene-therapies. However, use of these technologies is made more difficult due to variance in the rate and type of individual genetic outcomes generated by each editor. This variance is reproducible and a function of the sequence being targeted, thus making it amenable to quantitative modelling. My research focuses on building such models in a variety of contexts. In this thesis, I recount the history of gene editing from transgenesis to prime editors. I then review the modelling techniques I use in my projects, before covering the state of predictive modelling for genome engineering. In two main results chapters, I discuss my work on modelling base editor outcomes and the effects of DNA repair context on Cas9-induced double-stranded break repair. The final results chapter covers shorter, collaborative studies on other gene editing technologies. I will conclude by discussing the need for computational tools in genome editing and the gaps in our understanding. In my first results chapter, I examine the sequence- and position- specificity of base editor activity. Base editors are a gene editing technology derived from Cas9 that introduce precise base substitutions into a targeted region of the genome. The rate of these substitutions is known to vary between targeted sequences and the determinants of this variation are not completely understood. To untangle the determinants of base editing efficacy, our group performed a large-scale screen where we measured base editing outcomes across 20,000 targeted sequences in multiple cell lines and editors. I processed and analysed the data produced in this experiment and found that both the sequence flanking editable bases and the position of those bases in the sequence affects the rate of observed editing. I leveraged this understanding to construct a position-specific model of base editing activity for each editor type and used these models to predict the e fficacy and specificity of base editors for correcting pathogenic variants found in ClinVar. The second results chapter focuses on the mutational outcomes of Cas9-induced cuts in repair deficient backgrounds. Cas9 creates a double-stranded break at a targeted location in the genome and the cell repairs this lesion via several pathways which can leave mutations in the repaired sequence. It has been shown in previous studies that the distribution of these mutations is reproducible and dependent on the sequence being cut, but the effect of repair context on this process is not well understood. I planned an experiment to measure Cas9 repair outcomes at over 5000 target sites in 21 mouse cell lines with knockouts of single repair genes, then processed and analysed the data generated. I show that the knockout cells have reproducibly different repair patterns than controls. I highlight Nbn, Lig4 and PolQ as examples of knockouts with consistent effects on certain mutation types. I examine how the known sequence-determinants of Cas9 outcomes affect outcome preference knockout lines. Lastly, I use this understanding to train models that predict the distribution of Cas9 outcomes in various repair backgrounds. My final results chapter discusses two shorter collaborative studies on alternative editing technologies. First is the design of a large scale screen to profile the behaviour of a new Cas enzyme. I explain the experimental process of profiling Cas outcomes, the decisions involved in designing a guide library and an approach to modelling. The other collaboration is the prediction of editing rates when inserting sequences into the genome with prime editors. Here, I train a model to predict editing rates, examine which features are most important to predictive performance, and finally determine that collection of more data for training will improve model performance.
  • ItemOpen Access
    Somatic phylogenies: a window into human biology in health and disease
    Spencer Chapman, Michael; Spencer Chapman, Michael [0000-0002-5320-8193]
    All somatic cells in the human body originate from the fertilised egg, or ‘zygote’. Co-ordinated processes of cell division and differentiation result in the incredible complexity observed following development. Thereafter, each tissue must maintain its function for the decades of life that follow, while minimising risks such as cancer. Crucial to understanding the strategies employed to meet these demands is knowing what cell gives rise to what i.e. the structure of the somatic lineage tree. For this purpose, it is fortuitous that most cell divisions result in the acquisition of somatic mutations that are passed on to all of its future progeny. Therefore theoretically, if all somatic mutations in each cell of an organism were known, the organism’s full somatic lineage tree could be determined. While this is not yet possible at the scale of an entire organism, advances in clonal expansion and whole-genome sequencing technologies have facilitated such experiments on the scale of hundreds of cells in a single individual. This thesis contains three results chapters, each utilising somatic phylogenies in different ways. The first looks at early human development. Due to the small number of cells in early development, a near complete phylogeny can be determined. Assessing contributions of early lineages to different tissues is used to determine the developmental relationships and origins of tissues. The second looks at the clinically important context of gene therapy for sickle cell disease. Comparisons of somatic phylogenies from before and after the gene therapy procedure gives insights into the biology of sickle cell disease and the safety of gene therapy in this context. Finally, the third chapter looks at somatic phylogenies from a different perspective: as a way of testing our assumptions regarding mechanisms of mutation acquisition. The observation that some mutations do not fit these assumptions leads to the surprising conclusion that some mutation-causing DNA lesions persist unrepaired for months to years. Overall, this thesis demonstrates the power of interrogating somatic phylogenies to answer important questions in diverse biological disciplines. As technological advances allow such experiments on ever larger scales, the power of this approach will only increase.