Theses - Wellcome Sanger Institute
Permanent URI for this collection
Browse
Recent Submissions
Item Embargo Sperm sequencing reveals extensive positive selection in the human germlineNeville, MatthewOver the course of life, cells of the human body accumulate DNA mutations due to damage from intrinsic causes or exposure to mutagens. Mutations that occur in reproductive cell lineages are known as germline mutations and have the potential to be transmitted to offspring. Germline mutations serve as the origin of all heritable genetic variation, making them crucial in the study of evolution and disease. The majority (~80%) of germline mutations in humans are paternal in origin. However, direct observation of mutations in sperm, the gametes of the paternal germline, has been limited by the need for a low error rate sequencing technology. Using the duplex sequencing method known as NanoSeq, which achieves a sufficiently low error rate, we sequenced bulk sperm and whole blood from the same individuals. We show that mutation rates and mutation signatures in sperm are consistent with results from trio studies and we contrast mutation rates in the germline to that in blood. Applying a targeted version of NanoSeq, we then generated a large dataset of coding mutations from sperm. Our findings reveal extensive positive selection in the male germline, implicating new genes, pathways, and mutational mechanisms in this process. Annotation of positively selected genes identified in our study found that most are known to cause pathogenic disorders when transmitted as a germline mutation to an offspring. Furthermore, we quantified the fraction of sperm carrying pathogenic variants per individual, highlighting an increased disease risk for children born to fathers of advanced age. These findings shed light on the dynamics of germline mutations and have important implications for our understanding of human disease.Item Open Access Defining cell state regulators in cancers using single-cell analysis and CRISPR-Cas9 screeningEdwards, OliviaAs high-throughput transcriptional profiling becomes more sophisticated and accessible, our understanding of cancer heterogeneity and its impact on clinical outcomes are being realised. Whilst genomic variation has been extremely useful for cancer stratification and development of targeted therapies, they do not always underpin variable therapeutic responses, especially those that display higher levels of plasticity. Development of single-cell profiling has had a particular impact for elucidating the composition of cell states occurring within individual tumours, and offers high-dimensional data which can be utilised for unsupervised signature extraction. The aim of this project was to identify clinically relevant signatures of transcriptional heterogeneity in cancers by mining a pan-cancer single-cell RNA sequencing dataset spanning in 198 cell line models across 22 cancer types [1]. A dimensionality reduction method which resolves continuous expression signatures at multiple resolutions was used to resolve a range of behaviours, from consistent intra-sample cell states to cancer subtypes. In the analysis of melanoma models, three main signatures were defined that reflected distinct subtypes characterised by their differential invasive and proliferative properties. Genes highlighted as putative regulators of these sub- type signatures were screened using a single-cell RNA-seq coupled CRISPR- knock-out approach, with regulator potential uncovered for multiple targets including SOX10, MITF, EIF3G, PRPF19, RPS27A, and CDC20. Cell line annotations achieved through previous high-throughput screens were also leveraged to uncover associations between the defined heterogeneous expression signatures with features reflecting genetic variants, gene essentiality, and drug response. CRISPR perturbation screening was again used to validate the potential for putative melanoma subtype regulators to modulate the response to Rac inhibition, with results suggesting an overlap in gene regulatory networks between melanoma subtype and context specific responses to this inhibitor.Item Embargo Single-cell transcriptomics identifies key components in metabolic pathways of drug perturbed activated CD4+ T cellsKe, ZiyingCD4+ T cells play a critical role in the development of autoimmune diseases. During CD4+ T cell activation, remodelling of metabolic pathways is required for the cells to exert their effector functions. The importance of these pathways is highlighted by the successful therapies for immune diseases that target metabolic pathways. Although key metabolic processes have been recognized to affect T cell activation and lineage development, how metabolic interventions can skew the outcome of T cells activation and differentiation remain largely unknown. I first introduce the experiments which guided the selection of 19 compounds targeting various metabolic pathways with measurements on murine CD8+ T cells and both naïve and memory human CD4+ T cells. I proceed to describe drug effects on CD4+ T cell day 3 proliferation and gene expression by single cell transcriptional profiling at resting state, 16 hours and 3 days after stimulation with aCD3/aCD28. My observations reveal cell subpopulations present only at certain activation time point and demonstrate drug effects altering gene expression and pathways within specific cell populations. Then, I characterise distinct pseudotime trajectories representing the progression from naïve to effector memory cells at each time point. With linear models, I identify the immune-related disease genes regulated by both drug effects and the effector status of cells. Leveraging RNA splicing information, I demonstrate that expression changes of genes targeted by compounds resulted in altered T cell lineage development. Furthermore, my analysis highlights important transcription factors and metabolic pathways regulated by the interactions between metabolic perturbation and effector function. The work in the dissertation presents a unique resource of metabolic perturbations in CD4+ T cells and provides insights into understanding the role of T cell metabolism in immune-mediated diseases.Item Restricted Identification of novel therapeutic targets in uveal melanoma using CRISPR-Cas9Chan, Pui Ying[Restricted]Item Open Access The Malleability of Gene Regulation in Healthy Individuals: Analyzing CRISPR-based Screens with Single-Cell RNA-Sequencing Readout across Genetic BackgroundsFeng, ClaudiaA question that has driven the field of quantitative genetics since its inception, is that of understanding the relationship between genetic variation and complex traits. In the last few decades, genome-wide association (GWAS) and quantitative trait studies have implicated thousands of genetic loci in various physical and molecular traits. However, where there was hope that traits could be explained by a single, causal variant driving a genetic condition, this is only rarely the case, with the majority of traits driven by the aggregation of common, low-effect variants causing changes in gene regulation. Most puzzlingly, even where there is a clear disease-causing mutation, different individuals can often cause a diverse range of disease phenotypes, from no discernable symptoms to severe disease. Deconvoluting the impact of high-effect variants, the combined effects of genetic background and the environment remains an important question with many practical implications. Existing methods for teasing apart the many contributing factors require specific data-sets and are infeasible for many rare diseases, nor can the effects of lifestyle and environment be completely distinguished. Recent technological advances in genome engineering, sequencing technology and cell line differentiation can be used to mitigate these challenges. However, these techniques have been limited in scale and do not account for variation in genetic background. In this thesis, I explore four CRISPR-based screens with single-cell RNA sequencing read-out (scRNA-seq) that together map out an evolution of experiments that start from conducting a proof-of-concept CRISPRi screen targeting tens of genes across two cell lines to one targeting thousands of genes pooled across tens of genetic backgrounds. In doing so, we establish experimental and power considerations for knocking down genes with variable essentiality across pooled cell lines, as well as a computational framework for interpreting mean and variation in transcriptomic consequences from gene knockdown. Work done during my PhD and presented in this thesis is the second genome-scale experiment of its kind and is the only study that accounts for the role of genetic background on gene function, establishing a necessary and fundamental building block for conducting population-scale genetic analyses.Item Embargo Selective Neuronal Vulnerability in Neocortices from Patients with C9ORF72-related NeurodegenerationKwa, Jing EugeneAmyotrophic lateral sclerosis (ALS) is a rapidly fatal neurodegenerative disorder classically characterised by motor neuron death. However, genetic and histopathological overlap with frontotemporal dementia (FTD) suggests extra-motor involvement, and that pathology likely extends to other neuronal populations (Chapter 1). This may be especially evident in patients with the hexanucleotide repeat expansion in the gene C9ORF72 (c9HRE): the most common genetic cause of both ALS and FTD. We hypothesise that in c9HRE-related neurodegeneration, neuronal pathology may extend beyond the motor neurons. To date, a comprehensive survey of selective neuronal vulnerability has not yet been performed. With recent advances in single-cell and spatial transcriptomic approaches, our hypothesis can now be investigated in an unbiased fashion and at unprecedented resolution. The work in this thesis focuses on identifying selectively vulnerable neurons in the post-mortem primary motor cortex (M1) of patients with ALS and FTD caused by a . M1 tissue was processed for single nuclei RNA-sequencing (snRNA-seq) and 10X Visium spatial transcriptomics (Chapter 2). I first investigated cortical neuron vulnerability using snRNA-seq (Chapter 3). Upper motor neurons (herein called L5 ETs: layer 5 extratelencephalic-projecting excitatory neurons) were expected to display pathology in c9HRE and served as a positive control. Other vulnerable neuronal populations were identified by examining neurons with similar transcriptomic responses to L5 ETs. In so doing, upper layer excitatory subtypes, PVALB+ fast-spiking basket interneurons, and specific VIP+ interneurons were identified as vulnerable. Transcriptomic changes primarily manifested in neurons as altered mitochondrial, proteostatic, and synaptic function. Minimal evidence of glial involvement was observed. Given the range of developmental origins amongst vulnerable populations and the relative sparing of layer 6, I hypothesised that microcircuit connectivity might determine vulnerability. I investigated this hypothesis using Visium (Chapter 4). Three lines of evidence support the L6 sparing previously inferred from snRNA-seq. First, inferred vulnerable populations are located outside L6. Second, L6 transcriptomic changes are enriched for axon-related genes, and may thus originate from neurons with soma outside L6. Third, signature deconvolution suggests expected pathology is significantly higher for c9HRE samples for layers superficial to L5. Taken together, my spatial transcriptomic data supports L6 sparing inferred from snRNA-seq, which in turn reinforces the notion of microcircuitry being predictive of vulnerability (rather than spatial proximity or developmental origin). I also cross-examined the biological pathways highlighted in the snRNA-seq analysis with findings from literature (Chapter 5). Identified pathways were preserved upon metaanalysis, and suggest a role for mTORC signalling. Importantly, I also observed that existing c9HRE model systems may not be appropriate for understanding post-mortem changes. In summary, this work suggests that selective vulnerability in c9HRE is determined by (and possibly spreads through) neuronal connections. Deeper examination of this hypothesis is limited by existing technology, but nascent approaches that merge connectomics with transcriptomics (Chapter 6) may eventually allow us to investigate this in future.Item Embargo Single-cell atlasing of the human tissues across the lifespanKedlian, VeronikaSingle-cell and spatial sequencing have revolutionised our understanding of human tissue biology in recent years. The Human Cell Atlas (HCA) initiative aims to recover and position all of the cell types in the human body. Being part of HCA, I have worked on building atlases which allow comparison across age groups for two different organs: the human thymus and skeletal muscle. Firstly, I co-led a large multinational effort to create a spatial atlas of the human thymus from fetal and paediatric stages. As a part of this effort, we expanded the census of cell types in the human thymus and mapped them spatially in the tissue. We also developed a new morphological framework (Organ Axis) which allowed us to align and compare cell type locations in fetal vs. paediatric thymus. Secondly, I led an effort to create an ageing human skeletal muscle atlas, which systematically catalogued cell types and states in young adult and elderly skeletal muscle and described age-associated changes. In Chapter 1 I set the stage for the whole thesis by giving background on single-cell and spatial technologies as well as the research question, namely change across the lifespan. I start with a brief introduction to single-cell atlasing, the huge variety of spatial transcriptomics and proteomic technologies, focusing on Visium and IBEX. Next, I provide an overview of the prenatal and postnatal periods of development, followed by a more in depth discussion on causes of human ageing. Finally, I conclude with a description of the main events and functional changes in thymus and skeletal muscle across the lifespan. In Chapter 2 I give an overview of the spatial human thymus cell atlas. I give a brief introduction to human thymus development, its cell type composition and function. I describe major single-cell (single-cell, CITE-seq and TCR-seq) and spatial profiling modalities (Visium and IBEX) that were applied to the thymus and outline methods for imaging data processing, annotation and cortico-medullary axis construction for the thymus. I provide an annotation and spatial mapping of T cells and supporting resident cells, including thymic epithelial cells (TECs), fibroblasts, vascular and myeloid cells. Simultaneously, I use the cortico-medullary axis to compare the spatial position of cell types in the fetal vs. paediatric thymus. I conclude by discussing the differences in cell type localisation between fetal and paediatric thymus and the importance of spatial structure for thymus function. In Chapter 3 I provide insights from single-cell and spatial mapping of Hassall’s body, a keratinised structure in the thymus medulla, which used to be considered a degenerative epithelial structure. Firstly, I map the closest cell types to Hassall’s body and discuss the challenges of associating cell types to Hassall’s vs deep medulla. Next, I identify genes which are uniquely expressed by specific cell types in medulla, which led me to discover an underappreciated heterogeneity within mTECIII population including mucosal and skin-like subtypes. I conclude with a discussion on the putative function of Hassall’s bodies in T-cell development. In Chapter 4 I introduce the human skeletal muscle ageing atlas. I start by summarising knowledge about the main cell types in the muscle and known ageing changes. I describe the experimental setup for the creation of the atlas and give an overview of the recovered cell types and major ageing changes that we observe. Following on from this, I describe fine-grained cell states and ageing changes that occur in the different compartments of skeletal muscle, including muscle stem cells, the myofiber itself and supporting cells of the muscle microenvironment. I conclude with a discussion on how processes in different parts of the muscle combine to cause a breakdown in muscle function over age. In Chapter 5 I summarise the major insights from both studies and discuss their potential therapeutic uses. Next, I use my experience to provide suggestions on experimental design and challenges that need to be solved for the next generation of atlases looking to understand changes across human lifespan.Item Open Access Functional genomics of developmental disordersHampstead, JulietDNA methylation, or the epigenetic modification of primarily cytosine bases within DNA to 5-methylcytosine through the addition of a methyl group, is an epigenetic mark with a variety of biological and cellular roles. Genetic and environmental influences can perturb DNA methylation patterns in humans, and the set of differentially methylated CpG sites perturbed can be collectively called a DNA methylation signature. In this thesis, I characterise DNA methylation signatures as a diagnostic biomarker for children with rare developmental disorders in chromatin-modifying genes. I show that DNA methylation signatures are a general property of these genes, that they have substantial clinical and diagnostic utility, and that they can be used to resolve variants of uncertain significance. I also show that these signatures are robust across scientific centres and can be generated across multiple tissues. Lastly, I compare DNA methylation signatures generated from methylation microarrays to those generated from genome-wide long read sequencing data, and provide evidence that long read sequencing is a reliable and scalable method to profile 5-methylcytosine for DNA methylation signature-based classification. Overall, my work emphasises the need for scalable, cost-effective, and relatively high-throughput biomarkers in the characterisation and diagnosis of rare developmental disorder syndromes.Item Open Access Understanding Genomes Through Engineered Structural VariationKoeppel, JonasSequencing of the human genome has provided us with a detailed map of its content. While enormous progress has been made towards understanding the 1% of the human genome that is protein coding, we are still mostly in the dark about the function and relevance of the remaining 99%. Progress has been difficult because the non-coding genome is vast, the individual nucleotides hold less information, and we have lacked the tools to engineer and probe it to the necessary extent. This is beginning to change with the advent of ‘search and replace’ genome engineering technologies such as CRISPR prime editing. I leveraged the ability of prime editors to insert recognition sequences for recombinases at high throughput to engineer genomes at an unprecedented scale. In the process, I made discoveries about the biology of genome engineering, structural variation, and gene regulation. I first outlined the determinants of short sequence insertion using prime editing by systematically measuring the frequency of insertion for 3,604 short sequences in four target sites of three human cell lines with varying DNA repair contexts. I characterized how insertion sequence length and two cellular DNA processing pathways affected the incorporation rate. I reaffirmed that DNA mismatch repair suppressed the insertion of shorter sequences and made the discovery that 3’ flap nucleases TREX1 and TREX2 suppressed the insertion of longer sequences. I further delineated the effects of nucleotide composition and secondary structure of the insertion sequence on editing rates. Next, I targeted a prime editor to the high copy number LINE-1 retrotransposon to insert hundreds of recombinase sites into a single human genome. These engineered cell lines provided a latent substrate for large-scale genome randomization. After induction with Cre recombinase, I mapped thousands of deletions, inversions, extrachromosomal circular DNA, translocations, and fold- back inversions and tracked their abundance over time. Sequencing surviving variants and comparing them to early ones revealed strong selection pressures against creating non-segregable derivative chromosomes or deleting essential genes. However, it also demonstrated that haploid human cell lines could survive while losing megabases of DNA. I isolated 21 cell clones and linked variants to gene expression changes for three clones with multiple Cre-induced rearrangements. Finally, I used prime editing to insert loxPsym sites into the regulatory region of the *OTX2* developmental transcription factor. Cre recombinase induced stochastic deletions and inversions across the recombinase sites, and created diverse and novel enhancer arrangements. By endogenously fusing *OTX2* with a fluorophore and sorting, I could associate alternative regulatory architectures with *OTX2* expression and track changes in CpG methylation and chromatin accessibility. I discovered that three enhancers in a 20 kb cluster drove 50% of *OTX2* expression and that moving the cluster closer to the transcription start site while simultaneously deleting intermediate regulatory elements resulted in strong *OTX2* expression. The strategies presented here to more efficiently insert short DNA sequences with prime editing, shuffle DNA, and rearrange regulatory regions give a fundamentally new approach to randomizing mammalian genomes which will open new avenues to go beyond the 1% of coding sequence and study the 99% of underexplored regions. The data garnered from molecular phenotyping of novel genome architectures after randomization will allow predictive models to learn parameters beyond the limited diversity of our DNA.Item Embargo The impact of cytotoxic chemotherapy on somatic mutation in normal human cellsDunstone, EleanorSince their advent in the mid-twentieth century, cytotoxic chemotherapy drugs have been used to treat hundreds of millions of people with cancer. These drugs remain the most effective form of treatment in many cases, with over half of newly-diagnosed patients requiring chemotherapeutic intervention. Unfortunately, a small proportion of cancer survivors will go on to develop second tumours as a result of the treatment they received for their first diagnosis. These tumours are usually genetically unrelated to the patient’s first cancer, suggesting that cytotoxic chemotherapy may increase the chance of normal cells becoming malignant. Many chemotherapy drugs are thought to exert their effects on tumours by inducing DNA damage, which can generate somatic mutations in any surviving tumour cells. However, the systemic nature of many treatments means that the patient’s normal cells are also at risk of chemotherapy-induced mutagenesis. Patterns of mutations or ‘mutational signatures’ associated with chemotherapy have been seen in patient tumours, in cells exposed to drugs *in vitro*, and in some normal tissue samples from cancer patients. However, a comprehensive study of the impact of chemotherapy on the somatic mutational landscape of a wide range of normal human tissues has not yet been performed. In this project, I investigate the mutational impact of cytotoxic chemotherapy using two complementary approaches: the identification of chemotherapy-associated mutational signatures *in vitro* using organoid models; and the sequencing of samples of a wide range of tissue types from chemotherapy-treated patients. This work was enabled by recent developments in highly error-corrected duplex sequencing approaches, facilitating accurate detection of mutations at single molecule level. The results of this study show that many widely-used chemotherapy drugs are mutagenic in normal cells, generating distinctive patterns of single-base substitutions (SBS), doublet-base substitutions (DBS), and small insertions and deletions (indels). Treatment of normal human cells with chemotherapy drugs and environmental agents *in vitro* demonstrated mutational signatures associated with a wide range of agents. Alkylating and platinum-based agents were the most mutagenic classes of chemotherapeutics, with all 13 alkylating agent drugs and all four platinum-based drugs tested generating SBS signatures. Many alkylating agents and all platinum-based agents also generated an increase in DBS and indels. Previously undescribed signatures discovered include SBS signatures associated with mitomycin C, lomustine/carmustine, busulfan and thiotepa, alongside a DBS signature of mitomycin C and an indel signature associated with a broad range of platinum-based and alkylating agents. Additionally, indel signatures of topoisomerase II inhibitors and bleomycin were described for the first time. The antimetabolite drugs tested generally did not show significant mutagenesis when applied as single treatments; however, a previously-observed signature associated with 5-fluorouracil was recovered when samples were exposed to repeated treatments. The *in vitro* studies also highlighted the complex relationships between compound concentration, cytotoxicity and mutagenicity, with many compounds showing non-linear dose responses or differing relationships between dose and mutagenicity between different organoid tissues-of-origin. Treatment- associated mutagenesis was also shaped by DNA repair, demonstrated by investigating the relationship between *in vitro* temozolomide-induced mutagenesis and the activity of the DNA repair enzyme *O*-6- methylguanine-DNA methyltransferase (MGMT). Temozolomide exposure was shown to generate ten-fold higher SBS burdens in MGMT-knockout human induced pluripotent stem cells (hiPSCs) than in MGMT-wild-type hiPSCs, showing a different mutational signature depending on MGMT status. The sequencing of normal tissue samples from chemotherapy-treated patients showed that many of these drugs are also mutagenic *in vivo*, with eight of the signatures identified *in vitro* being observed in patient samples. Many alkylating and platinum-based agents were shown to have a major impact on SBS burden in normal human tissue samples, with some patients carrying several times as many mutations as would be expected for a person of their age. High excess SBS burdens are seen across many tissue types, including those composed predominantly of post-mitotic cells such as cardiac muscle, skeletal muscle and the neuron-rich cerebellar granular layer, which have historically been considered not to acquire substantial numbers of somatic mutations during adult life. These burdens are associated with signatures of platinum-based drugs, thiotepa, temozolomide and other alkylating agents. Additionally, DBS signatures associated with platinum-based agents and mitomycin C were observed across many normal human tissue samples, as were indel signatures associated with thiotepa and other cross-linking/alkylating agents. Conversely, some agents that generated signatures *in vitro* appeared not to be mutagenic *in vivo*, including 5-fluorouracil/capecitabine and topoisomerase II inhibitors. Alongside the requirement for repeated treatment *in vitro*, this suggests that antimetabolite- associated mutagenesis may be restricted to dividing cells, suggesting that mechanism of action influences the capacity of different drugs to generate mutations in normal cells. The prevalence of high chemotherapy-associated mutation burdens in normal tissues from cancer patients presents one possible factor contributing to the increased risk of second tumour development in chemotherapy-treated survivors of cancer, and the widespread DNA damage observed in post-mitotic tissues may inform research into other long-term side effects of chemotherapy treatment such as cardiac and neurological dysfunction. The work also generates insights into the fundamental mechanisms of mutation in human cells and provides a compendium of signatures of exposures to chemotherapy drugs and other environmental mutagens, facilitating future efforts to further characterise the distribution of these mutational patterns in different patients and tissue types.Item Open Access Dissecting immune interactions in health and disease with multiomics and spatial technologiesArutyunyan, AnnaCells are the basic building blocks of life, forming the enormous plethora of tissues and living organisms on Earth. They have a high diversity of phenotypes and functions in different environments. The high-throughput tools to profile different modalities from a single cell have grown exponentially in recent years. This now allows us to draw a complete picture of how cells function in different environments for the first time. In the work of my thesis, I use high-throughput multiomics and spatial technologies to create comprehensive cell atlases. I focus on studying the immune cell communication among themselves and with other cells in the context of disease and development. Chapter 1 starts with an outline of the background on cell biology and the impact that genomic technologies have on how we can study cellular processes. I then discuss the experimental methodology of high-throughput multiomics and spatial techniques, and computational tools for the analysis of such data. Following is the introduction to the two projects comprising the work of my thesis: (i) a multiomics study of Common Variable Immunodeficiency (CVID) and, (ii) a spatial multiomics map of the Maternal-Fetal Interface (MFI) in early pregnancy in humans. Chapter 2 outlines materials and methods used in this work, showcasing the workflow for each project. Chapter 3 details the multiomics atlas of Common Variable Immunodeficiency (CVID). This condition is characterised by defects in the function of B cells, a type of adaptive immune cells capable of producing antibodies to fight infections. I analyse gene expression and chromatin accessibility data of B cells from a pair of monozygotic CVID-discordant twins. I uncover potential defects in the epigenome of the affected twin’s B cells. Next, after in vitro stimulation of these twins’ PBMCs, I observe CVID-associated transcriptional dysregulation in immune subsets additional to those in B cells. I discover defects in the immune cell crosstalk between B cells and other immune compartments of the CVID twin. With an expanded cohort of CVID patients and healthy individuals, I go on to further validate these findings. These results show that, in addition to B-cell-intrinsic alterations, defects in cell-cell communication between B cells and other immune compartments may be compromising the correct immune response in CVID. Chapter 4 presents the work on creating a comprehensive spatial multiomics atlas of the maternal-fetal interface in early pregnancy. Firstly, I characterise the signatures and differentiation trajectories of trophoblast cells - the building blocks of placenta. I then focus on the crosstalk between invading trophoblast and maternal immune cells. I predict putative cell-cell communication events and validate in situ the selected molecules mediating these interactions. I propose a model of arterial transformation facilitated by fetal trophoblast and their communication with maternal cells. This work expands our knowledge about the cellular and molecular players in the maternal-fetal dialog in the first trimester of pregnancy, definitive of its success. Chapter 5 describes the work on modelling the dialog between decidual natural killer (dNK) cells, a type of innate immune cell most abundant in pregnant decidua, and the invading trophoblast at the maternal-fetal interface using primary trophoblast organoids (PTO). I benchmark the PTO system against the in vivo trophoblast atlas I described in chapter 4. After defining trophoblast cell states in vitro, I perform comparative analysis of PTOs stimulated with a cocktail of chemokines that in vivo are secreted by dNK cells and unstimulated PTOs as control. I propose a putative effect of the signals from dNK cells on trophoblast invasion in the first trimester of pregnancy. Lastly, Chapter 6 provides an overview of all the described work, as well as a discussion of how the novel high-throughput multiomic and spatial technologies together with in vitro models shape our current view of fundamental biology, and how they will impact future directions of research.Item Embargo Genetic architecture of transcript splicing in blood and phenotypic consequencesTokolyi, Alexander; Tokolyi, Alexander [0000-0003-4222-7484]Transcript splicing is a fundamental process which allows for the generation of multiple different isoforms from a single gene body, increasing the functional capacity of our genomes. Bulk RNA-sequencing has allowed us to analyse this at scale by investigating not only the amount of a gene product detected, but where, and by how much, certain parts of genes have been excised. This thesis presents the largest to-date investigation in to the genetic architecture of transcript splicing in blood, utilising a deeply-phenotyped cohort of 4,732 healthy adults, in addition to 638 adults presenting to intensive care units with sepsis. I first explore the genetic architecture of transcript splicing through the generation of splicing quantitative trait loci (sQTLs) in a healthy cohort of blood donors from the INTERVAL study. Transcript splicing is quantified through the use of split-reads present in RNA-seq data using the LeafCutter pipeline, allowing the quantification of transcript splicing without regards to established reference annotations. As the derived splice events do not have a 1-to-1 mapping to currently defined isoforms, I created a pipeline to richly annotate these splice events to aid in subsequent analyses. This resulted in 29,514 *cis*-sQTLs in 6,853 *cis*-sGenes, and I demonstrate large overlap with previous findings in addition to a plethora of new associations. Using *cis*-eSNPs derived from the same cohort, I perform a targeted *trans*-sQTL analysis under the hypothesis that *trans*-sSNPs could regulate splicing through the regulation of certain gene products involved in splicing. This validated the few currently known *trans*-sQTL associations, and provides a total of 642 splice events (in 208 sGenes), including known splice factors. Due to the magnitude and novelty of the created information, I develop an interactive online portal to browse and explore these sQTL results and incorporate subsequent analyses into, creating an interpretable form of the results generated by this thesis. This portal is publicly available at: https://intervalrna.org.uk/. As the INTERVAL cohort is deeply phenotyped, containing protein measurements in plasma along with metabolites, lipids, and their genetic associations, I perform colocalisation analysis of these with the generated spliceQTLs to explore their shared genetic architecture. This reveals that many splice events and molecular phenotypes appear to be regulated by shared genetic effects, and through examples demonstrate how splicing could be modulating these downstream phenotypes through mechanisms such as changes in solubility. As a proof of concept I then compare public GWAS statistics for immune and blood related diseases with both spliceQTLs and those of the downstream molecular phenotypes, detailing many splicing-mediated pathways of disease through which risk loci are putatively acting, the majority of which are independent of eQTLs. To investigate the interaction of genetic and environmental effects on transcript splicing in disease, I utilise the GAinS cohort of 638 adults that have had blood taken upon arrival to the ICU with sepsis. Using this, I explore the transcriptomic differences between these individuals that are explained by transcript splicing and how this information can be used to predict patient status, and subsequently compare the shared genetic architecture of these splicing events with those of the healthy individuals and the previously defined downstream associations with molecular phenotypes. Notably through colocalisation with summary statistics for COVID-19 susceptibility and severity, I observe risk loci shared with those impacting transcript splicing in the sepsis patients, that were not observed in the healthy individuals. In summary, this thesis provides an in-depth analysis of the genetic architecture of the largest to-date catalogue of transcript splicing, explores their utility in explaining the regulation of downstream molecular phenotypes, and demonstrates how these associations can be used to understand the mechanistic pathways of risk loci.Item Open Access Probabilistic Dynamical Modelling of Spatiotemporal Cell Trajectories During Neural DevelopmentAivazidis, AlexanderIn this PhD theses I present two new computational models, Cell2fate and CountCorrect, for the analysis of single-cell and spatial transcriptomics data and I show how they can be applied to more effectively map the rules of brain cell development in health and disease. Cell2fate is an RNA velocity model for inference of transcriptional dynamics from spliced and unspliced RNA counts. Unlike existing models, cell2fate is capable of capturing complex biological processes while still being analytically tractable. This is achieved with an implicit factorization of RNA velocity solutions into modules, which also enhances statistical power and interpretability. By evaluating cell2fate in various real-world scenarios, I demonstrate its enhanced ability to capture complex dynamics and weak dynamical signals in rare and mature cell types. Finally, I apply cell2fate to developing mouse and human brain single cell datasets, where I also demonstrate that RNA velocity modules can be mapped to parallel spatial transcriptomics data. The CountCorrect model provides new normalization and cell type mapping methods for the Nanostring WTA spatial transcriptomics technology that take into account, background binding of RNA probes. I use CountCorrect to analyze a spatial transcriptomics dataset of the human developing cortex, which revealed spatial autism enrichment patterns, a cortical cell type abundance map and differential gene expression patterns in Cajal-Retzius cells across developmental time and cortical regions.Item Embargo Probabilistic models to resolve cell identity and tissue architectureKleshchevnikov, Vitalii; Kleshchevnikov, Vitalii [0000-0001-9110-7441]Cell identity drives cell-cell communication and tissue architecture and is in return regulated by cell-extrinsic cues. Cell identity is determined by the combination of intrinsic developmentally established transcription factor use (TF) and constitutive as well as cell communication-dependent TF activities. In my thesis, I developed two probabilistic models that advance the understanding of these processes using single-cell and spatial genomic data. Spatial transcriptomic technologies promise to resolve cellular wiring diagrams of tissues in health and disease, but comprehensive mapping of cell types in situ remains a challenge. I present cell2location, a Bayesian model that can resolve fine-grained cell types in spatial transcriptomic data and create comprehensive cellular maps of diverse tissues. Cell2location accounts for technical sources of variation and borrows statistical strength across locations, thereby enabling the integration of single-cell and spatial transcriptomics with higher sensitivity and resolution than existing tools.We assess cell2location in three different tissues and demonstrate improved mapping of fine-grained cell types. In the mouse brain, we discover fine regional astrocyte subtypes across the thalamus and hypothalamus. In the human lymph node, we spatially map a rare pre-germinal centre B cell population. In the human gut, we resolve fine immune cell populations in lymphoid follicles. Collectively our results present cell2location as a versatile analysis tool for mapping tissue architectures in a comprehensive manner. Cell identity and plasticity is regulated by a combinatorial code mediated by transcription factors and the cell communication environment. Systematically dissecting how the regulatory code robustly defines the vast complexity of cell populations across tissues is a long-standing challenge. Measured using the assay for transposase-accessible chromatin with sequencing (ATAC-seq), DNA accessibility provides a readout of intermediate gene regulation steps at single-cell resolution, with technologies measuring both RNA and ATAC providing the necessary evidence to build mechanistic models of regulation. Existing methods address one or several subproblems of modelling DNA accessibility. For example, the DNA sequence-based deep learning models represent combinatorial interactions and in-vivo TF-DNA recognition preferences. In contrast, GRN models use TF abundance profiles across cells and in-vitro-derived TF-DNA recognition preferences, optionally incorporating ATAC-seq data as a filter. All models learn cell-type specific weights/properties and don't generalise to new TF abundance states such as new cell types. Therefore, we are missing an end-to-end mechanistic model that represents all steps of the biological process, that generalises to both new DNA sequences and TF abundance combinations and can simultaneously characterise hundreds to thousands of cell states observed in single-cell genomics atlases. Here, I formulated cell2state, a mechanistic end-to-end probabilistic model of TF recruitment to a chromatin locus and downstream TF effect on DNA accessibility. Cell2state is designed to achieve the generalisation of regulatory predictions to unseen cell types. Cell2state A) estimates TF nuclear protein abundance and models B) how TFs recognise DNA, C) how TF sites in DNA lead to TF recruitment to a chromatin locus, D) how the activity of DNA-associated TFs affects chromatin accessibility. To evaluate generalisation, I defined the computational problem and developed a workflow for predicting the scATAC-seq readout for previously unseen chromosomes and cell types. I show that cell2state outperforms the state-of-the-art deep learning models (ChromDragoNN) at explaining DNA accessibility differences across cells. Finally, to look at cell state plasticity, I developed ways to use cell2state to simulate the possible chromatin states given TF abundance of source cell types.Item Embargo Phylogenetic studies into the development of foetal tissues and their neoplastic derivativesOliver, ThomasOver a lifetime, each cell in the human body acquires a unique combination of somatic mutations that encode its ancestry, exposure to mutagens and strategy for optimising survival. Studies into normal and neoplastic tissues in adults have delineated clonal architecture and oncogenesis at an exquisite resolution. However, for over a century, data have indicated that childhood cancer is rather different, most likely emerging as an aberration of foetal development. This thesis explores how foetal tissues and their neoplastic progeny propagate, focusing specifically on the placenta, germ cell tumours and high-grade midline gliomas. In Chapter 1, I introduce the principles and technological advances that allow us to infer development from somatic mutations. I highlight existing evidence for the distinct origins of childhood and adult cancers and discuss the unique mutational forces that prenatal cells endure. Applying whole genome sequencing (WGS) to bulk and microdissected placental tissues, I begin my own lines of enquiry in Chapter 2. I show that placental trophoblast is unique amongst normal tissues in its clonal construction and sustains a pattern and rate of mutation normally seen in cancer. Each placental biopsy represents a driverless expansion of a spatially-fixed early embryonic precursor, making the placenta inherently mosaic. I turn my attention to neoplasia in Chapters 3 and 4. Using WGS from bulk and microdissected germ cell tumours in Chapter 3, I detail differences in cancer genomes by age group, including their mutational exposures. Where tumours form many, apparently normal tissues, RNA sequencing captures underlying foetal transcriptional signals and diversity that cannot be explained by mutation. In Chapter 4, I outline my work using WGS from post-mortem normal and neoplastic tissues of three children with high-grade midline gliomas. Germline *NF1* mutation is associated with independent, second *NF1* hits that pervade the macro- and microscopically unremarkable brain and spinal cord. Each glioma is characterised by abundant subclonal drivers with different lineages mutating the same genes recurrently, possibly exacerbated by radiotherapy. I conclude in Chapter 5 by exploring the clinical and histopathological utility of these types of experiments and considering new studies to gauge the impact of mutation on organogenesis. Lastly, I highlight other areas of child health where study of somatic mutation may prove beneficial.Item Open Access Discovering variation from cell atlases: comparative methods for single-cell genomicsDann, EmmaSingle-cell genomics technologies have become the norm to investigate cell-to-cell heterogeneity and collective research efforts have built “cell atlases” to characterize previously unknown cell types and states in tissues. For this task, the collection of cells from multiple donors or tissue sites is often required, and differences between cell populations from different samples are largely attributed to technical variation. More recently, multi-sample and multi-condition experiments are being designed to quantify differences between biological conditions, using atlases from healthy tissues as references. These studies require robust quantification of biological differences between cellular phenotypes while accounting for variability between samples and conserving fine-grained information on cell-to-cell heterogeneity. The work presented in this thesis focuses on development and application of computational strategies and best practices for comparative analysis between samples of different biological conditions profiled with single-cell genomics. Chapter 2 presents Milo, a statistical framework for differential abundance testing on single-cell data. By quantifying differences in cell abundances between conditions in partially overlapping neighborhoods on a k-nearest neighbor graph, Milo can identify perturbations that are obscured by discretizing cells into clusters and it minimizes false discovery rate control even in the presence of batch effects. I present a comprehensive benchmark against alternative differential abundance testing strategies, using simulations and scRNA-seq data. I then demonstrate the utility of Milo by studying perturbations across lineages in a dataset of human liver cirrhosis. Chapters 3 and 4 present a case-study where Milo and other integration and comparative analysis methods were used to study the development of the human immune system as a distributed network across tissues. In Chapter 3, I describe the integration of scRNA-seq data of almost a million cells from nine prenatal organs across 11 weeks of gestation, to define common and tissue-specific immune cell populations, and how these compare with immune cell states identified in adulthood. Using this integrated view, I show how I used Milo to identify stage- and tissue-specific subpopulations of myeloid and lymphoid cells, and discuss their potential role in maturation of immune function and tissue morphogenesis. Chapter 4 is centered around the analysis of spatial cellular environments across fetal tissues. By integrating our cross-tissue scRNA-seq atlas with spatial transcriptomics data, we identified and compared cellular niches in the developing liver, spleen, thymus and gut. In Chapter 5, I present a systematic meta-analysis to identify best practices to identify cell states altered in human disease using integration and differential analysis on single-cell datasets. In particular, I examined whether atlas datasets are suitable references for disease-state identification or whether matched control samples should be employed, to minimise false discoveries driven by biological and technical confounders. By quantitatively comparing the use of atlas and control datasets as references for identification of disease-associated cell states, on simulations and real disease scRNA-seq datasets, I show that reliance on a single type of reference dataset introduces false positives. Conversely, using an atlas dataset as reference for latent space learning followed by differential analysis against a matched control dataset leads to precise identification of disease-associated cell states. I demonstrate how optimized design can guide discovery in two applications: using a cell atlas of blood cells from 12 studies to contextualise data from a case-control COVID-19 cohort to detect cell states associated with infection, and distinguish heterogeneous pathological cell states associated with distinct clinical severities. In parallel, I used a healthy reference lung atlas generated from 14 studies to study single-cell profiles obtained from patients with idiopathic pulmonary fibrosis, characterising two distinct aberrant basal cell states associated with disease and identifying unique marker genes with therapeutic potential. In the final chapter, I discuss how advancements in multi-condition single-cell analysis open up a new phase of population-level tissue genomics studies, and the impact that it will have on biomedicine.Item Open Access Gene Regulatory Networks at Single-Cell Resolution: an approach to exploring the impact of genomic regulation on cellular heterogeneityXu, ZhihanThe computational analysis of single-cell RNA sequencing data provides a great opportunity to infer gene regulatory associations in different tissues, cell types and cells. However, there are many challenges still to be overcome. In this thesis, I discuss why this opportunity is fascinating but challenging from a computational point of view, and I build a computational method to demonstrate the plausibility of inferring one gene regulatory network (GRN) for each single cell, also evaluations are conducted from multiple perspectives. In Chapter 2, I investigate data-fitting models in existing computational methods which infer GRNs. To fully take advantage of the single-cell resolution in the input sequencing data, leads to the use of machine learning approaches, such as instance-wise feature selection models. I assess the eligibility at a methodological level, implement an instance-wise feature selection model, and aim to generate GRNs which learn cell-to-cell variations from single-cell sequencing. Based on the implementation above, I build a new method to infer GRN at single-cell resolution or single-cell specific GRN (scGRN). The inferred scGRN can be used to explore GRNs across cell types. Since the true underlying biological mechanism in scGRN cannot be known, my method is benchmarked based on available biological network databases which do not reveal cell-to-cell variation on GRN. Analysis of scRNA-seq data can provide biological insights into cellular heterogeneity because of the single-cell resolution; since scGRNs also contain cell-resolution information, it implies that scGRN can also be analysed about its corresponding cell-related properties. Since scGRN is generated from scRNA-seq data, the inherent cell-related patterns in scGRN shall not deviate substantially from the results in the existing analysis for scRNA-seq. A resulting scGRN suggesting very different cellular information may harm its reliability even before conducting GRN validation, so cellular information provides a powerful angle to evaluate scGRN. In Chapter 3, I build a pipeline to analyse scGRNs and I refer to three cell-related properties for comparison - cell types, cell-type trajectories and cell-type specific marker genes. The results demonstrate the difference in the implication of derived cell-related properties from scGRN and scRNA-seq. As this thesis’s ultimate goal, the analysis for resulting scGRN endows an unprecedented opportunity to explore changes in regulatory patterns along cell types or tissues. Thanks to the single-cell resolution in biologically interpretable scGRN, it is flexible to conduct various analyses such as implications from cell type specific GRNs. Chapter 4 focuses on the exploration of interactions between regulatory edges and cell subpopulations from scGRN. In other words, scGRNs are analysed to explore the changes in GRN edges across cell types, cellular lineages or organs. Based on the existing biological understanding, some expected patterns are summarised to evaluate scGRN. Unlike a quantitative score indicating the performance of a method, the evaluation of patterns is still sufficient to differentiate unreasonable scGRN candidates, and this work provides a novel perspective to evaluate GRNs. Besides the utility for evaluation, the observed patterns also suggest some meaningful biological implications about the impact of genomic regulation on cellular differentiation. I hope the methodology developed in this thesis is helpful to inspire more method developers to pursue scGRN, and I hope multi-angle evaluations of scGRN demonstrated here can facilitate more biological insights into the regulatory mechanisms driving cellular heterogeneity.Item Embargo Developing and applying new methods to understand blood stage growth in Plasmodium falciparumMuhwezi, Allan*Plasmodium falciparum* parasites cause nearly half a million deaths from malaria each year. There is still no highly effective vaccine, and resistance has emerged or is emerging to all current drugs. All the pathology and symptoms associated with malaria are caused by the growth of these parasites inside human red blood cells. If the parasites are to survive and be successful, they must invade, grow and survive under diverse micro–environmental conditions within red blood cells. Understanding the *P. falciparum* genes that regulate these key developmental processes could lead to the identification of targets for new drugs. However, while sequencing efforts have led to a good understanding of the *P. falciparum* genome and how it evolves over time and space, a disconnect remains between the amount of genome sequence data and what exactly these genes do – the phenotype. This therefore underpins my PhD thesis with a major goal of bridging the gap in our understanding of gene function in *P. falciparum*. I have developed high–throughput approaches combining next generation sequencing, flow cytometry and cell sorting to develop assays to accurately phenotype key aspects of the parasite life cycle such as invasion, cell cycle progression, multiplicity of invasion and replication capacity. These assays have been applied to a panel of *P. falciparum* genes in which genes potentially involved in specific developmental transitions have been deleted and reveal new phenotypes not described elsewhere in malaria literature.Item Open Access Understanding Development of Human Immunity One Cell at a TimeSuo, Chenqu; Suo, Chenqu [0000-0002-8813-0875]The emergence of single-cell and spatial multi-omic technologies has revolutionized our understanding of the immune cells. The international Human Cell Atlas consortium has spearheaded and coordinated a global effort to construct atlases of human tissues across multiple developmental stages. This is revealing the identity and function of cells at unprecedented resolution and depth, enhancing our understanding of the immune system in health and disease. The Human Cell Atlas data is also providing us with valuable prior knowledge to guide *in vitro* engineering of immune cells towards specific cell states. This thesis aims to study human immune cell development from both *in vivo* and *in vitro* angles, utilizing both experimental and newly developed computational approaches. Importantly, I focus on applying insights from single-cell genomics studies of human immune cell development to *in vitro* lymphocyte engineering. Chapter 1 introduces the recent advances in single-cell technologies in the context of previous methods used to study immune cells, summarizes human immune cell development with a focus on lymphocyte development, and presents published efforts in *in vitro* T cell engineering. Chapter 2 covers the assembly of a multi-organ single cell atlas of the developing human immune system and describes the insights derived from this comprehensive atlas. We uncovered system-wide blood and immune cell development in organs other than primary haematopoietic organs, and characterized prenatal innate-like B and T cells, namely B1 cells and unconventional T cells in humans for the first time. Chapter 3 is dedicated to a new computational tool, Dandelion, that we developed for single cell antigen receptor sequencing (scVDJ-seq) data analysis. We also devised a novel strategy to leverage scVDJ-seq data in pseudotime trajectory inference. The application of Dandelion improved the alignment of human thymic development trajectories of double positive T cells to mature single-positive CD4/CD8 T cells, and provided novel insights into the origins of human B1 cells and ILC/NK cell development. Chapter 4 introduces another computational tool, Genes2Genes, for aligning two single-cell trajectories. By applying Genes2Genes to *in vivo* and *in vitro* T cell development, we found that *in vitro* single positive (SP) T cells were matched to an immature state of the *in vivo* SP T cells while lacking the final TNFα signaling. Chapter 5 summarizes the insights gained from my work on human immune cell development and highlights potential future directions of research in this area.Item Embargo The influence of genetic background on drug resistance in the malaria parasite Plasmodium falciparumCarpenter, Emma; Carpenter, Emma [0000-0002-1911-6842]*Plasmodium falciparum* is a parasite that causes the most severe forms of human malaria; with multidrug resistance to modern antimalarials emerging and spreading in certain parasite populations, understanding the mechanisms for acquiring multidrug resistance will be key for predicting which genes may become important for clinical resistance in the future. Transporter proteins have key roles in drug resistance across a variety of organisms — in *P. falciparum*, the spread of resistance to the antimalarial chloroquine in the 1950s and 60s is largely due to mutations in the Chloroquine Resistance Transporter (PfCRT). Variations in the ABC transporter *pfmdr1* modulate sensitivity to multiple antimalarials, including mefloquine and lumefantrine, popular partner drugs in first-line artemisinin combination therapies. During my PhD, I assessed a panel of parasite lines that had undergone *in vitro* evolution experiments, identifying polymorphisms in a poorly characterised ABC transporter, ABCI3, that confer resistance to several experimental antimalarial compounds with diverse chemical scaffolds, suggesting that ABCI3 could mediate resistance to next-generation antimalarial drugs. Next, I examined a novel PfCRT mutation currently spreading in Southeast Asia that confers piperaquine resistance when edited into the wild-type *pfcrt* allele of laboratory lines using CRISPR/Cas9, raising concerns that piperaquine resistance could arise in susceptible populations with a single nucleotide polymorphism. Finally, I introduced unique 11-nucleotide ‘barcodes’ into 37 progeny from the NF54 × Cam3.II genetic cross using CRISPR/Cas9. The African NF54 parasite is considered wild-type and is broadly drug susceptible. The Cambodian Cam3.II parasite is multidrug-resistant owing to numerous mutations, including the R539T mutation in *pfkelch13*, a gene associated with artemisinin resistance. R539T has been demonstrated to have a high fitness cost, and similar PfKelch13 mutations display fitness costs that are exacerbated when engineered into non-Southeast Asian parasites. Combination of ‘barcoded’ parasite lines into a single flask, or ‘pool’ creates a highly valuable screening resource, allowing one to observe the change in proportion of each progeny within this pool over time via next-generation sequencing of the barcoded locus; linkage analysis can therefore be performed on data generated within a single experimental run. Using this tool, I note the enrichment of certain haplotypes of interest under various antimalarial pressures, including *pfabci3*, *pfcrt* and *pfkelch13* variants. I discuss the future avenues of research that will reveal the quantitative trait loci responsible for the differential survival of progeny under these antimalarial pressures, which will shed light on the complex interplay between drug resistance mutations, fitness, and genetic background.