Repository logo
 

Theses - Wellcome Sanger Institute

Browse

Recent Submissions

Now showing 1 - 20 of 133
  • ItemOpen Access
    Transcriptional Characterisation of Human Musculoskeletal Development in vivo and in vitro
    Lawrence, John
    The musculoskeletal system is affected by a wide range of diseases throughout life, often affecting young, economically active people. These diseases are the leading cause of morbidity worldwide. Despite this, this organ system is rarely the focus of large scientific initiatives. For example, it was absent from the original Human Cell Atlas (HCA) white paper; a global project which aims to compile a reference map of all human cell types to better understand health and disease. Applying the technologies used by the HCA to study the development of the musculoskeletal system at the level of individual cells could further our understanding of diseases that affect it, ranging from congenital limb malformations, which affect one in five hundred live births, to degenerative diseases such as osteoarthritis, which affects eight million individuals in the United Kingdom. This gap in our knowledge was the main stimulus for the work performed in this thesis. Chapter 1 aims to summarise recent developments in methods used to characterise the transcriptome of human tissues, in particular spatial transcriptomics, single cell RNA sequencing and in-situ sequencing, and discuss their application to date in the musculoskeletal system. In Chapter 2, I form a detailed healthy reference tissue ‘atlas’ of the human first trimester hindlimb across a range of gestational ages, and use spatial transcriptomics to explore links between gene expression patterns during limb morphogenesis and congenital limb malformation. In Chapter 3, I extend my investigation of patterning by profiling the development of the human spine along its entire craniocaudal axis, shedding new light on the rostrocaudal expression patterns of HOX genes and exploring their role in organising the dorsoventral axis of the embryonic spinal cord. In Chapter 4, I enrich the ‘atlas’ of healthy fetal limb development for the osteochondral compartment through sequencing of fetal long bones in order to address an important question in musculoskeletal science: how faithfully does the development of in vitro chondrocytes recapitulate that of their in vivo counterparts? This chapter presents a framework for pursuing high fidelity tissue engineering, using single cell data from human development to drive improvement. I believe this approach holds great promise for understanding the pathophysiology of a broad range of musculoskeletal diseases. I summarise how I plan to build on the work performed during this PhD in the discussion section of this thesis.
  • ItemOpen Access
    Mutational processes in normal human tissues
    Wang, Yichen
    Mutations accumulate in all cells throughout life from the very first cell division of the fertilised egg. These somatic mutations contribute to cancer and other diseases as well as provide insights into ageing and development. Although somatic mutations in cancer genomes have been extensively studied for the past two decades, because of technological limitations, we are still in the process of understanding patterns of somatic mutation in normal cells. This accumulation of somatic mutations is shaped by numerous mutational processes, each generating a distinctive profile of mutations, called a ‘mutational signature’. From mutational signatures, one can infer a history of operating mutational processes that have acted on the genome. Therefore, they can be a useful tool in revealing the historical presence of mutational processes in tissues that may be linked to causes of diseases. During my PhD, I first characterised the operating mutational processes in normal human small intestine through extensive sequencing and phylogenetic reconstruction of multiple biopsies from a group of 39 individuals within UK. Subsequently, I demonstrated how mutational signatures in normal tissues can be used for global surveillance of mutagenic exposures that may cause cancer, through a collection of normal kidneys from more than 200 individuals from multiple geographic regions. The small intestine epithelium is thought to be one of the most vigorously self-renewing tissues of adult mammals. The base of each small intestinal crypt is occupied by stem cells and the descendants of a single recent ancestor stem cell comprise most cells in each crypt. Therefore, isolation of single crypts provides relatively homogeneous clones of cells from which somatic mutations can be called. Using laser-capture microdissection to isolate individual crypts followed by whole-genome sequencing, I characterised somatic mutation rates and mutational signatures in the small intestine, and identified the frequent presence of a mutational process commonly found in cancer, which could be explained by collateral damage caused by an RNA editing enzyme involved in lipid transportation. On the contrary, normal kidney tissue is polyclonal, maintained largely by quiescent cells with low turnover. Because homogeneous clones cannot be easily isolated, a high-accuracy duplex sequencing approach was applied to distinguished somatic mutations from random sequencing errors. DNA extracts from normal kidney cortex was collected from nine countries with varying kidney cancer incidence rates, and mutational signature analysis revealed geographical variation in types and contributions of mutational signatures, resulting from both known and unknown environmental exposures. In addition, I used laser-microdissection to isolate and then sequenced distinct microscopic structures within the normal kidney, including glomeruli, proximal tubules, distal tubules and medulla, to estimate an accurate somatic mutation rate of these structures and how they are affected by different environmental exposures. Together, these two studies describe the distinct somatic mutation landscapes among different normal human tissues with active cell division and low cell division rates, as well as the same type of tissues collected from different geographic regions. These findings inform us about the varying patterns and mechanisms of mutational processes in normal human tissues, demonstrate how normal tissues can provide new insights into mutational processes, and exemplify how normal tissues can be used to identify geographically variable mutagenic exposures.
  • ItemRestricted
    Discovery and characterisation of driver mutations in liver disease
    Brzozowska, Natalia
    [Restricted]
  • ItemEmbargo
    Vibrionaceae genome dynamics
    McGimpsey, Stephanie; McGimpsey, Stephanie [0000-0001-9127-2789]
    *Vibrionaceae* is a family in the class *Gammaproteobacteria* that has 11 genera that are mostly found in marine niches. It is an important family from a human health perspective due to *Vibrio cholerae* causing millions of infections every year. Other members of the family cause millions of dollars in damage to the aquaculture industry due to their pathogenesis of fish. Research on the family doesn't tend to span the entirety of species sequenced and this thesis's work aims to understand characteristics of the species in the context of the whole family dynamics. General sequence characteristics of publicly available *Vibrionaceae* such as genome GC%, genome length, codon bias, plasmids, virulence and AMR genes are identified across the family in this research. From this a subset of genomes focused around *V. cholerae*, *V. vulnificus* and *V. parahaemolyticus* were selected for more in-depth study of gene content and potential horizontal gene transfer. To expand upon the dynamics of gene flow seen within publicly available data bacterial samples were collected from the English Channel and sequenced to identify potential gene flow in a natural population to species of other bacterial families.
  • ItemEmbargo
    Combining population-scale single-cell transcriptomics with germline genetics to understand the biology of inflammatory bowel diseases
    Alegbe, Oluwatobi
    Inflammatory bowel diseases (IBD) are chronic, incurable diseases of the gastrointestinal tract with a high prevalence in the United Kingdom and a growing incidence worldwide. The two most common forms of IBD are Crohn’s disease (CD) and ulcerative colitis (UC). Genome-wide association studies (GWAS) have discovered over 300 regions of the genome associated with increased risk of susceptibility to CD, UC, or IBD. Simultaneously, single- cell transcriptomic studies have extensively mapped the cell types present in the mostly commonly afflicted regions - the small and large intestines. Thusfar the efforts to align these two fields have been minimal despite the large benefit to doing so. In this thesis I explored the genes, cell types and pathways driving inflammatory bowel diseases with IBD relevant single-cell datasets whilst guided by the causal anchor of germline genetics. The first results chapter describes the construction of a single-cell atlas of the terminal ileum generated from biopsies of 50 individuals with CD and 71 healthy controls. With this well-powered dataset, 49 cell types could be identified across immune, epithelial, and stromal populations. I then performed a series of transcriptomic analyses including differential gene expression, non-negative matrix factorisation, and heritability partitioning. The results highlighted antigen presentation pathways upregulated in epithelial cells from diseased biopsies whilst T cell, myeloid and B cell populations were enriched in IBD risk heritability. In the second results chapter, I used a larger version of the same single-cell ileal dataset to map expression quantitative trait loci (eQTL) - genetic variants associated with gene expression changes. I examined how these eQTL, and the genes they impact, vary across the cell types present in the terminal ileum. Then I colocalised the eQTL with GWAS to prioritise likely effector genes for dozens of IBD risk loci. Finally, I compared my findings to public data and highlight the novelty this dataset adds for interpretation of IBD. My final results chapter builds on the prior two with the addition of single-cell RNA sequencing datasets spanning two more tissues: blood and rectum. The rectum is a more relevant site for ulcerative colitis and the blood contains many immune cell types that are under-represented in the gut. I compared the findings between the three datasets and show that they complement each other for eQTL discovery and interpreting IBD GWAS. I lastly used the individual-level phenotypes to test for interaction eQTLs effects and show that these can be valuable for understanding biology too. Overall, my thesis looks at diverse scRNA-seq data generated from a large cohort and harnesses that power to provide insight into IBD biology. The datasets created will prove a valuable resource within IBD but also for other gastrointestinal diseases such as colorectal cancer, coeliac and irritable bowel syndrome.
  • ItemEmbargo
    Molecular and clinical profiling of immune disease genetic loci
    El Garwany, Omar
    In the past two decades, genome-wide association studies (GWAS) have identified numerous genetic variants linked to various traits and diseases, including immune-mediated diseases (IMD). However, understanding the downstream effects of these genetic variations, both at the molecular and clinical levels, has proven more challenging than expected in the postGWAS era. To address these challenges, my thesis focuses on characterizing the effects IMD-associated loci at the transcriptomic and disease sub-phenotype levels. Many disease-associated variants are found in non-coding regions of the genome, making their functional interpretation elusive. Recent large-scale functional genomic datasets like GTEx and eQTLGen have linked genetic variation to gene expression differences, but most studies have primarily focused on steady-state gene expression at the tissue level, overlooking the impact of environmental factors on gene regulation in different cell types. Additionally, aspects of transcriptomic regulation such as alternative splicing have been understudied. In the first part of the thesis, I used iPSC-derived macrophages to investigate alternative splicing patterns and identify genetic variants regulating alternative splicing in different macrophage environmental contexts. The study found widespread differential splicing between stimulated and unstimulated macrophages, with context-dependent regulation and a link between IMD risk loci and alternative splicing changes. The second part of the thesis adopts a clinical perspective, focusing on perianal Crohn’s disease (pCD) as a case study. A GWAS meta-analysis between CD patients with and without pCD identified a significant genetic locus in the MHC region associated with pCD, previously linked to CD susceptibility. The study also investigated sporadic perianal manifestations in the general population, finding 12 significant genetic loci associated with sporadic perianal disease. An initial assessment shows that none of these loci replicate in the pCD meta-analysis, possibly suggesting distinct mechanisms driving both types of perianal manifestations. In summary, the thesis delves into the functional and clinical aspects of IMD-associated genetic variants, emphasizing the importance of alternative splicing and exploring a high-burden clinical sub-phenotype, within both disease and sporadic contexts. This research encourages further exploration of these dimensions of IMD genetics.
  • ItemEmbargo
    Sperm sequencing reveals extensive positive selection in the human germline
    Neville, Matthew
    Over the course of life, cells of the human body accumulate DNA mutations due to damage from intrinsic causes or exposure to mutagens. Mutations that occur in reproductive cell lineages are known as germline mutations and have the potential to be transmitted to offspring. Germline mutations serve as the origin of all heritable genetic variation, making them crucial in the study of evolution and disease. The majority (~80%) of germline mutations in humans are paternal in origin. However, direct observation of mutations in sperm, the gametes of the paternal germline, has been limited by the need for a low error rate sequencing technology. Using the duplex sequencing method known as NanoSeq, which achieves a sufficiently low error rate, we sequenced bulk sperm and whole blood from the same individuals. We show that mutation rates and mutation signatures in sperm are consistent with results from trio studies and we contrast mutation rates in the germline to that in blood. Applying a targeted version of NanoSeq, we then generated a large dataset of coding mutations from sperm. Our findings reveal extensive positive selection in the male germline, implicating new genes, pathways, and mutational mechanisms in this process. Annotation of positively selected genes identified in our study found that most are known to cause pathogenic disorders when transmitted as a germline mutation to an offspring. Furthermore, we quantified the fraction of sperm carrying pathogenic variants per individual, highlighting an increased disease risk for children born to fathers of advanced age. These findings shed light on the dynamics of germline mutations and have important implications for our understanding of human disease.
  • ItemOpen Access
    Defining cell state regulators in cancers using single-cell analysis and CRISPR-Cas9 screening
    Edwards, Olivia
    As high-throughput transcriptional profiling becomes more sophisticated and accessible, our understanding of cancer heterogeneity and its impact on clinical outcomes are being realised. Whilst genomic variation has been extremely useful for cancer stratification and development of targeted therapies, they do not always underpin variable therapeutic responses, especially those that display higher levels of plasticity. Development of single-cell profiling has had a particular impact for elucidating the composition of cell states occurring within individual tumours, and offers high-dimensional data which can be utilised for unsupervised signature extraction. The aim of this project was to identify clinically relevant signatures of transcriptional heterogeneity in cancers by mining a pan-cancer single-cell RNA sequencing dataset spanning in 198 cell line models across 22 cancer types [1]. A dimensionality reduction method which resolves continuous expression signatures at multiple resolutions was used to resolve a range of behaviours, from consistent intra-sample cell states to cancer subtypes. In the analysis of melanoma models, three main signatures were defined that reflected distinct subtypes characterised by their differential invasive and proliferative properties. Genes highlighted as putative regulators of these sub- type signatures were screened using a single-cell RNA-seq coupled CRISPR- knock-out approach, with regulator potential uncovered for multiple targets including SOX10, MITF, EIF3G, PRPF19, RPS27A, and CDC20. Cell line annotations achieved through previous high-throughput screens were also leveraged to uncover associations between the defined heterogeneous expression signatures with features reflecting genetic variants, gene essentiality, and drug response. CRISPR perturbation screening was again used to validate the potential for putative melanoma subtype regulators to modulate the response to Rac inhibition, with results suggesting an overlap in gene regulatory networks between melanoma subtype and context specific responses to this inhibitor.
  • ItemEmbargo
    Single-cell transcriptomics identifies key components in metabolic pathways of drug perturbed activated CD4+ T cells
    Ke, Ziying
    CD4+ T cells play a critical role in the development of autoimmune diseases. During CD4+ T cell activation, remodelling of metabolic pathways is required for the cells to exert their effector functions. The importance of these pathways is highlighted by the successful therapies for immune diseases that target metabolic pathways. Although key metabolic processes have been recognized to affect T cell activation and lineage development, how metabolic interventions can skew the outcome of T cells activation and differentiation remain largely unknown. I first introduce the experiments which guided the selection of 19 compounds targeting various metabolic pathways with measurements on murine CD8+ T cells and both naïve and memory human CD4+ T cells. I proceed to describe drug effects on CD4+ T cell day 3 proliferation and gene expression by single cell transcriptional profiling at resting state, 16 hours and 3 days after stimulation with aCD3/aCD28. My observations reveal cell subpopulations present only at certain activation time point and demonstrate drug effects altering gene expression and pathways within specific cell populations. Then, I characterise distinct pseudotime trajectories representing the progression from naïve to effector memory cells at each time point. With linear models, I identify the immune-related disease genes regulated by both drug effects and the effector status of cells. Leveraging RNA splicing information, I demonstrate that expression changes of genes targeted by compounds resulted in altered T cell lineage development. Furthermore, my analysis highlights important transcription factors and metabolic pathways regulated by the interactions between metabolic perturbation and effector function. The work in the dissertation presents a unique resource of metabolic perturbations in CD4+ T cells and provides insights into understanding the role of T cell metabolism in immune-mediated diseases.
  • ItemOpen Access
    The Malleability of Gene Regulation in Healthy Individuals: Analyzing CRISPR-based Screens with Single-Cell RNA-Sequencing Readout across Genetic Backgrounds
    Feng, Claudia
    A question that has driven the field of quantitative genetics since its inception, is that of understanding the relationship between genetic variation and complex traits. In the last few decades, genome-wide association (GWAS) and quantitative trait studies have implicated thousands of genetic loci in various physical and molecular traits. However, where there was hope that traits could be explained by a single, causal variant driving a genetic condition, this is only rarely the case, with the majority of traits driven by the aggregation of common, low-effect variants causing changes in gene regulation. Most puzzlingly, even where there is a clear disease-causing mutation, different individuals can often cause a diverse range of disease phenotypes, from no discernable symptoms to severe disease. Deconvoluting the impact of high-effect variants, the combined effects of genetic background and the environment remains an important question with many practical implications. Existing methods for teasing apart the many contributing factors require specific data-sets and are infeasible for many rare diseases, nor can the effects of lifestyle and environment be completely distinguished. Recent technological advances in genome engineering, sequencing technology and cell line differentiation can be used to mitigate these challenges. However, these techniques have been limited in scale and do not account for variation in genetic background. In this thesis, I explore four CRISPR-based screens with single-cell RNA sequencing read-out (scRNA-seq) that together map out an evolution of experiments that start from conducting a proof-of-concept CRISPRi screen targeting tens of genes across two cell lines to one targeting thousands of genes pooled across tens of genetic backgrounds. In doing so, we establish experimental and power considerations for knocking down genes with variable essentiality across pooled cell lines, as well as a computational framework for interpreting mean and variation in transcriptomic consequences from gene knockdown. Work done during my PhD and presented in this thesis is the second genome-scale experiment of its kind and is the only study that accounts for the role of genetic background on gene function, establishing a necessary and fundamental building block for conducting population-scale genetic analyses.
  • ItemEmbargo
    Selective Neuronal Vulnerability in Neocortices from Patients with C9ORF72-related Neurodegeneration
    Kwa, Jing Eugene
    Amyotrophic lateral sclerosis (ALS) is a rapidly fatal neurodegenerative disorder classically characterised by motor neuron death. However, genetic and histopathological overlap with frontotemporal dementia (FTD) suggests extra-motor involvement, and that pathology likely extends to other neuronal populations (Chapter 1). This may be especially evident in patients with the hexanucleotide repeat expansion in the gene C9ORF72 (c9HRE): the most common genetic cause of both ALS and FTD. We hypothesise that in c9HRE-related neurodegeneration, neuronal pathology may extend beyond the motor neurons. To date, a comprehensive survey of selective neuronal vulnerability has not yet been performed. With recent advances in single-cell and spatial transcriptomic approaches, our hypothesis can now be investigated in an unbiased fashion and at unprecedented resolution. The work in this thesis focuses on identifying selectively vulnerable neurons in the post-mortem primary motor cortex (M1) of patients with ALS and FTD caused by a . M1 tissue was processed for single nuclei RNA-sequencing (snRNA-seq) and 10X Visium spatial transcriptomics (Chapter 2). I first investigated cortical neuron vulnerability using snRNA-seq (Chapter 3). Upper motor neurons (herein called L5 ETs: layer 5 extratelencephalic-projecting excitatory neurons) were expected to display pathology in c9HRE and served as a positive control. Other vulnerable neuronal populations were identified by examining neurons with similar transcriptomic responses to L5 ETs. In so doing, upper layer excitatory subtypes, PVALB+ fast-spiking basket interneurons, and specific VIP+ interneurons were identified as vulnerable. Transcriptomic changes primarily manifested in neurons as altered mitochondrial, proteostatic, and synaptic function. Minimal evidence of glial involvement was observed. Given the range of developmental origins amongst vulnerable populations and the relative sparing of layer 6, I hypothesised that microcircuit connectivity might determine vulnerability. I investigated this hypothesis using Visium (Chapter 4). Three lines of evidence support the L6 sparing previously inferred from snRNA-seq. First, inferred vulnerable populations are located outside L6. Second, L6 transcriptomic changes are enriched for axon-related genes, and may thus originate from neurons with soma outside L6. Third, signature deconvolution suggests expected pathology is significantly higher for c9HRE samples for layers superficial to L5. Taken together, my spatial transcriptomic data supports L6 sparing inferred from snRNA-seq, which in turn reinforces the notion of microcircuitry being predictive of vulnerability (rather than spatial proximity or developmental origin). I also cross-examined the biological pathways highlighted in the snRNA-seq analysis with findings from literature (Chapter 5). Identified pathways were preserved upon metaanalysis, and suggest a role for mTORC signalling. Importantly, I also observed that existing c9HRE model systems may not be appropriate for understanding post-mortem changes. In summary, this work suggests that selective vulnerability in c9HRE is determined by (and possibly spreads through) neuronal connections. Deeper examination of this hypothesis is limited by existing technology, but nascent approaches that merge connectomics with transcriptomics (Chapter 6) may eventually allow us to investigate this in future.
  • ItemEmbargo
    Single-cell atlasing of the human tissues across the lifespan
    Kedlian, Veronika
    Single-cell and spatial sequencing have revolutionised our understanding of human tissue biology in recent years. The Human Cell Atlas (HCA) initiative aims to recover and position all of the cell types in the human body. Being part of HCA, I have worked on building atlases which allow comparison across age groups for two different organs: the human thymus and skeletal muscle. Firstly, I co-led a large multinational effort to create a spatial atlas of the human thymus from fetal and paediatric stages. As a part of this effort, we expanded the census of cell types in the human thymus and mapped them spatially in the tissue. We also developed a new morphological framework (Organ Axis) which allowed us to align and compare cell type locations in fetal vs. paediatric thymus. Secondly, I led an effort to create an ageing human skeletal muscle atlas, which systematically catalogued cell types and states in young adult and elderly skeletal muscle and described age-associated changes. In Chapter 1 I set the stage for the whole thesis by giving background on single-cell and spatial technologies as well as the research question, namely change across the lifespan. I start with a brief introduction to single-cell atlasing, the huge variety of spatial transcriptomics and proteomic technologies, focusing on Visium and IBEX. Next, I provide an overview of the prenatal and postnatal periods of development, followed by a more in depth discussion on causes of human ageing. Finally, I conclude with a description of the main events and functional changes in thymus and skeletal muscle across the lifespan. In Chapter 2 I give an overview of the spatial human thymus cell atlas. I give a brief introduction to human thymus development, its cell type composition and function. I describe major single-cell (single-cell, CITE-seq and TCR-seq) and spatial profiling modalities (Visium and IBEX) that were applied to the thymus and outline methods for imaging data processing, annotation and cortico-medullary axis construction for the thymus. I provide an annotation and spatial mapping of T cells and supporting resident cells, including thymic epithelial cells (TECs), fibroblasts, vascular and myeloid cells. Simultaneously, I use the cortico-medullary axis to compare the spatial position of cell types in the fetal vs. paediatric thymus. I conclude by discussing the differences in cell type localisation between fetal and paediatric thymus and the importance of spatial structure for thymus function. In Chapter 3 I provide insights from single-cell and spatial mapping of Hassall’s body, a keratinised structure in the thymus medulla, which used to be considered a degenerative epithelial structure. Firstly, I map the closest cell types to Hassall’s body and discuss the challenges of associating cell types to Hassall’s vs deep medulla. Next, I identify genes which are uniquely expressed by specific cell types in medulla, which led me to discover an underappreciated heterogeneity within mTECIII population including mucosal and skin-like subtypes. I conclude with a discussion on the putative function of Hassall’s bodies in T-cell development. In Chapter 4 I introduce the human skeletal muscle ageing atlas. I start by summarising knowledge about the main cell types in the muscle and known ageing changes. I describe the experimental setup for the creation of the atlas and give an overview of the recovered cell types and major ageing changes that we observe. Following on from this, I describe fine-grained cell states and ageing changes that occur in the different compartments of skeletal muscle, including muscle stem cells, the myofiber itself and supporting cells of the muscle microenvironment. I conclude with a discussion on how processes in different parts of the muscle combine to cause a breakdown in muscle function over age. In Chapter 5 I summarise the major insights from both studies and discuss their potential therapeutic uses. Next, I use my experience to provide suggestions on experimental design and challenges that need to be solved for the next generation of atlases looking to understand changes across human lifespan.
  • ItemOpen Access
    Functional genomics of developmental disorders
    Hampstead, Juliet
    DNA methylation, or the epigenetic modification of primarily cytosine bases within DNA to 5-methylcytosine through the addition of a methyl group, is an epigenetic mark with a variety of biological and cellular roles. Genetic and environmental influences can perturb DNA methylation patterns in humans, and the set of differentially methylated CpG sites perturbed can be collectively called a DNA methylation signature. In this thesis, I characterise DNA methylation signatures as a diagnostic biomarker for children with rare developmental disorders in chromatin-modifying genes. I show that DNA methylation signatures are a general property of these genes, that they have substantial clinical and diagnostic utility, and that they can be used to resolve variants of uncertain significance. I also show that these signatures are robust across scientific centres and can be generated across multiple tissues. Lastly, I compare DNA methylation signatures generated from methylation microarrays to those generated from genome-wide long read sequencing data, and provide evidence that long read sequencing is a reliable and scalable method to profile 5-methylcytosine for DNA methylation signature-based classification. Overall, my work emphasises the need for scalable, cost-effective, and relatively high-throughput biomarkers in the characterisation and diagnosis of rare developmental disorder syndromes.
  • ItemOpen Access
    Understanding Genomes Through Engineered Structural Variation
    Koeppel, Jonas
    Sequencing of the human genome has provided us with a detailed map of its content. While enormous progress has been made towards understanding the 1% of the human genome that is protein coding, we are still mostly in the dark about the function and relevance of the remaining 99%. Progress has been difficult because the non-coding genome is vast, the individual nucleotides hold less information, and we have lacked the tools to engineer and probe it to the necessary extent. This is beginning to change with the advent of ‘search and replace’ genome engineering technologies such as CRISPR prime editing. I leveraged the ability of prime editors to insert recognition sequences for recombinases at high throughput to engineer genomes at an unprecedented scale. In the process, I made discoveries about the biology of genome engineering, structural variation, and gene regulation. I first outlined the determinants of short sequence insertion using prime editing by systematically measuring the frequency of insertion for 3,604 short sequences in four target sites of three human cell lines with varying DNA repair contexts. I characterized how insertion sequence length and two cellular DNA processing pathways affected the incorporation rate. I reaffirmed that DNA mismatch repair suppressed the insertion of shorter sequences and made the discovery that 3’ flap nucleases TREX1 and TREX2 suppressed the insertion of longer sequences. I further delineated the effects of nucleotide composition and secondary structure of the insertion sequence on editing rates. Next, I targeted a prime editor to the high copy number LINE-1 retrotransposon to insert hundreds of recombinase sites into a single human genome. These engineered cell lines provided a latent substrate for large-scale genome randomization. After induction with Cre recombinase, I mapped thousands of deletions, inversions, extrachromosomal circular DNA, translocations, and fold- back inversions and tracked their abundance over time. Sequencing surviving variants and comparing them to early ones revealed strong selection pressures against creating non-segregable derivative chromosomes or deleting essential genes. However, it also demonstrated that haploid human cell lines could survive while losing megabases of DNA. I isolated 21 cell clones and linked variants to gene expression changes for three clones with multiple Cre-induced rearrangements. Finally, I used prime editing to insert loxPsym sites into the regulatory region of the *OTX2* developmental transcription factor. Cre recombinase induced stochastic deletions and inversions across the recombinase sites, and created diverse and novel enhancer arrangements. By endogenously fusing *OTX2* with a fluorophore and sorting, I could associate alternative regulatory architectures with *OTX2* expression and track changes in CpG methylation and chromatin accessibility. I discovered that three enhancers in a 20 kb cluster drove 50% of *OTX2* expression and that moving the cluster closer to the transcription start site while simultaneously deleting intermediate regulatory elements resulted in strong *OTX2* expression. The strategies presented here to more efficiently insert short DNA sequences with prime editing, shuffle DNA, and rearrange regulatory regions give a fundamentally new approach to randomizing mammalian genomes which will open new avenues to go beyond the 1% of coding sequence and study the 99% of underexplored regions. The data garnered from molecular phenotyping of novel genome architectures after randomization will allow predictive models to learn parameters beyond the limited diversity of our DNA.
  • ItemEmbargo
    The impact of cytotoxic chemotherapy on somatic mutation in normal human cells
    Dunstone, Eleanor
    Since their advent in the mid-twentieth century, cytotoxic chemotherapy drugs have been used to treat hundreds of millions of people with cancer. These drugs remain the most effective form of treatment in many cases, with over half of newly-diagnosed patients requiring chemotherapeutic intervention. Unfortunately, a small proportion of cancer survivors will go on to develop second tumours as a result of the treatment they received for their first diagnosis. These tumours are usually genetically unrelated to the patient’s first cancer, suggesting that cytotoxic chemotherapy may increase the chance of normal cells becoming malignant. Many chemotherapy drugs are thought to exert their effects on tumours by inducing DNA damage, which can generate somatic mutations in any surviving tumour cells. However, the systemic nature of many treatments means that the patient’s normal cells are also at risk of chemotherapy-induced mutagenesis. Patterns of mutations or ‘mutational signatures’ associated with chemotherapy have been seen in patient tumours, in cells exposed to drugs *in vitro*, and in some normal tissue samples from cancer patients. However, a comprehensive study of the impact of chemotherapy on the somatic mutational landscape of a wide range of normal human tissues has not yet been performed. In this project, I investigate the mutational impact of cytotoxic chemotherapy using two complementary approaches: the identification of chemotherapy-associated mutational signatures *in vitro* using organoid models; and the sequencing of samples of a wide range of tissue types from chemotherapy-treated patients. This work was enabled by recent developments in highly error-corrected duplex sequencing approaches, facilitating accurate detection of mutations at single molecule level. The results of this study show that many widely-used chemotherapy drugs are mutagenic in normal cells, generating distinctive patterns of single-base substitutions (SBS), doublet-base substitutions (DBS), and small insertions and deletions (indels). Treatment of normal human cells with chemotherapy drugs and environmental agents *in vitro* demonstrated mutational signatures associated with a wide range of agents. Alkylating and platinum-based agents were the most mutagenic classes of chemotherapeutics, with all 13 alkylating agent drugs and all four platinum-based drugs tested generating SBS signatures. Many alkylating agents and all platinum-based agents also generated an increase in DBS and indels. Previously undescribed signatures discovered include SBS signatures associated with mitomycin C, lomustine/carmustine, busulfan and thiotepa, alongside a DBS signature of mitomycin C and an indel signature associated with a broad range of platinum-based and alkylating agents. Additionally, indel signatures of topoisomerase II inhibitors and bleomycin were described for the first time. The antimetabolite drugs tested generally did not show significant mutagenesis when applied as single treatments; however, a previously-observed signature associated with 5-fluorouracil was recovered when samples were exposed to repeated treatments. The *in vitro* studies also highlighted the complex relationships between compound concentration, cytotoxicity and mutagenicity, with many compounds showing non-linear dose responses or differing relationships between dose and mutagenicity between different organoid tissues-of-origin. Treatment- associated mutagenesis was also shaped by DNA repair, demonstrated by investigating the relationship between *in vitro* temozolomide-induced mutagenesis and the activity of the DNA repair enzyme *O*-6- methylguanine-DNA methyltransferase (MGMT). Temozolomide exposure was shown to generate ten-fold higher SBS burdens in MGMT-knockout human induced pluripotent stem cells (hiPSCs) than in MGMT-wild-type hiPSCs, showing a different mutational signature depending on MGMT status. The sequencing of normal tissue samples from chemotherapy-treated patients showed that many of these drugs are also mutagenic *in vivo*, with eight of the signatures identified *in vitro* being observed in patient samples. Many alkylating and platinum-based agents were shown to have a major impact on SBS burden in normal human tissue samples, with some patients carrying several times as many mutations as would be expected for a person of their age. High excess SBS burdens are seen across many tissue types, including those composed predominantly of post-mitotic cells such as cardiac muscle, skeletal muscle and the neuron-rich cerebellar granular layer, which have historically been considered not to acquire substantial numbers of somatic mutations during adult life. These burdens are associated with signatures of platinum-based drugs, thiotepa, temozolomide and other alkylating agents. Additionally, DBS signatures associated with platinum-based agents and mitomycin C were observed across many normal human tissue samples, as were indel signatures associated with thiotepa and other cross-linking/alkylating agents. Conversely, some agents that generated signatures *in vitro* appeared not to be mutagenic *in vivo*, including 5-fluorouracil/capecitabine and topoisomerase II inhibitors. Alongside the requirement for repeated treatment *in vitro*, this suggests that antimetabolite- associated mutagenesis may be restricted to dividing cells, suggesting that mechanism of action influences the capacity of different drugs to generate mutations in normal cells. The prevalence of high chemotherapy-associated mutation burdens in normal tissues from cancer patients presents one possible factor contributing to the increased risk of second tumour development in chemotherapy-treated survivors of cancer, and the widespread DNA damage observed in post-mitotic tissues may inform research into other long-term side effects of chemotherapy treatment such as cardiac and neurological dysfunction. The work also generates insights into the fundamental mechanisms of mutation in human cells and provides a compendium of signatures of exposures to chemotherapy drugs and other environmental mutagens, facilitating future efforts to further characterise the distribution of these mutational patterns in different patients and tissue types.
  • ItemOpen Access
    Dissecting immune interactions in health and disease with multiomics and spatial technologies
    Arutyunyan, Anna
    Cells are the basic building blocks of life, forming the enormous plethora of tissues and living organisms on Earth. They have a high diversity of phenotypes and functions in different environments. The high-throughput tools to profile different modalities from a single cell have grown exponentially in recent years. This now allows us to draw a complete picture of how cells function in different environments for the first time. In the work of my thesis, I use high-throughput multiomics and spatial technologies to create comprehensive cell atlases. I focus on studying the immune cell communication among themselves and with other cells in the context of disease and development. Chapter 1 starts with an outline of the background on cell biology and the impact that genomic technologies have on how we can study cellular processes. I then discuss the experimental methodology of high-throughput multiomics and spatial techniques, and computational tools for the analysis of such data. Following is the introduction to the two projects comprising the work of my thesis: (i) a multiomics study of Common Variable Immunodeficiency (CVID) and, (ii) a spatial multiomics map of the Maternal-Fetal Interface (MFI) in early pregnancy in humans. Chapter 2 outlines materials and methods used in this work, showcasing the workflow for each project. Chapter 3 details the multiomics atlas of Common Variable Immunodeficiency (CVID). This condition is characterised by defects in the function of B cells, a type of adaptive immune cells capable of producing antibodies to fight infections. I analyse gene expression and chromatin accessibility data of B cells from a pair of monozygotic CVID-discordant twins. I uncover potential defects in the epigenome of the affected twin’s B cells. Next, after in vitro stimulation of these twins’ PBMCs, I observe CVID-associated transcriptional dysregulation in immune subsets additional to those in B cells. I discover defects in the immune cell crosstalk between B cells and other immune compartments of the CVID twin. With an expanded cohort of CVID patients and healthy individuals, I go on to further validate these findings. These results show that, in addition to B-cell-intrinsic alterations, defects in cell-cell communication between B cells and other immune compartments may be compromising the correct immune response in CVID. Chapter 4 presents the work on creating a comprehensive spatial multiomics atlas of the maternal-fetal interface in early pregnancy. Firstly, I characterise the signatures and differentiation trajectories of trophoblast cells - the building blocks of placenta. I then focus on the crosstalk between invading trophoblast and maternal immune cells. I predict putative cell-cell communication events and validate in situ the selected molecules mediating these interactions. I propose a model of arterial transformation facilitated by fetal trophoblast and their communication with maternal cells. This work expands our knowledge about the cellular and molecular players in the maternal-fetal dialog in the first trimester of pregnancy, definitive of its success. Chapter 5 describes the work on modelling the dialog between decidual natural killer (dNK) cells, a type of innate immune cell most abundant in pregnant decidua, and the invading trophoblast at the maternal-fetal interface using primary trophoblast organoids (PTO). I benchmark the PTO system against the in vivo trophoblast atlas I described in chapter 4. After defining trophoblast cell states in vitro, I perform comparative analysis of PTOs stimulated with a cocktail of chemokines that in vivo are secreted by dNK cells and unstimulated PTOs as control. I propose a putative effect of the signals from dNK cells on trophoblast invasion in the first trimester of pregnancy. Lastly, Chapter 6 provides an overview of all the described work, as well as a discussion of how the novel high-throughput multiomic and spatial technologies together with in vitro models shape our current view of fundamental biology, and how they will impact future directions of research.
  • ItemEmbargo
    Genetic architecture of transcript splicing in blood and phenotypic consequences
    Tokolyi, Alexander; Tokolyi, Alexander [0000-0003-4222-7484]
    Transcript splicing is a fundamental process which allows for the generation of multiple different isoforms from a single gene body, increasing the functional capacity of our genomes. Bulk RNA-sequencing has allowed us to analyse this at scale by investigating not only the amount of a gene product detected, but where, and by how much, certain parts of genes have been excised. This thesis presents the largest to-date investigation in to the genetic architecture of transcript splicing in blood, utilising a deeply-phenotyped cohort of 4,732 healthy adults, in addition to 638 adults presenting to intensive care units with sepsis. I first explore the genetic architecture of transcript splicing through the generation of splicing quantitative trait loci (sQTLs) in a healthy cohort of blood donors from the INTERVAL study. Transcript splicing is quantified through the use of split-reads present in RNA-seq data using the LeafCutter pipeline, allowing the quantification of transcript splicing without regards to established reference annotations. As the derived splice events do not have a 1-to-1 mapping to currently defined isoforms, I created a pipeline to richly annotate these splice events to aid in subsequent analyses. This resulted in 29,514 *cis*-sQTLs in 6,853 *cis*-sGenes, and I demonstrate large overlap with previous findings in addition to a plethora of new associations. Using *cis*-eSNPs derived from the same cohort, I perform a targeted *trans*-sQTL analysis under the hypothesis that *trans*-sSNPs could regulate splicing through the regulation of certain gene products involved in splicing. This validated the few currently known *trans*-sQTL associations, and provides a total of 642 splice events (in 208 sGenes), including known splice factors. Due to the magnitude and novelty of the created information, I develop an interactive online portal to browse and explore these sQTL results and incorporate subsequent analyses into, creating an interpretable form of the results generated by this thesis. This portal is publicly available at: https://intervalrna.org.uk/. As the INTERVAL cohort is deeply phenotyped, containing protein measurements in plasma along with metabolites, lipids, and their genetic associations, I perform colocalisation analysis of these with the generated spliceQTLs to explore their shared genetic architecture. This reveals that many splice events and molecular phenotypes appear to be regulated by shared genetic effects, and through examples demonstrate how splicing could be modulating these downstream phenotypes through mechanisms such as changes in solubility. As a proof of concept I then compare public GWAS statistics for immune and blood related diseases with both spliceQTLs and those of the downstream molecular phenotypes, detailing many splicing-mediated pathways of disease through which risk loci are putatively acting, the majority of which are independent of eQTLs. To investigate the interaction of genetic and environmental effects on transcript splicing in disease, I utilise the GAinS cohort of 638 adults that have had blood taken upon arrival to the ICU with sepsis. Using this, I explore the transcriptomic differences between these individuals that are explained by transcript splicing and how this information can be used to predict patient status, and subsequently compare the shared genetic architecture of these splicing events with those of the healthy individuals and the previously defined downstream associations with molecular phenotypes. Notably through colocalisation with summary statistics for COVID-19 susceptibility and severity, I observe risk loci shared with those impacting transcript splicing in the sepsis patients, that were not observed in the healthy individuals. In summary, this thesis provides an in-depth analysis of the genetic architecture of the largest to-date catalogue of transcript splicing, explores their utility in explaining the regulation of downstream molecular phenotypes, and demonstrates how these associations can be used to understand the mechanistic pathways of risk loci.
  • ItemOpen Access
    Probabilistic Dynamical Modelling of Spatiotemporal Cell Trajectories During Neural Development
    Aivazidis, Alexander
    In this PhD theses I present two new computational models, Cell2fate and CountCorrect, for the analysis of single-cell and spatial transcriptomics data and I show how they can be applied to more effectively map the rules of brain cell development in health and disease. Cell2fate is an RNA velocity model for inference of transcriptional dynamics from spliced and unspliced RNA counts. Unlike existing models, cell2fate is capable of capturing complex biological processes while still being analytically tractable. This is achieved with an implicit factorization of RNA velocity solutions into modules, which also enhances statistical power and interpretability. By evaluating cell2fate in various real-world scenarios, I demonstrate its enhanced ability to capture complex dynamics and weak dynamical signals in rare and mature cell types. Finally, I apply cell2fate to developing mouse and human brain single cell datasets, where I also demonstrate that RNA velocity modules can be mapped to parallel spatial transcriptomics data. The CountCorrect model provides new normalization and cell type mapping methods for the Nanostring WTA spatial transcriptomics technology that take into account, background binding of RNA probes. I use CountCorrect to analyze a spatial transcriptomics dataset of the human developing cortex, which revealed spatial autism enrichment patterns, a cortical cell type abundance map and differential gene expression patterns in Cajal-Retzius cells across developmental time and cortical regions.
  • ItemEmbargo
    Probabilistic models to resolve cell identity and tissue architecture
    Kleshchevnikov, Vitalii; Kleshchevnikov, Vitalii [0000-0001-9110-7441]
    Cell identity drives cell-cell communication and tissue architecture and is in return regulated by cell-extrinsic cues. Cell identity is determined by the combination of intrinsic developmentally established transcription factor use (TF) and constitutive as well as cell communication-dependent TF activities. In my thesis, I developed two probabilistic models that advance the understanding of these processes using single-cell and spatial genomic data. Spatial transcriptomic technologies promise to resolve cellular wiring diagrams of tissues in health and disease, but comprehensive mapping of cell types in situ remains a challenge. I present cell2location, a Bayesian model that can resolve fine-grained cell types in spatial transcriptomic data and create comprehensive cellular maps of diverse tissues. Cell2location accounts for technical sources of variation and borrows statistical strength across locations, thereby enabling the integration of single-cell and spatial transcriptomics with higher sensitivity and resolution than existing tools.We assess cell2location in three different tissues and demonstrate improved mapping of fine-grained cell types. In the mouse brain, we discover fine regional astrocyte subtypes across the thalamus and hypothalamus. In the human lymph node, we spatially map a rare pre-germinal centre B cell population. In the human gut, we resolve fine immune cell populations in lymphoid follicles. Collectively our results present cell2location as a versatile analysis tool for mapping tissue architectures in a comprehensive manner. Cell identity and plasticity is regulated by a combinatorial code mediated by transcription factors and the cell communication environment. Systematically dissecting how the regulatory code robustly defines the vast complexity of cell populations across tissues is a long-standing challenge. Measured using the assay for transposase-accessible chromatin with sequencing (ATAC-seq), DNA accessibility provides a readout of intermediate gene regulation steps at single-cell resolution, with technologies measuring both RNA and ATAC providing the necessary evidence to build mechanistic models of regulation. Existing methods address one or several subproblems of modelling DNA accessibility. For example, the DNA sequence-based deep learning models represent combinatorial interactions and in-vivo TF-DNA recognition preferences. In contrast, GRN models use TF abundance profiles across cells and in-vitro-derived TF-DNA recognition preferences, optionally incorporating ATAC-seq data as a filter. All models learn cell-type specific weights/properties and don't generalise to new TF abundance states such as new cell types. Therefore, we are missing an end-to-end mechanistic model that represents all steps of the biological process, that generalises to both new DNA sequences and TF abundance combinations and can simultaneously characterise hundreds to thousands of cell states observed in single-cell genomics atlases. Here, I formulated cell2state, a mechanistic end-to-end probabilistic model of TF recruitment to a chromatin locus and downstream TF effect on DNA accessibility. Cell2state is designed to achieve the generalisation of regulatory predictions to unseen cell types. Cell2state A) estimates TF nuclear protein abundance and models B) how TFs recognise DNA, C) how TF sites in DNA lead to TF recruitment to a chromatin locus, D) how the activity of DNA-associated TFs affects chromatin accessibility. To evaluate generalisation, I defined the computational problem and developed a workflow for predicting the scATAC-seq readout for previously unseen chromosomes and cell types. I show that cell2state outperforms the state-of-the-art deep learning models (ChromDragoNN) at explaining DNA accessibility differences across cells. Finally, to look at cell state plasticity, I developed ways to use cell2state to simulate the possible chromatin states given TF abundance of source cell types.