Theses - Wellcome Sanger Institute
Permanent URI for this collection
Browse
Recent Submissions
Item Open Access Chromosome evolution in Rhabditina (Nematoda) with a focus on programmed DNA eliminationGonzalez De La Rosa, Pablo ManuelRecent advances in DNA sequencing technologies have enabled research on the mechanisms that shape chromosome evolution in diverse taxa. My thesis focuses on nematodes from the Rhabditina suborder, which exhibit a high rate of chromosome rearrangement and have several species with chromosome-level assemblies available for comparison. To study chromosome evolution in this group, I inferred a set of loci that comprised the ancestral linkage groups, called Nigon elements, in rhabditid nematodes. This showed that the sex chromosome of the model nematode Caenorhabditis elegans is the product of a fusion of an ancestral autosome and the ancestral sex chromosome. As part of this work, I also generated the first telomere-to-telomere genome with no gaps of unknown sequence of a non-Caenorhabditis nematode, Oscheius tipulae. We serendipitously discovered that O. tipulae undergoes programmed DNA elimination (PDE) by identifying that both ends of each chromosome had two alternative terminal telomere repeat arrays. PDE is a developmentally controlled process whereby somatic cells selectively lose part of their genome. In total, less than 0.5% of the 60 Mb O. tipulae genome is eliminated. To gain insights into the evolution of PDE, I generated chromosome-level assemblies for four Oscheius species that are closely related to O. tipulae. These revealed PDE in all species, and linked PDE with variations in telomere length and breakage-associated sequence motifs. I characterised a novel gene family linked to chromosome ends, derived from helitron2-like transposable elements, and variably eliminated regions in Oscheius dolichura involving divergent paralogs of conserved genes. To facilitate comparative research, I propose a framework outlining a standard set of features for future studies. Finally, I applied this framework to characterise PDE in Auanema rhodensis, a species closely related to Oscheius in which over 60% of its genome is eliminated by PDE. The eliminated regions in A. rhodensis are also delineated by a sequence motif, contain protein-coding genes, ncRNAs, and share the same large tandem repeats between all autosomes. Together, my research sheds light on the evolution of nematode chromosomes and PDE.Item Open Access Geographic Migration and Evolution of Streptococcus pneumoniaeBelman, SophieStreptococcus pneumoniae (the pneumococcus) is a human-obligate, opportunistic bacteria. It resides asymptomatically in the nasopharynx of both children and adults. It occasionally goes on to cause local disease such as otitis media, or severe invasive disease as with pneumonia and meningitis. It is prevalent globally and rates of carriage have an inverse relationship with country income. The pneumococcus comprises massive diversity of >800 global lineages and 100 antigenically distinct serotypes. Many lineages and serotypes co-circulate endemically in any given location. Highly immunogenic conjugate vaccines have been implemented in >76% of national immunization schedules; we also have many antimicrobials to which the pneumococcus is susceptible. Following the introduction of the pneumococcal conjugate vaccine in the 2000’s there have been global ecological expansions and contractions of serotypes, lineages, and antimicrobial resistance (AMR). There are distinct qualitative patterns which characterise the global geographic structure of both the endemically circulating pneumococcus and the constantly changing ecology of its genes and loci. The extent and mechanisms of spread, and vaccine-driven changes in fitness and AMR, remain largely unquantified. Using geolocated genome sequences from South Africa (2000-2014) I developed models to reconstruct spread. I implemented simple statistical frameworks which can account for variable surveillance in both space and time. I also pair detailed human mobility data from Facebook users with genomic data within a mechanistic model to describe how human movement drives the geographic spread of the pathogen. I estimated that pneumococci only become homogenously mixed across South Africa after about 50 years of transmission, with the slow spread driven by the focal nature of human mobility. Further, the human population density in the municipality of introduction appears to be important as well — with a more rapid radius of spread when introduced to rural areas. I include both disease and carriage isolates in our models. There are similar ecological shifts among healthy Cambodian children, following vaccination, as among disease isolates. Among South African samples I utilize a logistic model to estimate the population level changes in fitness of strains that are (vaccine type, VT) and are not (non-vaccine type, NVT) as well as differences in strain fitness between those that are and are not resistant to penicillin continuously, across the vaccine period (PCV7: 2009; PCV13: 2011). In the years following vaccine implementation the relative fitness of NVT compared to VT strains increased with an increasing proportion of these NVT strains becoming penicillin resistant. These estimates indicate that initial vaccine-linked decreases in AMR may be transient. Data on human mobility between countries is scarce so to understand between country pneumococcal migration-dynamics I developed a simulation approach using Bayesian optimization for Likelihood-free inference. Incorporating thousands of pneumococcal genomes from South Africa, Malawi, The Gambia, and Kenya I included only neutral sites. I estimated migration rates and directions of migration between countries. I found some heterogeneity between lineages and across demes, and an inverse relationship between migration and the destination country population size. This model could be used to inform modeling frameworks on wider pneumococcal spread and timing and implementation of public health interventions. I also worked to better understand the history and evolution of the hundreds of extant GPSCs by delving into a 5700 year-old metagenome with genomic reads containing high identity to the pneumococcus. While this resulted in identifying a streptococcal ancestor with no polysaccharide capsule it did not further our knowledge of extant pneumococci; as such this paper is included in Appendix A. This work has contributed valuable insight into pneumococcal migration and mech- anisms of spread. I have quantified important fitness dynamics and incorporated them into our mobility models. Furthermore, I have developed a method which can now esti- mate migration parameters asymmetrically between countries elucidating paths of global pneumococcal spread. Together this thesis marks the advancement in our understanding of the geographic migration of Streptococcus pneumoniae and lays the groundwork for understanding its diverse, complex, global spread.Item Embargo Somatic Mutations in Ageing and Degenerative DiseaseHarvey, LukeFollowing the first division of the zygote, somatic mutations begin to accumulate in all human cells. Daughter cells become increasingly mutated leading to mosaic tissues, composed of genetically heterogeneous clonal units. In recent years, large scale sequencing efforts have begun to characterise the process of mutagenesis in normal tissues. These observations have improved our understanding of how somatic cells evolve oncogenic phenotypes and hinted that somatic changes may contribute to the development of age- related phenotypes and degenerative disease. To accurately characterise the mutational processes in normal tissues, we developed nanorate sequencing (NanoSeq), a duplex sequencing protocol with error rates of less than five errors per billion base pairs, allowing for the accurate identification of mutations in single DNA molecules. Using NanoSeq we describe the mutational processes in the nuclear and mitochondrial genome of three distinct cell types across the brain and cardiac muscle. We show that post-mitotic tissues accumulate somatic mutations at comparable rates, and by similar processes to dividing cells. We explore the relationship of mutation rates to transcription and chromatin state, and reveal patterns of transcription-coupled damage and repair in neuronal genomes. In the analysis of somatic mutations in cardiomyocytes, we identify a novel mutational signature that we believe may be caused by hypoxia-induced oxidative stress. Lastly, we demonstrate that neurons isolated from Alzheimer’s disease brains have a lower mutation burden than healthy neurons. To further investigate the role of somatic mutations in degenerative disease, we characterised somatic mutagenesis in rheumatoid arthritis and osteoarthritis. Using laser capture microdissection, we isolated 2000 microbiopsies of intimal lining, sublining, and lymphocytes for whole exome sequencing. Using these results we show that the synovium is a mostly polyclonal tissue with the capacity for large clonal expansions. We explore how the histopathology of rheumatoid arthritis and osteoarthritis relates to driver mutations, clonal expansions, and immune infiltration. Using NanoSeq, we characterise the mutational processes in isolated synovial cell types. Finally we show the power of NanoSeq for high- throughput driver discovery and identify 15 genes under positive selection in the synovium. This thesis offers novel insights into the mutation rate of distinct cell types in healthy and disease states. This work contributes to a growing body of literature characterising somatic evolution and provides a foundation for further functional studies to directly investigate the role of mutant clones in disease states.Item Embargo Enabling the in vitro study of long noncoding RNAs to understand their role in Plasmodium falciparumHoshizaki, JohannaLong noncoding RNAs (lncRNAs) have been identified in *Plasmodium falciparum*, the parasitic cause for life-threatening malaria, yet their role remains largely undiscovered. Through interactions with nucleic acids and proteins, lncRNAs can modulate gene expression at the transcriptional, post-transcriptional, translational, and post-translational levels. Determining the role of lncRNAs in the regulation of the *P. falciparum* transcriptome and proteome is imperative to further our understanding of gene regulation in the parasite. The characterisation of *P. falciparum* lncRNAs has been hindered by an incomplete annotation and the absence of disruption methods that together would permit high-throughput systematic knockdown of lncRNAs. During my PhD, I addressed these challenges to enable the study of *P. falciparum* lncRNAs *in vitro*. I generated a high-quality lncRNA annotation using manual curation of sequencing data generated at the Sanger Institute, along with supportive datasets from the literature. I evaluated CRISPR-based approaches for *in vitro* disruption of lncRNAs including gene knockout, knockdown, and interference. CRISPR-associated enzymes were explored including commonly used DNA-cutting enzymes (Cas9), inactivated enzymes to block transcription (dCpf1) and enzymes that target RNA directly (Cas13), the latter of which had not been applied to *Plasmodium*. Furthermore, I implemented these tools to demonstrate the feasibility of lncRNA studies in *P. falciparum*. I interrogated a set of lncRNAs that were selected based on predicted biological significance and targetability using dCpf1. LncRNA-depleted parasites were phenotypically characterised by assessing changes in fitness, drug resistance, gametocytogenesis and expression. I identified potential roles for specific lncRNAs in drug resistance and gametocytogenesis. By developing bioinformatics and molecular tools, this work enables future studies elucidating the specific roles of lncRNAs in *P. falciparum*. Understanding the transcriptome and gene regulation will inform the development of novel interventions for the control and eradication of malaria, which remains a serious global health concern.Item Open Access Quantitative modelling of CRISPR-Cas editing outcomesPallaseni, AnanthDevelopment of the CRISPR-Cas toolkit over the last decade has enabled unprecedented control over the genome and unlocked the capacity for new experiments and gene-therapies. However, use of these technologies is made more difficult due to variance in the rate and type of individual genetic outcomes generated by each editor. This variance is reproducible and a function of the sequence being targeted, thus making it amenable to quantitative modelling. My research focuses on building such models in a variety of contexts. In this thesis, I recount the history of gene editing from transgenesis to prime editors. I then review the modelling techniques I use in my projects, before covering the state of predictive modelling for genome engineering. In two main results chapters, I discuss my work on modelling base editor outcomes and the effects of DNA repair context on Cas9-induced double-stranded break repair. The final results chapter covers shorter, collaborative studies on other gene editing technologies. I will conclude by discussing the need for computational tools in genome editing and the gaps in our understanding. In my first results chapter, I examine the sequence- and position- specificity of base editor activity. Base editors are a gene editing technology derived from Cas9 that introduce precise base substitutions into a targeted region of the genome. The rate of these substitutions is known to vary between targeted sequences and the determinants of this variation are not completely understood. To untangle the determinants of base editing efficacy, our group performed a large-scale screen where we measured base editing outcomes across 20,000 targeted sequences in multiple cell lines and editors. I processed and analysed the data produced in this experiment and found that both the sequence flanking editable bases and the position of those bases in the sequence affects the rate of observed editing. I leveraged this understanding to construct a position-specific model of base editing activity for each editor type and used these models to predict the e fficacy and specificity of base editors for correcting pathogenic variants found in ClinVar. The second results chapter focuses on the mutational outcomes of Cas9-induced cuts in repair deficient backgrounds. Cas9 creates a double-stranded break at a targeted location in the genome and the cell repairs this lesion via several pathways which can leave mutations in the repaired sequence. It has been shown in previous studies that the distribution of these mutations is reproducible and dependent on the sequence being cut, but the effect of repair context on this process is not well understood. I planned an experiment to measure Cas9 repair outcomes at over 5000 target sites in 21 mouse cell lines with knockouts of single repair genes, then processed and analysed the data generated. I show that the knockout cells have reproducibly different repair patterns than controls. I highlight Nbn, Lig4 and PolQ as examples of knockouts with consistent effects on certain mutation types. I examine how the known sequence-determinants of Cas9 outcomes affect outcome preference knockout lines. Lastly, I use this understanding to train models that predict the distribution of Cas9 outcomes in various repair backgrounds. My final results chapter discusses two shorter collaborative studies on alternative editing technologies. First is the design of a large scale screen to profile the behaviour of a new Cas enzyme. I explain the experimental process of profiling Cas outcomes, the decisions involved in designing a guide library and an approach to modelling. The other collaboration is the prediction of editing rates when inserting sequences into the genome with prime editors. Here, I train a model to predict editing rates, examine which features are most important to predictive performance, and finally determine that collection of more data for training will improve model performance.Item Open Access Somatic phylogenies: a window into human biology in health and diseaseSpencer Chapman, Michael; Spencer Chapman, Michael [0000-0002-5320-8193]All somatic cells in the human body originate from the fertilised egg, or ‘zygote’. Co-ordinated processes of cell division and differentiation result in the incredible complexity observed following development. Thereafter, each tissue must maintain its function for the decades of life that follow, while minimising risks such as cancer. Crucial to understanding the strategies employed to meet these demands is knowing what cell gives rise to what i.e. the structure of the somatic lineage tree. For this purpose, it is fortuitous that most cell divisions result in the acquisition of somatic mutations that are passed on to all of its future progeny. Therefore theoretically, if all somatic mutations in each cell of an organism were known, the organism’s full somatic lineage tree could be determined. While this is not yet possible at the scale of an entire organism, advances in clonal expansion and whole-genome sequencing technologies have facilitated such experiments on the scale of hundreds of cells in a single individual. This thesis contains three results chapters, each utilising somatic phylogenies in different ways. The first looks at early human development. Due to the small number of cells in early development, a near complete phylogeny can be determined. Assessing contributions of early lineages to different tissues is used to determine the developmental relationships and origins of tissues. The second looks at the clinically important context of gene therapy for sickle cell disease. Comparisons of somatic phylogenies from before and after the gene therapy procedure gives insights into the biology of sickle cell disease and the safety of gene therapy in this context. Finally, the third chapter looks at somatic phylogenies from a different perspective: as a way of testing our assumptions regarding mechanisms of mutation acquisition. The observation that some mutations do not fit these assumptions leads to the surprising conclusion that some mutation-causing DNA lesions persist unrepaired for months to years. Overall, this thesis demonstrates the power of interrogating somatic phylogenies to answer important questions in diverse biological disciplines. As technological advances allow such experiments on ever larger scales, the power of this approach will only increase.Item Open Access The natural history of clonal haematopoiesisFabre, MargareteIntroduction Human cells acquire somatic mutations throughout life, some of which can drive clonal expansion. Such expansions are frequent in the haematopoietic system of healthy individuals and have been termed clonal haematopoiesis (CH). While CH predisposes to myeloid neoplasia and other diseases, we have limited understanding of its natural history and how this relates to clinical phenotype. Objectives 1. To characterise the behaviour of CH across the human lifespan; 2. To identify and quantify determinants of clonal behaviour; 3. To understand how CH clonal dynamics relate to malignant progression. Results By tracking 697 CH clones from 385 individuals aged 55 or older over a median of 13 years, we found that 92.4% of clones expanded at a stable exponential rate in old age, with different mutations driving substantially different growth rates, ranging from 5% (DNMT3A, TP53) to over 50%/yr (SRSF2-P95H). Growth rates of clones with the same mutation differed by +/-5%/yr, proportionately impacting “slow” drivers more substantially. Combining these time-series data with phylogenetic analysis of 1,731 whole genome-sequenced haematopoietic colonies from 7 older individuals revealed distinct patterns of lifelong clonal behaviour. DNMT3A-mutant clones preferentially expanded early in life and displayed slower growth in old age, in the context of an increasingly competitive oligoclonal landscape. By contrast, splicing gene mutations only drove expansion later in life, while TET2-mutant clones emerged across all ages. Using a separate cohort of 158 twins, we found that concordance for CH was no higher within monozygotic vs dizygotic pairs, suggesting that the inherited genome does not exert a dominant influence on CH behaviour. The identification of two monozygotic pairs in which both twins harboured identical rare somatic mutations confirmed that the origins of adult CH can be traced back to early life, even in utero. Finally, by comparing CH growth dynamics with (i) driver mutation selection patterns in large myeloid cancer data sets and (ii) driver-specific AML risk scores, we show that mutations driving faster clonal growth also carry a higher risk of malignant progression. Conclusions These findings characterise the origins and lifelong natural history of CH and give fundamental insights into the interactions between somatic mutation, ageing and clonal selection.Item Embargo Technologies to decode the multicellular networks within the human bodyShilts, JarrodThis is a story about how it is possible for collections of cells to physically assemble into coordinated multicellular systems. In other words, how millions of individual cells are able to physically interact with each other in an organized way, as well as how pathogens such as viruses can exploit these interaction points in order to infect the body. Its subject is principally centered around the proteins that cover the surfaces of human cells. These proteins have to bind with specific combinations of surface proteins on nearby cells, thereby establishing a complex ‘code’ of direct interactions possible between different cell populations and tissues. Some of these interactions trigger the exchange of signals that enable collections of multiple cells to coordinate complex behaviors such as immune responses, while others act as adhesive receptors that enable physical structure to emerge out of groups of cells. The influential roles of these surface proteins and their accessibility to systemic medications have also made them among the most effective targets for therapeutics, with surface proteins constituting a majority of all approved drug targets. So far however, prior research has only pieced together a fragmented picture of the direct receptor links between cells and the functional roles they have. Surface receptors pose unique experimental challenges to study and historically have lacked systematic methods to measure, leading most studies to only consider receptors at small-scale without a global view to the larger system. In this thesis, I take a different approach. I will present my work to establish a series of technological tools and strategies that overcome these challenges, in order to make it possible to systematically build up from characterizing the function of individual receptor molecules all the way to reconstructing multicellular interaction networks across entire systems of human cells. These methodologies can be categorized into three sequential steps. First, testing the binding of pairs of surface proteins across large arrays to decode the ‘interactome’ between two cells. Second, using cell-based assays to annotate the broad functional consequences a surface interaction has. And third, to computationally integrate these diverse data sources in order to understand how interacting communities of cells are organized. As my initial case study, I consider the question of how the distributed individual cells of the human immune system interact to produce a cohesive whole. By individually producing recombinant forms of most surface proteins detectable on white blood cells, I could assemble the first systematic and quantitative interaction network of these proteins, and in the process discover several novel interactions and reveal the identities of previously-unidentified binding partners for key immunomodulatory receptors. I could then adapt those recombinant proteins to experimentally manipulate live human immune cells in a multiplex microscopy technique, which revealed previously unknown interactions as having prominent roles in immune activation and leukocyte adhesion. I will show how these data can be integrated with high-resolution expression data in order to infer patterns of cell-to-cell connectivity throughout the human body, as well as to formulate a mathematical model that could predict the behavior of interacting cells from molecular first-principles. In the second half of my thesis, I will explain how this series of methods I established can be adapted and extended to new contexts. I will describe a large-scale effort I led applying these methods to characterize the cell-to-cell interactions occurring within the human brain, which revealed unexpected new pathways by which glia can directly communicate with cortical neurons. I will then extend my approaches to reveal which interactions may play a causal role in driving human disease. To do so, I will first show computational methods I devised for leveraging human clinical genetics in order to pinpoint cell-to-cell processes underlying the pathology. In the final section, I will extend this to infectious diseases driven by host-pathogen interactions. For this, I will explain how the tools I established allowed me to rapidly respond to the COVID-19 pandemic by systematically profiling the surface proteins that act as host factors during infection by the novel coronavirus SARS-CoV-2. That work has led to the discovery of two pathogen-host interactions that have subsequently been independently linked to COVID-19 severity, as well as helped clarify the precise host receptors that SARS-CoV-2 utilizes when invading human cells. From the combination of these technologies and approaches, I hope to provide a systematic and mechanistically-grounded foundation for deconstructing the emergence of biological function from the interacting communities of cells that make up the human body.Item Open Access Clonal dynamics of haematopoiesis across the human lifespanMitchell, EmilyThe haematopoietic system manifests several age-associated phenotypes including anaemia; loss of regenerative capacity, especially in the face of insults such as infection, chemotherapy or blood loss; and increased risk of clonal haematopoiesis and blood cancers. The cellular alterations that underpin these age-related phenotypes, which typically manifest in individuals aged over 70, remain elusive. In my thesis I have aimed to investigate whether changes in HSC population structure with age might underlie any aspects of haematopoietic system ageing. In addition, I have investigated the impact of chemotherapeutic perturbations on haematopoietic stem cell mutation burden and clonal dynamics. To answer the ageing question, I have sequenced 3579 genomes from single-cell-derived colonies of haematopoietic stem cell/multipotent progenitors (HSC/MPPs) from 10 haematologically normal subjects aged 0-81 years. HSC/MPPs accumulated 17 somatic mutations/year after birth with no increased rate of mutation accumulation in the elderly. HSC/MPP telomere length declined by 30 bp/yr. To interrogate changes in HSC population structure with age, I used the pattern of unique and shared mutations between the sampled cells from each individual to reconstruct their phylogenetic relationships. I found that haematopoiesis in adults aged <65 was polyclonal, with high indices of clonal diversity. In contrast, haematopoiesis in individuals aged >75 showed profoundly decreased clonal diversity. In each elderly subject, 30-60% of haematopoiesis was accounted for by 12-18 independent clones, each contributing 1-34% of blood production. Most clones had begun their expansion before age 40, but only 22% had known driver mutations. I used the ratio of non-synonymous to synonymous mutations (dN/dS) to identify any excess of non-synonymous (driver) mutations in the dataset. This genome-wide selection analysis estimated that the set of 300 - 400 HSC/MPPs sampled from each adult individual harboured around 100 driver mutations, over 10-fold higher than the number of known drivers we could identify. Novel drivers affected a wider pool of genes than identified in blood cancers. Simulations from a simple model of haematopoiesis, with constant HSC population size and constant acquisition of driver mutations conferring moderate fitness benefits, entirely explained the abrupt change in clonal structure observed over the age of 70. By old age the majority of HSCs harbour at least one driver mutation. Our data supports the view that dramatically decreased clonal diversity is a universal feature of haematopoiesis in elderly humans, underpinned by pervasive positive selection acting on many more genes than currently known. Finally, I also sequenced haematopoietic progenitor cells from individuals exposed to a wide range of chemotherapeutic agents. I was able to identify an increased mutation burden associated with a number of chemotherapeutic agents, including platinum and alkylating agents, some of which conferred thousands of excess mutations. There was a wide variation in mutation burden conferred from agents within the same class, meaning that there are potential patient benefits from switching drugs in commonly used regimens. I show that chemotherapy given in childhood can profoundly impact clonal dynamics in later life.Item Embargo Common genetic variation and spliceosome variants in rare developmental disordersWigdor, EmilieAlthough thousands of rare disorders are caused by single, deleterious, protein- coding variants, evidence suggests that common variants also contribute to risk for rare, neurodevelopmental disorders (NDDs). These are likely affecting the penetrance of protein-coding variants as well as expressivity, posing a major challenge in the interpretation of rare variants. An additional challenge is our incomplete understanding of which variants are likely to affect gene function. Due to the high burden of “variants of unknown significance” (VUS), there is a great need to develop molecular biomarkers of individual disorders which could be used as an intermediate phenotype to help determine whether a VUS is pathogenic or benign. For disorders which are due to mutations in spliceosomal components, global patterns of splicing changes may be a useful biomarker. The research presented falls into three main projects. First, I investigated via genetically-predicted gene expression, whether cis-regulatory variants modify the penetrance of inherited, putatively damaging variants in NDD probands in the Deciphering Developmental Disorders (DDD) Study. To determine whether there were overall differences in predicted gene expression between probands and controls, I conducted a Transcriptome Association Study. I then tested whether the predicted gene expression of genes harbouring inherited, putatively damaging variants, is lower in undiagnosed NDD probands compared to controls. Finally, I investigated the modified penetrance of inherited, putatively damaging variants by comparing predicted gene expression between undiagnosed NDD probands and their unaffected, variant-transmitting parents. Second, I further explored the role of common variants in severe NDDs using polygenic scores (PGS) in both DDD and the Genomics England (GEL) 100,000 Genomes project. I tested whether undiagnosed NDD probands over- or under- inherit PGS for NDDs and correlated traits. I found that NDD probands over-inherit PGS for NDDs and schizophrenia. To put these results into context, I compared unaffected parents of undiagnosed probands’ PGS to both controls and probands. I found that parents’ PGS are significantly different from controls’ PGS, but not from probands. Additionally, I explored sex differences in PGS, by examining both affected and unaffected individuals. I found preliminary evidence of a female protective effect in the context of common variation. Finally, I revisited the question of the modified penetrance of inherited, putatively damaging variants. Third, using whole genome sequencing and bulk RNA sequencing of whole blood from GEL, I investigated differential splicing and gene expression in rare disorder probands with a pathogenic variant in the spliceosome. I found enrichment of differentially expressed genes for processes related to genes containing minor introns. Additionally, I found enrichment of genes involved in spliceosomal components among differentially spliced genes, suggesting a potential feedback loop for regulation of splicing. These studies emphasise the importance of studying the convergence of common and rare variation, as well as the integration of functional data, in the context of rare disease genetics. Moreover, they highlight the need to collect phenotypic and genotypic data on parents and family members of rare disorder probands.Item Open Access The Contribution of Structural Variants to 2,095 Molecular Phenotypes in 12,354 European Ancestry IndividualsHowell, BrittanyStructural Variants (SVs) are large scale rearrangements of the genome resulting in linear and spatial changes which can profoundly affect the function of the genome. SVs contribute the majority of nucleotide variation among human genomes by number of basepairs and have been linked to various diseases and traits including schizophrenia, autism and obesity. In this thesis whole genome sequence data at 15X coverage were generated using 12,354 samples from the INTERVAL cohort. A combination of Genome STRiP, Lumpy, CNVnator and svtools was used to call deletions, duplications and inversions. I implemented a stringent, tiered QC procedure to minimise false positives. For duplications and deletions I modelled sequential random forests on read alignment parameters, resulting in 88% sensitivity and 99% specificity for deletions, and 92% specificity and 55% sensitivity for duplications. Final tuning of the overall quality score was modelled to ensure that 90% of carrier genotypes were identical among duplicate samples. Finally, a graph-based procedure was used to collapse SVs with significant overlap in carriers and in genomic coordinates. The final callset consists of 123,801 sites, with each sample containing approximately 3,300 SVs. After rigorous QC, I compared the cohort to similar population cohorts including 1000 genomes project and Hall-SV. The cohort is sensitive - capturing 93% and 92% of common deletions from each cohort respectively. There is less sensitivity at duplications, where INTERVAL captures only 65% and 75% respectively. The majority of detected variants are rare: 95% have MAF < 0.01, and 49% are singletons, both figure are in line with expectations set by other similarly sized cohorts such as gnomAD-SV. Intergenic SVs are more common than all SVs affecting coding regions, and multi-gene SVs are the most rare class as expected. SVs are well tagged by SNPs - 88%, 97%, 93% and 46% of deletions, inversions, reference MEIs and duplications have at least one SNP in high LD (r2>0.8)- suggesting that genotyping of the SVs is high quality. We evaluated the contribution of SVs on a comprehensive range of phenotypes available in the cohort. These traits include a range of blood cell traits and phenotypes relating to inflammation and immunity including 1,348 metabolites, 92 plasma proteins and 125 full blood count traits. I modelled linear associations between single SVs and each trait, and identified 495 signals across 196 regions. After conditional analysis, I estimate SVs have a causal role in 34 signals, and are the lead variant in 54 signals. Chapter four details several examples of the contribution the SV is making to the association, describes potential genetic mechanisms, and gathers additional clinical data such as electronic health records and gene expression information to comprehensively describe the role of the SV at the association. Finally, there were 481 signals with genome wide significant SNPs present. At 339 signals, at least one SNP became non-significant when conditioning on the SV, suggesting that many SV signals have been accounted for by proxy previously. SVs are challenging to detect, however here I demonstrate the importance of including SVs in GWAS and further studies in understanding complex traits.Item Open Access Factors Influencing the Somatic Mutational Landscape of Ageing Squamous EpitheliumKing, CharlotteThe incidences of many cancers vary substantially across the world, reflecting genetic differences between populations and exposure to environmental carcinogens. This is illustrated by keratinocyte skin cancers and oesophageal squamous cell carcinoma, which both develop from squamous epithelium, yet are remodelled during ageing by very different mutagenic processes and environmental exposures. In this thesis, I investigate the influence of cancer risk factors on the somatic mutations present in normal aged skin and oesophageal epithelium using a range of sequencing methods. I find sun-exposed facial skin from donors of the UK to have a 4-fold increased mutation burden and 10-fold increase in copy number aberrant clones compared to donors of Singapore, a country with a 17-fold lower incidence of keratinocyte skin cancer. The majority of these mutations in the UK are due to ultraviolet radiation (UV) but, in Singapore, age-related signatures predominate. Mutations in TP53 are more strongly selected in epidermis of the UK, whilst those in NOTCH1 and NOTCH2 are preferentially selected in Singapore, reflecting differences in the level of competition within the tissue. A survey of mutations in UK skin across body sites reveals differences in UV signature and selection between sites. In aged oesophageal epithelium from UK donors, I observe an increase in mutations with an alcohol-associated signature with reported alcohol consumption. Furthermore, mutation burden increases with smoking, without a detectable change in the mutational signature, consistent with tobacco smoke increasing oesophageal cancer risk independent of its mutagenic effects. Finally, in donors over the age of 60, mutations in TP53 and FAT1 are more strongly selected, whilst those in NOTCH3 more weakly selected, suggesting changes to levels of competition within the tissue with age. I conclude that the mutational landscapes of normal oesophagus and skin are shaped by age and environmental exposures and that this, in turn, may alter the risk of keratinocyte cancers.Item Open Access Normal cell signals in human cancer transcriptomesKildisiute, GerdaCancer cells arise from normal cells, and may retain transcriptional patterns present in the cell of origin. Recent advances in single-cell transcriptomics have allowed us to profile both normal and cancer cells at a single cell resolution, enabling direct comparisons between the two. Such comparisons can reveal information about which cells cancers originate from, transcriptional programs that underpin carcinogenesis, and differentiation states in cancer. The work presented in this thesis aims to study normal cell signals in human cancer, utilising single-cell transcriptomics. Chapter 1 provides an overview of transcriptomics as a whole, as well as in the context of cancer, with a focus on single-cell transcriptomics. In Chapter 2, I construct a single-cell normal reference of the fetal adrenal gland, and explore the development of the fetal adrenal medulla, from which the childhood cancer, neuroblastoma, arises. I then compare that reference to neuroblastoma single-cell and bulk transcriptomes in chapter 3. This comparison reveals sympathoblasts as the normal correlate of neuroblastoma cancer cells, and allows identification of transcripts present in both neuroblastoma and fetal adrenal medulla, but not adult tissues, making these transcripts an attractive therapeutic target. Chapter 4 presents a comparison of fetal and adult references across three different organs (gut, lung and liver) to bulk and single-cell transcriptomes of adult cancers originating in these organs. This analysis assesses the contribution of fetal and adult cell type signals to cancer transcriptomes. Overall, my work delineates the transcriptional relationship between cancer and normal cells.Item Open Access Exploring the Population Structure, Recombination Landscape, and Pan-Genome of the Global Neisseria meningitidis PopulationMacalasdair, NeilNeisseria meningitidis is a gram-negative species of bacteria which causes meningitis, septicaemia, urethritis, and pneumonia worldwide. Infections are typically asymptomatic carriage, but those which cause disease are extremely difficult to treat, leading to a high case-fatality rate. As such, there is considerable interest in studying N. meningitidis to understand its spread, what causes development from carriage to invasive disease, and how its evolution impacts efforts to control the disease. The latter has been of particular concern in regions where there have been outbreaks, particularly the ‘meningitis belt’ that spans from West Africa to East Africa, where there is greater disease burden and periodic epidemics which can span the region. Due to difficulties in treatment, the primary method of controlling invasive meningococcal disease is vaccination. Currently, available vaccines target five of the extant serogroups of N. meningitidis, chosen through study of the serogroups most frequently found in disease. However, either the replacement of disease lineages with those of different serogroups or capsular switching within disease-associated lineages may undermine the success of mass vaccination efforts and create the need for additional campaigns. N. meningitidis specifically possesses characteristics which make vaccine escape likely and unpredictable. The most important are the adaptions which allow frequent homologous recombination with other Neisseria. The evolutionary consequences of this sporadic partial chromosomal recombination are not well understood, but the transfer of alleles between distant lineages – including those associated with virulence – has been observed. Another gap in our understanding of bacterial evolution is in the evolutionary effect of population structure. Obilgately human-parasitic species such as N. meningitids have a global distribution and opportunities for rapid migration, and therefore may have a complex population structure. To study these problems, I have assembled a collection of over 15,000 whole-genome sequenced N. meningitids isolates from 70 distinct countries with isolation dates spanning over a hundred years. These data consist of a mixture of publicly published data, and three collections of newly sequenced isolates. Using these data, I determine the global population structure of N. meningitids. Subsequently, I infer phylogenetic trees for and find patterns of recombination within major lineages in the global population. Separately, I also infer and analyse the species-wide pan-genome. The results of these analyses indicate that N. meningitidis has a deep well of generally unsampled diversity in an extremely complex population structure which is primarily made up of a few globally distributed lineages. Within these lineages, population bottlenecks are a frequent occurrence. The 25 major lineages differ significantly in both their rates of recombination and the distribution of recombination across their genomes, but evidence suggest that most recombination occurs within N. meningitidis. In a local population, recombination generally acts to reduce the effect of deleterious mutations, although an example also exists of recombination acting in concert with positive selection. The pan-genome reveals the extent to which recombination can disrupt tree-like evolution, with most major lineages containing patterns of relatedness in their accessory gene content inconsistent with their whole-genome phylogenies. Trends in the pan-genome indicate that most gene gain is from other N. meningitidis isolates, but is governed primarily by evolutionary forces and not recombination rate. Together, these results demonstrate the profound complexity present in the population structure of N. meningitidis, and distinct evolutionary trends in individual lineages. This work also underscores the importance of carriage sampling and the value of a global perspective when studying a globally-distributed species. Further sampling in regions which are under-sampled and ongoing carriage surveillance will be a crucial part of any long-term efforts to successfully control the disease through vaccination.Item Open Access Comparative and population genomic analyses of the parasitic blood flukesBerger, DuncanThe etiological agents of schistosomiasis are snail-vectored, waterborne, parasitic trematodes of the genus Schistosoma. Schistosomes are considered a considerable public health concern responsible for infecting over 236 million people across 78 endemic nations. It is especially common among children in low and middle-income countries, with 90% of infections occurring in sub-Saharan Africa. As a consequence, they are now targeted by the World Health Organization for control and ultimately elimination through mass-drug administration (MDA), reliant on the only available drug, praziquantel. In this thesis, I first present the results of the first genomic study investigating the potential impact of long-term mass-drug administration (MDA) on Schistosoma mansoni populations, examining these populations from within Uganda with contrasting histories of past drug pressure, and sampling both pre- and post- praziquantel treatment. In particular, I examine how long-term MDA has impacted the population structure of S. mansoni and whether there is evidence of ongoing drug-induced selection that could represent the appearance of praziquantel drug resistance. In the following chapter, I reexamine this dataset along with additional samples from other genomic studies and newly sequenced samples. Using this more extensive dataset I explored population structure over larger spatial scales, identified clear evidence of hybridization with the closely related species S. rodhaini. In-depth sequencing of parasites infrapopulations permitted examination of the changes of infrapopulation relatedness after praziquantel treatment. Finally, I examined variation in a recently identified putative praziquantel resistance conferring gene. I then performed genomic characterization of livestock infective schistosome populations, focusing on the incidence of early generation hybrids between S. bovis and S. curassoni. In the following chapter I present the results of a long-term genome assembly project, aiming to produce genome assemblies and annotations for Schistosoma and other parasitic flatworms. This resulted in 19 chromosomal-scale assemblies for 14 species of parasitic flatworm, with Iso-Seq and RNA-seq based annotations. I then performed a range of comparative genomic analyses aiming to characterize interspecies variation at both the chromosomal and gene family level. The primary aim was to define specific trends in evolution of this genus.Item Open Access Somatic mutagenesis in humans with deficient DNA repairRobinson, Philip SThe accumulation of mutations in normal cells causes the development of cancer and is implicated as a potential mechanism in the physiological process of ageing. In recent years our ability to interrogate the genome of human cancers and the normal tissues from which they arise has expanded greatly. These studies have shown that mutations accumulate in normal tissues throughout life and that mutation rates are remarkably similar across individuals. However, the potential impact of increased somatic mutation rates on the risk of developing cancer and the process of ageing is not known. In this thesis, two inherited syndromes associated with intestinal cancer predisposition were selected to investigate the mutation burdens and mutational processes across different normal tissue types. Individuals with these syndromes have a known elevated risk of cancer which is thought to be underpinned by an increased somatic mutation rate. Chapter 3 summarises experiments that investigate somatic mutagenesis in a selection of normal tissue types from individuals with germline heterozygous mutations in the DNA polymerase genes POLE and POLD1. Chapter 4 summarises the investigation of somatic mutagenesis in normal tissues from individuals with germline MUTYH mutations. In Chapter 5 findings from the two cancer predisposition syndromes are compared and the results are placed in the broader context of intestinal cancer predisposition syndromes. Lastly, the observations from this thesis are interpreted with regards our current understanding of the somatic mutation theory of ageing. In summary, this thesis presents insight into somatic mutagenesis in normal tissues from individuals with known cancer predisposition. The findings may have potential implications for our understanding of cancer risk in predisposed and non-predisposed individuals. The observation of increased somatic mutation rates in normal healthy tissues also has pertinence to our understanding of the somatic mutation theory of ageing. The data presented in the thesis serve as a potential proof-of-concept for the measurement of somatic mutagenesis in normal tissues to improve the care of individuals with inherited DNA repair defects.Item Open Access Characterisation of Plasmodium parasite sexual commitment and developmentCudini, Juliana; Cudini, Juliana [0000-0002-8420-0362]Malaria is a devastating disease responsible for over 400,000 deaths each year. The disease is caused by a single-celled parasite of the genus Plasmodium, which establishes infection via a bite from an Anopheline mosquito. While the parasite progresses through a complex range of life stages, it is the blood stages, or the intraerythrocytic developmental cycle (IDC), that cause the large majority of harmful symptoms. During the course of the IDC, a parasite grows in size within a red blood cell until it is able to multiply itself asexually many times and burst from the cell as individual infectious units, each one then able to infect a new red blood cell and restart the cycle. This pattern of asexual reproduction and re-invasion of fresh cells allows the parasite population to swell to impressive sizes within a host. While the IDC growth cycle can keep a parasite population happily established within the host, it is not able to allow passage between hosts. Thus, as the parasite progresses through the IDC, it must make a decision. Either it can continue into another cycle of asexual growth in that host, or sexually (and terminally) differentiate into gametocytes, the transmissible form of the parasite, and thus gain an opportunity to transfer to a new host. Gametocytogenesis, the formation of these sexual forms, is therefore essential for malaria transmission, and an attractive target for transmission blocking interventions. Despite its importance, we know little about sex-specific gene expression or how the decision to become male or female is made. Efforts to understand gametocytogenesis have been hampered by the fact that gametocytes often represent less than 1% of the total population of parasites circulating in a host, meaning any sexual transcriptional signal is lost amidst an abundance of asexuals. Single cell RNA-sequencing has revolutionised our ability to capture rare populations, providing an ideal window into heterogeneity between parasites and developmental processes at high resolution. In this thesis, I use 10x Genomics single cell capture to sample the transcriptome of over 30,000 single cells from time points spanning the sexual developmental pathway of P. falciparum, from asexual growth, to sexual commitment, and into sexual maturity. I first use the data collected to generate a high quality reference atlas for gametocyte development. From this, I profile a number of global changes underlying sexual commitment, development, and maturity into males and females. By mixing two genetically distinct parasite strains (NF54 and 7G8), I place these findings in a larger context, describing differences in development that occur between strains of the same species. Lastly, I complete my profile of transcriptional changes underlying parasite development by exploring the localisation of the lesser profiled non-coding expression to specific regions of the life cycle, and how they may contribute to transmission.Item Open Access Reconstructing Chromothriptic Chromosomes in Oesophageal AdenocarcinomasIjaz, JannatThe epigenetic landscape is regulated by a myriad of factors. This regulation ranges from functional compartmentalisation of genomic sequences into topologically associating domains, to chromosome looping, to short-range promoter-enhancer interactions. The underlying genome sequence contributes to this regulation, likely at a variety of scales, however the extent of this contribution is not fully understood. Chromothripsis is a localised catastrophic genome shattering event that can be used to study how the underlying genomic sequence affects this higher order structuring. Since chromothripsis tends to affect only one of the two alleles, in every cell a direct comparison can be made between the wild-type chromosome and the chromothriptic chromosome. The wild-type chromosome represents the genome sequence and structure before reshuffling and the chromothriptic derivative chromosome can be used to query the direct effects of this reshuffling. Chromothripsis has been seen in up to 32% of cases of oesophageal adenocarcinomas. Therefore, patient-derived oesophageal adenocarcinoma organoids with evidence of chromothripsis restricted to one allele were used to better understand how the genome is regulated. Complex regions of structural variation between alleles in cancer genomes coupled with subclonal variants means haplotype-aware de novo assemblies are essential for contiguous cancer genome assemblies. Our method takes haplotype blocks and assigns PacBio circular consensus sequencing reads to the appropriate allele using B-allele frequencies of single nucleotide polymorphisms and presence of structural variants. The chromosomes are then assembled separately and scaffolded using Hi-C reads, which we also haplotype resolve. This produces contiguous assemblies, even on chromosomes with over 900 structural rearrangements compared to the reference genome. This methodology has been used to reconstruct chromothriptic derivative chromosomes and the associated wild-type chromosomes in five organoid models, as well as other chromosomes with complex rearrangements. All types of structural variant have been reconstructed, other than tandem duplications which are collapsed by current assembly tools. With these cancer-specific reference assemblies, the epigenome of the chromothriptic and wild-type chromosomes can be profiled. Hi-C chromosome capture has been used to study topologically associated domains; ATAC-seq to study chromatin accessibility; ChIP-seq to identify CTCF binding and histone modifications (H3K27me3, H3K4me3, H3K27ac) and Iso-seq to phase long read transcripts to their respective chromosomes. There are widespread differences between the chromothriptic and wild-type chromosomes for each epigenetic mark. This indicates that the shattering of the chromosome has dramatic consequences for gene regulation, far beyond what we see when comparing two wild-type alleles of the same chromosome. It highlights that, while underlying genome sequence has a fundamental role in gene regulation, the epigenetic context of that sequence also has a profound impact. The work done to assemble these chromosomes allows for unprecedented insight into the regulatory impact of structural variation.Item Open Access Somatic evolution in healthy and chronically inflamed colon and skinÓlafsson, SigurgeirThe human body is made up of trillions of cells which cooperate to reproduce their genetic material. While all the cells are a part of a whole, each is also an individual and will selfishly give rise to a clonal expansion of cells within a tissue given the chance, even to the detriment of the organism. This thesis discusses the evolutionary forces acting on cells within the body, specifically on epithelial cells in the colon and skin. After a general introduction of the evolutionary forces acting on normal cells and the methods used to study them, Chapter 2 focuses specifically on genetic drift within the colon, where clones expand through the process of crypt fission. I apply a statistical framework called Approximate Bayesian Computation to estimate the crypt fission rate in the normal colon and in individuals with Familial adenomatous polyposis (FAP). I estimate the rate of crypt fission to be one every 27 years in the normal colon and one every 13 years in (FAP). In Chapter 3, I describe somatic evolution in the colon under conditions of chronic inflam- mation. I used whole-genome sequencing of individual colonic crypts from patients with inflammatory bowel disease (IBD) to show that the IBD-colon is characterized by a higher mutation burden and larger clonal expansions than the healthy colon. I also show that muta- tions in immune-related genes, including PIGR, ZC3H12A and genes in the interleuking 17 and toll-like receptor pathways, are under positive selection in the colons of IBD patients and may contribute to the disease pathogenesis. In Chapter 4, I focus on the skin. I performed whole-exome sequencing of microbiopsies of epidermis from patients with psoriasis, a second chronic inflammatory disease. In contrast to IBD, I did not find increased mutation burden and clonal spread in psoriasis, except when the skin had been treated with psoralens + UVA (PUVA) phototreatment. The selection landscape of psoriatic skin resembles that of normal skin, and mutations in NOTCH1, FAT1, TP53, PPM1D and NOTCH2 are positively selected. ZFP36L2 was the only gene found to be enriched in mutations that has not been previously reported in normal skin, but it is as yet uncertain if selection of ZFP36L2 mutant cells is a feature specific to psoriatic skin or not. Finally, Chapter 5 discusses my findings in the broader context of cancer and complex-trait genomics. I discuss how a causal relationship between somatic evolution and non-neoplastic diseases may be established and the different ways somatic evolution may affect disease progression for good or ill. I further discuss how to design a study to search for germline determinants of somatic evolution and the need for developing methods to enable such studies to be conducted at scale.Item Open Access Human intestinal cells across space and time(2022-01-17) Elmentaite, RasaCells of the human intestinal tract undergo dynamic changes from development to adulthood and in response to environmental stimuli. Recent advances in high-throughput profiling of cells using transcriptomic approaches have vastly expanded the catalogue of cells that are found across the human body. Nonetheless, a holistic view of intestinal cell diversity across distinct intestinal regions (space) and multiple life stages (time) is still lacking. The work presented in this thesis is focused on building a reference of human intestinal cell types using novel genomic approaches including single-cell RNA sequencing (scRNA- seq) and spatial transcriptomics. Importantly, we use this data to gain new insights into intestinal cell organisation and function during in utero development, homeostasis, and in rare and complex diseases. Chapter 1 introduces the advances in single-cell genomic approaches within the last decade and contrasts them with previous technologies used for cataloguing cells. Following an overview of single-cell technologies, the intestinal spatial architecture, as well as key cell types (e.g. epithelium, enteric neurons, stromal and immune cells) required for the proper intestinal function, will be summarized. As an intestinal surface area, innervation and immunity are established during in utero development, this chapter will also outline the principles of relevant developmental events, namely villus formation, enteric nervous system development, and lymphoid structure formation. Chapter 2 outlines the materials and methods used for the generation and analysis of single-cell and spatial data that will be discussed in the following results chapters. Chapter 3 focuses on deciphering the key cell players in the process of human villus formation. This chapter outlines the single-cell profiles of human embryonic and early fetal gut tissue and explores the interactions between epithelial and mesenchymal cells. In addition, these cell type profiles and cell networks are compared and contrasted with cells found in the children diagnosed with chronic inflammation of the small intestines or Crohn’s disease. This analysis reveals similarities in epithelial cell changes during development and regeneration. Chapter 4 is focused on an in-depth analysis of epithelial and neuronal lineage composition and developmental relationships across different intestinal regions and life stages. In this chapter, an integrated view of the gut is presented, encompassing cells from in utero development, childhood and adulthood, and up to 11 intestinal regions. We resolve regional differences of BEST4+ absorptive cells, implicate IgG sensing as a novel function of intestinal tuft cells and resolve differentiation of enteroendocrine cell subsets. Analysis of the epithelial cells is followed by resolving the developing enteric nervous system at the single-cell resolution and showing the patterned expression of Hirschsprung’s disease-associated genes. Chapter 5 is focused on the development of local intestinal immunity. In this chapter, the three key cell types that orchestrate lymph node and gut-associated lymphoid tissue formation in humans are described. The comparisons of these subsets with cells expanded in Crohn’s disease patients reveal re-initiation of this cellular program during inflammation to recruit and retain immune cells to the site of damage. Lastly, Chapter 6 summarises the insights gained through single-cell analysis of intestinal cell types and discusses these results in the light of recent literature on development and inflammation.