Repository logo
 

Theses - European Bioinformatics Institute (EMBL-EBI)

Browse

Recent Submissions

Now showing 1 - 20 of 62
  • ItemEmbargo
    The evolution of gene regulatory landscapes in mammalian tissues
    Rimoldi, Martina
    Gene regulatory landscapes are highly complex and evolutionarily unstable. Understanding how genetic and regulatory variations affect gene expression evolution is paramount to gaining insight into phenotypic divergence across species. Although gene expression programs are broadly maintained during mammalian evolution, the collection of cis-regulatory elements and transcription factor (TF) binding events has diverged at different rates. In this thesis, I investigated several layers of gene regulatory divergence across mammalian species. In the first part of the thesis, I investigated the coevolution of DNA methylation patterns and transcription factor binding. Mammalian regulatory architecture evolves through widespread rewiring of TF binding events. Reportedly, TF binding is the fastest evolving layer of cis-gene regulation. However, whether DNA methylation patterns are reshaped during TF binding evolution has not been extensively studied. To fill this gap, I analyzed whole-genome bisulfite sequencing (WGBS) and matched chromatin immunoprecipitation assays followed by sequencing (ChIP-seq) of five TFs from the livers of five mammals. I first characterized DNA methylation patterns at transcription factor binding regions and found four prototypical methylation profiles that resolve alternative functional and chromatin contexts. These profiles were highly conserved across species and consistent across TFs. However, CTCF was an exception, with different methylation profiles likely underlying its various roles in both gene regulation and chromatin organization. I then investigated the relationship between DNA methylation patterns and TF binding divergence. Not only do genomic regions with evolutionary loss of binding regain their hypermethylated state, but distinct methylation profiles also follow species conservation patterns that reflect the turnover rate of the genomic context they correspond to (i.e., promoters and enhancers). Therefore, these results demonstrate coordinated evolution between DNA methylation patterns and transcription factor binding turnover. In the second part of my thesis, I explored how the three-dimensional organization of the genome is reshaped during mammalian evolution. Technological advancements in the field of 3D genomics now enable the investigation of spatial chromatin interactions underlying gene regulation and genome organization. Thus, in the third chapter of the thesis, I described several strategies for creating custom capture-HiC probe sets to investigate distinct aspects of gene regulatory evolution. Two designs focus on transposable elements and the rewiring of chromatin interactions that underlie their repression and evolutionary co-option into gene regulatory networks. An additional design targets one-to-one orthologous genes across five mammals and cis-regulatory elements (CREs) that have various functional and evolutionary constraints, such as deeply conserved and tissue-specific CREs, as well as CRE sequences that switch activity (promoter-enhancer switchers) either between species or between tissues. This last capture system was designed to conduct a comparative genomic study of five mammals and three tissues (brain, liver, and testis), thus to investigate how gene regulatory landscapes differ between tissues and throughout evolution. The final results chapter of the thesis focuses on bioloigical insights resulting from an experiment using the capture system described above targeting one-to-one orthologous genes across five mammals and three tissues. This study shows that orthologous promoters confirm established concepts, such as enhancers having more chromatin contacts with genes that have higher expression levels, and cis-regulatory sequences being more frequently distributed around 250 kbp away. However, they also exhibit distinctive, tissue-specific patterns of chromatin interactions. Interestingly, while the brain has the highest number of interactions per gene, the testis shows significantly fewer 3D contacts than somatic tissues and an increase in short-range interactions. I present a set of analyses that further investigate these results in the testis and their potential link to the restructuring of the genome that occurs in a large subpopulation cells going through spermatogenesis. In addition, I provide evidence of pervasive hubs of chromatin interactions, which often result in promoter-promoter networks that connect both active and inactive genes together. Finally, the investigation of the tissue-specificity of the promoter-centered chromatin organization shows only modest correlation across species, likely reflecting the evolutionary dynamics of regulatory landscapes. Therefore, I define a framework for studying how chromatin looping rewires at orthologous genes across different species.
  • ItemOpen Access
    Inferring context-specific essentiality networks using large-scale CRISPR-KO screens
    Weidemüller, Paula Helena
    Large-scale genome-wide CRISPR knockout screens, such as the ones from DepMap and Project Score, revealed that a lot of genes are essential, i.e. required, in only a subset of cell lines. These context-essential genes offer insights into vulnerabilities of different cancer types and provide promising targets for personalized cancer therapies. However, the challenge is to systematically identify and define those context-essential genes and to understand how cellular phenotype and interaction networks are altered in these contexts. In this thesis, I present different approaches to understanding context-specific essentiality networks. I developed a Bayesian linear model called PLMCECS which identifies genes important in the context of cancer driver mutations and tissue of origin by modelling important properties of CRISPR knockout data. I validated the performance using simulated data and performed various sanity checks, in the absence of a gold standard benchmark. When analysing genome- wide CRISPR-Cas9 knockout data, I found that gene essentiality was highly variable across tissues in the context of the same cancer driver mutation. Genes essential in the context of the same cancer driver mutation shared similar functions and formed tightly connected functional networks with clusters representing functions that were required in particular tissues. Understanding cancer dependencies involves not only how context affects single-gene essentiality, but also how gene interactions shape cancer-specific essentiality landscapes. Together with collaborators, we performed a large-scale dual-knockout CRISPR-Cas9 screen to identify genetic interactions in *KRAS*-mutant colorectal cancer and triple-negative breast cancer with the aim to nominate combinations for more effective cancer therapies and to counteract tumour resistance. Using preliminary data, I discussed different mathematical models to score genetic interactions and highlight context-specific synthetic interactions. Lastly, I nominated promising synthetic lethal gene pairs for follow-up validation in *KRAS*-mutant colorectal cancers. Overall, this thesis contributes important knowledge on context-essential genes and genetic interactions in cancer, enhancing our understanding of cancer vulnerabilities and guiding the development of targeted therapies for improved patient outcomes.
  • ItemOpen Access
    Geometrical Models for 2D Morphometry in Bioimages
    Mandal, Soham
    From understanding how cell shape can be a biomarker in cancer diagnosis and prognosis to how change in cell shape dictates embryogenesis, morphology quantification is crucial in different areas of biology. Morphology-related measurements are extracted from segmented objects in bioimages, and their quality is therefore directly depend on the number of pixels composing each object. In contrast, geometrical models representing the contour of objects as continuous parametric curves are free from discretisation artefacts and are therefore excellent alternative candidates for morphometry. The extraction of this kind of contour representation from bioimages is however tedious and poorly scalable. In this thesis, we investigate how deep learning can be leveraged to infer geometrical models directly from bioimages. We developed SplineDist, a supervised deep learning algorithm to extract geometrical models across a variety of imaging modalities. We show that it can be used as an alternative instance segmentation method with state-of-the-art results, and packaged it as a plugin for the popular microscopy image analysis platform napari to facilitate its use. We also explored the use of geometrical model parameters as direct measures of morphology, with applications to nuclear phenotyping. Finally, we developed a generative approach to sample geometrical model parameters from a known distribution and illustrate its use in the context of synthetic data generation.
  • ItemOpen Access
    Computational Discovery of Bacterial Fibrillar Adhesins and Adhesive Domains
    Monzon, Vivian
    Fibrillar adhesins are filamentous bacterial surface proteins, which can play a key role in host-pathogen interactions. They have a characteristic protein architecture with repeating domains, called stalk domains, folding into a rod-like stalk structure. An adhesive domain is positioned at the tip of the stalk. This study aims to provide a comprehensive characterisation of this protein class as well as to discover novel fibrillar adhesins and adhesive domains. Using a collection of known adhesive and stalk domains, a domain-based search for fibrillar adhesins was conducted yielding over 3,500 protein hits in the UniProt Reference Proteomes widespread across the bacterial tree of life. These proteins were called fibrillar adhesin-like (FA-like) proteins. Investigating them in-depth showed different adhesive and stalk domain combinations and distinct protein architectures between different bacterial phyla. It also resulted in the recognition of identification features for the development of a machine learning (ML)-based discovery approach. This approach was applied on the Firmicutes and Actinobacteria UniProt Reference Proteomes detecting over 5,000 FA-like proteins, which were missed by the domain- based approach, including proteins without a known adhesive and/or stalk domain. Exploring these proteins with the focus on those lacking a known adhesive domain enabled the discovery of potential novel adhesive domain families. Using AlphaFold2, the structure of these domains was predicted to identify their potential function. The AlphaFold2 release also enabled the detection of FA-like proteins using adhesive and stalk domain structures. Here, the TMalign and Foldseek structure aligners were compared, with Foldseek showing a higher concordance with the sequence-based discovery approaches. Integrating structure features in the ML-based discovery approach improved its precision when testing it on three bacterial proteomes. To find out more about the function of the detected FA-like proteins, protein interaction prediction methods were tested with the focus on AlphaFold Multimer. The results showed limitations in confidently predicting a binding target. The challenges encountered were discussed as well as an AlphaFold induced increase in the development of alternative methods tackling these challenges.
  • ItemOpen Access
    Cancer Cartography: Mapping Cancer Evolution in Tissue Space
    Lomakin, Artem
    While somatic evolution is widely accepted as fundamental to cancer development, significant gaps in understanding persist. These gaps concern the role of cellular interactions within the tumour microenvironment (TME) and the impact of spatial constraints on cancer evolution. Emerging spatial omics technologies offer the potential to address these gaps, although their application in spatial genomics remains limited. This is particularly crucial because genetic alterations not only drive cancer evolution but also serve as an archive of its history. To address this research gap, this thesis aims to develop computational methods for analysing spatial genomics data, specifically on data generated by base-specific in situ sequencing (BaSISS), a method that enables high-resolution mapping of diverse somatic mutations across large tumour tissue sections. Bayesian algorithms designed in the study generate quantitative clonal maps, effectively tracing cancer evolution in tissue space while accommodating multiple forms of biological and technical variability. Despite inherent assumptions, the algorithm demonstrates robustness and quantitative accuracy. Complementary data types contribute to the findings by applying the Bayesian model to two multifocal breast cancers at different stages of progression. Integration of histology, immunohistochemistry (IHC), and targeted in situ gene expression allows for the phenotypic characterisation of clones in distinct microanatomical niches. Subsequent analyses across various stages of breast cancer, including carcinoma in situ, invasive cancer, and lymph node metastasis reveal clone-specific variations in proliferation, morphology, stroma, hypoxia, and immune microenvironments. In one instance involving ductal carcinoma in situ, polyclonal neoplastic expansions manifest on a macroscopic scale but remain segregated within microanatomical structures. In summary, this thesis establishes a robust computational framework for extracting clonal architecture from spatial genomics data. It provides a proof-of-concept that such maps, when integrated with tissue morphology and spatial phenotype data, can offer vital insights into the mechanisms driving both cancer evolution and tissue ecology.
  • ItemOpen Access
    Synthetic ground truth of biological shapes — Simulating variable Nuclear Pore Complexes for Microscopy
    Theiss, Maria
    The Nuclear Pore Complex (NPC) is the only passageway for macromolecules between the nucleus and cytoplasm and an important reference standard in microscopy: it is intrinsic to cells, has a high copy number, it is massive and stereotypically arranged. The average architecture of NPC proteins has been resolved with pseudo-atomic precision, however, observed NPC heterogeneities, such as varying diameters, elongated shapes, 9-fold symmetry, and irregular shapes, evidence a high degree of divergence from this average. Single Molecule Localization Microscopy (SMLM) images NPCs at protein-level resolution, whereupon image analysis methods study NPC variability. However, the true picture of NPC variability is unknown and the biological function of this variability is poorly understood. In quantitative image analysis experiments, it is thus difficult to distinguish intrinsically high SMLM noise from variability of the underlying structure. This thesis introduces a pipeline that synthesizes ground truth datasets of structurally variable NPCs based on architectural models of the true NPC to benchmark image analysis methods and to help elucidate real-life NPC variability. In this pipeline, N- or C-terminally tagged NPC proteins can be selected for single- or multi-channel 3D simulations of geometrically variable NPCs. The NPC is furthermore represented as a spring model such that arbitrary deforming forces, of freely definable magnitudes, simulate shapes that are irregular, yet sufficiently smooth. Such simulations allow one to compare image analysis methods based on the quality and quantity of data required to elucidate specific types of variability. A side-by-side comparison with real data ultimately tests hypotheses about underlying NPC variability. Two clustering approaches are compared on simulations of geometric NPC variability. Furthermore, synthetically replicating analyses of real NPC radii reveal that a range of simulated variability parameters can lead to previously observed results. Ultimately, this thesis highlights the need and offers a template, for close-to-biology simulations when ground truth is unavailable.
  • ItemOpen Access
    Computational optimisation strategies for targeted DNA sequencing using nanopores
    Weilguny, Lukas
    Long-read DNA sequencing is causing a generational shift in genome sequencing productivity and is revolutionising many aspects of biological discovery. One of the technologies behind this transformation is sequencing using nanopores that act as biosensors measuring fluctuations of an ionic current caused by traversing polynucleotides. By partially blocking the flow of ions, distinct patterns of current are generated as nucleotides pass through the narrowest constriction; these are subsequently bioinformatically demixed into nucleobase sequences. This approach for sequencing has multiple advantages, including reading native molecules that can be up to megabases long without prior amplification, and the ability to analyse data in real time. This allows for ultra-fast time-to-answer investigations, and also manipulation of ongoing experiments by processing the generated data and feeding instructions back to the sequencing machine; a unique feature not realisable with any other existing sequencing technology. Currently, a few methods exist that make use of this by analysing nascent fragments of DNA and testing whether they originate from predetermined areas of a genome marked as targets. Otherwise, the voltage bias across the membrane, into which the nanopores are embedded, can be reversed and the molecule will be ejected from the nanopore, allowing another one to be sequenced in its place with the aim of saving time and enriching for on-target sequences. This process has been termed `adaptive sampling'; but prior to the work presented in my thesis, these methods were based entirely on static instructions. In other words, target regions of a genome were defined before an experiment and remain constant throughout. With this thesis, I extend adaptive sampling such that decisions about molecule rejections can incorporate information obtained during an ongoing experiment. One of the main motivations is to address the current need for oversampling genomes many-fold to ensure a minimum coverage across a sequenced genome high enough for downstream analyses, which can be wasteful. Similarly, sequencing mixed samples without wasteful oversampling might lead to underrepresented or missing rare species. This thesis describes two approaches for dynamic extensions to adaptive sampling to address these issues, by implementing more versatile real-time analysis and control of sequencing experiments. The first approach is intended for resequencing experiments, where reference sequences of the studied sample are available. For this, I implement an algorithmic framework and software that generates dynamically adapting decision strategies that are continuously updated to steer an active sequencing run. More specifically, this method quantifies uncertainty at each position in a genome and for each novel DNA fragment decides whether the expected decrease in uncertainty warrants fully sequencing it. This way, sequencing can be focused on molecules from areas with the highest uncertainty, e.g. regions of low coverage, thus optimising the information gain. I illustrate the effectiveness of the method by mitigating coverage bias between and within members of a microbial mixture sample. In particular, it adapts to the differential abundances without prior knowledge about sample composition, thereby reducing the interspecies bias and effectively redistributing coverage within species. In some scenarios, the need for reference genomes poses a limitation, e.g. when sample content is unknown. In this case previous implementations are not useful, since underrepresented species cannot be targeted. A second approach I develop in the thesis aims to overcome this limitation by exploring how rejection decisions can be made while simultaneously creating a genome assembly from the fragments read so far. Here, the method rejects molecules from regions of genomes that are already well-represented and instead focuses on sequence that either helps to extend a species' assembly or is entirely unknown. I show how refocusing sequencing in this way is useful to increase the detection limit for rare organisms in a mixed sample, leads to higher quality assemblies, and allows for true de novo enrichment of unknown species for the first time. Overall, the data-driven approaches to targeted sequencing with nanopores that I have created expand the applicability of adaptive sampling and could be applied to many other sequencing scenarios. The resulting reduction in the time-to-answer or increased information gain might be critical in clinical settings or for pathogen surveillance.
  • ItemOpen Access
    Structural studies on enzyme active sites
    Riziotis, Ioannis
    Enzymes catalyse a huge variety of biochemical reactions, and often the same function might evolve independently in different organisms. This PhD project zooms into the heart of biological catalysis, the active site, aiming to structurally characterise catalytic residues and their geometric disposition, which facilitates catalysis. A curated dataset of mechanisms and catalytic residue annotations for ∼ 1000 enzyme families (the *Mechanism and Catalytic Site Atlas* – *M-CSA*), integrated with structural data from the *Protein Data Bank*, led to the development of a tailor-made programmatic framework (CSA-3D), which implements new and modified algorithms and metrics used to perform all analyses herein. The work consists of two major parts, one looking into the conformation of the active site within homologous enzymes, and one exploring active site commonalities in functionally convergent and divergent enzymes. The first part is a study on catalytic residue conformational variation as captured in snapshots in enzyme crystal structures. The task is to explore active site flexibility and assess its importance as an intrinsic and essential property of enzymes. Through dynamic active site superposition, structural variability is captured at single-residue level, and geometrical changes driven by ligand binding or mutations are explored. It is shown that active sites exhibit different degrees of inherent flexibility, with the extent of this flexibility often depending on the role of each residue during catalysis. Moreover, the data suggest that ∼ 2/3 of active sites are flexible, although in half of those, flexibility is only observed in the side chains. The goal here is to better understand catalysis as enzymes evolve new functions and bind different substrates. The second part defines the term “catalytic modules”: structurally similar residue arrangements performing a defined function, that may recur in unrelated enzymes. After exploring and reviewing 3D templates as tools to identify functional sites in enzymes, again, M-CSA data was used to generate a template library representing compact residue clusters. A fuzzy template-template search identified and catalogued conserved and convergent “modules”, that were characterised in terms of function. A large fraction of modules facilitate metal binding, and some interact with co-factors. Often those modules are the outcome of convergent evolution. A smaller number of convergent modules perform a well-defined catalytic role, such as the catalytic triads (i.e. Ser-His-Asp/Cys-His-Asp) and the saccharide-cleaving Asp/Glu triad. Furthermore, it is shown that enzymes of divergent function retain regions of their active site unaltered during evolution. The ultimate goal of this PhD is to define paradigms of structural variation, and to identify common 3D modules in observed active sites. This work is potentially relevant to the design of novel enzymes and understanding the key structural components governing catalysis in the active site.
  • ItemOpen Access
    The skin microbiome in health and atopic dermatitis
    Saheb Kashaf, Sara
    The skin, as the body’s outermost layer of cells, plays a crucial dual role in protecting against foreign pathogens while providing a habitat for commensal microbes. Despite the skin’s harsh conditions, which include desiccation, acidity, and scarce nutrients, the skin hosts a diverse community of bacteria, fungi, and viruses. Prior work associating fluctuations in the skin microbiome with health and disease has been limited by our limited understanding of the skin microbiome composition and functions. One way to characterise skin microbial diversity is through metagenomics. A previous investigation of the skin microbiome found that more than half of the sequenced skin metagenomic reads did not align to reference genomes, complicating the analysis of skin metagenomic datasets. To address this issue, we combined bacterial cultivation and metagenomic sequencing to create the Skin Microbial Genome Collection (SMGC), the most comprehensive catalogue of prokaryotic, eukaryotic, and viral genomes from the skin. The SMGC allows for the classification of a median of 85% of skin metagenomic sequencing reads, providing a comprehensive view of skin microbial diversity. Using the SMGC, we investigated the skin microbiome in atopic dermatitis, a prevalent inflammatory skin condition characterized by recurring episodes of red, itchy, and swollen skin. Atopic dermatitis flares have been associated with the proliferation of various staphylococcal species, with only S. aureus strains cultured from atopic dermatitis inducing inflammation in a mouse model. Our extensive genomic survey of the skin microbiome in atopic dermatitis, supported by cultured isolates from the same samples, identified Staphylococcus strains and genomic loci associated with higher disease severity. Our work also showed that the Staphylococcus strains found in AD are influenced by factors such as geography and strain sharing within households. Additionally, our examination of the mobilome of multiple Staphylococcus species colonising the same individuals revealed widespread inter-species transfer of genetic material, highlighting the fluid nature of staphylococcal genetic composition. In conclusion, our work shows how novel genomic approaches and the integration of sequencing data can be used to characterise the skin microbiome at an unparalleled resolution, allowing for new insights into how skin microbes vary in health and disease.
  • ItemOpen Access
    Network approaches for data-driven reconstruction of intracellular signalling
    Barker, Charles; Barker, Charles [0000-0002-5223-6838]
    Intra-cellular signalling determines how cells process information. Through the integration of diverse chemical and physical stimuli, cells can enact transcription, among other changes to modulate growth, fate and survival. It is through the dysregulation of such processes that many diseases, including cancer originate. For many years, our study of signalling processes has been based on discrete ’pathways’, characterised mostly by small-scale studies. However, as more system-wide data becomes available, it is becoming increasingly obvious that intra-cellular signalling is more like a dense and inter-connected network, with intense and functional cross-talk between pathways. In my PhD project, I utilised both new ’omics’ data and a plethora of network-based approaches to guide our understanding of various signalling-related disease contexts. Networks provide a convenient framework with which to explicitly control the level of prior-knowledge required to understand complex ’omics’ data. Escaping the study bias that has so-far dominated our characterisation of intra-cellular signalling is vital if we are to progress in our understanding of complex cellular behaviours.
  • ItemOpen Access
    Japanese courage: a genetic analysis of complex traits in medaka fish and humans
    Brettell, Ian
    This thesis primarily explores how an individual's genes interact with the genes of their social companions to create differences in behaviour, using the Japanese medaka fish as a model organism. Chapter 1 sets out the introduction to the diverse topics covered in this thesis, and is followed by five substantive chapters. Chapter 2 describes several genomic characteristics of the Medaka Inbred Kiyosu-Karlsruhe (MIKK) panel, which comprises 80 inbred lines of medaka that were bred from a wild population from the city of Kiyosu, southern Japan. In this chapter I plot the inbreeding trajectory of the MIKK panel and analyse a number of genomic characteristics relevant to its utility for the genetic mapping of complex traits, including: the panel's evolutionary relationship with other previously-established inbred medaka strains; the degree of homozygosity in the inbred lines; the rate of linkage disequilibrium decay across the panel; and the genomic repeats and structural variation present in their genomes. In Chapter 3, I use a custom behavioural assay to characterise and classify bold-shy behaviours in 5 previously-established inbred medaka strains. I describe the assay, assess its robustness against confounding factors, and apply a hidden markov model (HMM) to classify the fishes' behaviours across a spectrum of boldness-shyness based on the individuals' distance and angle of travel between pre-defined time intervals. I describe how the strains differ in their behaviours over the course of the assay (a "direct genetic effect") and how the behaviour of a single "reference" strain (*iCab*) differs in the presence of different strains (an "indirect genetic effect"). In Chapter 4, I describe the bioinformatic processes and genetic association models that I used to map the variants associated with variation in the period of somite development, based on an F2-cross between the southern Japanese iCab strain, and the northern Japanese Kaga strain. In Chapter 5, I explain how I ran the custom behavioural assay described in Chapter 3 over the MIKK panel to identify lines that diverge in both their own bold-shy behaviours (the direct genetic effect) and the extent to which they transmit those behaviours onto their tank partners (the indirect genetic effect). I then describe how I used those divergent lines as the parental lines in a multi-way F2-cross to identify the genetic variants associated with both direct and indirect genetic effects. Finally, in Chapter 6, I turn to humans to compare and rank all complex traits in the GWAS Catalog based on the extent to which their associated alleles vary across global populations, using the Fixation Index (FST) as a metric, and the 1000 Genomes dataset as a sample of global genetic variation. I set out the bioinformatic pipelines used to process the data, present the distributions of FST for trait-associated alleles across the genome, and use the Kolmogorov-Smirnov test to compare the distributions of FST across different traits. Altogether, this thesis describes some of the genomic characteristics of both medaka fish and humans, and how those variations relate to differences in complex traits, with a particular focus on the genetic causes of adaptive behaviours and the transmission of those behaviours onto one's social companions.
  • ItemOpen Access
    Genome-graph based genotyping with applications to highly variable genes in P. falciparum
    Letcher, Brice; Letcher, Brice [0000-0002-8921-6005]
    Analysing genetic variation in pathogen genomes is key to understanding their biology, evolution and epidemiology. Typically, this is done by assembling one arbitrary genome, defined as the ‘reference’, and describing other samples as deviations from it. However, this model breaks down in highly diverse regions of the genome, where sample sequencing reads, differing too substantially from the reference, fail to map. This ‘bias against diversity’, due to using a single reference, naturally affects genomic regions under pressure to diversify: this includes the human MHC, a motivating example for the field, and vaccine candidate genes in the malaria parasite Plasmodium falciparum (Pf ), the motivating example for this thesis. The growing solution in the field is to build graph-based models, representing not one, but a population of genomes from a species, and using these genome graphs as a substrate for read mapping and genotyping instead. In this thesis, I develop new algorithms and data structures for genotyping highly variable genes in Pf, using genome graphs. In Chapters 2 and 3, I describe methods and code to analyse variation in highly diverse regions of the genome, across many genomes in a cohort. In doing so I provide two main advances on the state-of-the-art: jointly studying small (SNPs, indels) and large (indels >50bp) variation, and accessing variation on multiple references. I validate these methods using different datasets, ultimately genotyping SNPs on diverged haplotypes in two highly variable Pf genes, including one gene from the major methodological and biological motivation for this thesis, paralogs DBLMSP and DBLMSP2 (DBs). In Chapter 4, I study the DBs in greater detail, using a global dataset of >3,500 Pf genomes. Building a genome-graph-based pipeline, I recover variation inaccessible to single-reference based approaches (GATK), before uncovering new biology. Expressing each diverged DB haplotype as a mosaic of the others, I find widespread recombination in each gene, and also discover recent evidence of gene conversion between the two genes. In summary, this thesis provides both methodological advances into genome-graph based genotyping, and practical insights into the genome biology of an important human pathogen.
  • ItemOpen Access
    Advances in Time-to-Event Analysis: Big Data Applications in Cancer Risk Prediction
    Jung, Alexander Wolfgang
    The digital transformation of health care provides new opportunities to study dis- ease and gain unprecedented insights into the underlying biology. With the wealth of data generated, new statistical challenges arise. This thesis will address some of them, with a particular focus on Time-to-Event analysis. The Cox hazard model, one of the most widely used statistical tools in biomedicine, is extended to analyses for large-scale and high-dimensional data sets. Built on recent machine learning frame- works the approach scales readily to big data settings. The method is extensively evaluated in simulation- and case-studies, showcasing its applicability to different data modalities, ranging from hospital admission episodes to histopathological im- ages of tumour resections. The motivating application of this thesis are electronic health records (EHR), collections of various interlinked data at an individual level. With many countries starting to implement national health data resources, methods that can cope with these datasets become paramount. In particular, cancers could benefit significantly from these developments. The lifetime risk of developing a ma- lignancy is around 50%. However, the associated risks are not equally distributed with large differences between individuals. Hence, being able to utilise the data available in EHR could potentially help to stratify individuals by their risk profiles and screen or even intervene early. The proposed method is used to build a pre- dictive model for 20 primary cancer sites based on clinical disease histories, basic health parameters, and family histories covering 6.7 million Danish individuals over a combined 193 million life years. The obtained risk score can predict cancer inci- dence across most organ sites. Further, the information could potentially be used to create cohorts with similar efficiency while screening earlier, creating the possibility for risk-targeted screening programs. Additionally, the obtained result could also be transferred between health care systems, as shown here between Denmark and the UK. Taken together the thesis established a method to analyse the extensive amounts of data that is being generated nowadays as well as an evaluation of the potential these data sources can have in the context of cancer risk.
  • ItemOpen Access
    Multiomic Investigation of the Human Gut Toward Insight into Childhood Inflammatory Bowel Disease Pathogenesis
    Edgar, Rachel
    Inflammatory bowel disease (IBD) is an abnormal immune response to the gut microbiome causing chronic debilitating pain in genetically susceptible people. The development of effective therapies for this condition is hampered by a poor understanding of its complex aetiology. Progress in this area will require identification of both the immune and epithelial contributions to disease in susceptible individuals as well as detangling the contributions of different pathways and genes, all in a cell type specific context. This thesis summarizes my research into the gene regulatory architecture of IBD in intestinal epithelial cells. First I assess the utility of intestinal epithelial organoids, a model of the human intestine. This system is well established and has been shown to be comparable to in vivo intestinal tissue in many ways. However, my study revealed DNA methylation (DNAm) is generally lost and becomes more variable the longer organoids are cultured. My findings suggest a major impact of prolonged culturing on global organoid DNAm profiles, highlighting the importance of considering time in culture in organoid experiments. Next I explore DNAm, gene expression and genotype to: define a possible IBD biomarker, explain previously established IBD genotype associations, and identify potential mechanistic candidates for functional follow-up. In both primary epithelial cells and organoids I found widespread differential DNAm with IBD. Of the CpGs differentially DNAm in IBD, changes in the major histocompatibility complex class I (MHC I) signaling pathway are particularly interesting. The MHC I pathway represents a promising candidate mechanism of IBD as it is involved in the communication between cells and the immune system. Finally, I follow up on the association of MHC I and IBD in individual intestinal epithelial cells. As in the bulk expression data, I also see increased expression of MHC I in IBD compared to controls. This is consistent across cell types, but interestingly MHC I activity is higher in villus tip cells compared to crypt cells regardless of diagnosis, suggesting activity could be related to exposure to luminal microbiota. I also show MHC I can be activated successfully in organoids with proinflammatory cytokine stimulation, meaning organoids could be a useful model to study MHC I pathway function in IBD. In this thesis, I demonstrate that organoids are a valuable model of the human gut, and use them to identify MHC I activation as a potential mechanism of pediatric IBD pathogenesis.
  • ItemOpen Access
    Recovery and quality estimation of metagenomic assembled genomes of eukaryotes
    Saary, Paul
    Microorganisms are found in virtually all environments, and while the majority of microorganisms are often prokaryotes, by biomass there are suspected to be 6 times more prokaryotes than fungi globally (Bar-On et al., 2018), eukaryotes are also important constituents of microbial communities (e.g on the human skin). Shotgun metagenomics can provide access to the combined genetic information of a community in a culture independent manner. Using de novo assembly and post processing methods, it can lead to the generation of so called metagenomic assembled genomes (MAGs), which provide contextualised access to genes of these elusive organisms. As most studies have focused on the recovery of prokaryotic MAGs, I first examined the limitations and gaps of existing tools with respect to their ability to recover microbial eukaryotic genomes. This led to the development of EukCC, a software to estimate the completeness and contamination of eukaryotic MAGs. Evaluation of this software showed that it is well suited for the fully automated recovery of eukaryotic MAGs. This workflow was applied to dataset obtained from several biomes to recover eukaryotic MAGs. However, I also demonstrate that eukaryotic MAGs can sometimes be fragmented and developed a merging algorithm to create merged MAGs (mMAGs). With the implementation of this algorithm in EukCC 2, I search a large number of datasets from MGnify for known and novel eukaryotic MAGs. Completing the eukaryotic MAG recovery process, I discuss how species-level dereplication for eukaryotes can be approached based on the genetic information alone. In summary I show that recovery of eukaryotic MAGs is a challenging but can be largely automated allowing large-scale studies to be performed.
  • ItemOpen Access
    Prognostic biomarker discovery from omics data using machine learning approaches
    Garg, Manik
    Prognostic biomarker discovery from omics data using machine learning approaches by Manik Garg Prognostic biomarkers can help clinicians identify high-risk patients to administer appropriate therapies. The omics data from patient samples can be used to find such biomarkers. Moreover, molecular biomarkers can help to understand disease mechanisms. Machine learning and related data analysis methods can be applied to omics data for reliable determination of these biomarkers. In this thesis, I aimed to identify reproducible prognostic biomarkers in three different diseases: Alzheimer’s disease (AD), primary melanoma and coronavirus disease 2019 (COVID-19). In my second chapter, I contributed to the discovery of a new metabolic signature, to predict which patients with mild-cognitive impairment would later develop AD. As there were thousands of un-annotated metabolic features potentially differentiating such patients, to overcome the problem of over-training, we shortlisted only those features associated with genetic variants. After annotating the top-ranking features, we hypothesized about their potential links to AD using extensive literature research. My contribution to this chapter mostly was employing the machine learning methods to refine and optimise the signature. In my third chapter, I analysed RNA-sequencing data derived from primary melanomas resected from stage IIB-IIIC (7th edition of the American Joint Committee on Cancer staging manual) patients embedded within a prospective phase III randomized clinical trial. This led to the identification of a 121-gene-based expression signature that can predict poor outcomes and stratify patients with high absolute risk of death in 5 years. The prognostic ability of this signature was validated in 4 independent datasets. I also found that patients with higher signature score (indicating poor outcomes) had lower tumour infiltrating lymphocytes suggesting that these patients have immune cell deprived tumours that warrants the need for specialized treatment strategies. In my fourth chapter, I performed a meta-analysis of 10 published single-cell RNA sequencing datasets to validate the immune response changes associated with COVID-19 progression reported. I found that 8 out of 20 published immune response changes were consistently reproducible across multiple datasets. In addition, in my fifth chapter, I studied how immune response changes with COVID-19 severity in recovered patients. Here, I showed that while patients recovered from mild/moderate COVID-19 infection had their immune responses close to healthy individuals within 27-47 days of symptom onset, those recovered from severe/critical infection still had their immune response affected. The described results are published in four peer-reviewed journal papers and specific contributions are highlighted in the text. Overall, the work presented in this thesis demonstrates various approaches in which omics data can be used for prognostic biomarker discovery. Further, this work also contributes to the knowledge of current prognostic biomarkers in AD, primary melanoma and COVID-19.
  • ItemOpen Access
    Computational analyses of blood cells: somatic evolution and morphology
    Almeida, José
    Haematopoiesis is the complex process of blood cell production, carried out by haematopoietic stem cells in the bone marrow. Over the human lifespan, it generates quadrillions of cells which participate in essential bodily functions such as immunity (whiteblood cells), oxygen distribution (red blood cells) and blood coagulation (platelets). However, several conditions affect this process --- a lack of nutrients such as iron or folate can lead to alterations to the production of red and white blood cells, whereas the accumulation of somatic mutations contributes to the formation of genetically distinct subpopulations of haematopoietic stem cells (or clones) whose behaviour is partly determined by mutations conferring growth advantages (also known as driver mutations). When mutations in cancer-associated genes are present in the blood of healthy individuals this is known as clonal haematopoiesis, a benign condition that can progress to blood cancers such as myelodysplastic syndromes, characterized by an excess of abnormally developed (or dysplastic) cells in the bone marrow and in the blood. Blood cancers such as these are usually diagnosed by trained experts using a number of complementary analyses, including the inspection of blood cell morphology (cytomorphology) --- which can have high inter-individual variance --- and differential cell counts under the microscope. In this work, I have studied how the haematopoietic system and blood are altered in two distinct settings --- the evolution of somatic mutations in healthy individuals and the cytomorphological alterations to blood cells in myelodysplastic syndromes. Firstly, I studied the somatic evolution of the haematopoietic system in healthy individuals using longitudinal sequencing and single-cell-derived colonies. With simulations, I developed a computational model and applied it to a cohort of elderly individuals with clonal haematopoiesis and show i) how driver genetics determine the growth rate of different clones, ii) how clones appear consistently through life (with few exceptions), iii) that clones decelerate due to an increasingly competitive clonal landscape and iv) how mutations which confer greater growth advantages are associated with a greater risk of developing haematological malignancies. Secondly, I used a cohort of digitalised whole blood smears from healthy individuals and individuals with anaemia or myelodysplastic syndromes to study how each condition leads to cytomorphological alterations through computational methods. To do this, i) I developed and implemented methods for the high-throughput detection of blood cells from digitized whole blood slides, ii) developed a cellular characterisation protocol that captures morphological features relevant for the prediction of clinically-relevant conditions and the presence of specific mutations in myelodysplastic syndromes and iii) described novel associations between blood cell phenotype and anaemia or myelodysplastic syndrome subtypes. By studying how haematopoietic stem cells evolve in the human body, I contributed to the understanding of early cancer development and to the broader field of human somatic evolution, and by quantitatively studying alterations to cytomorphology in clinically-relevant conditions, I showed how this can reveal novel blood cell phenotypes which can aid in diagnostic and prognostic. Both projects offer different perspectives into haematopoiesis as a dynamic process and contribute to clinical research by highlighting connections between somatic evolution in healthy individuals and cancer onset, and by discovering previously unknown cellular phenotypes-disease associations.
  • ItemOpen Access
    Analysis of the understudied parts of the phospho-signalome using machine learning methods
    Petursson, Borgthor
    Abstract Analysis of the understudied parts of the phospho-signalome using machine learning methods Borgthor Petursson In order to make decisions and respond appropriately to external stimuli, cells rely on an intricate signalling system. One of the most important and best studied components of this signalling system is the phospho-signalling network. Phosphorylation relays information through adding phosphoryl groups onto substrates such as lipids or proteins, which in turn leads to changes in substrate function. Crucial components of this system include kinases, which phosphorylate on the substrate molecule and phosphatases that remove the phosphoryl group from the substrate. To date, even though >100K phosphoproteins have been identified through high throughput experiments, the vast majority of phosphosites are of unknown function, while over a third of kinases have no known substrate (Needham et al., 2019). Furthermore, there is a large study bias in our current knowledge, demonstrated by a disproportionate number of interactions between highly cited kinases and substrates Invergo and Beltrao, 2018. The vast understudied signalling space combined with this study bias make it difficult to understand the general principles underpinning cell signalling regulation and stresses the need to research the phosphoproteomic signalling system in an unbiased manner. In this thesis the central aim is to use data-driven and unbiased approaches to study the human phosphoproteomic signalling network. The first chapter describes a project where I co-developed a machine learning model to predict signed kinase-kinase regulatory circuits based on kinase specificities and high throughput phosphoproteomics and transcriptomic data. The network was validated using independent high throughput data and used to identify novel kinase-kinase regulatory interactions. This project was done in collaboration with Brandon Invergo, a postdoc in Pedro Beltrao’s research group. In the second chapter I expand upon work done in the first chapter. I used various predictors such as: Co-expression, kinase specificities and different variables characterising kinase-substrate potential target phosphosites to predict kinase-substrate relationships and their signs. I then used independent experimental kinase-substrate predictions to validate the predictions and identify high confidence kinase-substrate relationships. I then combined the kinase-substrate predictions with the kinase-kinase regulatory circuits to identify condition-specific signalling networks. To enable easy use of my method and networks and analyses of phosphoproteomics data by non-expert users I also developed the SELPHI2 server, where the user can extract biological insight from their datasets. SELPHI2 presents a substantial improvement upon the SELPHI server, which was developed in 2015 by my supervisor, Evangelia Petsalaki. Thirdly, to study the architecture of human cell signalling networks at a whole-cell level and address the limited predictive power of the current models of cell signalling such as pathways found in KEGG (Kanehisa, 2019), Reactome (Jassal et al., 2020) and WikiPathways (Slenter et al., 2018), the third chapter aims to identify signalling modules from phosphoproteomic data. These data-extracted modules were found to have a greater predictive power for independent data sets in terms of number of significant enrichments. Furthermore, we sought to predict the probability of module co-membership from predictors such as membership within data-driven modules, co-phosphorylation and co-expression. In summary, the work presented here seeks to explore the understudied phospho-signalling systems through system-wide prediction of kinase-substrate regulation and the identification of phospho-signalling modules through data-driven means.
  • ItemOpen Access
    Modelling the structural, functional and phenotypic consequences of protein coding mutations
    Dunham, Alistair
    Proteins are integral to all cellular processes and underpin the function of all extant organisms, meaning variants impacting them are a primary cause of phenotypic variation. Protein coding variants are a key area of study in biology, with relevance from structural and molecular biology to population genetics. They are also medically important, impacting inherited genetic diseases, cancer and response to pathogens. Recent advances in highthroughput experimental techniques have opened the door to many new approaches in biology, and protein variants are no exception. Deep mutational scanning experiments exhaustively measure the fitness of variants in a protein, which gives us more experimentally validated mutational consequence measurements than ever before. Such advances, together with ever larger sequence and structure databases, have created an opportunity to apply large scale analyses to coding variation, studying the effect on protein structure, function and phenotype. In this thesis I perform three large scale variant analyses. First, I use the consequences of variation to learn about protein structure and function. I compile a dataset from 28 deep mutational scanning studies, covering 6291 positions in 30 proteins, and use the consequences of mutation at each position to define a mutational landscape. I show rich biophysical relationships in this landscape and identify functionally distinct positional subtypes of each amino acid. In the second analysis, I explore genotype to phenotype prediction using a dataset of 1011 S. cerevisiae strains, with genotypes, transcriptomics, proteomics and measured phenotypes, and comprehensive gene deletions in four strains. I show knowledge-based models of mutational consequences and pathway function can be used to associate genes with phenotypes and predict growth phenotypes across 34 growth conditions. However, genetic background is found to have a large effect on variant consequences, to such an extent that the same deletion can be highly significant in one strain and have no effect in another. Finally, I analyse computational variant effect prediction, benchmarking current predictors using deep mutational scanning data. I then develop a new end-to-end deep convolutional neural network predictor that predicts consequences directly from sequence and structure and show it improves on current methods. Together these projects advance our knowledge of protein coding variation and enhance our capacity to link variation to impacts on structure, function and phenotype.
  • ItemOpen Access
    Statistical analysis of short template switch mutations in human genomes
    Walker, Conor
    Many complex rearrangements arise in human genomes through template switch mutations, which occur during DNA replication when there is a transient polymerase switch to an alternate template nearby in three-dimensional space. These variants are routinely captured at kilobase-to-megabase scales in studies of genetic variation by using methods for structural variant calling. However, the genomic and evolutionary consequences of replication-based rearrangements remain poorly characterised at smaller scales, where they are usually interpreted as complex clusters of independent substitutions, insertions and deletions. In this thesis, I describe statistical methods for the detection and interpretation of short template switch mutations within DNA sequence data. I then use my methods to explore small-scale template switch mutagenesis within human genome evolution, population variation, and cancer. I show that small-scale, replication- based rearrangements are a ubiquitous feature of the germline and somatic mutational landscape of human genomes.