Theses - European Bioinformatics Institute (EMBL-EBI)
Permanent URI for this collection
Browse
Recent Submissions
Item Open Access Towards using organoids in functional genomics – A single cell characterisation of mouse gastruloidsRosen, LeahGastrulation is the stage of development during which pluripotent cells differentiate and commit to one of the three main germ layers. While many of the key signalling pathways are well-known, the epigenetic processes underlying this lineage commitment remain poorly understood in mammals. Large-scale perturbation studies are necessary to functionally study the epigenetics of mouse gastrulation. While it has thus far not been feasible to perform such studies at scale in vivo, organoid models of mouse gastrulation are now making them possible. This thesis concerns the characterisation of an organoid model of mouse gastrulation called gastruloids, as a model for such perturbation studies. Gastruloids are grown by aggregating around 300 mESCs for 48hrs, followed by a 24-hr treatment with a WNT agonist which stimulates the gastruloids to produce all three main germ lineages. However, to use gastruloids in perturbation screens, in particular those with single cell readout, it is important to understand the total gene expression in each of a gastruloid’s cells, as well as how much heterogeneity there is between gastruloids. Using single cell RNA-sequencing of individual mouse gastruloids grown in different conditions and by different researchers between day 3 and day 5 of differentiation (roughly corresponding to E6.5-E8.75), this thesis shows substantial inter-organoid and inter-experiment heterogeneity. This heterogeneity in differentiation behaviour is important when using organoids as a model to study the mechanisms regulating gene expression changes in early development, and to that end this thesis introduces a computational method to explicitly model the inter-individual null distribution.Item Open Access Contrastive representation learning for bioimage quantificationHugger, JohannesDeep learning has enabled unprecedented progress towards automating the analysis and quantification of large-scale, high-resolution imaging data. However, the majority of current deep learning systems for bioimage analysis is trained with manual annotations, leading to limitations in their generalization capabilities. To further automate exploration and quantification in imaging volumes, it is essential to identify inductive biases that enable the extraction of useful information, especially in settings where annotations are sparse or entirely absent. A promising approach to tackle this problem is self-supervised representation learning, which utilizes supervision from known priors or the data itself to tune the parameters of deep neural networks. In particular, self-supervised contrastive learning has shown remarkable success in domains such as vision and language. In this thesis, I extend and adapt contrastive learning techniques to address problems in bioimage analysis and beyond. I start by discussing representation learning, biological priors, and how to integrate them through contrastive learning, including data augmentations and similarity relationships. Further, I introduce the necessary technical background, including a more detailed discussion of contrastive joint embedding methods, the InfoNCE loss and its connection to mutual information estimation. The incorporation of priors through similarity relationships motivates the problem of leveraging multiple similar data samples for representation learning. As the first contribution, a principled generalization of the InfoNCE objective is proposed, called InfoNCE*p*, that can leverage multiple data pairs at once and is consistent with the noise contrastive estimation framework. I examine its theoretical properties, including its connection to mutual information maximization, and demonstrate in experiments its utility on the task of cell-type prediction in a dataset of spatial cell graphs. In the second contribution, the problem of supervised classification is addressed, which is a common final step in many data driven problems after an initial self-supervised pre-training procedure. Specifically, I combine soft targets with InfoNCE, resulting in a new supervised learning objective which is dubbed SoftInfoNCE. The new loss function can be interpreted as a temperature scaled energy cross-entropy, and I discuss its hard positive and hard negative mining properties. The conducted experiments demonstrate on two domains its superior performance over soft target and standard cross-entropy as well as InfoNCE. The thrid and final contribution is a novel self-supervised method for learning representations of cellular shape and texture from volume electron microscopy (EM) data. The method is applied to an EM volume of Platynereis dumerilii, and enables a visually consistent grouping of cells in the learned MorphoFeatures embedding space. Further, when MorphoFeatures are combined with features from spatial neighbours, tissues and organs can be retrieved. I specifically focus on shape embeddings, and describe the developed contrastive learning framework including the design of geometric augmentations and the networks used to process shapes. The complete pipeline can be seen as a stepping stone towards an automated exploration of large EM volumes.Item Open Access Characterising the evolutionary dynamics of hypermutated tumours using single-cell sequencingKalyva, MariaDNA mismatch repair deficiency (MMRD) is caused by the inactivation of the mismatch repair pathway, which corrects DNA replication-associated errors. MMRD results in an elevated rate of point mutations and indels, called hypermutation, which leads to a high burden of neoantigens and increased tumour immunogenicity. As a result, hypermutated cancers show the highest response rates to immune checkpoint blockade (ICB), a type of immunotherapy that has revolutionised the treatment landscape for multiple cancer types. Yet, durable responses are only observed in a subset of patients. Therefore, advancing our understanding of the molecular mechanisms underpinning variable responses to ICB is critical. Previous clinical and preclinical studies of MMRD tumours showed that the clonality rather than the burden of mutations determines the variable responses to ICB observed in the clinic. Given that the clonality of mutations is determined by the evolutionary dynamics and patterns of intra-tumour heterogeneity, it is paramount to elucidate the clonal dynamics of MMRD tumours. Moreover, genomics analyses of MMRD tumour evolution have relied on bulk whole-genome sequencing (WGS), which is limited to infer the clonal structure and dynamics of tumours due to the impossibility of assigning mutations to subclones and limited sensitivity for subclonal mutation detection. Thus, the evolutionary trajectories and the timing of the onset of the disease remain unknown for MMRD cancers. Recently, the advent of single-cell (WGS) methodologies allowed for the assignment of mutations to individual cells, making the reconstruction of single-cell lineage histories possible. As a first contribution in my PhD, I have reconstructed the phylogenetic history of eight patient-derived organoids with MMRD at single-cell resolution using single-cell wholegenome sequencing. I found that the evolution of MMRD tumours is characterised by the accumulation of tens of thousands of clonal mutations during decades, which precedes a phase of rapid clonal expansion. In addition, I showed that MMRD tumours exhibit high levels of genomic heterogeneity, due to the increased rate of subclonal point mutations and indels they accumulate upon clonal expansion of the most recent common ancestor. As a second contribution, I present a computational framework, termed ClonalSim, for inferring the timing of clonal expansions and the growth rate of tumours using single-cell phylogenies inferred from somatic mutations. I applied ClonalSim to two patient-derived organoids of MMRD cancers to infer the timing of clonal expansion and the fitness of cancer cells. First, I determined that MMRD deactivation happens during childhood. Secondly, I predicted the age of onset to be more than a decade or 7 years before diagnosis. As a third contribution, I leveraged single-cell RNA sequencing atlases to study the degree to which cancer cells in human tumours dedifferentiate to a more primitive “stem-like” state, which is a known hallmark of cancer. Since quantitative evidence for this phenomenon is still lacking, I investigated the dedifferentiation stage in colorectal MMRD tumours and then expanded my analysis to 10.000 bulk-RNA sequencing data sets from pediatric and common adult tumours. My results indicate that cancer cells do not fully revert to an embryonic transcriptional state, as their genome-wide expression is closer to expression profiles of post-natal cells than those of gastrulation and embryonic cell types.Item Embargo The evolution of gene regulatory landscapes in mammalian tissuesRimoldi, MartinaGene regulatory landscapes are highly complex and evolutionarily unstable. Understanding how genetic and regulatory variations affect gene expression evolution is paramount to gaining insight into phenotypic divergence across species. Although gene expression programs are broadly maintained during mammalian evolution, the collection of cis-regulatory elements and transcription factor (TF) binding events has diverged at different rates. In this thesis, I investigated several layers of gene regulatory divergence across mammalian species. In the first part of the thesis, I investigated the coevolution of DNA methylation patterns and transcription factor binding. Mammalian regulatory architecture evolves through widespread rewiring of TF binding events. Reportedly, TF binding is the fastest evolving layer of cis-gene regulation. However, whether DNA methylation patterns are reshaped during TF binding evolution has not been extensively studied. To fill this gap, I analyzed whole-genome bisulfite sequencing (WGBS) and matched chromatin immunoprecipitation assays followed by sequencing (ChIP-seq) of five TFs from the livers of five mammals. I first characterized DNA methylation patterns at transcription factor binding regions and found four prototypical methylation profiles that resolve alternative functional and chromatin contexts. These profiles were highly conserved across species and consistent across TFs. However, CTCF was an exception, with different methylation profiles likely underlying its various roles in both gene regulation and chromatin organization. I then investigated the relationship between DNA methylation patterns and TF binding divergence. Not only do genomic regions with evolutionary loss of binding regain their hypermethylated state, but distinct methylation profiles also follow species conservation patterns that reflect the turnover rate of the genomic context they correspond to (i.e., promoters and enhancers). Therefore, these results demonstrate coordinated evolution between DNA methylation patterns and transcription factor binding turnover. In the second part of my thesis, I explored how the three-dimensional organization of the genome is reshaped during mammalian evolution. Technological advancements in the field of 3D genomics now enable the investigation of spatial chromatin interactions underlying gene regulation and genome organization. Thus, in the third chapter of the thesis, I described several strategies for creating custom capture-HiC probe sets to investigate distinct aspects of gene regulatory evolution. Two designs focus on transposable elements and the rewiring of chromatin interactions that underlie their repression and evolutionary co-option into gene regulatory networks. An additional design targets one-to-one orthologous genes across five mammals and cis-regulatory elements (CREs) that have various functional and evolutionary constraints, such as deeply conserved and tissue-specific CREs, as well as CRE sequences that switch activity (promoter-enhancer switchers) either between species or between tissues. This last capture system was designed to conduct a comparative genomic study of five mammals and three tissues (brain, liver, and testis), thus to investigate how gene regulatory landscapes differ between tissues and throughout evolution. The final results chapter of the thesis focuses on bioloigical insights resulting from an experiment using the capture system described above targeting one-to-one orthologous genes across five mammals and three tissues. This study shows that orthologous promoters confirm established concepts, such as enhancers having more chromatin contacts with genes that have higher expression levels, and cis-regulatory sequences being more frequently distributed around 250 kbp away. However, they also exhibit distinctive, tissue-specific patterns of chromatin interactions. Interestingly, while the brain has the highest number of interactions per gene, the testis shows significantly fewer 3D contacts than somatic tissues and an increase in short-range interactions. I present a set of analyses that further investigate these results in the testis and their potential link to the restructuring of the genome that occurs in a large subpopulation cells going through spermatogenesis. In addition, I provide evidence of pervasive hubs of chromatin interactions, which often result in promoter-promoter networks that connect both active and inactive genes together. Finally, the investigation of the tissue-specificity of the promoter-centered chromatin organization shows only modest correlation across species, likely reflecting the evolutionary dynamics of regulatory landscapes. Therefore, I define a framework for studying how chromatin looping rewires at orthologous genes across different species.Item Open Access Inferring context-specific essentiality networks using large-scale CRISPR-KO screensWeidemüller, Paula HelenaLarge-scale genome-wide CRISPR knockout screens, such as the ones from DepMap and Project Score, revealed that a lot of genes are essential, i.e. required, in only a subset of cell lines. These context-essential genes offer insights into vulnerabilities of different cancer types and provide promising targets for personalized cancer therapies. However, the challenge is to systematically identify and define those context-essential genes and to understand how cellular phenotype and interaction networks are altered in these contexts. In this thesis, I present different approaches to understanding context-specific essentiality networks. I developed a Bayesian linear model called PLMCECS which identifies genes important in the context of cancer driver mutations and tissue of origin by modelling important properties of CRISPR knockout data. I validated the performance using simulated data and performed various sanity checks, in the absence of a gold standard benchmark. When analysing genome- wide CRISPR-Cas9 knockout data, I found that gene essentiality was highly variable across tissues in the context of the same cancer driver mutation. Genes essential in the context of the same cancer driver mutation shared similar functions and formed tightly connected functional networks with clusters representing functions that were required in particular tissues. Understanding cancer dependencies involves not only how context affects single-gene essentiality, but also how gene interactions shape cancer-specific essentiality landscapes. Together with collaborators, we performed a large-scale dual-knockout CRISPR-Cas9 screen to identify genetic interactions in *KRAS*-mutant colorectal cancer and triple-negative breast cancer with the aim to nominate combinations for more effective cancer therapies and to counteract tumour resistance. Using preliminary data, I discussed different mathematical models to score genetic interactions and highlight context-specific synthetic interactions. Lastly, I nominated promising synthetic lethal gene pairs for follow-up validation in *KRAS*-mutant colorectal cancers. Overall, this thesis contributes important knowledge on context-essential genes and genetic interactions in cancer, enhancing our understanding of cancer vulnerabilities and guiding the development of targeted therapies for improved patient outcomes.Item Open Access Geometrical Models for 2D Morphometry in BioimagesMandal, SohamFrom understanding how cell shape can be a biomarker in cancer diagnosis and prognosis to how change in cell shape dictates embryogenesis, morphology quantification is crucial in different areas of biology. Morphology-related measurements are extracted from segmented objects in bioimages, and their quality is therefore directly depend on the number of pixels composing each object. In contrast, geometrical models representing the contour of objects as continuous parametric curves are free from discretisation artefacts and are therefore excellent alternative candidates for morphometry. The extraction of this kind of contour representation from bioimages is however tedious and poorly scalable. In this thesis, we investigate how deep learning can be leveraged to infer geometrical models directly from bioimages. We developed SplineDist, a supervised deep learning algorithm to extract geometrical models across a variety of imaging modalities. We show that it can be used as an alternative instance segmentation method with state-of-the-art results, and packaged it as a plugin for the popular microscopy image analysis platform napari to facilitate its use. We also explored the use of geometrical model parameters as direct measures of morphology, with applications to nuclear phenotyping. Finally, we developed a generative approach to sample geometrical model parameters from a known distribution and illustrate its use in the context of synthetic data generation.Item Open Access Computational Discovery of Bacterial Fibrillar Adhesins and Adhesive DomainsMonzon, VivianFibrillar adhesins are filamentous bacterial surface proteins, which can play a key role in host-pathogen interactions. They have a characteristic protein architecture with repeating domains, called stalk domains, folding into a rod-like stalk structure. An adhesive domain is positioned at the tip of the stalk. This study aims to provide a comprehensive characterisation of this protein class as well as to discover novel fibrillar adhesins and adhesive domains. Using a collection of known adhesive and stalk domains, a domain-based search for fibrillar adhesins was conducted yielding over 3,500 protein hits in the UniProt Reference Proteomes widespread across the bacterial tree of life. These proteins were called fibrillar adhesin-like (FA-like) proteins. Investigating them in-depth showed different adhesive and stalk domain combinations and distinct protein architectures between different bacterial phyla. It also resulted in the recognition of identification features for the development of a machine learning (ML)-based discovery approach. This approach was applied on the Firmicutes and Actinobacteria UniProt Reference Proteomes detecting over 5,000 FA-like proteins, which were missed by the domain- based approach, including proteins without a known adhesive and/or stalk domain. Exploring these proteins with the focus on those lacking a known adhesive domain enabled the discovery of potential novel adhesive domain families. Using AlphaFold2, the structure of these domains was predicted to identify their potential function. The AlphaFold2 release also enabled the detection of FA-like proteins using adhesive and stalk domain structures. Here, the TMalign and Foldseek structure aligners were compared, with Foldseek showing a higher concordance with the sequence-based discovery approaches. Integrating structure features in the ML-based discovery approach improved its precision when testing it on three bacterial proteomes. To find out more about the function of the detected FA-like proteins, protein interaction prediction methods were tested with the focus on AlphaFold Multimer. The results showed limitations in confidently predicting a binding target. The challenges encountered were discussed as well as an AlphaFold induced increase in the development of alternative methods tackling these challenges.Item Open Access Cancer Cartography: Mapping Cancer Evolution in Tissue SpaceLomakin, ArtemWhile somatic evolution is widely accepted as fundamental to cancer development, significant gaps in understanding persist. These gaps concern the role of cellular interactions within the tumour microenvironment (TME) and the impact of spatial constraints on cancer evolution. Emerging spatial omics technologies offer the potential to address these gaps, although their application in spatial genomics remains limited. This is particularly crucial because genetic alterations not only drive cancer evolution but also serve as an archive of its history. To address this research gap, this thesis aims to develop computational methods for analysing spatial genomics data, specifically on data generated by base-specific in situ sequencing (BaSISS), a method that enables high-resolution mapping of diverse somatic mutations across large tumour tissue sections. Bayesian algorithms designed in the study generate quantitative clonal maps, effectively tracing cancer evolution in tissue space while accommodating multiple forms of biological and technical variability. Despite inherent assumptions, the algorithm demonstrates robustness and quantitative accuracy. Complementary data types contribute to the findings by applying the Bayesian model to two multifocal breast cancers at different stages of progression. Integration of histology, immunohistochemistry (IHC), and targeted in situ gene expression allows for the phenotypic characterisation of clones in distinct microanatomical niches. Subsequent analyses across various stages of breast cancer, including carcinoma in situ, invasive cancer, and lymph node metastasis reveal clone-specific variations in proliferation, morphology, stroma, hypoxia, and immune microenvironments. In one instance involving ductal carcinoma in situ, polyclonal neoplastic expansions manifest on a macroscopic scale but remain segregated within microanatomical structures. In summary, this thesis establishes a robust computational framework for extracting clonal architecture from spatial genomics data. It provides a proof-of-concept that such maps, when integrated with tissue morphology and spatial phenotype data, can offer vital insights into the mechanisms driving both cancer evolution and tissue ecology.Item Open Access Synthetic ground truth of biological shapes — Simulating variable Nuclear Pore Complexes for MicroscopyTheiss, MariaThe Nuclear Pore Complex (NPC) is the only passageway for macromolecules between the nucleus and cytoplasm and an important reference standard in microscopy: it is intrinsic to cells, has a high copy number, it is massive and stereotypically arranged. The average architecture of NPC proteins has been resolved with pseudo-atomic precision, however, observed NPC heterogeneities, such as varying diameters, elongated shapes, 9-fold symmetry, and irregular shapes, evidence a high degree of divergence from this average. Single Molecule Localization Microscopy (SMLM) images NPCs at protein-level resolution, whereupon image analysis methods study NPC variability. However, the true picture of NPC variability is unknown and the biological function of this variability is poorly understood. In quantitative image analysis experiments, it is thus difficult to distinguish intrinsically high SMLM noise from variability of the underlying structure. This thesis introduces a pipeline that synthesizes ground truth datasets of structurally variable NPCs based on architectural models of the true NPC to benchmark image analysis methods and to help elucidate real-life NPC variability. In this pipeline, N- or C-terminally tagged NPC proteins can be selected for single- or multi-channel 3D simulations of geometrically variable NPCs. The NPC is furthermore represented as a spring model such that arbitrary deforming forces, of freely definable magnitudes, simulate shapes that are irregular, yet sufficiently smooth. Such simulations allow one to compare image analysis methods based on the quality and quantity of data required to elucidate specific types of variability. A side-by-side comparison with real data ultimately tests hypotheses about underlying NPC variability. Two clustering approaches are compared on simulations of geometric NPC variability. Furthermore, synthetically replicating analyses of real NPC radii reveal that a range of simulated variability parameters can lead to previously observed results. Ultimately, this thesis highlights the need and offers a template, for close-to-biology simulations when ground truth is unavailable.Item Open Access Computational optimisation strategies for targeted DNA sequencing using nanoporesWeilguny, LukasLong-read DNA sequencing is causing a generational shift in genome sequencing productivity and is revolutionising many aspects of biological discovery. One of the technologies behind this transformation is sequencing using nanopores that act as biosensors measuring fluctuations of an ionic current caused by traversing polynucleotides. By partially blocking the flow of ions, distinct patterns of current are generated as nucleotides pass through the narrowest constriction; these are subsequently bioinformatically demixed into nucleobase sequences. This approach for sequencing has multiple advantages, including reading native molecules that can be up to megabases long without prior amplification, and the ability to analyse data in real time. This allows for ultra-fast time-to-answer investigations, and also manipulation of ongoing experiments by processing the generated data and feeding instructions back to the sequencing machine; a unique feature not realisable with any other existing sequencing technology. Currently, a few methods exist that make use of this by analysing nascent fragments of DNA and testing whether they originate from predetermined areas of a genome marked as targets. Otherwise, the voltage bias across the membrane, into which the nanopores are embedded, can be reversed and the molecule will be ejected from the nanopore, allowing another one to be sequenced in its place with the aim of saving time and enriching for on-target sequences. This process has been termed `adaptive sampling'; but prior to the work presented in my thesis, these methods were based entirely on static instructions. In other words, target regions of a genome were defined before an experiment and remain constant throughout. With this thesis, I extend adaptive sampling such that decisions about molecule rejections can incorporate information obtained during an ongoing experiment. One of the main motivations is to address the current need for oversampling genomes many-fold to ensure a minimum coverage across a sequenced genome high enough for downstream analyses, which can be wasteful. Similarly, sequencing mixed samples without wasteful oversampling might lead to underrepresented or missing rare species. This thesis describes two approaches for dynamic extensions to adaptive sampling to address these issues, by implementing more versatile real-time analysis and control of sequencing experiments. The first approach is intended for resequencing experiments, where reference sequences of the studied sample are available. For this, I implement an algorithmic framework and software that generates dynamically adapting decision strategies that are continuously updated to steer an active sequencing run. More specifically, this method quantifies uncertainty at each position in a genome and for each novel DNA fragment decides whether the expected decrease in uncertainty warrants fully sequencing it. This way, sequencing can be focused on molecules from areas with the highest uncertainty, e.g. regions of low coverage, thus optimising the information gain. I illustrate the effectiveness of the method by mitigating coverage bias between and within members of a microbial mixture sample. In particular, it adapts to the differential abundances without prior knowledge about sample composition, thereby reducing the interspecies bias and effectively redistributing coverage within species. In some scenarios, the need for reference genomes poses a limitation, e.g. when sample content is unknown. In this case previous implementations are not useful, since underrepresented species cannot be targeted. A second approach I develop in the thesis aims to overcome this limitation by exploring how rejection decisions can be made while simultaneously creating a genome assembly from the fragments read so far. Here, the method rejects molecules from regions of genomes that are already well-represented and instead focuses on sequence that either helps to extend a species' assembly or is entirely unknown. I show how refocusing sequencing in this way is useful to increase the detection limit for rare organisms in a mixed sample, leads to higher quality assemblies, and allows for true de novo enrichment of unknown species for the first time. Overall, the data-driven approaches to targeted sequencing with nanopores that I have created expand the applicability of adaptive sampling and could be applied to many other sequencing scenarios. The resulting reduction in the time-to-answer or increased information gain might be critical in clinical settings or for pathogen surveillance.Item Open Access Structural studies on enzyme active sitesRiziotis, IoannisEnzymes catalyse a huge variety of biochemical reactions, and often the same function might evolve independently in different organisms. This PhD project zooms into the heart of biological catalysis, the active site, aiming to structurally characterise catalytic residues and their geometric disposition, which facilitates catalysis. A curated dataset of mechanisms and catalytic residue annotations for ∼ 1000 enzyme families (the *Mechanism and Catalytic Site Atlas* – *M-CSA*), integrated with structural data from the *Protein Data Bank*, led to the development of a tailor-made programmatic framework (CSA-3D), which implements new and modified algorithms and metrics used to perform all analyses herein. The work consists of two major parts, one looking into the conformation of the active site within homologous enzymes, and one exploring active site commonalities in functionally convergent and divergent enzymes. The first part is a study on catalytic residue conformational variation as captured in snapshots in enzyme crystal structures. The task is to explore active site flexibility and assess its importance as an intrinsic and essential property of enzymes. Through dynamic active site superposition, structural variability is captured at single-residue level, and geometrical changes driven by ligand binding or mutations are explored. It is shown that active sites exhibit different degrees of inherent flexibility, with the extent of this flexibility often depending on the role of each residue during catalysis. Moreover, the data suggest that ∼ 2/3 of active sites are flexible, although in half of those, flexibility is only observed in the side chains. The goal here is to better understand catalysis as enzymes evolve new functions and bind different substrates. The second part defines the term “catalytic modules”: structurally similar residue arrangements performing a defined function, that may recur in unrelated enzymes. After exploring and reviewing 3D templates as tools to identify functional sites in enzymes, again, M-CSA data was used to generate a template library representing compact residue clusters. A fuzzy template-template search identified and catalogued conserved and convergent “modules”, that were characterised in terms of function. A large fraction of modules facilitate metal binding, and some interact with co-factors. Often those modules are the outcome of convergent evolution. A smaller number of convergent modules perform a well-defined catalytic role, such as the catalytic triads (i.e. Ser-His-Asp/Cys-His-Asp) and the saccharide-cleaving Asp/Glu triad. Furthermore, it is shown that enzymes of divergent function retain regions of their active site unaltered during evolution. The ultimate goal of this PhD is to define paradigms of structural variation, and to identify common 3D modules in observed active sites. This work is potentially relevant to the design of novel enzymes and understanding the key structural components governing catalysis in the active site.Item Open Access The skin microbiome in health and atopic dermatitisSaheb Kashaf, SaraThe skin, as the body’s outermost layer of cells, plays a crucial dual role in protecting against foreign pathogens while providing a habitat for commensal microbes. Despite the skin’s harsh conditions, which include desiccation, acidity, and scarce nutrients, the skin hosts a diverse community of bacteria, fungi, and viruses. Prior work associating fluctuations in the skin microbiome with health and disease has been limited by our limited understanding of the skin microbiome composition and functions. One way to characterise skin microbial diversity is through metagenomics. A previous investigation of the skin microbiome found that more than half of the sequenced skin metagenomic reads did not align to reference genomes, complicating the analysis of skin metagenomic datasets. To address this issue, we combined bacterial cultivation and metagenomic sequencing to create the Skin Microbial Genome Collection (SMGC), the most comprehensive catalogue of prokaryotic, eukaryotic, and viral genomes from the skin. The SMGC allows for the classification of a median of 85% of skin metagenomic sequencing reads, providing a comprehensive view of skin microbial diversity. Using the SMGC, we investigated the skin microbiome in atopic dermatitis, a prevalent inflammatory skin condition characterized by recurring episodes of red, itchy, and swollen skin. Atopic dermatitis flares have been associated with the proliferation of various staphylococcal species, with only S. aureus strains cultured from atopic dermatitis inducing inflammation in a mouse model. Our extensive genomic survey of the skin microbiome in atopic dermatitis, supported by cultured isolates from the same samples, identified Staphylococcus strains and genomic loci associated with higher disease severity. Our work also showed that the Staphylococcus strains found in AD are influenced by factors such as geography and strain sharing within households. Additionally, our examination of the mobilome of multiple Staphylococcus species colonising the same individuals revealed widespread inter-species transfer of genetic material, highlighting the fluid nature of staphylococcal genetic composition. In conclusion, our work shows how novel genomic approaches and the integration of sequencing data can be used to characterise the skin microbiome at an unparalleled resolution, allowing for new insights into how skin microbes vary in health and disease.Item Open Access Network approaches for data-driven reconstruction of intracellular signallingBarker, Charles; Barker, Charles [0000-0002-5223-6838]Intra-cellular signalling determines how cells process information. Through the integration of diverse chemical and physical stimuli, cells can enact transcription, among other changes to modulate growth, fate and survival. It is through the dysregulation of such processes that many diseases, including cancer originate. For many years, our study of signalling processes has been based on discrete ’pathways’, characterised mostly by small-scale studies. However, as more system-wide data becomes available, it is becoming increasingly obvious that intra-cellular signalling is more like a dense and inter-connected network, with intense and functional cross-talk between pathways. In my PhD project, I utilised both new ’omics’ data and a plethora of network-based approaches to guide our understanding of various signalling-related disease contexts. Networks provide a convenient framework with which to explicitly control the level of prior-knowledge required to understand complex ’omics’ data. Escaping the study bias that has so-far dominated our characterisation of intra-cellular signalling is vital if we are to progress in our understanding of complex cellular behaviours.Item Open Access Japanese courage: a genetic analysis of complex traits in medaka fish and humansBrettell, IanThis thesis primarily explores how an individual's genes interact with the genes of their social companions to create differences in behaviour, using the Japanese medaka fish as a model organism. Chapter 1 sets out the introduction to the diverse topics covered in this thesis, and is followed by five substantive chapters. Chapter 2 describes several genomic characteristics of the Medaka Inbred Kiyosu-Karlsruhe (MIKK) panel, which comprises 80 inbred lines of medaka that were bred from a wild population from the city of Kiyosu, southern Japan. In this chapter I plot the inbreeding trajectory of the MIKK panel and analyse a number of genomic characteristics relevant to its utility for the genetic mapping of complex traits, including: the panel's evolutionary relationship with other previously-established inbred medaka strains; the degree of homozygosity in the inbred lines; the rate of linkage disequilibrium decay across the panel; and the genomic repeats and structural variation present in their genomes. In Chapter 3, I use a custom behavioural assay to characterise and classify bold-shy behaviours in 5 previously-established inbred medaka strains. I describe the assay, assess its robustness against confounding factors, and apply a hidden markov model (HMM) to classify the fishes' behaviours across a spectrum of boldness-shyness based on the individuals' distance and angle of travel between pre-defined time intervals. I describe how the strains differ in their behaviours over the course of the assay (a "direct genetic effect") and how the behaviour of a single "reference" strain (*iCab*) differs in the presence of different strains (an "indirect genetic effect"). In Chapter 4, I describe the bioinformatic processes and genetic association models that I used to map the variants associated with variation in the period of somite development, based on an F2-cross between the southern Japanese iCab strain, and the northern Japanese Kaga strain. In Chapter 5, I explain how I ran the custom behavioural assay described in Chapter 3 over the MIKK panel to identify lines that diverge in both their own bold-shy behaviours (the direct genetic effect) and the extent to which they transmit those behaviours onto their tank partners (the indirect genetic effect). I then describe how I used those divergent lines as the parental lines in a multi-way F2-cross to identify the genetic variants associated with both direct and indirect genetic effects. Finally, in Chapter 6, I turn to humans to compare and rank all complex traits in the GWAS Catalog based on the extent to which their associated alleles vary across global populations, using the Fixation Index (FST) as a metric, and the 1000 Genomes dataset as a sample of global genetic variation. I set out the bioinformatic pipelines used to process the data, present the distributions of FST for trait-associated alleles across the genome, and use the Kolmogorov-Smirnov test to compare the distributions of FST across different traits. Altogether, this thesis describes some of the genomic characteristics of both medaka fish and humans, and how those variations relate to differences in complex traits, with a particular focus on the genetic causes of adaptive behaviours and the transmission of those behaviours onto one's social companions.Item Open Access Genome-graph based genotyping with applications to highly variable genes in P. falciparumLetcher, Brice; Letcher, Brice [0000-0002-8921-6005]Analysing genetic variation in pathogen genomes is key to understanding their biology, evolution and epidemiology. Typically, this is done by assembling one arbitrary genome, defined as the ‘reference’, and describing other samples as deviations from it. However, this model breaks down in highly diverse regions of the genome, where sample sequencing reads, differing too substantially from the reference, fail to map. This ‘bias against diversity’, due to using a single reference, naturally affects genomic regions under pressure to diversify: this includes the human MHC, a motivating example for the field, and vaccine candidate genes in the malaria parasite Plasmodium falciparum (Pf ), the motivating example for this thesis. The growing solution in the field is to build graph-based models, representing not one, but a population of genomes from a species, and using these genome graphs as a substrate for read mapping and genotyping instead. In this thesis, I develop new algorithms and data structures for genotyping highly variable genes in Pf, using genome graphs. In Chapters 2 and 3, I describe methods and code to analyse variation in highly diverse regions of the genome, across many genomes in a cohort. In doing so I provide two main advances on the state-of-the-art: jointly studying small (SNPs, indels) and large (indels >50bp) variation, and accessing variation on multiple references. I validate these methods using different datasets, ultimately genotyping SNPs on diverged haplotypes in two highly variable Pf genes, including one gene from the major methodological and biological motivation for this thesis, paralogs DBLMSP and DBLMSP2 (DBs). In Chapter 4, I study the DBs in greater detail, using a global dataset of >3,500 Pf genomes. Building a genome-graph-based pipeline, I recover variation inaccessible to single-reference based approaches (GATK), before uncovering new biology. Expressing each diverged DB haplotype as a mosaic of the others, I find widespread recombination in each gene, and also discover recent evidence of gene conversion between the two genes. In summary, this thesis provides both methodological advances into genome-graph based genotyping, and practical insights into the genome biology of an important human pathogen.Item Open Access Advances in Time-to-Event Analysis: Big Data Applications in Cancer Risk PredictionJung, Alexander WolfgangThe digital transformation of health care provides new opportunities to study dis- ease and gain unprecedented insights into the underlying biology. With the wealth of data generated, new statistical challenges arise. This thesis will address some of them, with a particular focus on Time-to-Event analysis. The Cox hazard model, one of the most widely used statistical tools in biomedicine, is extended to analyses for large-scale and high-dimensional data sets. Built on recent machine learning frame- works the approach scales readily to big data settings. The method is extensively evaluated in simulation- and case-studies, showcasing its applicability to different data modalities, ranging from hospital admission episodes to histopathological im- ages of tumour resections. The motivating application of this thesis are electronic health records (EHR), collections of various interlinked data at an individual level. With many countries starting to implement national health data resources, methods that can cope with these datasets become paramount. In particular, cancers could benefit significantly from these developments. The lifetime risk of developing a ma- lignancy is around 50%. However, the associated risks are not equally distributed with large differences between individuals. Hence, being able to utilise the data available in EHR could potentially help to stratify individuals by their risk profiles and screen or even intervene early. The proposed method is used to build a pre- dictive model for 20 primary cancer sites based on clinical disease histories, basic health parameters, and family histories covering 6.7 million Danish individuals over a combined 193 million life years. The obtained risk score can predict cancer inci- dence across most organ sites. Further, the information could potentially be used to create cohorts with similar efficiency while screening earlier, creating the possibility for risk-targeted screening programs. Additionally, the obtained result could also be transferred between health care systems, as shown here between Denmark and the UK. Taken together the thesis established a method to analyse the extensive amounts of data that is being generated nowadays as well as an evaluation of the potential these data sources can have in the context of cancer risk.Item Open Access Multiomic Investigation of the Human Gut Toward Insight into Childhood Inflammatory Bowel Disease PathogenesisEdgar, RachelInflammatory bowel disease (IBD) is an abnormal immune response to the gut microbiome causing chronic debilitating pain in genetically susceptible people. The development of effective therapies for this condition is hampered by a poor understanding of its complex aetiology. Progress in this area will require identification of both the immune and epithelial contributions to disease in susceptible individuals as well as detangling the contributions of different pathways and genes, all in a cell type specific context. This thesis summarizes my research into the gene regulatory architecture of IBD in intestinal epithelial cells. First I assess the utility of intestinal epithelial organoids, a model of the human intestine. This system is well established and has been shown to be comparable to in vivo intestinal tissue in many ways. However, my study revealed DNA methylation (DNAm) is generally lost and becomes more variable the longer organoids are cultured. My findings suggest a major impact of prolonged culturing on global organoid DNAm profiles, highlighting the importance of considering time in culture in organoid experiments. Next I explore DNAm, gene expression and genotype to: define a possible IBD biomarker, explain previously established IBD genotype associations, and identify potential mechanistic candidates for functional follow-up. In both primary epithelial cells and organoids I found widespread differential DNAm with IBD. Of the CpGs differentially DNAm in IBD, changes in the major histocompatibility complex class I (MHC I) signaling pathway are particularly interesting. The MHC I pathway represents a promising candidate mechanism of IBD as it is involved in the communication between cells and the immune system. Finally, I follow up on the association of MHC I and IBD in individual intestinal epithelial cells. As in the bulk expression data, I also see increased expression of MHC I in IBD compared to controls. This is consistent across cell types, but interestingly MHC I activity is higher in villus tip cells compared to crypt cells regardless of diagnosis, suggesting activity could be related to exposure to luminal microbiota. I also show MHC I can be activated successfully in organoids with proinflammatory cytokine stimulation, meaning organoids could be a useful model to study MHC I pathway function in IBD. In this thesis, I demonstrate that organoids are a valuable model of the human gut, and use them to identify MHC I activation as a potential mechanism of pediatric IBD pathogenesis.Item Open Access Recovery and quality estimation of metagenomic assembled genomes of eukaryotesSaary, PaulMicroorganisms are found in virtually all environments, and while the majority of microorganisms are often prokaryotes, by biomass there are suspected to be 6 times more prokaryotes than fungi globally (Bar-On et al., 2018), eukaryotes are also important constituents of microbial communities (e.g on the human skin). Shotgun metagenomics can provide access to the combined genetic information of a community in a culture independent manner. Using de novo assembly and post processing methods, it can lead to the generation of so called metagenomic assembled genomes (MAGs), which provide contextualised access to genes of these elusive organisms. As most studies have focused on the recovery of prokaryotic MAGs, I first examined the limitations and gaps of existing tools with respect to their ability to recover microbial eukaryotic genomes. This led to the development of EukCC, a software to estimate the completeness and contamination of eukaryotic MAGs. Evaluation of this software showed that it is well suited for the fully automated recovery of eukaryotic MAGs. This workflow was applied to dataset obtained from several biomes to recover eukaryotic MAGs. However, I also demonstrate that eukaryotic MAGs can sometimes be fragmented and developed a merging algorithm to create merged MAGs (mMAGs). With the implementation of this algorithm in EukCC 2, I search a large number of datasets from MGnify for known and novel eukaryotic MAGs. Completing the eukaryotic MAG recovery process, I discuss how species-level dereplication for eukaryotes can be approached based on the genetic information alone. In summary I show that recovery of eukaryotic MAGs is a challenging but can be largely automated allowing large-scale studies to be performed.Item Open Access Prognostic biomarker discovery from omics data using machine learning approachesGarg, ManikPrognostic biomarker discovery from omics data using machine learning approaches by Manik Garg Prognostic biomarkers can help clinicians identify high-risk patients to administer appropriate therapies. The omics data from patient samples can be used to find such biomarkers. Moreover, molecular biomarkers can help to understand disease mechanisms. Machine learning and related data analysis methods can be applied to omics data for reliable determination of these biomarkers. In this thesis, I aimed to identify reproducible prognostic biomarkers in three different diseases: Alzheimer’s disease (AD), primary melanoma and coronavirus disease 2019 (COVID-19). In my second chapter, I contributed to the discovery of a new metabolic signature, to predict which patients with mild-cognitive impairment would later develop AD. As there were thousands of un-annotated metabolic features potentially differentiating such patients, to overcome the problem of over-training, we shortlisted only those features associated with genetic variants. After annotating the top-ranking features, we hypothesized about their potential links to AD using extensive literature research. My contribution to this chapter mostly was employing the machine learning methods to refine and optimise the signature. In my third chapter, I analysed RNA-sequencing data derived from primary melanomas resected from stage IIB-IIIC (7th edition of the American Joint Committee on Cancer staging manual) patients embedded within a prospective phase III randomized clinical trial. This led to the identification of a 121-gene-based expression signature that can predict poor outcomes and stratify patients with high absolute risk of death in 5 years. The prognostic ability of this signature was validated in 4 independent datasets. I also found that patients with higher signature score (indicating poor outcomes) had lower tumour infiltrating lymphocytes suggesting that these patients have immune cell deprived tumours that warrants the need for specialized treatment strategies. In my fourth chapter, I performed a meta-analysis of 10 published single-cell RNA sequencing datasets to validate the immune response changes associated with COVID-19 progression reported. I found that 8 out of 20 published immune response changes were consistently reproducible across multiple datasets. In addition, in my fifth chapter, I studied how immune response changes with COVID-19 severity in recovered patients. Here, I showed that while patients recovered from mild/moderate COVID-19 infection had their immune responses close to healthy individuals within 27-47 days of symptom onset, those recovered from severe/critical infection still had their immune response affected. The described results are published in four peer-reviewed journal papers and specific contributions are highlighted in the text. Overall, the work presented in this thesis demonstrates various approaches in which omics data can be used for prognostic biomarker discovery. Further, this work also contributes to the knowledge of current prognostic biomarkers in AD, primary melanoma and COVID-19.Item Open Access Computational analyses of blood cells: somatic evolution and morphologyAlmeida, JoséHaematopoiesis is the complex process of blood cell production, carried out by haematopoietic stem cells in the bone marrow. Over the human lifespan, it generates quadrillions of cells which participate in essential bodily functions such as immunity (whiteblood cells), oxygen distribution (red blood cells) and blood coagulation (platelets). However, several conditions affect this process --- a lack of nutrients such as iron or folate can lead to alterations to the production of red and white blood cells, whereas the accumulation of somatic mutations contributes to the formation of genetically distinct subpopulations of haematopoietic stem cells (or clones) whose behaviour is partly determined by mutations conferring growth advantages (also known as driver mutations). When mutations in cancer-associated genes are present in the blood of healthy individuals this is known as clonal haematopoiesis, a benign condition that can progress to blood cancers such as myelodysplastic syndromes, characterized by an excess of abnormally developed (or dysplastic) cells in the bone marrow and in the blood. Blood cancers such as these are usually diagnosed by trained experts using a number of complementary analyses, including the inspection of blood cell morphology (cytomorphology) --- which can have high inter-individual variance --- and differential cell counts under the microscope. In this work, I have studied how the haematopoietic system and blood are altered in two distinct settings --- the evolution of somatic mutations in healthy individuals and the cytomorphological alterations to blood cells in myelodysplastic syndromes. Firstly, I studied the somatic evolution of the haematopoietic system in healthy individuals using longitudinal sequencing and single-cell-derived colonies. With simulations, I developed a computational model and applied it to a cohort of elderly individuals with clonal haematopoiesis and show i) how driver genetics determine the growth rate of different clones, ii) how clones appear consistently through life (with few exceptions), iii) that clones decelerate due to an increasingly competitive clonal landscape and iv) how mutations which confer greater growth advantages are associated with a greater risk of developing haematological malignancies. Secondly, I used a cohort of digitalised whole blood smears from healthy individuals and individuals with anaemia or myelodysplastic syndromes to study how each condition leads to cytomorphological alterations through computational methods. To do this, i) I developed and implemented methods for the high-throughput detection of blood cells from digitized whole blood slides, ii) developed a cellular characterisation protocol that captures morphological features relevant for the prediction of clinically-relevant conditions and the presence of specific mutations in myelodysplastic syndromes and iii) described novel associations between blood cell phenotype and anaemia or myelodysplastic syndrome subtypes. By studying how haematopoietic stem cells evolve in the human body, I contributed to the understanding of early cancer development and to the broader field of human somatic evolution, and by quantitatively studying alterations to cytomorphology in clinically-relevant conditions, I showed how this can reveal novel blood cell phenotypes which can aid in diagnostic and prognostic. Both projects offer different perspectives into haematopoiesis as a dynamic process and contribute to clinical research by highlighting connections between somatic evolution in healthy individuals and cancer onset, and by discovering previously unknown cellular phenotypes-disease associations.