Elucidating the mechanistic impact of single nucleotide variants in model organisms

Wagih, Omar

Elucidating the mechanistic impact of single nucleotide variants in model organisms

Repository URI

https://www.repository.cam.ac.uk/handle/1810/271713

Repository DOI

https://doi.org/10.17863/CAM.18705

Files

Thesis (34.41 MB)

Type

Thesis

Authors

Wagih, Omar

Abstract

Understanding how genetic variation propagate to differences in phenotypes in individuals is an ongoing challenge in genetics. Genome-wide association studies have allowed for the identification of many trait-associated genomic loci. However, they are limited in their inability to explain the altered cellular mechanism. Genetic variation can drive disease by altering a range of mechanisms, including signalling networks, TF binding, and protein folding. Understanding the impact of variants on such processes has key implications in therapeutics, drug development, and more. This thesis aims to utilise computational predictors to shed light on how cellular mechanisms are altered in the context of genetic variation and better understand how they drive both molecular and organism-level phenotypes. Many binding events in the cell are mediated by short stretches of sequence motifs. The ability to discover these underlying rules of binding could greatly aid our understanding of variant impact. Kinase–substrate phosphorylation is one of the most prominent post-translational modifications (PTMs) which is mediated by such motifs. We first describe a computational method which utilises interaction and phosphorylation data to predict sequence preferences of kinases. Our method was applied to 57% of human kinases capturing known well-characterised and novel kinase specificities. We experimentally validate four understudied kinases to show that predicted models closely resemble true specificities. We further demonstrate that this method can be applied to different organisms and can be used for other phospho-recognition domains. The described approach allows for an extended repertoire of sequence specificities to be generated, particularly in organisms for which little data is available. TF-DNA binding is another mechanism driven by sequence motifs, which is key for the tight regulation of gene expression and can be greatly altered by genetic variation. We have comprehensively benchmarked current methods used to predict non-coding variant effects on TF-DNA binding by employing over 20,000 compiled allele-specific ChIP-seq variants across 94 TFs. We show that machine learning-based approaches significantly outperform more rudimentary methods such as the position weight matrix. We further note that models for many TFs with distinct binding specificities were unable to accurately assess the impact of variants. For these TFs, we explore alternative mechanisms underlying TF-binding, such as methylation, co-operative binding, and DNA shape that drive poor performance. Our results demonstrate the complexity of predicting non-coding variant effects and the importance of incorporating alternative mechanisms into models. Finally, we describe a comprehensive effort to compile and benchmark state-of-the-art sequence and structure-based predictors of mutational consequences and predict the effect of coding and non-coding variants in the reference genomes of human, yeast, and E. coli. Predicted mechanisms include the impact on protein stability, interaction interfaces, and PTMs. These variant effects are provided through mutfunc, a fast and intuitive web tool by which users can interactively explore pre-computed mechanistic variant impact predictions. We validate computed predictions by analysing known pathogenic disease variants and provide mechanistic hypotheses for causal variants of unknown function. We further use our predictions to devise gene-level functionality scores in human and yeast individuals, which we then used to perform gene-phenotype associations and uncover novel gene-phenotype associations.

Date

2018-01-31

Advisors

Beltrao, Pedro

Keywords

Bioinformatics, genomics, mutation, snp, yeast, human, ecoli, predictions

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge

Rights

Collections

Theses - European Bioinformatics Institute (EMBL-EBI)