Repository logo
 

Gene misexpression in humans


Loading...
Thumbnail Image

Type

Change log

Abstract

Gene misexpression is the unexpected transcription of a gene in a context where it is usually inactive. In humans, gene misexpression has been implicated in cancers and several rare diseases, for example congenital limb malformations, congenital hyperinsulinism and monogenic severe childhood obesity. These studies have identified gain-of-function genetic variants that lead to gene misexpression via specific mechanisms. In chapter 1, I review these mechanisms and conclude that despite its known pathological consequences in specific rare diseases, we have a limited understanding of its wider prevalence and genetic mechanisms in humans. To address these gaps, I conducted the first genome-wide analysis of gene misexpression using bulk RNA-seq data from 4,568 blood donors from the INTERVAL study.

In chapter 2, I aimed to assess the prevalence of gene misexpression across genes and samples, identify the characteristics of genes that tolerate gene misexpression and establish the types of genetic variants that are associated with gene misexpression. I found that while individual misexpression events occurred rarely, in aggregate they were found in almost all samples and over half of inactive genes. Misexpressed genes were less likely to be implicated in developmental disorders, had fewer enhancers and were shorter compared to non-misexpressed genes. Using 2,821 paired whole genome and RNA sequencing samples from the INTERVAL study, I observed that misexpression events were enriched for nearby rare structural variants (SVs) but not rare single nucleotide variants (SNVs) or indels. Rare deletions, duplications and inversions were all enriched within the gene body of the misexpressed gene but only deletions were enriched outside the gene body. For each SV class, different variant consequences were enriched among misexpression events and SVs showed a stronger enrichment proximal to co-misexpressed genes than individual misexpressed genes.

In chapter 3, I aimed to establish the genetic mechanisms by which gene misexpression occurs in healthy individuals. Using 2,821 paired RNA and whole-genome sequencing samples in the INTERVAL cohort, I identified 105 rare SVs associated with consistent gene misexpression. Misexpression-associated SVs were longer, affected more conserved regions and were predicted to be more deleterious than “control” SVs. They were also enriched within distinct regulatory regions such as promoters, transcribed regions and enhancers. Next, I examined specific SVs to understand how they resulted in misexpression. First, I identified 12 deletions and 5 tandem duplications associated with highly aberrant read counts and coverage across the intergenic regions upstream of the misexpressed gene. I concluded that these variants resulted in transcriptional readthrough leading to downstream gene misexpression. Interestingly, 2 deletion carriers also showed evidence of intergenic splicing involving the misexpressed genes RTP1 and OTP. Next, I identified 7 deletions and 3 tandem duplications that created fusion transcripts, including a duplication associated with misexpression of myosin heavy chain 1 (MYH1), a gene normally expressed exclusively in skeletal muscle. Finally, I identified a partial gene inversion leading to transcript-specific misexpression of ROPN1B. Overall, I identified mechanisms for 42% of misexpression events with a nearby associated SV. Surprisingly, I did not observe misexpression occurring via 3D chromatin architecture rearrangements as reported in certain rare diseases.

Chapter 4 was completed while conducting a three month placement at AstraZeneca. In this chapter, I developed a machine learning model to classify short open reading frames (sORFs) as either functional or non-functional. To do so, I curated a database of 2,112 sORFs with and without experimental evidence of being functional from a variety of different experimental screening approaches. I created an extensive feature annotation pipeline that can label sORFs with 88 features spanning DNA, RNA, regulatory and peptide properties. Next, I trained different model architectures to identify functional sORFs among all sORFs types as well as models for specific sORF types including upstream ORFs (uORFs) and lncRNA-ORFs. Overall, the performance of the best models was modest achieving a median AUROC of 0.71, 0.74 and 0.60 for all sORFs, uORFs and lncRNA-ORFs, respectively. Analysing the contribution of different features to model performance highlighted the importance of GC content, RNA secondary structure as well as peptide size and structure in sORFs that are functional. Finally, I validated my models on a set of experimentally validated sORFs. My model performed modestly, classifying 62.5% correctly as functional while on a set of background sORFs that are likely to be non-functional only 32% were classified as functional. This model will improve the functional interpretation of sORFs and could aid the identification of novel sORF therapeutic targets.

In this thesis, I have developed gene misexpression as a novel type of transcriptomic outlier analysis and conducted a genome-wide characterization of the gene misexpression landscape. In doing so, I have extended our understanding of how genetic variants influence gene expression. The fact that rare SVs can induce misexpression not just in the rare disease context should be taken into account in future studies when cataloguing and interpreting their effects in population cohorts.

Description

Date

2024-08-21

Advisors

Davenport, Emma

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge

Rights and licensing

Except where otherwised noted, this item's license is described as All rights reserved