Repository logo

Methods for Determining the Genetic Causes of Rare Diseases



Change log


Greene, Daniel, John  ORCID logo


Thanks to the affordability of DNA sequencing, hundreds of thousands of individuals with rare disorders are undergoing whole-genome sequencing in an effort to reveal novel disease aetiologies, increase our understanding of biological processes and improve patient care. However, the power to discover the genetic causes of many unexplained rare diseases is hindered by a paucity of cases with a shared molecular aetiology. This thesis presents research into statistical and computational methods for determining the genetic causes of rare diseases. Methods described herein treat important aspects of the nature of rare diseases, including genetic and phenotypic heterogeneity, phenotypes involving multiple organ systems, Mendelian modes of inheritance and the incorporation of complex prior information such as model organism phenotypes and evolutionary conservation.

The complex nature of rare disease phenotypes and the need to aggregate patient data across many centres has led to the adoption of the Human Phenotype Ontology (HPO) as a means of coding patient phenotypes. The HPO provides a standardised vocabulary and captures relationships between disease features. I developed a suite of software packages dubbed 'ontologyX' in order to simplify analysis and visualisation of such ontologically encoded data, and enable them to be incorporated into complex analysis methods. An important aspect of the analysis of ontological data is quantifying the semantic similarity between ontologically annotated entities, which is implemented in the ontologyX software. We employed this functionality in a phenotypic similarity regression framework, 'SimReg', which models the relationship between ontologically encoded patient phenotypes of individuals and rare variation in a given genomic locus. It does so by evaluating support for a model under which the probability that a person carries rare alleles in a locus depends on the similarity between the person's ontologically encoded phenotype and a latent characteristic phenotype which can be inferred from data. A probability of association is computed by comparison of the two models, allowing prioritisation of candidate loci for involvement in disease with respect to a heterogeneous collection of disease phenotypes.

SimReg includes a sophisticated treatment of HPO-coded phenotypic data but dichotomises the genetic data at a locus. Therefore, we developed an additional method, 'BeviMed', standing for Bayesian Evaluation of Variant Involvement in Mendelian Disease, which evaluates the evidence of association between allele configurations across rare variants within a genomic locus and a case/control label. It is capable of inferring the probability of association, and conditional on association, the probability of each mode of inheritance and probability of involvement of each variant. Inference is performed through a Bayesian comparison of multiple models: under a baseline model disease risk is independent of allele configuration at the given rare variant sites and under an alternate model disease risk depends on the configuration of alleles, a latent partition of variants into pathogenic and non-pathogenic groups and a mode of inheritance. The method can be used to analyse a dataset comprising thousands of individuals genotyped at hundreds of rare variant sites in a fraction of a second, making it much faster than competing methods and facilitating genome-wide application.





Turro, Ernest
Richardson, Sylvia
Ouwehand, Willem


Bayesian statistics, Rare diseases, Ontologies, Genetic diseases


Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge