Development of computational approaches for whole-genome sequence variation and deep phenotyping
The rare disease pulmonary arterial hypertension (PAH) results in high blood pressure in the lung caused by narrowing of lung arteries. Genes causative in PAH were discovered through family studies and very often harbour rare variants. However, the genetic cause in heritable (31%) and idiopathic (79%) PAH cases is not yet known but are speculated to be caused by rare variants. Advances in high-throughput sequencing (HTS) technologies made it possible to detect variants in 98% of the human genome. A drop in sequencing costs made it feasible to sequence 10,000 individuals including 1,250 subjects diagnosed with PAH and relatives as part of the NIHR Bioresource – Rare (BR-RD) disease study. This large cohort allows the genome-wide identification of rare variants to discover novel causative genes associated with PAH in a case-control study to advance our understanding of the underlying aetiology. In the first part of my thesis, I establish a phenotype capture system that allows research nurses to record clinical measurements and other patient related information of PAH patients recruited to the NIHR BR-RD study. The implemented extensions provide a programmatic data transfer and an automated data release pipeline for analysis ready data. The second part is dedicated to the discovery of novel disease genes in PAH. I focus on one well characterised PAH disease gene to establish variant filter strategies to enrich for rare disease causing variants. I apply these filter strategies to all known PAH disease genes and describe the phenotypic differences based on clinically relevant values. Genome-wide results from different filter strategies are tested for association with PAH. I describe the findings of the rare variant association tests and provide a detailed interrogation of two novel disease genes. The last part describes the data characteristics of variant information, available non SQL (NoSQL) implementations and evaluates the suitability and scalability of distributed compute frameworks to store and analyse population scale variation data. Based on the evaluation, I implement a variant analysis platform that incrementally merges samples, annotates variants and enables the analysis of 10,000 individuals in minutes. An incremental design for variant merging and annotation has not been described before. Using the framework, I develop a quality score to reduce technical variation and other biases. The result from the rare variant association test is compared with traditional methods.