Sequencing in Isolation: Next-generation sequencing studies in founder populations
Although common variants are routinely assayed in populations, rare mutations and copy-number variants are understudied contributors to the aetiology of complex traits. Isolated populations hold the promise of increased power gains in detecting associations in rare and low-frequency variants that have drifted up in frequency due to founder events and geographical isolation. Population-specific imputation reference panels and very low-depth whole-genome sequencing have been proposed as ways to boost power in next-generation association studies while keeping sequencing costs low. The aim of this work is to leverage the wealth of sequencing data generated as part of the HELIC project to study the allelic architecture of complex phenotypes and identify sequence variants associated with traits of medical relevance. We develop METACARPA, a method that meta-analyses summary statistics from genome-wide association studies. We establish a robust pipeline for the imputation and refinement of 1x whole-genome sequencing data, as well as a quality control and association pipeline for cohort-wide high-depth sequencing. We examine variant selection and weighting methods for genome-wide burden testing of rare variants, and write several tools for the visualisation of single-point and aggregated association results. Finally, we develop UN-CNVc, a fast copy number variant caller optimised for population-wide sequencing data. Applying METACARPA to a 4-way multi-array and multi-cohort analysis of the HELIC array data allowed the discovery, among others, of two lipid-associated loci, including the cardioprotective low-frequency variant rs145556679. In our cohorts, 1x data provided access to more than 100,000 low-frequency variants not discovered using an imputed chip design, and allowed to replicate a burden of low-frequency and rare cardioprotective variants in the APOC3 gene. We discover burdens of rare regulatory and coding variants independent of known common-variant associations at known loci, such as in the ADIPOQ gene for adiponectin or GGT1 for gamma-glutamyltransferase, as well as novel associations entirely driven by rare variants, such as with triglycerides for the FAM189B gene. We describe two complex gene deletions influencing serum levels of this genes' protein products, called using UN-CNVc.