Repository logo

The Contribution of Structural Variants to 2,095 Molecular Phenotypes in 12,354 European Ancestry Individuals



Change log


Howell, Brittany 


Structural Variants (SVs) are large scale rearrangements of the genome resulting in linear and spatial changes which can profoundly affect the function of the genome. SVs contribute the majority of nucleotide variation among human genomes by number of basepairs and have been linked to various diseases and traits including schizophrenia, autism and obesity.

In this thesis whole genome sequence data at 15X coverage were generated using 12,354 samples from the INTERVAL cohort. A combination of Genome STRiP, Lumpy, CNVnator and svtools was used to call deletions, duplications and inversions. I implemented a stringent, tiered QC procedure to minimise false positives. For duplications and deletions I modelled sequential random forests on read alignment parameters, resulting in 88% sensitivity and 99% specificity for deletions, and 92% specificity and 55% sensitivity for duplications. Final tuning of the overall quality score was modelled to ensure that 90% of carrier genotypes were identical among duplicate samples. Finally, a graph-based procedure was used to collapse SVs with significant overlap in carriers and in genomic coordinates. The final callset consists of 123,801 sites, with each sample containing approximately 3,300 SVs.

After rigorous QC, I compared the cohort to similar population cohorts including 1000 genomes project and Hall-SV. The cohort is sensitive - capturing 93% and 92% of common deletions from each cohort respectively. There is less sensitivity at duplications, where INTERVAL captures only 65% and 75% respectively. The majority of detected variants are rare: 95% have MAF < 0.01, and 49% are singletons, both figure are in line with expectations set by other similarly sized cohorts such as gnomAD-SV. Intergenic SVs are more common than all SVs affecting coding regions, and multi-gene SVs are the most rare class as expected. SVs are well tagged by SNPs - 88%, 97%, 93% and 46% of deletions, inversions, reference MEIs and duplications have at least one SNP in high LD (r2>0.8)- suggesting that genotyping of the SVs is high quality.

We evaluated the contribution of SVs on a comprehensive range of phenotypes available in the cohort. These traits include a range of blood cell traits and phenotypes relating to inflammation and immunity including 1,348 metabolites, 92 plasma proteins and 125 full blood count traits. I modelled linear associations between single SVs and each trait, and identified 495 signals across 196 regions. After conditional analysis, I estimate SVs have a causal role in 34 signals, and are the lead variant in 54 signals. Chapter four details several examples of the contribution the SV is making to the association, describes potential genetic mechanisms, and gathers additional clinical data such as electronic health records and gene expression information to comprehensively describe the role of the SV at the association. Finally, there were 481 signals with genome wide significant SNPs present. At 339 signals, at least one SNP became non-significant when conditioning on the SV, suggesting that many SV signals have been accounted for by proxy previously. SVs are challenging to detect, however here I demonstrate the importance of including SVs in GWAS and further studies in understanding complex traits.





Soranzo, Nicole
Hurles, Matthew


Clinical Phenotype, Genomics, GWAS, Structural Variant


Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge