Widespread dose-dependent e(cid:27)ects of RNA expression and splicing on complex diseases and traits

The resources generated by the GTEx consortium o(cid:27)er unprece-dented opportunities to advance our understanding of the biology of human traits and diseases. Here, we present an in-depth ex-amination of the phenotypic consequences of transcriptome regulation and a blueprint for the functional interpretation of genetic loci discovered by genome-wide association studies (GWAS). Across a broad set of complex traits and diseases, we (cid:28)nd widespread dose-dependent e(cid:27)ects of RNA expression and splicing, with higher impact on molecular phenotypes translating into higher impact downstream. Using colocalization and association approaches that take into account the observed allelic heterogeneity, we propose potential target genes for 47% (2,519 out of 5,385) of the GWAS loci examined. Our results demonstrate the translational relevance of the GTEx resources and highlight the need to increase their resolution and breadth to further our understanding of the genotype-phenotype link. on traits through multiple lines of evidence. We examine the importance of considering, or correcting for, false functional links attributed to GWAS loci due to neighboring but distinct causal variants. To identify predicted causal e(cid:27)ects among the complex trait associated QTLs, we conduct systematic evaluation across di(cid:27)erent methods. Furthermore, we provide guidelines for employing complementary methods to map the regulatory mechanisms underlying genetic associations with complex traits.


Introduction
In the last decade, the number of reproducible genetic associations with complex traits that have emerged from genome-wide association studies (GWAS) has substantially grown.
Many of the identied associations lie in non-coding regions of the genome, suggesting that they inuence disease pathophysiology and complex traits via gene regulatory changes.
Large-scale international eorts such as the Genotype-Tissue Expression (GTEx) Consortium have provided an atlas of the regulatory landscape of gene expression and splicing variation in a broad collection of primary human tissues (79). Nearly all protein-coding genes in the genome now have at least one local variant identied to be associated with expression and a majority also have common variants aecting splicing (FDR < 5%) (9). 2 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 22, 2019. In parallel, there has been an explosive growth in the number of genetic discoveries across a large number of phenotypes, prompting the development of integrative approaches to characterize the function of GWAS ndings (1014). Nevertheless, our understanding of underlying biological mechanisms for most complex traits substantially lags behind the improved eciency of discovery of genetic associations, made possible by large-scale biobanks and GWAS meta-analyses.
One of the primary tools for functional interpretation of GWAS associations has been integrative analysis of molecular QTLs. Colocalization approaches that seek to establish shared causal variants (e.g., eCaviar (15), enloc (16), and coloc (17)), enrichment analysis (S-LDSC (18) and QTLEnrich (11)) or mediation and association methods (SMR (12), TWAS (13) and PrediXcan (19)) have provided important insights, but they are often used in isolation, and there have been limited prior assessments of power and error rates associated with each (20). Their applications often fall short of providing a comprehensive, biologically interpretable view across multiple methods, traits, and tissues or oering guidelines that are generalizable to other contexts. Thus, a comprehensive assessment of expression and splicing QTLs for their contributions to disease susceptibility and other complex traits requires the development of novel methodologies with improved resolution and interpretability.
Here, we develop novel methods, approaches, and resources that elucidate how genetic variants associated with gene expression (cis-eQTLs) or splicing (cis-sQTLs) contribute to, or mediate, the functional mechanisms underlying a wide array of complex diseases and quantitative traits. Since splicing QTLs have largely been understudied, we perform a comprehensive integrative study of this class of QTLs, in a broad collection of tissues, and GWAS associations. We leverage full summary results from 87 GWAS for discovery analyses and use independent datasets for replication and validation. Notably, we nd widespread dose-dependent eect of cis-QTLs on traits through multiple lines of evidence. We examine the importance of considering, or correcting for, false functional links attributed to GWAS loci due to neighboring but distinct causal variants. To identify predicted causal eects among the complex trait associated QTLs, we conduct systematic evaluation across dierent methods. Furthermore, we provide guidelines for employing complementary methods to map the regulatory mechanisms underlying genetic associations with complex traits. 3 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 22, 2019. Fig. 1. Overview of workow for mapping complex trait associated QTLs. Full variant association summary statistics results from 114 GWAS were downloaded, standardized, and imputed to the GTEx v8 WGS variant calls (maf>0.01) for analyses. A total of 8.87 million imputed and genotyped variants were investigated to identify trait-associated QTLs. A total of 49 tissues, 87 traits, and 23,268 protein-coding genes and lncRNAs remained after stringent quality assurance protocols and selection criteria. A wide array of complex trait classes, including cardiometabolic, anthropometric, and psychiatric traits, were included.

Dose-dependent eects of expression and splicing regulation on complex traits
The robust enrichment of GWAS variants (g. S5, g. S6) and heritability in eQTLs and sQTLs has been established by multiple studies, including our analysis of GTEx v8 data (9,21). This observation provides strong support for a causal role of expression and splicing regulation in complex traits. Importantly, transcriptome-based PrediXcan/TWAS methods implicitly assume that gene regulation aects complex traits in a dose-dependent manner. Nevertheless, there has been little formal support for this assumption. Here, we tested a dose-dependent eect on traits, i.e., whether e/sVariants with higher impact on gene expression or splicing lead to higher impact on a complex trait and a larger GWAS eect ( Fig. 2A). We note these analyses were performed with ne-mapped variants (21). The dose-dependent eect was quantied by the genic medi- 5 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 22, 2019.

Causal gene prediction and prioritization
In addition to genome-wide analyses that shed light on the molecular architecture of complex traits, QTL analysis of GWAS data can identify potential causal genes and molecular changes in individual GWAS loci. Towards this end, we analyzed colocalization and genetically predicted regulation association (Fig. 3A). After evaluating the performance of coloc and enloc (16,17), we chose enloc as our primary approach, due to its use of hierarchical models to estimate colocalization priors (16) and its ability to account for multiple causal variants. The coloc assumption of a single causal variant drastically reduces performance especially in large QTL datasets such as GTEx with widespread allelic 8 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 22, 2019. ; https://doi.org/10.1101/814350 doi: bioRxiv preprint heterogeneity (g. S27). We estimated the posterior regional colocalization probability (rcp), using enloc, for 12,072,964 (tissue, gene, GWAS locus, trait)-tuples and 67,943,800 (tissue, splicing event, GWAS locus, trait)-tuples. We used rcp>0.5 as a stringent evidence of colocalization.
To assess the performance of dierent colocalization approaches, we compared the nemapping results based on two large GWAS of height in European-ancestry individuals: GIANT (23) and UK Biobank. Colocalization of signals in two traits occurs when they share ne-mapped variants, i.e. variants with posterior causal probability greater than 0. We found that 85% of ne-mapped variants (posterior inclusion probability > 0. 25) in GIANT had posterior probability of 0 in the UK Biobank, which implies that the colocalization probability contributed by these variants is 0. Notably, 48% of GIANT's ne-mapped loci had no overlap with the UK Biobank's loci, resulting in a colocalization probability of 0. Given the larger sample size in the UK Biobank, this low colocalization cannot be attributed to lack of power but is likely due in part to reference LD dierences.
Thus, colocalization is highly conservative and may miss many causal genes, and low colocalization probability should not be interpreted as evidence of lack of a causal link between the molecular phenotype and the GWAS trait.
A complementary approach to colocalization is to estimate GWAS association for genetically predicted gene expression or splicing (19). The GTEx v8 data provides an important expansion of these analyses, allowing generation of prediction models in 49 tissues with whole genome sequencing data to impute gene expression and splicing variation. We trained prediction models using a variety of approaches and selected the top performing one based on precision, recall, and other metrics (21, 24). Briey, the optimal model uses ne-mapping probabilities for feature selection and exploits global patterns of tissue sharing of regulation (see (21); g. S12) to improve the weights. Multi-SNP prediction models were generated for a total of 686,241 (gene, tissue) pairs and 1,816,703 (splicing event, tissue) pairs. With the increased sample size and improved models, we increased the number of expression models by 14% (median across tissues) relative to the GTEx v7 models Elastic Net models (g. S13). Splicing models are available only for the 9 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 22, 2019. ; https://doi.org/10.1101/814350 doi: bioRxiv preprint v8 release. Next, we computed the association between an imputed molecular phenotype (expression or splicing) and a trait to estimate the genic eect on the trait using S-PrediXcan (25).
Given the widespread tissue-sharing of regulatory variation (8), we also computed S-MultiXcan scores (10) to integrate patterns of associations from multiple tissues and increase statistical power (10). Twenty eight percent of the genes tested with S-PrediXcan showed a signicant association with at least one of the 87 traits at Bonferroni-corrected pvalue threshold (p < 0.05/686, 241; g. S14). For splicing, about 15% (20,364 of 138,890) of tested splicing events showed a signicant association (p < 0.05/1, 816, 703). Nearly all traits (94%; 82 out of 87) showed at least one S-PrediXcan signicant gene-level association in at least one tissue (g. S19 and S20). This resource of S-PrediXcan associations can be used to prioritize a list of putatively causal genes for follow-up studies.
To replicate the PrediXcan expression associations in an independent dataset, BioVU, which is a large-scale biobank tied to Electronic Health Records (26, 27), we selected seven traits with predicted high statistical power. Out of 947 gene-tissue-trait discoveries tested, 458 unique gene-tissue-trait triplets (48%) showed replication in this independent biobank (p < 0.05; (21)).
Altogether, these results provide abundant links between gene regulation and GWAS loci. To further quantify this, we considered approximately LD-independent regions (28) with a signicant GWAS variant for each trait, and calculated the proportion of GWAS loci that contain an associated gene from S-PrediXcan (p < 0.05 / # genes, 2 × 10 −6 ) or a colocalized gene from enloc (rcp > 0.5). Across the traits, 72% (3,899/5,385) of GWAS loci had a S-PrediXcan expression association in the same LD region and 55% (2,125/3,899) had evidence of colocalization with an eQTL (table S3, table S4, g. S17).
For splicing, 62% (3,345/5,385) had a S-PrediXcan association and 34% (1,135/3,345) enloc colocalized with an sQTL (g. S18). From the combined list of eGenes and sGenes, 47% of loci have a gene with both enloc and PrediXcan support. The distribution of the proportion of associated and colocalized GWAS loci across 87 traits is summarised in Fig. 3-E; for a typical complex trait, about 20% of GWAS loci contained a colocalized, signicantly associated gene while 11% contained a colocalized, signicantly associated splicing event. These results propose function for a large number of GWAS loci, but most loci remain without candidate genes, highlighting the need to expand the resolution of transcriptome studies. 10 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 22, 2019.

Performance of causal gene prediction
Multiple studies have found an excess of deleterious rare variants associated with complex traits in genes that are in the vicinity of common variants associated with the same trait (2931), suggesting that the dose-response curve at the regulatory range may be extrapolated to the rare, loss-of-function end (Fig. 3B). We thus leveraged this rare variant information to analyze the sensitivity and specicity of S-PrediXcan and enloc in nding causal genes in GWAS loci. Towards this end, we curated two silver standard" sets of genes associated with specic traits, based on the OMIM (Online Mendelian Inheritance in Man) database (32) and rare variant association studies (29, 33,34) (g. S21, table S6).
We analyzed the genes, within the silver standard sets, that have a GWAS association for a matched trait in the same LD block (21, 28), resulting in 1,592 OMIM gene-trait pairs and 101 rare variant based gene-trait pairs (table S11, table S12, g. S22). Since only genes in the vicinity of an index gene can be discovered with cis-regulatory information, we selected 229 OMIM genes and 81 genes harboring rare variant associations that are located within the same LD block as the GWAS locus for a matched trait (g. S23).
Both S-PrediXcan and enloc showed high sensitivity to identifying the silver standard genes (Fig. 3C, Fig. 3D). Compared to a random set of genes within the same LD block as the GWAS locus and OMIM gene, S-PrediXcan and enloc showed substantial enrichment of 2.5 and 4.6 folds for expression and 2.5 and 6.1 folds for splicing, respectively. For the rare variant silver standard, we found similar enrichment for PrediXcan (2.2 and 2.19 for expression and splicing, respectively) and enloc (14.7 and 21.7). We note comparison of this enrichment between the methods is not interpretable because the tresholds based on signicance and colocalization probability are not comparable.
For applications such as target selection for drug development or follow-up experiments, another relevant metric is the precision or, equivalently, positive predictive value (PPV) the probability that the gene-trait link is causal given that it is called signicant or colocalized. Using the same threshold as for the sensitivity calculation, we found that 8.5% (73 out of 859) of PrediXcan signicant genes and 11.7% (49 out of 419) of enloc-colocalized genes were also OMIM genes for matched traits.
These enrichment results were corroborated by ROC and precision-recall curves, which demonstrate that enloc and PrediXcan contribute to prediction of causal genes, and that combining enloc and PrediXcan improves the precision-recall trade-o (g. S25). However, the overall prediction performance is modest, which is likely to be partially due to 11 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 22, 2019. ; https://doi.org/10.1101/814350 doi: bioRxiv preprint the fact that the OMIM gene list has an inherent bias. Our current understanding of gene function is biased towards protein-coding variants with very large eects, which is reected in the list of OMIM genes. Genes associated to rare severe disease tend to be depleted of regulatory variation (35, 36), which will decrease the performance of a QTLbased method in a way that is unlikely to be generally applicable to GWAS genes that are more tolerant to regulatory variation (36).
To further investigate whether this predictive power could be improved by considering the proximity of the GWAS peaks to the OMIM genes, we performed a joint logistic regression of OMIM gene status on 1) the proximity of the top GWAS variant to the nearest gene, 2) posterior probability of colocalization, and 3) PrediXcan association signicance between QTL and GWAS variants. To make the scale of the three features more comparable, we used their respective ranking within the locus with a threshold for genes with no evidence of colocalization or association. Among the 229 OMIM genes, 28.4% were the closest gene, 22.7% were the most colocalized, and 18.3% were the most signicant g. S24.
All three features were signicant predictors of OMIM gene status, with better ranked genes more likely to be OMIM genes (proximity p = 2.0 × 10 −2 , enloc p = 6.1 × 10 −3 , PrediXcan p = 2.5 × 10 −4 ), indicating that each method provides an independent source of causal evidence. Similar results were obtained using splicing colocalization and association scores and the rare variant based silver standard, as shown in table S7. These results provide further empirical evidence that a combination of colocalization and association methods will perform better than individual ones. The signicance of proximity is an indicator of the missing regulability, i.e. mechanisms that may be uncovered by a gene assignment that assays other tissue or cell type contexts, larger samples, and other molecular traits.
Predicted OMIM genes included well-known ndings such as PCSK9 for LDLR, with PCSK9 signicant and colocalized for relevant GWAS traits (LDL-C levels, coronary artery disease, and self-reported high cholesterol), and Interleukins and HLA subunits for asthma, both signicant and colocalized for related immunological traits. Signicantly associated and colocalized genes that predicted OMIM genes also included FLG (eczema), TPO (hypothyroidism), and NOD2 (inammatory bowel disease) (see table S11 for complete list). Prediction of genes in the rare variant based silver standard was similarly observed (see (21); g. S26). 12 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 22, 2019. Performance of enloc, PrediXcan, and SMR on expression (C) and splicing (D) data to predict causal genes using the OMIM silver standard. (E) Proportion of GWAS-associated loci per trait that contain colocalized and S-PrediXcan-associated signals for expression and splicing.
Tissue enrichment of GWAS signals A systematic survey of regulatory variation across 49 human tissues promises to facilitate the identication of the tissues of action for complex traits. However, because of the broad sharing of regulatory variation across tissues and the reduced signicance of tissuespecic eQTLs, causal tissue identication has been challenging. Here we used sparse factors from mashR representing patterns of tissue sharing of eQTLs (21), to classify each gene-trait association into one of 15 tissue classes (g. S28). Using the pattern of tissue classes of non-colocalized genes (rcp = 0) as the expected null, we assessed whether signicantly associated and colocalized genes (PrediXcan signicant and rcp > 0.01) were over-represented in certain tissue classes (Fig. 4). Consistent with previous reports (11, 38), we identied several instances in which the most signicant tissue is supported by current biological knowledge. For example, blood cell count traits were enriched in whole blood, neuroticism and uid intelligence in brain/pituitary, hypothyrodism in thyroid, 13 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 22, 2019. ; https://doi.org/10.1101/814350 doi: bioRxiv preprint coronary artery disease in artery, and cholesterol-related traits in liver. Taken together, these results show the potential of leveraging regulatory variation to help identify tissues of relevance for complex traits. Fig. 4. Identifying trait-relevant tissues using tissue-specic enrichment. Enrichment of tisssue-specic association and colocalization compared to the pattern of tissue-specicity of non-colocalized genes. Over-representation of the tissue class for PrediXcan-signicant and colocalized genes is indicated by dark yellow while depletion is indicated by blue. Black dots label the tissue class-trait pairs passing the nominal p-value signicance threshold of 0.05. Abbreviation: S1. Trait category colors: S1.

Discussion
We examined in-depth the phenotypic consequences of transcriptome regulation and provide novel computational methodologies and best-practice guidelines for using the GTEx resources to interpret GWAS results. We provide a systematic empirical demonstration of the widespread dose-dependent eect of expression and splicing on complex traits, i.e., variants with larger impact at the molecular level had larger impact at the trait level.
Furthermore, we found that target genes in GWAS loci identied by enloc and PrediXcan were predictive of OMIM genes for matched traits, implying that for a proportion of the genes, the dose-response curve can be extrapolated to the rare and more severe end of the genotype-trait spectrum. The observation that common regulatory variants target genes also implicated by rare coding variants underscores the extent to which these dierent types of genetic variants converge to mediate a spectrum of similar pathophysiological eects and may provide a powerful approach to drug target discovery. 14 .
CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 22, 2019. ; https://doi.org/10.1101/814350 doi: bioRxiv preprint We implemented association and colocalization methods that leverage the observed allelic heterogeneity of expression traits. After extensive comparison using two independent sets of silver standard gene-trait pairs, we conclude that combining enloc, PrediXcan, and proximity ranking outperforms the individual approaches. The signicance of the proximity ranking is a sign of the missing regulability emphasizing the need to expand the resolution, sample size, and range of contexts of transcriptome studies as well as to examine other molecular mechanisms.
We caution that the increased power oered by this release of the GTEx resources also brings higher risk of false links due to LD contamination and that naive use of eQTL or sQTL association p-values to assign function to a GWAS locus can be misleading.
Colocalization approaches can be eective in weeding out LD contamination but given the current state of the methods and the lack of LD references from source studies, they can also be overtly conservative. Importantly, ne-mapping and colocalization approaches can be highly sensitive to LD misspecication when only summary results are used (39).
The GWAS community has made great progress in recognizing the need to share summary results, but to take full advantage of these data, improved sharing of LD information from the source study as well as from large sequencing reference datasets, is also required. We highlight the importance of considering more than one statistical evidence to determine the causal mechanisms underlying a complex trait.
Finally, we generated several resources that can open the door for addressing key questions in complex trait genomics. We present a catalog of gene-level associations, including potential target genes for nearly half of the GWAS loci investigated here that provides a rich basis for studies on the functional mechanisms of complex diseases and traits. We provide a database of optimal gene expression imputation models that were built on the ne-mapping probabilities for feature selection and that leverage the global patterns of tissue sharing of regulation to improve the weights. These imputation models of expression and splicing, which to date has been challenging to study, provide a foundation for transcriptome-wide association studies of the human phenome the collection of all human diseases and traits to further accelerate discovery of trait-associated genes. Collectively, these data thus represent a valuable resource, enabling novel biological insights and facilitating follow-up studies of causal mechanisms. 15 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under    has received research support from AstraZeneca and Goldnch Bio, not related to this work.

Code and data availability
The code for methods applied in this paper can be downloaded from the github repository (https://github.com/hakyimlab/gtex-gwas-analysis).

Data availability statement
Genotype-Tissue Expression (GTEx) project raw whole transcriptome and genome sequencing data are available via dbGaP accession number phs000424. v8 CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 22, 2019. ; https://doi.org/10.1101/814350 doi: bioRxiv preprint . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 22, 2019.     We did not include trans QTLs in our analyses due to limited power after correcting for confounders and potential pleiotropic eect in complex trait associations. Below, we briey describe the whole-genome sequencing, RNA-sequencing and QTL data processing protocols. Detailed description of subject ascertainment, sample procurement, and sequencing data processing are available elsewhere (9).
Whole-genome sequence data processing and quality control

RNA-Seq data processing and quality control
Whole transcriptome RNA-Seq data were aligned using STAR (v2. 5 (45)) using default parameters to quantify splicing QTLs in cis with intron excision ratios (9).

Genome-wide association studies (GWAS) Harmonization
The process followed for the harmonization and imputation are depicted in g. S2 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

Imputation of GWAS summary statistics
To standardize the number of variants across trait-tissue pairs, all processed GWAS results were imputed. We implemented the Best Linear Unbiased Prediction (BLUP) approach (46,47) in-house (https://github.com/hakyimlab/summary-gwas-imputation) to impute z-scores for those variants reported in GTEx without matching data in the GWAS summary statistics. This algorithm does not impute raw eect sizes (β coecients). The imputation was performed in specic regions assumed to have suciently low correlations between them, dened by approximately independent linkage disequilibrium (LD) blocks (28) lifted over to hg38/GRCh38.
Only GTEx variants with MAF > 0.01 in GTEx-EUR subjects were used in downstream analyses. Covariance matrices (reference LD information) were estimated on these GTEx-EUR subjects. The corresponding (pseudo-)inverse matrices for covariances C were calculated via Singular Value Decomposition (SVD) using ridge-like regularization C + 0.1I. To avoid ambiguous strand issues homogeneously, palindromic variants (i.e. CG) were excluded from the imputation input. Thus, an imputed z-score was generated for palindromic variants available in the original GWAS; for them, we report the absolute value of the original entry with the sign from the imputed z-score. The sample size that we report for the imputed variants is the same as the sample size for the observed ones if it is reported as constant across variants, or their median if it changes across the observed variants, which occurs in the case of meta-analyses.
We initially considered publicly available GWAS summary statistics for 114 complex traits provided by large-scale consortia and the UK Biobank (48) (table S9) CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 22, 2019. ; https://doi.org/10.1101/814350 doi: bioRxiv preprint (table S1). We observed noteworthy association prediction performance across the selected 87 traits (e.g., with a median r 2 = 0.90 (IQR = 0.0268) between the original and imputed zscores on chromosome 1). The median slope was 0.94 (IQR = 0.0164), as the imputed zscore values tend to be more conservative than the original ones. Imputation quality was consistent across traits, depending strongly on the number of input available variants (g. S3).
Supplementary Fig. S1. GWAS trait categories. Categories of the traits with full GWAS summary statistics used in the analysis. See list of traits in S1. 36 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 22, 2019.  Table S9. 39 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

NHGRI-EBI GWAS catalog
To investigate the downstream eects of GWAS loci using the resources from the GTEx consortium, we obtained the list of trait-associated SNPs from the GWAS catalog (49) (downloaded on 9/7/2018), which, at download, contained 80,727 entries. To measure the enrichment of e/sQTL in GWAS Catalog, we computed the proportion of e/sQTL in GWAS catalog relative to the proportion of e/sQTL among all GTEx V8 variants. And we obtained the uncertainty measurement of proportion and enrichment fold using block jackknife. See (9) for details.

Validation of ndings in BioVU Biobank
For replication analyses in BioVU (27), we selected 11 complex traits and 10 tissues with the largest sample size and estimated statistical power to detect true associations.

On summarizing across traits and tissues
Many of our analyses generate one statistic for each of the 4,263 (87 × 49) trait-tissue pairs. These can have a complex error structure with a wide range of standard errors and correlation between tissues. Thus, the usual "iid" (independent and identically distributed) assumption behind common statistical tests is not appropriate. For summarizing across traits for a given tissue, we assumed independence across traits but took into account the dierent standard errors. For summarizing across trait-tissue pairs, we allowed both correlation between tissues and correlation between traits, and corrected for dierent standard errors. More specically, let S tp be some statistic estimated in trait p and tissue t with standard error se(S tp ).
Summarizing across traits for a given tissue. Here we describe the procedure to summarize results that have one statistic (along with its standard error) per trait-tissue pair. For each tissue t, we summarized S t1 , · · · , S tP by tting the following linear model: Hence, we obtainμ t S and se(μ t S ) as the summary of S t1 , · · · , S tP estimates aggregated across traits, which is essentially a weighted average across traits.

41
. CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under Summarizing across trait and tissue pairs. Similarly, we summarized S 11 , · · · , S tp , · · · , S T P by tting the following linear model.
where µ t S is the tissue-specic random intercept (this accounts for tissue-specic features common across traits) and µ p S is the trait-specic random intercept (this accounts for trait-specic characteristics and thus accounts for the correlation between tissues for a given trait). The estimatedμ S and se(μ S ) is the average S tp across all trait-tissue pairs accounting for the complex error structure.
Testing whether two statistics have dierent mean. For some analyses, we would like to test whether two quantities are dierent across all trait-tissue pairs (e.g. enrichment signal measured for sQTL as µ S 1 vs. the one measured for eQTL as µ S 2 , etc). For this purpose, we constructed the following paired test. First, we formed the test statistic T tp := S 1,tp − S 2,tp which, under the null H 0 : µ S 1 = µ S 2 , has T tp ∼ N (0, se(S 1,tp ) 2 + se(S 2,tp ) 2 ).
Then, we summarized T tp across all trait-tissue pairs by the procedure described in the previous paragraph where tissue-or trait-specic intercepts are introduced to account for the complicated correlation structure among T tp 's. The resulting statistic T follows T ∼ N (0, se(T )) under the null.

Enrichment across tissues
Detailed discussions regarding the enrichment analyses and methods to address LD contamination are described elsewhere (9, 22). 42 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

Proportion of QTLs by p-value cuto
To estimate the proportion of SNPs considered as associated with expression (for at least one gene) at various p-value thresholds, we used the most signicant p-value (tested using all GTEx individuals) for each SNP from all associations in all tissues (including all genes and variants tested). We plotted the proportion SNPs whose most signicant p-value meets a p-value threshold for varying levels of this threshold (g. S6). To test whether trait-associated SNPs are more likely to be e/sQTLs, we repeated the above procedure for the lead SNPs in the GWAS catalog.

Enrichment of GWAS catalog variants
To investigate the relevance of the QTLs in complex traits, we analyzed the database of trait-associated variants (dened as p < 5 × 10 −8 ), as curated in the NHGRI-EBI GWAS catalog (see Methods; hereafter GWAS catalog), and tested enrichment of single tissue QTLs across complex traits represented in the GWAS catalog. Next, we examined the nominally associated loci that did not attain genome-wide signicance in a subset of studies reported in the GWAS catalog. We observed that the proportion of variants associated with expression and splicing at dierent signicance threshold was much larger for trait-associated variants from the GWAS catalog than for the full set of tested common variants (g. S6. The signicance of this dierence is reported elsewhere (9). Notably, as statistical power improved with increased sample sizes, spurious associations caused by trait-associated QTLs that are distinct from, but in linkage disequilibrium (LD) with, the trait causal variant(s) (LD contamination) also increased (22). At a nominal threshold, the proportion of common variants associated with the expression of a gene in some tissue increased from 92.7% in the V6 release (8) to 97.3% in V8. For splicing the proportion was 97.7%. We should caution that assigning function to a GWAS locus based on QTL association p-value alone, even with a more stringent threshold, could be misleading.
However, ne-mapping of the region for BMI and FTO expression in muscle showed 44 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under FTO expression ne-mapping assigned >99% probability of being causal to rs1861867, chr16_53814649_A_G_b38.

Fine-mapping QTL variants
We applied dap-g (14) to the 49 tissues to estimate the degree to which a variant might exert a causal eect on expression or splicing levels, using default parameter values. First, we selected genes annotated as protein-coding, lincRNA or pseudogenes. For each gene, we considered all variants within the cis-window (1Mbps) with M AF > 0.01, and used the same covariates as in the main eQTL analysis to correct for unwanted variation. This yielded a list of clusters (variants related by LD), and posterior inclusion probabilities (pip) that provide an estimate of the probability of a variant being causal. We repeated this process for splicing ratios from Leafcutter, using a cis-window ranging from 1Mbps upstream of the splicing event start location to 1Mbps downstream of the end location.
We used individual-level data for GTEx-EUR subjects both for expression and splicing.
Sample sizes ranged from 65 in kidney cortex to 602 in skeletal muscle tissues.

Modeling eect mediated by regulatory process
We compared the magnitude of GWAS and cis-QTL eect sizes, which is the basis of multi-SNP Mendelian randomization approaches (50).
To formalize the relationship between the GWAS eect size (δ) and the QTL eect size (γ), we assumed an additive genetic model for the GWAS trait. Specically, for variant k, where X k is the allele count of variant k, Y is the trait, and is the un-explained variation.
We decomposed GWAS eect size into its mediated and un-mediated components, 45 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 22, 2019. ; https://doi.org/10.1101/814350 doi: bioRxiv preprint where G k represents the set of genes regulated by variant k with corresponding QTL eect size as γ k,g , and ν k is the un-mediated eect of variant k on trait. And β g is the downstream eect of gene g on the trait.

Selection of ne-mapped variants as instrumental variables
To investigate the relationship between GWAS and QTL eect sizes in the transcriptome, we generated a set of ne-mapped QTL signals derived from dap-g ne-mapping performed in the GTEx-EUR individuals (see Section 1.5) to serve as proxy for causal QTLs. For splicing, we utilized sQTLs at the splicing event/variant level rather than the gene/variant level. In particular, we selected the top variant within each 25% credible set of a gene or splicing event and ltered out the QTLs with pip less than 1%. For each of the selected QTLs, we used the QTL eect size estimated from the marginal test (using the GTEx-EUR individuals) and GWAS eect size obtained from the imputed z-score from the GWAS imputation byβ ≈ z/ f (1 − f )N , where f is the allele frequency and N is the GWAS sample size.

Correlation between GWAS and QTL eect sizes
For each trait-tissue pair, we calculated the Pearson correlation of the magnitude of observed GWAS eect size and of cis-eQTL eect size, Cor(|δ k |, |γ k |), for the list of selected ne-mapped QTLs, as described in Section 1. 6. The observed Pearson correlation captures the mediated eect (see details in Section 1.6). To obtain a null distribution for the correlation, we computed the Pearson correlation under the shued data within each LD-score bin dened by quantiles (100 bins were used). The signicance of the dierence between observed and null distribution was calculated using the method described in 1.3.

46
. CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 22, 2019. ; https://doi.org/10.1101/814350 doi: bioRxiv preprint where b 0 , b 1 are the xed eect capturing the un-mediated eect and β g is the downstream eect (mediated eect) of the gene or splicing event g. In short, we assumed an innitesimal model on the downstream eect and aimed at estimating σ 2 gene as the transcriptome-wide contribution of the mediated eect. For each tissue-trait pair, we tted the model using selected ne-mapped QTLs, as described in Section 1.6, along with the correspondingδ k ,γ k,g . To obtain the distribution of σ 2 gene under the null, we performed the same calculation using shued GWAS eect sizes.

Concordance of mediated eects for allelic series of independent eQTLs
Under the mediation model in Eq. 8, we expect that for a given gene with multiple QTL signals, these signals should share the same downstream eect, β g . Since the number of splicing events with multiple QTL signals was limited, we restricted this analysis to eQTLs only. We tested for concordance of downstream eect size obtained from the primary and secondary eQTL of a gene (ranked by QTL signicance or QTL eect size estimate).
Specically, for a given trait and gene g, we dened the observed downstream eect for the kth variant asβ k,g =δ k /γ k,g . Thus, for each gene, we obtainedβ prim andβ sec as the observed downstream eect for the primary and secondary eQTLs if more than one eQTL signal was detected by dap-g. Ideally, for a mediating gene in a causal tissue (or a good proxy tissue), we would expect thatβ prim andβ sec should tend to have consistent value as compared to random. We measured the concordance in two ways: 1) correlation between β prim andβ sec ; 2) percent concordant, dened as the fraction of eQTL pairs having the same sign inβ prim andβ sec Concordance as compared to non-colocalized genes with matched LD. To ensure that the concordance betweenβ prim andβ sec was not driven by LD, we compared the concordance of primary and secondary eQTLs between colocalized and likely non-causal genes with similar LD pattern between primary and secondary eQTLs. Specically, for each trait-tissue pair, we obtained primary and secondary eQTLs based on magnitude of eect size and measured the concordance as Cor(β prim ,β sec ). We computed the concordance for colocalized genes at various enloc rcp cutos (obtained from GTEx-EUR individuals). Furthermore, we randomly sampled the same number of genes with enloc rcp < 0.01 by matched LD and calculated the corresponding concordance as the null. To reduce the eect of outliers on concordance calculation, we removed genes withβ prim or 47 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 22, 2019. ; https://doi.org/10.1101/814350 doi: bioRxiv preprint β sec in the top and bottom 5%. We kept only trait-tissue pairs with more than 10 genes observed after removing outliers. Concordance as compared to null with matched LD. The correlation between primary and secondary eQTLs (pairwise LD) could introduce correlation between primary and secondaryδ's and similarly to primary and secondaryγ's, which would potentially contribute to concordance. To account for this confounding, for each gene, we simulated δ prim andδ sec preserving the correlation introduced by pairwise LD with whereR is the sample correlation (from GTEx-EUR individuals) of primary and secondary variants. This simulation scheme is equivalent to simulating phenotype asỸ ∼ N (Xδ prim + Xδ sec , σ 2 ) and running GWAS on the GTEx-EUR genotypes.
Visualizing the concordance among enloc colocalized genes. To visualize the concordance ofβ prim andβ sec , we rst scaledδ andγ by their standard deviation among all eQTLs selected in Section 1. 6. Then, we extracted the set of genes with exactly two dap-g eQTLs (as described in 1.6) and labelled the two eQTLs as primary and secondary based on QTL signicance or QTL eect size. We computedβ prim andβ sec and removed the genes withβ prim orβ sec in the top and bottem 5%. As a control, we also simulated random δ to compute simulated β sim for downstream analysis. We further ltered the genes by selecting only those with enloc rcp > 0. 1.

Widespread dose-dependent eects of expression and splicing regulation on complex traits (cont. from main text)
With multiple ne-mapped QTLs being detected in expression data, we proceeded to look into the eect of primary and secondary eQTLs. A third line of support for the dosedependent eect was provided by the fact that primary eQTLs (ranked by eect size) showed, on average, larger GWAS eect sizes than secondary eQTLs (Fig. 2F). 48 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 22, 2019.  49 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under  S8. Percent concordance for genes showing colocalization evidence versus LD-matched non-colocalized genes and LD-matched simulatedδ prim andδ sec . The percent concordance (on y-axis) for genes with various enloc colcalization signal (on x-axis) is shown for 87 traits in whole blood in blue. The percent concordance obtained by sampling from non-colocalized genes (rcp < 0.01) with matched LD between primary and secondary QTLs is shown in light blue and that obtained by simulatingδ prim andδ sec with observed pairwise LD is shown in gray.
Providing further support for the dose-dependent eect, the concordance of the mediated eects was consistently observed across traits and tissues and retained concordant directionality ( Fig. 2D-E, gs. S8,S7), especially among colocalized genes (rcp > 0.1). Primary and secondary eQTLs ranked by eQTL eect size instead of p-value yielded similar patterns (g. S11).

Correlation between GWAS and QTL eect sizes and mixed-eects model account for LD contamination
We illustrate the intuition behind the LD-contamination correction when the average mediated eects are estimated using the approximate method (correlation of absolute values) or the mixed-eects approach. 50 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 22, 2019. where Eq. 15 holds because the marginal eect size depends on LD. To determine the covariance of the magnitude of the GWAS and QTL estimates for SNP 1, we consider E(|δ 1 ||γ 1 |). E(δ 1γ1 |R) = E((Rδ 2 + GWAS ) · (γ 1 + QTL ) |R) (17) = E(Rδ 2 γ 1 |R) + E( GWAS γ 1 ) + E(Rδ 2 QTL |R) + E( GWAS QTL ) (18) = R · E(δ 2 γ 1 ), (19) where Eq. 19 holds since the last three terms in the previous line are zeros, due to the independence among GWAS , QTL , and true eect sizes, δ and γ.
Hence, the covariance of the GWAS and QTL eect sizes under the LD contamination 51 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 22, 2019. ; https://doi.org/10.1101/814350 doi: bioRxiv preprint scenario is where Eq. 25 follows by denition of the mediation model considering no direct eect. So, So, if we consider a gene locus, which naturally conditions on local LD and gene-level 52 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under which still captures the distinction between LD contamination and mediation model, since However, if the LD contamination goes into both GWAS and QTL eect sizes, Cor(|δ|, |γ|) We also developed a mixed-eects approach. We model β as a random eect, β ∼ N (0, σ 2 gene ), and instead of averaging β's across the whole transcriptome (this is what Cor(δ,γ) does), we quantify the mediated eect by estimating σ 2 gene . Specically, we t PVE k represented the quality score for using FLASH factor k to explain the cross-tissue pattern of eQTL. The eQTL was assigned to a FLASH factor k if PVE k was maximal among all FLASH factors and PVE k > 0.2 and for those with PVE k ≤ 0.2 in all FLASH factors, NA (short for not assigned) was assigned instead. These "not assigned" eQTLs had more complex tissue-sharing pattern than the factors captured in the FLASH analysis.
To obtain an interpretable tissue-specicity category, we labeled Factor1 as the shared factor, Factor2, Factor13, Factor14, Factor29, and Factor30 as brain-specic factors, and the rest of the factor assignment as other factors.

Smoothing eect size estimates by leveraging global patterns of tissue sharing
We applied the multivariate adaptive shrinkage implemented in mashr (52) to smooth cis-eQTL eect size estimates (obtained from all GTEx individuals) by taking advantage of correlation between tissues. To t the mashr model, we used the set of ∼ 16, 000 cis-eQTLs as stated in Section 1.7 to learn the mashr prior, and then t the mashr model using ∼ 40, 000 randomly selected variant-gene pairs for the same set of eGenes. We learned data-driven mashr priors in three ways: 1) FLASH factors as described in Section 1.7; 2) PCA with number of PC = 3; 3) empirical covariance of observed zscores. The data-driven covariances were further denoised by calling cov_ed in mashr.
Furthermore, we included the set of canonical covariances as described in (52) as an additional mashr prior. We t the mashr model using the set of randomly selected variantgene pairs with the error correlation estimated by applying estimate_null_correlation function in mashr and the priors obtained above. The resulting mashr model was used to compute the posterior mean, standard deviation, and local false sign rate (LFSR) for a given variant-trait pair. 55 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 22, 2019. ; https://doi.org/10.1101/814350 doi: bioRxiv preprint Supplementary Fig. S12. Tissue-specic factor estimation using ashr. We performed empirical Bayes matrix factorization (by flashr) on a set of the top cis-eQTLs (per gene), and we restricted factors to have non-negative values. We binarized the resulting factors by thresholding the tissue contribution to TRUE if it is at least 20% of the maximum. The pattern after thresholding is shown.

Causal gene prioritization
Two classes of methods can be used to identify the target genes of GWAS loci. One class is based on the colocalization of GWAS and QTL loci, which seeks to determine whether the causal variant for the trait is the same as the causal variant for the molecular phenotype. The other class is based on the association between the genetically regulated component of gene expression (or splicing) with the trait.

Colocalization
For a given variant associated with multiple traits such as gene expression (eQTL) and complex disease (trait-associated variant), extensive LD makes it challenging to identify the underlying true causal mechanisms. Thus, we conducted colocalization analysis using two independent approaches: coloc (17) and enloc (16)), to estimate whether a gene's 56 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

enloc
We computed Bayesian regional colocalization probability (rcp) using enloc, to estimate the probability of a GWAS region and a gene's cis window sharing causal variants. We leveraged the same dap-g results from 1.5, and split the GWAS summary statistics into approximately LD-independent regions (28), each region dening a GWAS locus. For each trait-tissue combination, we computed the rcp of every overlapping GWAS locus to a gene's or splicing event's cis window with enloc default parameters. The enrichment estimates obtained by enloc are shown in g. S5.
For each trait, we counted the number of GWAS loci that contain a GWAS signicant hit, and among these, the number of loci that additionally contain a gene with enloc colocalization rcp > 0. 5. As shown in g. S17C, across traits, a median 29% of loci with a GWAS signal contain an enloc colocalized signal. Given enloc's conservative nature, we caution that rcp < 0.5 does not mean that there is no causal relationship between the molecular phenotype and the complex trait; rather, it should be interpreted as lack of sucient evidence with current data. We summarize the ndings in g. S18. We observed a smaller proportion of GWAS loci containing a colocalized splicing event (median 11% across traits).

coloc
We computed coloc on all cis-windows with at least one eVariant (cis-eQTL per-tissue q-value< 0.05) or sVariant. For binary traits, case proportion and 'cc' trait type parameters were used. For continuous traits, sample size and 'quant' trait type parameters were used.
In both cases, imputed or calculated z-scores were used as eect coecients in Bayes factor calculations.
Coloc is very sensitive to the choice of priors. We used enloc's enrichment estimates to dene data-based priors in a consistent manner. First, we dened likely LD-independent blocks of variants using denitions provided previously (28). The probability of eQTL signal, Pr(d i = 1), was estimated using dap-g (14). Subsequently, we calculated priors p 1 ,

57
. CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 22, 2019. ; https://doi.org/10.1101/814350 doi: bioRxiv preprint p 2 , and p 12 for colocalization analyses as follows: × Pr(d i = 1), and where α 0 and α 1 indicate intercept eect estimate and log odds ratio estimate for the enrichment using enloc, respectively.
We ran coloc using variants in the cis-window for each gene and the intersection with each GWAS trait, obtaining ve probabilities for each (gene, tissue, trait) tuple: P0 for the probability of neither expression nor GWAS having a causal variant; P1 for the probability of only expression having a causal variant; P2 for only the GWAS having a causal variant; P3 for the GWAS and expression traits to have distinct causal variants; P4 for the GWAS and expression traits to have a shared causal variant. We repeated this process using sQTL results. 1.9 Fine-mapping of GWAS using summary statistics To investigate the robustness of ne-mapping, we ne-mapped "height" from the GIANT GWAS meta-analysis and "standing height" from the UK Biobank using susieR (53). We performed ne-mapping using susie_bhat within each LD block (28). We used GWAS eect sizesβ imputed from z-scores byβ = z/ N f (1 − f ) and se(β) =β/z, where f is allele frequency and N is GWAS sample size. The GTEx-EUR individuals were used as the reference LD panel. We applied the same approach to ne-map the BMI-associated FTO locus using the UK BioBank BMI data.

Association to predicted expression or splicing Prediction models
To predict expression, we constructed linear prediction models (24), using only individuals of European ancestry, and variants with MAF > 0.01, for genes annotated as proteincoding, pseudo-gene, or lncRNA. For each gene-tissue pair, we selected the variants with highest pip in their cluster, and kept those achieving pip > 0.01 in dap-g (14). We used mashr (52) eect sizes (as computed in 1.7) for each selected variant. For each 58 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 22, 2019. ; https://doi.org/10.1101/814350 doi: bioRxiv preprint model, we computed the covariance matrix between variants using only individuals of European ancestries, with sample sizes ranging from 65 (kidney -cortex) to 602 (skeletal muscle). This allowed us to build LD panels for every tissue, noting that GWAS studies are conducted on populations of predominantly European ancestries. For every gene, we also computed the covariance of all the variants present across the dierent tissue models, compiling a cross-tissue LD panel to compute the correlation between predicted expression levels across tissues. We refer to these models as mashr models. We compared the number of mashr models to the number of Elastic Net models from GTEx version 7 (g. S13). We generated analogous prediction models for splicing ratios, as computed by Leafcutter (45), applying the same model-building methodology to the data from the sQTL analysis.
We generated a secondary set of prediction models based on ne-mapping information. For every gene-tissue pair, we selected the variants with the highest pip in each cluster achieving pip > 0.01 as explanatory variables. We performed an Elastic Net (54) regression of expression on these variables, for genes annotated as protein coding, pseudogene, or lncRNA. We employed a cross-validated strategy, and kept only models that achieved cross-validated correlation ρ > 0.1 and cross-validated prediction performance p-value p < 0.05. Each variant's eect size was penalized by a factor 1 − pip, so that variants with higher probabilities were more likely to impact the model. Expression phenotypes were adjusted for unwanted variation using covariates such as gender, sequencing plaform, age, the top 3 principal components from genotype data, and PEER factors.
The number of PEER factors was determined from the sample size: 15 for n < 150, 30 for 150 ≤ n < 250, 45 for 250 ≤ n < 350, 60 for 350 ≤ n. We obtained 686,241 models for dierent (gene, tissue) pairs. For each model, we computed tissue-level and cross-tissue covariances as in the mashr models.
We also generated analogous prediction models for splicing ratios, with the same model-building methodology applied to the data from the sQTL analysis, obtaining 1,816,703 (splicing event, tissue) pairs.
We constructed additional sets of prediction models comprised of a single snp, using the top eQTL per gene or the primary, secondary or tertiary eQTL arising from conditional or marginal analysis, in order to assess eect dierence on the complex traits. 59 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 22, 2019.
Supplementary Fig. S13. Number of models available in v8 MASHR family of models, compared to v7 Elastic Net family. Tissues are ordered by sample size.

S-PrediXcan
We performed S-PrediXcan analysis (25) on the 87 complex traits, using the GWAS summary statistics described in 1.2, to identify trait-associated genes (typically p < 2.5 × 10 −7 ). We used the 49 models and LD panels described in 1.10, separately on each trait, to obtain 59,485,548 (gene, tissue, trait) tuples. Repeating this process to generate splicing event ratio models, we obtained 154,891,730 (splicing event, tissue, trait) tuples; for each trait, the Bonferroni-signicance threshold was p < 9.5 × 10 −8 .

Colocalized and signicantly associated genes
We assessed how many genes present evidence of trait association and colocalization, using both expression and splicing event. First, we counted the proportion of genes that showed 60 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 22, 2019. ; https://doi.org/10.1101/814350 doi: bioRxiv preprint a colocalized expression signal with any trait in any tissue, and observed 15% such genes at rcp> 0. 5. Then, for each gene, we considered the splicing event with highest colocalization value in any trait or tissue, and found evidence for 5% at rcp> 0. 5

.
Then we repeated this process for S-PrediXcan associations at dierent signifcance thresholds. About 30% of genes showed a signicant S-PrediXcan association to any trait, and only 8% when ltered for associations with rcp> 0.5. When using the highest splicing association and colocalization value for a gene, these proportions were 20% and 3%, respectively.
These proportions gauge our power to predict causal genes aecting complex traits on the GTEx resource, with expression yielding more ndings than splicing. 61 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 22, 2019. ; https://doi.org/10.1101/814350 doi: bioRxiv preprint Supplementary Fig. S14. Proportion of genes with a colocalized or associated signal using expression or splicing event. A shows the proportion of genes with colocalization evidence in expression data, for dierent rcp thresholds. 3,477 genes show evidence at rcp> 0.5 (15% out of 23,963 genes with enloc results). B shows the proportion of genes with colocalization evidence in splicing data; 1,277 genes (5% of all 23,963) show evidence at rcp> 0. 5. C shows the proportion of genes with association evidence in expression data, additionally ltered by colocalization on dierent thresholds. About 30% of genes show associations at the bonferroni threshold (p < 0.05/686, 241), while 8% also show colocalization evidence. D shows the proportion with association and colocalization evidence in splicing data; about 20% show association evidence (p < 0.05/1, 816, 703) and 3% are also colocalized.

S-MultiXcan
There is substantial sharing of eQTLs across tissues (8). Therefore, we applied S-MultiXcan (10), an approach to exploit the tissue sharing of regulatory variation, to improve our ability 62 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 22, 2019. ; https://doi.org/10.1101/814350 doi: bioRxiv preprint to identify trait-associated genes. The method extends the single-tissue S-PrediXcan approach, leveraging GWAS summary statistics and taking into account the correlation between tissues. We obtained association statistics for 1,958,220 (gene, trait) pairs and 11,986,329 (splicing event, trait) pairs.

PrediXcan replication in BioVU
We replicated the signicant gene-level associations for a prioritized list of traits (table S16) using BioVU (27), Vanderbilt University's DNA Biobank tied to a large-scale Electronic Health Records (EHR) database. We sought BioVU replication in the exact discovery tissues for the signicant gene-trait associations. We restricted our analysis to subjects of European ancestries, using principal component analysis as implemented in EIGENSOFT (version 7. 1.2; (55)). First, we estimated the genetically determined component of gene expression in the BioVU individuals using the PrediXcan imputation models. We then conducted association analysis for the prioritized traits using logistic regression, with sex and age as covariates.

Summary-data-based Mendelian Randomization (SMR) and HEIDI
We performed top-eQTL based Summary-data-based Mendelian Randomization (SMR) (12) analysis of the 4,263 tissue-trait pairs. SMR, which integrates summary statistics from GWAS and eQTL data, has been used to prioritize genes underlying GWAS associations. Supplementary Fig. S15. Causal gene prioritization using S-PrediXcan and enloc.
Summary of GWAS loci that also contain an associated S-PrediXcan or enloc signal, for expression (left) and splicing (right), using MASHR models. 63 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 22, 2019. ; https://doi.org/10.1101/814350 doi: bioRxiv preprint Panel B shows the number of loci (approximately independent LD regions from (28)) with at least one GWAS-signicant variant (dark gray), and among them those with at least one gene reaching rcp > 0.5 (dark green).
Panel C shows the proportion of loci with at least one GWAS-signicant hit that contain at least one colocalized gene. Across traits, a median of 21% of the GWAS loci contain colocalized results. See trait abbreviation list in Table S1. This gure is also presented in (9). 66 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under  Table S1. 67 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under  Table S1. 68 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under This gure summarizes S-MultiXcan associations for each of the 87 traits using splicing models. The traits are ordered by number of GWAS-signicant variants. Panel A) shows in purple the number of S-MultiXcan signicant splicing events, and in dark green the subset also achieving enloc rcp > 0.5 in any tissue. The proportion of colocalized, signicantly associated splicing events is typically 2%, much lower than the proportion from gene expression (12%). Panel B) shows the number of loci (approximately independent LD regions (28)) with a signicant GWAS association (gray), a signicant S-MultiXcan association (purple), and a signicant S-MultiXcan association that is colocalized (dark green). As in the case of expression models, Anthropometric and Blood traits tend to present the largest number of associated loci. Panel C) shows the proportion of loci with signicant GWAS associations (gray) that contain S-Multixcan (purple) and colocalized S-MultiXcan associations (dark green). Across traits, a median of 63% of GWAS-associated loci show an S-MultiXcan association, while 11% show a colocalized S-MultiXcan association. These proportions are lower than the corresponding ones for expression (70% and 19% respectively). See trait abbreviation list in Table S1. 69 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 22, 2019. ; https://doi.org/10.1101/814350 doi: bioRxiv preprint

Curation of causal gene-trait pairs (silver standards)
We curated a list of 1,592 gene-trait pairs with evidence of causal associations in the OMIM database (hereafter, OMIM genes).
After matching traits, we retained 29 unique traits and 631 unique genes that are within the same LD block (28) as the GWAS hit (table S10). As an additional independent evaluation, we also curated a list of`silver standard genes', using 101 gene-trait pairs from 70 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under Throughout this section, we limited our scope to only the protein-coding genes. We rst describe the approach to dene the OMIM-based silver standard. And then we describe the construction of a rare variant set based silver standard.

OMIM-based causal genes
Supplementary Fig. S21. Workow of OMIM-based curation. The workow of OMIMbased causal gene curation is shown where each box represents the trait description/identier in dierent databases. The steps to obtain OMIM genes for MAGIC_FastingGlucose, one of our GWAS traits, is shown as a concrete realization of the workow.
To obtain a curated set of trait-gene pairs from the OMIM database (32), we constructed a map between our GWAS traits and the OMIM traits and then mapped the OMIM traits to genes using the`Gene-Phenotype Relationships' available in the OMIM database.
Specically, we constructed a keyword for each of the 114 GWAS traits (see TableThen, we matched the keyword to trait description in the GWAS catalog if the keyword occurred in the description sentence (as shown in g. S21 step 1). The GWAS catalog-to-phecode map (as shown in S21 step 2) was created, using electronic health records (EHR) data (27), for a replication study of GWAS ndings. The implementation of the map is described in detail in (27). Briey, traits from the catalog (represented as free text) were mapped to the closest corresponding phecode. The phecode/catalog trait relationships were classied 71 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 22, 2019. ; https://doi.org/10.1101/814350 doi: bioRxiv preprint as "exact", "narrower" (if the phecode was more specic than the catalog trait), "broader" (if the phecode was more general than the catalog trait), and "proxy" (if the catalog trait was for a continuous measurement). All phecode/catalog trait relationships were included in the analysis.
The clinical descriptions from OMIM have been annotated with Human Phenotype Ontology (HPO) (56) terms. We created a map between phecodes and HPO terms used to describe OMIM diseases, as previously described (57), which gave rise to the mapping betweem phecodes and OMIM traits (as shown in S21 step 3). By combining these maps, we were able to relate GWAS catalog traits to OMIM disease descriptions, utilizing phecodes and HPO terms as intermediate steps.
For a subset of datasets with discovery (public) and replication (UKB) results in our collection, we kept the dataset with higher number of GWAS loci to avoid double counting.
The number of GWAS loci was determined based on counting the lead variants, using the PLINK V1.9 command clump-r2 0.2 clump-p1 5e-8 at genome-wide signicance 5 × 10 −8 ) for each trait. Furthermore, for this analysis, we excluded GWAS traits with fewer than 50 GWAS loci. The full list of OMIM based trait-gene pairs is listed in S10.

Rare variant association based causal genes
In addition to the OMIM-based curation, we collected a set of genes in which rare proteincoding variants were reported to be signicantly associated with our list of complex traits.
Here, we focused on rare variant association evidence reported on height and lipid traits (low-density lipid cholesterol, high-density lipid cholesterol, triglycerides, and total cholesterol levels) (29, 33, 34). In particular, we collected signicant coding/splicing variants reported previously (29) and kept variants with eect allele frequency < 0.01 (Table   S6: ExomeChip variants with Pdiscovery <2e-07 in the European-ancestry meta-analysis (N=381,625)). Similarly, we collected signicant variants reported by (33) (Table S12: Association Results for 444 independently associated variants with lipid traits) and ltered out variants with minor allele frequency < 0.01. For the whole-exome sequencing study conducted in Finnish isolates (34), we extracted signicant genes identied by a gene-based test using protein truncating variants (Table S9:  . Workow on constructing the list of candidate genes for silver standard analysis. We started with GWAS summary statistics and extract LD blocks (boundaries of LD block are shown as gray vertical lines) that contain a variant with GWAS association p < 5 × 10 −8 . We further ltered out LD blocks which do not overlap with silver gene (labelled with red triangle). The candidate genes (shown as black horizontal bars) are those overlapping with the leftover LD blocks.

Construction of candidate genes and denition of the scores for various methods
We constructed a set of candidate genes for silver standard analysis following the workow shown in S23. The rationale of the analysis was to focus on a common use case in practice, namely to prioritize genes within GWAS loci. First, for each trait we dened all LD blocks with genome-wide signicant GWAS signals (p < 5 × 10 −8 ) as GWAS loci. Then, we selected the variant with the highest signicance as the GWAS lead variant for the locus (and randomly picked one in case of ties). Here we limited our scope to those GWAS loci containing silver standard genes since the current silver standard did not have enough information to test the rest, which were driven by genes without indication in OMIM database and/or rare variant associations, and potentially had smaller eect size as compared to the former. Thus, we kept only the GWAS loci overlapping with silver standard genes. The full list of silver standard genes, after the ltering procedure, can be found in S14 and S15. The number of GWAS loci and silver standard genes that 74 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 22, 2019. ; https://doi.org/10.1101/814350 doi: bioRxiv preprint remained after the above ltering steps can be found in Supplementary Table S6. From these GWAS loci containing silver standard genes, we extracted all genes overlapping with the loci as the set of candidate genes (the number of candidate genes per locus is shown in g. S22).
For each of these candidate genes, we obtained the gene-level statistics for the corresponding traits from the application of various methods, i.e. enloc, coloc, SMR, and PrediXcan-mashr where we collapsed statistics across tissues by taking the`best' scores (highest regional colocalization probability (rcp) in enloc; highest posterior probability under hypothesis 4 in coloc; smallest p-value in SMR and PrediXcan-mashr ).
Regarding the results on splicing (with statistics reported at the intron excision event level), we obtained gene-level statistics by taking the`best' score among all splicing events of the gene. We also summarized the cis-sQTL at the gene level using the same strategy. 75 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 22, 2019. Regression-based test on the per-locus rank To investigate the usefulness of the colocalization and association statistics reported by enloc and PrediXcan respectively, we 76 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 22, 2019. ; https://doi.org/10.1101/814350 doi: bioRxiv preprint performed logistic regression, as described in , to t log odds of being a 'causal' gene against the ranking of: 1) proximity to GWAS lead variant (from close to distal), 2) rcp from enloc (from high to low), and 3) gene-level association p-value from PrediXcan-mashr or SMR (from signicant to non-signicant). logit(Pr(causal i )) = β 0 + β 1 · rank(proximity i ) + β 2 · rank(rcp i ) + β 3 · rank(P-value i ), (34) in which non-zero β k meant that the kth variable contributed independently on predicting whether a gene was causal. Moreover, negative β k indicated that the direction of contribution of the variable was as expected.
Regression-based test to investigate the independent contribution of proximity, colocalization, and association based methods  Precision-recall and receiver operating characteristic curve In addition to the per-locus analysis, we combined the gene-trait pairs from the perlocus analysis. We labelled the ones with silver standard genes for the corresponding trait as 1 and the rest as 0 so that we could plot PR and ROC curve for each association/colocalization based method by varying a universal threshold (i.e., either p-value or colocalization probability) across all analyzed traits and loci. These curves provide a 77 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under   Discussion on the limitation of applying universal cuto in precision-recall/ROC curve To apply a universal threshold to all loci and traits was a limitation of this approach. On the one hand, dierent GWAS traits may have dierent sample sizes so that the colocalization probabilities or association p-values are not comparable across traits. On the other hand, for the same trait, the magnitude of the mediated eect size (gene/splicing-level eect) at dierent loci may vary, which makes the colocalization probability or association p-value less comparable across loci. With these limitations, the comparison between methods was still informative but might favor the one that suers less from the lack of comparability and being more stringent (i.e. enloc). Furthermore, the curves were not directly comparable to per-locus approaches since a per-locus approach intrinsically utilized the information that there was only one signal per locus. However, these curves provide insights into how these methods would perform (in terms of precision/power trade-o ) if we applied a universal cuto across loci, which is a typical use case in practice.
Unique causal gene assumption with silver standard In many practical applications, the investigator will be interested in identifying the causal gene that drives the signal at a given GWAS locus. Here we assume existence and uniqueness, i.e., that the 81 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 22, 2019. ; https://doi.org/10.1101/814350 doi: bioRxiv preprint causal gene is on the list of neighboring genes and that there is only one. These assumptions may fail, for example, if regulatory eects are unrelated to the causal mechanism or the causal molecular phenotypes have not been assayed or discovered. Furthermore, the assumption that there is only one causal gene is parsimonious and arguably reasonable, but in general we do not have hard evidence to rule out multiple genes contributing to the trait eect. 1.14 Predicted associations replicated in BioVU (cont.) Among replicated loci are SORT1 (liver, coronary artery disease rcp = 0.952; dicovery p = 2.041 × 10 −19 BioVU p = 3.475 × 10 −4 ), which has a well-established associations to lipid metabolism and cardiovascular traits (58). Chromosome 6p24 region, which contains PHACTR1, has been previously associated with a constellation of vascular diseases, including coronary artery disease (59) and migraine headache (60). Notably, PHACTR1 was signicant in three dierent arteries (aorta artery, coronary artery and tibial artery) in two traits (coronary artery disease and migraine) in the replication analysis. In all six tissue-trait pairs, PHACTR1 showed very high posterior probabilities in discovery analyses (rcp = 0.992 to 1.00). In our replication analysis, PHACTR1 remained signicant only for coronary artery disease associations (table S16, aorta artery, discovery p = 2.246×10 −39 , BioVU p = 7.484×10 −8 ; coronary artery, discovery p = 1.952×10 −37 , BioVU p = 2.047 × 10 −7 ; tibial artery, discovery p = 1.559 × 10 −33 , BioVU p = 9.880 × 10 −7 ). 1.15 Validation of likely causal genes (cont.) Of note, two members of the sterolin family, ABCG5 and ABCG8, showed highly signicant predicted causal associations using both PrediXcan and enloc for LDL-C levels and self-reported high cholesterol levels. ABCG8 showed more signicant associa- CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 22, 2019. ; https://doi.org/10.1101/814350 doi: bioRxiv preprint levels was observed, indicating that disrupting these genes contribute greatly to plasma cholesterol levels (62). The overexpression of human ABCG5 and ABCG8 in transgenic Ldlr-/-mice resulted in 30% reduction in hepatic cholesterol levels and 70% reduced atherosclerotic legion in the aortic root and arch (63) after 6-months on a Western diet.
Several other lipid-associated loci were also consistently predicted as causal across OMIM, the rare variant derived set, PrediXcan and enloc. Rare protein-truncating variants in APOB have been previously associated with reduced LDL-C and triglyceride levels and reduced coronary heart disease risk (64). Interestingly, APOB has been predicted as a causal gene in four related traits, coronary artery disease, LDL-C levels, triglyceride levels, and self-reported high cholesterol levels. Among the four traits, PrediXcan showed the highest association to LDL-C levels (-log10(p PrediXcan ) = 130.89; rcp = 0.485) while self-reported high cholesterol showed the strongest evidence using enloc at nearly maximum posterior probability (-log10(p PrediXcan ) = 93.66; rcp = 0.969). Although APOB has been suggested as a better molecular indicator of predicted cardiac events in place of LDL-C levels (65, 66), its translation has been surprisingly slow in clinical practice (67).
Here, we provide an additional support for the crucial role APOB may play in predicting lipid traits.

Causal tissue analysis
We investigated the cross-tissue pattern of PrediXcan results across 49 tissues. For each trait-gene pair, the PrediXcan z-score can be represented as a 49×1 vector with each entry being the gene-level z-score in the corresponding tissue (if the prediction model of the gene is not available in that tissue, we lled in zero). To explore the tissue-specicity of the PrediXcan z-score vector, we proceeded by assigning the z-score vector to a tissue-pattern category and tested whether certain tissue-pattern categories were over-represented among colocalized PrediXcan genes as compared to non-colocalized genes. In particular, we used the FLASH factors identied from matrix factorization applied to the cis-eQTL eect size matrix, as described in Section 1.7 (as PrediXcan and cis-eQTL shared similar tissuesharing pattern, data not shown). To obtain a set of detailed and biologically interpretable tissue-pattern categories from the 31 FLASH factors, we manually merged them into 18 categories as shown in g. S28. For each trait, we projected the z-score vector of each gene to one of the 31 FLASH factors (as described in Section 1.7) so that the gene was assigned to the corresponding tissue-pattern category. We dened a`positive' set of genes as the 83 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 22, 2019. ; https://doi.org/10.1101/814350 doi: bioRxiv preprint ones that met Bonferroni signicance at α = 0.05 in at least one tissue and enloc rcp > 0.01 in at least one tissue, which could be thought as a set of candidate genes aecting the trait through expression level. We also constructed a`negative' set of genes with enloc rcp = 0, which could be thought as a set of genes whose expressions were unlikely to aect the trait. We proceeded to test whether certain tissue-pattern categories were enriched in`positive' set as compared to`negative' set. Since the main focus of this analysis was tissue-specic patterns, we excluded Factor1 (the cross-tissue factor) and Factor25 (likely to be a tissue-shared factor capturing tissues with large sample size). Additionally, we excluded Factor7 (testis), as it was unlikely to be the mediating tissue but might introduce false positives. We tested the enrichment of each tissue-pattern category by Fisher's exact test (`positive'/`negative' sets and in/not in tissue-patter category). Among 87 traits, 82 traits had enloc signal and the enrichment of these was calculated accordingly.
Supplementary Fig. S28. Factor analysis using ashr to identify causal tissues.
Tissue-pattern categories generated from from FLASH applied to the cis-eQTLs are shown. These tissue categories (on y-axis) were the same ones used in the analysis of causal tissue identication. Tissues are ordered by sample size. 84 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 22, 2019. ; https://doi.org/10.1101/814350 doi: bioRxiv preprint Table S11: PrediXcan and enloc results for predicted causal genes selected based on OMIM. Columns are: lead_var: the most signicant variant within the LD block, trait: trait name, gene: Ensembl ID for the gene, is_omim: Is included in the OMIM database. TRUE if included, FALSE if not, proximity: 0 if variant is in the gene, otherwise BPS from the gene boundary, rank_proximity: ranking by proximity within LD block (rank starts from 0 and the closer the lower rank), percentage_proximity: rank_proximity / number of genes in the locus, predixcan_mashr_eur_score: -log10 p-value (most signicant across tissues is used) of PrediXcan-MASH trained on European data, enloc_score: rcp (max across tissues), predix-can_mashr_eur_rank: PrediXcan signicance ranking within LD block (rank starts from 0 and the higher signicance the lower rank), enloc_rank: enloc rcp ranking within LD block (rank starts from 0 and the higher rcp the lower rank), predixcan_mashr_eur_percentage: predixcan_mashr_eur_rank / number of genes in the locus, enloc_percentage: enloc_rank / number of genes in the locus, gene_name: Ocial gene symbol, gene_type: Gencode annotsted gene type, chromosome: Chromosome for the gene, start: Gencode annotated gene start position. All isoforms are combined, end: Gencode annotated gene end position. All isoforms are combined, strand: Gencode annotated gene strand. Table S12: PrediXcan and enloc results for presumed causal genes in the rare variant based silver standard. Columns are: lead_var: the most signicant variant within the LD block, trait: trait name, gene: Ensembl ID for the gene, is_ewas: Is included in the EWAS . TRUE if included, FALSE if not, proximity: 0 if variant is in the gene, otherwise BPS from the gene boundary, rank_proximity: ranking by proximity within LD block (rank starts from 0 and the closer the lower rank), percentage_proximity: rank_proximity / number of genes in the locus, predixcan_mashr_score: -log10 p-value (most signicant across tissues is used) of PrediXcan-MASH trained on European data, enloc_score: rcp (max across tissues), predixcan_mashr_rank: PrediXcan signicance ranking within LD block (rank starts from 0 and the higher signicance the lower rank), enloc_rank: enloc rcp ranking within LD block (rank starts from 0 and the higher rcp the lower rank), predixcan_mashr_percentage: predixcan_mashr_eur_rank / number of genes in the locus, enloc_percentage: enloc_rank / number of genes in the locus, gene_name: Ocial gene symbol, gene_type: Gencode annotsted gene type, chromosome: Chromosome for the gene, start: Gencode annotated gene start position. All isoforms are combined, end: Gencode annotated gene end position. All isoforms are combined, strand: Gencode annotated gene strand.   86 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 22, 2019. ; https://doi.org/10.1101/814350 doi: bioRxiv preprint  List of Figures   1 FIG -GWAS- . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted October 22, 2019. ; https://doi.org/10.1101/814350 doi: bioRxiv preprint