Disentangling the effects of traits with shared clustered genetic predictors using multivariable Mendelian randomization

Abstract When genetic variants in a gene cluster are associated with a disease outcome, the causal pathway from the variants to the outcome can be difficult to disentangle. For example, the chemokine receptor gene cluster contains genetic variants associated with various cytokines. Associations between variants in this cluster and stroke risk may be driven by any of these cytokines. Multivariable Mendelian randomization is an extension of standard univariable Mendelian randomization to estimate the direct effects of related exposures with shared genetic predictors. However, when genetic variants are clustered, due to being located in a single genetic region, a Goldilocks dilemma arises: including too many highly‐correlated variants in the analysis can lead to ill‐conditioning, but pruning variants too aggressively can lead to imprecise estimates or even lack of identification. We propose multivariable methods that use principal component analysis to reduce many correlated genetic variants into a smaller number of orthogonal components that are used as instrumental variables. We show in simulations that these methods result in more precise estimates that are less sensitive to numerical instability due to both strong correlations and small changes in the input data. We apply the methods to demonstrate the most likely causal risk factor for stroke at the chemokine gene cluster is monocyte chemoattractant protein‐1.


| INTRODUCTION
Genetic variants associated with molecular and phenotypic traits can provide evidence on the causal pathways linking the associated trait with a disease outcome . Various analytical approaches, including Mendelian randomization and colocalization, have been proposed that synthesize data on genetic associations to assess the nature of the relationship between a trait and a disease. In Mendelian randomization, it is assumed that the only pathway by which selected genetic variants influence the outcome is via the associated trait (Lawlor et al., 2008). Formally, genetic variants are assumed to satisfy the assumptions of an instrumental variable (Didelez & Sheehan, 2007). If multiple independent variants associated with the same trait show a consistent pattern of associations with the outcome, then it is plausible that the trait has a causal effect on the outcome (Bowden et al., 2016;Lawlor et al., 2016).
We here consider an extension to standard Mendelian randomization known as multivariable Mendelian randomization, which allows genetic variants to be associated with multiple related traits . For instance, it is difficult to find specific genetic predictors of fatfree mass that are not also associated with fat mass. Multivariable Mendelian randomization can be implemented by fitting a multivariable regression model using genetic associations with each of the traits as predictors (Sanderson et al., 2019). Coefficients from this model represent direct effects; that is, the effect of varying one of the traits in the model while keeping other traits constant (Burgess, Thompson, et al., 2017;Carter et al., 2021). Such investigations have suggested that fat mass rather than fat-free mass influences cardiovascular disease risk (Larsson et al., 2020), and that, amongst a set of lipid traits, apolipoprotein B is the primary driver of coronary heart disease risk (Zuber et al., 2021).
Often, Mendelian randomization investigations are conducted using genetic variants from a single genetic region, an approach known as cis-Mendelian randomization (Schmidt et al., 2020). This approach is particularly common when the risk factor is a gene product, such as gene expression or circulating levels of a protein. Such analyses are somewhat fragile, as the evidence is based on a single genetic region and so it is not possible to assess heterogeneity of findings across multiple genetic regions that represent independent datapoints . However, if the function of the gene is well-understood, these analyses can be particularly insightful into the impact of intervening on a specific biological pathway. In some cases, the function of the gene may correspond to the action of a pharmacological agent, and hence the analysis is informative about a druggable pathway . Examples include the use of variants in the HMGCR gene region to inform about the impact of statin drugs (Ference et al., 2015), and variants in the IL6R gene region about the impact of interleukin-6 receptor inhibitors, such as tocilizumab (IL6R Genetics Consortium & Emerging Risk Factors Collaboration, 2012).
However, some genetic regions contain multiple genes (referred to as a gene cluster), and so are associated with multiple gene products. For example, the FADS locus contains genetic predictors of various fatty acids (Lattka et al., 2010), and the IL1RL1-IL18R1 locus (the interleukin-1 receptor cluster) contains protein quantitative trait loci (pQTLs) for several proteins (Sun et al., 2018). Although variants in the interleukin-1 cluster are associated with several autoimmune diseases (Timms et al., 2004;G. Zhu et al., 2008), it is difficult to determine which of the proteins are causal risk factors (Reijmerink et al., 2010). Although a multivariable cis-Mendelian randomization approach has been proposed to disentangle complex gene regions and identify the causal drivers of disease (Porcu et al., 2019), authors of this approach suggest pruning genetic variants to near independence ( r < 0.1 2 ) to avoid potential problems of collinearity. However, it may not be possible to find sufficient near-independent variants for the multivariable regression model to give precise estimates for each trait. Although it is possible to prune at a less strict threshold, we have previously shown that under-pruning can result in illconditioning (Burgess, Zuber, et al., 2017). This represents a Goldilocks dilemma: too much pruning and we get imprecision or even lack of identification; too little pruning and we can get results that are highly sensitive to small changes in the estimated correlation matrix, and can be nonsensical.
We propose two methods for multivariable cis-Mendelian randomization that perform dimension reduction on the available genetic variants at a locus using principal component analysis (PCA). These methods reduce information on large numbers of highly correlated variants into a smaller number of orthogonal components, allowing efficient multivariable analyses to be implemented that are not so sensitive to high correlations between variants or small changes in the inputs. We demonstrate the superior performance of these methods over pruning methods in a simulation study, and illustrate the methods in a applied analysis investigating effects on stroke risk of three cytokines associated with a gene cluster on chromosome 17.

| Overview of the approach
Multivariable Mendelian randomization takes genetic variants that are each associated with at least one of a set of related exposure traits, and satisfy the instrumental variable assumptions for multivariable Mendelian randomization: i. each variant is associated with one or more of the exposures, ii. each exposure is associated with one or more of the genetic variants, iii. each variant is not associated with the outcome via a confounding pathway, and iv. each variant does not affect the outcome directly, only possibly indirectly via one or more of the exposures.
Although the approach was originally developed for use with individual-level data using the established two-stage least squares method , equivalent estimates can be obtained using summarized genetic association data that are typically reported by genome-wide association studies (GWAS) (Sanderson et al., 2019). We use summarized genetic association data (Bowden et al., 2017), and denote the genetic association of variant j with exposure trait k as β Xjk ; this is the betacoefficient from univariable regression of the trait on the variant. We denote the genetic association of variant j with the outcome as β Yj and its standard error as se β (ˆ) Yj ; (2) where bold face represents vectors, and Σ is a variancecovariance matrix with elements representing the correlation between the j 1 th and j 2 th variants. This method was advocated by Porcu et al. (2019) for the analysis of summarized genetic association data on gene expression traits with shared genetic predictors. Estimates are obtained as where β X is the J by K matrix of genetic associations with the traits, and β Y is the J by 1 vector of genetic associations with the outcome. We note that this method is identical to the method referred to as transcriptome-wide Mendelian randomization (TWMR) by Porcu et al. (2019). This calculation requires inversion of the variancecovariance matrix Σ, which can lead to numerical instability if the matrix of correlations between genetic variants is nearsingular. This occurs when there is a set of genetic variants that is close to being linearly dependent. If a set of genetic variants is linearly dependent (i.e., there is at least one variant that can be perfectly predicted based on the other variants), then the correlation matrix will be exactly singular, and so cannot be inverted. If a set of genetic variants is almost but not exactly linearly dependent, then the correlation matrix can be inverted, but some elements of the matrix inverse will be very large in magnitude. This results in an ill-conditioned problem, meaning that small changes in the inputs can lead to large changes in the estimates. The condition number of a matrix is a measure of ill-conditioning; for a positive-definite symmetric matrix, this can be calculated as the ratio of the largest to the smallest eigenvalue. Although there are no universal thresholds, condition numbers over 100 are cause for concern, particularly if the genetic associations are known to a limited degree of precision.
To implement the proposed PCA method, we first consider the matrix Ψ where: This is a weighted version of the variance-covariance matrix, with weights taken as the sum of the absolute values of the genetic associations with the traits. Obtaining principal components of this matrix ensures that the top principal components assign greater weights for variants having larger associations with the traits and more precise associations with the outcome. If genetic associations with the outcome are all estimated in the same data set, then their standard errors will depend on the minor allele frequency of the variants; inclusion of these standard errors in the weighting matrix prioritizes genetic variants that have more precise associations with the outcome, and hence provide the most information to the Mendelian randomization estimate. Considering the PCA decomposition W W Ψ = Λ T , where W is the matrix of eigenvectors (or loadings) and Λ is the diagonal matrix with the eigenvalues λ λ > … > J 1 on the diagonal, let W k be the matrix constructed of the first k columns of W . Then we define: The multivariable inverse-variance weighted principal component analysis (MV-IVW-PCA) estimate is given by This is an adaptation of the MV-IVW method using transformed genetic instruments that represent linear weighted scores comprised of the original genetic variants, where the weights of the scores are the eigenvectors from the PCA decomposition. As the principal components are orthogonal, the transformed variance-covariance matrix should not be near-singular. The standard errors of these estimates are: In our investigations, we set the number of principal components to explain 99% of the variance in the matrix Ψ.

| Limited information maximum likelihood method (LIML)
An alternative method for instrumental variable analysis is the LIML method (Baum et al., 2007). The estimate from the LIML method minimizes the Anderson-Rubin statistic (Anderson & Rubin, 1949), which is a measure of heterogeneity of the estimates based on the different genetic variants. Both the IVW and LIML methods are part of a larger family of methods, known as the generalized method of moments (GMM) (Hansen, 1982). We here derive a multivariable analogue of the LIML method that can be implemented using summarized genetic association data for correlated variants. We refer to this method as the multivariable limited information maximum likelihood (MV-LIML) method. For the MV-LIML method, we require the additional knowledge of the K K × correlation matrix of exposures, which we denote Φ. In the simulation study, we set this matrix to be the identity. We Under the assumption that all J variants are valid instruments, setting θ g 0 ( ) = provides a set of J estimating equations for the K unknown parameters θ k . When J K > , if the genetic variants are linearly independent, it is generally not possible to find an estimator θ that can set θ g 0 (ˆ) = . as the number of equations is greater than the number of unknown parameters. Thus, LIML-based methods take K linear combinations of the J estimating equations, where the weights in the linear combination are chosen to minimize the variance of the resulting estimator θ.
For exposure k, the MV-LIML estimator of θ k is given by the kth element of θ MV LIML − , and its standard error is Theoretical results suggest LIML provides robust estimation when using many weak instruments (Chao & Swanson, 2005). For univariable cis-Mendelian randomization analyses, simulation evidence has further highlighted the low bias properties of LIML-based estimators in finite samples .
We consider a version of the LIML method using PCA to perform dimension reduction on the set of genetic variants by replacing: and minimizing θ Q( ), with W k defined as previously. We refer to this method as MV-LIML-PCA. As per the MV-IVW-PCA method, we set the number of principal components to explain 99% of the variance in the matrix Ψ.

| Simulation study
We perform a simulation study to assess whether the proposed PCA methods are able to detect which out of a set of traits with shared clustered genetic predictors influences an outcome, and whether the causal effects of the traits can be estimated reliably.
We consider a scenario with three traits and 100 correlated genetic variants. We generate 10 000 simulated datasets according to the following data-generating model for 20 000 individuals indexed by i: Uniform(−0.3, 1) for , = 1, …, 100 = cor( ) ( , ) (0.08, 0.01 ) for = 1, …, 15 The genetic variants G j are simulated from a multivariable normal distribution with mean zero and variancecovariance matrix B. The traits X 1 , X 2 , and X 3 are simulated such that 5 variants influence the first trait, the next 5 influence the second trait, the next 5 influence the third trait, and the remaining 85 do not influence any trait. However, due to the moderately large correlations between the genetic variants (which typically range from around −0.1 to +0.6 with an interquartile range from around +0.2 to +0.4), typically each of the 100 variants is associated with all three traits at a genome-wide level of significance. The outcome Y is influenced by traits X 1 and X 3 , with the true effect of X 1 set at +0.4 and the true effect of X 3 set at −0.6. The associations between the traits and the outcome are affected by confounders U 1 , U 2 , and U 3 .
We estimate genetic associations with the exposures on the first 10,000 individuals, and genetic associations with the outcome on the subsequent 10,000 individuals. Correlations between the genetic variants are estimated on the first 10,000 individuals. This represents a twosample scenario, where genetic associations with the exposures and outcome are obtained on nonoverlapping samples (Pierce & Burgess, 2013). The mean instrument strength based on the 5 causal variants for each trait is R = 3.5% 2 , corresponding to a mean univariable F statistic (on 5 and 19,994 degrees of freedom) for each trait of around 145, and a conditional F statistic (on 15 and 19,984 degrees of freedom) for each trait of around 22 (Sanderson et al., 2019). Although these F statistics are the most relevant measure of instrument strength for the oracle analyses, instrument strength in all other analyses will depend on the number and identity of the genetic variants included in each analysis, which will vary between simulations and depending on the pruning threshold.
We compare four different methods: the MV-IVW and MV-LIML methods with various choices of genetic variants as inputs, and the MV-IVW-PCA and MV-LIML-PCA methods described above. For the MV-IVW and MV-LIML methods, we consider pruning the variants at thresholds of   ρ < 0.4,   ρ < 0.6, and   ρ < 0.8 (equivalent to r < 0.16 2 , r < 0.36 2 , and r < 0.64 2 ). We note that pruning at   ρ < 0.1 would often result in 3 or fewer variants being available for analysis, which would not allow multivariable Mendelian randomization to be attempted, as the number of genetic variants needs to be greater than the number of traits. Pruning is performed by first selecting the variant with the lowest p value for association with any of the traits, and then excluding from consideration all variants more strongly correlated with the selected variant than the threshold value. We then select the variant amongst those remaining with the lowest p value, and exclude variants more correlated with that variant than the threshold value. We repeat until all variants have either been selected or excluded. We also consider an oracle choice of variants, in which only the 15 genetic variants that truly influence the traits are used as instruments.
In addition to the main simulation study, we also consider the performance of methods with other parameter settings: (1) weaker instruments: we generate the α j parameters from a normal distribution with mean 0.05 (corresponding to a mean instrument strength of R = 1.4% 2 , mean univariable F = 57, mean conditional F = 9.4); (2) stronger correlations: we generate elements of the A matrix from a uniform distribution on +0.1 to +1.0, resulting in correlations which typically range from around +0.5 to +0.85 with an interquartile range from around +0.65 to +0.75; (3) stronger causal effects, with θ = +0.8 1 and θ = +1.0 3 , and (4) two alternative approaches for generating a correlation matrix taken from the R package clusterGeneration (Joe, 2006), (4) real linkage disequilibrium data: a correlation matrix estimated in UK Biobank participants of European ancestries on variants from the chemokine receptor gene cluster, pruned at a threshold of r < 0.95 2 to avoid very high correlations; and (5) correlated exposures: the exposures were correlated by additionally adding U i2 to X U , i 1 3 to X 2 , and U i1 to X 3 . Further details of these additional scenarios are provided in the Supporting Information.
In the main simulation study, we also consider a conditional selection of genetic variants, in which we first select the variant having the lowest p value for association with any of the traits. In each subsequent step, we select the variant having the lowest p value for association with any of the traits conditional on all previously selected variants. We repeat until no additional variant is conditionally associated with any trait below a threshold of p < 0.001. Although this approach does not guarantee that variants will not be highly correlated, it is unlikely that two very highly correlated variants would both be conditionally associated with a trait. For convenience, in the main simulation study, we implement this method using individual-level data. If we only had access to summarized data, a mathematically equivalent procedure could be implemented using the Genome-wide Complex Trait Analysis conditional & joint association analysis (GCTA-COJO) method .
Further, we consider two variations to the main simulation study that are reflective of potential problems that may arise in applied practice. First, we estimate the variant correlation matrix based on an independent sample. This reflects the common occurrence that the correlation matrix is obtained from a reference sample rather than the data set under analysis, and assesses robustness of the methods to variability in the correlation matrix. Second, we round the genetic associations and their standard errors to three decimal places. This reflects the common occurrence that genetic associations are obtained from a publicly-available source, and hence are not known to absolute precision. Again, this assesses robustness of the methods to variability in the data inputs.
2.5 | Applied example: Chemokine gene cluster and risk of stroke We illustrate our methods using data on genetic associations with three cytokines and stroke risk. Previous research has implicated monocyte chemoattractant protein-1 (MCP-1), which is also called chemokine (C-C motif) ligand 2 (CCL2), in the pathophysiology of stroke (Georgakis, De Lemos, et al., 2021;Georgakis, Gill, et al., 2019;Georgakis, van der Laan, et al., 2021). However, the CCL2 gene that encodes this cytokine is located in a cluster that also includes genes CCL7, and CCL11. Variants in this genetic region are associated with multiple cytokines other than MCP-1, including MCP-3 (also called CCL7), and eotaxin-1 (also called CCL11). Hence, it is not clear from univariable Mendelian randomization (i.e., analyses with a single exposure trait) which of these proteins is driving stroke risk.
We conduct a multivariable cis-Mendelian randomization analysis to disentangle the effects of these cytokines. We take variants from the CCL2 gene region (GRCh38/hg38;chr17:34,255,257,203) plus 500 kilobasepairs either side, genetic associations with the cytokines from a reanalysis of data on three Finnish cohorts by Ahola-Olli et al. (2017) that did not adjust for body mass index (Kalaoja et al., 2021), and genetic associations with all stroke and cardioembolic stroke from European ancestry individuals in the MEGA-STROKE consortium (Malik et al., 2018). Cardioembolic stroke was chosen as genetic associations were stronger with this stroke subtype than for all stroke in a motivating Mendelian randomization analysis that included variants from throughout the genome (Georgakis, Gill, et al., 2019). Correlations between variants were estimated in 376 703 individuals of European ancestries from UK Biobank.

| Simulation study
Results from the simulation study are shown in Table 1. For each method, we display the mean estimate of each parameter, the standard deviation of estimates, the mean standard error, and the empirical power of the 95% confidence interval, which represents the proportion of confidence intervals that exclude the null. For θ 2 , the empirical power is the Type 1 error rate, and should be close to 5%.
Although the MV-IVW and MV-LIML methods perform well under the oracle setting, power to detect a causal effect is substantially reduced when pruning variants at 0.4. Although increasing the pruning threshold to 0.6 increases power, it also results in Type 1 error inflation. For the MV-IVW method, Type 1 error rates increase to 9.1%, and for the MV-LIML method, to 20.2%. When increasing the pruning threshold to 0.8, estimates are completely unreliable, with mean estimates in the MV-IVW method for θ 1 and θ 3 having the wrong sign.
In contrast, the MV-IVW-PCA and MV-LIML-PCA methods perform well throughout, with Type 1 error rates similar to those of the oracle methods, and greater power to detect a causal effect than the methods that rely on pruning. For the MV-IVW-PCA method, the variability and mean standard errors of estimates are similar to those of the oracle methods, although estimates of θ 1 and θ 3 are attenuated. This is a result of weak instrument bias, which affects analyses similarly to non-differential measurement error. With a single exposure in a twosample setting, estimates are biased towards the null due to imprecision in the estimated genetic associations with the exposures. With multiple exposures, this bias may act in any direction, even in a two-sample setting (J. Zhu et al., 2022).
Similar findings were observed when considering weaker instruments (Supporting Information: Table A2), stronger correlations (Supporting Information: Table A3), stronger causal effects (Supporting Information: Table A2), alternative synthetic correlation matrices (Supporting Information: Tables A4 and A5), a correlation matrix based on real genetic data (Supporting Information: Table A6), and with correlated exposures (Supporting Information: Tables A7-A9). Although the power varied between simulation settings, in each case the PCA methods outperformed the pruning methods in terms of power and precision at a threshold of 0.4, whereas at a threshold of 0.6 the MV-IVW and MV-LIML methods had inflated Type 1 error rates. Type 1 error rates for the PCA methods were generally well controlled, although the Type 1 error for the MV-LIML-PCA method was slightly inflated with weaker instruments (6.7% for MV-IVW-PCA, 13.9% for MV-LIML-PCA), with correlated exposures (moderate correlations: 7.5% for MV-IVW-PCA, 10.3% for MV-LIML-PCA; weaker correlations: 7.6% for MV-IVW-PCA, 10.9% for MV-LIML-PCA; stronger correlations: 7.5% for MV-IVW-PCA, 10.0% for MV-LIML-PCA), and the Type 1 error for the MV-IVW-PCA method was slightly inflated with stronger causal effects (12.8% for MV-IVW-PCA, 8.2% for MV-LIML-PCA). This highlights the importance of specifying these trait correlations in the MV-LIML-PCA method, which were set at zero in the simulation study for simplicity. For the correlated exposure scenarios, we repeated the MV-LIML-PCA method accounting for correlations between the exposures: Type 1 error rates were reduced to 9.1% (moderate correlations), 10.3% (weaker correlations), and 8.2% (stronger correlations).
We also considered a conditional approach for the selection of genetic variants. As this approach requires conditional associations with each exposure for each variant to be recalculated at each step of the variant selection algorithm, this approach is considerably more computationally intensive than the pruning approach, and so we only considered estimates for the first 1000 simulated datasets in the main simulation. Results are provided in Table 2. The conditional selection method performed better than the pruning methods, although it was outperformed by the MV-IVW-PCA method in terms of variability and precision of estimates (the conditional selection method had greater standard deviations and mean standard errors), and by the MV-LIML-PCA method in terms of power to detect a causal effect.
We also considered two additional variations to the main simulation reflective of potential problems in applied practice. Table 3 shows results in which the variant correlation matrix was obtained from an independent sample of 10,000 individuals. Results were similar, except that the Type 1 error rate for the MV-IVW method at a pruning threshold of 0.6 was slightly higher at 11.2%. When obtaining the correlation matrix from an independent sample of 1000 individuals, Type 1 error rates for the MV-IVW method were higher still at 18.7% (Supporting Information: Table A10). Table 4 shows results in which the genetic association estimates were rounded to 3 decimal places. Again, results were similar, except that the Type 1 error rate for the MV-IVW method at a pruning threshold of 0.6 was notably higher at 15.1%. In contrast, results from the PCA methods were not sensitive to changes in the variant correlation matrix or rounding of the genetic association estimates.

| Applied example: Chemokine gene cluster and risk of stroke
Genetic associations with each of the cytokines and all stroke were available for 2922 variants, and with cardioembolic stroke for 2904 variants. We compare results from the MV-IVW-PCA method to those from the MV-IVW method at a pruning threshold of   ρ < 0.1 (equivalent to r < 0.01 2 ),   ρ < 0.4 (equivalent to r < 0.16 2 ), and   ρ < 0.6 (equivalent to r < 0.13 2 ). In the MV-IVW-PCA method, we initially pruned at   ρ < 0.95 to remove very highly correlated variants from the analysis. We also excluded variants not associated with any of the cytokines at p < 0.001 from all analyses. For the pruned set of variants at   ρ < 0.95, the genetic BATOOL ET AL.

| 421
associations with the different cytokines were not highly correlated: correlations were 0.21 between genetic associations with MCP-1 and MCP-3; 0.27 between genetic associations with MCP-1 and eotaxin-1; and 0.01 between genetic associations with MCP-3 and eotaxin-1. Estimates for the three cytokines, which represent log odds ratios per 1 standard deviation increase in the cytokine, are provided in Table 5.
For all stroke, at a pruning threshold of 0.1, the MV-IVW method indicates that MCP-1 is the true causal risk factor. However, a researcher may be tempted to consider a less strict pruning threshold to obtain more precise estimates. But at a pruning threshold of 0.4, the MV-IVW method indicates that eotaxin-1 is the true causal risk factor, and at a pruning threshold of 0.6, the MV-IVW method again indicates that MCP-1 has the strongest    evidence of being the true causal risk factor, but the causal estimate is in the other direction. In contrast, the MV-IVW-PCA method indicates that MCP-1 has the strongest evidence of being the true causal risk factor, similarly to the MV-IVW method at the most conservative pruning threshold. Compared with results at this threshold, estimates from the MV-IVW-PCA method have narrower standard errors, although the estimate for MCP-1 is slightly attenuated and so has a slightly higher p value. At a pruning threshold of 0.4, the condition number for the variance-covariance matrix Σ in the MV-IVW method is 1224, whereas the condition number for the transformed variance-covariance matrix Σ in the MV-IVW-PCA method is 24.9; a larger number signals worse problems due to ill-conditioning. We would therefore trust results from the MV-IVW-PCA method and the MV-IVW method at a threshold of 0.1, which both suggest that the strongest evidence is for MCP-1 as the causal risk factor, and the effect is in the harmful direction. For cardioembolic stroke, estimates are more similar amongst the different implementations of the methods, with stronger evidence for MCP-1 as the causal risk factor at this locus, particularly from the MV-IVW-PCA method. These findings add to the existing body of basic science (Georgakis, van der Laan, et al., 2021), observational (Georgakis, De Lemos, et al., 2021;Georgakis, van der Laan, et al., 2019b), and genetic evidence (Georgakis, Gill, et al., 2019) implicating circulating MCP-1 in stroke risk.

| DISCUSSION
In this manuscript, we have introduced two methods for multivariable cis-Mendelian randomization, the MV-IVW-PCA and MV-LIML-PCA methods. Compared to existing methods that rely on pruning, these methods had superior performance: they outperformed pruning methods in terms of power to detect a causal effect, and they generally maintained close to nominal Type 1 error rates across a range of scenarios. They were also less sensitive that the pruning methods to variation in the variant correlation matrix, and to rounding of the genetic association estimates. We applied the MV-IVW-PCA method to disentangle the effects of three similar exposures with shared genetic predictors at a gene cluster; the method gave results that correspond to existing biological understanding of this pathway. We note that these methods are not designed for performing Mendelian randomization in most contexts, as most such analyses are polygenic, using genetic variants from across the genome. These methods are recommended for consideration when performing cis-Mendelian randomization, that is when genetic variants are taken from a single gene region; typically a gene region with functional relevance to the exposures of interest. The approach of multivariable cis-Mendelian randomization has several potential applications. In our applied example, we considered proteins as exposures. Alternative potential applications could include expression of different genes as exposures, or expression of the same gene in different tissues. However, results from the latter case may be difficult to interpret if the genetic predictors of gene expression do not vary between tissues, or if data on variants affecting gene expression in all relevant tissues are not available. Another possible area of application is if there are different aspects of an exposure trait that could be considered as independent risk factors, such as concentration and isoform size of lipoprotein(a) (Saleheen et al., 2017). To obtain estimates for the different exposures, it is not necessary to have genetic predictors that are uniquely associated with each exposure trait, but it is necessary to have some variants that associate more or less strongly with the different traits (Sanderson et al., 2019).
An alternative approach for disentangling clusters of correlated variants associated with multiple traits is colocalization. Colocalization is a method that attempts to distinguish between two scenarios: a scenario in which two traits (which we here conceptualize as an exposure and an outcome) are influenced by distinct genetic variants, but there is overlap in the genetic associations due to correlation between the variants; and an alternative scenario in which the two traits are influenced by the same genetic variant (Wallace et al., 2012). In the latter scenario, it is likely that the two traits are causally related, although the colocalization method is agnostic as to whether the exposure trait influences the outcome trait, the outcome influences the exposure, or the two are influenced by a common causal factor (Solovieff et al., 2013). There are several conceptual differences between colocalization and cis-Mendelian randomization, although there are also similarities. Two specific advantages of the proposed cis-Mendelian randomization method are that it allows for the existence of multiple causal variants, in contrast to some colocalization methods (Foley et al., 2021;Giambartolomei et al., 2014), and it allows for the existence of multiple causal traits. Another feature is that it provides causal estimates, although the value of causal estimates in Mendelian randomization beyond indicating the direction of effect is disputed (VanderWeele et al., 2014).
Although the PCA methods can be implemented with highly correlated variants, we would recommend a minimal level of pruning (say, r < 0.9 2 ) before applying the methods in practice as variants that are very highly BATOOL ET AL. | 425 correlated do not contribute independent information to the analysis, but could provide computational challenges. Even if two variants do not have a high pairwise correlation as measured by the r 2 statistic, they may be highly correlated. For example the case of "tagging variants," where a less common variant is only present when the common variant is present. This leads to collinearity even though the pairwise r 2 may be low. Another possibility is that three of more variants may perfectly or near-perfectly linear dependent even though their pairwise correlations are only moderate in size. This is particularly likely if the genetic variation in a region can be explained by a small number of haplotypes. Although the concept of ill-conditioning is not technically related to the problem of weak instruments (illconditioning occurs when genetic variants are highly correlated, weak instruments refers to the strength of genetic associations with the exposure), the phenomena are likely to be practically related, as if an investigator has strong instruments for a set of traits, then there is no strong motivation to include many genetic variants from a gene region in a cis-Mendelian randomization analysis. Whereas if the instruments are weak (or, in a multivariable context, conditionally weak [Sanderson & Windmeijer, 2016]), then an investigator is more likely to include multiple variants in an analysis in an attempt to bolster power.
Additionally, we have assumed in this paper that the traits under analysis are causally independent. If a trait has a causal effect on the outcome that is fully mediated by one of the other traits in the analysis, then the estimate for that trait would be zero. Hence, the method identifies the proximal causal risk factors for the outcome (Grant & Burgess, 2021). If there are large numbers of traits, the MV-IVW-PCA method could be combined with a Bayesian variable selection method that compares models with different sets of traits on the assumption of a sparse risk factor set (Zuber et al., 2020).
There are several limitations to these methods, which are shared by other methods for Mendelian randomization using summarized data (Bowden et al., 2017;Burgess et al., 2016). Uncertainty in genetic associations with the exposure traits is not accounted for in the analysis. However, this is typically small compared with uncertainty in the genetic associations with the outcome, as variants selected for inclusion in the analysis are typically associated with at least one of the traits at a robust level of statistical significance. The effects of the exposure traits on the outcome are assumed to be linear. This is usually a reasonable assumption, given the small influence of genetic variants on traits, meaning that estimates reflect average causal effects for a small shift in the overall distribution of a trait (Burgess et al., 2014). In our main simulation, all the exposure and outcome traits are continuous. For binary traits, the method can be implemented using genetic association estimates obtained from logistic regression. We have previously shown that multivariable Mendelian randomization methods are still able to make correct inferences in this setting Grant & Burgess, 2021), although the interpretation of estimates is obscured due to non-collapsibility (Burgess & CHD CRP Genetics Collaboration, 2013). The analysis model assumes that variant correlations are the same in all datasets. In the context of the applied example, this may not hold as genetic associations with the exposures were estimated in a Finnish data set, variant associations with the outcome were estimated in European ancestry individuals from the MEGASTROKE consortium, and variant correlations were estimated in European ancestry individuals from UK Biobank. It is most critical that variant correlations are estimated in a similar population group to genetic associations with the outcome, as causal inferences are based on genetic associations with the outcome. We were unable to assess the relevance of the variant correlations estimated in UK Biobank to the MEGASTROKE data set. Finally, estimates are subject to weak instrument bias. Whereas in univariable Mendelian randomization, weak instrument bias in a two-sample setting is towards the null (Pierce & VanderWeele, 2012), in multivariable Mendelian randomization, weak instruments can bias estimates in any direction (Zuber et al., 2020). This is because weak instrument bias is analogous to measurement error, as the genetically-predicted values of the exposures are estimated with uncertainty, which in a multivariable regression model can lead to arbitrary bias (Phillips & Smith, 1991;Thouless, 1939). Hence it is important to balance the inclusion of several genetic variants in the analysis to achieve identification, with the inclusion of only variants strongly associated with the exposures to minimize weak instruments. The optimal balance will depend on the specifics of the analysis (such as the sample size available), but researchers should consider the conditional strength of instruments (via conditional F statistics) as well as the more conventional univariable F statistics (Sanderson et al., 2019). Performance with weak instruments was mixed; mean estimates from the MV-LIML-PCA method were generally less affected than those from the MV-IVW-PCA method, although both methods had slightly elevated Type 1 error rates in one of the scenarios considered. We therefore recommend that both methods are applied when the instruments are weak, and caution is expressed if the methods give divergent results. Overall, we slightly prefer the MV-IVW-PCA method, as the Type 1 error rates from this method were generally slightly lower.
In summary, multivariable cis-Mendelian randomization can be used to disentangle the causal relationships of traits, such as proteins or gene expression measurements, that are influenced by a cluster of correlated genetic variants. The proposed PCA methods provide a compromise between loss of precision resulting from overpruning and numerical instability resulting from underpruning, to allow valid statistical tests that identify the causal traits influencing the outcome.