Department of Epidemiology and Health Statistics, School of Public Health, Shandong University, Jinan 250012, China

CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China

Key Laboratory of Computational Biology, CAS-MPG Partner Institute for Computational Biology, Chinese Academy of Sciences, Shanghai 200031, China

MRC Epidemiology Unit, Institute of Metabolic Science, Addenbrooke's Hospital, Cambridge, UK

Abstract

Background

In genetic association study, especially in GWAS, gene- or region-based methods have been more popular to detect the association between multiple SNPs and diseases (or traits). Kernel principal component analysis combined with logistic regression test (KPCA-LRT) has been successfully used in classifying gene expression data. Nevertheless, the purpose of association study is to detect the correlation between genetic variations and disease rather than to classify the sample, and the genomic data is categorical rather than numerical. Recently, although the kernel-based logistic regression model in association study has been proposed by projecting the nonlinear original SNPs data into a linear feature space, it is still impacted by multicolinearity between the projections, which may lead to loss of power. We, therefore, proposed a KPCA-LRT model to avoid the multicolinearity.

Results

Simulation results showed that KPCA-LRT was always more powerful than principal component analysis combined with logistic regression test (PCA-LRT) at different sample sizes, different significant levels and different relative risks, especially at the genewide level (1E-5) and lower relative risks (RR = 1.2, 1.3). Application to the four gene regions of rheumatoid arthritis (RA) data from Genetic Analysis Workshop16 (GAW16) indicated that KPCA-LRT had better performance than single-locus test and PCA-LRT.

Conclusions

KPCA-LRT is a valid and powerful gene- or region-based method for the analysis of GWAS data set, especially under lower relative risks and lower significant levels.

Background

It is commonly believed that genetic factors play an important role in the etiology of common diseases and traits. With rapid improvements in high-throughout genotyping techniques and the growing number of available markers, genome-wide association studies (GWAS) have been promising approaches for identifying common genetic variants. The first successful wave of GWAS has reproducibly identified hundreds of associations of common genetic variants with more than 100 diseases and traits, including age-related macular degenerative diseases

However, given the SNPs allocated into genes or regions, the issue of how to evaluate genetic association for each candidate gene or genome region remains. To examine whether multiple SNPs in the candidate gene or region are associated with disease or trait, several multi-marker analysis methods have been developed, including haplotype-based methods ^{2 }test

However, one cannot assert that linear PCA will always detect all structure in a given genomic data set. If the genomic data contains nonlinear structure, PCA will not be able to detect it

KPCA has been studied intensively in the last several years in the field of machine learning, face recognition and data classification, and has been claimed success in many applications

Methods

PCA

As a traditional multivariable statistical technique, PCA has been widely applied in genetic analysis, both for reduction of redundant information and interpretation of multiple SNPs. The basic idea of PCA is to efficiently represent the data by decomposing a data space into a linear combination of a small collection of bases consisting of orthogonal axes that maximally decorrelate the data. Assuming that _{i }∈ ^{M }| _{i},

To do this, one has to solve the following eigenvalue problem:

where _{i }∈ ^{M }|

where the dot product of two vectors _{1}, _{2}, ..., _{N}) and _{1}, _{2}, ..., _{N}) is defined as

KPCA

Given the observations, we first map the data nonlinearly into a feature space

Again, we make the assumption that our data mapped into feature space, Φ(_{1}),...,Φ(_{N}), is centered, i.e.

we have to find eigenvalues

By the same argument as above, the solutions _{1}),...,Φ(_{N}). This implies that we may consider the equivalent equation

and that there exist coefficients _{i }(

Substituting (3) and (5) into (4), we arrive at

where _{1}, ..., _{N}, and

It has a set of eigenvectors which spans the whole space, thus

gives all solutions

Assume _{1 }≤ _{2 }≤ ... ≤ _{N }represent the eigenvalues for the matrix ^{1}^{2}^{N }being the corresponding complete set of eigenvectors. _{p }is the first nonzero eigenvalue. We do the normalization for the solutions ^{p}, ..., ^{N }by requiring that the corresponding vectors in ^{k }· ^{k }= 1 for all

We need to compute projections on the eigenvectors ν^{k }in

are its nonlinear principal components corresponding to Φ.

Note that neither (7) nor (10) requires Φ(_{i}) in explicit form - they are only needed in dot products. We, therefore, are able to use kernel functions for computing these dot products without actually performing the map Φ: for some choices of a kernel _{i}, _{j}), by methods of functional analysis, it can be shown that there exists a map Φ into some dot product space _{i}, _{j}) can compute the dot product in

Theoretically, a proper function can be created for each data set based on the Mercer's theorem of functional analysis

There are two widely used approaches for the selection of parameters for a certain kernel function. The first method chooses a series of candidate values for the concerned kernel parameter empirically, performs the learning algorithm using each candidate value, and finally assigns the value based on the best performance to the kernel parameter. As is well-known to us, the second one is the cross-validation. However, both approaches are time-consuming and with high computation burden _{i }- _{j}|| in the set {_{k }∈ ^{M }|

Models

To test the associations between multiple SNPs and disease, the PCA-LRT and KPCA-LRT models are defined as follows:

where ^{th }linear and nonlinear (kernel) principal component scores of the SNPs, respectively. The value of _{1 }_{2 }_{L})/(_{1 }_{2 }_{M}) exceeds some threshold. For comparison, we set the same threshold of 80% in both PCA-LRT and KPCA-LRT as Gauderman et al

Data simulation

To assess the performance of KPCA-LRT and compare it with PCA-LRT, we apply a statistical simulation based on HapMap data under the null hypothesis (_{0}) and alternative hypothesis (_{1}). The corresponding steps for the simulation are as follows:

^{2 }structure and minor allele frequencies (MAF).

Pairwise R^{2 }among the 11 SNPs in the selected region

**Pairwise R**^{2 }**among the 11 SNPs in the selected region**. The 11 SNPs are: rs7555634, rs2476600, rs1217395, rs2797415, rs1970559, rs1746853, rs2185827, rs1217406, rs1217407, rs3765598, rs1217408. The triangles mark the three haplotype blocks within this region. The value in each diamond is the R^{2 }value and the shading indicates the level of LD between a given pair of SNPs. The values to the right of the 11 dbSNP IDs (rs# IDs) are the corresponding minor allele frequencies.

_{0}, we set the relative risk per allele as 1.0 to assess the type I error. Under _{1}, different levels of relative risks are set (1.1, 1.2, 1.3, 1.4 and 1.5 per allele) to assess the power. The SNPs in this region are coded according to the additive genetic model.

_{0}, we repeat 10 000 simulations at two significant levels (0.05 and 0.01). Under _{1}, for each model with a given relative risk, we repeat 10 000 simulations at four significant levels (0.05, 0.01, 1E-5 and 1E-7).

Application

The proposed method is applied to rheumatoid arthritis (RA) data from GAW16 Problem 1. The data consists of 2062 Illumina 550 k SNP chips from 868 RA patients and 1194 normal controls collected by the North American Rheumatoid Arthritis Consortium (NARAC)

To illustrate the performance of PCA-LRT and KPCA-LRT, we mainly focus on four special regions in chromosome 1, within the genes PTPN22, ANKRD35, DUSP23, RNF186 involved, respectively. The reasons are as follows: 1) Both the PTPN22 gene (R620W, rs2476601) and ANKRD35 gene have been reported to be associated with RA

Results

Data simulation

Type I error

Simulation results under _{0 }are shown in Table

Type I error of PCA-LRT and KPCA-LRT

**Sample size**

**PCA-LRT**

**KPCA-LRT**

**α = 0.05**

**α = 0.01**

**α = 0.05**

**α = 0.01**

1000

0.052

0.011

0.049

0.012

2000

0.051

0.010

0.054

0.011

3000

0.056

0.011

0.052

0.012

4000

0.048

0.014

0.051

0.011

5000

0.053

0.012

0.050

0.010

6000

0.048

0.011

0.050

0.009

7000

0.051

0.009

0.052

0.011

8000

0.051

0.012

0.050

0.012

9000

0.051

0.008

0.051

0.012

10000

0.051

0.011

0.052

0.012

11000

0.050

0.011

0.051

0.011

12000

0.051

0.009

0.051

0.009

Power

When defining the 6^{th }SNP (rs1746853) as the causal variant, Figure

The powers of PCA-LRT and KPCA-LRT under different significant levels at the given relative risk of 1.3 and sample size of 3000

**The powers of PCA-LRT and KPCA-LRT under different significant levels at the given relative risk of 1.3 and sample size of 3000**. The horizontal axis denotes the significant levels and the vertical axis denotes the powers of PCA-LRT and KPCA-LRT.

With the same causal variant as above, Figure

The powers of PCA-LRT and KPCA-LRT under different sample sizes at the given relative risk of 1.3

**The powers of PCA-LRT and KPCA-LRT under different sample sizes at the given relative risk of 1.3**. The horizontal axis denotes the sample sizes and the vertical axis denotes the powers of PCA-LRT and KPCA-LRT.

The powers of PCA-LRT and KPCA-LRT under different relative risks at the given of sample sizes 3000

**The powers of PCA-LRT and KPCA-LRT under different relative risks at the given of sample sizes 3000**. The horizontal axis denotes the relative risks and the vertical axis denotes the powers of PCA-LRT and KPCA-LRT.

The powers of PCA-LRT and KPCA-LRT at the given sample size of 3000 and relative risk of 1.3 when each of the 11 SNPs was set as the causal variant

**The powers of PCA-LRT and KPCA-LRT at the given sample size of 3000 and relative risk of 1.3 when each of the 11 SNPs was set as the causal variant**. The horizontal axis denotes the positions of the causal variant and the vertical axis denotes the powers of PCA-LRT and KPCA-LRT.

These simulation results indicate that the powers of KPCA-LRT are always higher than PCA-LRT at given significant levels, sample sizes and relative risks. Particularly, under lower relative risk (1.2 and 1.3) and smaller significant levels (1E-5 and 1E-7), KPCA-LRT is more powerful than PCA-LRT.

Application

Table

The performances of single-locus test, PCA-LRT and KPCA-LRT

**Region**

**# of SNPs**

**Physical location**

**Gene involved**

**Results**

**Single****

**PCA**

**KPCA**

Region 1

12

114030646-114132504

PTPN22

2.30E-8*

4.63E-9*

3.14E-9*

Region 2

8

143025126-143050638

ANKRD35

1.94E-6*

0.837

4.25E-6*

Region 3

13

156523590-156572131

DUSP23

2.47E-4

6.01E-3

7.82E-6*

Region 4

15

19880889-19929909

RNF186

2.05E-4

5.33E-6*

2.54E-6*

*significant at the level of 1E-5

**the most significant p value in the corresponding region

Discussion

In genetic association study, especially in GWAS, in order to avoid the collinearity among SNPs and reduce the false positive rate caused by multiple testing, several groups have proposed PCA-based methods and found that these methods are typically as or more powerful than both single locus test and haplotype-based test

To compare the three methods (single-locus test, PCA-LRT and KPCA-LRT), the four regions from the RA data in GAW16 Problem 1 (Table

The four genes involved in the regions for real data analysis are selected based on prior researches and Gene Ontology

There are several limitations about the proposed method. First, only one causal SNP is considered in present work. Second, how to fix the kernel function with appropriate parameters for each data is still a theoretical problem. Third, when the effect size is smaller (relative risk per allele = 1.1, see Figure

Conclusions

In present study, we have proposed a KPCA-LRT model for testing associations between a candidate gene or genome region with diseases (or traits). Results from both simulation studies and application to real data show that KPCA-LRT with appropriate parameters is always as or more powerful than PCA-LRT, especially under lower relative risks and significant levels.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

QSG, YGH, ZSY, JHZ, BBZ and FZX conceptualized the study, acquired and analyzed the data and prepared for the manuscript. All authors approved the final manuscript.

Acknowledgements

This work was supported by the grant from National Natural Science Foundation of China (30871392). We thank NARAC for providing us with the data.