Multi-bin multi-variant tests for gene-based linear regression analysis of genetic association

Size: px

Start display at page:

Download "Multi-bin multi-variant tests for gene-based linear regression analysis of genetic association"

Beryl Mills
6 years ago
Views:

1 Multi-bin multi-variant tests for gene-based linear regression analysis of genetic association Yun Joo Yoo, Lei Sun, Julia Poirier, and Shelley B. Bull Y. J. Yoo Department of Mathematics Education and Interdisciplinary Program in Bioinformatics Seoul National University Seoul, South Korea L. Sun Department of Statistical Sciences and Division of Biostatistics, Dalla Lana School of Public Health University of Toronto Toronto, Canada J. Poirier Lunenfeld-Tanenbaum Research Institute of Mount Sinai Hospital Toronto, Canada S. B. Bull Department of Statistical Sciences and Division of Biostatistics, Dalla Lana School of Public Health University of Toronto and Lunenfeld-Tanenbaum Research Institute of Mount Sinai Hospital 60 Murray Street, Box No. 8, Toronto, ON, M5T 3L9, Canada 30 December 04 Abstract: By ointly analyzing multiple variants within a gene together, instead of one at a time, gene-based regression analysis can improve power and robustness of genetic association analysis. Extending prior work that examined multi-bin linear combination (MLC) statistics for combined analysis of rare and common variants, here we investigate analysis of common variants more extensively under realistic trait models and conditions. This method exploits the linkage disequilibrium structure in a gene to construct bins of closely correlated variants, and for variants within each bin, corrects the coding scheme to make the maority of pairwise correlations positive. After bin construction and recoding, variant effects within the same bin are combined linearly, and the bin-specific effects are aggregated in a quadratic sum. This produces a test statistic with reduced degrees of freedom (df) as

2 many as the number of bins. By simulation studies based on HapMap Asians and analytic power investigation, we compare type I error and power of the MLC test with the multi-df generalized Wald test as well as with linear combination, MinP, and principal-component-based tests assuming various trait models with multiple causal variants and varying linkage disequilibrium strength. We confirm that MLC tests have good power and robustness whether causal SNPs are excluded or included in the analysis. In particular, we report increased relative performance of MLC tests for stronger correlations between variants and a higher number of causal variants. Key words: multi-bin linear combination test, multi-marker test, genetic association, linkage disequilibrium, functional variants, common variants, GWAS Introduction In genome-wide association studies (GWAS) and large-scale candidate gene studies, researchers typically scan a large number of single nucleotide polymorphism (SNP) markers, one by one, to detect disease-snp association signals. This single-snp-analysis strategy has been preferred as a simple and effective approach assuming the design and scale of the studies are sufficient to capture the marginal direct or indirect association of a SNP with complex disease traits [Risch and Merikangas, 996; Kraft and Cox, 008]. As an alternative to single SNP test statistics, combined analysis of multiple SNPs within a gene (or a specified region) is a natural and interpretable analytic strategy at the gene level, and various methods have been advocated. A gene-based approach offers other analytical merits reduction of multiple testing burden, robustness to population differences regarding LD and allele frequency, and improved ability to replicate associations [Neale and Sham, 004; Liu et al., 00]. One class of methods for gene-based analysis derives a test of global association from multiple single-snp results, that is, from multiple marginal analyses (see Pan 009 for example who proposed a weighted squared sum test of per-snp marginal effects and showed they are powerful for a certain type of alternative). GWAS investigators performing single-snp-based analysis effectively apply an intrinsic gene-based approach by using a maximum statistic, which selects one SNP with the strongest association within a gene region [WTCCC, 007]. This practice implies that the biggest test score (or smallest p-value) has been chosen as a global statistic for a gene, but this approach may be insensitive when multiple independent moderate associations occur within a gene. Another class of gene-based methods applies multiple regression analysis in which each SNP is coded as a covariate [Clayton et al., 004] and a multi-covariate global statistic is constructed to represent a gene. Implementation of a generalized Wald chi-square statistic tests the global null hypothesis for all SNP covariates in the regression model, and the degrees of freedom (df) corresponds to the number of covariates. Multi-SNP regression can serve to reveal independent associations of the SNPs in a gene with the trait, and is robust to the occurrence of both deleterious and protective associations. Instead of including all SNPs in regression-based oint

3 analysis, an alternative is to represent the SNP genotypes by one or more of their principal components, [Gauderman et al., 007]. When a small number of principal components explain most of the variability of the genotypes, a global test with reduced df can be obtained. A global statistic with reduced df can also be constructed from a regression analysis of multiple SNPs using a weighted linear combination of regression coefficients or their corresponding test statistics (the weighted sum method) [Pocock et al., 987; O Brien, 984; Schaid et al., 005; Stram et al., 988; Pan, 009; Wang and Elston, 007]. The per-snp scores are collapsed into a weighted sum and with proper weights the test only requires one df and can be more powerful than the generalized Wald test. However, when the weights for each SNP effect are non-negative and the signs of SNP coefficients are opposing, summing the coefficients can yield a substantially reduced signal. Therefore, the performance of linear combination tests depends on the direction of the per-snp association, and how minor and maor alleles are coded in the analysis [Wang and Elston, 007; Pan, 009, Madsen and Browning, 009]. To improve power and robustness of genetic association analysis to gene structure and trait architecture, we proposed the multi-bin linear combination (MLC) test [Yoo et al., 03]. The MLC test incorporates the LD structure of SNPs in a gene by partitioning the SNPs in a gene into bins of positively correlated SNPs. From a regression analysis of multiple SNPs in a gene, the individual SNP coefficients are combined linearly within each bin and then the bin-specific terms are combined in a quadratic sum constructed across the multiple bins within a gene. Association of the trait with the gene region is then represented by an overall global statistic with df equal to the number of bins. In this way, we improve robustness to the problem of opposing direction of effects and achieve some reduction in df while retaining the advantage of a linear combination of multiple SNP effects within a bin. In Yoo et al. [03], we examined trait model scenarios with one or two low frequency causal SNPs, analyzing only non-causal low-frequency and/or common SNPs. Reduced df MLC tests often showed better power than the full df Wald test when the regression analysis was performed using both low and common frequency SNPs. Multi-variant methods designed for rare variants analysis are well-developed [see Derkach et al. 04 and Lee et al. 04 for recent reviews], and some methods can be applied to both rare and common variants [Ionita-Laza et al., 03; Ayers and Cordell, 03; Wang et al., 03; Yoo et al, 03], but less attention has been paid to evaluation of methods for gene-based analysis of typical GWAS data consisting mainly of genotypes and imputed common variants [Asimit et al., 009; Clayton et al., 004; Gauderman et al., 007; Pan, 009]. Because they are based on multiple regression analysis, MLC tests are well-designed for common variants analysis. Much of the strength of the MLC test comes from incorporating the linkage disequilibrium (LD) between causal and neutral SNPs, which is expected to be higher and more extensive for common causal SNPs. In this article, we investigate the power of MLC tests in comparison with other gene-based statistics for scenarios with multiple common causal variants (minor allele frequency (MAF) > 0.05). We consider realistic LD structure among causal and neutral SNPs based on SNP distributions and LD in HapMap Asian population data for one thousand genes from across the 3

4 genome. For settings in which genotyping data consist of common tag variants, we conduct simulation studies under various genetic models in which analysis models exclude causal SNPs. For settings such as high density GWAS imputation, exome arrays, or filtered whole genome sequencing (WGS) in which genetic data are available for common causal variants, we evaluate asymptotic power for regression analysis including causal SNPs. Throughout we aim to characterize the conditions in which the MLC test performs better than other gene-based tests. Statistical Methods Estimation and Hypothesis Testing in Multi-SNP Regression Models Denote the genotype data of K SNPs as X = X, X,, X ), and the quantitative trait variable as Y. Here, we code the genotype values of 0,, and as a count of the minor allele. Inference concerning overall association between the trait and the SNPs can be obtained via a multi-snp regression model: g { E ( Y )} = β + β X + β X + + β () where E(Y ) is the expected value of Y and g ( ) is the link function. In this report, we focus on a quantitative trait, using the identity function for g ( ), but analogous approaches can be applied to categorical traits, as typical in case-control studies, or time-to-event traits in cohort studies. Hypothesis tests for association in the oint model can be constructed simply using the parameter estimates ˆ β = ( ˆ β,, ˆ β ) and the corresponding estimated variance-covariance matrix Σ B, obtained by least squares/maximum likelihood methods under large-sample assumptions. A generalized Wald test of the global null hypothesis of no association (β = 0 for all ) against the alternative that at least one β 0 is defined as G W ˆ T β Σ ˆ β which has a quadratic form. The asymptotic null distribution of this statistic is chi-squared with K degrees of freedom (df), assuming that the SNP design matrix is of full rank, i.e. no linear dependency among SNPs. Single df components from this quadratic form correspond to Z = ˆ β / Var( ˆ β ) = ˆ β / Σ with a large sample covariance, which can be derived as the correlation matrix of the β estimates. Multi-bin Linear Combination L df Tests Two versions of multi-bin linear combination tests can be constructed using estimates from the multi-snp regression model. One is based on combination of parameter estimates (MLC-B: ˆ β = ( ˆ β,, ˆ β ) ) and the other combines the corresponding test statistics (MLC-Z: K ( K K Σ Z i 0 ( K Z = Z, Z,, Z )). MLC tests incorporate pre-determined bins into the construction of the test statistics; we discuss bin construction in the following sub-section. For now, suppose that L bins are formed from K SNPs (L K). Let J be a K by L matrix where the th column represents assignment to th bin, i.e. J i = if the i th SNP belongs to the th bin and J i =0 if not. We assume that the bins do not overlap. MLC tests of multiple bins are then defined as: MLC-B= = B K X K B 4

5 ( W W T s ˆ)( β W T s ˆ T T T T Σ W ) ( β W ) and MLC-Z= ( W Z)( W Σ W ) ( Z W ) with B T s = ( Σ B J )( J Σ B J s s ) and W, respectively. The usual regression Wald test is a special case in which each bin consists of one SNP i.e. J = diag(,, ), L=K. The null distribution of MLC-B (MLC-Z) is approximately L df chi-squared since T T W βˆ ( βˆ ) is a vector of length L with expected value of 0 and variance T S S W O Var( W ˆ) β = W Var( ˆ β) W T S S = W T S Σ B T T T ( Var ( W Z) = W Var ( Z) W = W Σ W ) under the null hypothesis J T β = 0. The df linear combination tests (LC-B and LC-Z) can also be constructed as special cases of MLC tests where J = (,,, ) T, i.e. all SNPs belong to the same bin. The LC tests follow from early consideration by Pocock et al. [987] and O Brien [984] in the context of multiple outcomes in clinical trial data analysis. We note that LC-B is equivalent to the Wald test statistic described by Stram et al. [988] and the score test statistic proposed by Schaid et al. [005]. These test statistics asymptotically follow chi-square distributions with df, and therefore may be more powerful than the K df Wald test or the L df MLC tests in certain conditions. However, LC tests are sensitive to the direction of alternatives in the parameter space. In particular, LC tests are expected to perform poorly when multiple causal variants in a gene affect the trait positively and negatively. This results in the occurrence of opposite signs among β,, ˆ and possible ˆ cancellation of the effects in the linear combination, producing substantial loss of power. o T O = ( ΣZ J )( J ΣZ J W S O Construction of SNP Bins and Allele Coding Our obective in bin construction is to cluster SNPs into subsets using within-gene LD information such that correlation between additively coded SNPs within a bin is high and positive, while correlation between bins is low. To this end, we conducted bin selection based on an algorithm implemented in LDSelect software [Carlson et al., 004]. As a clustering criterion, LDSelect uses the LD measure r which is calculated as the Pearson correlation of a pair of additively coded SNPs, and can be positive or negative in sign, For a given r threshold value c, the first bin is constituted by the index SNP with the most tags (r > c ) together with the SNPs tagged by it. Excluding SNPs assigned to the first bin, the second bin is constructed by applying the same criteria to the remaining SNPs. This procedure is iterated until it exhausts all SNPs. If c is too strict (high r values) most of the bins consist of a single SNP, but if c is more liberal (low r values) then few bins, each including many SNPs, are created. In the multi-snp regression (), the sign of the regression coefficients is determined by SNP coding of the minor and maor alleles (usually coded as and 0 respectively). Assuming an additive genotype score for an allele, switching the minor and maor alleles changes the sign of the corresponding βˆ estimate and its covariance with other βˆ estimates. This affects the linear i combination test statistics but has no effect on the quadratic Wald statistic. A widely used allele coding method designates the minor allele as the deleterious allele, under the presumption that this will be the case for most of the SNPs. If, however, the SNPs being tested are merely tagging a o Z o O ) o O O β K Z O 5

6 causal SNP or some of the causal effects are actually protective, the signs of the regression coefficients of these SNPs may be reversed. To address potential inconsistencies in effect direction in the context of linear combination tests, we apply a coding correction method suggested by Wang and Elston [007]. According to this algorithm, the allele coding decision, based on the genotype data, is applied within each bin of the MLC test (or to the entire gene in the LC test) so that the number of positively correlated SNPs is maximized within a bin (or within a gene for the LC test). This approach first catalogues SNPs using the number of negative pairwise correlations with other SNPs for a given initial coding scheme (usually minor alleles coded as ). Starting from the SNP with the largest number of negative pairwise correlations with others, the /0 coding is reversed. The correlation status between SNPs after re-coding one SNP is re-examined and the procedure is repeated iteratively until the number of SNPs negatively correlated with one SNP is less than half of the number of SNPs. After applying this coding correction, the resulting matrix of pairwise LD measures r will have a minimized number of negative pairwise correlations between SNPs. Because this adaptive coding procedure is applied only to the genotype data, ignoring the direction of the trait association, it does not inflate the type I error of MLC and LC tests. We apply the recoding procedure in construction of the MLC and LC statistics throughout the subsequent application and evaluations. Other Gene-based Tests The so-called minimum p-value (MinP) test is defined as the p-value corresponding to the max( Z,, ZK ) from the results of oint multi-snp (MinP-J) or marginal (MinP-M) regressions. In contrast to oint modeling in a multi-snp regression, individual SNP associations can be assessed marginally by separate single-snp regression models: g { E( Y)} = β + β, =,, K () In the marginal case, βˆ is replaced by ˆ M ˆ M ˆ M β ( β,, β ) and Σ is replaced by Σ which is an estimate of the variance-covariance matrix for M marginal test statistics ˆ M / ( ˆ M M M Z = β Var β ) is defined as MinP-M for max( Z,, Z K ). M 0 M X = K B β,, ˆ β. The minimum p-value test for For uncorrelated SNPs, the Σ matrix has a diagonal form. Generally, however, Σ can be BM BM obtained using the robust GEE variance method to estimate covariances as suggested by Pan [009, Appendix D]. Although the Bonferroni correction is a simple multiple-testing adustment for MinP, it will be too conservative for correlated SNPs. Alternatively, an adusted p-value can be approximated assuming Z, Z,, Z follow an asymptotic null distribution that is K multivariate normal with mean 0 and covariance matrix [James, 99; Conneely and Boehnke, 007]. In addition, we evaluated two other gene-based statistics based on regression modeling: the SSB statistic [Pan, 009], which is a quadratic statistic obtained from the marginal regression framework () as the squared sum of beta coefficients, and the statistic based on multiple M ˆ Σ Z M K BM 6

7 regression analysis of principal components of the multiple SNP genotypes (PC80) according to Gauderman et al. [007] (See details in the Supplementary Methods). Illustration in the DCCT/EDIC Candidate Gene Study The Diabetes Control and Complications Trial/Epidemiology of Diabetes Interventions and Complications (DCCT/EDIC) study is a long-term follow-up study of randomized trial participants with type diabetes (TD) (DCCT 993; EDIC 999). The DCCT/EDIC Genetics Study was designed to investigate the association of SNPs in a large set of candidate genes with complications of TD. As detailed in the methods of Al-Kateb et al. [008], tagging SNPs within 5kb flanking either side of each of 0 candidate genes were selected not to be in strong linkage disequilibrium and to have MAF greater than 5%. Study participants were genotyped by a custom Illumina GoldenGate Beadarray assay, and data were subected to standard QC procedures. This set of candidate genes includes several recently reported to be associated with high-density lipoprotein (HDL), a measure of blood cholesterol level, in the general population by Teslovitch et al [0]: CETP, APOB, IRS, LPL, ABCA, APOA, LIPC, LCAT, MC4R. It is therefore of interest to assess these and other candidate genes for association in a population with TD. The dataset we analysed for genetic associations with the quantitative HDL trait consists of 36 white probands with genotyping data for 3 SNPs (MAF>5%) in 83 candidate genes (with SNPs per gene). The number of SNPs genotyped per gene ranges from to 47, with a median value of 4. We applied the MLC tests and other gene-based tests to the logged and centered HDL measures obtained at the DCCT baseline assessment. For each gene, we constructed bins using a r threshold value of 0.3, and obtained bin sizes ranging from to 7 with a median of. Out of 83 genes, had at least one test that met a liberal significance criterion of p-value < 0.03, with of these detected by the MLC method (Table S). As expected, there is strong concordance between MLC-B and MLC-Z, and between LC-B and LC-Z. The well-known HDLrelated gene CETP shows genome-wide significant p-values for all the methods except MinP-J, but for the remaining 8 genes, none of the methods reach the Bonferroni adusted p-value criteria of 3x0-4. The ten SNPs in the CETP region cluster into five bins (Table and Figure ). Two of the SNPs with strong marginal signals (SNP4, SNP5) are highly correlated (r=0.7) and consequently the SNP5 signal disappears in oint regression of all ten SNPs. SNP4 (rs709) has been reported as associated with the HDL trait [Enquobahrie et al., 008; Asselbergs et al., 0;Wu et al., 03]. The PC80 test p-value (4 df) is the smallest with the MLC-B and SSB tests yielding similar small p-values for the CETP region. Simulation Studies Analysis of non-causal SNPs Design and Methods To rigorously compare the methods, and to better understand the data and model characteristics that influence the performance of the methods, we conducted a series of simulation studies based on observed human genotypes. Trait data were simulated under null and various 7

8 alternative models using genotype data derived from the HapMap Phase III Asian population (JPT and CHB). To compare the performance of various gene-based tests under various realistic gene structures, we selected 000 genes as follows. A list of genes across autosomes was obtained from the UCSC genome annotation database for NCBI hg8 Build 36. ( database/). Among 6,54 genes with at least one SNP present in the HapMap Asian data, there were 8883 genes with size 4~30 after excluding rare and low frequency SNPs (MAF <0.05) and pruning SNPs in complete LD. From these, 000 genes were randomly selected for inclusion in the simulation studies. For each gene, 000 replicated data sets of genotypes and trait values were generated for 000 individuals under each trait model scenario. To maintain the gene structure observed in the HapMap data, we generated genotype data for 000 individuals using the method of randomly pairing haplotypes from the haplotype pool (for each gene) obtained from the phased genotype data. Based on the genotypes for each individual, a quantitative trait Y was generated under an additive genetic model for a gene with C causal SNPs such that where Y = C i= b i G i + ε ε ~ N(0, σ ), b i is the effect of i th causal SNP, and G i is the number of minor alleles at the i th causal SNP, i.e. G i can be 0,, or. We considered six different quantitative trait models for each gene, including 0, or causal SNPs per gene (Table ). Under each model, the causal SNPs were randomly selected from the set of SNPs for each gene. Under the null hypothesis of no gene effect, all b i were specified to be equal to 0. Under the alternative hypothesis of genetic association, the effect size b i and error variance σ were chosen for each genetic model to maintain Wald test average power in the range 5-80% at a sample size n=000. At this stage, the causal SNPs were not included in the gene-based regression analysis, so associations detected were indirect through the natural LD structure of each gene. The LD structure among the SNPs within a gene therefore arose entirely from the inherent structure in the HapMap haplotype data. Choice of r -threshold Values for the MLC bin construction We conducted a preliminary simulation to evaluate how MLC test power depends on the r - threshold values, attempting to find r -threshold values that usually produced the highest power with best power improvement. For 000 genes selected for simulation study, we compared power at eight different r -threshold values (0.~ 0.8) for five trait models. Considering both the frequency of occurrence of best power (Figure S) and the mean power improvement (Figure S) at the optimizing r -threshold value, we concluded that a threshold from 0.3~0.6 is a reasonable choice for dimension reduction by the binning procedure. In the subsequent simulation studies, we applied an r - threshold of 0.5 in the MLC test evaluations. Type I Error Evaluation Type I error rates of MLC tests and other gene-based tests were estimated for each of 000 genes at two nominal significance levels (α=0.05 and α=0.0). For the maority of the 000 8

9 genes, the estimated type I error rate is within the range of the nominal significance level expected in 000 replicates, with 70-87% within 0.04~0.06 for α=0.05 and 9-98% within 0.003~0.07 for α=0.0 (Table S). The average of the estimated Type I error rates over 000 genes is closest to nominal for the MLC, MinP-J, PC80, and LC tests, while the averages observed for the Wald and SSB tests tend to be inflated, and the distribution for MinP-M is skewed toward elevated values. Power Comparison Between MLC Tests and Other Gene-based Tests The power of each gene-based test was estimated using the critical values for α = 0.05 and α = 0.0 from the asymptotic null distribution for each statistic and averaged across all 000 genes under each trait model (see Table 3, Table S3 for power at α = 0.05, α = 0.0 respectively). The 95% confidence interval (CI) for average MLC test power (both MLC-B and MLC-Z) is strictly higher than that of Wald test power for trait Models -3 in which the causal SNPs are deleterious (Table 3). In Figure (a) which compares power for each of the 000 genes, MLC test power is nearly always above Wald test power for Models -3, and mostly above for Models 4 and 5 in which one causal SNP is deleterious and the other is protective. This suggests consistent benefits of the reduced df MLC-B statistic compared to the Wald statistic, while the exceptions with low power are likely due to difficulties in bin construction and/or in effective recoding within a particular gene bin when positively correlated SNPs have opposing effects. On average, the power of the LC and MinP-J tests is low under all models (Table 3). Although there are more than a few genes where LC-B achieves better power than the Wald and MLC-B statistics (Figure (a)), genes with lower power for the LC statistic predominate, and the power loss is usually substantial when it occurs. The 95% CIs for average MLC power overlap substantially with those for MinP-M and PC80 and tend to be higher than those for the SSB test (Table 3). However, as evident in the gene-bygene comparisons in Figure (b), MLC-B often exhibits substantially better power than MinP and PC80, especially at lower power levels (eg. Models, 4, and 5), with comparable or only slightly lower power than MinP otherwise. This latter difference may be exaggerated by the occurrence of elevated type I error in MinP-M (Table S). Additional comparisons among MLC-B, SSB and Wald tests indicate that SSB can sometimes perform better than Wald and MLC, but there are many genes where SSB loses substantial power compared to MLC and Wald (Fig S3). Analytic Power Comparisons Analysis of causal and neutral SNPs To evaluate the performance of MLC-B, Wald, MinP and LC-B tests under several trait models with the causal SNPs included in the regression analysis, we analytically derived their asymptotic distributions using a simplified genotype structure. Under the alternative, the test statistics follow non-central chi-square distributions with the non-centrality parameter as the expected statistic under a trait model. Let the critical value for a significance level α obtained from m df chi-square distribution be c where P χ > α. If we denote the expected MLC-B, Wald, and α } = m,α { m c m, LC-B and statistics as λ, λ, and λ, their power is then φ = P χ > c ), M M W L M ( L, λ L, α 9

10 φ = P χ > c ), and φ = P χ c ) with L, K and df respectively. We give detailed W W ( K, λ K, α ( L, λ >, α formulations of λ, λ, and λ in terms of trait model, allele frequencies and linkage M W disequilibrium structure in the Supplementary Methods. We obtained adusted critical values for MinP-J and MinP-M statistics corresponding to a significance level α from the multivariate normal distribution with mean vector of 0's and the covariance matrix for Z = Z, Z,, Z ) ( K based on Conneely and Boehnke [007]. Power is φ = P U > c ) where U is the MinP-M MinP or MinP-J statistic and c U,α, is the corresponding critical value. L L ( U,α Performance of MLC Test with LD Patterns from HapMap Data We compared MLC-B, Wald, and LC-B tests under realistic genotype structures based on the same 000 genes in the HapMap Asian population used for the simulation studies. We assumed the causal SNPs are included in the regression analysis and calculated power under Models ~3 (Table ), except that in these evaluations the error variance σ was set to yield 60% power for the Wald test under the given trait model (Figure 3, Table S4). The power values of LC-B for 000 genes under all trait models fluctuated around 60% depending on the LD structure of each gene, but were usually less than 60% under Model 3 in which the two causal SNPs are in different bins. Similarly to the findings from simulation studies (which did not include the causal SNPs in the model), MLC-B maintains similar or higher power than the Wald test for most cases, and the fraction of genes in which MLC-B loses substantial power compared to Wald tests is small: %, 3%, 5% for Models, and 3 respectively. Effect of Linkage Disequilibrium and Number of Causal SNPs To analytically evaluate the effect of LD, we examine simple trait models with one or two causal SNPs and one or two neutral SNPs across a range of allele frequencies (0. to 0.5) and pairwise SNP correlations (-.0 to.0 subect to allele frequency constraints). To evaluate the effect of the number of SNPs, we examine trait models with 0 SNPs, all with MAF=0.3, and one to five causal SNPs. These analytic studies focus on the LC-B and MLC-B tests in comparison to the Wald and MinP tests (see Table S5.A for trait model details). The performance of LC-B is sensitive to within-gene correlation and the underlying causal model. For one causal SNP and high positive correlation with neutral SNPs, LC-B power can exceed Wald power, but it never does better than MinP-M (Figure S4, S5). For two positively correlated deleterious causal SNPs, the LC-B test can be more powerful than the Wald and MinP tests (Figure 4(a), S6). On the other hand, for negative correlations, the coding correction algorithm changes the direction of one causal effect resulting in power loss for LC-B. The power of MLC-B depends on within-gene correlation and the underlying trait model, but is less sensitive to the direction of the causal effects; here we assume that two SNPs are grouped into a bin when their correlation is above 0.6. For one causal SNP and correlation high enough to create a bin, the MLC test power exceeds that of the Wald test due to the reduction in df (Figure S7). For a trait model including two causal SNPs in different bins and one neutral SNP in the 0

11 same bin as the first SNP (r = 0.8), MLC test power consistently exceeds Wald test power regardless of the direction of causal effects. When the causal SNPs are positively correlated, the LC-B test can have somewhat higher power than the MLC-B test (Figure 4(a)). However, when positively correlated causal SNPs have coefficients with opposing effects, the pattern is reversed (Figure 4(b)). Overall the MLC test performs well regardless of the direction of causal effects because it is affected only by allele recoding within a bin. When multiple independent causal SNPs and correlated neutral SNPs are analysed together, we expect improvement in the performance of MLC relative to other tests. Under a trait model with five bins each consisting of two SNPs and no more than one causal SNP per bin (Table S5.B), MLC test power improves faster than Wald and MinP-M power as the within-bin correlation r and the number of causal SNPs increases (Figure 5). For r 0.4, MLC has higher power than MinP and for r 0.6, MLC is as or more powerful than the Wald test. Discussion MLC tests can be seen as a compromise between df linear combination tests and multi df Wald tests, resulting in power improvement and better robustness to varying gene structures and underlying trait models. By restricting the linear combination strategy within a bin of closely and positively correlated SNPs, we aim to minimize the chance that opposing genetic effect estimates will reduce the overall signals while still improving power through df reduction. By means of extensive simulation and numerical studies performed under various trait models for each of 000 representative genes, we evaluated the performance of the MLC test in two general situations. In one, we set up the analysis model to exclude the causal SNPs so as to imitate the complexity of typical GWAS analysis of mainly tag SNPs. In the second, to imitate the settings of GWAS imputation or dense genomic data such as exome array data or WGS, we included both causal and non-causal SNPs. Because our gene panels were selected to represent the overall distribution of gene structure across the genome, the average power and good performance of MLC tests can be expected to represent typical performance in applications. In practical application of regression-based tests, some computational issues need to be considered. First, depending on the linkage disequilibrium pattern in the data set, the oint analysis model using multi-snp regression may encounter linear dependencies among the SNPs and nearsingularity in the variance-covariance matrix. This can arise due to complete pairwise LD between SNPs or higher order multi-snp linear dependence. Most statistical software performing linear regression can deal with the singularity and automatically remove covariates to obtain a nonsingular matrix. By removing the SNPs that are redundant (in complete LD with some other SNP), the possible singularity problem can usually be resolved prior to analysis. Depending on sample size, regression analysis using low frequency SNPs may be problematic. However, if a legitimate regression analysis can be performed for low frequency SNPs, the MLC tests can always be constructed. In this study we successfully used LDSelect to construct bins for the MLC method. Grouping correlated SNPs into bins is an extra procedure and, in principle, any clustering algorithm for

12 correlated SNPs can be used. The SNPs in a bin need not be physically consecutive and although clustering is indirectly related to haplotype structure, the bin itself does not necessarily have to have this specific biological meaning. Clustering of SNPs based on haplotype structure or functional annotation, or by alternative clustering algorithms that can provide some biological interpretation, might be suitable for use with the MLC test and worthwhile to develop in future study. In that instance, the bin-specific statistics may be appropriate to localize the association signal. Based on the entirety of the simulation and numerical results, we conclude that MLC tests can be expected to perform better than Wald, MinP-J, SSB, and LC tests when SNPs are in linkage disequilibrium. Taking the ratio of the number of SNPs over the number of bins per gene as an indirect measure of within-gene LD, because the number of bins is reduced by high correlation between SNPs in a gene, simulation results generated under a model with one or two causal variants suggest that MLC tests can perform better than MinP-M and PC80 at ratios of -.5 while the three statistics performed similarly at higher ratios (Figure S8).. Our conclusions are similar to those of previous authors who found multilocus reduced df tests to be more powerful than MinP-M SNP set tests when SNPs are in LD [Kwee et al 008; Lin et al 0]. The Wald test is robust to the presence of multiple causal variants, but does not utilize the LD information efficiently whereas the MinP test does use LD information to adust p-values but is prone to inflated type I error and is more suitable for scenarios with a single causal SNP (Taub et al., 03). Principal-component-based methods efficiently reduce the df based on the LD structure, and therefore showed good performance in our study and in Peterson et al. [03]. However, each principal component does not have a ready biological interpretation, and some information may be lost by choosing a subset of the principal components. In contrast, the bins in the MLC statistics maintain meaning as a small LD block, and the df are effectively reduced using LD information while averaging correlated SNPs. We therefore recommend the MLC test for genebased analysis of genetic association data as a powerful and robust multi-marker method. In our evaluations, we pre-specified gene-based common-frequency SNP sets. To apply genebased association methods to a whole genome scan, it is necessary to assign the genomic variants to analysis units such as genes, LD blocks, or other biological regions. Furthermore, the number of SNPs that can be ointly analyzed in a statistic may be limited, requiring a windowing approach to determine the analysis units. For case-control gene-based methods, Petersen et al. [03] investigated the influence of non-causal SNPs within the analysis unit, assuming various LD relationships within and between causal and non-causal SNPs. Their evaluations included: a p- value adustment type method similar to the MinP tests, a likelihood-ratio test from principalcomponent-based regression analysis, and a likelihood-ratio test from genotype-based regression analysis which is equivalent to the Wald test. They conclude that if causal SNPs comprise less than 0-5% of the SNPs analysed ointly, gene-based tests can be ineffective, and therefore suggest that only intergenic SNPs that are in high LD be included in the gene-based analysis unit.

13 In their simulation and numerical power analysis of rare variant tests, Derkach et al. [04] found that LD between causal SNPs or causal and neutral SNPs can increase or decrease the power of linear combination tests, which is consistent with our results presented in Figure 4 and Figures S4-S7. Petersen et al. [03] also observed that LD between neutral SNPs can affect the power of multi-marker tests, similar to our results in Figure S7. With clustering of correlated SNPs and application of the coding correction, we were able to retain the benefits of LD, resulting in improved and more robust performance of MLC tests. With accumulating evidence that LD between causal and non-causal SNPs influences the power of gene-based tests, MLC methods specifically designed to utilize the LD information should be considered as a powerful alternative for analysis of dense genetic association data with highly correlated SNPs. Acknowledgement DCCT/EDIC study data: The Diabetes Control and Complications Trial (DCCT) and its followup the Epidemiology of Diabetes Interventions and Complications (EDIC) study were conducted by the DCCT/EDIC Research Group and supported by National Institute of Health grants and contracts and by the General Clinical Research Center Program, NCRR. This manuscript was not prepared under the auspices of the DCCT/EDIC study and does not represent analyses or conclusions either of the DCCT/EDIC study group or the National Institutes of Health. Funding: This work was supported by the Canadian Institutes of Health Research operating grant MOP-8487, and National Research Foundation of Korea (NRF) grant NRF-0RAA3048. References Al-Kateb H, Boright AP, Mirea L, Xie X, Sutradhar R, Mowoodi A, Bhara B, Liu M, Bucksa JM, Arends VL, Steffes MW, Cleary PA, Sun W, Lachin JM, Thorner PS, Ho M, McKnight AJ, Maxwell AP, Savage DA, Kidd KK, Kidd JR, Speed WC, Orchard TJ, Miller RG, Sun L, Bull SB, Paterson AD; Diabetes Control and Complications Trial/Epidemiology of Diabetes Interventions and Complications Research Group Multiple superoxide dismutase /splicing 9 factor serine alanine 5 variants are associated with the development and progression of diabetic nephropathy. Diabetes 57:8-8. Asselbergs FW et al. 0. Large-scale gene-centric meta-analysis across 3 studies identifies multiple lipid loci. Am J Hum Genet 9: Asimit JL, Yoo YJ, Waggott D, Sun L, Bull SB Region-based analysis in GWA of FHS blood lipid traits. BMC Proceedings Suppl 7:S7. Ayers KL, Cordell HJ. 03. Identification of grouped rare and common variants via penalized regression. Genet Epidemiol 37: Carlson CS, Eberle MA, Rieder MJ, Yi Q, Kruglyak L, Nickerson DA Selecting a maximally informative set of single-nucleotide polymorphisms for association analysis using linkage disequilibrium. Am J Hum Genet 74:

14 Clayton D, Chapman J, Cooper J Use of unphased multilocus genotype data in indirect association studies. Genet Epidemiol 7: Conneely KN, Boehnke M So many tests, so little time! Rapid adustment of P-values for multiple correlated tests. Am J Hum Genet 8: Derkach A, Lawless JF, Sun L. 04. Assessment of pooled association tests for rare variants within a unified framework. Statistical Science 9():30-3. Diabetes Control and Complications Trial Research Group (DCCT) The effect of intensive treatment of diabetes on the development and progression of long-term complications in insulin-dependent diabetes mellitus. N Engl J Med 39: Epidemiology of Diabetes Interventions and Complications (EDIC) Design, implementation, and preliminary results of a long-term follow-up of the Diabetes Control and Complications Trial cohort. Diabetes Care :99-. Enquobahrie DA, Smith NL, Bis JC, Carty CL, Rice KM, Lumley T, Hindorff LA, Lemaitre RN, Williams MA, Siscovick DS, Heckbert SR, Psaty BM Cholesterol ester transfer protein, interleukin-8, peroxisome proliferator activator receptor alpha, and Toll-like receptor 4 genetic variations and risk of incident nonfatal myocardial infarction and ischemic stroke. Am J Cardiol 0: Gauderman WJ, Murcray C, Gilliland F, Conti DV Testing association between disease and multiple SNPs in a candidate gene. Genet Epidemiol 3: Ionita-Laza I, Lee S, Makarov V, Buxbaum JD, Lin X. 03. Sequence Kernel Association Tests for the Combined Effect of Rare and Common Variants. Am J Hum Genet 9: James S. 99. Approximate multinormal probabilities applied to correlated multiple endpoints in clinical trials. Stat Med 0:3-35. Kraft P, Cox DG Study designs for genome-wide association studies. Adv Genet 60: Kwee LC, Liu D, Lin X, Ghosh D, Epstein MP A powerful and flexible multilocus association test for quantitative traits. Am J Hum Genet 8: Lee S, Abecasis GR, Boehnke M, Lin X. 04. Rare-variant association analysis: Study designs and statistical tests. Am J Hum Genet 95: 5-3. Lin X, Cai T, Wu MC, Zhou Q, Liu G, Christiani DC, Lin X. 0. Kernel machine SNP-set analysis for censored survival outcomes in genome-wide association studies. Genet Epidemiol 35: Liu JZ, Mcrae AF, Nyholt DR, Medland SE, Wray NR, Brown KM, AMFS Investigators, Hayward NK, Montgomery GW, Visscher PM, Martin NG, Macgregor S. 00. A versatile gene-based test for genome-wide association studies. Am J Hum Genet 87: Madsen BE, Browning SR A groupwise association test for rare mutations using a weighted sum Statistic. PLoS Genet 5:e Neale BM, Sham PC The future of association studies: gene-based analysis and replication. Am J Hum Genet 75:

15 O Brien PC Procedures for comparing samples with multiple endpoints. Biometrics 40: Pan W Asymptotic tests of association with multiple SNPs in linkage disequilibrium, Genet Epidemiol 33: Petersen A, Alvarez C, DeClaire S, Tintle NL. 03. Assessing methods for assigning SNPs to genes in gene-based tests of association using common variants. PLoS One 8:e66. Pocock SJ, Geller NL, Tsiatis AA The analysis of multiple endpoints in clinical trials. Biometrics 43: Risch N, Merikangas K The future of genetic studies of complex human diseases. Science 73: Schaid DJ, McDonnell SK, Hebbring SJ, Cunningham JM, Thibodeau SN Nonparametric tests of association of multiple genes with human disease. Am J Hum Genet 76: Stram DO, Wei LJ, Ware JH Analysis of repeated ordered categorical outcomes with possibly missing observations and time-dependent covariates. J Am Stat Assoc 83: Taub MA, Schwender HR, Youkin SG, Louis TA, Ruczinski I. 03. On multi-marker tests for association in case-control studies. Front Genet 4:5. Wang T, Elston RC Improved power by use of a weighted score test for linkage disequilibrium mapping. Am J Hum Genet 80: Wang X, Morris NJ, Zhu X, Elston RC. 03. A variance component based multi-marker association test using family and unrelated data. BMC Genet 4:7 The Wellcome Trust Case-Control Consortium Genome-wide association study of 4,000 cases of seven common diseases and 3,000 shared controls. Nature 447: Wu Y et al. 03. Trans-ethnic fine-mapping of lipid loci identifies population-specific signals and allelic heterogeneity that increases the trait variance explained. PLoS Genet 9(3): e Yoo YJ, Sun L, Bull SB. 03. Gene-based multiple regression association testing for combined examination of common and low frequency variants in quantitative trait analysis. Front Genet 4:33. 5

16 Table. Application to DCCT/EDIC Genetics Study: Regression analysis and gene-based analysis results for association of the HDL trait with 0 SNPS in the CETP gene. SNP Bin alloca -tion MAF Regression Analysis Results Joint analysis Marginal analysis Beta P-value LC-B per bin Wald per bin Beta P-value SNP SNP (P=0.0003) (P=0.000) SNP SNP* * 0.73 SNP (P=0.6) 7.3 (P=0.068) *.66E SNP SNP E- SNP (P=0.09) (P=5.00E-8) E-0 SNP SNP (P=0.690) 0.08 (P=0.893) Gene-based Analysis Results 0.6 (P=0.690) 0.08 (P=0.893) Joint analysis Marginal analysis **Wald MLC-B MLC-Z MinP-J PC80 LC-B LC-Z SSB MinP-M Stat df P.8E- 3.6E- 6.76E E-.09E-08.7E-08.93E-08.40E-0 *The risk and base allele coding was reversed for SNP by application of the coding correction algorithm by Wang and Elston [007]. The beta coefficients for SNP in this table are the values after coding correction. **Wald : Generalized Wald test MLC-B : Multi-bin linear combination test using beta coefficients MLC-Z : Multi-bin linear combination test using Z statistics LC-B : Linear combination test using beta coefficients LC-Z : Linear combination test using Z statistics MinP-J : Minimum P test based on oint regression analysis MinP-M : Minimum P test based on marginal regression analysis PC80 : Global test based on regression using the minimum number of principal components covering 80% variances SSB : Squared sum of marginal beta coefficients (Pan, 009) 6

17 Table. Gene-based Genetic Models for Simulation Study of a Quantitative Trait Model Name Description Trait Model Parameters * Null Model No SNP association b = 0, σ = 5 Model One causal SNP within a gene b =, σ = 5 Model Two causal SNPs in the same bin, both deleterious b =, b =, σ = 0 Model 3 Two causal SNPs in different bins, both deleterious b =, b =, σ = 8 Model 4 Two causal SNPs in the same bin, one deleterious and one protective b =, b =, σ = 5 Two causal SNPs in different Model 5 bins, one deleterious and one b =, b =, σ = 5 protective C *The trait model is where ε ~ N(0, σ, C is the number of causal SNPs, b i is the Y = i= b i G i + ε ) effect of i th causal SNP, and G i is the number of causal alleles for the i th causal SNP. 7

18 Table 3. Simulation Study Results: Average empirical power of MLC test statistics and other gene-based statistics for 000 genes at nominal level α = 0.05 (N=,000 simulation replicates used to estimate power for each gene) Summary Model Model Model 3 Model 4 Model 5 Wald * Average % CI (0.608,0.639) (0.58,0.56) (0.47,0.44) (0.376,0.4) (0.764,0.795) MLC-B Average % CI (0.669,0.70) (0.585,0.6) (0.483,0.5) (0.377,0.43) (0.79,0.8) MLC-Z Average % CI (0.668,0.70) (0.584,0.69) (0.48,0.509) (0.377,0.43) (0.787,0.89) MinP-M Average % CI (0.67,0.706) (0.6,0.636) (0.49,0.59) (0.37,0.47) (0.783,0.85) PC80 Average % CI (0.658,0.694) (0.59,0.68) (0.484,0.53) (0.357,0.404) (0.756,0.79) SSB Average % CI (0.63,0.667) (0.56,0.597) (0.45,0.48) (0.364,0.4) (0.743,0.778) LC-B Average % CI (0.48,0.53) (0.47,0.53) (0.385,0.48) (0.57,0.99) (0.56,0.605) LC-Z Average % CI (0.476,0.59) (0.457,0.498) (0.377,0.4) (0.54,0.95) (0.555,0.6) MinP-J Average % CI (0.94,0.) (0.95,0.5) (0.6,0.83) (0.93,0.5) (0.3,0.36) *For description of tests, see the footnote for Table 8

19 Figure. Clustering of SNPs in DCCT/EDIC CETP gene data by applying LDSelect algorithm to linkage disequilibrium (r) network. Edges with r <0.3 are removed. Edges with 0.3 r <0.55 are shown in dashed line. SNPs in the same bin have the same color. The r cutpoint value for LDSelect algorithm was set at r =0.3 ( r 0.55). The minor and maor allele coding was reversed for SNP (starred) after application of the coding correction algorithm by Wang and Elston [007]. The r values reflect the coding correction. 0.7 SNP SNP SNP5 0.5 SNP SNP SNP SNP* SNP6 0.6 SNP SNP3 SNP

20 Figure. Simulation Study Results: Comparisons of empirical power for each of 000 genes under Models -5. For each trait model, the genes are ordered by the rank of Wald test power among 000 genes. (a) MLC-B, LC-B and Wald, (b) MLC-B, PC80 and MinP-M. (a) (b) 0

21 Figure 3. Comparison of analytic power of MLC-B, Wald, and LC-B tests for each of 000 genes under Models -3. For each trait model, the genes are ordered by the rank of MLC-B power among 000 genes.

22 Figure 4. Analytic study results for two causal SNPs and one neutral SNP: the power of MLC-B test (red dots), Wald (black dots), and LC-B (blue dots) tests, are calculated for two trait models Y β X + β X + β + ε where β = β =, β 0 (upper plots) and = 3X 3 β, (lower plots) with, for a range of allele =, β =, β3 = 0 ε ~ N(0, σ ) σ = frequencies between 0. and 0.5. The correlation between causal SNP and neutral SNP3 is fixed as 0.8 (SNPs & 3 are in the same bin) and the pairwise correlation between the two causal SNPs and is set at r. The range of correlation r between -0.5 and +0.5 is restricted, depending on the marginal allele frequencies. For the LC test, the sign of a beta coefficient and some negative correlations are reversed with application of the allele coding correction. The allele coding correction does not affect the MLC test since there are no negative correlations present within bins. (a) Upper Panel: Two deleterious causal SNPs 3 = (b) Lower Panel: Two causal SNPS, one deleterious and one protective

23 Figure 5. Analytic study results for ten SNPs including one to five causal SNPs: comparison of power of MLC-B, Wald, LC-B and MinP-M for a range of correlation r between causal and neutral SNP pairs as the number of causal SNPs increases from to 5. Ten SNPs were divided into five bins, each with two SNPs assumed to have a fixed correlation value r (0~0.8). The allele frequency of all SNPs is fixed to be 0.3. In each bin, only one SNP can be assigned to be causal. The power of gene-based tests from analysis of all ten SNPs is obtained under the alternative model in which the beta coefficient for a causal SNP is and the error variance is σ = 00. 3

24 Appendices for "Multi-bin multi-variant tests for gene-based linear regression analysis of genetic association Y. J. Yoo Department of Mathematics Education and Interdisciplinary Program in Bioinformatics Seoul National University Seoul, South Korea L. Sun Department of Statistical Sciences and Division of Biostatistics, Dalla Lana School of Public Health University of Toronto Toronto, Canada J. Poirier Lunenfeld-Tanenbaum Research Institute of Mount Sinai Hospital Toronto, Canada S. B. Bull Department of Statistical Sciences and Division of Biostatistics, Dalla Lana School of Public Health University of Toronto and Lunenfeld-Tanenbaum Research Institute of Mount Sinai Hospital 60 Murray Street, Box No. 8, Toronto, ON, M5T 3L9, Canada 30 December 04 Methods A. SSB The SSB statistic recommended by Pan [009] is a quadratic statistic calculated as the squared sum of beta coefficients obtained from the marginal regressions of each SNP genotype variable. That is, SSB = ˆ β M T ˆ β M = which is asymptotically distributed as a weighted sum of K independent df chi-square distributions with an approximated p- value where the weights are eigenvalues of the covariance matrix [Pan, 009]. Therefore, the SSB statistic is constructed using beta coefficients only but the null distribution reflects the correlation between SNPs. B. PC80 Gene-based statistics also can be formulated using multiple regression analysis of the principal components of the multiple SNP genotypes. Gauderman et al. [007] suggested that K i= ( ˆ β M i ) 4

25 principal components be selected using a threshold criteria such that greater than 80% of the variance is explained by the minimum number of principal components, assuming the principal components are ordered by variance from the largest (P ) to smallest (P K ). If P,..., P S explain greater than 80% of the total SNP variance, then the regression analysis is performed using these principal components with the model Y * * P * P * = β P 0 + β + β + + β S S. Denoting the variance-covariance matrix of ˆ * * * (,,, * * β = β β β S ) as Σ B, we have PC80 = * T * ˆ * β Σ ˆ B β which follows an asymptotic chi-squared distribution with S df. Since the covariances of different principal components are zero, the is a diagonal matrix. S (β Therefore, PC80 can be formulated as PC80= * ) * where b is the variance of β. = b C. Estimation of non-centrality parameter for alternative models Denote observed genotypes as X and a quantitative trait as for i th i, X, i, X Y ik i subect among n unrelated individuals. Assume a trait model with regression parameters β = ( β, β,, βk ) corresponding to the genotypes of K SNPs: Y β + β X + β X i + + β X + ε () = 0 K K where ε ~ N(0, σ ). For the th SNP, the genotype X i is coded as the number of minor alleles, i.e., 0,, or. Gene-based statistics are constructed from the beta estimates ˆ β = ( ˆ β ˆ and the estimated covariance matrix of.,, β ) Σ K B βˆ Under Hardy-Weinberg equilibrium, we assume that the underlying genotype structure is such that is the allele frequency of the th SNP and the linkage disequilibrium measure r k p p k p pk between SNP and SNP k is defined as rk = for, k =,, K p p p )( p ) where p P( X =, X = ). Let v = p p ). Then the mean and variance for X i are k = k ( and v respectively, and the covariance of X i and X ik is r v v. The expected value of X i is equal to np and that of X i X ik is equal to n ( r ), k v vk + p pk which results in a covariance matrix Σ for the genotypes consisting of v in the th diagonal position and r v v as the (,k)-element. Assuming a fixed value for Σ, we use the covariance matrix of ˆ ˆ ˆ ˆ σ β = ( β, β,, β K ): Σ X which is equal to Σ under n B large sample theory, to derive the noncentrality values for the distributions of the Wald. LC-B, and MLC-B statistics. (i) Wald statistic (K df ) The noncentrality parameter under the trait model (), which depends on values assumed for the genotype covariances, is K K K n λ =. W v β + β β k r k v v k σ = k = k > * Σ B k ( k p k k n i= X n i= k k X 5

26 (ii) LC-B statistic ( df ) We denote the half of the sum of all the elements of among X, X K ) as K T S = J Σ X J = v + k= where J = (,,,) (all variances and covariances The weight for the th K SNP in the LC-B statistic is then w, i.e. = v + rk v vk S k=, k the relative sum of the variance of X and the covariances between X and other SNPs over the sum of all the variances and covariance among X, X K. Then the LC-B noncentrality parameter under the trait model () is λ L (iii) MLC-B statistic (L df ) Let be the (,l)-element (for th SNP and l th bin) of the weight matrix W w l T S = ( Σ B J )( J Σ B J =,,, K l,,, L for and. Also let correspond to the l th element of T W βˆ, s = ( w ) V + w w CV where is the variance of βˆ, i.e. the (,) element of ΣB, and CV k is the variance between βˆ and, i.e. the (,k) element of Σ, and = w CV. Then the MLC-B statistic noncentrality parameter under the trait model () is L L L n λm = ( ) sltl + slmtltm σ l= m= l> m For an simplified example of W S, if SNPs S, S, S 3, and S 4 belong to three bins B B, B 3, and B respectively and if we assume zero correlation between SNPs in different bins, then 0 0 w and 0 0 J = W S = w4 0 0 where w ˆ ˆ ˆ ˆ ˆ and so that = ( σ + σ 4 ) /( σ + σ 4 + σ 4 ) w ˆ ˆ ˆ ˆ ˆ 4 = ( σ 4 + σ 4 ) /( σ + σ 4 + σ 4 ) w where is the estimated covariance of X i and X and is the estimated + w4 = σˆ i σˆ i variance of X i for i, =,, K. Σ X K K k= l> k r lk v v K K K K ns ns = w β w β + w wk β βk σ = σ = k= k> = ) B S lm K K s w = k= l l K = km l k = l w l = l k K < k l kl k T = V K β βˆk 6

27 Tables and Figures Table S. Application to DCCT/EDIC study: Candidate gene analysis for HDL phenotype (selected genes with p-values <0.03). The number of SNPs per gene corresponds to the df for the Wald statistic, and the number of bins corresponds to the df for the MLC-B and MLC-Z statistics. Gene # SNP # bin Wald * MLC-B MLC-Z MinP-M PC80 SSB LCB LCZ MinP-J CETP 0 5.8E- 3.6E- 6.76E-.40E-0.39E-.93E-08.09E-08.7E SLCA CTGF LDLR TIMP APOB NPHS FASLG EDN AGER TNF NDUFS VEGF ABCA INSR UQCRFS AGT CRP FLT COL4A PDGFB

28 *Wald : Generalized Wald test MLC-B : Multi-bin linear combination test using beta coefficients MLC-Z : Multi-bin linear combination test using Z statistics LC-B : Linear combination test using beta coefficients LC-Z : Linear combination test using Z statistics MinP-J : Minimum P test based on oint regression analysis MinP-M : Minimum P test based on marginal regression analysis PC80 : Global test based on regression using the minimum number of principal components covering 80% variances SSB : Squared sum of marginal beta coefficients (Pan, 009) 8

29 Table S. Simulation study results: Distribution and average Type I error rate estimates for 000 genes at nominal levels α = 0.05 and α = 0.0 (N=,000 simulation replicates used to estimate type I error for each of 000 genes) α=0.05 Distribution of Type I error estimates <.04.04~ ~ ~.06 >.06 Average SD 95%CI Wald (0.0509,0.058) MLC-B (0.0503,0.05) MLC-Z (0.0503,0.05) MinP-M (0.0533,0.0546) PC (0.050,0.05) SSB (0.05,0.059) LC-B (0.0496,0.0504) LC-Z (0.0497,0.0505) MinP-J (0.0497,0.0509) α=0.0 Distribution of Type I error estimates < ~ ~.03.03~.07 >.07 Average SD 95%CI Wald (0.004,0.008) MLC-B (0.00,0.006) MLC-Z (0.003,0.007) MinP-M (0.09,0.04) PC (0.003,0.006) SSB (0.005,0.009) LC-B (0.0099,0.003) LC-Z (0.0,0.004) MinP-J (0.003,0.008) 9

30 Table S3. Simulation study results: Average empirical power of MLC test statistics and other gene-based statistics for 000 genes at nominal level α = 0.0 (N=,000 simulation replicates used to estimate power for each gene) Model Model Model 3 Model 4 Model 5 Wald Average % CI (0.43,0.454) (0.347,0.378) (0.3,0.54) (0.7,0.35) (0.69,0.666) MLC-B Average % CI (0.50,0.536) (0.4,0.447) (0.9,0.35) (0.79,0.34) (0.67,0.708) MLC-Z Average % CI (0.5,0.534) (0.4,0.446) (0.89,0.34) (0.79,0.34) (0.668,0.706) MinP-M Average % CI (0.58,0.555) (0.439,0.476) (0.30,0.38) (0.77,0.33) (0.663,0.70) PC80 Average % CI (0.50,0.539) (0.46,0.46) (0.9,0.38) (0.66,0.3) (0.635,0.675) SSB Average % CI (0.447,0.484) (0.374,0.409) (0.56,0.8) (0.65,0.3) (0.595,0.636) LC-B Average % CI (0.346,0.388) (0.36,0.364) (0.6,0.53) (0.78,0.8) (0.434,0.479) LC-Z Average % CI (0.34,0.38) (0.34,0.35) (0.9,0.46) (0.74,0.4) (0.48,0.473) MinP-J Average % CI (0.097,0.9) (0.0,0.4) (0.067,0.08) (0.06,0.35) (0.04,0.4) Table S4. Numerical power calculation results: Average power of Wald, LC-B, and MLC-B tests for 000 genes at nominal level α = Power values with or without coding correction are reported. The variance value for each gene under the trait Models -3 (specified in Table ) was chosen to maintain the power of the Wald to be LC MLC Wald No code change Code change No code change Code change Model Average % CI n/a (0.64,0.669) (0.694,0.78) (0.69,0.70) (0.697,0.706) Model Average % CI n/a (0.657,0.685) (0.558,0.596) (0.69,0.703) (0.696,0.706) Model 3 Average % CI n/a (0.3,0.34) (0.447,0.484) (0.659,0.677) (0.664,0.68) 30

31 Table S5: Summary of trait models with analytic power calculations for regressions that include causal SNPs A. Effect of Linkage Disequilibrium Figure #SNPs #causal Disease model Correlation between SNPs Allele frequency Comparison: LC-B, Wald, MinP-J, MinP-M S4 β r =r 3 =r 3 = -.0~ ~0.5 Assuming one causal SNP and one neutral SNP, we =, β = 0 change the correlation between the SNPs and evaluate across a range of MAF values. S5 3 β r =r 3 =r 3 = -.0~ ~0.5 Assuming one causal SNP and two neutral SNPs, we =, β = β3 = 0 change the pairwise correlation among the three SNPs, and evaluate across a range of MAF values. S6 3 β r =r 3 =r 3 = -.0~ ~0.5 Assuming two causal SNPs and one neutral SNP, we = β =, β 3 = 0 change the pairwise correlation among the three SNPs and evaluate across a range of MAF values. S7 (a) 3 β =, β = β3 = 0 r =0.~0.9, r 3 =r 3 =0 0.~0.5 Assuming one causal SNP and two neutral SNPs, we increase the correlation between the causal SNP and one neutral SNP, other correlations are zero (fixed). S7 (b) 3 β =, β = β3 = 0 r =r 3 =0, r 3 =0.~0.9 0.~0.5 Assuming one causal SNP uncorrelated with two neutral SNPs, we increase the neutral SNP correlation, other correlations are zero. 4 (a) 3 β r 3 =0.8, r = r 3 = -.0~.0 0.~0.5 Assuming two causal SNPs and one neutral SNP, = β =, β 3 = 0 with causal effects in the same direction, we change the correlation between the two causal SNPs and fix the correlation between the neutral SNP and one causal SNP. 4 (b) 3 β r 3 =0.8, r = r 3 = -.0~.0 0.~0.5 Assuming two causal SNPs and one neutral SNP, =, β =, β3 = 0 with causal effects in opposite directions, we change the correlation between the two causal SNPs and fix the correlation between the neutral SNP and one causal SNP

32 B. Effect of the Number of Causal SNPs Figure #SNPs #causal Disease models Correlation between SNPs Allele frequency Comparison: MLC-B, Wald, MinP-M 5 0 ~5, others= 0 β = β = β 3 =, other= 0 β = β3 = β5 =, other= 0 β = β3 = β5 = β7 = β = β3 = β5 = β7 = β9 =, other= 0, other= 0 r = r 34 = r 56 = r 78 = r 9,0 = 0, 0., 0.4, 0.6, 0.8 other correlations are all Assuming independent causal SNPs with effects in the same direction, we increase correlation between causalneutral SNP pairs and/or neutral-neutral SNP pairs. 3

Figure S. Simulation Study Results: The frequency plot of r -threshold values where the optimum power of MLC-B is achieved. (a) MLC-B - all 000 genes, (b) MLC-B - genes with power improvement > 0.

33 Figure S. Simulation Study Results: The frequency plot of r -threshold values where the optimum power of MLC-B is achieved. (a) MLC-B - all 000 genes, (b) MLC-B - genes with power improvement > 0. When the optimum occurs at multiple r -threshold values, we take the midpoint as the threshold value that achieves the optimum power. (a) (b) Figure S. Simulation Study Results: Averages of improved power values when the optimum is achieved at different r -threshold values. (a) MLC-B - all 000 genes, (b) MLC-B - genes with power improvement > 0. (a) (b) 33

34 Figure S3. Simulation Study Results: Comparisons of empirical power among MLC-B, SSB and Wald tests for each of 000 genes under Models -5 (as specified in Table ). 34

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

Association Testing with Quantitative Traits: Common and Rare Variants Timothy Thornton and Katie Kerr Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5 1 / 41 Introduction to Quantitative