Multi-bin multi-variant tests for gene-based linear regression analysis of genetic association

Size: px
Start display at page:

Download "Multi-bin multi-variant tests for gene-based linear regression analysis of genetic association"

Transcription

1 Multi-bin multi-variant tests for gene-based linear regression analysis of genetic association Yun Joo Yoo, Lei Sun, Julia Poirier, and Shelley B. Bull Y. J. Yoo Department of Mathematics Education and Interdisciplinary Program in Bioinformatics Seoul National University Seoul, South Korea L. Sun Department of Statistical Sciences and Division of Biostatistics, Dalla Lana School of Public Health University of Toronto Toronto, Canada J. Poirier Lunenfeld-Tanenbaum Research Institute of Mount Sinai Hospital Toronto, Canada S. B. Bull Department of Statistical Sciences and Division of Biostatistics, Dalla Lana School of Public Health University of Toronto and Lunenfeld-Tanenbaum Research Institute of Mount Sinai Hospital 60 Murray Street, Box No. 8, Toronto, ON, M5T 3L9, Canada 30 December 04 Abstract: By ointly analyzing multiple variants within a gene together, instead of one at a time, gene-based regression analysis can improve power and robustness of genetic association analysis. Extending prior work that examined multi-bin linear combination (MLC) statistics for combined analysis of rare and common variants, here we investigate analysis of common variants more extensively under realistic trait models and conditions. This method exploits the linkage disequilibrium structure in a gene to construct bins of closely correlated variants, and for variants within each bin, corrects the coding scheme to make the maority of pairwise correlations positive. After bin construction and recoding, variant effects within the same bin are combined linearly, and the bin-specific effects are aggregated in a quadratic sum. This produces a test statistic with reduced degrees of freedom (df) as

2 many as the number of bins. By simulation studies based on HapMap Asians and analytic power investigation, we compare type I error and power of the MLC test with the multi-df generalized Wald test as well as with linear combination, MinP, and principal-component-based tests assuming various trait models with multiple causal variants and varying linkage disequilibrium strength. We confirm that MLC tests have good power and robustness whether causal SNPs are excluded or included in the analysis. In particular, we report increased relative performance of MLC tests for stronger correlations between variants and a higher number of causal variants. Key words: multi-bin linear combination test, multi-marker test, genetic association, linkage disequilibrium, functional variants, common variants, GWAS Introduction In genome-wide association studies (GWAS) and large-scale candidate gene studies, researchers typically scan a large number of single nucleotide polymorphism (SNP) markers, one by one, to detect disease-snp association signals. This single-snp-analysis strategy has been preferred as a simple and effective approach assuming the design and scale of the studies are sufficient to capture the marginal direct or indirect association of a SNP with complex disease traits [Risch and Merikangas, 996; Kraft and Cox, 008]. As an alternative to single SNP test statistics, combined analysis of multiple SNPs within a gene (or a specified region) is a natural and interpretable analytic strategy at the gene level, and various methods have been advocated. A gene-based approach offers other analytical merits reduction of multiple testing burden, robustness to population differences regarding LD and allele frequency, and improved ability to replicate associations [Neale and Sham, 004; Liu et al., 00]. One class of methods for gene-based analysis derives a test of global association from multiple single-snp results, that is, from multiple marginal analyses (see Pan 009 for example who proposed a weighted squared sum test of per-snp marginal effects and showed they are powerful for a certain type of alternative). GWAS investigators performing single-snp-based analysis effectively apply an intrinsic gene-based approach by using a maximum statistic, which selects one SNP with the strongest association within a gene region [WTCCC, 007]. This practice implies that the biggest test score (or smallest p-value) has been chosen as a global statistic for a gene, but this approach may be insensitive when multiple independent moderate associations occur within a gene. Another class of gene-based methods applies multiple regression analysis in which each SNP is coded as a covariate [Clayton et al., 004] and a multi-covariate global statistic is constructed to represent a gene. Implementation of a generalized Wald chi-square statistic tests the global null hypothesis for all SNP covariates in the regression model, and the degrees of freedom (df) corresponds to the number of covariates. Multi-SNP regression can serve to reveal independent associations of the SNPs in a gene with the trait, and is robust to the occurrence of both deleterious and protective associations. Instead of including all SNPs in regression-based oint

3 analysis, an alternative is to represent the SNP genotypes by one or more of their principal components, [Gauderman et al., 007]. When a small number of principal components explain most of the variability of the genotypes, a global test with reduced df can be obtained. A global statistic with reduced df can also be constructed from a regression analysis of multiple SNPs using a weighted linear combination of regression coefficients or their corresponding test statistics (the weighted sum method) [Pocock et al., 987; O Brien, 984; Schaid et al., 005; Stram et al., 988; Pan, 009; Wang and Elston, 007]. The per-snp scores are collapsed into a weighted sum and with proper weights the test only requires one df and can be more powerful than the generalized Wald test. However, when the weights for each SNP effect are non-negative and the signs of SNP coefficients are opposing, summing the coefficients can yield a substantially reduced signal. Therefore, the performance of linear combination tests depends on the direction of the per-snp association, and how minor and maor alleles are coded in the analysis [Wang and Elston, 007; Pan, 009, Madsen and Browning, 009]. To improve power and robustness of genetic association analysis to gene structure and trait architecture, we proposed the multi-bin linear combination (MLC) test [Yoo et al., 03]. The MLC test incorporates the LD structure of SNPs in a gene by partitioning the SNPs in a gene into bins of positively correlated SNPs. From a regression analysis of multiple SNPs in a gene, the individual SNP coefficients are combined linearly within each bin and then the bin-specific terms are combined in a quadratic sum constructed across the multiple bins within a gene. Association of the trait with the gene region is then represented by an overall global statistic with df equal to the number of bins. In this way, we improve robustness to the problem of opposing direction of effects and achieve some reduction in df while retaining the advantage of a linear combination of multiple SNP effects within a bin. In Yoo et al. [03], we examined trait model scenarios with one or two low frequency causal SNPs, analyzing only non-causal low-frequency and/or common SNPs. Reduced df MLC tests often showed better power than the full df Wald test when the regression analysis was performed using both low and common frequency SNPs. Multi-variant methods designed for rare variants analysis are well-developed [see Derkach et al. 04 and Lee et al. 04 for recent reviews], and some methods can be applied to both rare and common variants [Ionita-Laza et al., 03; Ayers and Cordell, 03; Wang et al., 03; Yoo et al, 03], but less attention has been paid to evaluation of methods for gene-based analysis of typical GWAS data consisting mainly of genotypes and imputed common variants [Asimit et al., 009; Clayton et al., 004; Gauderman et al., 007; Pan, 009]. Because they are based on multiple regression analysis, MLC tests are well-designed for common variants analysis. Much of the strength of the MLC test comes from incorporating the linkage disequilibrium (LD) between causal and neutral SNPs, which is expected to be higher and more extensive for common causal SNPs. In this article, we investigate the power of MLC tests in comparison with other gene-based statistics for scenarios with multiple common causal variants (minor allele frequency (MAF) > 0.05). We consider realistic LD structure among causal and neutral SNPs based on SNP distributions and LD in HapMap Asian population data for one thousand genes from across the 3

4 genome. For settings in which genotyping data consist of common tag variants, we conduct simulation studies under various genetic models in which analysis models exclude causal SNPs. For settings such as high density GWAS imputation, exome arrays, or filtered whole genome sequencing (WGS) in which genetic data are available for common causal variants, we evaluate asymptotic power for regression analysis including causal SNPs. Throughout we aim to characterize the conditions in which the MLC test performs better than other gene-based tests. Statistical Methods Estimation and Hypothesis Testing in Multi-SNP Regression Models Denote the genotype data of K SNPs as X = X, X,, X ), and the quantitative trait variable as Y. Here, we code the genotype values of 0,, and as a count of the minor allele. Inference concerning overall association between the trait and the SNPs can be obtained via a multi-snp regression model: g { E ( Y )} = β + β X + β X + + β () where E(Y ) is the expected value of Y and g ( ) is the link function. In this report, we focus on a quantitative trait, using the identity function for g ( ), but analogous approaches can be applied to categorical traits, as typical in case-control studies, or time-to-event traits in cohort studies. Hypothesis tests for association in the oint model can be constructed simply using the parameter estimates ˆ β = ( ˆ β,, ˆ β ) and the corresponding estimated variance-covariance matrix Σ B, obtained by least squares/maximum likelihood methods under large-sample assumptions. A generalized Wald test of the global null hypothesis of no association (β = 0 for all ) against the alternative that at least one β 0 is defined as G W ˆ T β Σ ˆ β which has a quadratic form. The asymptotic null distribution of this statistic is chi-squared with K degrees of freedom (df), assuming that the SNP design matrix is of full rank, i.e. no linear dependency among SNPs. Single df components from this quadratic form correspond to Z = ˆ β / Var( ˆ β ) = ˆ β / Σ with a large sample covariance, which can be derived as the correlation matrix of the β estimates. Multi-bin Linear Combination L df Tests Two versions of multi-bin linear combination tests can be constructed using estimates from the multi-snp regression model. One is based on combination of parameter estimates (MLC-B: ˆ β = ( ˆ β,, ˆ β ) ) and the other combines the corresponding test statistics (MLC-Z: K ( K K Σ Z i 0 ( K Z = Z, Z,, Z )). MLC tests incorporate pre-determined bins into the construction of the test statistics; we discuss bin construction in the following sub-section. For now, suppose that L bins are formed from K SNPs (L K). Let J be a K by L matrix where the th column represents assignment to th bin, i.e. J i = if the i th SNP belongs to the th bin and J i =0 if not. We assume that the bins do not overlap. MLC tests of multiple bins are then defined as: MLC-B= = B K X K B 4

5 ( W W T s ˆ)( β W T s ˆ T T T T Σ W ) ( β W ) and MLC-Z= ( W Z)( W Σ W ) ( Z W ) with B T s = ( Σ B J )( J Σ B J s s ) and W, respectively. The usual regression Wald test is a special case in which each bin consists of one SNP i.e. J = diag(,, ), L=K. The null distribution of MLC-B (MLC-Z) is approximately L df chi-squared since T T W βˆ ( βˆ ) is a vector of length L with expected value of 0 and variance T S S W O Var( W ˆ) β = W Var( ˆ β) W T S S = W T S Σ B T T T ( Var ( W Z) = W Var ( Z) W = W Σ W ) under the null hypothesis J T β = 0. The df linear combination tests (LC-B and LC-Z) can also be constructed as special cases of MLC tests where J = (,,, ) T, i.e. all SNPs belong to the same bin. The LC tests follow from early consideration by Pocock et al. [987] and O Brien [984] in the context of multiple outcomes in clinical trial data analysis. We note that LC-B is equivalent to the Wald test statistic described by Stram et al. [988] and the score test statistic proposed by Schaid et al. [005]. These test statistics asymptotically follow chi-square distributions with df, and therefore may be more powerful than the K df Wald test or the L df MLC tests in certain conditions. However, LC tests are sensitive to the direction of alternatives in the parameter space. In particular, LC tests are expected to perform poorly when multiple causal variants in a gene affect the trait positively and negatively. This results in the occurrence of opposite signs among β,, ˆ and possible ˆ cancellation of the effects in the linear combination, producing substantial loss of power. o T O = ( ΣZ J )( J ΣZ J W S O Construction of SNP Bins and Allele Coding Our obective in bin construction is to cluster SNPs into subsets using within-gene LD information such that correlation between additively coded SNPs within a bin is high and positive, while correlation between bins is low. To this end, we conducted bin selection based on an algorithm implemented in LDSelect software [Carlson et al., 004]. As a clustering criterion, LDSelect uses the LD measure r which is calculated as the Pearson correlation of a pair of additively coded SNPs, and can be positive or negative in sign, For a given r threshold value c, the first bin is constituted by the index SNP with the most tags (r > c ) together with the SNPs tagged by it. Excluding SNPs assigned to the first bin, the second bin is constructed by applying the same criteria to the remaining SNPs. This procedure is iterated until it exhausts all SNPs. If c is too strict (high r values) most of the bins consist of a single SNP, but if c is more liberal (low r values) then few bins, each including many SNPs, are created. In the multi-snp regression (), the sign of the regression coefficients is determined by SNP coding of the minor and maor alleles (usually coded as and 0 respectively). Assuming an additive genotype score for an allele, switching the minor and maor alleles changes the sign of the corresponding βˆ estimate and its covariance with other βˆ estimates. This affects the linear i combination test statistics but has no effect on the quadratic Wald statistic. A widely used allele coding method designates the minor allele as the deleterious allele, under the presumption that this will be the case for most of the SNPs. If, however, the SNPs being tested are merely tagging a o Z o O ) o O O β K Z O 5

6 causal SNP or some of the causal effects are actually protective, the signs of the regression coefficients of these SNPs may be reversed. To address potential inconsistencies in effect direction in the context of linear combination tests, we apply a coding correction method suggested by Wang and Elston [007]. According to this algorithm, the allele coding decision, based on the genotype data, is applied within each bin of the MLC test (or to the entire gene in the LC test) so that the number of positively correlated SNPs is maximized within a bin (or within a gene for the LC test). This approach first catalogues SNPs using the number of negative pairwise correlations with other SNPs for a given initial coding scheme (usually minor alleles coded as ). Starting from the SNP with the largest number of negative pairwise correlations with others, the /0 coding is reversed. The correlation status between SNPs after re-coding one SNP is re-examined and the procedure is repeated iteratively until the number of SNPs negatively correlated with one SNP is less than half of the number of SNPs. After applying this coding correction, the resulting matrix of pairwise LD measures r will have a minimized number of negative pairwise correlations between SNPs. Because this adaptive coding procedure is applied only to the genotype data, ignoring the direction of the trait association, it does not inflate the type I error of MLC and LC tests. We apply the recoding procedure in construction of the MLC and LC statistics throughout the subsequent application and evaluations. Other Gene-based Tests The so-called minimum p-value (MinP) test is defined as the p-value corresponding to the max( Z,, ZK ) from the results of oint multi-snp (MinP-J) or marginal (MinP-M) regressions. In contrast to oint modeling in a multi-snp regression, individual SNP associations can be assessed marginally by separate single-snp regression models: g { E( Y)} = β + β, =,, K () In the marginal case, βˆ is replaced by ˆ M ˆ M ˆ M β ( β,, β ) and Σ is replaced by Σ which is an estimate of the variance-covariance matrix for M marginal test statistics ˆ M / ( ˆ M M M Z = β Var β ) is defined as MinP-M for max( Z,, Z K ). M 0 M X = K B β,, ˆ β. The minimum p-value test for For uncorrelated SNPs, the Σ matrix has a diagonal form. Generally, however, Σ can be BM BM obtained using the robust GEE variance method to estimate covariances as suggested by Pan [009, Appendix D]. Although the Bonferroni correction is a simple multiple-testing adustment for MinP, it will be too conservative for correlated SNPs. Alternatively, an adusted p-value can be approximated assuming Z, Z,, Z follow an asymptotic null distribution that is K multivariate normal with mean 0 and covariance matrix [James, 99; Conneely and Boehnke, 007]. In addition, we evaluated two other gene-based statistics based on regression modeling: the SSB statistic [Pan, 009], which is a quadratic statistic obtained from the marginal regression framework () as the squared sum of beta coefficients, and the statistic based on multiple M ˆ Σ Z M K BM 6

7 regression analysis of principal components of the multiple SNP genotypes (PC80) according to Gauderman et al. [007] (See details in the Supplementary Methods). Illustration in the DCCT/EDIC Candidate Gene Study The Diabetes Control and Complications Trial/Epidemiology of Diabetes Interventions and Complications (DCCT/EDIC) study is a long-term follow-up study of randomized trial participants with type diabetes (TD) (DCCT 993; EDIC 999). The DCCT/EDIC Genetics Study was designed to investigate the association of SNPs in a large set of candidate genes with complications of TD. As detailed in the methods of Al-Kateb et al. [008], tagging SNPs within 5kb flanking either side of each of 0 candidate genes were selected not to be in strong linkage disequilibrium and to have MAF greater than 5%. Study participants were genotyped by a custom Illumina GoldenGate Beadarray assay, and data were subected to standard QC procedures. This set of candidate genes includes several recently reported to be associated with high-density lipoprotein (HDL), a measure of blood cholesterol level, in the general population by Teslovitch et al [0]: CETP, APOB, IRS, LPL, ABCA, APOA, LIPC, LCAT, MC4R. It is therefore of interest to assess these and other candidate genes for association in a population with TD. The dataset we analysed for genetic associations with the quantitative HDL trait consists of 36 white probands with genotyping data for 3 SNPs (MAF>5%) in 83 candidate genes (with SNPs per gene). The number of SNPs genotyped per gene ranges from to 47, with a median value of 4. We applied the MLC tests and other gene-based tests to the logged and centered HDL measures obtained at the DCCT baseline assessment. For each gene, we constructed bins using a r threshold value of 0.3, and obtained bin sizes ranging from to 7 with a median of. Out of 83 genes, had at least one test that met a liberal significance criterion of p-value < 0.03, with of these detected by the MLC method (Table S). As expected, there is strong concordance between MLC-B and MLC-Z, and between LC-B and LC-Z. The well-known HDLrelated gene CETP shows genome-wide significant p-values for all the methods except MinP-J, but for the remaining 8 genes, none of the methods reach the Bonferroni adusted p-value criteria of 3x0-4. The ten SNPs in the CETP region cluster into five bins (Table and Figure ). Two of the SNPs with strong marginal signals (SNP4, SNP5) are highly correlated (r=0.7) and consequently the SNP5 signal disappears in oint regression of all ten SNPs. SNP4 (rs709) has been reported as associated with the HDL trait [Enquobahrie et al., 008; Asselbergs et al., 0;Wu et al., 03]. The PC80 test p-value (4 df) is the smallest with the MLC-B and SSB tests yielding similar small p-values for the CETP region. Simulation Studies Analysis of non-causal SNPs Design and Methods To rigorously compare the methods, and to better understand the data and model characteristics that influence the performance of the methods, we conducted a series of simulation studies based on observed human genotypes. Trait data were simulated under null and various 7

8 alternative models using genotype data derived from the HapMap Phase III Asian population (JPT and CHB). To compare the performance of various gene-based tests under various realistic gene structures, we selected 000 genes as follows. A list of genes across autosomes was obtained from the UCSC genome annotation database for NCBI hg8 Build 36. ( database/). Among 6,54 genes with at least one SNP present in the HapMap Asian data, there were 8883 genes with size 4~30 after excluding rare and low frequency SNPs (MAF <0.05) and pruning SNPs in complete LD. From these, 000 genes were randomly selected for inclusion in the simulation studies. For each gene, 000 replicated data sets of genotypes and trait values were generated for 000 individuals under each trait model scenario. To maintain the gene structure observed in the HapMap data, we generated genotype data for 000 individuals using the method of randomly pairing haplotypes from the haplotype pool (for each gene) obtained from the phased genotype data. Based on the genotypes for each individual, a quantitative trait Y was generated under an additive genetic model for a gene with C causal SNPs such that where Y = C i= b i G i + ε ε ~ N(0, σ ), b i is the effect of i th causal SNP, and G i is the number of minor alleles at the i th causal SNP, i.e. G i can be 0,, or. We considered six different quantitative trait models for each gene, including 0, or causal SNPs per gene (Table ). Under each model, the causal SNPs were randomly selected from the set of SNPs for each gene. Under the null hypothesis of no gene effect, all b i were specified to be equal to 0. Under the alternative hypothesis of genetic association, the effect size b i and error variance σ were chosen for each genetic model to maintain Wald test average power in the range 5-80% at a sample size n=000. At this stage, the causal SNPs were not included in the gene-based regression analysis, so associations detected were indirect through the natural LD structure of each gene. The LD structure among the SNPs within a gene therefore arose entirely from the inherent structure in the HapMap haplotype data. Choice of r -threshold Values for the MLC bin construction We conducted a preliminary simulation to evaluate how MLC test power depends on the r - threshold values, attempting to find r -threshold values that usually produced the highest power with best power improvement. For 000 genes selected for simulation study, we compared power at eight different r -threshold values (0.~ 0.8) for five trait models. Considering both the frequency of occurrence of best power (Figure S) and the mean power improvement (Figure S) at the optimizing r -threshold value, we concluded that a threshold from 0.3~0.6 is a reasonable choice for dimension reduction by the binning procedure. In the subsequent simulation studies, we applied an r - threshold of 0.5 in the MLC test evaluations. Type I Error Evaluation Type I error rates of MLC tests and other gene-based tests were estimated for each of 000 genes at two nominal significance levels (α=0.05 and α=0.0). For the maority of the 000 8

9 genes, the estimated type I error rate is within the range of the nominal significance level expected in 000 replicates, with 70-87% within 0.04~0.06 for α=0.05 and 9-98% within 0.003~0.07 for α=0.0 (Table S). The average of the estimated Type I error rates over 000 genes is closest to nominal for the MLC, MinP-J, PC80, and LC tests, while the averages observed for the Wald and SSB tests tend to be inflated, and the distribution for MinP-M is skewed toward elevated values. Power Comparison Between MLC Tests and Other Gene-based Tests The power of each gene-based test was estimated using the critical values for α = 0.05 and α = 0.0 from the asymptotic null distribution for each statistic and averaged across all 000 genes under each trait model (see Table 3, Table S3 for power at α = 0.05, α = 0.0 respectively). The 95% confidence interval (CI) for average MLC test power (both MLC-B and MLC-Z) is strictly higher than that of Wald test power for trait Models -3 in which the causal SNPs are deleterious (Table 3). In Figure (a) which compares power for each of the 000 genes, MLC test power is nearly always above Wald test power for Models -3, and mostly above for Models 4 and 5 in which one causal SNP is deleterious and the other is protective. This suggests consistent benefits of the reduced df MLC-B statistic compared to the Wald statistic, while the exceptions with low power are likely due to difficulties in bin construction and/or in effective recoding within a particular gene bin when positively correlated SNPs have opposing effects. On average, the power of the LC and MinP-J tests is low under all models (Table 3). Although there are more than a few genes where LC-B achieves better power than the Wald and MLC-B statistics (Figure (a)), genes with lower power for the LC statistic predominate, and the power loss is usually substantial when it occurs. The 95% CIs for average MLC power overlap substantially with those for MinP-M and PC80 and tend to be higher than those for the SSB test (Table 3). However, as evident in the gene-bygene comparisons in Figure (b), MLC-B often exhibits substantially better power than MinP and PC80, especially at lower power levels (eg. Models, 4, and 5), with comparable or only slightly lower power than MinP otherwise. This latter difference may be exaggerated by the occurrence of elevated type I error in MinP-M (Table S). Additional comparisons among MLC-B, SSB and Wald tests indicate that SSB can sometimes perform better than Wald and MLC, but there are many genes where SSB loses substantial power compared to MLC and Wald (Fig S3). Analytic Power Comparisons Analysis of causal and neutral SNPs To evaluate the performance of MLC-B, Wald, MinP and LC-B tests under several trait models with the causal SNPs included in the regression analysis, we analytically derived their asymptotic distributions using a simplified genotype structure. Under the alternative, the test statistics follow non-central chi-square distributions with the non-centrality parameter as the expected statistic under a trait model. Let the critical value for a significance level α obtained from m df chi-square distribution be c where P χ > α. If we denote the expected MLC-B, Wald, and α } = m,α { m c m, LC-B and statistics as λ, λ, and λ, their power is then φ = P χ > c ), M M W L M ( L, λ L, α 9

10 φ = P χ > c ), and φ = P χ c ) with L, K and df respectively. We give detailed W W ( K, λ K, α ( L, λ >, α formulations of λ, λ, and λ in terms of trait model, allele frequencies and linkage M W disequilibrium structure in the Supplementary Methods. We obtained adusted critical values for MinP-J and MinP-M statistics corresponding to a significance level α from the multivariate normal distribution with mean vector of 0's and the covariance matrix for Z = Z, Z,, Z ) ( K based on Conneely and Boehnke [007]. Power is φ = P U > c ) where U is the MinP-M MinP or MinP-J statistic and c U,α, is the corresponding critical value. L L ( U,α Performance of MLC Test with LD Patterns from HapMap Data We compared MLC-B, Wald, and LC-B tests under realistic genotype structures based on the same 000 genes in the HapMap Asian population used for the simulation studies. We assumed the causal SNPs are included in the regression analysis and calculated power under Models ~3 (Table ), except that in these evaluations the error variance σ was set to yield 60% power for the Wald test under the given trait model (Figure 3, Table S4). The power values of LC-B for 000 genes under all trait models fluctuated around 60% depending on the LD structure of each gene, but were usually less than 60% under Model 3 in which the two causal SNPs are in different bins. Similarly to the findings from simulation studies (which did not include the causal SNPs in the model), MLC-B maintains similar or higher power than the Wald test for most cases, and the fraction of genes in which MLC-B loses substantial power compared to Wald tests is small: %, 3%, 5% for Models, and 3 respectively. Effect of Linkage Disequilibrium and Number of Causal SNPs To analytically evaluate the effect of LD, we examine simple trait models with one or two causal SNPs and one or two neutral SNPs across a range of allele frequencies (0. to 0.5) and pairwise SNP correlations (-.0 to.0 subect to allele frequency constraints). To evaluate the effect of the number of SNPs, we examine trait models with 0 SNPs, all with MAF=0.3, and one to five causal SNPs. These analytic studies focus on the LC-B and MLC-B tests in comparison to the Wald and MinP tests (see Table S5.A for trait model details). The performance of LC-B is sensitive to within-gene correlation and the underlying causal model. For one causal SNP and high positive correlation with neutral SNPs, LC-B power can exceed Wald power, but it never does better than MinP-M (Figure S4, S5). For two positively correlated deleterious causal SNPs, the LC-B test can be more powerful than the Wald and MinP tests (Figure 4(a), S6). On the other hand, for negative correlations, the coding correction algorithm changes the direction of one causal effect resulting in power loss for LC-B. The power of MLC-B depends on within-gene correlation and the underlying trait model, but is less sensitive to the direction of the causal effects; here we assume that two SNPs are grouped into a bin when their correlation is above 0.6. For one causal SNP and correlation high enough to create a bin, the MLC test power exceeds that of the Wald test due to the reduction in df (Figure S7). For a trait model including two causal SNPs in different bins and one neutral SNP in the 0

11 same bin as the first SNP (r = 0.8), MLC test power consistently exceeds Wald test power regardless of the direction of causal effects. When the causal SNPs are positively correlated, the LC-B test can have somewhat higher power than the MLC-B test (Figure 4(a)). However, when positively correlated causal SNPs have coefficients with opposing effects, the pattern is reversed (Figure 4(b)). Overall the MLC test performs well regardless of the direction of causal effects because it is affected only by allele recoding within a bin. When multiple independent causal SNPs and correlated neutral SNPs are analysed together, we expect improvement in the performance of MLC relative to other tests. Under a trait model with five bins each consisting of two SNPs and no more than one causal SNP per bin (Table S5.B), MLC test power improves faster than Wald and MinP-M power as the within-bin correlation r and the number of causal SNPs increases (Figure 5). For r 0.4, MLC has higher power than MinP and for r 0.6, MLC is as or more powerful than the Wald test. Discussion MLC tests can be seen as a compromise between df linear combination tests and multi df Wald tests, resulting in power improvement and better robustness to varying gene structures and underlying trait models. By restricting the linear combination strategy within a bin of closely and positively correlated SNPs, we aim to minimize the chance that opposing genetic effect estimates will reduce the overall signals while still improving power through df reduction. By means of extensive simulation and numerical studies performed under various trait models for each of 000 representative genes, we evaluated the performance of the MLC test in two general situations. In one, we set up the analysis model to exclude the causal SNPs so as to imitate the complexity of typical GWAS analysis of mainly tag SNPs. In the second, to imitate the settings of GWAS imputation or dense genomic data such as exome array data or WGS, we included both causal and non-causal SNPs. Because our gene panels were selected to represent the overall distribution of gene structure across the genome, the average power and good performance of MLC tests can be expected to represent typical performance in applications. In practical application of regression-based tests, some computational issues need to be considered. First, depending on the linkage disequilibrium pattern in the data set, the oint analysis model using multi-snp regression may encounter linear dependencies among the SNPs and nearsingularity in the variance-covariance matrix. This can arise due to complete pairwise LD between SNPs or higher order multi-snp linear dependence. Most statistical software performing linear regression can deal with the singularity and automatically remove covariates to obtain a nonsingular matrix. By removing the SNPs that are redundant (in complete LD with some other SNP), the possible singularity problem can usually be resolved prior to analysis. Depending on sample size, regression analysis using low frequency SNPs may be problematic. However, if a legitimate regression analysis can be performed for low frequency SNPs, the MLC tests can always be constructed. In this study we successfully used LDSelect to construct bins for the MLC method. Grouping correlated SNPs into bins is an extra procedure and, in principle, any clustering algorithm for

12 correlated SNPs can be used. The SNPs in a bin need not be physically consecutive and although clustering is indirectly related to haplotype structure, the bin itself does not necessarily have to have this specific biological meaning. Clustering of SNPs based on haplotype structure or functional annotation, or by alternative clustering algorithms that can provide some biological interpretation, might be suitable for use with the MLC test and worthwhile to develop in future study. In that instance, the bin-specific statistics may be appropriate to localize the association signal. Based on the entirety of the simulation and numerical results, we conclude that MLC tests can be expected to perform better than Wald, MinP-J, SSB, and LC tests when SNPs are in linkage disequilibrium. Taking the ratio of the number of SNPs over the number of bins per gene as an indirect measure of within-gene LD, because the number of bins is reduced by high correlation between SNPs in a gene, simulation results generated under a model with one or two causal variants suggest that MLC tests can perform better than MinP-M and PC80 at ratios of -.5 while the three statistics performed similarly at higher ratios (Figure S8).. Our conclusions are similar to those of previous authors who found multilocus reduced df tests to be more powerful than MinP-M SNP set tests when SNPs are in LD [Kwee et al 008; Lin et al 0]. The Wald test is robust to the presence of multiple causal variants, but does not utilize the LD information efficiently whereas the MinP test does use LD information to adust p-values but is prone to inflated type I error and is more suitable for scenarios with a single causal SNP (Taub et al., 03). Principal-component-based methods efficiently reduce the df based on the LD structure, and therefore showed good performance in our study and in Peterson et al. [03]. However, each principal component does not have a ready biological interpretation, and some information may be lost by choosing a subset of the principal components. In contrast, the bins in the MLC statistics maintain meaning as a small LD block, and the df are effectively reduced using LD information while averaging correlated SNPs. We therefore recommend the MLC test for genebased analysis of genetic association data as a powerful and robust multi-marker method. In our evaluations, we pre-specified gene-based common-frequency SNP sets. To apply genebased association methods to a whole genome scan, it is necessary to assign the genomic variants to analysis units such as genes, LD blocks, or other biological regions. Furthermore, the number of SNPs that can be ointly analyzed in a statistic may be limited, requiring a windowing approach to determine the analysis units. For case-control gene-based methods, Petersen et al. [03] investigated the influence of non-causal SNPs within the analysis unit, assuming various LD relationships within and between causal and non-causal SNPs. Their evaluations included: a p- value adustment type method similar to the MinP tests, a likelihood-ratio test from principalcomponent-based regression analysis, and a likelihood-ratio test from genotype-based regression analysis which is equivalent to the Wald test. They conclude that if causal SNPs comprise less than 0-5% of the SNPs analysed ointly, gene-based tests can be ineffective, and therefore suggest that only intergenic SNPs that are in high LD be included in the gene-based analysis unit.

13 In their simulation and numerical power analysis of rare variant tests, Derkach et al. [04] found that LD between causal SNPs or causal and neutral SNPs can increase or decrease the power of linear combination tests, which is consistent with our results presented in Figure 4 and Figures S4-S7. Petersen et al. [03] also observed that LD between neutral SNPs can affect the power of multi-marker tests, similar to our results in Figure S7. With clustering of correlated SNPs and application of the coding correction, we were able to retain the benefits of LD, resulting in improved and more robust performance of MLC tests. With accumulating evidence that LD between causal and non-causal SNPs influences the power of gene-based tests, MLC methods specifically designed to utilize the LD information should be considered as a powerful alternative for analysis of dense genetic association data with highly correlated SNPs. Acknowledgement DCCT/EDIC study data: The Diabetes Control and Complications Trial (DCCT) and its followup the Epidemiology of Diabetes Interventions and Complications (EDIC) study were conducted by the DCCT/EDIC Research Group and supported by National Institute of Health grants and contracts and by the General Clinical Research Center Program, NCRR. This manuscript was not prepared under the auspices of the DCCT/EDIC study and does not represent analyses or conclusions either of the DCCT/EDIC study group or the National Institutes of Health. Funding: This work was supported by the Canadian Institutes of Health Research operating grant MOP-8487, and National Research Foundation of Korea (NRF) grant NRF-0RAA3048. References Al-Kateb H, Boright AP, Mirea L, Xie X, Sutradhar R, Mowoodi A, Bhara B, Liu M, Bucksa JM, Arends VL, Steffes MW, Cleary PA, Sun W, Lachin JM, Thorner PS, Ho M, McKnight AJ, Maxwell AP, Savage DA, Kidd KK, Kidd JR, Speed WC, Orchard TJ, Miller RG, Sun L, Bull SB, Paterson AD; Diabetes Control and Complications Trial/Epidemiology of Diabetes Interventions and Complications Research Group Multiple superoxide dismutase /splicing 9 factor serine alanine 5 variants are associated with the development and progression of diabetic nephropathy. Diabetes 57:8-8. Asselbergs FW et al. 0. Large-scale gene-centric meta-analysis across 3 studies identifies multiple lipid loci. Am J Hum Genet 9: Asimit JL, Yoo YJ, Waggott D, Sun L, Bull SB Region-based analysis in GWA of FHS blood lipid traits. BMC Proceedings Suppl 7:S7. Ayers KL, Cordell HJ. 03. Identification of grouped rare and common variants via penalized regression. Genet Epidemiol 37: Carlson CS, Eberle MA, Rieder MJ, Yi Q, Kruglyak L, Nickerson DA Selecting a maximally informative set of single-nucleotide polymorphisms for association analysis using linkage disequilibrium. Am J Hum Genet 74:

14 Clayton D, Chapman J, Cooper J Use of unphased multilocus genotype data in indirect association studies. Genet Epidemiol 7: Conneely KN, Boehnke M So many tests, so little time! Rapid adustment of P-values for multiple correlated tests. Am J Hum Genet 8: Derkach A, Lawless JF, Sun L. 04. Assessment of pooled association tests for rare variants within a unified framework. Statistical Science 9():30-3. Diabetes Control and Complications Trial Research Group (DCCT) The effect of intensive treatment of diabetes on the development and progression of long-term complications in insulin-dependent diabetes mellitus. N Engl J Med 39: Epidemiology of Diabetes Interventions and Complications (EDIC) Design, implementation, and preliminary results of a long-term follow-up of the Diabetes Control and Complications Trial cohort. Diabetes Care :99-. Enquobahrie DA, Smith NL, Bis JC, Carty CL, Rice KM, Lumley T, Hindorff LA, Lemaitre RN, Williams MA, Siscovick DS, Heckbert SR, Psaty BM Cholesterol ester transfer protein, interleukin-8, peroxisome proliferator activator receptor alpha, and Toll-like receptor 4 genetic variations and risk of incident nonfatal myocardial infarction and ischemic stroke. Am J Cardiol 0: Gauderman WJ, Murcray C, Gilliland F, Conti DV Testing association between disease and multiple SNPs in a candidate gene. Genet Epidemiol 3: Ionita-Laza I, Lee S, Makarov V, Buxbaum JD, Lin X. 03. Sequence Kernel Association Tests for the Combined Effect of Rare and Common Variants. Am J Hum Genet 9: James S. 99. Approximate multinormal probabilities applied to correlated multiple endpoints in clinical trials. Stat Med 0:3-35. Kraft P, Cox DG Study designs for genome-wide association studies. Adv Genet 60: Kwee LC, Liu D, Lin X, Ghosh D, Epstein MP A powerful and flexible multilocus association test for quantitative traits. Am J Hum Genet 8: Lee S, Abecasis GR, Boehnke M, Lin X. 04. Rare-variant association analysis: Study designs and statistical tests. Am J Hum Genet 95: 5-3. Lin X, Cai T, Wu MC, Zhou Q, Liu G, Christiani DC, Lin X. 0. Kernel machine SNP-set analysis for censored survival outcomes in genome-wide association studies. Genet Epidemiol 35: Liu JZ, Mcrae AF, Nyholt DR, Medland SE, Wray NR, Brown KM, AMFS Investigators, Hayward NK, Montgomery GW, Visscher PM, Martin NG, Macgregor S. 00. A versatile gene-based test for genome-wide association studies. Am J Hum Genet 87: Madsen BE, Browning SR A groupwise association test for rare mutations using a weighted sum Statistic. PLoS Genet 5:e Neale BM, Sham PC The future of association studies: gene-based analysis and replication. Am J Hum Genet 75:

15 O Brien PC Procedures for comparing samples with multiple endpoints. Biometrics 40: Pan W Asymptotic tests of association with multiple SNPs in linkage disequilibrium, Genet Epidemiol 33: Petersen A, Alvarez C, DeClaire S, Tintle NL. 03. Assessing methods for assigning SNPs to genes in gene-based tests of association using common variants. PLoS One 8:e66. Pocock SJ, Geller NL, Tsiatis AA The analysis of multiple endpoints in clinical trials. Biometrics 43: Risch N, Merikangas K The future of genetic studies of complex human diseases. Science 73: Schaid DJ, McDonnell SK, Hebbring SJ, Cunningham JM, Thibodeau SN Nonparametric tests of association of multiple genes with human disease. Am J Hum Genet 76: Stram DO, Wei LJ, Ware JH Analysis of repeated ordered categorical outcomes with possibly missing observations and time-dependent covariates. J Am Stat Assoc 83: Taub MA, Schwender HR, Youkin SG, Louis TA, Ruczinski I. 03. On multi-marker tests for association in case-control studies. Front Genet 4:5. Wang T, Elston RC Improved power by use of a weighted score test for linkage disequilibrium mapping. Am J Hum Genet 80: Wang X, Morris NJ, Zhu X, Elston RC. 03. A variance component based multi-marker association test using family and unrelated data. BMC Genet 4:7 The Wellcome Trust Case-Control Consortium Genome-wide association study of 4,000 cases of seven common diseases and 3,000 shared controls. Nature 447: Wu Y et al. 03. Trans-ethnic fine-mapping of lipid loci identifies population-specific signals and allelic heterogeneity that increases the trait variance explained. PLoS Genet 9(3): e Yoo YJ, Sun L, Bull SB. 03. Gene-based multiple regression association testing for combined examination of common and low frequency variants in quantitative trait analysis. Front Genet 4:33. 5

16 Table. Application to DCCT/EDIC Genetics Study: Regression analysis and gene-based analysis results for association of the HDL trait with 0 SNPS in the CETP gene. SNP Bin alloca -tion MAF Regression Analysis Results Joint analysis Marginal analysis Beta P-value LC-B per bin Wald per bin Beta P-value SNP SNP (P=0.0003) (P=0.000) SNP SNP* * 0.73 SNP (P=0.6) 7.3 (P=0.068) *.66E SNP SNP E- SNP (P=0.09) (P=5.00E-8) E-0 SNP SNP (P=0.690) 0.08 (P=0.893) Gene-based Analysis Results 0.6 (P=0.690) 0.08 (P=0.893) Joint analysis Marginal analysis **Wald MLC-B MLC-Z MinP-J PC80 LC-B LC-Z SSB MinP-M Stat df P.8E- 3.6E- 6.76E E-.09E-08.7E-08.93E-08.40E-0 *The risk and base allele coding was reversed for SNP by application of the coding correction algorithm by Wang and Elston [007]. The beta coefficients for SNP in this table are the values after coding correction. **Wald : Generalized Wald test MLC-B : Multi-bin linear combination test using beta coefficients MLC-Z : Multi-bin linear combination test using Z statistics LC-B : Linear combination test using beta coefficients LC-Z : Linear combination test using Z statistics MinP-J : Minimum P test based on oint regression analysis MinP-M : Minimum P test based on marginal regression analysis PC80 : Global test based on regression using the minimum number of principal components covering 80% variances SSB : Squared sum of marginal beta coefficients (Pan, 009) 6

17 Table. Gene-based Genetic Models for Simulation Study of a Quantitative Trait Model Name Description Trait Model Parameters * Null Model No SNP association b = 0, σ = 5 Model One causal SNP within a gene b =, σ = 5 Model Two causal SNPs in the same bin, both deleterious b =, b =, σ = 0 Model 3 Two causal SNPs in different bins, both deleterious b =, b =, σ = 8 Model 4 Two causal SNPs in the same bin, one deleterious and one protective b =, b =, σ = 5 Two causal SNPs in different Model 5 bins, one deleterious and one b =, b =, σ = 5 protective C *The trait model is where ε ~ N(0, σ, C is the number of causal SNPs, b i is the Y = i= b i G i + ε ) effect of i th causal SNP, and G i is the number of causal alleles for the i th causal SNP. 7

18 Table 3. Simulation Study Results: Average empirical power of MLC test statistics and other gene-based statistics for 000 genes at nominal level α = 0.05 (N=,000 simulation replicates used to estimate power for each gene) Summary Model Model Model 3 Model 4 Model 5 Wald * Average % CI (0.608,0.639) (0.58,0.56) (0.47,0.44) (0.376,0.4) (0.764,0.795) MLC-B Average % CI (0.669,0.70) (0.585,0.6) (0.483,0.5) (0.377,0.43) (0.79,0.8) MLC-Z Average % CI (0.668,0.70) (0.584,0.69) (0.48,0.509) (0.377,0.43) (0.787,0.89) MinP-M Average % CI (0.67,0.706) (0.6,0.636) (0.49,0.59) (0.37,0.47) (0.783,0.85) PC80 Average % CI (0.658,0.694) (0.59,0.68) (0.484,0.53) (0.357,0.404) (0.756,0.79) SSB Average % CI (0.63,0.667) (0.56,0.597) (0.45,0.48) (0.364,0.4) (0.743,0.778) LC-B Average % CI (0.48,0.53) (0.47,0.53) (0.385,0.48) (0.57,0.99) (0.56,0.605) LC-Z Average % CI (0.476,0.59) (0.457,0.498) (0.377,0.4) (0.54,0.95) (0.555,0.6) MinP-J Average % CI (0.94,0.) (0.95,0.5) (0.6,0.83) (0.93,0.5) (0.3,0.36) *For description of tests, see the footnote for Table 8

19 Figure. Clustering of SNPs in DCCT/EDIC CETP gene data by applying LDSelect algorithm to linkage disequilibrium (r) network. Edges with r <0.3 are removed. Edges with 0.3 r <0.55 are shown in dashed line. SNPs in the same bin have the same color. The r cutpoint value for LDSelect algorithm was set at r =0.3 ( r 0.55). The minor and maor allele coding was reversed for SNP (starred) after application of the coding correction algorithm by Wang and Elston [007]. The r values reflect the coding correction. 0.7 SNP SNP SNP5 0.5 SNP SNP SNP SNP* SNP6 0.6 SNP SNP3 SNP

20 Figure. Simulation Study Results: Comparisons of empirical power for each of 000 genes under Models -5. For each trait model, the genes are ordered by the rank of Wald test power among 000 genes. (a) MLC-B, LC-B and Wald, (b) MLC-B, PC80 and MinP-M. (a) (b) 0

21 Figure 3. Comparison of analytic power of MLC-B, Wald, and LC-B tests for each of 000 genes under Models -3. For each trait model, the genes are ordered by the rank of MLC-B power among 000 genes.

22 Figure 4. Analytic study results for two causal SNPs and one neutral SNP: the power of MLC-B test (red dots), Wald (black dots), and LC-B (blue dots) tests, are calculated for two trait models Y β X + β X + β + ε where β = β =, β 0 (upper plots) and = 3X 3 β, (lower plots) with, for a range of allele =, β =, β3 = 0 ε ~ N(0, σ ) σ = frequencies between 0. and 0.5. The correlation between causal SNP and neutral SNP3 is fixed as 0.8 (SNPs & 3 are in the same bin) and the pairwise correlation between the two causal SNPs and is set at r. The range of correlation r between -0.5 and +0.5 is restricted, depending on the marginal allele frequencies. For the LC test, the sign of a beta coefficient and some negative correlations are reversed with application of the allele coding correction. The allele coding correction does not affect the MLC test since there are no negative correlations present within bins. (a) Upper Panel: Two deleterious causal SNPs 3 = (b) Lower Panel: Two causal SNPS, one deleterious and one protective

23 Figure 5. Analytic study results for ten SNPs including one to five causal SNPs: comparison of power of MLC-B, Wald, LC-B and MinP-M for a range of correlation r between causal and neutral SNP pairs as the number of causal SNPs increases from to 5. Ten SNPs were divided into five bins, each with two SNPs assumed to have a fixed correlation value r (0~0.8). The allele frequency of all SNPs is fixed to be 0.3. In each bin, only one SNP can be assigned to be causal. The power of gene-based tests from analysis of all ten SNPs is obtained under the alternative model in which the beta coefficient for a causal SNP is and the error variance is σ = 00. 3

24 Appendices for "Multi-bin multi-variant tests for gene-based linear regression analysis of genetic association Y. J. Yoo Department of Mathematics Education and Interdisciplinary Program in Bioinformatics Seoul National University Seoul, South Korea L. Sun Department of Statistical Sciences and Division of Biostatistics, Dalla Lana School of Public Health University of Toronto Toronto, Canada J. Poirier Lunenfeld-Tanenbaum Research Institute of Mount Sinai Hospital Toronto, Canada S. B. Bull Department of Statistical Sciences and Division of Biostatistics, Dalla Lana School of Public Health University of Toronto and Lunenfeld-Tanenbaum Research Institute of Mount Sinai Hospital 60 Murray Street, Box No. 8, Toronto, ON, M5T 3L9, Canada 30 December 04 Methods A. SSB The SSB statistic recommended by Pan [009] is a quadratic statistic calculated as the squared sum of beta coefficients obtained from the marginal regressions of each SNP genotype variable. That is, SSB = ˆ β M T ˆ β M = which is asymptotically distributed as a weighted sum of K independent df chi-square distributions with an approximated p- value where the weights are eigenvalues of the covariance matrix [Pan, 009]. Therefore, the SSB statistic is constructed using beta coefficients only but the null distribution reflects the correlation between SNPs. B. PC80 Gene-based statistics also can be formulated using multiple regression analysis of the principal components of the multiple SNP genotypes. Gauderman et al. [007] suggested that K i= ( ˆ β M i ) 4

25 principal components be selected using a threshold criteria such that greater than 80% of the variance is explained by the minimum number of principal components, assuming the principal components are ordered by variance from the largest (P ) to smallest (P K ). If P,..., P S explain greater than 80% of the total SNP variance, then the regression analysis is performed using these principal components with the model Y * * P * P * = β P 0 + β + β + + β S S. Denoting the variance-covariance matrix of ˆ * * * (,,, * * β = β β β S ) as Σ B, we have PC80 = * T * ˆ * β Σ ˆ B β which follows an asymptotic chi-squared distribution with S df. Since the covariances of different principal components are zero, the is a diagonal matrix. S (β Therefore, PC80 can be formulated as PC80= * ) * where b is the variance of β. = b C. Estimation of non-centrality parameter for alternative models Denote observed genotypes as X and a quantitative trait as for i th i, X, i, X Y ik i subect among n unrelated individuals. Assume a trait model with regression parameters β = ( β, β,, βk ) corresponding to the genotypes of K SNPs: Y β + β X + β X i + + β X + ε () = 0 K K where ε ~ N(0, σ ). For the th SNP, the genotype X i is coded as the number of minor alleles, i.e., 0,, or. Gene-based statistics are constructed from the beta estimates ˆ β = ( ˆ β ˆ and the estimated covariance matrix of.,, β ) Σ K B βˆ Under Hardy-Weinberg equilibrium, we assume that the underlying genotype structure is such that is the allele frequency of the th SNP and the linkage disequilibrium measure r k p p k p pk between SNP and SNP k is defined as rk = for, k =,, K p p p )( p ) where p P( X =, X = ). Let v = p p ). Then the mean and variance for X i are k = k ( and v respectively, and the covariance of X i and X ik is r v v. The expected value of X i is equal to np and that of X i X ik is equal to n ( r ), k v vk + p pk which results in a covariance matrix Σ for the genotypes consisting of v in the th diagonal position and r v v as the (,k)-element. Assuming a fixed value for Σ, we use the covariance matrix of ˆ ˆ ˆ ˆ σ β = ( β, β,, β K ): Σ X which is equal to Σ under n B large sample theory, to derive the noncentrality values for the distributions of the Wald. LC-B, and MLC-B statistics. (i) Wald statistic (K df ) The noncentrality parameter under the trait model (), which depends on values assumed for the genotype covariances, is K K K n λ =. W v β + β β k r k v v k σ = k = k > * Σ B k ( k p k k n i= X n i= k k X 5

26 (ii) LC-B statistic ( df ) We denote the half of the sum of all the elements of among X, X K ) as K T S = J Σ X J = v + k= where J = (,,,) (all variances and covariances The weight for the th K SNP in the LC-B statistic is then w, i.e. = v + rk v vk S k=, k the relative sum of the variance of X and the covariances between X and other SNPs over the sum of all the variances and covariance among X, X K. Then the LC-B noncentrality parameter under the trait model () is λ L (iii) MLC-B statistic (L df ) Let be the (,l)-element (for th SNP and l th bin) of the weight matrix W w l T S = ( Σ B J )( J Σ B J =,,, K l,,, L for and. Also let correspond to the l th element of T W βˆ, s = ( w ) V + w w CV where is the variance of βˆ, i.e. the (,) element of ΣB, and CV k is the variance between βˆ and, i.e. the (,k) element of Σ, and = w CV. Then the MLC-B statistic noncentrality parameter under the trait model () is L L L n λm = ( ) sltl + slmtltm σ l= m= l> m For an simplified example of W S, if SNPs S, S, S 3, and S 4 belong to three bins B B, B 3, and B respectively and if we assume zero correlation between SNPs in different bins, then 0 0 w and 0 0 J = W S = w4 0 0 where w ˆ ˆ ˆ ˆ ˆ and so that = ( σ + σ 4 ) /( σ + σ 4 + σ 4 ) w ˆ ˆ ˆ ˆ ˆ 4 = ( σ 4 + σ 4 ) /( σ + σ 4 + σ 4 ) w where is the estimated covariance of X i and X and is the estimated + w4 = σˆ i σˆ i variance of X i for i, =,, K. Σ X K K k= l> k r lk v v K K K K ns ns = w β w β + w wk β βk σ = σ = k= k> = ) B S lm K K s w = k= l l K = km l k = l w l = l k K < k l kl k T = V K β βˆk 6

27 Tables and Figures Table S. Application to DCCT/EDIC study: Candidate gene analysis for HDL phenotype (selected genes with p-values <0.03). The number of SNPs per gene corresponds to the df for the Wald statistic, and the number of bins corresponds to the df for the MLC-B and MLC-Z statistics. Gene # SNP # bin Wald * MLC-B MLC-Z MinP-M PC80 SSB LCB LCZ MinP-J CETP 0 5.8E- 3.6E- 6.76E-.40E-0.39E-.93E-08.09E-08.7E SLCA CTGF LDLR TIMP APOB NPHS FASLG EDN AGER TNF NDUFS VEGF ABCA INSR UQCRFS AGT CRP FLT COL4A PDGFB

28 *Wald : Generalized Wald test MLC-B : Multi-bin linear combination test using beta coefficients MLC-Z : Multi-bin linear combination test using Z statistics LC-B : Linear combination test using beta coefficients LC-Z : Linear combination test using Z statistics MinP-J : Minimum P test based on oint regression analysis MinP-M : Minimum P test based on marginal regression analysis PC80 : Global test based on regression using the minimum number of principal components covering 80% variances SSB : Squared sum of marginal beta coefficients (Pan, 009) 8

29 Table S. Simulation study results: Distribution and average Type I error rate estimates for 000 genes at nominal levels α = 0.05 and α = 0.0 (N=,000 simulation replicates used to estimate type I error for each of 000 genes) α=0.05 Distribution of Type I error estimates <.04.04~ ~ ~.06 >.06 Average SD 95%CI Wald (0.0509,0.058) MLC-B (0.0503,0.05) MLC-Z (0.0503,0.05) MinP-M (0.0533,0.0546) PC (0.050,0.05) SSB (0.05,0.059) LC-B (0.0496,0.0504) LC-Z (0.0497,0.0505) MinP-J (0.0497,0.0509) α=0.0 Distribution of Type I error estimates < ~ ~.03.03~.07 >.07 Average SD 95%CI Wald (0.004,0.008) MLC-B (0.00,0.006) MLC-Z (0.003,0.007) MinP-M (0.09,0.04) PC (0.003,0.006) SSB (0.005,0.009) LC-B (0.0099,0.003) LC-Z (0.0,0.004) MinP-J (0.003,0.008) 9

30 Table S3. Simulation study results: Average empirical power of MLC test statistics and other gene-based statistics for 000 genes at nominal level α = 0.0 (N=,000 simulation replicates used to estimate power for each gene) Model Model Model 3 Model 4 Model 5 Wald Average % CI (0.43,0.454) (0.347,0.378) (0.3,0.54) (0.7,0.35) (0.69,0.666) MLC-B Average % CI (0.50,0.536) (0.4,0.447) (0.9,0.35) (0.79,0.34) (0.67,0.708) MLC-Z Average % CI (0.5,0.534) (0.4,0.446) (0.89,0.34) (0.79,0.34) (0.668,0.706) MinP-M Average % CI (0.58,0.555) (0.439,0.476) (0.30,0.38) (0.77,0.33) (0.663,0.70) PC80 Average % CI (0.50,0.539) (0.46,0.46) (0.9,0.38) (0.66,0.3) (0.635,0.675) SSB Average % CI (0.447,0.484) (0.374,0.409) (0.56,0.8) (0.65,0.3) (0.595,0.636) LC-B Average % CI (0.346,0.388) (0.36,0.364) (0.6,0.53) (0.78,0.8) (0.434,0.479) LC-Z Average % CI (0.34,0.38) (0.34,0.35) (0.9,0.46) (0.74,0.4) (0.48,0.473) MinP-J Average % CI (0.097,0.9) (0.0,0.4) (0.067,0.08) (0.06,0.35) (0.04,0.4) Table S4. Numerical power calculation results: Average power of Wald, LC-B, and MLC-B tests for 000 genes at nominal level α = Power values with or without coding correction are reported. The variance value for each gene under the trait Models -3 (specified in Table ) was chosen to maintain the power of the Wald to be LC MLC Wald No code change Code change No code change Code change Model Average % CI n/a (0.64,0.669) (0.694,0.78) (0.69,0.70) (0.697,0.706) Model Average % CI n/a (0.657,0.685) (0.558,0.596) (0.69,0.703) (0.696,0.706) Model 3 Average % CI n/a (0.3,0.34) (0.447,0.484) (0.659,0.677) (0.664,0.68) 30

31 Table S5: Summary of trait models with analytic power calculations for regressions that include causal SNPs A. Effect of Linkage Disequilibrium Figure #SNPs #causal Disease model Correlation between SNPs Allele frequency Comparison: LC-B, Wald, MinP-J, MinP-M S4 β r =r 3 =r 3 = -.0~ ~0.5 Assuming one causal SNP and one neutral SNP, we =, β = 0 change the correlation between the SNPs and evaluate across a range of MAF values. S5 3 β r =r 3 =r 3 = -.0~ ~0.5 Assuming one causal SNP and two neutral SNPs, we =, β = β3 = 0 change the pairwise correlation among the three SNPs, and evaluate across a range of MAF values. S6 3 β r =r 3 =r 3 = -.0~ ~0.5 Assuming two causal SNPs and one neutral SNP, we = β =, β 3 = 0 change the pairwise correlation among the three SNPs and evaluate across a range of MAF values. S7 (a) 3 β =, β = β3 = 0 r =0.~0.9, r 3 =r 3 =0 0.~0.5 Assuming one causal SNP and two neutral SNPs, we increase the correlation between the causal SNP and one neutral SNP, other correlations are zero (fixed). S7 (b) 3 β =, β = β3 = 0 r =r 3 =0, r 3 =0.~0.9 0.~0.5 Assuming one causal SNP uncorrelated with two neutral SNPs, we increase the neutral SNP correlation, other correlations are zero. 4 (a) 3 β r 3 =0.8, r = r 3 = -.0~.0 0.~0.5 Assuming two causal SNPs and one neutral SNP, = β =, β 3 = 0 with causal effects in the same direction, we change the correlation between the two causal SNPs and fix the correlation between the neutral SNP and one causal SNP. 4 (b) 3 β r 3 =0.8, r = r 3 = -.0~.0 0.~0.5 Assuming two causal SNPs and one neutral SNP, =, β =, β3 = 0 with causal effects in opposite directions, we change the correlation between the two causal SNPs and fix the correlation between the neutral SNP and one causal SNP

32 B. Effect of the Number of Causal SNPs Figure #SNPs #causal Disease models Correlation between SNPs Allele frequency Comparison: MLC-B, Wald, MinP-M 5 0 ~5, others= 0 β = β = β 3 =, other= 0 β = β3 = β5 =, other= 0 β = β3 = β5 = β7 = β = β3 = β5 = β7 = β9 =, other= 0, other= 0 r = r 34 = r 56 = r 78 = r 9,0 = 0, 0., 0.4, 0.6, 0.8 other correlations are all Assuming independent causal SNPs with effects in the same direction, we increase correlation between causalneutral SNP pairs and/or neutral-neutral SNP pairs. 3

33 Figure S. Simulation Study Results: The frequency plot of r -threshold values where the optimum power of MLC-B is achieved. (a) MLC-B - all 000 genes, (b) MLC-B - genes with power improvement > 0. When the optimum occurs at multiple r -threshold values, we take the midpoint as the threshold value that achieves the optimum power. (a) (b) Figure S. Simulation Study Results: Averages of improved power values when the optimum is achieved at different r -threshold values. (a) MLC-B - all 000 genes, (b) MLC-B - genes with power improvement > 0. (a) (b) 33

34 Figure S3. Simulation Study Results: Comparisons of empirical power among MLC-B, SSB and Wald tests for each of 000 genes under Models -5 (as specified in Table ). 34

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5 Association Testing with Quantitative Traits: Common and Rare Variants Timothy Thornton and Katie Kerr Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5 1 / 41 Introduction to Quantitative

More information

Genotype Imputation. Biostatistics 666

Genotype Imputation. Biostatistics 666 Genotype Imputation Biostatistics 666 Previously Hidden Markov Models for Relative Pairs Linkage analysis using affected sibling pairs Estimation of pairwise relationships Identity-by-Descent Relatives

More information

Relationship between Genomic Distance-Based Regression and Kernel Machine Regression for Multi-marker Association Testing

Relationship between Genomic Distance-Based Regression and Kernel Machine Regression for Multi-marker Association Testing Relationship between Genomic Distance-Based Regression and Kernel Machine Regression for Multi-marker Association Testing Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota,

More information

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES Saurabh Ghosh Human Genetics Unit Indian Statistical Institute, Kolkata Most common diseases are caused by

More information

Lecture 9: Kernel (Variance Component) Tests and Omnibus Tests for Rare Variants. Summer Institute in Statistical Genetics 2017

Lecture 9: Kernel (Variance Component) Tests and Omnibus Tests for Rare Variants. Summer Institute in Statistical Genetics 2017 Lecture 9: Kernel (Variance Component) Tests and Omnibus Tests for Rare Variants Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2017 1 / 46 Lecture Overview 1. Variance Component

More information

Binomial Mixture Model-based Association Tests under Genetic Heterogeneity

Binomial Mixture Model-based Association Tests under Genetic Heterogeneity Binomial Mixture Model-based Association Tests under Genetic Heterogeneity Hui Zhou, Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 April 30,

More information

Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies

Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies Ruth Pfeiffer, Ph.D. Mitchell Gail Biostatistics Branch Division of Cancer Epidemiology&Genetics National

More information

Power and sample size calculations for designing rare variant sequencing association studies.

Power and sample size calculations for designing rare variant sequencing association studies. Power and sample size calculations for designing rare variant sequencing association studies. Seunggeun Lee 1, Michael C. Wu 2, Tianxi Cai 1, Yun Li 2,3, Michael Boehnke 4 and Xihong Lin 1 1 Department

More information

NIH Public Access Author Manuscript Stat Sin. Author manuscript; available in PMC 2013 August 15.

NIH Public Access Author Manuscript Stat Sin. Author manuscript; available in PMC 2013 August 15. NIH Public Access Author Manuscript Published in final edited form as: Stat Sin. 2012 ; 22: 1041 1074. ON MODEL SELECTION STRATEGIES TO IDENTIFY GENES UNDERLYING BINARY TRAITS USING GENOME-WIDE ASSOCIATION

More information

Test for interactions between a genetic marker set and environment in generalized linear models Supplementary Materials

Test for interactions between a genetic marker set and environment in generalized linear models Supplementary Materials Biostatistics (2013), pp. 1 31 doi:10.1093/biostatistics/kxt006 Test for interactions between a genetic marker set and environment in generalized linear models Supplementary Materials XINYI LIN, SEUNGGUEN

More information

A Robust Test for Two-Stage Design in Genome-Wide Association Studies

A Robust Test for Two-Stage Design in Genome-Wide Association Studies Biometrics Supplementary Materials A Robust Test for Two-Stage Design in Genome-Wide Association Studies Minjung Kwak, Jungnam Joo and Gang Zheng Appendix A: Calculations of the thresholds D 1 and D The

More information

Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power

Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power Proportional Variance Explained by QTL and Statistical Power Partitioning the Genetic Variance We previously focused on obtaining variance components of a quantitative trait to determine the proportion

More information

Combining dependent tests for linkage or association across multiple phenotypic traits

Combining dependent tests for linkage or association across multiple phenotypic traits Biostatistics (2003), 4, 2,pp. 223 229 Printed in Great Britain Combining dependent tests for linkage or association across multiple phenotypic traits XIN XU Program for Population Genetics, Harvard School

More information

Statistical Methods in Mapping Complex Diseases

Statistical Methods in Mapping Complex Diseases University of Pennsylvania ScholarlyCommons Publicly Accessible Penn Dissertations Summer 8-12-2011 Statistical Methods in Mapping Complex Diseases Jing He University of Pennsylvania, jinghe@mail.med.upenn.edu

More information

Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate

Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate Lucas Janson, Stanford Department of Statistics WADAPT Workshop, NIPS, December 2016 Collaborators: Emmanuel

More information

(Genome-wide) association analysis

(Genome-wide) association analysis (Genome-wide) association analysis 1 Key concepts Mapping QTL by association relies on linkage disequilibrium in the population; LD can be caused by close linkage between a QTL and marker (= good) or by

More information

Genotype Imputation. Class Discussion for January 19, 2016

Genotype Imputation. Class Discussion for January 19, 2016 Genotype Imputation Class Discussion for January 19, 2016 Intuition Patterns of genetic variation in one individual guide our interpretation of the genomes of other individuals Imputation uses previously

More information

Comparison of Statistical Tests for Disease Association with Rare Variants

Comparison of Statistical Tests for Disease Association with Rare Variants Comparison of Statistical Tests for Disease Association with Rare Variants Saonli Basu, Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 November

More information

Cover Page. The handle holds various files of this Leiden University dissertation

Cover Page. The handle   holds various files of this Leiden University dissertation Cover Page The handle http://hdl.handle.net/1887/35195 holds various files of this Leiden University dissertation Author: Balliu, Brunilda Title: Statistical methods for genetic association studies with

More information

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017 Lecture 2: Genetic Association Testing with Quantitative Traits Instructors: Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2017 1 / 29 Introduction to Quantitative Trait Mapping

More information

Extending the MR-Egger method for multivariable Mendelian randomization to correct for both measured and unmeasured pleiotropy

Extending the MR-Egger method for multivariable Mendelian randomization to correct for both measured and unmeasured pleiotropy Received: 20 October 2016 Revised: 15 August 2017 Accepted: 23 August 2017 DOI: 10.1002/sim.7492 RESEARCH ARTICLE Extending the MR-Egger method for multivariable Mendelian randomization to correct for

More information

On the limiting distribution of the likelihood ratio test in nucleotide mapping of complex disease

On the limiting distribution of the likelihood ratio test in nucleotide mapping of complex disease On the limiting distribution of the likelihood ratio test in nucleotide mapping of complex disease Yuehua Cui 1 and Dong-Yun Kim 2 1 Department of Statistics and Probability, Michigan State University,

More information

Marginal Screening and Post-Selection Inference

Marginal Screening and Post-Selection Inference Marginal Screening and Post-Selection Inference Ian McKeague August 13, 2017 Ian McKeague (Columbia University) Marginal Screening August 13, 2017 1 / 29 Outline 1 Background on Marginal Screening 2 2

More information

Methods for Cryptic Structure. Methods for Cryptic Structure

Methods for Cryptic Structure. Methods for Cryptic Structure Case-Control Association Testing Review Consider testing for association between a disease and a genetic marker Idea is to look for an association by comparing allele/genotype frequencies between the cases

More information

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control Xiaoquan Wen Department of Biostatistics, University of Michigan A Model

More information

HERITABILITY ESTIMATION USING A REGULARIZED REGRESSION APPROACH (HERRA)

HERITABILITY ESTIMATION USING A REGULARIZED REGRESSION APPROACH (HERRA) BIRS 016 1 HERITABILITY ESTIMATION USING A REGULARIZED REGRESSION APPROACH (HERRA) Malka Gorfine, Tel Aviv University, Israel Joint work with Li Hsu, FHCRC, Seattle, USA BIRS 016 The concept of heritability

More information

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 20: Epistasis and Alternative Tests in GWAS Jason Mezey jgm45@cornell.edu April 16, 2016 (Th) 8:40-9:55 None Announcements Summary

More information

p(d g A,g B )p(g B ), g B

p(d g A,g B )p(g B ), g B Supplementary Note Marginal effects for two-locus models Here we derive the marginal effect size of the three models given in Figure 1 of the main text. For each model we assume the two loci (A and B)

More information

Theoretical and computational aspects of association tests: application in case-control genome-wide association studies.

Theoretical and computational aspects of association tests: application in case-control genome-wide association studies. Theoretical and computational aspects of association tests: application in case-control genome-wide association studies Mathieu Emily November 18, 2014 Caen mathieu.emily@agrocampus-ouest.fr - Agrocampus

More information

Lecture 7: Interaction Analysis. Summer Institute in Statistical Genetics 2017

Lecture 7: Interaction Analysis. Summer Institute in Statistical Genetics 2017 Lecture 7: Interaction Analysis Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2017 1 / 39 Lecture Outline Beyond main SNP effects Introduction to Concept of Statistical Interaction

More information

Statistics and Probability Letters. Using randomization tests to preserve type I error with response adaptive and covariate adaptive randomization

Statistics and Probability Letters. Using randomization tests to preserve type I error with response adaptive and covariate adaptive randomization Statistics and Probability Letters ( ) Contents lists available at ScienceDirect Statistics and Probability Letters journal homepage: wwwelseviercom/locate/stapro Using randomization tests to preserve

More information

Review of Statistics 101

Review of Statistics 101 Review of Statistics 101 We review some important themes from the course 1. Introduction Statistics- Set of methods for collecting/analyzing data (the art and science of learning from data). Provides methods

More information

Review We have covered so far: Single variant association analysis and effect size estimation GxE interaction and higher order >2 interaction Measurement error in dietary variables (nutritional epidemiology)

More information

Some models of genomic selection

Some models of genomic selection Munich, December 2013 What is the talk about? Barley! Steptoe x Morex barley mapping population Steptoe x Morex barley mapping population genotyping from Close at al., 2009 and phenotyping from cite http://wheat.pw.usda.gov/ggpages/sxm/

More information

Lecture 1: Case-Control Association Testing. Summer Institute in Statistical Genetics 2015

Lecture 1: Case-Control Association Testing. Summer Institute in Statistical Genetics 2015 Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2015 1 / 1 Introduction Association mapping is now routinely being used to identify loci that are involved with complex traits.

More information

Computational Systems Biology: Biology X

Computational Systems Biology: Biology X Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA L#7:(Mar-23-2010) Genome Wide Association Studies 1 The law of causality... is a relic of a bygone age, surviving, like the monarchy,

More information

Optimal Methods for Using Posterior Probabilities in Association Testing

Optimal Methods for Using Posterior Probabilities in Association Testing Digital Collections @ Dordt Faculty Work: Comprehensive List 5-2013 Optimal Methods for Using Posterior Probabilities in Association Testing Keli Liu Harvard University Alexander Luedtke University of

More information

Nature Genetics: doi: /ng Supplementary Figure 1. Number of cases and proxy cases required to detect association at designs.

Nature Genetics: doi: /ng Supplementary Figure 1. Number of cases and proxy cases required to detect association at designs. Supplementary Figure 1 Number of cases and proxy cases required to detect association at designs. = 5 10 8 for case control and proxy case control The ratio of controls to cases (or proxy cases) is 1.

More information

Supplementary Information

Supplementary Information Supplementary Information 1 Supplementary Figures (a) Statistical power (p = 2.6 10 8 ) (b) Statistical power (p = 4.0 10 6 ) Supplementary Figure 1: Statistical power comparison between GEMMA (red) and

More information

Backward Genotype-Trait Association. in Case-Control Designs

Backward Genotype-Trait Association. in Case-Control Designs Backward Genotype-Trait Association (BGTA)-Based Dissection of Complex Traits in Case-Control Designs Tian Zheng, Hui Wang and Shaw-Hwa Lo Department of Statistics, Columbia University, New York, New York,

More information

Analyzing metabolomics data for association with genotypes using two-component Gaussian mixture distributions

Analyzing metabolomics data for association with genotypes using two-component Gaussian mixture distributions Analyzing metabolomics data for association with genotypes using two-component Gaussian mixture distributions Jason Westra Department of Statistics, Iowa State University Ames, IA 50011, United States

More information

Negative Multinomial Model and Cancer. Incidence

Negative Multinomial Model and Cancer. Incidence Generalized Linear Model under the Extended Negative Multinomial Model and Cancer Incidence S. Lahiri & Sunil K. Dhar Department of Mathematical Sciences, CAMS New Jersey Institute of Technology, Newar,

More information

NIH Public Access Author Manuscript Genet Epidemiol. Author manuscript; available in PMC 2014 June 23.

NIH Public Access Author Manuscript Genet Epidemiol. Author manuscript; available in PMC 2014 June 23. NIH Public Access Author Manuscript Published in final edited form as: Genet Epidemiol. 2013 January ; 37(1): 99 109. doi:10.1002/gepi.21691. Adjustment for Population Stratification via Principal Components

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION doi:10.1038/nature25973 Power Simulations We performed extensive power simulations to demonstrate that the analyses carried out in our study are well powered. Our simulations indicate very high power for

More information

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics 1 Springer Nan M. Laird Christoph Lange The Fundamentals of Modern Statistical Genetics 1 Introduction to Statistical Genetics and Background in Molecular Genetics 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

More information

Linear Regression (1/1/17)

Linear Regression (1/1/17) STA613/CBB540: Statistical methods in computational biology Linear Regression (1/1/17) Lecturer: Barbara Engelhardt Scribe: Ethan Hada 1. Linear regression 1.1. Linear regression basics. Linear regression

More information

Integrative analysis of sequencing and array genotype data for discovering disease associations with rare mutations

Integrative analysis of sequencing and array genotype data for discovering disease associations with rare mutations Integrative analysis of sequencing and array genotype data for discovering disease associations with rare mutations Yi-Juan Hu a, Yun Li b,c, Paul L. Auer d, and Dan-Yu Lin b,1 a Department of Biostatistics

More information

Previous lecture. Single variant association. Use genome-wide SNPs to account for confounding (population substructure)

Previous lecture. Single variant association. Use genome-wide SNPs to account for confounding (population substructure) Previous lecture Single variant association Use genome-wide SNPs to account for confounding (population substructure) Estimation of effect size and winner s curse Meta-Analysis Today s outline P-value

More information

Gene mapping in model organisms

Gene mapping in model organisms Gene mapping in model organisms Karl W Broman Department of Biostatistics Johns Hopkins University http://www.biostat.jhsph.edu/~kbroman Goal Identify genes that contribute to common human diseases. 2

More information

Statistical Power of Model Selection Strategies for Genome-Wide Association Studies

Statistical Power of Model Selection Strategies for Genome-Wide Association Studies Statistical Power of Model Selection Strategies for Genome-Wide Association Studies Zheyang Wu 1, Hongyu Zhao 1,2 * 1 Department of Epidemiology and Public Health, Yale University School of Medicine, New

More information

NIH Public Access Author Manuscript Genet Epidemiol. Author manuscript; available in PMC 2014 December 01.

NIH Public Access Author Manuscript Genet Epidemiol. Author manuscript; available in PMC 2014 December 01. NIH Public Access Author Manuscript Published in final edited form as: Genet Epidemiol. 2013 December ; 37(8): 759 767. doi:10.1002/gepi.21759. A General Framework for Association Tests With Multivariate

More information

Least Absolute Value vs. Least Squares Estimation and Inference Procedures in Regression Models with Asymmetric Error Distributions

Least Absolute Value vs. Least Squares Estimation and Inference Procedures in Regression Models with Asymmetric Error Distributions Journal of Modern Applied Statistical Methods Volume 8 Issue 1 Article 13 5-1-2009 Least Absolute Value vs. Least Squares Estimation and Inference Procedures in Regression Models with Asymmetric Error

More information

Support Vector Hazard Regression (SVHR) for Predicting Survival Outcomes. Donglin Zeng, Department of Biostatistics, University of North Carolina

Support Vector Hazard Regression (SVHR) for Predicting Survival Outcomes. Donglin Zeng, Department of Biostatistics, University of North Carolina Support Vector Hazard Regression (SVHR) for Predicting Survival Outcomes Introduction Method Theoretical Results Simulation Studies Application Conclusions Introduction Introduction For survival data,

More information

Pooled Association Tests for Rare Genetic Variants: A Review and Some New Results

Pooled Association Tests for Rare Genetic Variants: A Review and Some New Results Statistical Science 2014, Vol. 29, No. 2, 302 321 DOI: 10.1214/13-STS456 c Institute of Mathematical Statistics, 2014 Pooled Association Tests for Rare Genetic Variants: A Review and Some New Results Andriy

More information

Non-iterative, regression-based estimation of haplotype associations

Non-iterative, regression-based estimation of haplotype associations Non-iterative, regression-based estimation of haplotype associations Benjamin French, PhD Department of Biostatistics and Epidemiology University of Pennsylvania bcfrench@upenn.edu National Cancer Center

More information

SNP Association Studies with Case-Parent Trios

SNP Association Studies with Case-Parent Trios SNP Association Studies with Case-Parent Trios Department of Biostatistics Johns Hopkins Bloomberg School of Public Health September 3, 2009 Population-based Association Studies Balding (2006). Nature

More information

BTRY 4830/6830: Quantitative Genomics and Genetics

BTRY 4830/6830: Quantitative Genomics and Genetics BTRY 4830/6830: Quantitative Genomics and Genetics Lecture 23: Alternative tests in GWAS / (Brief) Introduction to Bayesian Inference Jason Mezey jgm45@cornell.edu Nov. 13, 2014 (Th) 8:40-9:55 Announcements

More information

STA6938-Logistic Regression Model

STA6938-Logistic Regression Model Dr. Ying Zhang STA6938-Logistic Regression Model Topic 2-Multiple Logistic Regression Model Outlines:. Model Fitting 2. Statistical Inference for Multiple Logistic Regression Model 3. Interpretation of

More information

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 18: Introduction to covariates, the QQ plot, and population structure II + minimal GWAS steps Jason Mezey jgm45@cornell.edu April

More information

27: Case study with popular GM III. 1 Introduction: Gene association mapping for complex diseases 1

27: Case study with popular GM III. 1 Introduction: Gene association mapping for complex diseases 1 10-708: Probabilistic Graphical Models, Spring 2015 27: Case study with popular GM III Lecturer: Eric P. Xing Scribes: Hyun Ah Song & Elizabeth Silver 1 Introduction: Gene association mapping for complex

More information

Supplemental Information Likelihood-based inference in isolation-by-distance models using the spatial distribution of low-frequency alleles

Supplemental Information Likelihood-based inference in isolation-by-distance models using the spatial distribution of low-frequency alleles Supplemental Information Likelihood-based inference in isolation-by-distance models using the spatial distribution of low-frequency alleles John Novembre and Montgomery Slatkin Supplementary Methods To

More information

. Also, in this case, p i = N1 ) T, (2) where. I γ C N(N 2 2 F + N1 2 Q)

. Also, in this case, p i = N1 ) T, (2) where. I γ C N(N 2 2 F + N1 2 Q) Supplementary information S7 Testing for association at imputed SPs puted SPs Score tests A Score Test needs calculations of the observed data score and information matrix only under the null hypothesis,

More information

Case-Control Association Testing. Case-Control Association Testing

Case-Control Association Testing. Case-Control Association Testing Introduction Association mapping is now routinely being used to identify loci that are involved with complex traits. Technological advances have made it feasible to perform case-control association studies

More information

Robust Bayesian Variable Selection for Modeling Mean Medical Costs

Robust Bayesian Variable Selection for Modeling Mean Medical Costs Robust Bayesian Variable Selection for Modeling Mean Medical Costs Grace Yoon 1,, Wenxin Jiang 2, Lei Liu 3 and Ya-Chen T. Shih 4 1 Department of Statistics, Texas A&M University 2 Department of Statistics,

More information

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information #

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information # Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Details of PRF Methodology In the Poisson Random Field PRF) model, it is assumed that non-synonymous mutations at a given gene are either

More information

Geometric Framework for Evaluating Rare Variant Tests of Association

Geometric Framework for Evaluating Rare Variant Tests of Association Digital Collections @ Dordt Faculty Work: Comprehensive List 5-2013 Geometric Framework for Evaluating Rare Variant Tests of Association Keli Liu Harvard University Shannon Fast St. Olaf College Matthew

More information

Mutual Information between Discrete Variables with Many Categories using Recursive Adaptive Partitioning

Mutual Information between Discrete Variables with Many Categories using Recursive Adaptive Partitioning Supplementary Information Mutual Information between Discrete Variables with Many Categories using Recursive Adaptive Partitioning Junhee Seok 1, Yeong Seon Kang 2* 1 School of Electrical Engineering,

More information

A new strategy for meta-analysis of continuous covariates in observational studies with IPD. Willi Sauerbrei & Patrick Royston

A new strategy for meta-analysis of continuous covariates in observational studies with IPD. Willi Sauerbrei & Patrick Royston A new strategy for meta-analysis of continuous covariates in observational studies with IPD Willi Sauerbrei & Patrick Royston Overview Motivation Continuous variables functional form Fractional polynomials

More information

Causal Hazard Ratio Estimation By Instrumental Variables or Principal Stratification. Todd MacKenzie, PhD

Causal Hazard Ratio Estimation By Instrumental Variables or Principal Stratification. Todd MacKenzie, PhD Causal Hazard Ratio Estimation By Instrumental Variables or Principal Stratification Todd MacKenzie, PhD Collaborators A. James O Malley Tor Tosteson Therese Stukel 2 Overview 1. Instrumental variable

More information

25 : Graphical induced structured input/output models

25 : Graphical induced structured input/output models 10-708: Probabilistic Graphical Models 10-708, Spring 2013 25 : Graphical induced structured input/output models Lecturer: Eric P. Xing Scribes: Meghana Kshirsagar (mkshirsa), Yiwen Chen (yiwenche) 1 Graph

More information

Bayesian Inference of Interactions and Associations

Bayesian Inference of Interactions and Associations Bayesian Inference of Interactions and Associations Jun Liu Department of Statistics Harvard University http://www.fas.harvard.edu/~junliu Based on collaborations with Yu Zhang, Jing Zhang, Yuan Yuan,

More information

Pearson s Test, Trend Test, and MAX Are All Trend Tests with Different Types of Scores

Pearson s Test, Trend Test, and MAX Are All Trend Tests with Different Types of Scores Commentary doi: 101111/1469-180900800500x Pearson s Test, Trend Test, and MAX Are All Trend Tests with Different Types of Scores Gang Zheng 1, Jungnam Joo 1 and Yaning Yang 1 Office of Biostatistics Research,

More information

I Have the Power in QTL linkage: single and multilocus analysis

I Have the Power in QTL linkage: single and multilocus analysis I Have the Power in QTL linkage: single and multilocus analysis Benjamin Neale 1, Sir Shaun Purcell 2 & Pak Sham 13 1 SGDP, IoP, London, UK 2 Harvard School of Public Health, Cambridge, MA, USA 3 Department

More information

Distinctive aspects of non-parametric fitting

Distinctive aspects of non-parametric fitting 5. Introduction to nonparametric curve fitting: Loess, kernel regression, reproducing kernel methods, neural networks Distinctive aspects of non-parametric fitting Objectives: investigate patterns free

More information

Generalized Linear Model under the Extended Negative Multinomial Model and Cancer Incidence

Generalized Linear Model under the Extended Negative Multinomial Model and Cancer Incidence Generalized Linear Model under the Extended Negative Multinomial Model and Cancer Incidence Sunil Kumar Dhar Center for Applied Mathematics and Statistics, Department of Mathematical Sciences, New Jersey

More information

TESTS FOR EQUIVALENCE BASED ON ODDS RATIO FOR MATCHED-PAIR DESIGN

TESTS FOR EQUIVALENCE BASED ON ODDS RATIO FOR MATCHED-PAIR DESIGN Journal of Biopharmaceutical Statistics, 15: 889 901, 2005 Copyright Taylor & Francis, Inc. ISSN: 1054-3406 print/1520-5711 online DOI: 10.1080/10543400500265561 TESTS FOR EQUIVALENCE BASED ON ODDS RATIO

More information

Lecture 9. QTL Mapping 2: Outbred Populations

Lecture 9. QTL Mapping 2: Outbred Populations Lecture 9 QTL Mapping 2: Outbred Populations Bruce Walsh. Aug 2004. Royal Veterinary and Agricultural University, Denmark The major difference between QTL analysis using inbred-line crosses vs. outbred

More information

Causal Model Selection Hypothesis Tests in Systems Genetics

Causal Model Selection Hypothesis Tests in Systems Genetics 1 Causal Model Selection Hypothesis Tests in Systems Genetics Elias Chaibub Neto and Brian S Yandell SISG 2012 July 13, 2012 2 Correlation and Causation The old view of cause and effect... could only fail;

More information

Genetics Studies of Multivariate Traits

Genetics Studies of Multivariate Traits Genetics Studies of Multivariate Traits Heping Zhang Department of Epidemiology and Public Health Yale University School of Medicine Presented at Southern Regional Council on Statistics Summer Research

More information

Three-Level Modeling for Factorial Experiments With Experimentally Induced Clustering

Three-Level Modeling for Factorial Experiments With Experimentally Induced Clustering Three-Level Modeling for Factorial Experiments With Experimentally Induced Clustering John J. Dziak The Pennsylvania State University Inbal Nahum-Shani The University of Michigan Copyright 016, Penn State.

More information

Prerequisite: STATS 7 or STATS 8 or AP90 or (STATS 120A and STATS 120B and STATS 120C). AP90 with a minimum score of 3

Prerequisite: STATS 7 or STATS 8 or AP90 or (STATS 120A and STATS 120B and STATS 120C). AP90 with a minimum score of 3 University of California, Irvine 2017-2018 1 Statistics (STATS) Courses STATS 5. Seminar in Data Science. 1 Unit. An introduction to the field of Data Science; intended for entering freshman and transfers.

More information

Approximation of Survival Function by Taylor Series for General Partly Interval Censored Data

Approximation of Survival Function by Taylor Series for General Partly Interval Censored Data Malaysian Journal of Mathematical Sciences 11(3): 33 315 (217) MALAYSIAN JOURNAL OF MATHEMATICAL SCIENCES Journal homepage: http://einspem.upm.edu.my/journal Approximation of Survival Function by Taylor

More information

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007) FROM: PAGANO, R. R. (007) I. INTRODUCTION: DISTINCTION BETWEEN PARAMETRIC AND NON-PARAMETRIC TESTS Statistical inference tests are often classified as to whether they are parametric or nonparametric Parameter

More information

Calculation of IBD probabilities

Calculation of IBD probabilities Calculation of IBD probabilities David Evans and Stacey Cherny University of Oxford Wellcome Trust Centre for Human Genetics This Session IBD vs IBS Why is IBD important? Calculating IBD probabilities

More information

A Robust Model-free Approach for Rare Variants Association Studies Incorporating Gene-Gene and Gene-Environmental Interactions

A Robust Model-free Approach for Rare Variants Association Studies Incorporating Gene-Gene and Gene-Environmental Interactions A Robust Model-free Approach for Rare Variants Association Studies Incorporating Gene-Gene and Gene-Environmental Interactions Ruixue Fan and Shaw-Hwa Lo * Department of Statistics, Columbia University,

More information

Heritability estimation in modern genetics and connections to some new results for quadratic forms in statistics

Heritability estimation in modern genetics and connections to some new results for quadratic forms in statistics Heritability estimation in modern genetics and connections to some new results for quadratic forms in statistics Lee H. Dicker Rutgers University and Amazon, NYC Based on joint work with Ruijun Ma (Rutgers),

More information

Weierstraß-Institut. für Angewandte Analysis und Stochastik. Leibniz-Institut im Forschungsverbund Berlin e. V. Preprint ISSN

Weierstraß-Institut. für Angewandte Analysis und Stochastik. Leibniz-Institut im Forschungsverbund Berlin e. V. Preprint ISSN Weierstraß-Institut für Angewandte Analysis und Stochastik Leibniz-Institut im Forschungsverbund Berlin e. V. Preprint ISSN 2198-5855 On an extended interpretation of linkage disequilibrium in genetic

More information

Affected Sibling Pairs. Biostatistics 666

Affected Sibling Pairs. Biostatistics 666 Affected Sibling airs Biostatistics 666 Today Discussion of linkage analysis using affected sibling pairs Our exploration will include several components we have seen before: A simple disease model IBD

More information

Supplementary Materials: Meta-analysis of Quantitative Pleiotropic Traits at Gene Level with Multivariate Functional Linear Models

Supplementary Materials: Meta-analysis of Quantitative Pleiotropic Traits at Gene Level with Multivariate Functional Linear Models Supplementary Materials: Meta-analysis of Quantitative Pleiotropic Traits at Gene Level with Multivariate Functional Linear Models Appendix A. Type I Error and Power Simulations and Results A.. Type I

More information

Learning Your Identity and Disease from Research Papers: Information Leaks in Genome-Wide Association Study

Learning Your Identity and Disease from Research Papers: Information Leaks in Genome-Wide Association Study Learning Your Identity and Disease from Research Papers: Information Leaks in Genome-Wide Association Study Rui Wang, Yong Li, XiaoFeng Wang, Haixu Tang and Xiaoyong Zhou Indiana University at Bloomington

More information

A novel fuzzy set based multifactor dimensionality reduction method for detecting gene-gene interaction

A novel fuzzy set based multifactor dimensionality reduction method for detecting gene-gene interaction A novel fuzzy set based multifactor dimensionality reduction method for detecting gene-gene interaction Sangseob Leem, Hye-Young Jung, Sungyoung Lee and Taesung Park Bioinformatics and Biostatistics lab

More information

Statistical Methods for Modeling Heterogeneous Effects in Genetic Association Studies

Statistical Methods for Modeling Heterogeneous Effects in Genetic Association Studies Statistical Methods for Modeling Heterogeneous Effects in Genetic Association Studies by Jingchunzi Shi A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy

More information

Multivariate Survival Analysis

Multivariate Survival Analysis Multivariate Survival Analysis Previously we have assumed that either (X i, δ i ) or (X i, δ i, Z i ), i = 1,..., n, are i.i.d.. This may not always be the case. Multivariate survival data can arise in

More information

Common Variants near MBNL1 and NKX2-5 are Associated with Infantile Hypertrophic Pyloric Stenosis

Common Variants near MBNL1 and NKX2-5 are Associated with Infantile Hypertrophic Pyloric Stenosis Supplementary Information: Common Variants near MBNL1 and NKX2-5 are Associated with Infantile Hypertrophic Pyloric Stenosis Bjarke Feenstra 1*, Frank Geller 1*, Camilla Krogh 1, Mads V. Hollegaard 2,

More information

Two Correlated Proportions Non- Inferiority, Superiority, and Equivalence Tests

Two Correlated Proportions Non- Inferiority, Superiority, and Equivalence Tests Chapter 59 Two Correlated Proportions on- Inferiority, Superiority, and Equivalence Tests Introduction This chapter documents three closely related procedures: non-inferiority tests, superiority (by a

More information

Causal inference in biomedical sciences: causal models involving genotypes. Mendelian randomization genes as Instrumental Variables

Causal inference in biomedical sciences: causal models involving genotypes. Mendelian randomization genes as Instrumental Variables Causal inference in biomedical sciences: causal models involving genotypes Causal models for observational data Instrumental variables estimation and Mendelian randomization Krista Fischer Estonian Genome

More information

Asymptotic distribution of the largest eigenvalue with application to genetic data

Asymptotic distribution of the largest eigenvalue with application to genetic data Asymptotic distribution of the largest eigenvalue with application to genetic data Chong Wu University of Minnesota September 30, 2016 T32 Journal Club Chong Wu 1 / 25 Table of Contents 1 Background Gene-gene

More information

BTRY 7210: Topics in Quantitative Genomics and Genetics

BTRY 7210: Topics in Quantitative Genomics and Genetics BTRY 7210: Topics in Quantitative Genomics and Genetics Jason Mezey Biological Statistics and Computational Biology (BSCB) Department of Genetic Medicine jgm45@cornell.edu February 12, 2015 Lecture 3:

More information

Linkage and Linkage Disequilibrium

Linkage and Linkage Disequilibrium Linkage and Linkage Disequilibrium Summer Institute in Statistical Genetics 2014 Module 10 Topic 3 Linkage in a simple genetic cross Linkage In the early 1900 s Bateson and Punnet conducted genetic studies

More information

Statistical Methods for Alzheimer s Disease Studies

Statistical Methods for Alzheimer s Disease Studies Statistical Methods for Alzheimer s Disease Studies Rebecca A. Betensky, Ph.D. Department of Biostatistics, Harvard T.H. Chan School of Public Health July 19, 2016 1/37 OUTLINE 1 Statistical collaborations

More information