Supplementary Information

Similar documents
FaST Linear Mixed Models for Genome-Wide Association Studies

. Also, in this case, p i = N1 ) T, (2) where. I γ C N(N 2 2 F + N1 2 Q)

SUPPLEMENTARY TEXT: EFFICIENT COMPUTATION WITH A LINEAR MIXED MODEL ON LARGE-SCALE DATA SETS WITH APPLICATIONS TO GENETIC STUDIES

Implicit Causal Models for Genome-wide Association Studies

Improved linear mixed models for genome-wide association studies

FaST linear mixed models for genome-wide association studies

EFFICIENT COMPUTATION WITH A LINEAR MIXED MODEL ON LARGE-SCALE DATA SETS WITH APPLICATIONS TO GENETIC STUDIES

GWAS with mixed models

Efficient Bayesian mixed model analysis increases association power in large cohorts

Methods for Cryptic Structure. Methods for Cryptic Structure

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

Association studies and regression

Combining SEM & GREML in OpenMx. Rob Kirkpatrick 3/11/16

Lecture 5: BLUP (Best Linear Unbiased Predictors) of genetic values. Bruce Walsh lecture notes Tucson Winter Institute 9-11 Jan 2013

Theoretical and computational aspects of association tests: application in case-control genome-wide association studies.

An analytical comparison of the principal component method and the mixed effects model for genetic association studies

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control

Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017

Gene mapping in model organisms

Models with multiple random effects: Repeated Measures and Maternal effects

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Lecture 32: Infinite-dimensional/Functionvalued. Functions and Random Regressions. Bruce Walsh lecture notes Synbreed course version 11 July 2013

ARTICLE MASTOR: Mixed-Model Association Mapping of Quantitative Traits in Samples with Related Individuals

Latent Variable Methods for the Analysis of Genomic Data

Multiple random effects. Often there are several vectors of random effects. Covariance structure

Introduction to QTL mapping in model organisms

New Introduction to Multiple Time Series Analysis

Wiley. Methods and Applications of Linear Models. Regression and the Analysis. of Variance. Third Edition. Ishpeming, Michigan RONALD R.

ARTICLE Mixed Model with Correction for Case-Control Ascertainment Increases Association Power

Genotype Imputation. Biostatistics 666

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

DNA polymorphisms such as SNP and familial effects (additive genetic, common environment) to

Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate

Likelihood-Based Methods

Calculation of IBD probabilities

Maximum Likelihood Estimation; Robust Maximum Likelihood; Missing Data with Maximum Likelihood

I Have the Power in QTL linkage: single and multilocus analysis

Generalized, Linear, and Mixed Models

Power and sample size calculations for designing rare variant sequencing association studies.

Case-Control Association Testing. Case-Control Association Testing

HW2 - Due 01/30. Each answer must be mathematically justified. Don t forget your name.

Introduction to Statistical Genetics (BST227) Lecture 6: Population Substructure in Association Studies

Multilevel Models in Matrix Form. Lecture 7 July 27, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2

Introduction to QTL mapping in model organisms

Cover Page. The handle holds various files of this Leiden University dissertation

Introduction to QTL mapping in model organisms

Introduction to QTL mapping in model organisms

Variance Component Models for Quantitative Traits. Biostatistics 666

Hands-on Matrix Algebra Using R

COMBI - Combining high-dimensional classification and multiple hypotheses testing for the analysis of big data in genetics

p(d g A,g B )p(g B ), g B

THE ABILITY TO PREDICT COMPLEX TRAITS from marker data

Statistical issues in QTL mapping in mice

(Genome-wide) association analysis

Feature Selection via Block-Regularized Regression

Family-wise Error Rate Control in QTL Mapping and Gene Ontology Graphs

Package MACAU2. R topics documented: April 8, Type Package. Title MACAU 2.0: Efficient Mixed Model Analysis of Count Data. Version 1.

Correction for Hidden Confounders in the Genetic Analysis of Gene Expression

HERITABILITY ESTIMATION USING A REGULARIZED REGRESSION APPROACH (HERRA)

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar

Mapping QTL to a phylogenetic tree

BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014

Mapping multiple QTL in experimental crosses

Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations

Bayesian Regression (1/31/13)

arxiv: v1 [q-bio.gn] 19 Dec 2012

1 Mixed effect models and longitudinal data analysis

Introduction to QTL mapping in model organisms

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

2. Map genetic distance between markers

Statistical Distribution Assumptions of General Linear Models

25 : Graphical induced structured input/output models

Heritability estimation in modern genetics and connections to some new results for quadratic forms in statistics

A General Framework for Variable Selection in Linear Mixed Models with Applications to Genetic Studies with Structured Populations

Linear Regression (1/1/17)

Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model

Combining multiple observational data sources to estimate causal eects

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

STA6938-Logistic Regression Model

Multistat2. The course is focused on the parts in bold; a short description will be presented for the other parts

Lecture 9. QTL Mapping 2: Outbred Populations

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

New imputation strategies optimized for crop plants: FILLIN (Fast, Inbred Line Library ImputatioN) FSFHap (Full Sib Family Haplotype)

Structural Equation Modeling and Confirmatory Factor Analysis. Types of Variables

Likelihood Ratio Tests. that Certain Variance Components Are Zero. Ciprian M. Crainiceanu. Department of Statistical Science

Estimation and Testing for Common Cycles

USING LINEAR PREDICTORS TO IMPUTE ALLELE FREQUENCIES FROM SUMMARY OR POOLED GENOTYPE DATA. By Xiaoquan Wen and Matthew Stephens University of Chicago

Modeling Longitudinal Count Data with Excess Zeros and Time-Dependent Covariates: Application to Drug Use

Asymptotic distribution of the largest eigenvalue with application to genetic data

R/qtl workshop. (part 2) Karl Broman. Biostatistics and Medical Informatics University of Wisconsin Madison. kbroman.org

Nature Genetics: doi: /ng Supplementary Figure 1. Number of cases and proxy cases required to detect association at designs.

Mapping multiple QTL in experimental crosses

Multiple QTL mapping

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS

Stat 579: Generalized Linear Models and Extensions

Computational functional genomics

Mixed-Model Estimation of genetic variances. Bruce Walsh lecture notes Uppsala EQG 2012 course version 28 Jan 2012

PCA and admixture models

Incorporating Functional Annotations for Fine-Mapping Causal Variants in a. Bayesian Framework using Summary Statistics

Transcription:

Supplementary Information 1 Supplementary Figures (a) Statistical power (p = 2.6 10 8 ) (b) Statistical power (p = 4.0 10 6 ) Supplementary Figure 1: Statistical power comparison between GEMMA (red) and EMMAX (blue) using simulation with the HMDP data set, at two different genome-wide significance thresholds. The y axis shows how power varies with SNP effect size (x axis). 1

(a) Full Rank vs Low Rank (HMDP) (b) Full Rank vs Low Rank (WTCCC) Supplementary Figure 2: Comparison of log 10 p values obtained from linear mixed models using low-rank matrices, with those from the usual full-rank matrix. a) shows p values computed from 1,885,197 markers for HDL measurements in the HMDP data set, by either a linear model, linear mixed models using low-rank relatedness matrices constructed from the top 10% (10) or 50% (50) eigenvectors, or linear mixed model using the full-rank relatedness matrix. b) shows p values computed from 442,001 markers for Crohn s disease states in the WTCCC data set, by either a linear model, linear mixed models using low-rank relatedness matrices constructed from the top 10% (469) or 50% (2343) eigenvectors, or linear mixed model using the full-rank relatedness matrix. Black line shows the diagonal line. nev, number of eigenvectors used; LM, linear model; LMM, linear mixed model. 2

2 Supplementary Table Supplementary Table 1: Comparison of genomic control inflation factors obtained with linear mixed models using relatedness matrices constructed from the top 0% (linear model), 10%, 50% and 100% (full-rank matrix) eigenvectors of the usual relatedness matrix, for both HMDP and WTCCC data sets. EVs, eigenvectors. Data Set Genomic Control Inflation Factor Linear Model LMM (10% EVs) LMM(50% EVs) LMM(Full) HMDP, HDL-C 26.392 7.575 3.879 0.955 WTCCC, CD 1.174 1.070 1.030 1.012 3

3 Supplementary Note 3.1 Genome-wide Efficient Mixed Model Association, More Details 3.1.1 Derivation of the target optimization functions If is known, the log-likelihood is maximized at: ( ) ˆα = ((W, x) ˆβ T H 1 (W, x)) 1 (W, x) T H 1 y, n ˆτ = (y W ˆα x ˆβ) T H 1 (y W ˆα x ˆβ) = n y T P x y. The last equation uses the property P x HP x = P x. This can be derived by noticing P x = M x (M x HM x ) M x, where M x = I n (W, x)((w, x) T (W, x)) 1 (W, x) T and denotes generalized inverse. Similarly, the log-restricted likelihood is maximized at ˆτ = n c 1 y T P x y. Therefore, finding MLE and REML estimates is equivalent to optimizing the following functions with respect to : l() = n 2 log( n 2π ) n 2 1 2 log H n 2 log(yt P x y), l r () = n c 1 2 3.1.2 Numeric optimization details log( n c 1 ) n c 1 + 1 2π 2 2 log (W, x)t (W, x) 1 2 log H 1 2 log (W, x)t H 1 (W, x) n c 1 2 log(y T P x y). Following EMMA 1, we consider s ranging from 1 10 5 (corresponding to almost pure environmental effect) to 1 10 5 (corresponding to almost pure genetic effect). We use Brent s method on the first derivative of the target functions, initialized with the two boundary values, to provide an initial value that is close to a root with relative error of 1 10 1. We then follow this with Newton-Raphson s method, taking advantage of the second derivative to achieve a relative error of 1 10 5. This search strategy combines the stability of Brent s method with the efficiency of Newton-Raphson s method. In the implemented software, we also provide the option of dividing the log-scale evaluation interval into equally spaced regions 1. Brent s method followed by Newton- Raphson s method can be carried out for optimization in each region where the first derivatives change sign. This dividing strategy with ten regions yields identical results, takes less than five minutes longer for both data sets, and is expected to find the maxima in the evaluation interval 4

more easily when the target functions are not well behaved. 3.1.3 Matrix calculus properties The derivation of the first and second derivatives for both target functions uses a few matrix calculus properties: log H vec(p x ) vec T (H 1 ) vec(p x GP x ) = vec T (H 1 )vec(g) = trace(h 1 G), = vec(m x(m x HM x ) M x ) = H 1 H 1 vec(g) = vec(h 1 GH 1 ), = (I n P x G + P x G I n ) vec(p x) = P x P x vec(g) = vec(p x GP x ), = 2vec(P x GP x GP x ), where denotes Kronecker product and vec denotes matrix vectorization (by stacking columns). 3.1.4 Simplification of the trace term and the vector-matrix-vector product term For the trace term, we notice: trace(h 1 G) = trace(h 1 (H I n) trace(h 1 GH 1 G) = trace(h 1 (H I n) trace(p x G) = trace(p x (H I n ) trace(p x GP x G) = trace(p x (H I n ) ) = n trace(h 1 ), H 1 (H I n) ) = n + trace(h 1 H 1 ) 2trace(H 1 ) 2, ) = n c 1 trace(p x), (H I n ) P x ) = n c 1 + trace(p xp x ) 2trace(P x ) 2. The last two equations use the property trace(p x H) = trace(m x (M x HM x ) M x H) = n c 1. For the vector-matrix-vector product term, we notice: y T P x GP x y = yt P x y y T P x P x y, y T P x GP x GP x y = yt P x y + y T P x P x P x y 2y T P x P x y 2. 3.1.5 Recursions for the trace term and the vector-matrix-vector product term With blockwise matrix inversion we have: P i = P i 1 P i 1 w i (w T i P i 1 w i ) 1 w T i P i 1. 5

This leads to a recursion for the trace terms: trace(p i ) = trace(p i 1 ) (w T i P i 1 P i 1 w i )(w T i P i 1 w i ) 1, trace(p i P i ) = trace(p i 1 P i 1 ) + (w T i P i 1 P i 1 w i ) 2 (w T i P i 1 w i ) 2 2(w T i P i 1 P i 1 P i 1 w i )(w T i P i 1 w i ) 1, and a recursion for the vector-matrix-vector product terms, for any vectors a, b of the right size: a T P i b = a T P i 1 b (a T P i 1 w i )(b T P i 1 w i )(wi T P i 1 w i ) 1, a T P i P i b = a T P i 1 P i 1 b + (a T P i 1 w i )(b T P i 1 w i )(wi T P i 1 P i 1 w i )(wi T P i 1 w i ) 2 (a T P i 1 w i )(b T P i 1 P i 1 w i )(wi T P i 1 w i ) 1 (b T P i 1 w i )(a T P i 1 P i 1 w i )(wi T P i 1 w i ) 1, a T P i P i P i b = a T P i 1 P i 1 P i 1 b (a T P i 1 w i )(b T P i 1 w i )(wi T P i 1 P i 1 w i ) 2 (wi T P i 1 w i ) 3 (a T P i 1 w i )(b T P i 1 P i 1 P i 1 w i )(wi T P i 1 w i ) 1 (b T P i 1 w i )(a T P i 1 P i 1 P i 1 w i )(wi T P i 1 w i ) 1 (a T P i 1 P i 1 w i )(b T P i 1 P i 1 w i )(wi T P i 1 w i ) 1 + (a T P i 1 w i )(b T P i 1 P i 1 w i )(wi T P i 1 P i 1 w i )(wi T P i 1 w i ) 2 + (b T P i 1 w i )(a T P i 1 P i 1 w i )(wi T P i 1 P i 1 w i )(wi T P i 1 w i ) 2 + (a T P i 1 w i )(b T P i 1 w i )(wi T P i 1 P i 1 P i 1 w i )(wi T P i 1 w i ) 2. Note that each recursion only requires a few scalar multiplications and does not depend on the number of individuals, as each vector-matrix-vector product in the form of a T P i b, a T P i P i b or a T P i P i P i b is a scalar. 3.1.6 Test statistics and p values To test the null hypothesis β = 0, we obtain the likelihood ratio test statistic with MLE estimates and the Wald test statistic with the REML estimate as suggested 2,1 : D lrt = 2 log l 1(ˆ 1 ) l 0 (ˆ 0 ), F W ald = ˆβ 2 V ( ˆβ). 6

where l 1 and l 0 are the likelihood functions for the null and the alternative models, respectively; ˆ0 and ˆ 1 are the MLE estimates for the null and the alternative models, respectively; ˆβ = (x T P c (ˆ r )x) 1 (x T P c (ˆ r )y) is the estimate for β obtained using the REML estimate ˆ r in the alternative model; and V ( ˆβ) = (n c 1) 1 (x T P c (ˆ r )x) 1 (y T P x (ˆ r )y) is the variance for ˆβ. Under the null hypothesis the likelihood ratio test statistic D lrt and the Wald test statistics F W ald come from a χ 2 (1) and a F (1, n c 1) distribution respectively, and p values can be calculated accordingly. 3.1.7 Missing data We note that the tricks used in GEMMA rely on having complete or imputed genotype data at each SNP. That is, unlike EMMA, which, when testing a particular SNP, can simply ignore individuals with missing genotype data at that SNP, GEMMA requires the user to instead impute all missing genotypes before association testing. Arguably this imputation approach is preferable in any case, since it can improve power to detect associations 3. In the current implementation of GEMMA, missing genotypes are required to be imputed first. Otherwise, any SNPs with missingness > 5% will not be analyzed, and other missing genotypes will simply be replaced with the mean genotype of that SNP. 3.2 Genotype and Phenotype Data, Details We analyzed two data sets, one for quantitative traits from mouse and one for binary disease traits from human. The mouse data set contains measurements of HDL levels for the Hybrid Mouse Diversity Panel (HMDP) 4. Both phenotypes and genotypes are obtained from http://mouse.cs.ucla.edu/. A total of 99 mouse strains (29 classical inbred strains and 70 recombinant inbred strains) and 681 animals with overlapping genotype and phenotype recodes were used. A total of fully imputed 3,918,755 SNPs were used to obtain the identity by state (IBS) matrix as estimates of relatedness 5,4, and a total of 1,885,197 polymorphic SNPs were used for analysis. As in 4, we applied a linear mixed model with an intercept term and tested each SNP in turn (as a fixed effect), without controlling for any other covariates. Following a reviewer s suggestion we also repeated this analysis, but controlling for the top SNP (NES13033708) as a fixed effect in addition to an intercept; comparisons between the methods remained qualitatively similar (data not shown). The human data set contains population controls and cases with Crohn s diseases from the WTCCC study 6. Quality controlled genotypes were obtained from WTCCC and missing genotypes were imputed with BIMBAM 3. 4686 individuals (1748 cases and 2938 controls) and a total of 442,001 SNPs were used for analysis. The Balding-Nichols matrix was used as estimates of relatedness 5 and binary variables were treated as quantitative traits as suggested 3,5. As in 5 we 7

applied a linear mixed model with an intercept term and tested each SNP in turn (as a fixed effect), without controlling for any other covariates. 3.3 Supplementary Results 3.3.1 Power simulation We performed a power comparison between GEMMA and EMMAX using simulations similar to ref 7, in the HMDP data set. (We did not perform comparisons on the WTCCC data set because the empirical p value comparisons show that EMMAX produces almost identical results to GEMMA in this case.) For the power simulation we simulated phenotypes by adding effects to the original phenotype observations as in Scheme 1 from ref 7. Specifically, we first identified polymorphic SNPs unassociated with the original phenotype (exact Wald test p value > 0.05). We ordered the 990,841 SNPs satisfying this criteria by their genomic location, and selected from them 10,000 evenly spaced SNPs to act as causal SNPs. For each causal SNP, we specified its effect size so that it explained a particular percentage of the phenotypic variance (proportion of variance explained, or PVE), and the effect was added back to the original phenotype to form the new simulated phenotype. For each pre-specified PVE (ranged from 2% to 20%), we simulated 10000 sets of phenotype, one for each causal SNP, and calculated p values for each SNP-phenotype pair. We calculated statistical power as the proportion of (Wald test) p values exceeding the genome-wide significance level, either at the conventional 0.05 level after Bonferroni correction (p = 2.6 10 8 ), or at a p value known to achieve the same family-wise error rate in the HMDP panel (p = 4.0 10 6 ) 4. Supplementary Figure 1 shows the genome-wide statistical power of GEMMA versus EMMAX in the HMDP data set, at two different genome-wide significance p values, for SNPs with different effect sizes. The results suggest that GEMMA can be several times more powerful than EMMAX for SNPs with various effect sizes in this case. 3.3.2 Effects of modifying relatedness matrix We explored the effects of using a low-rank relatedness matrix by computing p values from linear mixed models with different relatedness matrices. In particular we considered replacing the full relatedness matrix G with a rank r approximation of the form Ĝr = r k=1 δ ku k u T k, where δ k are the eigenvalues of G (ordered to be decreasing) and u k are the corresponding eigenvectors. Standard theory ensures that Ĝr is the best rank r approximation to G in Frobenius norm. We considered the relatedness matrices Ĝ(r) that corresponded to using either the top 0%, 10%, 50% or 100% eigenvectors of G. Note that 0% corresponds to the linear model, and 100% corresponds to the LMM with the usual relatedness matrix G. Supplementary Figure 2 shows comparison of p values, and Supplementary Table 1 shows comparison of genomic control inflation 8

factors, obtained with different relatedness matrices in both HMDP and WTCCC data sets. The results show that, compared with using G, using a lower-rank relatedness matrix can cause much larger changes in p values than approximation methods such as EMMAX (Figure 1). Moreover, the genomic control inflation factors also suggest that, for these data sets, using a lower-rank relatedness matrix compromises the ability of the linear mixed model to control for sample structure. 9

References 1. Kang, H. M., Zaitlen, N. A., Wade, C. M., Kirby, A., Heckerman, D., Daly, M. J., and Eskin, E. Efficient control of population structure in model organism association mapping. Genetics 178, 1709 1723 (2008). 2. Yu, J., Pressoir, G., Briggs, W. H., Bi, I. V., Yamasaki, M., Doebley, J. F., McMullen, M. D., Gaut, B. S., Nielsen, D. M., Holland, J. B., Kresovich, S., and Buckler, E. S. A unified mixedmodel method for association mapping that accounts for multiple levels of relatedness. Nature Genetics 38, 203 208 (2006). 3. Guan, Y. and Stephens, M. Practical issues in imputation-based association mapping. PLoS Genetics 4 (2008). 4. Bennett, B. J., Farber, C. R., Orozco, L., Kang, H. M., Ghazalpour, A., Siemers, N., Neubauer, M., Neuhaus, I., Yordanova, R., Guan, B., Truong, A., Yang, W.-P., He, A., Kayne, P., Gargalovic, P., Kirchgessner, T., Pan, C., Castellani, L. W., Kostem, E., Furlotte, N., Drake, T. A., Eskin, E., and Lusis, A. J. A high-resolution association mapping panel for the dissection of complex traits in mice. Genome Research 20, 281 290 (2010). 5. Kang, H. M., Sul, J. H., Service, S. K., Zaitlen, N. A., Kong, S.-Y., Freimer, N. B., Sabatti, C., and Eskin, E. Variance component model to account for sample structure in genome-wide association studies. Nature Genetics 42, 348 354 (2010). 6. The Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661 678 (2007). 7. Zhang, Z., Ersoz, E., Lai, C.-Q., Todhunter, R. J., Tiwari, H. K., Gore, M. A., Bradbury, P. J., Yu, J., Arnett, D. K., Ordovas, J. M., and Buckler, E. S. Mixed linear model approach adapted for genome-wide association studies. Nature Genetics 42, 355 360 (2010). 10