Nature Methods: doi: /nmeth.3439

Size: px

Start display at page:

Download "Nature Methods: doi: /nmeth.3439"

Gyles Jackson
5 years ago
Views:

1 Supplementary Figure 1 Computational run time of alternative implementations of mtset as a function of the number of traits. Shown is the extrapolated CPU time (h to test associations on chromosome 20, considering a total of 3,975 windows (tests, on a simulated cohort with 1,000 individuals for increasing numbers of traits. Compared are mtset and the approximate mtset-pc model. mtset-naive denotes the runtime for a standard LMM package. Runtime estimates were obtained from a single core of an Intel Xeon CPU E GHz processor.

2 Supplementary Figure 2 Computational run time of alternative implementations of mtset as a function of the cohort size. (a Shown is the CPU time (h to test associations on chromosome 20 (3,975 regions/tests on a simulated cohort with increasing number of individuals and for four traits. Compared are mtset and the approximate mtset-pc model. Additionally, we considered a lowrank approximation where the background covariance has rank 30, which matches the number of PCs included as fixed effects in the mtset-pc model (see Online Methods. mtset-naive denotes the runtime for a standard LMM package, which scales cubical in the number of traits and samples. Runtime estimates were obtained on a single core of an Intel Xeon CPU E GHz processor. (b Shown is the average number of iterations until the optimizer converges. For larger number of samples, the likelihood gets more peaked, resulting in smaller number of iterations and thus reduced overall runtime.

3 Supplementary Figure 3 Characterization of the confounding structure in the four data sets used to assess statistical calibration of mtset. Shown are the genetic relatedness matrices as well as scatter plots of the first two principal components for each of the four datasets used to assess the statistical calibration of mtset. (a Empirical genotype data of 1,000 individuals from 14 populations that are part of the 1000 genomes project (1000G. (b-d Synthetic datasets based on 1000 genomes individuals of European ancestry. In brief, each individual is assigned to n ancestors, randomly inheriting blocks of SNPs from its ancestors. By placing alternative restrictions on the ancestors (number of ancestors, ancestors are drawn from the same or different populations, datasets with different confounding structures can be obtained: (b simpopstructure (kinship matrix has low-rank structure, (c simunrelated (kinship matrix is not structured and (d simrelated (kinship matrix is highly structure. See Online Methods and Supplementary Note for full details.

Supplementary Figure 4 Statistical calibration of mtset, mtset-pc, stlmm-sv and mtlmm-sv for four data sets with different confounding structures.

4 Supplementary Figure 4 Statistical calibration of mtset, mtset-pc, stlmm-sv and mtlmm-sv for four data sets with different confounding structures. Shown are QQ-plots for simulated data when only background effects (no causal variants were simulated and when considering alternative degrees of population structure and relatedness (Online Methods; see also Supplementary Fig. 3. Compared are a single trait single-variant LMM (stlmm-sv, a multi-trait single-variant LMM (mtlmm-sv as well as mtset and the PC-based approximation without relatedness component (mtset-pc. From left to right: mtset, mtset-pc, stlmm-sv and mtlmm-sv. From top to bottom: 1000 genomes (real genotypes, simpopstructure, simunrelated and simrelated (see Supplementary Fig. 3. Whereas the models mtset, stlmm-sv and mtlmm-sv yield robust results irrespective of the type of confounding (see also Fig. 1, mtset-pc is not able to correct for complex (cyptic relatedness between individuals (bottom row, second column.

5 Supplementary Figure 5 Parametric fit of the null distribution on simulated data using 1000 Genomes genotypes for mtset. The null distribution is fit by a mixture π of χ 2 0 and a χ 2 d test statistics using five genome-wide permutations. Although, we use only the top 10% of null test statistics for fitting the free parameters π, a, d, we found empirically that our fit works well for the complete range of the test statistics. Shown are the results for five different repetitions of four simulated phenotypes when only background effects are present.

6 Supplementary Figure 6 Power comparison of alternative methods on simulated data using genotype data from 1000 Genomes individuals. Shown is power at 10% family-wise error rate for mtset, mtset-pc, mtlmm-sv, stset and stlmm-sv for varying different simulation parameters. Specifically, we altered the proportions of variance explained by the region (h 2 r, the numbers of causal variants in the region (S r, the percentages of shared causal variants (π r, the proportions of variance explained by genetic background (h 2 g, the percentage of residual variance explained by hidden confounders (λ, and the percentage of background and residual signal that is shared across traits (α (see also SupplementaryTable 2. See Online Methods for details on the simulation procedure and the evaluation scheme.

7 Supplementary Figure 7 Power comparison when varying the size of the set component on simulated data using genotype data from 1000 Genomes individuals. (a Shown is power at 10% family-wise error rate for mtset, stset, mtset-pc, mtlmm-sv and stlmm-sv when varying the region size for set test approaches. While set tests are overall robust, these methods are most powerful when the region size matches the size of the simulated causal region. (b Average squared correlation coefficient between variants within a window as a function of the window size. (c Number of unique SNPs within testing regions as a function of the window size. When selecting the size of the testing window both linkage disequilibrium and number of SNPs within regions should be considered. Too small testing regions will lead to high LD among SNPs within windows and low number of unique SNPs, which results in limited advantages of set tests compared to singlevariant LMMs. Conversely, regions that are too large result in a prohibitively large numbers of SNPs, which presents a computational burden and may lead to reduced power (a.

8 ( a mt Set Supplementary Figure 8 ( b mt SetPC Scalability of mtset as a function of the number of variants in the set component. Shown is computational time to fit a single window using mtset (a and mtset-pc (b (randomly drawn from chrom 20, 1000 genomes dataset for windows with increasing numbers of variants. Runtimes are reported for windows of varying size (1kb-200kb using simulated data generated using the default parameter settings (see also Supplementary Table 2.

9 (a (b (c Supplement ar y Figur e 9 QQ-plot s for blood lipid levels on t he N FB C dat aset. All ods show good calibration. Genomic control is λ(mtlmm-sv = 0.979, λ(stlmm-sv[crp] = λ(stlmm-sv[ldl] = 0.996, λ(stlmm-sv[hdl] = and λ(stlmm-sv[trigl] = fo single-variant methods, λ(mtset = and λ(mtsetpc = for the set methods. Supplementary Figure 9 Statistical calibration of all considered methods applied to four blood lipid levels on the NFBC data set. (a-c QQ-plots of set tests including the relatedness component (a, approximate set tests using PC-based correction (b and singlevariant LMMs (c. Both single-trait LMMs and set test methods are calibrated, i.e. genomic control is λ(mtlmm-sv = 0.979, λ(stlmm- SV[CRP] = 0.995, λ(stlmm-sv[ldl] = 0.996, λ(stlmm-sv[hdl] = and λ(stlmm-sv[trigl] = for the single-variant methods, λ(mtset = and λ(mtsetpc = for the set test methods.

10 ( a mt Set ( b mt Set-PC (c stset(crp ( d stset(ldl (e stset(hdl (f stset(t RI GL Supplementary Figure 10 Histogram of P values obtained from single- and multi-trait set tests applied to four blood lipid levels on the NFBC data set. Top row: multi-trait set tests (mtset, mtset-pc applied to four lipid related traits. Bottom two rows: single-trait set test (stset applied to individual traits. The spike in the histograms is a common feature of set tests and results form the constrained marginal likelihood optimization: the set component is bounded to explain variance greater to or equal to zero, resulting in a box constraint optimization. The location of the spike is determined by the mixture coefficients of the parametric null distribution fit (see Online Methods and Supplementary Note.

11 ( a stlm M-SV: minimum p-value ( b stset: minimum p-value (c mt LMM-SV ( d mt Set Supplementary Figure 11 (e mt Set-PC Manhattan plots for different methods applied to four blood lipid levels on the NFBC data set. (a,b Shown are Manhattan plots of the minimal P values across traits, considering either a single-trait single-variant LMM (stlmm-sv, (a or a single-trait set test (stset (b. (c-e Corresponding Manhattan plots for multi-trait approaches jointly fit to all four traits, mtlmm- SV (c, mtset (d and mtset-pc (e. mtset-pc is the most powered approach and recovers all associations found by the union of QTLs retrieved by previous approaches (stlmm-sv,mtlmm-sv and stset and yields two additional QTLs: one association on chromosome 1 (shared with mtset and a second QTL on chromosome 16.

( a stlm M-SV(basos ( b stset(basos (c stlm M-SV(eos ( d stset(eos (e st LM M-SV(lucs (f st Set(lucs ( g stlm M-SV(lymphs ( h stset(lymphs ( i stlm M-SV(monos ( j stset(monos ( k stlm M-SV(neuts ( l

12 ( a stlm M-SV(basos ( b stset(basos (c stlm M-SV(eos ( d stset(eos (e st LM M-SV(lucs (f st Set(lucs ( g stlm M-SV(lymphs ( h stset(lymphs ( i stlm M-SV(monos ( j stset(monos ( k stlm M-SV(neuts ( l stset(neuts Supplementary Figure 12 ( m mt LMM-SV ( n mt Set Manhattan plots for quantitative traits related to basal hematology in the rat data set. (a, c, e, g, i, k Manhattan plots for basophils (basos, eosinophils (eos, large unstained cells (luc, lymphocytes (lymphs, monocytes (monos and neutrophils (neuts respectively, when using a single-trait single-variant LMM (stset-sv. (b, d, f, h, j, l Analogous Manhattan plots for the same traits obtained using a single-trait set test (stset. (m, n Manhattan plots from the multi-trait single-variant LMM (mtlmm-sv and the multi-trait set test (mtset respectively. Note that the horizontal lines in Manhattan plots for stset and mtset are a common feature of set tests and results form the constrained marginal likelihood optimization: the set component is bounded to explain variance greater to or equal to zero, resulting in a box constraint optimization (see also Supplementary Fig. 10.

13 ( a Multi-Trait Set Test - without the relatedness component (mt Set-noBg. ( b Mult i-trait Set Test - top 30 principal components (mtset-pc. Supplementary Figure 13 (c Mult i-trait Set Test - including t he relat edness component (mt Set. Manhattan plots for set tests when considering different strategies for confounder correction applied to six phenotypes related to basal hematology in the rat data set. (a Manhattan plot obtained when applying mtset without any adjustment for relatedness or population structure (mtset-nobg. (b Equivalent Manhattan plot when using principal components to correct for population structure (mtset-pc. (c Results obtained from the full mtset model, where relatedness is accounted for using a second random effect term. Because of the closely related individuals in the study population, only the full mtset model is able to comprehensively correct for relatedness (c; see also main Fig. 2.

14 Supplement ar y F igur e 1 D ist r ibut ion of t he numb er of var iants w it hin test ing regions a well as t he squar ed int r a-sn P cor r elat ion coeffi cient, b ot h as a funct ion of t he consider e Supplementary Figure 14 Distribution of the number of variants within testing regions as well as the squared intra-snp correlation coefficient, when considering regions of increasing window sizes. Left column: Dependency between region sizes and the number of contained variants. Right column: Dependency between the region sizes and SNP-SNP squared correlation coefficient for SNPs within regions. From top to bottom: Rat datasets, NFBC data, 1000 genomes data (chromosome 20. The computational cost of mtset depends on the number of (unique SNPs in testing regions. In the experiments, we considered 100kb windows for the NFBC data, 1mb windows for the rat study and 30kb windows for the 1000 genomes data. Alternative results for different region sizes are shown in Supplementary Fig. 7 (simulated data based on 1000 genomes individuals and Supplementary Table 4 (NFBC data.

15 plem ent ar y F igur e 15 C om par ison mt Set -PC and mt Set -L ow R ank B g. Compared are ihood ratio t est statist ics for mtset-pc and mt Set-LowRankBG. For large cohorts, we find good cordance between both models. T his shows t hat account ing for PCs as (REML fixed e ects or dom e ect s yields similar result s. Supplementary Figure 15 Comparison of test P values obtained from mtset-pc and mtset-lowrankbg. Compared are likelihood ratio test statistics for the mtset-pc model and a model that considers a low-rank approximation to the background covariance (using the same number of principal components, mtset-lowrankbg, Online Methods. For large cohorts, we observe good concordance between both models. This confirms that accounting for PCs as (REML fixed effects or alternatively including them as random effect covariates yields concordant results.

16 Method Dataset Significance Level True Windows Test Windows Train Windows mtset 1000G =5.00e e e e G =5.00e e e e G =5.00e e e e G =5.00e e e e-05 mtset-pc 1000G =5.00e e e e G =5.00e e e e G =5.00e e e e G =5.00e e e e-05 mtset simpopstructure =5.00e e e e-02 simpopstructure =5.00e e e e-03 simpopstructure =5.00e e e e-04 simpopstructure =5.00e e e e-05 mtset-pc simpopstructure =5.00e e e e-02 simpopstructure =5.00e e e e-03 simpopstructure = 5.00e e e e-04 simpopstructure =5.00e e e e-05 mtset simunrelated =5.00e e e e-02 simunrelated =5.00e e e e-03 simunrelated =5.00e e e e-04 simunrelated =5.00e e e e-05 mtset-pc simunrelated =5.00e e e e-02 simunrelated =5.00e e e e-03 simunrelated =5.00e e e e-04 simunrelated =5.00e e e e-05 mtset simrelated =5.00e e e e-02 simrelated =5.00e e e e-03 simrelated =5.00e e e e-04 simrelated =5.00e e e e-05 mtset-pc simrelated =5.00e e e e-02 simrelated =5.00e e e e-03 simrelated =5.00e e e e-04 simrelated =5.00e e e e-05 Supplementary Table 1 Type-1 error estimates on simulated data. Shown are the type-1 error estimates for increasingly stringent level thresholds 2{0.05, 0.005, 0.005, } on four alternative simulated datasets (see also Supp. Fig. 3. Train windows denote regions that have been used (based on permutations to fit the parametric model of null distribution (methods. True windows denote genomic regions that have not been used to train the null model (independent test validation. Finally, test windows denote regions where the genotype and phenotype relationship have been shu ed. These are equivalent to train windows, but using a di erent set of permutations (methods. mtset and mtset-pc perform equally well when no structure or population structure is present, while the calibration of mtset- PCs detoriates when the individuals are related (see Supp. Methods and Supp. Figure 3 for simulation strategy. 2

17 h 2 r S r r h 2 g window size (in kb Supplementary Table 2 Parameter ranges for simulated datasets. To assess the power of different methods, we considered a range of alternative simulations, varying key parameters that determine the genetic architecture of the traits. We altered the variance explained by the region (h 2 r, the number of causal variants from the region (S r, the percentage of shared causal variants ( r, the percentage of background and residual signal that is shared across traits (, the variance explained by genetic background (h 2 g, the percentage of residual variance explained by hidden confounders ( andthewindow size. Each of those parameters was varied while keeping the other values at the default value (highlighted in bold. For details of the simulation procedure, see Methods. Supplementary Table 3 Tabular summary of QTLs identified by di erent set test and singlevariant LMMs on the NFBC dataset. The results table is provided as separate supplementary information file: nfbc sm 1e5.xlsx Supplementary Table 4 Tabular summary of QTLs identified by mtset with varying window size on the NFBC dataset. The results table is provided as separate supplementary information file: nfbc sm windowsize.xlsx Supplementary Table 5 Tabular summary of QTLs identified by di erent set test and singlelocus LMMs on the rat dataset. The results table is provided as separate supplementary information file: rat sm 1e6.xlsx 3

18 CRP LDL HDL TRIGL Heritability Estimates single-trait 0.11± ± ± ±0.04 multi-trait 0.11± ± ± ±0.03 Genetic Covariance Matrix CRP 0.11± ± ± ±0.04 LDL -0.03± ± ± ±0.04 HDL 0.06± ± ± ±0.04 TRIGL -0.10± ± ± ±0.05 Noise Covariance Matrix CRP 0.89± ± ± ±0.04 LDL 0.13± ± ± ±0.04 HDL -0.24± ± ± ±0.04 TRIGL 0.36± ± ± ±0.05 Phenotypic Covariance Matrix CRP 1.00± ± ± ±0.01 LDL 0.10± ± ± ±0.01 HDL -0.18± ± ± ±0.01 TRIGL 0.26± ± ± ±0.00 Supplementary Table 6 Estimates of trait heritability and covariances for 4 lipid-related traits from the NFBC dataset. Heritability estimates: Single-trait heritability estimates are obtained independently for each trait. Multi-trait estimates correspond to the marginal estimates obtained form the genetic and noise trait covariance matrix from the null model fit of mtset. As expected, these marginal estimates are consistent. Genetic covariance matrix: Trait-trait covariances of the relatedness component from the null model fit of mtset. Noise covariance matrix: Trait-trait covariances of the noise component of the null model fit of mtset. Phenotype covariance: Empirical covariance matrix of the raw phenotypes. All estimates are obtained from a maximum likelihood fit of mtset; standard errors are denoted by ±. 4

19 basos eos lucs monos neuts Heritability Estimates single-trait 0.29± ± ± ± ±0.03 multi-trait 0.31± ± ± ± ±0.03 Genetic Covariance Matrix basos 0.31± ± ± ± ±0.04 eos 0.14± ± ± ± ±0.05 lucs 0.22± ± ± ± ±0.04 monos 0.34± ± ± ± ±0.05 neuts 0.21± ± ± ± ±0.06 Noise Covariance Matrix basos 0.68± ± ± ± ±0.03 eos 0.18± ± ± ± ±0.03 lucs 0.21± ± ± ± ±0.03 monos 0.31± ± ± ± ±0.02 neuts 0.27± ± ± ± ±0.03 Phenotypic Covariance Matrix basos 1.00± ± ± ± ±0.02 eos 0.28± ± ± ± ±0.03 lucs 0.42± ± ± ± ±0.02 monos 0.62± ± ± ± ±0.02 neuts 0.48± ± ± ± ±0.00 Supplementary Table 7 Estimates of trait heritability and covariances for 6 phenotypes related to basal haematology on the rat dataset. Heritability estimates: Single-trait heritability estimates are obtained independently for each trait. Multi-trait estimates correspond to the marginal estimates obtained form the genetic and noise trait covariance matrix from the null model fit of mtset. As expected, these marginal estimates are consistent. Genetic covariance matrix: Trait-trait covariances of the relatedness component from the null model fit of mtset. Noise covariance matrix: Trait-trait covariances of the noise component of the null model fit of mtset. Phenotype covariance: Empirical covariance matrix of the raw phenotypes. All estimates are obtained from a maximum likelihood fit of mtset; standard errors are denoted by ±. 5

20 Supplementary Notes: Efficient multivariate set tests for the genetic analysis of correlated traits Francesco Paolo Casale, Barbara Rakitsch, Christoph Lippert, Oliver Stegle 1. Multi-trait set tests We here provide additional implementation details of mtset, covering efficient inference approaches and approximation schemes to scale mtset to very large cohorts. Section 1.1 introduces the multi-trait linear mixed model (LMM that underlies mtset. Section 1.2 describes a permutation scheme to estimate p- values within mtset. In Section 1.3, we discuss inference challenges in LMMs, the approach taken in mtset and the relationship to prior work. In Section 1.4, we lay out the mathematical details of efficient likelihood and gradient evaluations for parameter inference in mtset. Finally, in Section 1.5 we discuss alternative approximations to scale mtset to extremely large cohorts, mtset-pc and mtset-lowrankbg Model The matrix-variate phenotype Y is modelled by the sum of of the contribution from the variants in the genetic region (set component a random genetic background effect (relatedness component and residual observation noise: Y = }{{} F B + U r + U }{{} g }{{} fixed effects set component relatedness component + Ψ }{{} noise. (1 Here, Y denotes the N P phenotype matrix for N individuals and P traits. F is the N N F E sample-design matrix of the fixed effects and B is the corresponding N F E P weight matrix. The matrix U r denotes effects from the set component, U g explains variation from the relatedness component and Ψ denotes residual noise. We model each of the previous three terms as random effects with the following matrix-variate normal priors: U r MV N(0, C r, R r U g MV N(0, C g, R g Ψ MV N(0, C n, I N N (2 The covariance matrix C r R P P explains the trait-to-trait covariance between phenotypes that is induced by the set term. Conversely, the individual-to-individual covariance R r R N N denotes the genetic relatedness matrix between individuals that captures the local genetic structure of the variants in the set. These authors have contributed equally. 1

21 Analogously, trait covariance that is induced by the relatedness component are modelled by the P P trait-to-trait covariance matrix C g and R g denotes the corresponding individual-to-individual relatedness matrix that captures the global genetic relatedness between individuals (e.g. kinship. Finally, the random effect Ψ explains i.i.d observation noise, where C n models residual correlations between the traits. The marginal likelihood of the model in (1-2 is given by p(y, F, B, C r, R r, C g, R g, C n = N vec(y vec(f B, C r R r }{{} set component + C g R g }{{} relatedness component + C n I N N, (3 }{{} noise where denotes the Kronecker product (see Appendix A.1 and we have used the equivalence of a matrix-variate normal distribution and a multivariate normal distribution (Appendix A.2. The operator vec( denotes a stacking operation, which transforms an input matrix into a vector by concatenating its columns. For the sake of clarity, we omit the fixed effects from now on in this derivation. The software implementation of mtset provides support for fixed effect covariates. The LMM in Eqn. (3 is closely related to existing multi-trait association models used in genetics, in particular the MTMM model [1] as well the multi-trait version of GEMMA [2] and implementations in LIMIX [3]. However, importantly, mtset requires two variance component terms, whereas GEMMA and MTMM build on a single variance component to account for relatedness (in addition to observation noise, whereas the genetic variants are tested one by one as fixed effect covariates. A detailed discussion of how mtset relates to prior work is provided in Section 1.3. Both the set component (R r and the relatedness component (R g can be estimated from the genotype data alone (see below. In contrast, the elements of the three trait-to-trait covariance matrices need to be estimated form the full model, for which we employ maximum likelihood estimation. 1 In order to retain efficiency, we exploit linear algebra identities and convenient factorizations, thereby minimizing the computational complexity and memory requirements of likelihood and gradient evaluations. Set and relatedness covariance matrices In principle, any valid covariance function [4] can be used to define the covariance matrices in mtset. However, the algorithmic tricks for computational efficiency in mtset relies on i the assumption that both R r and R g are constant (i.e., their eigenvalue decompositions and some other operations can be cashed and ii that the set covariance R r is low-rank. Here, we consider the realized relationship matrix which is compatible with these assumptions [5]. We define R g = SS and R r = GG, where S R N S and G R N R denote matrices with all genomewide variants (relatedness and variants in the set to be tested respectively. The scalar S denotes the total number of genome-wide variants and R corresponds to the number of variants in the set. Weights for individual variants, for example to prioritize rare variants, could be considered straightforwardly; an approach that has previously been used to increase power for rare variant association analysis (see e.g.[6]. If we use C to denote the rank of the trait-to-trait covariance matrix C r, the overall rank of the region covariance term follows as C R with R N and C P, which does not directly depend on the number of samples and traits. As discussed in the following paragraph, C can be interpreted as the number of independent effects from the region across traits. Rank of the set trait-to-trait covariance For the sake of computational efficiency, we consider a lowrank set covariance, setting C = 1. This setting results in a linear scaling of the number of parameters in P (instead of a quadratic scaling. 1 We perform gradient-based optimization of the marginal likelihood (see Section

22 In order to understand the effect of a low-rank trait-trait covariance matrix on the genetic effects the model can capture, it is instructive to derive the mtset LMM from a generative linear model perspective: Y = F B }{{} fixed effects + GV }{{} + U g }{{} set component relatedness component + Ψ }{{} noise (4 where V R R C are the effect sizes of the R variants in the region on the P traits (V r,p is the effect size of variant r on trait p, U g MV N(0, C g, R g and Ψ MV N(0, C n, I N N. Assuming C = 1 equivals to assume that we can write the effect sizes for trait p as V :p = e p v. Note that while we have a unique genetic signal v R R 1, which is shared across all traits, this model allows for trait-specific rescaling of this through the factors e p. Introducing the scaling vector e = [e 1,..., e P ] T R P 1, we can rewrite the model as Y = F B }{{} fixed effects + Gv }{{ e T } + U g }{{} set component relatedness component + Ψ }{{} noise (5 Finally, considering a normal prior over the new weights, v N (0, 1, and marginalising them out we obtain the marginal likelihood in (1, with U r MV N(0, C r, R r, C r = ee T and R r = GG T. Notice that C r is a rank 1 matrix. More complex genetic signals (higher ranks of C r could also be considered by relaxing this assumption at the cost of increased model complexity and additional computational cost (see below and Table Estimation of p-values and significance testing Building on previous methods for single-trait set test [7, 8], we consider likelihood-ratio tests to assess the significance of a particular region set. When testing for variance components, the distribution of the test statistics under the null is in not known, when the phenotype vector vec (Y cannot be divided into a large number of i.i.d. subvectors [9], as it is the case here. To estimate p-values, we employ a permutation scheme, where we assume that the distribution of the test statistics under the null is constant across regions and has the parametric form p(x π, a, d = πχ 2 0(x + (1 πaχ 2 d(x. (6 We first obtain test statistics from the null distribution by using genome-wide permutations pooling the test statistics over all windows. Subsequently, we use the largest 10% of the test statistics to fit the parameters such that the error between the parametric and theoretical p-values is minimized. In the experiments, we found that a relatively small number of genome-wide permutations (<100 was sufficient to accurately estimate the null distribution. A closely related scheme has previously been proposed for single-trait set tests [7] and compared to to other testing procedures, in particular score tests [8], suggesting that likelihood ratio tends to be well powered Overview of inference methods in LMMs Before providing full details of the efficient inference scheme in mtset, we provide an overview of existing methods for inference in LMMs and compare these methods to the approach taken here. The majority of the LMM-based approaches for genetic association testing build on closely related formulations of the null model that underlies mtset. Common to these methods is the assumption that the observed phenotype data is modelled by the sum of a variance component to explain variation due to relatedness as well as residual noise. In standard application of LMMs for GWAS, genetic variants are then tested one by one as additional fixed effects in the model (as are other covariates. In contrast, set tests such as mtset aggregate across multiple proximal genetic variants using a second variance component 3

23 in the model. In the context of single-trait set tests, this has previously been described in [10, 11, 7, 8], however none of these inference schemes allows for multi-trait modelling. Parameter inference in either types of LMMs (one or two variance components is typically done using (restricted maximum likelihood. Because of the large number of alternative models that need to be fitted, the computational tractability of the underlying operations, i.e. evaluation of the likelihood and gradients to determine model parameters, is essential. In general, naive inference in an LMM requires the inversion of the covariance matrix in the model, which for a multi-trait model with N individuals and P traits, scales cubically in both dimensions, i.e. O(N 3 P 3. LMMs with fixed effect testing Efficient inference for single trait LMMs, as implemented in FaST- LMM [12] and GEMMA [13], exploits pre-computing the (constant across SNPs Eigen decomposition of the sample covariance matrix. This steps allows to reduce the computational cost from O(N 3 per variant to a single O(N 3 operation up-front 2 and a per-test complexity of O(N 2. For every test, exact parameter inference can be achieved by means of simple closed form operations and an unidimensional Brent search optimization. The computational complexity of these approaches can be further reduced to O(N 2 for the up-front computation and a per-test complexity of O(N, conditioned on the relatedness covariance matrix having a low-rank structure. In practice, this can be achieved through a feature selection approach, selecting a small proportion of all genome-wide variants to estimate R g [14]. The extension of these efficient linear algebra for LMMs for joint analysis of multiple traits (mtlmm- SV has recently been proposed as an extension to GEMMA [2] (termed mvlmm. Combining Kronecker product algebra with the Eigen decomposition trick, the native cost of O(N 3 P 3 can be reduced to a single O(N 3 operation up-front and O(N 2 +NP x per variant, where x depends on the optimization algorithm. As the number of variance components increases quadratically with the number of traits, derivative-free methods, as used in efficient single-trait LMMs, are rendered inefficient and hence mvlmm considers gradient-based optimization scheme (combined with an EM-like algorithm to estimate model parameters. In particular, mvlmm combines Newton-Raphson and the expectation maximization algorithms. LMMs for set tests Single-trait set tests based on an LMM with a single variance component have first been proposed in [15, 16, 10] and subsequently been extended to include a relatedness component [11, 17, 6]. Common to these models is that p-values are being estimated using a score test. Alternatively, it has also been proposed to use a likelihood ratio test to assess statistical significance for the same class of LMMs [7]. A recent comparison between score tests and likelihood ratio tests [8] shows that likelihood ratio tests tend to have more power in real settings. However, score tests are computationally cheaper to compute as the model parameters need only to be fit once on the null model, whereas likelihood ratio test require full parameter inference of the alternative model for each test. FaST-LMM-Set [7] assumes a low-rank relatedness covariance and a low rank set covariance, which allows to aggregate both components into a single (low rank variance component enabling efficient inference. Parameter optimization is again carried out using a unidimensional Brent search optimization. An extension to full-rank background covariance matrices has been presented in [8], which is a special case of mtset (single trait and will be referred to as stset. In the same way mvlmm extends a standard single-variant LMM to multi-trait analysis, mtset is the multivariate generalization of stset. As discussed in the next section, the algorithm underlying mtset combines eigenvalue decomposition and low-rank updates with Kronecker product algebra to break down the O(N 3 P 3 computational cost to a O(N 3 operations upfront and O(N 2 + NR 2 P 2 + NRP 4 per set, where R denotes the number of variants in the set component to be tested. In the same vein we also consider two alternative approximations of the full mtset model: mtset-pc, where relatedness component is omitted and population structure is modelled as fixed effect, and mtset-lowrankbg, where analogously to FaST-LMM-Set, we assumes a low-rank relatedness covariance. Both proposed approximations scale 2 Corresponding to the cost of an eigenvalue decomposition 4

24 linearly in the number of individuals, permitting analysis of extremely large cohorts (up to 500,000 individuals; see also 1, Figure 1 and Supplementary Figure 3. Similarly to mvlmm, parameter inference in mtset (mtset-pc and mtset-lowrankbg is done using a gradient-based parameter optimization (LBFGS [18, 19]. In our experience, the success of the optimization method is greatly affected by the employed stopping criterion. For example, when the likelihood surface is flat (N < 5000, large windows, the default parameter settings of the SciPy [20] library in python are sufficiently stringent, resulting in premature stopping. We circumvent this by explicitly choosing stringent stopping criteria, setting factr to 10 3 (default value: We note that GCTA [21, 22], a popular approach to fit variance component models, provides support for arbitrary numbers of variance components for single traits and limited support for multi-trait analyses. Specifically, the model allows for joint analysis across pairs of traits, which can be regarded as a special case of GEMMA, however employing gradient-based parameter inference (using the PX-AI algorithm. Table 1 provides a tabular listing of the per-test computational complexity for alternative LMM methods and implementations. Note that the listed complexities do not take into account the the upfront O(N 3 operation for the eigen decomposition of the relatedness covariance matrix that is common to all methods (or O(N 2 respectively, if a low rank relatedness covariance is used Efficient inference for the full mtset model Without loss of generality, we consider G R N,R having column rank R in the following 3. To simplify the derivation of efficient inference (see Section 1.3, we also rewrite the trait-to-trait covariance matrix as C r = EE, where E is a P C matrix. Inverting the covariance matrix The full model covariance matrix has the following form K = C r R r + C g R g + C n I N (7 = A + XX T (8 Here, we have defined A = C g R g + C n I N, which bundles the effects of the relatedness and noise covariance term. The set term is represented as XX, where X = E G. Using the same linear algebra tricks as done before [23], and using the notation M = U M S M UM T for the eigenvalue decomposition of matrix M, we can write A 1 = = = [ ( (C U CnS 1/2 ( ] T 1 C n I g R g + I NP U Cn S 1/2 C n I (9 [ ( ( ( ] T 1 U CnS 1/2 C n U C g U Rg S C g S Rg + I NP U Cn S 1/2 C n U C g U Rg (10 ( T ( 1 ( UC T g S 1/2 C n UC T n UR T g S C g S Rg + I NP UC T g S 1/2 C n UC T n UR T g (11 where we have introduced Cg = S 1/2 C n UC T n C g U Cn S 1/2 C n. All elements in (11 can be calculated in O(N 3 + P 3 where the O(N 3 operation needs to be done only once in the whole analysis [23]. For 3 If G has column rank R > R we can always find G with column rank R such that R r = G G T by a single value decomposition on G = US 1/2 }{{} G V (with runtime of O(NR 2. 5

25 simplicity of notation we introduce L c = U T C g S 1/2 C n U T C n (12 L r = U T R g (13 L = L c L r ( 1 I (14 D = S C g S Rg + (15 In the new notation, (11 becomes A 1 = L T DL (16 which explicitly shows that A 1 is a kroneckered transformation of a diagonal matrix. We can use the Woodbury matrix identity to efficiently invert K exploiting the low-rank nature of XX : where we have introduced K 1 = ( A + XX 1 (17 = A 1 A 1 X ( I + X A 1 X 1 X A 1 (18 = L T DL L T DLX ( I + X A 1 X 1 X L T [ DL (19 = L T D DLX ( I + X A 1 X ] 1 X L T D L (20 = L T ( D DW Λ 1 W T D L (21 W c = L c E R P C (22 W r = L r G R N R (23 W = W c W r (24 Λ = I + X T A 1 X R RC RC (25 Computing the column matrix W c takes O(P 2 C time, while computing the row matrix W r requires O(N 2 R time. Note that the row matrix does not change while optimizing the parameters of the column covariance matrices and can be computed prior to the analysis. The matrix Λ can also be computed efficiently by rewriting it as Λ = I + X T A 1 X (26 = I + W T DW = I + (W c W r T DW (28 = I + ( Wc T Wr T [ ] DW:,1... DW :,RS (29 = I + [ vec ( ( Wr T vec 1 (DW :,1 W c... vec ( ( ] Wr T vec 1 (DW :,RS W c. (30 Indeed, computing W explicitly and multiplying it with D takes O(CN P R time and space, multiplying it with W from left takes O(CR(NP C + RNC, while the inversion takes O(C 3 R 3 time and O(C 2 R 2 memory. In practice, we use the Cholesky factorization to compute the inverse of Λ having the advantage that we can re-use the decomposition for computing the log determinant later on. (27 6

26 Evaluating the model log likelihood The log likelihood of our model (3 is given by L = NP log 2π 1 2 log det K 1 2 vec (Y T K 1 vec (Y (31 The log-determinant can be computed by using the matrix determinant lemma log det K = log det A + log det Λ. (32 Provided that we have already computed L c, L r, D and the Cholesky decomposition of Λ, evaluating the log determinant of A and Λ take respectively O(NT and O(CR. The squared form can be evaluated as follows vec (Y T K 1 vec (Y = vec (Y T [ L T ( D DW Λ 1 W T D L ] vec (Y = vec (Y T L T DLvec (Y vec (Y T L T DW Λ 1 W T DLvec (Y T T = vec (Ỹ D vec (Ỹ vec (Ỹ DW Λ 1 W T Dvec (Ỹ T T = vec (Ỹ D vec (Ỹ vec (Ŷ W Λ 1 W T vec (Ŷ T = vec (Ỹ D vec (Ỹ vec ( Ȳ T Λ 1 vec ( Ȳ, where we have defined vec (Ỹ = Lvec (Y = (L c L r vec (Y = vec ( L r Y L T c (33 vec (Ŷ = D vec (Ỹ = diag(d vec (Ỹ (34 vec ( Ȳ = W T vec (Ŷ = (W c W r vec (Ŷ ( = vec Wr T Ŷ W c. (35 Rotating and scaling the data Eqs (33-35 takes O(N 2 P + NP 2 + NP + NP C + RNC time. where again the O(N 2 P operation is done only once prior to the analysis. Computing the squared form vec ( Ȳ T Λ 1 vec ( Ȳ takes O(C 2 R 2 time after having inverted Λ. Evaluating the gradient The derivative of the log likelihood with respect to the column covariance parameter θ i θ is given by L θi = 1 2 tr( K 1 K θi vec (Y T K 1 K θi K 1 vec (Y, (36 where the first term arises from the log determinant and the second term from the squared form. We have used the notation M θi to indicate the derivative of M with respect to θ i. The first term can be 7

27 rewritten as tr ( K 1 K θi = tr ([ L T ( D DW Λ 1 W T D L ] K θi = tr (( D DW Λ 1 W T D LK θi L ( (D = tr DW Λ 1 W T D Kθi = tr (D K θi tr (DW Λ 1 W T D K θi = (D K θi ( Λ 1 W T D K θi DW jk jk jk jk (Λ 1 K θi = diag(d T diag( K θi jk jk where where Ẽ = L ce, Ẽθ i = L c E θi and K θi = LK θi L T (37 = L c C θi L c L r R i L r. (38 L c (EE T θi L T c W r Wr T if θ i is an entry of E = L c (C g θi L T c S r if θ i is a param of C g (39 L c (C n θi L T c I if θ i is a param of C n Ẽ θi Ẽ T + ẼẼθ i Wr Wr T if θ i is an entry of E = (C g θi S r if θ i is a param of C g, (40 (C n θi I if θ i is a param of C n (41 K θi = W T D K θi DW. (42 First, we compute the column covariance matrix of Kθi, which can be done in O(P 2 C, O(P 3 and O(P 3 respectively for random effect, error and noise parameters. Calculating K θi requires more care: we first compute the dot product between D and W, which requires us to explicitly calculate W, taking O(NP RC time and O(NP RC space. The resulting matrix consists of NP rows and RC columns. In the next step, we multiply each column of DW :,i with K θi from the right side exploiting the same tricks as in (30: (Ẽθi Ẽ T + ẼẼθ i Wr Wr T DW :,i if θ i is an entry of E K θi DW :,i = ( Cθi S r DW :,i if θ i is a param of C g (43 ( Cθi I DW :,i if θ i is a param of C n ( vec W r Wr T vec 1 (DW :,i (Ẽθi Ẽ T + ẼẼθ i if θ i is a parameter of E = vec (S r vec 1 (DW :,i C θi if θ i is a parameter of C g (44 vec (vec 1 (DW :,i C θi if θ i is a parameter of C n. This leads to an overall runtime complexity of O(RC (NP C + NCR, O(RC (NP + NP 2 and O(RCNP 2 for region, random effect and region parameters. 8

28 We use the same trick to compute the multiplication between W T and D K θi DW efficiently, leading to a complexity of O(RC(NP C + NCR. Finally, computing the trace term has an additional runtime of O(NP + C 2 R 2. The derivative of the squared form can be rewritten as T (D vec (Y T K 1 K θi K 1 vec (Y = vec (Ỹ DW Λ 1 W T D ( Kθi D DW Λ 1 W T D vec (Ỹ ( = vec (Ŷ DW Λ 1 vec ( Ȳ ( Kθi vec (Ŷ DW Λ 1 vec ( Ȳ. We start by multiplying Λ 1 with vec (Ỹ, which can be done in O(C 2 R 2 after having precomputed the inverse. Exploiting that W has Kronecker structure and D is a diagonal matrix, reduces the runtime for multiplying the resulting matrix with DW from the left from O(N P +N P RC to O(N P C +RN C +N P. In the next step, we subtract the resulting vector from vec (Ŷ and multiply it with K θi from the left, having an additional runtime of O(NP R+NP 2, O(NP +NP 2 and O(NP 2 for the region, the random effect and the noise term respectively. Finally we have to multiply two vectors of size N P, which can be done in O(NP time. A tabular overview of the individual computations and how often these need to be carried out can be found in Table 1.4. Inverse A 1 O(N 3 + P 3 Cholesky chol(i + W T DW D W O(NP CR W T (DW O(NP RC 2 + NR 2 C 2 chol(i + W T DW O(C 3 R 3 Log Likelihood log det K log det A O(NP log det Λ O(CR ỹ T Dỹ ȳ T Λ 1 ȳ ỹ = Lvec (Y O(N 2 P + NP 2 ȳ = W T Dvec (Y O(NP + NP C + NRC ỹ T Dỹ ȳ T Λ 1 ȳ O(NP + C 2 R 2 + CR Gradient K θi = LK θi L T O(N 2 R + P 2 C region O(N 3 + P 3 rand effect O(N 3 + P 3 noise K θi = W T D K θi DW Kθi (DW O(NRP C 2 + NR 2 C 2 region O(NRP C + NRP 2 C rand eff O(NRP 2 C noise W T (D K θi DW O(NP RC 2 + NR 2 C 2 **computed only once **computed only once per-region Table 2: Tabular summary of the complexity of individual computational steps in the mtset inference. 9

29 1.5. Efficient inference for approximations to the full mtset model As any exact LMM (see 1.3 mtset is bound to the upfront eigenvalue decomposition of the genetic relatedness matrix, which is a cubic operation in the number of samples limiting scalability of LMMs to very large cohorts (N 20, 000. In the following we discuss two alternative approximations that are available in the mtset software implementation, allowing to scale mtset to cohort with up to 500, 000 individuals (see also main paper text, Figure 1 and Supplementary Figure 3: mtset-pc and mtset- LowRankBg. In mtset-pc, the random effect accounting for relatedness is dropped while population structure is accounted for as fixed effect covariates. Alternatively, mtset-lowrankbg considers a lowrank approximation to the background covariance. Low rank approximations to the relatedness matrix have previously been applied to single-trait LMMs, e.g. [24, 14, 7] Modelling population structure with principal components (mtset-pc In mtset-pc, population structure is modelled as fixed effects using the first N P C principal components instead than using a random effect term as in the full mtset. This approximation results in an LMM with only a single variance component (in addition to the noise component. Indicating with F R N N P C the sample design matrix of the fixed effect, the fixed effects on the vectorized phenotypes vec (Y can then be written as V = I F R NP N P CP with weights b R N P CP, where b = vec (B. This model assumes a P degrees of freedom fit for each of PC covariate. The restricted log-likelihood [25, 26] is then given by where and. L θ = const. 1 2 (vec (Y V bt K 1 (vec (Y V b 1 }{{} 2 log det K 1 2 log det V } T K {{ 1 V } (45 vec(z A reml The covariance matrix can be rewritten as b = A 1 reml V T K 1 vec (Y (46 K = EE T GG T + C n I N (47 K = EE T GG T + C n I N (48 ( ( ( T = U n Sn 1/2 I N E E T GG T + I NP U n Sn 1/2 I N (49 ( ( ( T = U n Sn 1/2 I N (U E U G (S E S G (U E U G T + I NP U n Sn 1/2 I N (50 where we used the notation M = U M S 1/2 M V M 4 for the singular value decomposition of M. The inverse of K can be written as K 1 = ( S 1/2 n T Un T I N I NP (U E U G ( S 1 = L T ( I W T DW L 1 } E S 1 G + I RC {{ } D ( (U E U G T }{{} W Sn 1/2 Un T I N }{{} L Calculating the SVD of E and G takes respectively O(P C 2 and O(NR 2 operations. We marked the complexity of the SVD of G in blue as it has to be performed only once during optimization. 4 M R n 1,n 2, U R n 1,n 1, S R n 1,n 2, V R n 2,n 2 (51 10

30 Evaluating the log-likelihood The log-likelihood of the model is L θ = const. 1 2 vec (ZT K 1 vec (Z 1 log det K 1 }{{} 2 }{{} 2 log det A reml }{{} squared form term logdet term reml term (52 The log-determinant term can be computed as follows by applying the matrix determinant lemma log det K = log det ((U E U G (S E S G (U E U G T + I NP + N log det S n (53 and = log det ( S 1 E S 1 G + I + R log det S E + C log det S G + N log det S n. (54 A reml and b can be computed respectively as A reml = V T K 1 V (55 = (LV T LV (W LV T D(W LV (56 = (L c F T (L c F (W c L c W r F T D(W c L c W r F (57 b = A 1 reml V T K 1 vec (Y = (58 = A 1 ( reml (LV T Lvec (Y (W LV T DW Lvec (Y (59 = A 1 ( ( reml vec F T Y L T c L c (W LV T Dvec ( W r Y L T c Wc T (60 Finally, the quadratic term can be rewritten as vec (Z T K 1 vec (Z = (Lvec (Z T (Lvec (Z (W Lvec (Z T D(W Lvec (Z (61 where Lvec (Z = vec ( Y L T c F BL T c W Lvec (Z = vec ( W r Y L T c W T c W r F BL T c W T c (62 (63 The log-likelihood can be evaluated in O(NN 2 PC +NN PCR +NN PC P +NP R +NN PC P +NP +NP 2 where we only report all quantities depending on N, which are bottleneck for huge sample sizes, and denote in blue all the quantities that have to be computed only once during optimization. Calculating the gradient L θi The gradient of the likelihood can be written as = 1 2 vec (ZT K 1 K θi K 1 vec (Z + vec (Z T K 1 V b θi 1 }{{}}{{} 2 tr( K 1 K θi 1 }{{} 2 tr ( A 1 reml A remlθ }{{ i } squared form 1 squared form 2 trace reml (64 11

31 Let us start by rewriting K 1 K θi K 1 : K 1 K θi K 1 = L T ( I W T DW L (C θi R L T ( I W T DW L (65 = L T ( I W T DW L c C θi L T c R ( I W T DW L (66 }{{} C ( = L T C R L + L T W T T D W c CW }{{ c W r RW T r DW L (67 }}{{} ( C S r L T C R W T DW L (L ( T T C R W T DW L (68 ( = L T C R L + L T W T D ( C Sr DW L (69 ( L T C R W T DW L (L ( T T C R W T DW L (70 where we used that K θi = C θi R where C and R are C r and R r if θ i is a region term parameter or C n and I N if θ i is a noise term parameter. The gradients of A reml and b can be calculated as and A remlθi = V T K 1 K θi K 1 V (71 = ( (LV T C R LV (DW LV T ( C Sr DW LV (72 ( ( ( +(W C R LV T DW LV + (W C R LV T DW LV = (L T CL c c (F T RF (DW LV T ( C Sr DW LV (73 ( ( ( T +(W C R LV T DW LV + (W C R LV T DW LV where b θi = A 1 reml A remlθ i b A 1 reml V T K 1 K θi K 1 vec (Y (74 ( V T K 1 K θi K 1 vec (Y = (LV T C R Lvec (Y + (DW LV T ( C Sr DW Lvec (Y (75 ( ( (W C R LV T DW Lvec (Y (DW LV T (W C R Lvec (Y Several of the matrix products in (71, 74, 75 have already been computed for estimating the loglikelihood. The additional terms can be computed efficiently by using convenient factorisations and Kronecker product algebra: ( W C R LV = W c CLc W r RF (76 ( ( (LV T C R Lvec (Y = vec F RY Lc T CT Lc (77 ( ( W C R Lvec (Y = vec W r RY L T C c T Wc T (78 Notice that the computation of RY or RV can also be done in linear time in N. In the non-trivial case where R = GG T we can rewrite RY = G(G T Y which takes O(NRP. 12

FaST linear mixed models for genome-wide association studies

FaST linear mixed models for genome-wide association studies Nature Methods FaS linear mixed models for genome-wide association studies Christoph Lippert, Jennifer Listgarten, Ying Liu, Carl M Kadie, Robert I Davidson & David Heckerman Supplementary Figure Supplementary