Nature Methods: doi: /nmeth.3439
|
|
- Gyles Jackson
- 5 years ago
- Views:
Transcription
1 Supplementary Figure 1 Computational run time of alternative implementations of mtset as a function of the number of traits. Shown is the extrapolated CPU time (h to test associations on chromosome 20, considering a total of 3,975 windows (tests, on a simulated cohort with 1,000 individuals for increasing numbers of traits. Compared are mtset and the approximate mtset-pc model. mtset-naive denotes the runtime for a standard LMM package. Runtime estimates were obtained from a single core of an Intel Xeon CPU E GHz processor.
2 Supplementary Figure 2 Computational run time of alternative implementations of mtset as a function of the cohort size. (a Shown is the CPU time (h to test associations on chromosome 20 (3,975 regions/tests on a simulated cohort with increasing number of individuals and for four traits. Compared are mtset and the approximate mtset-pc model. Additionally, we considered a lowrank approximation where the background covariance has rank 30, which matches the number of PCs included as fixed effects in the mtset-pc model (see Online Methods. mtset-naive denotes the runtime for a standard LMM package, which scales cubical in the number of traits and samples. Runtime estimates were obtained on a single core of an Intel Xeon CPU E GHz processor. (b Shown is the average number of iterations until the optimizer converges. For larger number of samples, the likelihood gets more peaked, resulting in smaller number of iterations and thus reduced overall runtime.
3 Supplementary Figure 3 Characterization of the confounding structure in the four data sets used to assess statistical calibration of mtset. Shown are the genetic relatedness matrices as well as scatter plots of the first two principal components for each of the four datasets used to assess the statistical calibration of mtset. (a Empirical genotype data of 1,000 individuals from 14 populations that are part of the 1000 genomes project (1000G. (b-d Synthetic datasets based on 1000 genomes individuals of European ancestry. In brief, each individual is assigned to n ancestors, randomly inheriting blocks of SNPs from its ancestors. By placing alternative restrictions on the ancestors (number of ancestors, ancestors are drawn from the same or different populations, datasets with different confounding structures can be obtained: (b simpopstructure (kinship matrix has low-rank structure, (c simunrelated (kinship matrix is not structured and (d simrelated (kinship matrix is highly structure. See Online Methods and Supplementary Note for full details.
4 Supplementary Figure 4 Statistical calibration of mtset, mtset-pc, stlmm-sv and mtlmm-sv for four data sets with different confounding structures. Shown are QQ-plots for simulated data when only background effects (no causal variants were simulated and when considering alternative degrees of population structure and relatedness (Online Methods; see also Supplementary Fig. 3. Compared are a single trait single-variant LMM (stlmm-sv, a multi-trait single-variant LMM (mtlmm-sv as well as mtset and the PC-based approximation without relatedness component (mtset-pc. From left to right: mtset, mtset-pc, stlmm-sv and mtlmm-sv. From top to bottom: 1000 genomes (real genotypes, simpopstructure, simunrelated and simrelated (see Supplementary Fig. 3. Whereas the models mtset, stlmm-sv and mtlmm-sv yield robust results irrespective of the type of confounding (see also Fig. 1, mtset-pc is not able to correct for complex (cyptic relatedness between individuals (bottom row, second column.
5 Supplementary Figure 5 Parametric fit of the null distribution on simulated data using 1000 Genomes genotypes for mtset. The null distribution is fit by a mixture π of χ 2 0 and a χ 2 d test statistics using five genome-wide permutations. Although, we use only the top 10% of null test statistics for fitting the free parameters π, a, d, we found empirically that our fit works well for the complete range of the test statistics. Shown are the results for five different repetitions of four simulated phenotypes when only background effects are present.
6 Supplementary Figure 6 Power comparison of alternative methods on simulated data using genotype data from 1000 Genomes individuals. Shown is power at 10% family-wise error rate for mtset, mtset-pc, mtlmm-sv, stset and stlmm-sv for varying different simulation parameters. Specifically, we altered the proportions of variance explained by the region (h 2 r, the numbers of causal variants in the region (S r, the percentages of shared causal variants (π r, the proportions of variance explained by genetic background (h 2 g, the percentage of residual variance explained by hidden confounders (λ, and the percentage of background and residual signal that is shared across traits (α (see also SupplementaryTable 2. See Online Methods for details on the simulation procedure and the evaluation scheme.
7 Supplementary Figure 7 Power comparison when varying the size of the set component on simulated data using genotype data from 1000 Genomes individuals. (a Shown is power at 10% family-wise error rate for mtset, stset, mtset-pc, mtlmm-sv and stlmm-sv when varying the region size for set test approaches. While set tests are overall robust, these methods are most powerful when the region size matches the size of the simulated causal region. (b Average squared correlation coefficient between variants within a window as a function of the window size. (c Number of unique SNPs within testing regions as a function of the window size. When selecting the size of the testing window both linkage disequilibrium and number of SNPs within regions should be considered. Too small testing regions will lead to high LD among SNPs within windows and low number of unique SNPs, which results in limited advantages of set tests compared to singlevariant LMMs. Conversely, regions that are too large result in a prohibitively large numbers of SNPs, which presents a computational burden and may lead to reduced power (a.
8 ( a mt Set Supplementary Figure 8 ( b mt SetPC Scalability of mtset as a function of the number of variants in the set component. Shown is computational time to fit a single window using mtset (a and mtset-pc (b (randomly drawn from chrom 20, 1000 genomes dataset for windows with increasing numbers of variants. Runtimes are reported for windows of varying size (1kb-200kb using simulated data generated using the default parameter settings (see also Supplementary Table 2.
9 (a (b (c Supplement ar y Figur e 9 QQ-plot s for blood lipid levels on t he N FB C dat aset. All ods show good calibration. Genomic control is λ(mtlmm-sv = 0.979, λ(stlmm-sv[crp] = λ(stlmm-sv[ldl] = 0.996, λ(stlmm-sv[hdl] = and λ(stlmm-sv[trigl] = fo single-variant methods, λ(mtset = and λ(mtsetpc = for the set methods. Supplementary Figure 9 Statistical calibration of all considered methods applied to four blood lipid levels on the NFBC data set. (a-c QQ-plots of set tests including the relatedness component (a, approximate set tests using PC-based correction (b and singlevariant LMMs (c. Both single-trait LMMs and set test methods are calibrated, i.e. genomic control is λ(mtlmm-sv = 0.979, λ(stlmm- SV[CRP] = 0.995, λ(stlmm-sv[ldl] = 0.996, λ(stlmm-sv[hdl] = and λ(stlmm-sv[trigl] = for the single-variant methods, λ(mtset = and λ(mtsetpc = for the set test methods.
10 ( a mt Set ( b mt Set-PC (c stset(crp ( d stset(ldl (e stset(hdl (f stset(t RI GL Supplementary Figure 10 Histogram of P values obtained from single- and multi-trait set tests applied to four blood lipid levels on the NFBC data set. Top row: multi-trait set tests (mtset, mtset-pc applied to four lipid related traits. Bottom two rows: single-trait set test (stset applied to individual traits. The spike in the histograms is a common feature of set tests and results form the constrained marginal likelihood optimization: the set component is bounded to explain variance greater to or equal to zero, resulting in a box constraint optimization. The location of the spike is determined by the mixture coefficients of the parametric null distribution fit (see Online Methods and Supplementary Note.
11 ( a stlm M-SV: minimum p-value ( b stset: minimum p-value (c mt LMM-SV ( d mt Set Supplementary Figure 11 (e mt Set-PC Manhattan plots for different methods applied to four blood lipid levels on the NFBC data set. (a,b Shown are Manhattan plots of the minimal P values across traits, considering either a single-trait single-variant LMM (stlmm-sv, (a or a single-trait set test (stset (b. (c-e Corresponding Manhattan plots for multi-trait approaches jointly fit to all four traits, mtlmm- SV (c, mtset (d and mtset-pc (e. mtset-pc is the most powered approach and recovers all associations found by the union of QTLs retrieved by previous approaches (stlmm-sv,mtlmm-sv and stset and yields two additional QTLs: one association on chromosome 1 (shared with mtset and a second QTL on chromosome 16.
12 ( a stlm M-SV(basos ( b stset(basos (c stlm M-SV(eos ( d stset(eos (e st LM M-SV(lucs (f st Set(lucs ( g stlm M-SV(lymphs ( h stset(lymphs ( i stlm M-SV(monos ( j stset(monos ( k stlm M-SV(neuts ( l stset(neuts Supplementary Figure 12 ( m mt LMM-SV ( n mt Set Manhattan plots for quantitative traits related to basal hematology in the rat data set. (a, c, e, g, i, k Manhattan plots for basophils (basos, eosinophils (eos, large unstained cells (luc, lymphocytes (lymphs, monocytes (monos and neutrophils (neuts respectively, when using a single-trait single-variant LMM (stset-sv. (b, d, f, h, j, l Analogous Manhattan plots for the same traits obtained using a single-trait set test (stset. (m, n Manhattan plots from the multi-trait single-variant LMM (mtlmm-sv and the multi-trait set test (mtset respectively. Note that the horizontal lines in Manhattan plots for stset and mtset are a common feature of set tests and results form the constrained marginal likelihood optimization: the set component is bounded to explain variance greater to or equal to zero, resulting in a box constraint optimization (see also Supplementary Fig. 10.
13 ( a Multi-Trait Set Test - without the relatedness component (mt Set-noBg. ( b Mult i-trait Set Test - top 30 principal components (mtset-pc. Supplementary Figure 13 (c Mult i-trait Set Test - including t he relat edness component (mt Set. Manhattan plots for set tests when considering different strategies for confounder correction applied to six phenotypes related to basal hematology in the rat data set. (a Manhattan plot obtained when applying mtset without any adjustment for relatedness or population structure (mtset-nobg. (b Equivalent Manhattan plot when using principal components to correct for population structure (mtset-pc. (c Results obtained from the full mtset model, where relatedness is accounted for using a second random effect term. Because of the closely related individuals in the study population, only the full mtset model is able to comprehensively correct for relatedness (c; see also main Fig. 2.
14 Supplement ar y F igur e 1 D ist r ibut ion of t he numb er of var iants w it hin test ing regions a well as t he squar ed int r a-sn P cor r elat ion coeffi cient, b ot h as a funct ion of t he consider e Supplementary Figure 14 Distribution of the number of variants within testing regions as well as the squared intra-snp correlation coefficient, when considering regions of increasing window sizes. Left column: Dependency between region sizes and the number of contained variants. Right column: Dependency between the region sizes and SNP-SNP squared correlation coefficient for SNPs within regions. From top to bottom: Rat datasets, NFBC data, 1000 genomes data (chromosome 20. The computational cost of mtset depends on the number of (unique SNPs in testing regions. In the experiments, we considered 100kb windows for the NFBC data, 1mb windows for the rat study and 30kb windows for the 1000 genomes data. Alternative results for different region sizes are shown in Supplementary Fig. 7 (simulated data based on 1000 genomes individuals and Supplementary Table 4 (NFBC data.
15 plem ent ar y F igur e 15 C om par ison mt Set -PC and mt Set -L ow R ank B g. Compared are ihood ratio t est statist ics for mtset-pc and mt Set-LowRankBG. For large cohorts, we find good cordance between both models. T his shows t hat account ing for PCs as (REML fixed e ects or dom e ect s yields similar result s. Supplementary Figure 15 Comparison of test P values obtained from mtset-pc and mtset-lowrankbg. Compared are likelihood ratio test statistics for the mtset-pc model and a model that considers a low-rank approximation to the background covariance (using the same number of principal components, mtset-lowrankbg, Online Methods. For large cohorts, we observe good concordance between both models. This confirms that accounting for PCs as (REML fixed effects or alternatively including them as random effect covariates yields concordant results.
16 Method Dataset Significance Level True Windows Test Windows Train Windows mtset 1000G =5.00e e e e G =5.00e e e e G =5.00e e e e G =5.00e e e e-05 mtset-pc 1000G =5.00e e e e G =5.00e e e e G =5.00e e e e G =5.00e e e e-05 mtset simpopstructure =5.00e e e e-02 simpopstructure =5.00e e e e-03 simpopstructure =5.00e e e e-04 simpopstructure =5.00e e e e-05 mtset-pc simpopstructure =5.00e e e e-02 simpopstructure =5.00e e e e-03 simpopstructure = 5.00e e e e-04 simpopstructure =5.00e e e e-05 mtset simunrelated =5.00e e e e-02 simunrelated =5.00e e e e-03 simunrelated =5.00e e e e-04 simunrelated =5.00e e e e-05 mtset-pc simunrelated =5.00e e e e-02 simunrelated =5.00e e e e-03 simunrelated =5.00e e e e-04 simunrelated =5.00e e e e-05 mtset simrelated =5.00e e e e-02 simrelated =5.00e e e e-03 simrelated =5.00e e e e-04 simrelated =5.00e e e e-05 mtset-pc simrelated =5.00e e e e-02 simrelated =5.00e e e e-03 simrelated =5.00e e e e-04 simrelated =5.00e e e e-05 Supplementary Table 1 Type-1 error estimates on simulated data. Shown are the type-1 error estimates for increasingly stringent level thresholds 2{0.05, 0.005, 0.005, } on four alternative simulated datasets (see also Supp. Fig. 3. Train windows denote regions that have been used (based on permutations to fit the parametric model of null distribution (methods. True windows denote genomic regions that have not been used to train the null model (independent test validation. Finally, test windows denote regions where the genotype and phenotype relationship have been shu ed. These are equivalent to train windows, but using a di erent set of permutations (methods. mtset and mtset-pc perform equally well when no structure or population structure is present, while the calibration of mtset- PCs detoriates when the individuals are related (see Supp. Methods and Supp. Figure 3 for simulation strategy. 2
17 h 2 r S r r h 2 g window size (in kb Supplementary Table 2 Parameter ranges for simulated datasets. To assess the power of different methods, we considered a range of alternative simulations, varying key parameters that determine the genetic architecture of the traits. We altered the variance explained by the region (h 2 r, the number of causal variants from the region (S r, the percentage of shared causal variants ( r, the percentage of background and residual signal that is shared across traits (, the variance explained by genetic background (h 2 g, the percentage of residual variance explained by hidden confounders ( andthewindow size. Each of those parameters was varied while keeping the other values at the default value (highlighted in bold. For details of the simulation procedure, see Methods. Supplementary Table 3 Tabular summary of QTLs identified by di erent set test and singlevariant LMMs on the NFBC dataset. The results table is provided as separate supplementary information file: nfbc sm 1e5.xlsx Supplementary Table 4 Tabular summary of QTLs identified by mtset with varying window size on the NFBC dataset. The results table is provided as separate supplementary information file: nfbc sm windowsize.xlsx Supplementary Table 5 Tabular summary of QTLs identified by di erent set test and singlelocus LMMs on the rat dataset. The results table is provided as separate supplementary information file: rat sm 1e6.xlsx 3
18 CRP LDL HDL TRIGL Heritability Estimates single-trait 0.11± ± ± ±0.04 multi-trait 0.11± ± ± ±0.03 Genetic Covariance Matrix CRP 0.11± ± ± ±0.04 LDL -0.03± ± ± ±0.04 HDL 0.06± ± ± ±0.04 TRIGL -0.10± ± ± ±0.05 Noise Covariance Matrix CRP 0.89± ± ± ±0.04 LDL 0.13± ± ± ±0.04 HDL -0.24± ± ± ±0.04 TRIGL 0.36± ± ± ±0.05 Phenotypic Covariance Matrix CRP 1.00± ± ± ±0.01 LDL 0.10± ± ± ±0.01 HDL -0.18± ± ± ±0.01 TRIGL 0.26± ± ± ±0.00 Supplementary Table 6 Estimates of trait heritability and covariances for 4 lipid-related traits from the NFBC dataset. Heritability estimates: Single-trait heritability estimates are obtained independently for each trait. Multi-trait estimates correspond to the marginal estimates obtained form the genetic and noise trait covariance matrix from the null model fit of mtset. As expected, these marginal estimates are consistent. Genetic covariance matrix: Trait-trait covariances of the relatedness component from the null model fit of mtset. Noise covariance matrix: Trait-trait covariances of the noise component of the null model fit of mtset. Phenotype covariance: Empirical covariance matrix of the raw phenotypes. All estimates are obtained from a maximum likelihood fit of mtset; standard errors are denoted by ±. 4
19 basos eos lucs monos neuts Heritability Estimates single-trait 0.29± ± ± ± ±0.03 multi-trait 0.31± ± ± ± ±0.03 Genetic Covariance Matrix basos 0.31± ± ± ± ±0.04 eos 0.14± ± ± ± ±0.05 lucs 0.22± ± ± ± ±0.04 monos 0.34± ± ± ± ±0.05 neuts 0.21± ± ± ± ±0.06 Noise Covariance Matrix basos 0.68± ± ± ± ±0.03 eos 0.18± ± ± ± ±0.03 lucs 0.21± ± ± ± ±0.03 monos 0.31± ± ± ± ±0.02 neuts 0.27± ± ± ± ±0.03 Phenotypic Covariance Matrix basos 1.00± ± ± ± ±0.02 eos 0.28± ± ± ± ±0.03 lucs 0.42± ± ± ± ±0.02 monos 0.62± ± ± ± ±0.02 neuts 0.48± ± ± ± ±0.00 Supplementary Table 7 Estimates of trait heritability and covariances for 6 phenotypes related to basal haematology on the rat dataset. Heritability estimates: Single-trait heritability estimates are obtained independently for each trait. Multi-trait estimates correspond to the marginal estimates obtained form the genetic and noise trait covariance matrix from the null model fit of mtset. As expected, these marginal estimates are consistent. Genetic covariance matrix: Trait-trait covariances of the relatedness component from the null model fit of mtset. Noise covariance matrix: Trait-trait covariances of the noise component of the null model fit of mtset. Phenotype covariance: Empirical covariance matrix of the raw phenotypes. All estimates are obtained from a maximum likelihood fit of mtset; standard errors are denoted by ±. 5
20 Supplementary Notes: Efficient multivariate set tests for the genetic analysis of correlated traits Francesco Paolo Casale, Barbara Rakitsch, Christoph Lippert, Oliver Stegle 1. Multi-trait set tests We here provide additional implementation details of mtset, covering efficient inference approaches and approximation schemes to scale mtset to very large cohorts. Section 1.1 introduces the multi-trait linear mixed model (LMM that underlies mtset. Section 1.2 describes a permutation scheme to estimate p- values within mtset. In Section 1.3, we discuss inference challenges in LMMs, the approach taken in mtset and the relationship to prior work. In Section 1.4, we lay out the mathematical details of efficient likelihood and gradient evaluations for parameter inference in mtset. Finally, in Section 1.5 we discuss alternative approximations to scale mtset to extremely large cohorts, mtset-pc and mtset-lowrankbg Model The matrix-variate phenotype Y is modelled by the sum of of the contribution from the variants in the genetic region (set component a random genetic background effect (relatedness component and residual observation noise: Y = }{{} F B + U r + U }{{} g }{{} fixed effects set component relatedness component + Ψ }{{} noise. (1 Here, Y denotes the N P phenotype matrix for N individuals and P traits. F is the N N F E sample-design matrix of the fixed effects and B is the corresponding N F E P weight matrix. The matrix U r denotes effects from the set component, U g explains variation from the relatedness component and Ψ denotes residual noise. We model each of the previous three terms as random effects with the following matrix-variate normal priors: U r MV N(0, C r, R r U g MV N(0, C g, R g Ψ MV N(0, C n, I N N (2 The covariance matrix C r R P P explains the trait-to-trait covariance between phenotypes that is induced by the set term. Conversely, the individual-to-individual covariance R r R N N denotes the genetic relatedness matrix between individuals that captures the local genetic structure of the variants in the set. These authors have contributed equally. 1
21 Analogously, trait covariance that is induced by the relatedness component are modelled by the P P trait-to-trait covariance matrix C g and R g denotes the corresponding individual-to-individual relatedness matrix that captures the global genetic relatedness between individuals (e.g. kinship. Finally, the random effect Ψ explains i.i.d observation noise, where C n models residual correlations between the traits. The marginal likelihood of the model in (1-2 is given by p(y, F, B, C r, R r, C g, R g, C n = N vec(y vec(f B, C r R r }{{} set component + C g R g }{{} relatedness component + C n I N N, (3 }{{} noise where denotes the Kronecker product (see Appendix A.1 and we have used the equivalence of a matrix-variate normal distribution and a multivariate normal distribution (Appendix A.2. The operator vec( denotes a stacking operation, which transforms an input matrix into a vector by concatenating its columns. For the sake of clarity, we omit the fixed effects from now on in this derivation. The software implementation of mtset provides support for fixed effect covariates. The LMM in Eqn. (3 is closely related to existing multi-trait association models used in genetics, in particular the MTMM model [1] as well the multi-trait version of GEMMA [2] and implementations in LIMIX [3]. However, importantly, mtset requires two variance component terms, whereas GEMMA and MTMM build on a single variance component to account for relatedness (in addition to observation noise, whereas the genetic variants are tested one by one as fixed effect covariates. A detailed discussion of how mtset relates to prior work is provided in Section 1.3. Both the set component (R r and the relatedness component (R g can be estimated from the genotype data alone (see below. In contrast, the elements of the three trait-to-trait covariance matrices need to be estimated form the full model, for which we employ maximum likelihood estimation. 1 In order to retain efficiency, we exploit linear algebra identities and convenient factorizations, thereby minimizing the computational complexity and memory requirements of likelihood and gradient evaluations. Set and relatedness covariance matrices In principle, any valid covariance function [4] can be used to define the covariance matrices in mtset. However, the algorithmic tricks for computational efficiency in mtset relies on i the assumption that both R r and R g are constant (i.e., their eigenvalue decompositions and some other operations can be cashed and ii that the set covariance R r is low-rank. Here, we consider the realized relationship matrix which is compatible with these assumptions [5]. We define R g = SS and R r = GG, where S R N S and G R N R denote matrices with all genomewide variants (relatedness and variants in the set to be tested respectively. The scalar S denotes the total number of genome-wide variants and R corresponds to the number of variants in the set. Weights for individual variants, for example to prioritize rare variants, could be considered straightforwardly; an approach that has previously been used to increase power for rare variant association analysis (see e.g.[6]. If we use C to denote the rank of the trait-to-trait covariance matrix C r, the overall rank of the region covariance term follows as C R with R N and C P, which does not directly depend on the number of samples and traits. As discussed in the following paragraph, C can be interpreted as the number of independent effects from the region across traits. Rank of the set trait-to-trait covariance For the sake of computational efficiency, we consider a lowrank set covariance, setting C = 1. This setting results in a linear scaling of the number of parameters in P (instead of a quadratic scaling. 1 We perform gradient-based optimization of the marginal likelihood (see Section
22 In order to understand the effect of a low-rank trait-trait covariance matrix on the genetic effects the model can capture, it is instructive to derive the mtset LMM from a generative linear model perspective: Y = F B }{{} fixed effects + GV }{{} + U g }{{} set component relatedness component + Ψ }{{} noise (4 where V R R C are the effect sizes of the R variants in the region on the P traits (V r,p is the effect size of variant r on trait p, U g MV N(0, C g, R g and Ψ MV N(0, C n, I N N. Assuming C = 1 equivals to assume that we can write the effect sizes for trait p as V :p = e p v. Note that while we have a unique genetic signal v R R 1, which is shared across all traits, this model allows for trait-specific rescaling of this through the factors e p. Introducing the scaling vector e = [e 1,..., e P ] T R P 1, we can rewrite the model as Y = F B }{{} fixed effects + Gv }{{ e T } + U g }{{} set component relatedness component + Ψ }{{} noise (5 Finally, considering a normal prior over the new weights, v N (0, 1, and marginalising them out we obtain the marginal likelihood in (1, with U r MV N(0, C r, R r, C r = ee T and R r = GG T. Notice that C r is a rank 1 matrix. More complex genetic signals (higher ranks of C r could also be considered by relaxing this assumption at the cost of increased model complexity and additional computational cost (see below and Table Estimation of p-values and significance testing Building on previous methods for single-trait set test [7, 8], we consider likelihood-ratio tests to assess the significance of a particular region set. When testing for variance components, the distribution of the test statistics under the null is in not known, when the phenotype vector vec (Y cannot be divided into a large number of i.i.d. subvectors [9], as it is the case here. To estimate p-values, we employ a permutation scheme, where we assume that the distribution of the test statistics under the null is constant across regions and has the parametric form p(x π, a, d = πχ 2 0(x + (1 πaχ 2 d(x. (6 We first obtain test statistics from the null distribution by using genome-wide permutations pooling the test statistics over all windows. Subsequently, we use the largest 10% of the test statistics to fit the parameters such that the error between the parametric and theoretical p-values is minimized. In the experiments, we found that a relatively small number of genome-wide permutations (<100 was sufficient to accurately estimate the null distribution. A closely related scheme has previously been proposed for single-trait set tests [7] and compared to to other testing procedures, in particular score tests [8], suggesting that likelihood ratio tends to be well powered Overview of inference methods in LMMs Before providing full details of the efficient inference scheme in mtset, we provide an overview of existing methods for inference in LMMs and compare these methods to the approach taken here. The majority of the LMM-based approaches for genetic association testing build on closely related formulations of the null model that underlies mtset. Common to these methods is the assumption that the observed phenotype data is modelled by the sum of a variance component to explain variation due to relatedness as well as residual noise. In standard application of LMMs for GWAS, genetic variants are then tested one by one as additional fixed effects in the model (as are other covariates. In contrast, set tests such as mtset aggregate across multiple proximal genetic variants using a second variance component 3
23 in the model. In the context of single-trait set tests, this has previously been described in [10, 11, 7, 8], however none of these inference schemes allows for multi-trait modelling. Parameter inference in either types of LMMs (one or two variance components is typically done using (restricted maximum likelihood. Because of the large number of alternative models that need to be fitted, the computational tractability of the underlying operations, i.e. evaluation of the likelihood and gradients to determine model parameters, is essential. In general, naive inference in an LMM requires the inversion of the covariance matrix in the model, which for a multi-trait model with N individuals and P traits, scales cubically in both dimensions, i.e. O(N 3 P 3. LMMs with fixed effect testing Efficient inference for single trait LMMs, as implemented in FaST- LMM [12] and GEMMA [13], exploits pre-computing the (constant across SNPs Eigen decomposition of the sample covariance matrix. This steps allows to reduce the computational cost from O(N 3 per variant to a single O(N 3 operation up-front 2 and a per-test complexity of O(N 2. For every test, exact parameter inference can be achieved by means of simple closed form operations and an unidimensional Brent search optimization. The computational complexity of these approaches can be further reduced to O(N 2 for the up-front computation and a per-test complexity of O(N, conditioned on the relatedness covariance matrix having a low-rank structure. In practice, this can be achieved through a feature selection approach, selecting a small proportion of all genome-wide variants to estimate R g [14]. The extension of these efficient linear algebra for LMMs for joint analysis of multiple traits (mtlmm- SV has recently been proposed as an extension to GEMMA [2] (termed mvlmm. Combining Kronecker product algebra with the Eigen decomposition trick, the native cost of O(N 3 P 3 can be reduced to a single O(N 3 operation up-front and O(N 2 +NP x per variant, where x depends on the optimization algorithm. As the number of variance components increases quadratically with the number of traits, derivative-free methods, as used in efficient single-trait LMMs, are rendered inefficient and hence mvlmm considers gradient-based optimization scheme (combined with an EM-like algorithm to estimate model parameters. In particular, mvlmm combines Newton-Raphson and the expectation maximization algorithms. LMMs for set tests Single-trait set tests based on an LMM with a single variance component have first been proposed in [15, 16, 10] and subsequently been extended to include a relatedness component [11, 17, 6]. Common to these models is that p-values are being estimated using a score test. Alternatively, it has also been proposed to use a likelihood ratio test to assess statistical significance for the same class of LMMs [7]. A recent comparison between score tests and likelihood ratio tests [8] shows that likelihood ratio tests tend to have more power in real settings. However, score tests are computationally cheaper to compute as the model parameters need only to be fit once on the null model, whereas likelihood ratio test require full parameter inference of the alternative model for each test. FaST-LMM-Set [7] assumes a low-rank relatedness covariance and a low rank set covariance, which allows to aggregate both components into a single (low rank variance component enabling efficient inference. Parameter optimization is again carried out using a unidimensional Brent search optimization. An extension to full-rank background covariance matrices has been presented in [8], which is a special case of mtset (single trait and will be referred to as stset. In the same way mvlmm extends a standard single-variant LMM to multi-trait analysis, mtset is the multivariate generalization of stset. As discussed in the next section, the algorithm underlying mtset combines eigenvalue decomposition and low-rank updates with Kronecker product algebra to break down the O(N 3 P 3 computational cost to a O(N 3 operations upfront and O(N 2 + NR 2 P 2 + NRP 4 per set, where R denotes the number of variants in the set component to be tested. In the same vein we also consider two alternative approximations of the full mtset model: mtset-pc, where relatedness component is omitted and population structure is modelled as fixed effect, and mtset-lowrankbg, where analogously to FaST-LMM-Set, we assumes a low-rank relatedness covariance. Both proposed approximations scale 2 Corresponding to the cost of an eigenvalue decomposition 4
24 linearly in the number of individuals, permitting analysis of extremely large cohorts (up to 500,000 individuals; see also 1, Figure 1 and Supplementary Figure 3. Similarly to mvlmm, parameter inference in mtset (mtset-pc and mtset-lowrankbg is done using a gradient-based parameter optimization (LBFGS [18, 19]. In our experience, the success of the optimization method is greatly affected by the employed stopping criterion. For example, when the likelihood surface is flat (N < 5000, large windows, the default parameter settings of the SciPy [20] library in python are sufficiently stringent, resulting in premature stopping. We circumvent this by explicitly choosing stringent stopping criteria, setting factr to 10 3 (default value: We note that GCTA [21, 22], a popular approach to fit variance component models, provides support for arbitrary numbers of variance components for single traits and limited support for multi-trait analyses. Specifically, the model allows for joint analysis across pairs of traits, which can be regarded as a special case of GEMMA, however employing gradient-based parameter inference (using the PX-AI algorithm. Table 1 provides a tabular listing of the per-test computational complexity for alternative LMM methods and implementations. Note that the listed complexities do not take into account the the upfront O(N 3 operation for the eigen decomposition of the relatedness covariance matrix that is common to all methods (or O(N 2 respectively, if a low rank relatedness covariance is used Efficient inference for the full mtset model Without loss of generality, we consider G R N,R having column rank R in the following 3. To simplify the derivation of efficient inference (see Section 1.3, we also rewrite the trait-to-trait covariance matrix as C r = EE, where E is a P C matrix. Inverting the covariance matrix The full model covariance matrix has the following form K = C r R r + C g R g + C n I N (7 = A + XX T (8 Here, we have defined A = C g R g + C n I N, which bundles the effects of the relatedness and noise covariance term. The set term is represented as XX, where X = E G. Using the same linear algebra tricks as done before [23], and using the notation M = U M S M UM T for the eigenvalue decomposition of matrix M, we can write A 1 = = = [ ( (C U CnS 1/2 ( ] T 1 C n I g R g + I NP U Cn S 1/2 C n I (9 [ ( ( ( ] T 1 U CnS 1/2 C n U C g U Rg S C g S Rg + I NP U Cn S 1/2 C n U C g U Rg (10 ( T ( 1 ( UC T g S 1/2 C n UC T n UR T g S C g S Rg + I NP UC T g S 1/2 C n UC T n UR T g (11 where we have introduced Cg = S 1/2 C n UC T n C g U Cn S 1/2 C n. All elements in (11 can be calculated in O(N 3 + P 3 where the O(N 3 operation needs to be done only once in the whole analysis [23]. For 3 If G has column rank R > R we can always find G with column rank R such that R r = G G T by a single value decomposition on G = US 1/2 }{{} G V (with runtime of O(NR 2. 5
25 simplicity of notation we introduce L c = U T C g S 1/2 C n U T C n (12 L r = U T R g (13 L = L c L r ( 1 I (14 D = S C g S Rg + (15 In the new notation, (11 becomes A 1 = L T DL (16 which explicitly shows that A 1 is a kroneckered transformation of a diagonal matrix. We can use the Woodbury matrix identity to efficiently invert K exploiting the low-rank nature of XX : where we have introduced K 1 = ( A + XX 1 (17 = A 1 A 1 X ( I + X A 1 X 1 X A 1 (18 = L T DL L T DLX ( I + X A 1 X 1 X L T [ DL (19 = L T D DLX ( I + X A 1 X ] 1 X L T D L (20 = L T ( D DW Λ 1 W T D L (21 W c = L c E R P C (22 W r = L r G R N R (23 W = W c W r (24 Λ = I + X T A 1 X R RC RC (25 Computing the column matrix W c takes O(P 2 C time, while computing the row matrix W r requires O(N 2 R time. Note that the row matrix does not change while optimizing the parameters of the column covariance matrices and can be computed prior to the analysis. The matrix Λ can also be computed efficiently by rewriting it as Λ = I + X T A 1 X (26 = I + W T DW = I + (W c W r T DW (28 = I + ( Wc T Wr T [ ] DW:,1... DW :,RS (29 = I + [ vec ( ( Wr T vec 1 (DW :,1 W c... vec ( ( ] Wr T vec 1 (DW :,RS W c. (30 Indeed, computing W explicitly and multiplying it with D takes O(CN P R time and space, multiplying it with W from left takes O(CR(NP C + RNC, while the inversion takes O(C 3 R 3 time and O(C 2 R 2 memory. In practice, we use the Cholesky factorization to compute the inverse of Λ having the advantage that we can re-use the decomposition for computing the log determinant later on. (27 6
26 Evaluating the model log likelihood The log likelihood of our model (3 is given by L = NP log 2π 1 2 log det K 1 2 vec (Y T K 1 vec (Y (31 The log-determinant can be computed by using the matrix determinant lemma log det K = log det A + log det Λ. (32 Provided that we have already computed L c, L r, D and the Cholesky decomposition of Λ, evaluating the log determinant of A and Λ take respectively O(NT and O(CR. The squared form can be evaluated as follows vec (Y T K 1 vec (Y = vec (Y T [ L T ( D DW Λ 1 W T D L ] vec (Y = vec (Y T L T DLvec (Y vec (Y T L T DW Λ 1 W T DLvec (Y T T = vec (Ỹ D vec (Ỹ vec (Ỹ DW Λ 1 W T Dvec (Ỹ T T = vec (Ỹ D vec (Ỹ vec (Ŷ W Λ 1 W T vec (Ŷ T = vec (Ỹ D vec (Ỹ vec ( Ȳ T Λ 1 vec ( Ȳ, where we have defined vec (Ỹ = Lvec (Y = (L c L r vec (Y = vec ( L r Y L T c (33 vec (Ŷ = D vec (Ỹ = diag(d vec (Ỹ (34 vec ( Ȳ = W T vec (Ŷ = (W c W r vec (Ŷ ( = vec Wr T Ŷ W c. (35 Rotating and scaling the data Eqs (33-35 takes O(N 2 P + NP 2 + NP + NP C + RNC time. where again the O(N 2 P operation is done only once prior to the analysis. Computing the squared form vec ( Ȳ T Λ 1 vec ( Ȳ takes O(C 2 R 2 time after having inverted Λ. Evaluating the gradient The derivative of the log likelihood with respect to the column covariance parameter θ i θ is given by L θi = 1 2 tr( K 1 K θi vec (Y T K 1 K θi K 1 vec (Y, (36 where the first term arises from the log determinant and the second term from the squared form. We have used the notation M θi to indicate the derivative of M with respect to θ i. The first term can be 7
27 rewritten as tr ( K 1 K θi = tr ([ L T ( D DW Λ 1 W T D L ] K θi = tr (( D DW Λ 1 W T D LK θi L ( (D = tr DW Λ 1 W T D Kθi = tr (D K θi tr (DW Λ 1 W T D K θi = (D K θi ( Λ 1 W T D K θi DW jk jk jk jk (Λ 1 K θi = diag(d T diag( K θi jk jk where where Ẽ = L ce, Ẽθ i = L c E θi and K θi = LK θi L T (37 = L c C θi L c L r R i L r. (38 L c (EE T θi L T c W r Wr T if θ i is an entry of E = L c (C g θi L T c S r if θ i is a param of C g (39 L c (C n θi L T c I if θ i is a param of C n Ẽ θi Ẽ T + ẼẼθ i Wr Wr T if θ i is an entry of E = (C g θi S r if θ i is a param of C g, (40 (C n θi I if θ i is a param of C n (41 K θi = W T D K θi DW. (42 First, we compute the column covariance matrix of Kθi, which can be done in O(P 2 C, O(P 3 and O(P 3 respectively for random effect, error and noise parameters. Calculating K θi requires more care: we first compute the dot product between D and W, which requires us to explicitly calculate W, taking O(NP RC time and O(NP RC space. The resulting matrix consists of NP rows and RC columns. In the next step, we multiply each column of DW :,i with K θi from the right side exploiting the same tricks as in (30: (Ẽθi Ẽ T + ẼẼθ i Wr Wr T DW :,i if θ i is an entry of E K θi DW :,i = ( Cθi S r DW :,i if θ i is a param of C g (43 ( Cθi I DW :,i if θ i is a param of C n ( vec W r Wr T vec 1 (DW :,i (Ẽθi Ẽ T + ẼẼθ i if θ i is a parameter of E = vec (S r vec 1 (DW :,i C θi if θ i is a parameter of C g (44 vec (vec 1 (DW :,i C θi if θ i is a parameter of C n. This leads to an overall runtime complexity of O(RC (NP C + NCR, O(RC (NP + NP 2 and O(RCNP 2 for region, random effect and region parameters. 8
28 We use the same trick to compute the multiplication between W T and D K θi DW efficiently, leading to a complexity of O(RC(NP C + NCR. Finally, computing the trace term has an additional runtime of O(NP + C 2 R 2. The derivative of the squared form can be rewritten as T (D vec (Y T K 1 K θi K 1 vec (Y = vec (Ỹ DW Λ 1 W T D ( Kθi D DW Λ 1 W T D vec (Ỹ ( = vec (Ŷ DW Λ 1 vec ( Ȳ ( Kθi vec (Ŷ DW Λ 1 vec ( Ȳ. We start by multiplying Λ 1 with vec (Ỹ, which can be done in O(C 2 R 2 after having precomputed the inverse. Exploiting that W has Kronecker structure and D is a diagonal matrix, reduces the runtime for multiplying the resulting matrix with DW from the left from O(N P +N P RC to O(N P C +RN C +N P. In the next step, we subtract the resulting vector from vec (Ŷ and multiply it with K θi from the left, having an additional runtime of O(NP R+NP 2, O(NP +NP 2 and O(NP 2 for the region, the random effect and the noise term respectively. Finally we have to multiply two vectors of size N P, which can be done in O(NP time. A tabular overview of the individual computations and how often these need to be carried out can be found in Table 1.4. Inverse A 1 O(N 3 + P 3 Cholesky chol(i + W T DW D W O(NP CR W T (DW O(NP RC 2 + NR 2 C 2 chol(i + W T DW O(C 3 R 3 Log Likelihood log det K log det A O(NP log det Λ O(CR ỹ T Dỹ ȳ T Λ 1 ȳ ỹ = Lvec (Y O(N 2 P + NP 2 ȳ = W T Dvec (Y O(NP + NP C + NRC ỹ T Dỹ ȳ T Λ 1 ȳ O(NP + C 2 R 2 + CR Gradient K θi = LK θi L T O(N 2 R + P 2 C region O(N 3 + P 3 rand effect O(N 3 + P 3 noise K θi = W T D K θi DW Kθi (DW O(NRP C 2 + NR 2 C 2 region O(NRP C + NRP 2 C rand eff O(NRP 2 C noise W T (D K θi DW O(NP RC 2 + NR 2 C 2 **computed only once **computed only once per-region Table 2: Tabular summary of the complexity of individual computational steps in the mtset inference. 9
29 1.5. Efficient inference for approximations to the full mtset model As any exact LMM (see 1.3 mtset is bound to the upfront eigenvalue decomposition of the genetic relatedness matrix, which is a cubic operation in the number of samples limiting scalability of LMMs to very large cohorts (N 20, 000. In the following we discuss two alternative approximations that are available in the mtset software implementation, allowing to scale mtset to cohort with up to 500, 000 individuals (see also main paper text, Figure 1 and Supplementary Figure 3: mtset-pc and mtset- LowRankBg. In mtset-pc, the random effect accounting for relatedness is dropped while population structure is accounted for as fixed effect covariates. Alternatively, mtset-lowrankbg considers a lowrank approximation to the background covariance. Low rank approximations to the relatedness matrix have previously been applied to single-trait LMMs, e.g. [24, 14, 7] Modelling population structure with principal components (mtset-pc In mtset-pc, population structure is modelled as fixed effects using the first N P C principal components instead than using a random effect term as in the full mtset. This approximation results in an LMM with only a single variance component (in addition to the noise component. Indicating with F R N N P C the sample design matrix of the fixed effect, the fixed effects on the vectorized phenotypes vec (Y can then be written as V = I F R NP N P CP with weights b R N P CP, where b = vec (B. This model assumes a P degrees of freedom fit for each of PC covariate. The restricted log-likelihood [25, 26] is then given by where and. L θ = const. 1 2 (vec (Y V bt K 1 (vec (Y V b 1 }{{} 2 log det K 1 2 log det V } T K {{ 1 V } (45 vec(z A reml The covariance matrix can be rewritten as b = A 1 reml V T K 1 vec (Y (46 K = EE T GG T + C n I N (47 K = EE T GG T + C n I N (48 ( ( ( T = U n Sn 1/2 I N E E T GG T + I NP U n Sn 1/2 I N (49 ( ( ( T = U n Sn 1/2 I N (U E U G (S E S G (U E U G T + I NP U n Sn 1/2 I N (50 where we used the notation M = U M S 1/2 M V M 4 for the singular value decomposition of M. The inverse of K can be written as K 1 = ( S 1/2 n T Un T I N I NP (U E U G ( S 1 = L T ( I W T DW L 1 } E S 1 G + I RC {{ } D ( (U E U G T }{{} W Sn 1/2 Un T I N }{{} L Calculating the SVD of E and G takes respectively O(P C 2 and O(NR 2 operations. We marked the complexity of the SVD of G in blue as it has to be performed only once during optimization. 4 M R n 1,n 2, U R n 1,n 1, S R n 1,n 2, V R n 2,n 2 (51 10
30 Evaluating the log-likelihood The log-likelihood of the model is L θ = const. 1 2 vec (ZT K 1 vec (Z 1 log det K 1 }{{} 2 }{{} 2 log det A reml }{{} squared form term logdet term reml term (52 The log-determinant term can be computed as follows by applying the matrix determinant lemma log det K = log det ((U E U G (S E S G (U E U G T + I NP + N log det S n (53 and = log det ( S 1 E S 1 G + I + R log det S E + C log det S G + N log det S n. (54 A reml and b can be computed respectively as A reml = V T K 1 V (55 = (LV T LV (W LV T D(W LV (56 = (L c F T (L c F (W c L c W r F T D(W c L c W r F (57 b = A 1 reml V T K 1 vec (Y = (58 = A 1 ( reml (LV T Lvec (Y (W LV T DW Lvec (Y (59 = A 1 ( ( reml vec F T Y L T c L c (W LV T Dvec ( W r Y L T c Wc T (60 Finally, the quadratic term can be rewritten as vec (Z T K 1 vec (Z = (Lvec (Z T (Lvec (Z (W Lvec (Z T D(W Lvec (Z (61 where Lvec (Z = vec ( Y L T c F BL T c W Lvec (Z = vec ( W r Y L T c W T c W r F BL T c W T c (62 (63 The log-likelihood can be evaluated in O(NN 2 PC +NN PCR +NN PC P +NP R +NN PC P +NP +NP 2 where we only report all quantities depending on N, which are bottleneck for huge sample sizes, and denote in blue all the quantities that have to be computed only once during optimization. Calculating the gradient L θi The gradient of the likelihood can be written as = 1 2 vec (ZT K 1 K θi K 1 vec (Z + vec (Z T K 1 V b θi 1 }{{}}{{} 2 tr( K 1 K θi 1 }{{} 2 tr ( A 1 reml A remlθ }{{ i } squared form 1 squared form 2 trace reml (64 11
31 Let us start by rewriting K 1 K θi K 1 : K 1 K θi K 1 = L T ( I W T DW L (C θi R L T ( I W T DW L (65 = L T ( I W T DW L c C θi L T c R ( I W T DW L (66 }{{} C ( = L T C R L + L T W T T D W c CW }{{ c W r RW T r DW L (67 }}{{} ( C S r L T C R W T DW L (L ( T T C R W T DW L (68 ( = L T C R L + L T W T D ( C Sr DW L (69 ( L T C R W T DW L (L ( T T C R W T DW L (70 where we used that K θi = C θi R where C and R are C r and R r if θ i is a region term parameter or C n and I N if θ i is a noise term parameter. The gradients of A reml and b can be calculated as and A remlθi = V T K 1 K θi K 1 V (71 = ( (LV T C R LV (DW LV T ( C Sr DW LV (72 ( ( ( +(W C R LV T DW LV + (W C R LV T DW LV = (L T CL c c (F T RF (DW LV T ( C Sr DW LV (73 ( ( ( T +(W C R LV T DW LV + (W C R LV T DW LV where b θi = A 1 reml A remlθ i b A 1 reml V T K 1 K θi K 1 vec (Y (74 ( V T K 1 K θi K 1 vec (Y = (LV T C R Lvec (Y + (DW LV T ( C Sr DW Lvec (Y (75 ( ( (W C R LV T DW Lvec (Y (DW LV T (W C R Lvec (Y Several of the matrix products in (71, 74, 75 have already been computed for estimating the loglikelihood. The additional terms can be computed efficiently by using convenient factorisations and Kronecker product algebra: ( W C R LV = W c CLc W r RF (76 ( ( (LV T C R Lvec (Y = vec F RY Lc T CT Lc (77 ( ( W C R Lvec (Y = vec W r RY L T C c T Wc T (78 Notice that the computation of RY or RV can also be done in linear time in N. In the non-trivial case where R = GG T we can rewrite RY = G(G T Y which takes O(NRP. 12
FaST linear mixed models for genome-wide association studies
Nature Methods FaS linear mixed models for genome-wide association studies Christoph Lippert, Jennifer Listgarten, Ying Liu, Carl M Kadie, Robert I Davidson & David Heckerman Supplementary Figure Supplementary
More informationPCA and admixture models
PCA and admixture models CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar, Alkes Price PCA and admixture models 1 / 57 Announcements HW1
More informationAPPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.
APPENDIX A Background Mathematics A. Linear Algebra A.. Vector algebra Let x denote the n-dimensional column vector with components 0 x x 2 B C @. A x n Definition 6 (scalar product). The scalar product
More information(Genome-wide) association analysis
(Genome-wide) association analysis 1 Key concepts Mapping QTL by association relies on linkage disequilibrium in the population; LD can be caused by close linkage between a QTL and marker (= good) or by
More informationCPSC 340: Machine Learning and Data Mining. More PCA Fall 2017
CPSC 340: Machine Learning and Data Mining More PCA Fall 2017 Admin Assignment 4: Due Friday of next week. No class Monday due to holiday. There will be tutorials next week on MAP/PCA (except Monday).
More informationA General Framework for Variable Selection in Linear Mixed Models with Applications to Genetic Studies with Structured Populations
A General Framework for Variable Selection in Linear Mixed Models with Applications to Genetic Studies with Structured Populations Joint work with Karim Oualkacha (UQÀM), Yi Yang (McGill), Celia Greenwood
More informationAssociation Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5
Association Testing with Quantitative Traits: Common and Rare Variants Timothy Thornton and Katie Kerr Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5 1 / 41 Introduction to Quantitative
More informationSupplementary Information
Supplementary Information 1 Supplementary Figures (a) Statistical power (p = 2.6 10 8 ) (b) Statistical power (p = 4.0 10 6 ) Supplementary Figure 1: Statistical power comparison between GEMMA (red) and
More information1 Data Arrays and Decompositions
1 Data Arrays and Decompositions 1.1 Variance Matrices and Eigenstructure Consider a p p positive definite and symmetric matrix V - a model parameter or a sample variance matrix. The eigenstructure is
More informationCS281 Section 4: Factor Analysis and PCA
CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we
More informationGWAS IV: Bayesian linear (variance component) models
GWAS IV: Bayesian linear (variance component) models Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS IV: Bayesian
More informationMultidimensional heritability analysis of neuroanatomical shape. Jingwei Li
Multidimensional heritability analysis of neuroanatomical shape Jingwei Li Brain Imaging Genetics Genetic Variation Behavior Cognition Neuroanatomy Brain Imaging Genetics Genetic Variation Neuroanatomy
More informationEfficient Bayesian mixed model analysis increases association power in large cohorts
Linear regression Existing mixed model methods New method: BOLT-LMM Time O(MM) O(MN 2 ) O MN 1.5 Corrects for confounding? Power Efficient Bayesian mixed model analysis increases association power in large
More informationLinear Regression (1/1/17)
STA613/CBB540: Statistical methods in computational biology Linear Regression (1/1/17) Lecturer: Barbara Engelhardt Scribe: Ethan Hada 1. Linear regression 1.1. Linear regression basics. Linear regression
More informationHeritability estimation in modern genetics and connections to some new results for quadratic forms in statistics
Heritability estimation in modern genetics and connections to some new results for quadratic forms in statistics Lee H. Dicker Rutgers University and Amazon, NYC Based on joint work with Ruijun Ma (Rutgers),
More informationBare minimum on matrix algebra. Psychology 588: Covariance structure and factor models
Bare minimum on matrix algebra Psychology 588: Covariance structure and factor models Matrix multiplication 2 Consider three notations for linear combinations y11 y1 m x11 x 1p b11 b 1m y y x x b b n1
More informationLinear Regression and Its Applications
Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start
More informationCS168: The Modern Algorithmic Toolbox Lecture #10: Tensors, and Low-Rank Tensor Recovery
CS168: The Modern Algorithmic Toolbox Lecture #10: Tensors, and Low-Rank Tensor Recovery Tim Roughgarden & Gregory Valiant May 3, 2017 Last lecture discussed singular value decomposition (SVD), and we
More informationPart 6: Multivariate Normal and Linear Models
Part 6: Multivariate Normal and Linear Models 1 Multiple measurements Up until now all of our statistical models have been univariate models models for a single measurement on each member of a sample of
More informationB553 Lecture 5: Matrix Algebra Review
B553 Lecture 5: Matrix Algebra Review Kris Hauser January 19, 2012 We have seen in prior lectures how vectors represent points in R n and gradients of functions. Matrices represent linear transformations
More informationCS281A/Stat241A Lecture 17
CS281A/Stat241A Lecture 17 p. 1/4 CS281A/Stat241A Lecture 17 Factor Analysis and State Space Models Peter Bartlett CS281A/Stat241A Lecture 17 p. 2/4 Key ideas of this lecture Factor Analysis. Recall: Gaussian
More informationSupplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control
Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control Xiaoquan Wen Department of Biostatistics, University of Michigan A Model
More informationAssociation studies and regression
Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration
More informationEstimating Variances and Covariances in a Non-stationary Multivariate Time Series Using the K-matrix
Estimating Variances and Covariances in a Non-stationary Multivariate ime Series Using the K-matrix Stephen P Smith, January 019 Abstract. A second order time series model is described, and generalized
More informationNoisy Streaming PCA. Noting g t = x t x t, rearranging and dividing both sides by 2η we get
Supplementary Material A. Auxillary Lemmas Lemma A. Lemma. Shalev-Shwartz & Ben-David,. Any update of the form P t+ = Π C P t ηg t, 3 for an arbitrary sequence of matrices g, g,..., g, projection Π C onto
More informationStatistical Machine Learning Hilary Term 2018
Statistical Machine Learning Hilary Term 2018 Pier Francesco Palamara Department of Statistics University of Oxford Slide credits and other course material can be found at: http://www.stats.ox.ac.uk/~palamara/sml18.html
More informationF & B Approaches to a simple model
A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring 215 http://www.astro.cornell.edu/~cordes/a6523 Lecture 11 Applications: Model comparison Challenges in large-scale surveys
More informationCheng Soon Ong & Christian Walder. Canberra February June 2017
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2017 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 141 Part III
More informationSPARSE signal representations have gained popularity in recent
6958 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 10, OCTOBER 2011 Blind Compressed Sensing Sivan Gleichman and Yonina C. Eldar, Senior Member, IEEE Abstract The fundamental principle underlying
More informationPower and sample size calculations for designing rare variant sequencing association studies.
Power and sample size calculations for designing rare variant sequencing association studies. Seunggeun Lee 1, Michael C. Wu 2, Tianxi Cai 1, Yun Li 2,3, Michael Boehnke 4 and Xihong Lin 1 1 Department
More informationThe purpose of this section is to derive the asymptotic distribution of the Pearson chi-square statistic. k (n j np j ) 2. np j.
Chapter 9 Pearson s chi-square test 9. Null hypothesis asymptotics Let X, X 2, be independent from a multinomial(, p) distribution, where p is a k-vector with nonnegative entries that sum to one. That
More informationIntroduction Eigen Values and Eigen Vectors An Application Matrix Calculus Optimal Portfolio. Portfolios. Christopher Ting.
Portfolios Christopher Ting Christopher Ting http://www.mysmu.edu/faculty/christophert/ : christopherting@smu.edu.sg : 6828 0364 : LKCSB 5036 November 4, 2016 Christopher Ting QF 101 Week 12 November 4,
More informationSupplementary Information for Efficient Bayesian mixed model analysis increases association power in large cohorts
Supplementary Information for Efficient Bayesian mixed model analysis increases association power in large cohorts Po-Ru Loh, George Tucker, Brendan K Bulik-Sullivan, Bjarni J Vilhjálmsson, Hilary K Finucane,
More informationSparse Covariance Selection using Semidefinite Programming
Sparse Covariance Selection using Semidefinite Programming A. d Aspremont ORFE, Princeton University Joint work with O. Banerjee, L. El Ghaoui & G. Natsoulis, U.C. Berkeley & Iconix Pharmaceuticals Support
More informationAn efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss
An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss arxiv:1811.04545v1 [stat.co] 12 Nov 2018 Cheng Wang School of Mathematical Sciences, Shanghai Jiao
More informationFactor Analysis (10/2/13)
STA561: Probabilistic machine learning Factor Analysis (10/2/13) Lecturer: Barbara Engelhardt Scribes: Li Zhu, Fan Li, Ni Guan Factor Analysis Factor analysis is related to the mixture models we have studied.
More informationCombining SEM & GREML in OpenMx. Rob Kirkpatrick 3/11/16
Combining SEM & GREML in OpenMx Rob Kirkpatrick 3/11/16 1 Overview I. Introduction. II. mxgreml Design. III. mxgreml Implementation. IV. Applications. V. Miscellany. 2 G V A A 1 1 F E 1 VA 1 2 3 Y₁ Y₂
More informationInverse of a Square Matrix. For an N N square matrix A, the inverse of A, 1
Inverse of a Square Matrix For an N N square matrix A, the inverse of A, 1 A, exists if and only if A is of full rank, i.e., if and only if no column of A is a linear combination 1 of the others. A is
More informationTitle. Description. var intro Introduction to vector autoregressive models
Title var intro Introduction to vector autoregressive models Description Stata has a suite of commands for fitting, forecasting, interpreting, and performing inference on vector autoregressive (VAR) models
More informationMultivariate Distributions
IEOR E4602: Quantitative Risk Management Spring 2016 c 2016 by Martin Haugh Multivariate Distributions We will study multivariate distributions in these notes, focusing 1 in particular on multivariate
More informationIntroduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin
1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)
More informationResampling techniques for statistical modeling
Resampling techniques for statistical modeling Gianluca Bontempi Département d Informatique Boulevard de Triomphe - CP 212 http://www.ulb.ac.be/di Resampling techniques p.1/33 Beyond the empirical error
More informationApproximating the Covariance Matrix with Low-rank Perturbations
Approximating the Covariance Matrix with Low-rank Perturbations Malik Magdon-Ismail and Jonathan T. Purnell Department of Computer Science Rensselaer Polytechnic Institute Troy, NY 12180 {magdon,purnej}@cs.rpi.edu
More informationSUPPLEMENTARY INFORMATION
doi:10.1038/nature25973 Power Simulations We performed extensive power simulations to demonstrate that the analyses carried out in our study are well powered. Our simulations indicate very high power for
More informationManaging Uncertainty
Managing Uncertainty Bayesian Linear Regression and Kalman Filter December 4, 2017 Objectives The goal of this lab is multiple: 1. First it is a reminder of some central elementary notions of Bayesian
More informationCOM336: Neural Computing
COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk
More informationFast and Robust Phase Retrieval
Fast and Robust Phase Retrieval Aditya Viswanathan aditya@math.msu.edu CCAM Lunch Seminar Purdue University April 18 2014 0 / 27 Joint work with Yang Wang Mark Iwen Research supported in part by National
More informationTwo-View Segmentation of Dynamic Scenes from the Multibody Fundamental Matrix
Two-View Segmentation of Dynamic Scenes from the Multibody Fundamental Matrix René Vidal Stefano Soatto Shankar Sastry Department of EECS, UC Berkeley Department of Computer Sciences, UCLA 30 Cory Hall,
More informationEIGENVALUES AND EIGENVECTORS 3
EIGENVALUES AND EIGENVECTORS 3 1. Motivation 1.1. Diagonal matrices. Perhaps the simplest type of linear transformations are those whose matrix is diagonal (in some basis). Consider for example the matrices
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Numerical Linear Algebra Background Cho-Jui Hsieh UC Davis May 15, 2018 Linear Algebra Background Vectors A vector has a direction and a magnitude
More informationLeast Squares Optimization
Least Squares Optimization The following is a brief review of least squares optimization and constrained optimization techniques. I assume the reader is familiar with basic linear algebra, including the
More informationPrincipal component analysis
Principal component analysis Angela Montanari 1 Introduction Principal component analysis (PCA) is one of the most popular multivariate statistical methods. It was first introduced by Pearson (1901) and
More informationApplications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices
Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices Vahid Dehdari and Clayton V. Deutsch Geostatistical modeling involves many variables and many locations.
More informationUnsupervised Learning with Permuted Data
Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University
More informationcomponent risk analysis
273: Urban Systems Modeling Lec. 3 component risk analysis instructor: Matteo Pozzi 273: Urban Systems Modeling Lec. 3 component reliability outline risk analysis for components uncertain demand and uncertain
More informationLinear Algebra Review. Vectors
Linear Algebra Review 9/4/7 Linear Algebra Review By Tim K. Marks UCSD Borrows heavily from: Jana Kosecka http://cs.gmu.edu/~kosecka/cs682.html Virginia de Sa (UCSD) Cogsci 8F Linear Algebra review Vectors
More informationProportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power
Proportional Variance Explained by QTL and Statistical Power Partitioning the Genetic Variance We previously focused on obtaining variance components of a quantitative trait to determine the proportion
More informationMatrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =
30 MATHEMATICS REVIEW G A.1.1 Matrices and Vectors Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A = a 11 a 12... a 1N a 21 a 22... a 2N...... a M1 a M2... a MN A matrix can
More informationLecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017
Lecture 2: Genetic Association Testing with Quantitative Traits Instructors: Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2017 1 / 29 Introduction to Quantitative Trait Mapping
More informationHaruhiko Ogasawara. This article gives the first half of an expository supplement to Ogasawara (2015).
Economic Review (Otaru University of Commerce, Vol.66, No. & 3, 9-58. December, 5. Expository supplement I to the paper Asymptotic expansions for the estimators of Lagrange multipliers and associated parameters
More informationMACHINE LEARNING ADVANCED MACHINE LEARNING
MACHINE LEARNING ADVANCED MACHINE LEARNING Recap of Important Notions on Estimation of Probability Density Functions 22 MACHINE LEARNING Discrete Probabilities Consider two variables and y taking discrete
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 5: Numerical Linear Algebra Cho-Jui Hsieh UC Davis April 20, 2017 Linear Algebra Background Vectors A vector has a direction and a magnitude
More informationRegression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)
Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features
More informationMA 575 Linear Models: Cedric E. Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1
MA 575 Linear Models: Cedric E Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1 1 Within-group Correlation Let us recall the simple two-level hierarchical
More informationRegression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)
Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features
More informationStat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2
Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, 2010 Jeffreys priors Lecturer: Michael I. Jordan Scribe: Timothy Hunter 1 Priors for the multivariate Gaussian Consider a multivariate
More informationIntroduction to the Tensor Train Decomposition and Its Applications in Machine Learning
Introduction to the Tensor Train Decomposition and Its Applications in Machine Learning Anton Rodomanov Higher School of Economics, Russia Bayesian methods research group (http://bayesgroup.ru) 14 March
More informationSparse orthogonal factor analysis
Sparse orthogonal factor analysis Kohei Adachi and Nickolay T. Trendafilov Abstract A sparse orthogonal factor analysis procedure is proposed for estimating the optimal solution with sparse loadings. In
More informationVariational Principal Components
Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings
More informationGWAS V: Gaussian processes
GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011
More informationMIXED MODELS THE GENERAL MIXED MODEL
MIXED MODELS This chapter introduces best linear unbiased prediction (BLUP), a general method for predicting random effects, while Chapter 27 is concerned with the estimation of variances by restricted
More informationBasic Concepts in Matrix Algebra
Basic Concepts in Matrix Algebra An column array of p elements is called a vector of dimension p and is written as x p 1 = x 1 x 2. x p. The transpose of the column vector x p 1 is row vector x = [x 1
More informationL11: Pattern recognition principles
L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction
More informationELEMENTARY LINEAR ALGEBRA
ELEMENTARY LINEAR ALGEBRA K R MATTHEWS DEPARTMENT OF MATHEMATICS UNIVERSITY OF QUEENSLAND First Printing, 99 Chapter LINEAR EQUATIONS Introduction to linear equations A linear equation in n unknowns x,
More information3/10/03 Gregory Carey Cholesky Problems - 1. Cholesky Problems
3/10/03 Gregory Carey Cholesky Problems - 1 Cholesky Problems Gregory Carey Department of Psychology and Institute for Behavioral Genetics University of Colorado Boulder CO 80309-0345 Email: gregory.carey@colorado.edu
More informationBTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014
BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014 Homework 4 (version 3) - posted October 3 Assigned October 2; Due 11:59PM October 9 Problem 1 (Easy) a. For the genetic regression model: Y
More informationLinear Regression (9/11/13)
STA561: Probabilistic machine learning Linear Regression (9/11/13) Lecturer: Barbara Engelhardt Scribes: Zachary Abzug, Mike Gloudemans, Zhuosheng Gu, Zhao Song 1 Why use linear regression? Figure 1: Scatter
More informationLecture 4 Noisy Channel Coding
Lecture 4 Noisy Channel Coding I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw October 9, 2015 1 / 56 I-Hsiang Wang IT Lecture 4 The Channel Coding Problem
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationMultiple-step Time Series Forecasting with Sparse Gaussian Processes
Multiple-step Time Series Forecasting with Sparse Gaussian Processes Perry Groot ab Peter Lucas a Paul van den Bosch b a Radboud University, Model-Based Systems Development, Heyendaalseweg 135, 6525 AJ
More informationECE521 lecture 4: 19 January Optimization, MLE, regularization
ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity
More informationClustering VS Classification
MCQ Clustering VS Classification 1. What is the relation between the distance between clusters and the corresponding class discriminability? a. proportional b. inversely-proportional c. no-relation Ans:
More informationMODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES
MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES Saurabh Ghosh Human Genetics Unit Indian Statistical Institute, Kolkata Most common diseases are caused by
More informationFast Algorithms for SDPs derived from the Kalman-Yakubovich-Popov Lemma
Fast Algorithms for SDPs derived from the Kalman-Yakubovich-Popov Lemma Venkataramanan (Ragu) Balakrishnan School of ECE, Purdue University 8 September 2003 European Union RTN Summer School on Multi-Agent
More informationPCA vignette Principal components analysis with snpstats
PCA vignette Principal components analysis with snpstats David Clayton October 30, 2018 Principal components analysis has been widely used in population genetics in order to study population structure
More informationParametric Empirical Bayes Methods for Microarrays
Parametric Empirical Bayes Methods for Microarrays Ming Yuan, Deepayan Sarkar, Michael Newton and Christina Kendziorski April 30, 2018 Contents 1 Introduction 1 2 General Model Structure: Two Conditions
More informationBiostat 2065 Analysis of Incomplete Data
Biostat 2065 Analysis of Incomplete Data Gong Tang Dept of Biostatistics University of Pittsburgh October 20, 2005 1. Large-sample inference based on ML Let θ is the MLE, then the large-sample theory implies
More informationDATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane
DATA MINING AND MACHINE LEARNING Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Linear models for regression Regularized
More informationLinear Algebra - Part II
Linear Algebra - Part II Projection, Eigendecomposition, SVD (Adapted from Sargur Srihari s slides) Brief Review from Part 1 Symmetric Matrix: A = A T Orthogonal Matrix: A T A = AA T = I and A 1 = A T
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project
More informationThe Hilbert Space of Random Variables
The Hilbert Space of Random Variables Electrical Engineering 126 (UC Berkeley) Spring 2018 1 Outline Fix a probability space and consider the set H := {X : X is a real-valued random variable with E[X 2
More informationNotes on Latent Semantic Analysis
Notes on Latent Semantic Analysis Costas Boulis 1 Introduction One of the most fundamental problems of information retrieval (IR) is to find all documents (and nothing but those) that are semantically
More informationIntroduction to Machine Learning
10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what
More informationCorner. Corners are the intersections of two edges of sufficiently different orientations.
2D Image Features Two dimensional image features are interesting local structures. They include junctions of different types like Y, T, X, and L. Much of the work on 2D features focuses on junction L,
More informationAdvanced Introduction to Machine Learning
10-715 Advanced Introduction to Machine Learning Homework 3 Due Nov 12, 10.30 am Rules 1. Homework is due on the due date at 10.30 am. Please hand over your homework at the beginning of class. Please see
More informationRestricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model
Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model Xiuming Zhang zhangxiuming@u.nus.edu A*STAR-NUS Clinical Imaging Research Center October, 015 Summary This report derives
More informationFlexible phenotype simulation with PhenotypeSimulator Hannah Meyer
Flexible phenotype simulation with PhenotypeSimulator Hannah Meyer 2018-03-01 Contents Introduction 1 Work-flow 2 Examples 2 Example 1: Creating a phenotype composed of population structure and observational
More informationA matrix over a field F is a rectangular array of elements from F. The symbol
Chapter MATRICES Matrix arithmetic A matrix over a field F is a rectangular array of elements from F The symbol M m n (F ) denotes the collection of all m n matrices over F Matrices will usually be denoted
More informationVector Auto-Regressive Models
Vector Auto-Regressive Models Laurent Ferrara 1 1 University of Paris Nanterre M2 Oct. 2018 Overview of the presentation 1. Vector Auto-Regressions Definition Estimation Testing 2. Impulse responses functions
More informationProbabilistic Latent Semantic Analysis
Probabilistic Latent Semantic Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr
More information