Nature Methods: doi: /nmeth.3439

Size: px
Start display at page:

Download "Nature Methods: doi: /nmeth.3439"

Transcription

1 Supplementary Figure 1 Computational run time of alternative implementations of mtset as a function of the number of traits. Shown is the extrapolated CPU time (h to test associations on chromosome 20, considering a total of 3,975 windows (tests, on a simulated cohort with 1,000 individuals for increasing numbers of traits. Compared are mtset and the approximate mtset-pc model. mtset-naive denotes the runtime for a standard LMM package. Runtime estimates were obtained from a single core of an Intel Xeon CPU E GHz processor.

2 Supplementary Figure 2 Computational run time of alternative implementations of mtset as a function of the cohort size. (a Shown is the CPU time (h to test associations on chromosome 20 (3,975 regions/tests on a simulated cohort with increasing number of individuals and for four traits. Compared are mtset and the approximate mtset-pc model. Additionally, we considered a lowrank approximation where the background covariance has rank 30, which matches the number of PCs included as fixed effects in the mtset-pc model (see Online Methods. mtset-naive denotes the runtime for a standard LMM package, which scales cubical in the number of traits and samples. Runtime estimates were obtained on a single core of an Intel Xeon CPU E GHz processor. (b Shown is the average number of iterations until the optimizer converges. For larger number of samples, the likelihood gets more peaked, resulting in smaller number of iterations and thus reduced overall runtime.

3 Supplementary Figure 3 Characterization of the confounding structure in the four data sets used to assess statistical calibration of mtset. Shown are the genetic relatedness matrices as well as scatter plots of the first two principal components for each of the four datasets used to assess the statistical calibration of mtset. (a Empirical genotype data of 1,000 individuals from 14 populations that are part of the 1000 genomes project (1000G. (b-d Synthetic datasets based on 1000 genomes individuals of European ancestry. In brief, each individual is assigned to n ancestors, randomly inheriting blocks of SNPs from its ancestors. By placing alternative restrictions on the ancestors (number of ancestors, ancestors are drawn from the same or different populations, datasets with different confounding structures can be obtained: (b simpopstructure (kinship matrix has low-rank structure, (c simunrelated (kinship matrix is not structured and (d simrelated (kinship matrix is highly structure. See Online Methods and Supplementary Note for full details.

4 Supplementary Figure 4 Statistical calibration of mtset, mtset-pc, stlmm-sv and mtlmm-sv for four data sets with different confounding structures. Shown are QQ-plots for simulated data when only background effects (no causal variants were simulated and when considering alternative degrees of population structure and relatedness (Online Methods; see also Supplementary Fig. 3. Compared are a single trait single-variant LMM (stlmm-sv, a multi-trait single-variant LMM (mtlmm-sv as well as mtset and the PC-based approximation without relatedness component (mtset-pc. From left to right: mtset, mtset-pc, stlmm-sv and mtlmm-sv. From top to bottom: 1000 genomes (real genotypes, simpopstructure, simunrelated and simrelated (see Supplementary Fig. 3. Whereas the models mtset, stlmm-sv and mtlmm-sv yield robust results irrespective of the type of confounding (see also Fig. 1, mtset-pc is not able to correct for complex (cyptic relatedness between individuals (bottom row, second column.

5 Supplementary Figure 5 Parametric fit of the null distribution on simulated data using 1000 Genomes genotypes for mtset. The null distribution is fit by a mixture π of χ 2 0 and a χ 2 d test statistics using five genome-wide permutations. Although, we use only the top 10% of null test statistics for fitting the free parameters π, a, d, we found empirically that our fit works well for the complete range of the test statistics. Shown are the results for five different repetitions of four simulated phenotypes when only background effects are present.

6 Supplementary Figure 6 Power comparison of alternative methods on simulated data using genotype data from 1000 Genomes individuals. Shown is power at 10% family-wise error rate for mtset, mtset-pc, mtlmm-sv, stset and stlmm-sv for varying different simulation parameters. Specifically, we altered the proportions of variance explained by the region (h 2 r, the numbers of causal variants in the region (S r, the percentages of shared causal variants (π r, the proportions of variance explained by genetic background (h 2 g, the percentage of residual variance explained by hidden confounders (λ, and the percentage of background and residual signal that is shared across traits (α (see also SupplementaryTable 2. See Online Methods for details on the simulation procedure and the evaluation scheme.

7 Supplementary Figure 7 Power comparison when varying the size of the set component on simulated data using genotype data from 1000 Genomes individuals. (a Shown is power at 10% family-wise error rate for mtset, stset, mtset-pc, mtlmm-sv and stlmm-sv when varying the region size for set test approaches. While set tests are overall robust, these methods are most powerful when the region size matches the size of the simulated causal region. (b Average squared correlation coefficient between variants within a window as a function of the window size. (c Number of unique SNPs within testing regions as a function of the window size. When selecting the size of the testing window both linkage disequilibrium and number of SNPs within regions should be considered. Too small testing regions will lead to high LD among SNPs within windows and low number of unique SNPs, which results in limited advantages of set tests compared to singlevariant LMMs. Conversely, regions that are too large result in a prohibitively large numbers of SNPs, which presents a computational burden and may lead to reduced power (a.

8 ( a mt Set Supplementary Figure 8 ( b mt SetPC Scalability of mtset as a function of the number of variants in the set component. Shown is computational time to fit a single window using mtset (a and mtset-pc (b (randomly drawn from chrom 20, 1000 genomes dataset for windows with increasing numbers of variants. Runtimes are reported for windows of varying size (1kb-200kb using simulated data generated using the default parameter settings (see also Supplementary Table 2.

9 (a (b (c Supplement ar y Figur e 9 QQ-plot s for blood lipid levels on t he N FB C dat aset. All ods show good calibration. Genomic control is λ(mtlmm-sv = 0.979, λ(stlmm-sv[crp] = λ(stlmm-sv[ldl] = 0.996, λ(stlmm-sv[hdl] = and λ(stlmm-sv[trigl] = fo single-variant methods, λ(mtset = and λ(mtsetpc = for the set methods. Supplementary Figure 9 Statistical calibration of all considered methods applied to four blood lipid levels on the NFBC data set. (a-c QQ-plots of set tests including the relatedness component (a, approximate set tests using PC-based correction (b and singlevariant LMMs (c. Both single-trait LMMs and set test methods are calibrated, i.e. genomic control is λ(mtlmm-sv = 0.979, λ(stlmm- SV[CRP] = 0.995, λ(stlmm-sv[ldl] = 0.996, λ(stlmm-sv[hdl] = and λ(stlmm-sv[trigl] = for the single-variant methods, λ(mtset = and λ(mtsetpc = for the set test methods.

10 ( a mt Set ( b mt Set-PC (c stset(crp ( d stset(ldl (e stset(hdl (f stset(t RI GL Supplementary Figure 10 Histogram of P values obtained from single- and multi-trait set tests applied to four blood lipid levels on the NFBC data set. Top row: multi-trait set tests (mtset, mtset-pc applied to four lipid related traits. Bottom two rows: single-trait set test (stset applied to individual traits. The spike in the histograms is a common feature of set tests and results form the constrained marginal likelihood optimization: the set component is bounded to explain variance greater to or equal to zero, resulting in a box constraint optimization. The location of the spike is determined by the mixture coefficients of the parametric null distribution fit (see Online Methods and Supplementary Note.

11 ( a stlm M-SV: minimum p-value ( b stset: minimum p-value (c mt LMM-SV ( d mt Set Supplementary Figure 11 (e mt Set-PC Manhattan plots for different methods applied to four blood lipid levels on the NFBC data set. (a,b Shown are Manhattan plots of the minimal P values across traits, considering either a single-trait single-variant LMM (stlmm-sv, (a or a single-trait set test (stset (b. (c-e Corresponding Manhattan plots for multi-trait approaches jointly fit to all four traits, mtlmm- SV (c, mtset (d and mtset-pc (e. mtset-pc is the most powered approach and recovers all associations found by the union of QTLs retrieved by previous approaches (stlmm-sv,mtlmm-sv and stset and yields two additional QTLs: one association on chromosome 1 (shared with mtset and a second QTL on chromosome 16.

12 ( a stlm M-SV(basos ( b stset(basos (c stlm M-SV(eos ( d stset(eos (e st LM M-SV(lucs (f st Set(lucs ( g stlm M-SV(lymphs ( h stset(lymphs ( i stlm M-SV(monos ( j stset(monos ( k stlm M-SV(neuts ( l stset(neuts Supplementary Figure 12 ( m mt LMM-SV ( n mt Set Manhattan plots for quantitative traits related to basal hematology in the rat data set. (a, c, e, g, i, k Manhattan plots for basophils (basos, eosinophils (eos, large unstained cells (luc, lymphocytes (lymphs, monocytes (monos and neutrophils (neuts respectively, when using a single-trait single-variant LMM (stset-sv. (b, d, f, h, j, l Analogous Manhattan plots for the same traits obtained using a single-trait set test (stset. (m, n Manhattan plots from the multi-trait single-variant LMM (mtlmm-sv and the multi-trait set test (mtset respectively. Note that the horizontal lines in Manhattan plots for stset and mtset are a common feature of set tests and results form the constrained marginal likelihood optimization: the set component is bounded to explain variance greater to or equal to zero, resulting in a box constraint optimization (see also Supplementary Fig. 10.

13 ( a Multi-Trait Set Test - without the relatedness component (mt Set-noBg. ( b Mult i-trait Set Test - top 30 principal components (mtset-pc. Supplementary Figure 13 (c Mult i-trait Set Test - including t he relat edness component (mt Set. Manhattan plots for set tests when considering different strategies for confounder correction applied to six phenotypes related to basal hematology in the rat data set. (a Manhattan plot obtained when applying mtset without any adjustment for relatedness or population structure (mtset-nobg. (b Equivalent Manhattan plot when using principal components to correct for population structure (mtset-pc. (c Results obtained from the full mtset model, where relatedness is accounted for using a second random effect term. Because of the closely related individuals in the study population, only the full mtset model is able to comprehensively correct for relatedness (c; see also main Fig. 2.

14 Supplement ar y F igur e 1 D ist r ibut ion of t he numb er of var iants w it hin test ing regions a well as t he squar ed int r a-sn P cor r elat ion coeffi cient, b ot h as a funct ion of t he consider e Supplementary Figure 14 Distribution of the number of variants within testing regions as well as the squared intra-snp correlation coefficient, when considering regions of increasing window sizes. Left column: Dependency between region sizes and the number of contained variants. Right column: Dependency between the region sizes and SNP-SNP squared correlation coefficient for SNPs within regions. From top to bottom: Rat datasets, NFBC data, 1000 genomes data (chromosome 20. The computational cost of mtset depends on the number of (unique SNPs in testing regions. In the experiments, we considered 100kb windows for the NFBC data, 1mb windows for the rat study and 30kb windows for the 1000 genomes data. Alternative results for different region sizes are shown in Supplementary Fig. 7 (simulated data based on 1000 genomes individuals and Supplementary Table 4 (NFBC data.

15 plem ent ar y F igur e 15 C om par ison mt Set -PC and mt Set -L ow R ank B g. Compared are ihood ratio t est statist ics for mtset-pc and mt Set-LowRankBG. For large cohorts, we find good cordance between both models. T his shows t hat account ing for PCs as (REML fixed e ects or dom e ect s yields similar result s. Supplementary Figure 15 Comparison of test P values obtained from mtset-pc and mtset-lowrankbg. Compared are likelihood ratio test statistics for the mtset-pc model and a model that considers a low-rank approximation to the background covariance (using the same number of principal components, mtset-lowrankbg, Online Methods. For large cohorts, we observe good concordance between both models. This confirms that accounting for PCs as (REML fixed effects or alternatively including them as random effect covariates yields concordant results.

16 Method Dataset Significance Level True Windows Test Windows Train Windows mtset 1000G =5.00e e e e G =5.00e e e e G =5.00e e e e G =5.00e e e e-05 mtset-pc 1000G =5.00e e e e G =5.00e e e e G =5.00e e e e G =5.00e e e e-05 mtset simpopstructure =5.00e e e e-02 simpopstructure =5.00e e e e-03 simpopstructure =5.00e e e e-04 simpopstructure =5.00e e e e-05 mtset-pc simpopstructure =5.00e e e e-02 simpopstructure =5.00e e e e-03 simpopstructure = 5.00e e e e-04 simpopstructure =5.00e e e e-05 mtset simunrelated =5.00e e e e-02 simunrelated =5.00e e e e-03 simunrelated =5.00e e e e-04 simunrelated =5.00e e e e-05 mtset-pc simunrelated =5.00e e e e-02 simunrelated =5.00e e e e-03 simunrelated =5.00e e e e-04 simunrelated =5.00e e e e-05 mtset simrelated =5.00e e e e-02 simrelated =5.00e e e e-03 simrelated =5.00e e e e-04 simrelated =5.00e e e e-05 mtset-pc simrelated =5.00e e e e-02 simrelated =5.00e e e e-03 simrelated =5.00e e e e-04 simrelated =5.00e e e e-05 Supplementary Table 1 Type-1 error estimates on simulated data. Shown are the type-1 error estimates for increasingly stringent level thresholds 2{0.05, 0.005, 0.005, } on four alternative simulated datasets (see also Supp. Fig. 3. Train windows denote regions that have been used (based on permutations to fit the parametric model of null distribution (methods. True windows denote genomic regions that have not been used to train the null model (independent test validation. Finally, test windows denote regions where the genotype and phenotype relationship have been shu ed. These are equivalent to train windows, but using a di erent set of permutations (methods. mtset and mtset-pc perform equally well when no structure or population structure is present, while the calibration of mtset- PCs detoriates when the individuals are related (see Supp. Methods and Supp. Figure 3 for simulation strategy. 2

17 h 2 r S r r h 2 g window size (in kb Supplementary Table 2 Parameter ranges for simulated datasets. To assess the power of different methods, we considered a range of alternative simulations, varying key parameters that determine the genetic architecture of the traits. We altered the variance explained by the region (h 2 r, the number of causal variants from the region (S r, the percentage of shared causal variants ( r, the percentage of background and residual signal that is shared across traits (, the variance explained by genetic background (h 2 g, the percentage of residual variance explained by hidden confounders ( andthewindow size. Each of those parameters was varied while keeping the other values at the default value (highlighted in bold. For details of the simulation procedure, see Methods. Supplementary Table 3 Tabular summary of QTLs identified by di erent set test and singlevariant LMMs on the NFBC dataset. The results table is provided as separate supplementary information file: nfbc sm 1e5.xlsx Supplementary Table 4 Tabular summary of QTLs identified by mtset with varying window size on the NFBC dataset. The results table is provided as separate supplementary information file: nfbc sm windowsize.xlsx Supplementary Table 5 Tabular summary of QTLs identified by di erent set test and singlelocus LMMs on the rat dataset. The results table is provided as separate supplementary information file: rat sm 1e6.xlsx 3

18 CRP LDL HDL TRIGL Heritability Estimates single-trait 0.11± ± ± ±0.04 multi-trait 0.11± ± ± ±0.03 Genetic Covariance Matrix CRP 0.11± ± ± ±0.04 LDL -0.03± ± ± ±0.04 HDL 0.06± ± ± ±0.04 TRIGL -0.10± ± ± ±0.05 Noise Covariance Matrix CRP 0.89± ± ± ±0.04 LDL 0.13± ± ± ±0.04 HDL -0.24± ± ± ±0.04 TRIGL 0.36± ± ± ±0.05 Phenotypic Covariance Matrix CRP 1.00± ± ± ±0.01 LDL 0.10± ± ± ±0.01 HDL -0.18± ± ± ±0.01 TRIGL 0.26± ± ± ±0.00 Supplementary Table 6 Estimates of trait heritability and covariances for 4 lipid-related traits from the NFBC dataset. Heritability estimates: Single-trait heritability estimates are obtained independently for each trait. Multi-trait estimates correspond to the marginal estimates obtained form the genetic and noise trait covariance matrix from the null model fit of mtset. As expected, these marginal estimates are consistent. Genetic covariance matrix: Trait-trait covariances of the relatedness component from the null model fit of mtset. Noise covariance matrix: Trait-trait covariances of the noise component of the null model fit of mtset. Phenotype covariance: Empirical covariance matrix of the raw phenotypes. All estimates are obtained from a maximum likelihood fit of mtset; standard errors are denoted by ±. 4

19 basos eos lucs monos neuts Heritability Estimates single-trait 0.29± ± ± ± ±0.03 multi-trait 0.31± ± ± ± ±0.03 Genetic Covariance Matrix basos 0.31± ± ± ± ±0.04 eos 0.14± ± ± ± ±0.05 lucs 0.22± ± ± ± ±0.04 monos 0.34± ± ± ± ±0.05 neuts 0.21± ± ± ± ±0.06 Noise Covariance Matrix basos 0.68± ± ± ± ±0.03 eos 0.18± ± ± ± ±0.03 lucs 0.21± ± ± ± ±0.03 monos 0.31± ± ± ± ±0.02 neuts 0.27± ± ± ± ±0.03 Phenotypic Covariance Matrix basos 1.00± ± ± ± ±0.02 eos 0.28± ± ± ± ±0.03 lucs 0.42± ± ± ± ±0.02 monos 0.62± ± ± ± ±0.02 neuts 0.48± ± ± ± ±0.00 Supplementary Table 7 Estimates of trait heritability and covariances for 6 phenotypes related to basal haematology on the rat dataset. Heritability estimates: Single-trait heritability estimates are obtained independently for each trait. Multi-trait estimates correspond to the marginal estimates obtained form the genetic and noise trait covariance matrix from the null model fit of mtset. As expected, these marginal estimates are consistent. Genetic covariance matrix: Trait-trait covariances of the relatedness component from the null model fit of mtset. Noise covariance matrix: Trait-trait covariances of the noise component of the null model fit of mtset. Phenotype covariance: Empirical covariance matrix of the raw phenotypes. All estimates are obtained from a maximum likelihood fit of mtset; standard errors are denoted by ±. 5

20 Supplementary Notes: Efficient multivariate set tests for the genetic analysis of correlated traits Francesco Paolo Casale, Barbara Rakitsch, Christoph Lippert, Oliver Stegle 1. Multi-trait set tests We here provide additional implementation details of mtset, covering efficient inference approaches and approximation schemes to scale mtset to very large cohorts. Section 1.1 introduces the multi-trait linear mixed model (LMM that underlies mtset. Section 1.2 describes a permutation scheme to estimate p- values within mtset. In Section 1.3, we discuss inference challenges in LMMs, the approach taken in mtset and the relationship to prior work. In Section 1.4, we lay out the mathematical details of efficient likelihood and gradient evaluations for parameter inference in mtset. Finally, in Section 1.5 we discuss alternative approximations to scale mtset to extremely large cohorts, mtset-pc and mtset-lowrankbg Model The matrix-variate phenotype Y is modelled by the sum of of the contribution from the variants in the genetic region (set component a random genetic background effect (relatedness component and residual observation noise: Y = }{{} F B + U r + U }{{} g }{{} fixed effects set component relatedness component + Ψ }{{} noise. (1 Here, Y denotes the N P phenotype matrix for N individuals and P traits. F is the N N F E sample-design matrix of the fixed effects and B is the corresponding N F E P weight matrix. The matrix U r denotes effects from the set component, U g explains variation from the relatedness component and Ψ denotes residual noise. We model each of the previous three terms as random effects with the following matrix-variate normal priors: U r MV N(0, C r, R r U g MV N(0, C g, R g Ψ MV N(0, C n, I N N (2 The covariance matrix C r R P P explains the trait-to-trait covariance between phenotypes that is induced by the set term. Conversely, the individual-to-individual covariance R r R N N denotes the genetic relatedness matrix between individuals that captures the local genetic structure of the variants in the set. These authors have contributed equally. 1

21 Analogously, trait covariance that is induced by the relatedness component are modelled by the P P trait-to-trait covariance matrix C g and R g denotes the corresponding individual-to-individual relatedness matrix that captures the global genetic relatedness between individuals (e.g. kinship. Finally, the random effect Ψ explains i.i.d observation noise, where C n models residual correlations between the traits. The marginal likelihood of the model in (1-2 is given by p(y, F, B, C r, R r, C g, R g, C n = N vec(y vec(f B, C r R r }{{} set component + C g R g }{{} relatedness component + C n I N N, (3 }{{} noise where denotes the Kronecker product (see Appendix A.1 and we have used the equivalence of a matrix-variate normal distribution and a multivariate normal distribution (Appendix A.2. The operator vec( denotes a stacking operation, which transforms an input matrix into a vector by concatenating its columns. For the sake of clarity, we omit the fixed effects from now on in this derivation. The software implementation of mtset provides support for fixed effect covariates. The LMM in Eqn. (3 is closely related to existing multi-trait association models used in genetics, in particular the MTMM model [1] as well the multi-trait version of GEMMA [2] and implementations in LIMIX [3]. However, importantly, mtset requires two variance component terms, whereas GEMMA and MTMM build on a single variance component to account for relatedness (in addition to observation noise, whereas the genetic variants are tested one by one as fixed effect covariates. A detailed discussion of how mtset relates to prior work is provided in Section 1.3. Both the set component (R r and the relatedness component (R g can be estimated from the genotype data alone (see below. In contrast, the elements of the three trait-to-trait covariance matrices need to be estimated form the full model, for which we employ maximum likelihood estimation. 1 In order to retain efficiency, we exploit linear algebra identities and convenient factorizations, thereby minimizing the computational complexity and memory requirements of likelihood and gradient evaluations. Set and relatedness covariance matrices In principle, any valid covariance function [4] can be used to define the covariance matrices in mtset. However, the algorithmic tricks for computational efficiency in mtset relies on i the assumption that both R r and R g are constant (i.e., their eigenvalue decompositions and some other operations can be cashed and ii that the set covariance R r is low-rank. Here, we consider the realized relationship matrix which is compatible with these assumptions [5]. We define R g = SS and R r = GG, where S R N S and G R N R denote matrices with all genomewide variants (relatedness and variants in the set to be tested respectively. The scalar S denotes the total number of genome-wide variants and R corresponds to the number of variants in the set. Weights for individual variants, for example to prioritize rare variants, could be considered straightforwardly; an approach that has previously been used to increase power for rare variant association analysis (see e.g.[6]. If we use C to denote the rank of the trait-to-trait covariance matrix C r, the overall rank of the region covariance term follows as C R with R N and C P, which does not directly depend on the number of samples and traits. As discussed in the following paragraph, C can be interpreted as the number of independent effects from the region across traits. Rank of the set trait-to-trait covariance For the sake of computational efficiency, we consider a lowrank set covariance, setting C = 1. This setting results in a linear scaling of the number of parameters in P (instead of a quadratic scaling. 1 We perform gradient-based optimization of the marginal likelihood (see Section

22 In order to understand the effect of a low-rank trait-trait covariance matrix on the genetic effects the model can capture, it is instructive to derive the mtset LMM from a generative linear model perspective: Y = F B }{{} fixed effects + GV }{{} + U g }{{} set component relatedness component + Ψ }{{} noise (4 where V R R C are the effect sizes of the R variants in the region on the P traits (V r,p is the effect size of variant r on trait p, U g MV N(0, C g, R g and Ψ MV N(0, C n, I N N. Assuming C = 1 equivals to assume that we can write the effect sizes for trait p as V :p = e p v. Note that while we have a unique genetic signal v R R 1, which is shared across all traits, this model allows for trait-specific rescaling of this through the factors e p. Introducing the scaling vector e = [e 1,..., e P ] T R P 1, we can rewrite the model as Y = F B }{{} fixed effects + Gv }{{ e T } + U g }{{} set component relatedness component + Ψ }{{} noise (5 Finally, considering a normal prior over the new weights, v N (0, 1, and marginalising them out we obtain the marginal likelihood in (1, with U r MV N(0, C r, R r, C r = ee T and R r = GG T. Notice that C r is a rank 1 matrix. More complex genetic signals (higher ranks of C r could also be considered by relaxing this assumption at the cost of increased model complexity and additional computational cost (see below and Table Estimation of p-values and significance testing Building on previous methods for single-trait set test [7, 8], we consider likelihood-ratio tests to assess the significance of a particular region set. When testing for variance components, the distribution of the test statistics under the null is in not known, when the phenotype vector vec (Y cannot be divided into a large number of i.i.d. subvectors [9], as it is the case here. To estimate p-values, we employ a permutation scheme, where we assume that the distribution of the test statistics under the null is constant across regions and has the parametric form p(x π, a, d = πχ 2 0(x + (1 πaχ 2 d(x. (6 We first obtain test statistics from the null distribution by using genome-wide permutations pooling the test statistics over all windows. Subsequently, we use the largest 10% of the test statistics to fit the parameters such that the error between the parametric and theoretical p-values is minimized. In the experiments, we found that a relatively small number of genome-wide permutations (<100 was sufficient to accurately estimate the null distribution. A closely related scheme has previously been proposed for single-trait set tests [7] and compared to to other testing procedures, in particular score tests [8], suggesting that likelihood ratio tends to be well powered Overview of inference methods in LMMs Before providing full details of the efficient inference scheme in mtset, we provide an overview of existing methods for inference in LMMs and compare these methods to the approach taken here. The majority of the LMM-based approaches for genetic association testing build on closely related formulations of the null model that underlies mtset. Common to these methods is the assumption that the observed phenotype data is modelled by the sum of a variance component to explain variation due to relatedness as well as residual noise. In standard application of LMMs for GWAS, genetic variants are then tested one by one as additional fixed effects in the model (as are other covariates. In contrast, set tests such as mtset aggregate across multiple proximal genetic variants using a second variance component 3

23 in the model. In the context of single-trait set tests, this has previously been described in [10, 11, 7, 8], however none of these inference schemes allows for multi-trait modelling. Parameter inference in either types of LMMs (one or two variance components is typically done using (restricted maximum likelihood. Because of the large number of alternative models that need to be fitted, the computational tractability of the underlying operations, i.e. evaluation of the likelihood and gradients to determine model parameters, is essential. In general, naive inference in an LMM requires the inversion of the covariance matrix in the model, which for a multi-trait model with N individuals and P traits, scales cubically in both dimensions, i.e. O(N 3 P 3. LMMs with fixed effect testing Efficient inference for single trait LMMs, as implemented in FaST- LMM [12] and GEMMA [13], exploits pre-computing the (constant across SNPs Eigen decomposition of the sample covariance matrix. This steps allows to reduce the computational cost from O(N 3 per variant to a single O(N 3 operation up-front 2 and a per-test complexity of O(N 2. For every test, exact parameter inference can be achieved by means of simple closed form operations and an unidimensional Brent search optimization. The computational complexity of these approaches can be further reduced to O(N 2 for the up-front computation and a per-test complexity of O(N, conditioned on the relatedness covariance matrix having a low-rank structure. In practice, this can be achieved through a feature selection approach, selecting a small proportion of all genome-wide variants to estimate R g [14]. The extension of these efficient linear algebra for LMMs for joint analysis of multiple traits (mtlmm- SV has recently been proposed as an extension to GEMMA [2] (termed mvlmm. Combining Kronecker product algebra with the Eigen decomposition trick, the native cost of O(N 3 P 3 can be reduced to a single O(N 3 operation up-front and O(N 2 +NP x per variant, where x depends on the optimization algorithm. As the number of variance components increases quadratically with the number of traits, derivative-free methods, as used in efficient single-trait LMMs, are rendered inefficient and hence mvlmm considers gradient-based optimization scheme (combined with an EM-like algorithm to estimate model parameters. In particular, mvlmm combines Newton-Raphson and the expectation maximization algorithms. LMMs for set tests Single-trait set tests based on an LMM with a single variance component have first been proposed in [15, 16, 10] and subsequently been extended to include a relatedness component [11, 17, 6]. Common to these models is that p-values are being estimated using a score test. Alternatively, it has also been proposed to use a likelihood ratio test to assess statistical significance for the same class of LMMs [7]. A recent comparison between score tests and likelihood ratio tests [8] shows that likelihood ratio tests tend to have more power in real settings. However, score tests are computationally cheaper to compute as the model parameters need only to be fit once on the null model, whereas likelihood ratio test require full parameter inference of the alternative model for each test. FaST-LMM-Set [7] assumes a low-rank relatedness covariance and a low rank set covariance, which allows to aggregate both components into a single (low rank variance component enabling efficient inference. Parameter optimization is again carried out using a unidimensional Brent search optimization. An extension to full-rank background covariance matrices has been presented in [8], which is a special case of mtset (single trait and will be referred to as stset. In the same way mvlmm extends a standard single-variant LMM to multi-trait analysis, mtset is the multivariate generalization of stset. As discussed in the next section, the algorithm underlying mtset combines eigenvalue decomposition and low-rank updates with Kronecker product algebra to break down the O(N 3 P 3 computational cost to a O(N 3 operations upfront and O(N 2 + NR 2 P 2 + NRP 4 per set, where R denotes the number of variants in the set component to be tested. In the same vein we also consider two alternative approximations of the full mtset model: mtset-pc, where relatedness component is omitted and population structure is modelled as fixed effect, and mtset-lowrankbg, where analogously to FaST-LMM-Set, we assumes a low-rank relatedness covariance. Both proposed approximations scale 2 Corresponding to the cost of an eigenvalue decomposition 4

24 linearly in the number of individuals, permitting analysis of extremely large cohorts (up to 500,000 individuals; see also 1, Figure 1 and Supplementary Figure 3. Similarly to mvlmm, parameter inference in mtset (mtset-pc and mtset-lowrankbg is done using a gradient-based parameter optimization (LBFGS [18, 19]. In our experience, the success of the optimization method is greatly affected by the employed stopping criterion. For example, when the likelihood surface is flat (N < 5000, large windows, the default parameter settings of the SciPy [20] library in python are sufficiently stringent, resulting in premature stopping. We circumvent this by explicitly choosing stringent stopping criteria, setting factr to 10 3 (default value: We note that GCTA [21, 22], a popular approach to fit variance component models, provides support for arbitrary numbers of variance components for single traits and limited support for multi-trait analyses. Specifically, the model allows for joint analysis across pairs of traits, which can be regarded as a special case of GEMMA, however employing gradient-based parameter inference (using the PX-AI algorithm. Table 1 provides a tabular listing of the per-test computational complexity for alternative LMM methods and implementations. Note that the listed complexities do not take into account the the upfront O(N 3 operation for the eigen decomposition of the relatedness covariance matrix that is common to all methods (or O(N 2 respectively, if a low rank relatedness covariance is used Efficient inference for the full mtset model Without loss of generality, we consider G R N,R having column rank R in the following 3. To simplify the derivation of efficient inference (see Section 1.3, we also rewrite the trait-to-trait covariance matrix as C r = EE, where E is a P C matrix. Inverting the covariance matrix The full model covariance matrix has the following form K = C r R r + C g R g + C n I N (7 = A + XX T (8 Here, we have defined A = C g R g + C n I N, which bundles the effects of the relatedness and noise covariance term. The set term is represented as XX, where X = E G. Using the same linear algebra tricks as done before [23], and using the notation M = U M S M UM T for the eigenvalue decomposition of matrix M, we can write A 1 = = = [ ( (C U CnS 1/2 ( ] T 1 C n I g R g + I NP U Cn S 1/2 C n I (9 [ ( ( ( ] T 1 U CnS 1/2 C n U C g U Rg S C g S Rg + I NP U Cn S 1/2 C n U C g U Rg (10 ( T ( 1 ( UC T g S 1/2 C n UC T n UR T g S C g S Rg + I NP UC T g S 1/2 C n UC T n UR T g (11 where we have introduced Cg = S 1/2 C n UC T n C g U Cn S 1/2 C n. All elements in (11 can be calculated in O(N 3 + P 3 where the O(N 3 operation needs to be done only once in the whole analysis [23]. For 3 If G has column rank R > R we can always find G with column rank R such that R r = G G T by a single value decomposition on G = US 1/2 }{{} G V (with runtime of O(NR 2. 5

25 simplicity of notation we introduce L c = U T C g S 1/2 C n U T C n (12 L r = U T R g (13 L = L c L r ( 1 I (14 D = S C g S Rg + (15 In the new notation, (11 becomes A 1 = L T DL (16 which explicitly shows that A 1 is a kroneckered transformation of a diagonal matrix. We can use the Woodbury matrix identity to efficiently invert K exploiting the low-rank nature of XX : where we have introduced K 1 = ( A + XX 1 (17 = A 1 A 1 X ( I + X A 1 X 1 X A 1 (18 = L T DL L T DLX ( I + X A 1 X 1 X L T [ DL (19 = L T D DLX ( I + X A 1 X ] 1 X L T D L (20 = L T ( D DW Λ 1 W T D L (21 W c = L c E R P C (22 W r = L r G R N R (23 W = W c W r (24 Λ = I + X T A 1 X R RC RC (25 Computing the column matrix W c takes O(P 2 C time, while computing the row matrix W r requires O(N 2 R time. Note that the row matrix does not change while optimizing the parameters of the column covariance matrices and can be computed prior to the analysis. The matrix Λ can also be computed efficiently by rewriting it as Λ = I + X T A 1 X (26 = I + W T DW = I + (W c W r T DW (28 = I + ( Wc T Wr T [ ] DW:,1... DW :,RS (29 = I + [ vec ( ( Wr T vec 1 (DW :,1 W c... vec ( ( ] Wr T vec 1 (DW :,RS W c. (30 Indeed, computing W explicitly and multiplying it with D takes O(CN P R time and space, multiplying it with W from left takes O(CR(NP C + RNC, while the inversion takes O(C 3 R 3 time and O(C 2 R 2 memory. In practice, we use the Cholesky factorization to compute the inverse of Λ having the advantage that we can re-use the decomposition for computing the log determinant later on. (27 6

26 Evaluating the model log likelihood The log likelihood of our model (3 is given by L = NP log 2π 1 2 log det K 1 2 vec (Y T K 1 vec (Y (31 The log-determinant can be computed by using the matrix determinant lemma log det K = log det A + log det Λ. (32 Provided that we have already computed L c, L r, D and the Cholesky decomposition of Λ, evaluating the log determinant of A and Λ take respectively O(NT and O(CR. The squared form can be evaluated as follows vec (Y T K 1 vec (Y = vec (Y T [ L T ( D DW Λ 1 W T D L ] vec (Y = vec (Y T L T DLvec (Y vec (Y T L T DW Λ 1 W T DLvec (Y T T = vec (Ỹ D vec (Ỹ vec (Ỹ DW Λ 1 W T Dvec (Ỹ T T = vec (Ỹ D vec (Ỹ vec (Ŷ W Λ 1 W T vec (Ŷ T = vec (Ỹ D vec (Ỹ vec ( Ȳ T Λ 1 vec ( Ȳ, where we have defined vec (Ỹ = Lvec (Y = (L c L r vec (Y = vec ( L r Y L T c (33 vec (Ŷ = D vec (Ỹ = diag(d vec (Ỹ (34 vec ( Ȳ = W T vec (Ŷ = (W c W r vec (Ŷ ( = vec Wr T Ŷ W c. (35 Rotating and scaling the data Eqs (33-35 takes O(N 2 P + NP 2 + NP + NP C + RNC time. where again the O(N 2 P operation is done only once prior to the analysis. Computing the squared form vec ( Ȳ T Λ 1 vec ( Ȳ takes O(C 2 R 2 time after having inverted Λ. Evaluating the gradient The derivative of the log likelihood with respect to the column covariance parameter θ i θ is given by L θi = 1 2 tr( K 1 K θi vec (Y T K 1 K θi K 1 vec (Y, (36 where the first term arises from the log determinant and the second term from the squared form. We have used the notation M θi to indicate the derivative of M with respect to θ i. The first term can be 7

27 rewritten as tr ( K 1 K θi = tr ([ L T ( D DW Λ 1 W T D L ] K θi = tr (( D DW Λ 1 W T D LK θi L ( (D = tr DW Λ 1 W T D Kθi = tr (D K θi tr (DW Λ 1 W T D K θi = (D K θi ( Λ 1 W T D K θi DW jk jk jk jk (Λ 1 K θi = diag(d T diag( K θi jk jk where where Ẽ = L ce, Ẽθ i = L c E θi and K θi = LK θi L T (37 = L c C θi L c L r R i L r. (38 L c (EE T θi L T c W r Wr T if θ i is an entry of E = L c (C g θi L T c S r if θ i is a param of C g (39 L c (C n θi L T c I if θ i is a param of C n Ẽ θi Ẽ T + ẼẼθ i Wr Wr T if θ i is an entry of E = (C g θi S r if θ i is a param of C g, (40 (C n θi I if θ i is a param of C n (41 K θi = W T D K θi DW. (42 First, we compute the column covariance matrix of Kθi, which can be done in O(P 2 C, O(P 3 and O(P 3 respectively for random effect, error and noise parameters. Calculating K θi requires more care: we first compute the dot product between D and W, which requires us to explicitly calculate W, taking O(NP RC time and O(NP RC space. The resulting matrix consists of NP rows and RC columns. In the next step, we multiply each column of DW :,i with K θi from the right side exploiting the same tricks as in (30: (Ẽθi Ẽ T + ẼẼθ i Wr Wr T DW :,i if θ i is an entry of E K θi DW :,i = ( Cθi S r DW :,i if θ i is a param of C g (43 ( Cθi I DW :,i if θ i is a param of C n ( vec W r Wr T vec 1 (DW :,i (Ẽθi Ẽ T + ẼẼθ i if θ i is a parameter of E = vec (S r vec 1 (DW :,i C θi if θ i is a parameter of C g (44 vec (vec 1 (DW :,i C θi if θ i is a parameter of C n. This leads to an overall runtime complexity of O(RC (NP C + NCR, O(RC (NP + NP 2 and O(RCNP 2 for region, random effect and region parameters. 8

28 We use the same trick to compute the multiplication between W T and D K θi DW efficiently, leading to a complexity of O(RC(NP C + NCR. Finally, computing the trace term has an additional runtime of O(NP + C 2 R 2. The derivative of the squared form can be rewritten as T (D vec (Y T K 1 K θi K 1 vec (Y = vec (Ỹ DW Λ 1 W T D ( Kθi D DW Λ 1 W T D vec (Ỹ ( = vec (Ŷ DW Λ 1 vec ( Ȳ ( Kθi vec (Ŷ DW Λ 1 vec ( Ȳ. We start by multiplying Λ 1 with vec (Ỹ, which can be done in O(C 2 R 2 after having precomputed the inverse. Exploiting that W has Kronecker structure and D is a diagonal matrix, reduces the runtime for multiplying the resulting matrix with DW from the left from O(N P +N P RC to O(N P C +RN C +N P. In the next step, we subtract the resulting vector from vec (Ŷ and multiply it with K θi from the left, having an additional runtime of O(NP R+NP 2, O(NP +NP 2 and O(NP 2 for the region, the random effect and the noise term respectively. Finally we have to multiply two vectors of size N P, which can be done in O(NP time. A tabular overview of the individual computations and how often these need to be carried out can be found in Table 1.4. Inverse A 1 O(N 3 + P 3 Cholesky chol(i + W T DW D W O(NP CR W T (DW O(NP RC 2 + NR 2 C 2 chol(i + W T DW O(C 3 R 3 Log Likelihood log det K log det A O(NP log det Λ O(CR ỹ T Dỹ ȳ T Λ 1 ȳ ỹ = Lvec (Y O(N 2 P + NP 2 ȳ = W T Dvec (Y O(NP + NP C + NRC ỹ T Dỹ ȳ T Λ 1 ȳ O(NP + C 2 R 2 + CR Gradient K θi = LK θi L T O(N 2 R + P 2 C region O(N 3 + P 3 rand effect O(N 3 + P 3 noise K θi = W T D K θi DW Kθi (DW O(NRP C 2 + NR 2 C 2 region O(NRP C + NRP 2 C rand eff O(NRP 2 C noise W T (D K θi DW O(NP RC 2 + NR 2 C 2 **computed only once **computed only once per-region Table 2: Tabular summary of the complexity of individual computational steps in the mtset inference. 9

29 1.5. Efficient inference for approximations to the full mtset model As any exact LMM (see 1.3 mtset is bound to the upfront eigenvalue decomposition of the genetic relatedness matrix, which is a cubic operation in the number of samples limiting scalability of LMMs to very large cohorts (N 20, 000. In the following we discuss two alternative approximations that are available in the mtset software implementation, allowing to scale mtset to cohort with up to 500, 000 individuals (see also main paper text, Figure 1 and Supplementary Figure 3: mtset-pc and mtset- LowRankBg. In mtset-pc, the random effect accounting for relatedness is dropped while population structure is accounted for as fixed effect covariates. Alternatively, mtset-lowrankbg considers a lowrank approximation to the background covariance. Low rank approximations to the relatedness matrix have previously been applied to single-trait LMMs, e.g. [24, 14, 7] Modelling population structure with principal components (mtset-pc In mtset-pc, population structure is modelled as fixed effects using the first N P C principal components instead than using a random effect term as in the full mtset. This approximation results in an LMM with only a single variance component (in addition to the noise component. Indicating with F R N N P C the sample design matrix of the fixed effect, the fixed effects on the vectorized phenotypes vec (Y can then be written as V = I F R NP N P CP with weights b R N P CP, where b = vec (B. This model assumes a P degrees of freedom fit for each of PC covariate. The restricted log-likelihood [25, 26] is then given by where and. L θ = const. 1 2 (vec (Y V bt K 1 (vec (Y V b 1 }{{} 2 log det K 1 2 log det V } T K {{ 1 V } (45 vec(z A reml The covariance matrix can be rewritten as b = A 1 reml V T K 1 vec (Y (46 K = EE T GG T + C n I N (47 K = EE T GG T + C n I N (48 ( ( ( T = U n Sn 1/2 I N E E T GG T + I NP U n Sn 1/2 I N (49 ( ( ( T = U n Sn 1/2 I N (U E U G (S E S G (U E U G T + I NP U n Sn 1/2 I N (50 where we used the notation M = U M S 1/2 M V M 4 for the singular value decomposition of M. The inverse of K can be written as K 1 = ( S 1/2 n T Un T I N I NP (U E U G ( S 1 = L T ( I W T DW L 1 } E S 1 G + I RC {{ } D ( (U E U G T }{{} W Sn 1/2 Un T I N }{{} L Calculating the SVD of E and G takes respectively O(P C 2 and O(NR 2 operations. We marked the complexity of the SVD of G in blue as it has to be performed only once during optimization. 4 M R n 1,n 2, U R n 1,n 1, S R n 1,n 2, V R n 2,n 2 (51 10

30 Evaluating the log-likelihood The log-likelihood of the model is L θ = const. 1 2 vec (ZT K 1 vec (Z 1 log det K 1 }{{} 2 }{{} 2 log det A reml }{{} squared form term logdet term reml term (52 The log-determinant term can be computed as follows by applying the matrix determinant lemma log det K = log det ((U E U G (S E S G (U E U G T + I NP + N log det S n (53 and = log det ( S 1 E S 1 G + I + R log det S E + C log det S G + N log det S n. (54 A reml and b can be computed respectively as A reml = V T K 1 V (55 = (LV T LV (W LV T D(W LV (56 = (L c F T (L c F (W c L c W r F T D(W c L c W r F (57 b = A 1 reml V T K 1 vec (Y = (58 = A 1 ( reml (LV T Lvec (Y (W LV T DW Lvec (Y (59 = A 1 ( ( reml vec F T Y L T c L c (W LV T Dvec ( W r Y L T c Wc T (60 Finally, the quadratic term can be rewritten as vec (Z T K 1 vec (Z = (Lvec (Z T (Lvec (Z (W Lvec (Z T D(W Lvec (Z (61 where Lvec (Z = vec ( Y L T c F BL T c W Lvec (Z = vec ( W r Y L T c W T c W r F BL T c W T c (62 (63 The log-likelihood can be evaluated in O(NN 2 PC +NN PCR +NN PC P +NP R +NN PC P +NP +NP 2 where we only report all quantities depending on N, which are bottleneck for huge sample sizes, and denote in blue all the quantities that have to be computed only once during optimization. Calculating the gradient L θi The gradient of the likelihood can be written as = 1 2 vec (ZT K 1 K θi K 1 vec (Z + vec (Z T K 1 V b θi 1 }{{}}{{} 2 tr( K 1 K θi 1 }{{} 2 tr ( A 1 reml A remlθ }{{ i } squared form 1 squared form 2 trace reml (64 11

31 Let us start by rewriting K 1 K θi K 1 : K 1 K θi K 1 = L T ( I W T DW L (C θi R L T ( I W T DW L (65 = L T ( I W T DW L c C θi L T c R ( I W T DW L (66 }{{} C ( = L T C R L + L T W T T D W c CW }{{ c W r RW T r DW L (67 }}{{} ( C S r L T C R W T DW L (L ( T T C R W T DW L (68 ( = L T C R L + L T W T D ( C Sr DW L (69 ( L T C R W T DW L (L ( T T C R W T DW L (70 where we used that K θi = C θi R where C and R are C r and R r if θ i is a region term parameter or C n and I N if θ i is a noise term parameter. The gradients of A reml and b can be calculated as and A remlθi = V T K 1 K θi K 1 V (71 = ( (LV T C R LV (DW LV T ( C Sr DW LV (72 ( ( ( +(W C R LV T DW LV + (W C R LV T DW LV = (L T CL c c (F T RF (DW LV T ( C Sr DW LV (73 ( ( ( T +(W C R LV T DW LV + (W C R LV T DW LV where b θi = A 1 reml A remlθ i b A 1 reml V T K 1 K θi K 1 vec (Y (74 ( V T K 1 K θi K 1 vec (Y = (LV T C R Lvec (Y + (DW LV T ( C Sr DW Lvec (Y (75 ( ( (W C R LV T DW Lvec (Y (DW LV T (W C R Lvec (Y Several of the matrix products in (71, 74, 75 have already been computed for estimating the loglikelihood. The additional terms can be computed efficiently by using convenient factorisations and Kronecker product algebra: ( W C R LV = W c CLc W r RF (76 ( ( (LV T C R Lvec (Y = vec F RY Lc T CT Lc (77 ( ( W C R Lvec (Y = vec W r RY L T C c T Wc T (78 Notice that the computation of RY or RV can also be done in linear time in N. In the non-trivial case where R = GG T we can rewrite RY = G(G T Y which takes O(NRP. 12

FaST linear mixed models for genome-wide association studies

FaST linear mixed models for genome-wide association studies Nature Methods FaS linear mixed models for genome-wide association studies Christoph Lippert, Jennifer Listgarten, Ying Liu, Carl M Kadie, Robert I Davidson & David Heckerman Supplementary Figure Supplementary

More information

PCA and admixture models

PCA and admixture models PCA and admixture models CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar, Alkes Price PCA and admixture models 1 / 57 Announcements HW1

More information

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2. APPENDIX A Background Mathematics A. Linear Algebra A.. Vector algebra Let x denote the n-dimensional column vector with components 0 x x 2 B C @. A x n Definition 6 (scalar product). The scalar product

More information

(Genome-wide) association analysis

(Genome-wide) association analysis (Genome-wide) association analysis 1 Key concepts Mapping QTL by association relies on linkage disequilibrium in the population; LD can be caused by close linkage between a QTL and marker (= good) or by

More information

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017 CPSC 340: Machine Learning and Data Mining More PCA Fall 2017 Admin Assignment 4: Due Friday of next week. No class Monday due to holiday. There will be tutorials next week on MAP/PCA (except Monday).

More information

A General Framework for Variable Selection in Linear Mixed Models with Applications to Genetic Studies with Structured Populations

A General Framework for Variable Selection in Linear Mixed Models with Applications to Genetic Studies with Structured Populations A General Framework for Variable Selection in Linear Mixed Models with Applications to Genetic Studies with Structured Populations Joint work with Karim Oualkacha (UQÀM), Yi Yang (McGill), Celia Greenwood

More information

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5 Association Testing with Quantitative Traits: Common and Rare Variants Timothy Thornton and Katie Kerr Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5 1 / 41 Introduction to Quantitative

More information

Supplementary Information

Supplementary Information Supplementary Information 1 Supplementary Figures (a) Statistical power (p = 2.6 10 8 ) (b) Statistical power (p = 4.0 10 6 ) Supplementary Figure 1: Statistical power comparison between GEMMA (red) and

More information

1 Data Arrays and Decompositions

1 Data Arrays and Decompositions 1 Data Arrays and Decompositions 1.1 Variance Matrices and Eigenstructure Consider a p p positive definite and symmetric matrix V - a model parameter or a sample variance matrix. The eigenstructure is

More information

CS281 Section 4: Factor Analysis and PCA

CS281 Section 4: Factor Analysis and PCA CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we

More information

GWAS IV: Bayesian linear (variance component) models

GWAS IV: Bayesian linear (variance component) models GWAS IV: Bayesian linear (variance component) models Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS IV: Bayesian

More information

Multidimensional heritability analysis of neuroanatomical shape. Jingwei Li

Multidimensional heritability analysis of neuroanatomical shape. Jingwei Li Multidimensional heritability analysis of neuroanatomical shape Jingwei Li Brain Imaging Genetics Genetic Variation Behavior Cognition Neuroanatomy Brain Imaging Genetics Genetic Variation Neuroanatomy

More information

Efficient Bayesian mixed model analysis increases association power in large cohorts

Efficient Bayesian mixed model analysis increases association power in large cohorts Linear regression Existing mixed model methods New method: BOLT-LMM Time O(MM) O(MN 2 ) O MN 1.5 Corrects for confounding? Power Efficient Bayesian mixed model analysis increases association power in large

More information

Linear Regression (1/1/17)

Linear Regression (1/1/17) STA613/CBB540: Statistical methods in computational biology Linear Regression (1/1/17) Lecturer: Barbara Engelhardt Scribe: Ethan Hada 1. Linear regression 1.1. Linear regression basics. Linear regression

More information

Heritability estimation in modern genetics and connections to some new results for quadratic forms in statistics

Heritability estimation in modern genetics and connections to some new results for quadratic forms in statistics Heritability estimation in modern genetics and connections to some new results for quadratic forms in statistics Lee H. Dicker Rutgers University and Amazon, NYC Based on joint work with Ruijun Ma (Rutgers),

More information

Bare minimum on matrix algebra. Psychology 588: Covariance structure and factor models

Bare minimum on matrix algebra. Psychology 588: Covariance structure and factor models Bare minimum on matrix algebra Psychology 588: Covariance structure and factor models Matrix multiplication 2 Consider three notations for linear combinations y11 y1 m x11 x 1p b11 b 1m y y x x b b n1

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

CS168: The Modern Algorithmic Toolbox Lecture #10: Tensors, and Low-Rank Tensor Recovery

CS168: The Modern Algorithmic Toolbox Lecture #10: Tensors, and Low-Rank Tensor Recovery CS168: The Modern Algorithmic Toolbox Lecture #10: Tensors, and Low-Rank Tensor Recovery Tim Roughgarden & Gregory Valiant May 3, 2017 Last lecture discussed singular value decomposition (SVD), and we

More information

Part 6: Multivariate Normal and Linear Models

Part 6: Multivariate Normal and Linear Models Part 6: Multivariate Normal and Linear Models 1 Multiple measurements Up until now all of our statistical models have been univariate models models for a single measurement on each member of a sample of

More information

B553 Lecture 5: Matrix Algebra Review

B553 Lecture 5: Matrix Algebra Review B553 Lecture 5: Matrix Algebra Review Kris Hauser January 19, 2012 We have seen in prior lectures how vectors represent points in R n and gradients of functions. Matrices represent linear transformations

More information

CS281A/Stat241A Lecture 17

CS281A/Stat241A Lecture 17 CS281A/Stat241A Lecture 17 p. 1/4 CS281A/Stat241A Lecture 17 Factor Analysis and State Space Models Peter Bartlett CS281A/Stat241A Lecture 17 p. 2/4 Key ideas of this lecture Factor Analysis. Recall: Gaussian

More information

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control Xiaoquan Wen Department of Biostatistics, University of Michigan A Model

More information

Association studies and regression

Association studies and regression Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration

More information

Estimating Variances and Covariances in a Non-stationary Multivariate Time Series Using the K-matrix

Estimating Variances and Covariances in a Non-stationary Multivariate Time Series Using the K-matrix Estimating Variances and Covariances in a Non-stationary Multivariate ime Series Using the K-matrix Stephen P Smith, January 019 Abstract. A second order time series model is described, and generalized

More information

Noisy Streaming PCA. Noting g t = x t x t, rearranging and dividing both sides by 2η we get

Noisy Streaming PCA. Noting g t = x t x t, rearranging and dividing both sides by 2η we get Supplementary Material A. Auxillary Lemmas Lemma A. Lemma. Shalev-Shwartz & Ben-David,. Any update of the form P t+ = Π C P t ηg t, 3 for an arbitrary sequence of matrices g, g,..., g, projection Π C onto

More information

Statistical Machine Learning Hilary Term 2018

Statistical Machine Learning Hilary Term 2018 Statistical Machine Learning Hilary Term 2018 Pier Francesco Palamara Department of Statistics University of Oxford Slide credits and other course material can be found at: http://www.stats.ox.ac.uk/~palamara/sml18.html

More information

F & B Approaches to a simple model

F & B Approaches to a simple model A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring 215 http://www.astro.cornell.edu/~cordes/a6523 Lecture 11 Applications: Model comparison Challenges in large-scale surveys

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Cheng Soon Ong & Christian Walder. Canberra February June 2017 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2017 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 141 Part III

More information

SPARSE signal representations have gained popularity in recent

SPARSE signal representations have gained popularity in recent 6958 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 10, OCTOBER 2011 Blind Compressed Sensing Sivan Gleichman and Yonina C. Eldar, Senior Member, IEEE Abstract The fundamental principle underlying

More information

Power and sample size calculations for designing rare variant sequencing association studies.

Power and sample size calculations for designing rare variant sequencing association studies. Power and sample size calculations for designing rare variant sequencing association studies. Seunggeun Lee 1, Michael C. Wu 2, Tianxi Cai 1, Yun Li 2,3, Michael Boehnke 4 and Xihong Lin 1 1 Department

More information

The purpose of this section is to derive the asymptotic distribution of the Pearson chi-square statistic. k (n j np j ) 2. np j.

The purpose of this section is to derive the asymptotic distribution of the Pearson chi-square statistic. k (n j np j ) 2. np j. Chapter 9 Pearson s chi-square test 9. Null hypothesis asymptotics Let X, X 2, be independent from a multinomial(, p) distribution, where p is a k-vector with nonnegative entries that sum to one. That

More information

Introduction Eigen Values and Eigen Vectors An Application Matrix Calculus Optimal Portfolio. Portfolios. Christopher Ting.

Introduction Eigen Values and Eigen Vectors An Application Matrix Calculus Optimal Portfolio. Portfolios. Christopher Ting. Portfolios Christopher Ting Christopher Ting http://www.mysmu.edu/faculty/christophert/ : christopherting@smu.edu.sg : 6828 0364 : LKCSB 5036 November 4, 2016 Christopher Ting QF 101 Week 12 November 4,

More information

Supplementary Information for Efficient Bayesian mixed model analysis increases association power in large cohorts

Supplementary Information for Efficient Bayesian mixed model analysis increases association power in large cohorts Supplementary Information for Efficient Bayesian mixed model analysis increases association power in large cohorts Po-Ru Loh, George Tucker, Brendan K Bulik-Sullivan, Bjarni J Vilhjálmsson, Hilary K Finucane,

More information

Sparse Covariance Selection using Semidefinite Programming

Sparse Covariance Selection using Semidefinite Programming Sparse Covariance Selection using Semidefinite Programming A. d Aspremont ORFE, Princeton University Joint work with O. Banerjee, L. El Ghaoui & G. Natsoulis, U.C. Berkeley & Iconix Pharmaceuticals Support

More information

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss arxiv:1811.04545v1 [stat.co] 12 Nov 2018 Cheng Wang School of Mathematical Sciences, Shanghai Jiao

More information

Factor Analysis (10/2/13)

Factor Analysis (10/2/13) STA561: Probabilistic machine learning Factor Analysis (10/2/13) Lecturer: Barbara Engelhardt Scribes: Li Zhu, Fan Li, Ni Guan Factor Analysis Factor analysis is related to the mixture models we have studied.

More information

Combining SEM & GREML in OpenMx. Rob Kirkpatrick 3/11/16

Combining SEM & GREML in OpenMx. Rob Kirkpatrick 3/11/16 Combining SEM & GREML in OpenMx Rob Kirkpatrick 3/11/16 1 Overview I. Introduction. II. mxgreml Design. III. mxgreml Implementation. IV. Applications. V. Miscellany. 2 G V A A 1 1 F E 1 VA 1 2 3 Y₁ Y₂

More information

Inverse of a Square Matrix. For an N N square matrix A, the inverse of A, 1

Inverse of a Square Matrix. For an N N square matrix A, the inverse of A, 1 Inverse of a Square Matrix For an N N square matrix A, the inverse of A, 1 A, exists if and only if A is of full rank, i.e., if and only if no column of A is a linear combination 1 of the others. A is

More information

Title. Description. var intro Introduction to vector autoregressive models

Title. Description. var intro Introduction to vector autoregressive models Title var intro Introduction to vector autoregressive models Description Stata has a suite of commands for fitting, forecasting, interpreting, and performing inference on vector autoregressive (VAR) models

More information

Multivariate Distributions

Multivariate Distributions IEOR E4602: Quantitative Risk Management Spring 2016 c 2016 by Martin Haugh Multivariate Distributions We will study multivariate distributions in these notes, focusing 1 in particular on multivariate

More information

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin 1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)

More information

Resampling techniques for statistical modeling

Resampling techniques for statistical modeling Resampling techniques for statistical modeling Gianluca Bontempi Département d Informatique Boulevard de Triomphe - CP 212 http://www.ulb.ac.be/di Resampling techniques p.1/33 Beyond the empirical error

More information

Approximating the Covariance Matrix with Low-rank Perturbations

Approximating the Covariance Matrix with Low-rank Perturbations Approximating the Covariance Matrix with Low-rank Perturbations Malik Magdon-Ismail and Jonathan T. Purnell Department of Computer Science Rensselaer Polytechnic Institute Troy, NY 12180 {magdon,purnej}@cs.rpi.edu

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION doi:10.1038/nature25973 Power Simulations We performed extensive power simulations to demonstrate that the analyses carried out in our study are well powered. Our simulations indicate very high power for

More information

Managing Uncertainty

Managing Uncertainty Managing Uncertainty Bayesian Linear Regression and Kalman Filter December 4, 2017 Objectives The goal of this lab is multiple: 1. First it is a reminder of some central elementary notions of Bayesian

More information

COM336: Neural Computing

COM336: Neural Computing COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk

More information

Fast and Robust Phase Retrieval

Fast and Robust Phase Retrieval Fast and Robust Phase Retrieval Aditya Viswanathan aditya@math.msu.edu CCAM Lunch Seminar Purdue University April 18 2014 0 / 27 Joint work with Yang Wang Mark Iwen Research supported in part by National

More information

Two-View Segmentation of Dynamic Scenes from the Multibody Fundamental Matrix

Two-View Segmentation of Dynamic Scenes from the Multibody Fundamental Matrix Two-View Segmentation of Dynamic Scenes from the Multibody Fundamental Matrix René Vidal Stefano Soatto Shankar Sastry Department of EECS, UC Berkeley Department of Computer Sciences, UCLA 30 Cory Hall,

More information

EIGENVALUES AND EIGENVECTORS 3

EIGENVALUES AND EIGENVECTORS 3 EIGENVALUES AND EIGENVECTORS 3 1. Motivation 1.1. Diagonal matrices. Perhaps the simplest type of linear transformations are those whose matrix is diagonal (in some basis). Consider for example the matrices

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Numerical Linear Algebra Background Cho-Jui Hsieh UC Davis May 15, 2018 Linear Algebra Background Vectors A vector has a direction and a magnitude

More information

Least Squares Optimization

Least Squares Optimization Least Squares Optimization The following is a brief review of least squares optimization and constrained optimization techniques. I assume the reader is familiar with basic linear algebra, including the

More information

Principal component analysis

Principal component analysis Principal component analysis Angela Montanari 1 Introduction Principal component analysis (PCA) is one of the most popular multivariate statistical methods. It was first introduced by Pearson (1901) and

More information

Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices

Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices Vahid Dehdari and Clayton V. Deutsch Geostatistical modeling involves many variables and many locations.

More information

Unsupervised Learning with Permuted Data

Unsupervised Learning with Permuted Data Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University

More information

component risk analysis

component risk analysis 273: Urban Systems Modeling Lec. 3 component risk analysis instructor: Matteo Pozzi 273: Urban Systems Modeling Lec. 3 component reliability outline risk analysis for components uncertain demand and uncertain

More information

Linear Algebra Review. Vectors

Linear Algebra Review. Vectors Linear Algebra Review 9/4/7 Linear Algebra Review By Tim K. Marks UCSD Borrows heavily from: Jana Kosecka http://cs.gmu.edu/~kosecka/cs682.html Virginia de Sa (UCSD) Cogsci 8F Linear Algebra review Vectors

More information

Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power

Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power Proportional Variance Explained by QTL and Statistical Power Partitioning the Genetic Variance We previously focused on obtaining variance components of a quantitative trait to determine the proportion

More information

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A = 30 MATHEMATICS REVIEW G A.1.1 Matrices and Vectors Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A = a 11 a 12... a 1N a 21 a 22... a 2N...... a M1 a M2... a MN A matrix can

More information

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017 Lecture 2: Genetic Association Testing with Quantitative Traits Instructors: Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2017 1 / 29 Introduction to Quantitative Trait Mapping

More information

Haruhiko Ogasawara. This article gives the first half of an expository supplement to Ogasawara (2015).

Haruhiko Ogasawara. This article gives the first half of an expository supplement to Ogasawara (2015). Economic Review (Otaru University of Commerce, Vol.66, No. & 3, 9-58. December, 5. Expository supplement I to the paper Asymptotic expansions for the estimators of Lagrange multipliers and associated parameters

More information

MACHINE LEARNING ADVANCED MACHINE LEARNING

MACHINE LEARNING ADVANCED MACHINE LEARNING MACHINE LEARNING ADVANCED MACHINE LEARNING Recap of Important Notions on Estimation of Probability Density Functions 22 MACHINE LEARNING Discrete Probabilities Consider two variables and y taking discrete

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 5: Numerical Linear Algebra Cho-Jui Hsieh UC Davis April 20, 2017 Linear Algebra Background Vectors A vector has a direction and a magnitude

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

MA 575 Linear Models: Cedric E. Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1

MA 575 Linear Models: Cedric E. Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1 MA 575 Linear Models: Cedric E Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1 1 Within-group Correlation Let us recall the simple two-level hierarchical

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2 Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, 2010 Jeffreys priors Lecturer: Michael I. Jordan Scribe: Timothy Hunter 1 Priors for the multivariate Gaussian Consider a multivariate

More information

Introduction to the Tensor Train Decomposition and Its Applications in Machine Learning

Introduction to the Tensor Train Decomposition and Its Applications in Machine Learning Introduction to the Tensor Train Decomposition and Its Applications in Machine Learning Anton Rodomanov Higher School of Economics, Russia Bayesian methods research group (http://bayesgroup.ru) 14 March

More information

Sparse orthogonal factor analysis

Sparse orthogonal factor analysis Sparse orthogonal factor analysis Kohei Adachi and Nickolay T. Trendafilov Abstract A sparse orthogonal factor analysis procedure is proposed for estimating the optimal solution with sparse loadings. In

More information

Variational Principal Components

Variational Principal Components Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

More information

GWAS V: Gaussian processes

GWAS V: Gaussian processes GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011

More information

MIXED MODELS THE GENERAL MIXED MODEL

MIXED MODELS THE GENERAL MIXED MODEL MIXED MODELS This chapter introduces best linear unbiased prediction (BLUP), a general method for predicting random effects, while Chapter 27 is concerned with the estimation of variances by restricted

More information

Basic Concepts in Matrix Algebra

Basic Concepts in Matrix Algebra Basic Concepts in Matrix Algebra An column array of p elements is called a vector of dimension p and is written as x p 1 = x 1 x 2. x p. The transpose of the column vector x p 1 is row vector x = [x 1

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

ELEMENTARY LINEAR ALGEBRA

ELEMENTARY LINEAR ALGEBRA ELEMENTARY LINEAR ALGEBRA K R MATTHEWS DEPARTMENT OF MATHEMATICS UNIVERSITY OF QUEENSLAND First Printing, 99 Chapter LINEAR EQUATIONS Introduction to linear equations A linear equation in n unknowns x,

More information

3/10/03 Gregory Carey Cholesky Problems - 1. Cholesky Problems

3/10/03 Gregory Carey Cholesky Problems - 1. Cholesky Problems 3/10/03 Gregory Carey Cholesky Problems - 1 Cholesky Problems Gregory Carey Department of Psychology and Institute for Behavioral Genetics University of Colorado Boulder CO 80309-0345 Email: gregory.carey@colorado.edu

More information

BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014

BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014 BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014 Homework 4 (version 3) - posted October 3 Assigned October 2; Due 11:59PM October 9 Problem 1 (Easy) a. For the genetic regression model: Y

More information

Linear Regression (9/11/13)

Linear Regression (9/11/13) STA561: Probabilistic machine learning Linear Regression (9/11/13) Lecturer: Barbara Engelhardt Scribes: Zachary Abzug, Mike Gloudemans, Zhuosheng Gu, Zhao Song 1 Why use linear regression? Figure 1: Scatter

More information

Lecture 4 Noisy Channel Coding

Lecture 4 Noisy Channel Coding Lecture 4 Noisy Channel Coding I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw October 9, 2015 1 / 56 I-Hsiang Wang IT Lecture 4 The Channel Coding Problem

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Multiple-step Time Series Forecasting with Sparse Gaussian Processes

Multiple-step Time Series Forecasting with Sparse Gaussian Processes Multiple-step Time Series Forecasting with Sparse Gaussian Processes Perry Groot ab Peter Lucas a Paul van den Bosch b a Radboud University, Model-Based Systems Development, Heyendaalseweg 135, 6525 AJ

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

Clustering VS Classification

Clustering VS Classification MCQ Clustering VS Classification 1. What is the relation between the distance between clusters and the corresponding class discriminability? a. proportional b. inversely-proportional c. no-relation Ans:

More information

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES Saurabh Ghosh Human Genetics Unit Indian Statistical Institute, Kolkata Most common diseases are caused by

More information

Fast Algorithms for SDPs derived from the Kalman-Yakubovich-Popov Lemma

Fast Algorithms for SDPs derived from the Kalman-Yakubovich-Popov Lemma Fast Algorithms for SDPs derived from the Kalman-Yakubovich-Popov Lemma Venkataramanan (Ragu) Balakrishnan School of ECE, Purdue University 8 September 2003 European Union RTN Summer School on Multi-Agent

More information

PCA vignette Principal components analysis with snpstats

PCA vignette Principal components analysis with snpstats PCA vignette Principal components analysis with snpstats David Clayton October 30, 2018 Principal components analysis has been widely used in population genetics in order to study population structure

More information

Parametric Empirical Bayes Methods for Microarrays

Parametric Empirical Bayes Methods for Microarrays Parametric Empirical Bayes Methods for Microarrays Ming Yuan, Deepayan Sarkar, Michael Newton and Christina Kendziorski April 30, 2018 Contents 1 Introduction 1 2 General Model Structure: Two Conditions

More information

Biostat 2065 Analysis of Incomplete Data

Biostat 2065 Analysis of Incomplete Data Biostat 2065 Analysis of Incomplete Data Gong Tang Dept of Biostatistics University of Pittsburgh October 20, 2005 1. Large-sample inference based on ML Let θ is the MLE, then the large-sample theory implies

More information

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane DATA MINING AND MACHINE LEARNING Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Linear models for regression Regularized

More information

Linear Algebra - Part II

Linear Algebra - Part II Linear Algebra - Part II Projection, Eigendecomposition, SVD (Adapted from Sargur Srihari s slides) Brief Review from Part 1 Symmetric Matrix: A = A T Orthogonal Matrix: A T A = AA T = I and A 1 = A T

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

The Hilbert Space of Random Variables

The Hilbert Space of Random Variables The Hilbert Space of Random Variables Electrical Engineering 126 (UC Berkeley) Spring 2018 1 Outline Fix a probability space and consider the set H := {X : X is a real-valued random variable with E[X 2

More information

Notes on Latent Semantic Analysis

Notes on Latent Semantic Analysis Notes on Latent Semantic Analysis Costas Boulis 1 Introduction One of the most fundamental problems of information retrieval (IR) is to find all documents (and nothing but those) that are semantically

More information

Introduction to Machine Learning

Introduction to Machine Learning 10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what

More information

Corner. Corners are the intersections of two edges of sufficiently different orientations.

Corner. Corners are the intersections of two edges of sufficiently different orientations. 2D Image Features Two dimensional image features are interesting local structures. They include junctions of different types like Y, T, X, and L. Much of the work on 2D features focuses on junction L,

More information

Advanced Introduction to Machine Learning

Advanced Introduction to Machine Learning 10-715 Advanced Introduction to Machine Learning Homework 3 Due Nov 12, 10.30 am Rules 1. Homework is due on the due date at 10.30 am. Please hand over your homework at the beginning of class. Please see

More information

Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model

Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model Xiuming Zhang zhangxiuming@u.nus.edu A*STAR-NUS Clinical Imaging Research Center October, 015 Summary This report derives

More information

Flexible phenotype simulation with PhenotypeSimulator Hannah Meyer

Flexible phenotype simulation with PhenotypeSimulator Hannah Meyer Flexible phenotype simulation with PhenotypeSimulator Hannah Meyer 2018-03-01 Contents Introduction 1 Work-flow 2 Examples 2 Example 1: Creating a phenotype composed of population structure and observational

More information

A matrix over a field F is a rectangular array of elements from F. The symbol

A matrix over a field F is a rectangular array of elements from F. The symbol Chapter MATRICES Matrix arithmetic A matrix over a field F is a rectangular array of elements from F The symbol M m n (F ) denotes the collection of all m n matrices over F Matrices will usually be denoted

More information

Vector Auto-Regressive Models

Vector Auto-Regressive Models Vector Auto-Regressive Models Laurent Ferrara 1 1 University of Paris Nanterre M2 Oct. 2018 Overview of the presentation 1. Vector Auto-Regressions Definition Estimation Testing 2. Impulse responses functions

More information

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information