SUPPLEMENTARY INFORMATION

Similar documents
Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control

Model II (or random effects) one-way ANOVA:

G E INTERACTION USING JMP: AN OVERVIEW

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

(Genome-wide) association analysis

Association studies and regression

Methods for Cryptic Structure. Methods for Cryptic Structure

Introduction to Quantitative Genetics. Introduction to Quantitative Genetics

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Subject CS1 Actuarial Statistics 1 Core Principles

Asymptotic distribution of the largest eigenvalue with application to genetic data

An introduction to quantitative genetics

Lecture 2: Diversity, Distances, adonis. Lecture 2: Diversity, Distances, adonis. Alpha- Diversity. Alpha diversity definition(s)

Bivariate Relationships Between Variables

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Variance Components: Phenotypic, Environmental and Genetic

7.2 One-Sample Correlation ( = a) Introduction. Correlation analysis measures the strength and direction of association between

T.I.H.E. IT 233 Statistics and Probability: Sem. 1: 2013 ESTIMATION AND HYPOTHESIS TESTING OF TWO POPULATIONS

ECE521 week 3: 23/26 January 2017

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics

Partitioning the Genetic Variance

Lectures 5 & 6: Hypothesis Testing

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

. Also, in this case, p i = N1 ) T, (2) where. I γ C N(N 2 2 F + N1 2 Q)

Marginal Screening and Post-Selection Inference

Psychology 282 Lecture #4 Outline Inferences in SLR

THE ROYAL STATISTICAL SOCIETY 2008 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE (MODULAR FORMAT) MODULE 4 LINEAR MODELS

heritable diversity feb ! gene 8840 biol 8990

Evolution of phenotypic traits

An Adaptive Association Test for Microbiome Data

Introduction and Descriptive Statistics p. 1 Introduction to Statistics p. 3 Statistics, Science, and Observations p. 5 Populations and Samples p.

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Partitioning Genetic Variance

Chapter 8. Inferences Based on a Two Samples Confidence Intervals and Tests of Hypothesis

Correlation. A statistics method to measure the relationship between two variables. Three characteristics

Inferences for Regression

Linear Regression (9/11/13)

Hypothesis testing:power, test statistic CMS:

A General Framework for Variable Selection in Linear Mixed Models with Applications to Genetic Studies with Structured Populations

A discussion on multiple regression models

Supplementary Materials for

BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014

Can you tell the relationship between students SAT scores and their college grades?

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES

Bootstrapping, Randomization, 2B-PLS

Power and sample size calculations for designing rare variant sequencing association studies.

Lecture 9. QTL Mapping 2: Outbred Populations

Quantitative Genetics & Evolutionary Genetics

1. they are influenced by many genetic loci. 2. they exhibit variation due to both genetic and environmental effects.

STAT 263/363: Experimental Design Winter 2016/17. Lecture 1 January 9. Why perform Design of Experiments (DOE)? There are at least two reasons:

Basic Business Statistics 6 th Edition

Variance Component Models for Quantitative Traits. Biostatistics 666

Comparing Bayesian Networks and Structure Learning Algorithms

Supplementary materials Quantitative assessment of ribosome drop-off in E. coli

Lecture: Mixture Models for Microbiome data

Quantitative Genetics I: Traits controlled my many loci. Quantitative Genetics: Traits controlled my many loci

Introduction to Machine Learning

GWAS IV: Bayesian linear (variance component) models

Homework Assignment, Evolutionary Systems Biology, Spring Homework Part I: Phylogenetics:

Statistical Methods III Statistics 212. Problem Set 2 - Answer Key

Algorithmisches Lernen/Machine Learning

Machine Learning Linear Classification. Prof. Matteo Matteucci

Lecture 7: Interaction Analysis. Summer Institute in Statistical Genetics 2017

Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power

Bayesian Regression (1/31/13)

Unsupervised machine learning

Lecture 3: Mixture Models for Microbiome data. Lecture 3: Mixture Models for Microbiome data

Fundamental Probability and Statistics

Lecture 2: Descriptive statistics, normalizations & testing

Chapter 8 Handout: Interval Estimates and Hypothesis Testing

Multivariate Fundamentals: Rotation. Exploratory Factor Analysis

Inference for Regression Inference about the Regression Model and Using the Regression Line, with Details. Section 10.1, 2, 3

Quantitative Genetics

Institute of Actuaries of India

Laboratory III Quantitative Genetics

Advanced Statistical Methods: Beyond Linear Regression

Classification. Chapter Introduction. 6.2 The Bayes classifier

Previous lecture. Single variant association. Use genome-wide SNPs to account for confounding (population substructure)

2011 Pearson Education, Inc

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Computational methods for predicting protein-protein interactions

25 : Graphical induced structured input/output models

Lecture 7 Introduction to Statistical Decision Theory

FRANKLIN UNIVERSITY PROFICIENCY EXAM (FUPE) STUDY GUIDE

Function Approximation

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

Linear Regression (1/1/17)

SUPPLEMENTARY INFORMATION

Matrices and Multivariate Statistics - II

Microbial Taxonomy and the Evolution of Diversity

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

SUPPLEMENTARY APPENDICES FOR WAVELET-DOMAIN REGRESSION AND PREDICTIVE INFERENCE IN PSYCHIATRIC NEUROIMAGING

DETAILED CONTENTS PART I INTRODUCTION AND DESCRIPTIVE STATISTICS. 1. Introduction to Statistics

Biometrical Genetics. Lindon Eaves, VIPBG Richmond. Boulder CO, 2012

Statistical. Psychology

ML in Practice: CMSC 422 Slides adapted from Prof. CARPUAT and Prof. Roth

GAUSSIAN PROCESS REGRESSION

LECTURE 5. Introduction to Econometrics. Hypothesis testing

Summary and discussion of: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing

Transcription:

doi:10.1038/nature25973 Power Simulations We performed extensive power simulations to demonstrate that the analyses carried out in our study are well powered. Our simulations indicate very high power for all experiments which test for association between bacterial and ancestral similarity, and between microbiome composition and ancestry proportions (where we define power as the proportion of simulations yielding P<0.05). Specifically, we considered several simulation scenarios, generated multiple synthetic phenotype vectors according to these scenarios, and tested for association between these vectors and genetic/ancestry (depending on the simulation scenario). We then estimated statistical power via the fraction of tests with P value < 0.05. The simulation settings we considered are as follows: 1. Simulating a phenotype whose distribution depends on ancestry proportions. Here, we generated for every individual i a synthetic phenotype y i according to the formulas: y i = c a i β a + x T i γ + ε i a β a N(0, σ 2 ) ε i N(0,1 σ 2 ) γ N(0, I). Here, the summation is performed over ancestries, a i is the ancestry proportion of individual i for ancestry a (the fraction of grandparents originating from this ancestry) after centering to obtain a zero mean, β a is the coefficient of ancestry a, x i is the vector of covariates of individual i, γ is a vector of covariate effects, ε i encodes a residual term, σ 2 (0,1) is the variance of β a, and c is constant guaranteeing that y i has a unit variance on average after regressing out the covariate effects. Hence, under this formulation σ 2 controls the fraction of the variance of y i explained by ancestry (after regressing out the covariate effects). We evaluated σ 2 values in the range [0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.999] and repeated each experiment 100 times, where in each experiment we (1) randomly sampled β a, ε i and γ values from their distribution; (2) generated y i values based on the true ancestry proportion values a i in our data; (3) regressed the covariates x i out of y i ; and (4) tested if the phenotypes vector y is associated with ancestry via a Mantel test with Spearman correlation, where we tested for association between Euclidean ancestry distances of individuals i and j (given by [a i a j ] 2 a ) and phenotypic differences (given by [y i y j ] 2 ). Steps (3) and (4) in the above procedure mimic the ancestry-microbiome association test described in the paper (Figure 1e, Extended Table 1 middle column), with the difference that here we test for association with a single phenotype y i instead of a vector of bacteria. The proposed formulation is designed to simulate a genome wide association study, which enables evaluating the WWW.NATURE.COM/NATURE 1

model s power in a well-known setting with a minimal set of assumptions. Although the above model could be extended to simulate a multivariate vector instead of a scalar phenotype y i, this would require introducing additional assumptions into the model which would detract from its generalizability. The results indicate that our study is very well powered to identify a phenotype whose distribution is given by the formulas above, even for small values of σ 2 (Supplementary Table 7). 2. Simulating a bacterial community based on genetic principal components. Here, we generated for every individual i, and for each of the K = 184 bacterial species present in >5% of genotyped individuals, a vector of relative abundances t i 1,, t i K defined as follows: t i 1,, t i K Dir(α i 1,, α i K ) α j i = { [1 + exp ( P i m m β j )] 5 m=1 1 i q 0.5 i q β j m N(0, b). Here, α i j is the Dirichlet concentration parameter of taxon j in individual i (where larger values indicate a larger tendency to carry taxon j), m iterates over the top five genetic principal components, P i m is the m th genetic principal component of individual i, β j m is the weight of the m th genetic principal component with respect to taxon j, b is the variance of β j m, and q is a tunable parameter controlling the number of species affected by genetic principal components. Hence, larger values of q and of b indicate that a larger fraction of the microbiome composition is affected by genetic ancestry, which should be reflected in the PCos of a microbial β diversity matrix. We carried out experiments where we (1) generated bacterial taxa vectors according to the model above, using b values in the grid [5,10,20] and q values corresponding to 0%, 1%, 25%, 50%, 75%, 95% and 100% of K; (2) computed the top PCos of a bacterial Bray-Curtis matrix; and (3) tested for a Spearman correlation between the top genetic PCs and the corresponding top microbiome PCos, with 100 experiments carried out for each evaluated value of q. We then computed P values for association via the standard asymptotic formulas for either a Spearman or a Pearson correlation of two multivariate random variables, and performed a multiple hypothesis correction for testing five different hypotheses via the Benjamini-Hochberg procedure. We measured power as the fraction of experiments with P value < 0.05 (after multiple hypothesis correction). The results indicate excellent power for finding correlations between genetic PCs and bacterial PCos. (Supplementary Table 8). WWW.NATURE.COM/NATURE 2

3. Simulating a bacterial taxon based on ancestry proportions. Here, we generated for every individual i a synthetic species t i whose relative abundance is given by: t i Beta(1 + ka i, 1 + k(1 a i )), where a i is the ancestry proportion of individual i for ancestry a, and k 0 is a tunable parameter which controls the association strength. After generating t i, we scaled all other bacterial species carried by the same individual so that their total relative abundance (including t i ) sums to unity. Hence, when k = 0 t i is distributed uniformly between 0 and 1, whereas larger values of k induce a greater tendency for individuals of ancestry a to have a larger taxon abundance, and for individuals from other ancestries to have a smaller taxon abundance. As before, we repeated the experiment multiple times, where each experiment is associated with different values of k and of an ancestry a. Specifically, we investigated values of k in the grid [0,0.25,0.5,1,2,5], and repeated each experiment 100 times. As the above formulation simulates an association between a single taxon and a single ancestry, we used our machine learning model to estimate power. Specifically, in each experiment we trained a Ridge regression model to estimate the ancestry proportions of individuals based on their microbiome, and computed the coefficient of determination (R 2 ) via a 10-fold cross validation. This test mimics the ancestry proportion prediction test described in the results section of the main text. We computed approximate P values by generating a distribution of 1,000 R 2 values obtained under k = 0 (corresponding to the null hypothesis) for each ancestry a, and computing the fraction of null R 2 values greater than the one obtained in practice. The results indicate excellent power in the majority of studied settings, and especially when a corresponds to Ashkenazi ancestry (Supplementary Table 9). 4. Simulating a phenotype based on genetic kinship. Here, we generated a vector of synthetic phenotype y according to the formula: y N(Xγ, σ 2 G + (1 σ 2 )I n ) γ N(0, I c ) where X is an n c matrix of c covariates for n individuals, using the same covariates defined in the main text, γ is a vector of c effect sizes (often denoted as fixed effects), G is an n n kinship matrix, I n is the n n identity matrix, I c is the c c identity matrix, and σ 2 controls the fraction of phenotype variance explained by genetic kinship (after regressing out the covariate effects). We evaluated σ 2 values in the range [0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.999] and repeated each experiment 100 times, where in each experiment we (1) randomly WWW.NATURE.COM/NATURE 3

sampled a vector of effect sizes γ and a vector of phenotypes y from their distribution; (2) regressed the covariates X out of y; and (3) tested if the phenotypes vector y is associated with the genetic kinship matrix G via a Mantel test with Spearman correlation. Steps (2) and (3) in the above procedure mimic the genetic kinship-microbiome association test described in the paper (Figure 2a, Extended Data Table 1 right column), with the difference that here we test for association with a phenotypes vector y instead of a matrix of bacterial abundances. The results indicate that our study has moderate power to identify a phenotype whose distribution is given by the formulas above. (Supplementary Table 10). These results are in good agreement with Figure 4a, which shows that genetic kinship estimation has very large confidence intervals for samples with less than 2,000 individuals. Nevertheless, our analysis of the well-powered twinsuk data set of Goodrich et al. provides an estimate that the average estimated fraction of taxa variance explained by genetic kinship is 1.9%, or at most 8.1% under very liberal assumptions, as explained in the manuscript. Statistical Aspects of the Microbiome-Association Index The microbiome-association index (b 2 ) is a formal measure of the extent to which microbiome composition can predict a phenotype of interest. The value ranges between 0 and 1, with 0 indicating no predictive power and 1 indicating a fully deterministic prediction. b 2 was defined analogously to genetic heritability (commonly defined h 2 ). Both b 2 and h 2 are officially defined as the proportion of phenotypic variance that can be explained by the microbiome composition or by the genetic contents of an individual, respectively (where the term explained variance refers to statistical rather than causal explanatory power). We now provide a formal description of some of the underlying assumptions behind the microbiome-association index. Denoting Y as a random variable encoding a phenotype of interest, and G, B, E as genetic, microbiome and environmental random variables than can be used to predict Y, respectively, we have: var(y) = var(g) + var(b) + var(e) + 2cov(G, B) + 2cov(G, E) + 2cov(B, E). In this work, we used the simpler term: var(y) = var(g) + var(b) + var(e), where we estimated G via a polygenic risk score, B via the relative abundance of bacterial genes, and E as a normally distributed random variable. Hence, our derivation implicitly assumes cov(g, B) = cov(g, E) = cov(b, E) = 0. We emphasize that G, B, E do not encode the genotypes, microbiome composition and the environment of an individual per se. Rather, they encode variables that are derived from the genotype, the microbiome composition and the environment of an individual, respectively, and can be used to predict Y. Consequently, WWW.NATURE.COM/NATURE 4

cov(b, E) = 0 does not mean that the gut microbiome composition is uncorrelated with the environment. Rather, this equality means that the variables derived from the microbiome and from the environment for phenotype prediction are uncorrelated. Our assumptions are similar to those commonly made in statistical genetics when estimating the genetic heritability of a phenotype. Specifically, heritability estimation is typically carried out via the equation var(y) = var(g) + var(e) instead of the more general equation var(y) = var(g) + var(e) + 2cov(G, E). If we additionally assume that the contribution of each genetic variant to G is additive, then the heritability estimated from the first formula is typically called narrow-sense heritability, whereas heritability estimated from the second formula is typically called broad-sense heritability. Analogously, in this paper we estimate the narrow-sense microbiome-association index, rather than the broad-sense microbiome-association index. The decision to estimate the narrow-sense microbiome-association index stems from several reasons. First, there is a large body of literature in statistical genetics demonstrating that it is very difficult to identify interactions terms in high-dimensional models, even if such interactions exist 1,2. This is due to technical reasons The first order term of the Taylor expansion of the underlying function can accurately approximate the effect of the interaction terms, which are originally encoded in the lower order terms. As our model is high dimensional (due to the ~1,300,000 bacterial genes used in the linear mixed model approach), the overall microbiomeassociation index estimate will likely be very similar to that of a model that explicitly encodes interactions, regardless of whether such approximations exist in reality. Second, even if we ignore the above arguments, there is no established method to encode GxE, BxE and GxB interactions in a manner that is well accepted in the statistical genetics community. Hence, any attempt to encode these quantities is likely to be based on subjective considerations and subject to debate. We thus believe that our simplified model will facilitate reproduction of microbiome-association index estimation in different studies with varying study designs. References 1. Mäki-Tanila, A. & Hill, W. G. Influence of gene interaction on complex trait variation with multilocus models. Genetics 198, 355 67 (2014). 2. Huang, W. & Mackay, T. F. C. The Genetic Architecture of Quantitative Traits Cannot Be Inferred from Variance Component Analysis. PLOS Genet. 12, e1006421 (2016). WWW.NATURE.COM/NATURE 5