Lecture 3: Mixture Models for Microbiome data. Lecture 3: Mixture Models for Microbiome data

Size: px

Start display at page:

Download "Lecture 3: Mixture Models for Microbiome data. Lecture 3: Mixture Models for Microbiome data"

Bennett Howard
5 years ago
Views:

1 Lecture 3: Mixture Models for Microbiome data 1 Lecture 3: Mixture Models for Microbiome data Outline: - Mixture Models (Negative Binomial) - DESeq2 / Don t Rarefy. Ever. 2

2 Hypothesis Tests - reminder A hypothesis is a precise disprovable statement. Null hypothesis - the default position. Nothing special Alternative/Rejection: Evidence disagrees with the Null Null hypothesis cannot be confirmed by the data. Hypothesis Tests - some examples 3 test t-test Mann-Whitney U-test correlation test Chi-Square test Neg-Binom Wald test R function t.test wilcox.test cor.test chisq.test DESeq2::nbinomWaldTest 4

3 Multiple Testing In Big Data, we often want to test many hypotheses in one batch. p-values are distributed uniformly when null hypothesis is true The expected number of rejections by chance is m*α count P values under Null hypothesis with 0 trials p rejected FALSE TRUE Model Uncertainty in NGS Count Data Uncertainty Depends on Library Size Poisson-only Count Simulation True Species (or Gene) Proportion in Simulation Proportion Species or Gene 6

4 Model Uncertainty in NGS Count Data Uncertainty Depends on Library Size Poisson-only Count Simulation One realization of the simulation (blue) Proportion Species or Gene 7 Model Uncertainty in NGS Count Data Uncertainty Depends on Library Size Repeat simulation (resampling) many times and different library sizes Poisson-only Count Simulation Library Size Species or Gene 8

5 Model Uncertainty in NGS Count Data Uncertainty Depends on Library Size Repeat simulation many times and different library sizes This turns out to describe technical sequencing replicates Observed Variance Poisson-only Count Simulation 1e+08 1e+0 9 Mean Count Library Size Model Uncertainty in NGS Count Data 1e+02 Est. Variance NGS Count Data 1e+08 Real Data (Biological Replicates) Variance 1e+0 1e+02 Variance Poisson Mean Count 1e+08

6 Model Uncertainty in NGS Count Data 1e+02 Negative Binomial Variance = Poisson Est. Variance NGS Count Data Overdispersion Over-dispersion Variance Strong Function of Mean 1e+08 Observed 1e+0 Share Information Across Genes to 1e+02 Improve Fit Variance (Performance) Over-dispersion Poisson Mean Count 1e Model Uncertainty in NGS Count Data 1e+0 Microbiome Data Is Inadmissible Rarefying Microbiome DataRarefying Is Inadmissible Negative Binomial is an infinite mixture of Poisson R.V. 1e+02 we have (almost) as many different Intuition: relevant when distributions (poisson means) as observations Borrow from RNA-Seq analysis implementations? (Yes) Negative Binomial McMurdie & Holmes (14). PLoS Computational Biology Effect Size Effect Size Robinson, Oshlack (). A scaling normalization RNA-Seq data. Genome Biology Anders, & Huber (). Differential expression sequence count data. Genome Biology 12 Mean t-distribution A.U.C. 00

If it comes up heads Generate a random number from a Normal with mean 1 and variance 0.2. R: `rnorm` function.

7 Transition: Mixture Models Technical details in: mixture-model-holmes-mathy-details.pdf Finite Mixture Model Example: Finite mixture of two normals Flip a fair coin. If it comes up heads Generate a random number from a Normal with mean 1 and variance 0.2. R: `rnorm` function. If it comes up tails Generate a random number from a Normal with mean 2 and variance 0.2. This is what the resulting histogram would look like if we did this,000 times

Finite Mixture Model Example: Finite mixture of two normals However in many cases the separation is not so clear. Challenge: Here is a histogram generated by two Normals with the same variances.

8 Finite Mixture Model Example: Finite mixture of two normals However in many cases the separation is not so clear. Challenge: Here is a histogram generated by two Normals with the same variances. Can you guess the two parameters for these two Normals? Finite Mixture Model Here we knew the answer (the source every data point) 1 In practice, this information is usually missing, and we call it a latent variable Discovering the hidden class: EM For simple parametric components, can use EM (Expectation- Maximization) algorithm to infer the value of the hidden variable. 16

9 Expectation Maximization (EM) Very popular iterative procedure Lots of implementations. E.g. FlexMix First, initialize θ to some random values. 2.Compute best value for U. 3. Use the just-computed values of U to compute a better estimate for θ. Parameters associated with a particular value of U only use data points whose associated latent variable has that value. 4. Iterate steps 2 and 3 until convergence maximization_algorithm Infinite Mixture Model Sometimes mixtures can be useful without us having to find who came from which distribution. 17 This is especially the case when we have (almost) as many different distributions as observations. In some cases the total distribution can still be studied, even if we don t know the source of each component distribution. e.g. Gamma-Poisson a.k.a. Negative Binomial 18

10 Infinite Mixture Model - N.B. Generative Description: Summarized Mathematically: variance: Poisson Overdispersion Negative Binomial is useful for modeling: Overdispersion (in Ecology) Simplest Mixture Model for Counts Different evolutionary mutation rates Throughout Bioinformatics and Bayesian Statistics Abundance data Summary of Mixture Models Finite Mixture Models Mixture of Normals with different means and variances. Mixtures of multivariate Normals with different means and covariance matrices Decomposing the mixtures using the EM algorithm. Common Infinite Mixture Models Gamma-Poisson for read counts Dirichlet-Multinomial (Birthday problem and the Bayesian setting). 19

Inefficient Normalization by rarefying & applicability of Negative Binomial Mixture Model Modern sequencing creates libraries of unequal sizes Early

normalization method is rarefying species samples species counts 21 Inefficient Normalization by rarefying the original idea Sanders 1968

11 Inefficient Normalization by rarefying & applicability of Negative Binomial Mixture Model Modern sequencing creates libraries of unequal sizes Early analyses focused on library-wise distances: paradigm: rarefy - UniFrac - PCoA - Write Paper This approach has leaked into formal settings, standard normalization method is rarefying species samples species counts 21 Inefficient Normalization by rarefying the original idea Sanders 1968 non-parametric richness estimate coverage Normalize? - No. rarefaction curves Sanders, H. L. (1968). Marine benthic diversity: a comparative study. American Naturalist 22

12 Inefficient Normalization by rarefying 1. Select a minimum library size N L,min Library Sizes (column sums) 2. Discard libraries (samples) that are smaller than N L,min 3. Subsample the remaining libraries without replacement such that they all have size N L,min N A B C D E Hughes & Hellmann (0) Methods in Enzymology Gotelli, & Colwell (01) Ecology Letters 23 Inefficient Normalization by rarefying 1. Select a minimum library size N L,min Library Sizes (column sums) 2. Discard libraries (samples) that are smaller than N L,min 3. Subsample the remaining libraries without replacement such that they all have size N L,min N A B C D E Hughes & Hellmann (0) Methods in Enzymology removed from dataset Gotelli, & Colwell (01) Ecology Letters 24

13 Microbiome Clustering Simulation OTUs samples Ocean Feces 1. Sum rows. A multinomial for each sample class. Microbiome count data from the Global Patterns dataset OTUs Ocean Feces 2. Deterministic mixing. Mix multinomials in precise proportion. Amount added is library size / effect size Ocean Feces OTUs samples Simulated Ocean 2 Simulated Feces 3. Sample from these multinomials. 4. Perform clustering, evaluate accuracy. Repeat for each effect size and media library size. Microbiome Clustering - Simulation Normalization Method: DESeqVS None Proportion Rarefy UQ logfc Accuracy Bray Curtis Euclidean PoissonDist top MSD UniFrac u UniFrac w Effect Size N ~ L = 00 N ~ L = 00 N ~ L =

14 Microbiome Clustering - Simulation Performance Depends on ÑL Distance Method: Bray Curtis PoissonDist top MSD UniFrac u UniFrac w ES = 1.1 ES = 1.2 ES = 1. Accuracy N ~ L = 00 N ~ L = 00 N ~ L = 000 N ~ L = Library Size Minimum Quantile 27 Issues with rarefying clustering Loss of Power: 1. Microbiome samples that cannot be classified because they were discarded (< NL,min). 2. Samples that are poorly distinguishable because of the discarded fraction of the original library. Arbitrary threshold: 1. Choice clearly affects performance 2. Optimum value, * NL, min, can t be known in practice 28

15 Differential Abundance samples species counts species test null Which species have proportions that are different between the sample classes? 29 Differential Abundance What about NB Mixture Model? genes samples RNA-Seq gene counts samples species species counts Mortazavi, et al (08). Mapping & quantifying transcriptomes by RNA-Seq. Nature Methods 30

16 Differential Abundance What about NB Mixture Model? Is Negative Binomial effective for this data? 1. Is there appreciable overdispersion? 2. Is there a useful across-species trend? K ij NB(s j µ i, i) is the dispersion param s a linear scaling factor for s i = s j µ i + i s j µ 2 i ing Poisson when Robinson, Oshlack (). A scaling normalization RNA-Seq data. Genome Biology Anders, & Huber (). Differential expression sequence count data. Genome Biology 31 Differential Abundance Common Scale Rarefied variance Variance 1e+08 1e+0 1e+02 1e+08 1e+0 1e+02 1e+08 1e+0 DietPatterns GlobalPatterns GlobalPatterns 04 HighFat freshwater marine Microbiome Survey Biological Replicates 1e+02 mean Mean 32

17 Differential Abundance Simulation species samples species counts test null Which species have proportions that are different between the sample classes? 33 Differential Abundance Environment Simulation samples OTUs OTUs OTUs samples test Repeat for each environment, number of samples, effect size, and median library size Sum rows for each environment. null OTUs Sample from multinomial. 3. Multiply randomly selected OTUs within test class by effect size. 4. Perform differential abundance tests, evaluate performance.

18 Differential Abundance - Simulation Number Samples per Class: DESeq2 nbinomwaldtest 3 Normalization Method: edger exacttest DESeq nbinomtest AUC ~ NL = 0000 ~ NL = 0000 Specificity two sided Welch t test Proportion ~ NL = 00 Rarefied metagenomeseq fitzig Model/None ~ NL = 0000 Sensitivity Effect Size 3 Differential Abundance - Alt Simulation (Courtesy: Sophie Weiss, UC Boulder) Number Samples per Class: DESeq2 nbinomwaldtest DESeq nbinomtest 0 Normalization Method: edger exacttest Model/None metagenomeseq fitzig Rarefied mt Wilcoxon rank-sum (wrs) Voom Proportion ~ NL = 00 AUC ~ NL = 0000 ~ NL = 0000 Specificity ~ NL = 0000 Sensitivity Effect Size

19 Differential Abundance - Simulation False Positive Rates 37 Issues with rarefying Differential Abundance 1. Rarefied counts worse sensitivity in every analysis method we attempted. 2. Rarefied counts also worse specificity (high FPs) No accounting for overdispersion Added noise from subsampling step 38

20 Transition: Lab 3 Negative Binomial mixture model for differential abundance multiple testing 39

Lecture: Mixture Models for Microbiome data

Lecture: Mixture Models for Microbiome data Lecture 3: Mixture Models for Microbiome data Outline: - - Sequencing thought experiment Mixture Models (tangent) - (esp. Negative Binomial) - Differential abundance