Lecture 2: Diversity, Distances, adonis. Lecture 2: Diversity, Distances, adonis. Alpha- Diversity. Alpha diversity definition(s)

Similar documents
Lecture: Mixture Models for Microbiome data

Lecture 2: Descriptive statistics, normalizations & testing

Other resources. Greengenes (bacterial) Silva (bacteria, archaeal and eukarya)

Lecture 3: Mixture Models for Microbiome data. Lecture 3: Mixture Models for Microbiome data

Lecture 5: Ecological distance metrics; Principal Coordinates Analysis. Univariate testing vs. community analysis

Lecture 5: Ecological distance metrics; Principal Coordinates Analysis. Univariate testing vs. community analysis

Multivariate Statistics 101. Ordination (PCA, NMDS, CA) Cluster Analysis (UPGMA, Ward s) Canonical Correspondence Analysis

Outline Classes of diversity measures. Species Divergence and the Measurement of Microbial Diversity. How do we describe and compare diversity?

Machine Learning for Data Science (CS4786) Lecture 12

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Experimental Design and Data Analysis for Biologists

Supplementary Information

-Principal components analysis is by far the oldest multivariate technique, dating back to the early 1900's; ecologists have used PCA since the

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

FIG S1: Rarefaction analysis of observed richness within Drosophila. All calculations were

Characterizing and predicting cyanobacterial blooms in an 8-year

Algebra of Principal Component Analysis

BIO 682 Multivariate Statistics Spring 2008

SUPPLEMENTARY INFORMATION

Eigenvalues, Eigenvectors, and an Intro to PCA

Eigenvalues, Eigenvectors, and an Intro to PCA

Numerical Methods and Computation Prof. S.R.K. Iyengar Department of Mathematics Indian Institute of Technology Delhi

Lecture 2: Probability Distributions

PRINCIPAL COMPONENTS ANALYSIS

DETECTING BIOLOGICAL AND ENVIRONMENTAL CHANGES: DESIGN AND ANALYSIS OF MONITORING AND EXPERIMENTS (University of Bologna, 3-14 March 2008)

How to quantify biological diversity: taxonomical, functional and evolutionary aspects. Hanna Tuomisto, University of Turku

Using Topological Data Analysis to find discrimination between microbial states in human microbiome data

Flowchart. (b) (c) (d)

Chad Burrus April 6, 2010

4. Ordination in reduced space

Multivariate Analysis of Ecological Data

Edwin A. Hernández-Delgado*

Multivariate Analysis of Ecological Data

4/4/2018. Stepwise model fitting. CCA with first three variables only Call: cca(formula = community ~ env1 + env2 + env3, data = envdata)

Multivariate analysis

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Unconstrained Ordination

Multivariate Data Analysis a survey of data reduction and data association techniques: Principal Components Analysis

Jeffrey D. Ullman Stanford University

Linear Programming and its Extensions Prof. Prabha Shrama Department of Mathematics and Statistics Indian Institute of Technology, Kanpur

G E INTERACTION USING JMP: AN OVERVIEW

20 Unsupervised Learning and Principal Components Analysis (PCA)

Clusters. Unsupervised Learning. Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved

Studying the effect of species dominance on diversity patterns using Hill numbers-based indices

Principal Components Analysis. Sargur Srihari University at Buffalo

Taxonomy and Clustering of SSU rrna Tags. Susan Huse Josephine Bay Paul Center August 5, 2013

Machine Learning (Spring 2012) Principal Component Analysis

Resampling Methods. Lukas Meier

Multivariate Statistics Summary and Comparison of Techniques. Multivariate Techniques

CSC321 Lecture 18: Learning Probabilistic Models

Introduction to multivariate analysis Outline

Principal Component Analysis. Applied Multivariate Statistics Spring 2012

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

CS 6375 Machine Learning

Variations in pelagic bacterial communities in the North Atlantic Ocean coincide with water bodies

Lecture 7 Spectral methods

INTRODUCTION TO MULTIVARIATE ANALYSIS OF ECOLOGICAL DATA

Unsupervised learning: beyond simple clustering and PCA

Palaeontological community and diversity analysis brief notes. Oyvind Hammer Paläontologisches Institut und Museum, Zürich

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees

Lecture: Face Recognition and Feature Reduction

PRINCIPAL COMPONENTS ANALYSIS (PCA)

CSE 554 Lecture 7: Alignment

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Classification and Regression Trees

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Real Analysis Prof. S.H. Kulkarni Department of Mathematics Indian Institute of Technology, Madras. Lecture - 13 Conditional Convergence

STAT 730 Chapter 14: Multidimensional scaling

Operation and Supply Chain Management Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras

Generative Clustering, Topic Modeling, & Bayesian Inference

Title ghost-tree: creating hybrid-gene phylogenetic trees for diversity analyses

Diversity partitioning without statistical independence of alpha and beta

Gentle Introduction to Infinite Gaussian Mixture Modeling

Supplementary Materials for

Linear Regression (9/11/13)

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Introduction to Machine Learning

An Adaptive Association Test for Microbiome Data

Randomized Decision Trees

MATH 3C: MIDTERM 1 REVIEW. 1. Counting

Lecture: Face Recognition and Feature Reduction

Discriminant analysis and supervised classification

Unsupervised Learning: Dimensionality Reduction

TABLE OF CONTENTS CHAPTER 1 COMBINATORIAL PROBABILITY 1

SYSM 6303: Quantitative Introduction to Risk and Uncertainty in Business Lecture 4: Fitting Data to Distributions

Motivating the Covariance Matrix

Jeffrey D. Ullman Stanford University

Chapter 15. Probability Rules! Copyright 2012, 2008, 2005 Pearson Education, Inc.

Principal component analysis (PCA) for clustering gene expression data

Classification 2: Linear discriminant analysis (continued); logistic regression

Statistics Toolbox 6. Apply statistical algorithms and probability models

Lecture 6 Proof for JL Lemma and Linear Dimensionality Reduction

Structural Equation Modeling and Confirmatory Factor Analysis. Types of Variables

LECTURE 4 PRINCIPAL COMPONENTS ANALYSIS / EXPLORATORY FACTOR ANALYSIS

Chapter 11 Canonical analysis

Package MiRKATS. May 30, 2016

Introduction to Machine Learning

Transcription:

Lecture 2: Diversity, Distances, adonis Lecture 2: Diversity, Distances, adonis Diversity - alpha, beta (, gamma) Beta- Diversity in practice: Ecological Distances Unsupervised Learning: Clustering, etc Ordination: e.g. PCA, UniFrac/PCoA, DPCoA Testing: Permutational Multivariate ANOVA Some slides from prof A. Alekseyenko, NYU; and prof S. Holmes, Stanford 1 2 Alpha- Diversity Alpha diversity definition(s) Alpha diversity describes the diversity of a single community (specimen). In statistical terms, it is a scalar statistic computed for a single observation (column) that represents the diversity of that observation. There are many statistics that can describe diversity: e.g. taxonomical richness, evenness, dominance, etc. 3 4

Rank abundance plots Species richness Suppose we observe a community that can contain up to k species. The relative proportions of the species P = {p 1,, p k } Richness is computed as R = 1(p 1 ) + 1(p 2 ) + + 1(p k ) where 1(.) is an indicator function, i.e. 1(x) = 1 if p i 0, and 0 otherwise. Higher R means greater diversity Very dependent upon depth of sampling and sensitive to presence of rare species 5 6 Sanders 1968 non-parametric richness estimate coverage Sanders, H. L. (1968). Marine benthic diversity: a comparative study. American Naturalist Rarefaction Curves Number of species # Observations / Library Size / # Reads / Sample Size 7 Shannon index Suppose we observe a community that can contain up to k species. The relative proportions of the species are P = {p 1,, p k }. Shannon index is related to the notion of information content from information theory. It roughly represents the amount of information that is available for the distribution of P. When p i = p j, for all i and j, then we have no information about which species a random draw will result in. As the inequality becomes more pronounced, we gain more information about the possible outcome of the draw. The Shannon index captures this property of the distribution. Shannon index is computed as S k = p 1 log 2 p 1 p 2 log 2 p 2 p k log 2 p k Note as p i 0, log 2 p i, we therefore define p i log 2 p i = 0. Higher S k means higher diversity Shannon entropy http://en.wikipedia.org/wiki/entropy_(information_theory) 8

From Shannon to Evenness Shannon index for a community of k species has a maximum at log 2 k We can make different communities more comparable if we normalize by the maximum Evenness index is computed as E k =S k /log 2 k E k =1 means total evenness Simpson index Suppose we observe a community that can contain up to k species. The relative proportions of the species are P = {p 1,, p k }. Simpson index is the probability of resampling the same species on two consecutive draws with replacement. Suppose on the first draw we picked species i, this event has probability p i, hence the probability of drawing that species twice is p i *p i. Simpson index is usually computed as: D=1 (p 1 2 + p 2 2 + + p k 2 ) In this case, the index represents the probability that two individuals randomly selected from a sample will belong to different species. D = 0 means no diversity (1 species is completely dominant) D = 1 means complete diversity 9 10 Numbers equivalent diversity Often it is convenient to talk about alpha diversity in terms of equivalent units: How many equally abundant taxa will it take to get the same diversity as we see in a given community? For richness there is no difference in statistic For Shannon, remember that log 2 k is the maximum which is attained when all species equal abundance. Hence the diversity in equivalent units is 2 Sk For Simpson the equivalent units measure of diversity is 1/(1- D) Sometimes called Inverse Simpson Index Beta- Diversity 11 12

Beta- Diversity Microbial ecologists typically use beta diversity as a broad umbrella term that can refer to any of several indices related to compositional differences (Differences in species content between samples) For some reason this is contentious, and there appears to be ongoing (and pointless?) argument over the possible definitions For our purposes, and microbiome research, when you hear beta- diversity, you can probably think: Diversity of species composition http://en.wikipedia.org/wiki/beta_diversity 13 Summary of diversity types α diversity within a community, # of species only β diversity between communities (differentiation), species identity is taken into account γ (global) diversity of the site Theoretically, one would wishes to use such measures that result in γ = α β This is only possible if α and β are independent of each other. 14 Beta- Diversity in practice Dimensional Reduction 1.UniFrac or Bray- Curtis distance between samples 2.MDS ( PCoA ) 3.Plot first two axes 4.Admire clusters 5.Write Paper 6.Choose new microbiomes 7.Return to Step 1, Repeat Why? Let s back up. This is one option in an arsenal of dimensional reduction methods, that come from unsupervised learning in exploratory data analysis Regress disc on weight Regress weight on disc 15 16

Dimensional Reduction Minimize the distance to the line in both directions the purple line is the principal component line Dimensional Reduction Principal Components are Linear Combinations of the old variables The projection that maximizes the area of the shadow and an equivalent measurement is the sums of squares of the distances between points in the projection, we want to see as much of the variation as possible, that s what PCA does. 17 18 The PCA workflow Ordination Using the Tree 1. UniFrac- PCoA 2. Double Principal Coordinates 19 20

Ordination Best Practice Ordination Best Practice 1. Always look at scree plot 2. Variables, Samples 3. Biplot 4. Altogether (if readable) pca.turtles=dudi.pca(turtles[,-1],scannf=f,nf=2)! scatter(pca.turtles) 21 22 How many axes are probably useful? Are their clusters? How many? Are their gradients? Are the patterns consistent with covariates (e.g. sample observations) How might we test this? Are their clusters? How many?! Gap Statistic 23 24

Are their gradients?! PCA regression Are the patterns consistent with covariates How might we test this? (Permutational) Multivariate ANOVA vegan::adonis( ) 25 26