Lecture 2: Diversity, Distances, adonis. Lecture 2: Diversity, Distances, adonis. Alpha- Diversity. Alpha diversity definition(s)

Lecture 2: Diversity, Distances, adonis Lecture 2: Diversity, Distances, adonis Diversity - alpha, beta (, gamma) Beta- Diversity in practice: Ecological Distances Unsupervised Learning: Clustering, etc Ordination: e.g. PCA, UniFrac/PCoA, DPCoA Testing: Permutational Multivariate ANOVA Some slides from prof A. Alekseyenko, NYU; and prof S. Holmes, Stanford 1 2 Alpha- Diversity Alpha diversity definition(s) Alpha diversity describes the diversity of a single community (specimen). In statistical terms, it is a scalar statistic computed for a single observation (column) that represents the diversity of that observation. There are many statistics that can describe diversity: e.g. taxonomical richness, evenness, dominance, etc. 3 4

Rank abundance plots Species richness Suppose we observe a community that can contain up to k species. The relative proportions of the species P = {p 1,, p k } Richness is computed as R = 1(p 1 ) + 1(p 2 ) + + 1(p k ) where 1(.) is an indicator function, i.e. 1(x) = 1 if p i 0, and 0 otherwise. Higher R means greater diversity Very dependent upon depth of sampling and sensitive to presence of rare species 5 6 Sanders 1968 non-parametric richness estimate coverage Sanders, H. L. (1968). Marine benthic diversity: a comparative study. American Naturalist Rarefaction Curves Number of species # Observations / Library Size / # Reads / Sample Size 7 Shannon index Suppose we observe a community that can contain up to k species. The relative proportions of the species are P = {p 1,, p k }. Shannon index is related to the notion of information content from information theory. It roughly represents the amount of information that is available for the distribution of P. When p i = p j, for all i and j, then we have no information about which species a random draw will result in. As the inequality becomes more pronounced, we gain more information about the possible outcome of the draw. The Shannon index captures this property of the distribution. Shannon index is computed as S k = p 1 log 2 p 1 p 2 log 2 p 2 p k log 2 p k Note as p i 0, log 2 p i, we therefore define p i log 2 p i = 0. Higher S k means higher diversity Shannon entropy http://en.wikipedia.org/wiki/entropy_(information_theory) 8

From Shannon to Evenness Shannon index for a community of k species has a maximum at log 2 k We can make different communities more comparable if we normalize by the maximum Evenness index is computed as E k =S k /log 2 k E k =1 means total evenness Simpson index Suppose we observe a community that can contain up to k species. The relative proportions of the species are P = {p 1,, p k }. Simpson index is the probability of resampling the same species on two consecutive draws with replacement. Suppose on the first draw we picked species i, this event has probability p i, hence the probability of drawing that species twice is p i *p i. Simpson index is usually computed as: D=1 (p 1 2 + p 2 2 + + p k 2 ) In this case, the index represents the probability that two individuals randomly selected from a sample will belong to different species. D = 0 means no diversity (1 species is completely dominant) D = 1 means complete diversity 9 10 Numbers equivalent diversity Often it is convenient to talk about alpha diversity in terms of equivalent units: How many equally abundant taxa will it take to get the same diversity as we see in a given community? For richness there is no difference in statistic For Shannon, remember that log 2 k is the maximum which is attained when all species equal abundance. Hence the diversity in equivalent units is 2 Sk For Simpson the equivalent units measure of diversity is 1/(1- D) Sometimes called Inverse Simpson Index Beta- Diversity 11 12

Beta- Diversity Microbial ecologists typically use beta diversity as a broad umbrella term that can refer to any of several indices related to compositional differences (Differences in species content between samples) For some reason this is contentious, and there appears to be ongoing (and pointless?) argument over the possible definitions For our purposes, and microbiome research, when you hear beta- diversity, you can probably think: Diversity of species composition http://en.wikipedia.org/wiki/beta_diversity 13 Summary of diversity types α diversity within a community, # of species only β diversity between communities (differentiation), species identity is taken into account γ (global) diversity of the site Theoretically, one would wishes to use such measures that result in γ = α β This is only possible if α and β are independent of each other. 14 Beta- Diversity in practice Dimensional Reduction 1.UniFrac or Bray- Curtis distance between samples 2.MDS ( PCoA ) 3.Plot first two axes 4.Admire clusters 5.Write Paper 6.Choose new microbiomes 7.Return to Step 1, Repeat Why? Let s back up. This is one option in an arsenal of dimensional reduction methods, that come from unsupervised learning in exploratory data analysis Regress disc on weight Regress weight on disc 15 16

Dimensional Reduction Minimize the distance to the line in both directions the purple line is the principal component line Dimensional Reduction Principal Components are Linear Combinations of the old variables The projection that maximizes the area of the shadow and an equivalent measurement is the sums of squares of the distances between points in the projection, we want to see as much of the variation as possible, that s what PCA does. 17 18 The PCA workflow Ordination Using the Tree 1. UniFrac- PCoA 2. Double Principal Coordinates 19 20

Ordination Best Practice Ordination Best Practice 1. Always look at scree plot 2. Variables, Samples 3. Biplot 4. Altogether (if readable) pca.turtles=dudi.pca(turtles[,-1],scannf=f,nf=2)! scatter(pca.turtles) 21 22 How many axes are probably useful? Are their clusters? How many? Are their gradients? Are the patterns consistent with covariates (e.g. sample observations) How might we test this? Are their clusters? How many?! Gap Statistic 23 24

Are their gradients?! PCA regression Are the patterns consistent with covariates How might we test this? (Permutational) Multivariate ANOVA vegan::adonis( ) 25 26