Multivariate analysis

Size: px

Start display at page:

Download "Multivariate analysis"

Rafe Hunt
5 years ago
Views:

1 Multivariate analysis Prof dr Ann Vanreusel -Multidimensional scaling -Simper analysis -BEST -ANOSIM

2 1 2 Gradient in species composition 3 4 Gradient in environment site1 site2 site 3 site 4 site species a 1 2 species b species c species d species e 2 beach zonation beach zonation Similarity site 3 site Stress: site 4 1 site1 site2 site 3 site 4 site site1 site2

3 Clustering or Classification some disadvantages Even when there is contious structure in the data matrix DISCONTINUOUS OUTPUT CLUSTERS Variation in communities rather continuous than discontinuous However still useful in ecology, mainly in combination with ordination In order to recognize structure (communities) in large datamatrices.

4 Non metric multidimensional scaling = ordination points close together = sites similar in (species) composition points far apart = sites dissimilar in (species) composition MDS original (species) composition data are replaced by matrix of dissimilarity values between sites this matrix is used to obtain ordination diagram Specifies what similar means Measure needed that expresses how well or badly the distances in the ordination diagram correspond to the dissimilarity values = stress function MDS to choose a configuration that minimizes the degree of stress

5 Metric ordination (CA, PCA) Stress function depends on the actual numerical values of the dissimilarities Chi square CA Euclidean distance PCA Non metric ordination (MDS) Stress function depends only on the rank order of the dissimilarities Characteristics better flexibility complex algorithm rationale simple few if any assumptions

6 Based on ranks of similarities Raw data similarities ranks ordination The higher similarity has the lowest rank

7 site1 site2 site 3 site 4 site species a 1 2 species b species c species d species e 2 Raw counts Bray Curtis similarity matrix Site 1 site 2 site 3 site 4 Site 2 8 Site 3 44,44 44,44 Site 4 19,4 19,4 63,1 Site 1,2 1,2 8,82 7 Site 1 site 2 site 3 site 4 Site 2 1 Site 3 Site Site Ordination diagram 2 site 4 site Ordination ranks 3 and 4 6 and 7 site 3 Resemblance: S17 Bray Curtis similarity 2D Stress: 1 site1 site2

8 What are stages in the construction of an MDS diagram? Iterative procedure Successively refining of the positions of the points until they satisfy as closely as possible the dissimilarity relationships between samples I. Specify nr of dimensions (usually 2 ) II. Starting configuration of samples (whatever..) III. Regress interpoint distances from this plot on the corresponding dissimilarities

9 Shepard diagram non-parametric regression = non metric MDS (regression metric MDS) = best fitting line which moulds itself to the shape of scatterplot = constrained to increase (series of steps)

10 IV. Goodness of fit of the regression by calculating the stress value ΣΣ (d jk d jk )² Stress = ΣΣd jk ² Predicted from regression line Larger scatter = larger stress V. Points are moved to new positions in distribution which decrease the stress most rapidly VI. Repeat steps 3 to until no further improvement of stress can be achieved

11 Iterative procedure gradually finds it way down to a minimum of the stress function traps - Local minimum of stress function in stead of global minimum Repeat MDS starting with different random positions of samples If same solution re-appears best solution - Degenerate solutions f.i. if data divide in two groups with no species in common No sense to determine how far apart groups should be placed in the MDS plot infinitely apart Two separate analyses

12 Adequacy of MDS ordination Is stress value small? Is a 2 dimensional plot a usable summary of the sample relationships? Stress <. excellent Stress <.1 good Stress <.2 potential useful Stress >.3 arbitrarily placed points in 2 dimensional space Does the shepard diagram appears satisfactory? The stress value totals the scatter around the regression line in a shepard diagram Outliers might need a higher dimensional representation for accurate placement

13 Strenghts Weakness Simple in concept Based on relevant sample information Species deletions are unnecessary Generally applicable Similarities can be given unequal weight Computionally demanding Convergence to the global minimum of stress is not guaranteed The algorithm places most weight on the large distances

14 Based on road distance matrix Based on real distance matrix

15 site1 site2 site 3 site 4 site Resemblance: S17 Bray Curtis similarity 2D Stress: species a species b species c species d species e 1 2 site1 site2 site 4site Resemblance: S17 Bray Curtis similarity 2D Stress: species a site 3 site1 site2 2 4 site site 4,4 1,6 site 3 1 2,8 4 Bubble plots Distribution of species over stations Resemblance: S17 Bray Curtis similarity 2D Stress: species e Resemblance: S17 Bray Curtis similarity 2D Stress: species c site1 site2 site 3 1 site site 4 2, 2 3, site1 site2 site 3 2 site site 4 2 1, 2 3,

16 ANOSIM (Analysis of similarities) To test for statistically significant differences between groups A priori defined structure within set of samples (e.g. replicates ) = simple non-parametric permutation procedure applied to the (rank) similarity matrix Null hypothesis No significant differences in community composition between a priori defined groups

17 st1a st1b st1c st2a st2b st2c st3a st3b st3c spec A spec B Spec C spec D Spec E spec F Resemblance: S17 Bray Curtis similarity 2D Stress:,1 site st2a st1c st2b st2c st1b st1a st3a st3b st3c Significant differences in species composition between sites???

18 Cfr ANOVA Compute test statistic R reflecting the observed differences between sites contrasted with differences among replicates within sites Test is based on distances between and within sites or better Based on ranked similarities R is based on difference between - average of rank similarities of all pairs between sites And - Average of rank similarities from all pairs within sites r R = B -r W ((n(n-1)/2)/2) 1 when all replicates within sites are more similar to each other than any other replicates from different sites

19 Rationale of permutation test all possible allocations of replicate labels to any sample is examined and R statistic is calculated (all = a large number of times) If R statistic falls outside range of R s obtained after permutation H is rejected (H : no site differences)

20 Global Test Sample statistic (Global R):,934 Significance level of sample statistic:,4% Number of permutations: 28 (All possible permutations) Number of permuted statistics greater than or equal to Global R: 1 Pairwise Tests R Significance Possible Actual Number >= Groups Statistic Level % Permutations Permutations Observed 1, 2, , , = low st2a st1c st2b st2c st1b st1a Resemblance: S17 Bray Curtis similarity 2D Stress:,1 st3a st3b st3c site Significant differences in species composition between sites??? ANOSIM Ho : no sites difference P < % (p>.) R close or = to 1 Ho rejected Sites are different

21 If R statistic falls outside range of R s obtained after permutation H is rejected (H : no site differences) 73 site Test Sample statistic (Global R):,934 R =.943 is very unlikely 4 times on thousands trials (p =.4 %) Frequency -,4 -,3 -,2 -,1,1,2,3,4,,6,7,8,9 1, R

22 So far global test To test for specific pairs of sites Repeated significancy test cumulation of risks to draw incorrect conclusion (type I error) Global test is most reliable higher nr of replicates sufficient permutations Pairwise test rather look at R (in stead of p) R approaching 1 separation (in case of low stress value also obvious from MDS) R appraoching no separation Also ANOSIM for two lay layout

23 Correlation with environmental variables BEST analysis Selects environmental variables, or species "best explaining" community pattern, by maximising a rank correlation between their respective resemblance matrices. Two algorithms are available. In the BIOENV algorithm all permutations of the trial variables are tried. In the BVSTEP algorithm a stepwise search over the trial variables is tried. Use BVSTEP if there is a large number of trial variables and BIOENV is too slow.

24 BIO -ENV Linking community analysis to environmental variables To which extent are physico-chemical variables related ( explains ) to the observed biological pattern By superimposing univariates on top of the MDS plot

25 MDS repeated for specific combination of environmental variables Best fitting environmental combination Match between any two plots Ranks of two similarity matrices are compared through a (weighted) rank correlation coefficient (take care for collinearity)

28 SIMPER (similarity percentages) Species similarity matrix MDS Often high stress for species MDS Therefore concentrate on sample similarities and highlight species responsible for determining the sample groupings in cluster or ordination analysis Compute the average dissimilaity (δ) between all pairs of the intergroup samples = every sample in group 1 paired with every sample in group 2 Break the average down into specific contributions from each species to δ Discriminating species When it contributes much to the dissimilarity between group 1 and 2 (δ is large) When it does so consistently in the inter comparisons of all samples in the 2 groups Standard Deviation of δ is small

29 Species that are good discriminators between groups are indicated by *

30 E. Affinis explains almost 3 % Intra group similarity typical species (not necessarily a good discriminator)

31 st1a st1b st1c st2a st2b st2c st3a st3b st3c spec A spec B Spec C spec D Spec E spec F Groups 1 & 2 Average dissimilarity = 19,61 Group 1 Group 2 Species Av.Abund Av.Abund Av.Diss Diss/SD Contrib% Cum.% spec A 1,33 3,67 6,87 3,33 3,4 3,4 spec B 4, 2,,82 1,69 29,69 64,73 spec D 6,, 3,67 1,22 18,7 83,44 Groups 1 & 3 Average dissimilarity = 8,27 Group 1 Group 3 Species Av.Abund Av.Abund Av.Diss Diss/SD Contrib% Cum.% spec D 6, 21, 24,69 6,78 3,76 3,76 spec F, 12,33 2,9 4,78 2,2,78 Spec E, 1, 16,42 11,19 2,46 76,23 Groups 2 & 3 Average dissimilarity = 83,2 Group 2 Group 3 Species Av.Abund Av.Abund Av.Diss Diss/SD Contrib% Cum.% spec D, 21, 26,9 6,69 32,37 32,37 spec F, 12,33 2,3 4,81 24,66 7,3 Spec E, 1, 16,79 11,29 2,16 77,19

32 st1a st1b st1c st2a st2b st2c st3a st3b st3c spec A spec B Spec C spec D Spec E spec F Group 1 Average similarity: 84,93 Species Av.Abund Av.Sim Sim/SD Contrib% Cum.% spec D 6, 3,34 6,6 3,72 3,72 Spec C 6,33 3,13 24,8 3,48 71,2 spec B 4, 18,78 9, 22,11 93,32 Group 2 Average similarity: 87,6 Species Av.Abund Av.Sim Sim/SD Contrib% Cum.% Spec C,67 32,6 21,8 37,21 37,21 spec D, 26,46 14,13 3,2 67,42 spec A 3,67 2,32 9,16 23,2 9,61 Group 3 Average similarity: 9,4 Species Av.Abund Av.Sim Sim/SD Contrib% Cum.% spec D 21, 4,47 11,24,, spec F 12,33 23,1 7,24 2, 76, Spec E 1, 21,6 13,21 23,9 1,

DETECTING BIOLOGICAL AND ENVIRONMENTAL CHANGES: DESIGN AND ANALYSIS OF MONITORING AND EXPERIMENTS (University of Bologna, 3-14 March 2008)

DETECTING BIOLOGICAL AND ENVIRONMENTAL CHANGES: DESIGN AND ANALYSIS OF MONITORING AND EXPERIMENTS (University of Bologna, 3-14 March 2008) Dipartimento di Biologia Evoluzionistica Sperimentale Centro Interdipartimentale di Ricerca per le Scienze Ambientali in Ravenna INTERNATIONAL WINTER SCHOOL UNIVERSITY OF BOLOGNA DETECTING BIOLOGICAL AND