Multivariate analysis

Similar documents
DETECTING BIOLOGICAL AND ENVIRONMENTAL CHANGES: DESIGN AND ANALYSIS OF MONITORING AND EXPERIMENTS (University of Bologna, 3-14 March 2008)

Experimental Design and Data Analysis for Biologists

Multivariate Statistics 101. Ordination (PCA, NMDS, CA) Cluster Analysis (UPGMA, Ward s) Canonical Correspondence Analysis

ANOVA approach. Investigates interaction terms. Disadvantages: Requires careful sampling design with replication

Multivariate Statistics Summary and Comparison of Techniques. Multivariate Techniques

4. Ordination in reduced space

4/2/2018. Canonical Analyses Analysis aimed at identifying the relationship between two multivariate datasets. Cannonical Correlation.

Multivariate Analysis of Ecological Data

Edwin A. Hernández-Delgado*

Test Code: STA/STB (Short Answer Type) 2013 Junior Research Fellowship for Research Course in Statistics

Unconstrained Ordination

Multivariate Analysis of Ecological Data

BIO 682 Multivariate Statistics Spring 2008

STAT Section 5.8: Block Designs

Trip Distribution Modeling Milos N. Mladenovic Assistant Professor Department of Built Environment

Multivariate Data Analysis a survey of data reduction and data association techniques: Principal Components Analysis

Lecture 2: Diversity, Distances, adonis. Lecture 2: Diversity, Distances, adonis. Alpha- Diversity. Alpha diversity definition(s)

Unsupervised machine learning

Regression analysis is a tool for building mathematical and statistical models that characterize relationships between variables Finds a linear

4/4/2018. Stepwise model fitting. CCA with first three variables only Call: cca(formula = community ~ env1 + env2 + env3, data = envdata)

Multivariate Analysis of Ecological Data using CANOCO

CAP. Canonical Analysis of Principal coordinates. A computer program by Marti J. Anderson. Department of Statistics University of Auckland (2002)

Ordination & PCA. Ordination. Ordination

Discrimination Among Groups. Discrimination Among Groups

Distance Measures. Objectives: Discuss Distance Measures Illustrate Distance Measures

Purposes of Data Analysis. Variables and Samples. Parameters and Statistics. Part 1: Probability Distributions

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur

Revision: Chapter 1-6. Applied Multivariate Statistics Spring 2012

Parametric versus Nonparametric Statistics-when to use them and which is more powerful? Dr Mahmoud Alhussami

* Tuesday 17 January :30-16:30 (2 hours) Recored on ESSE3 General introduction to the course.

BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression

FACTOR ANALYSIS AND MULTIDIMENSIONAL SCALING

Statistics Toolbox 6. Apply statistical algorithms and probability models

EXAM PRACTICE. 12 questions * 4 categories: Statistics Background Multivariate Statistics Interpret True / False

MSc in Statistics and Operations Research

Unsupervised learning: beyond simple clustering and PCA

Nonparametric Statistics. Leah Wright, Tyler Ross, Taylor Brown

Chapter 11 Canonical analysis

1.3. Principal coordinate analysis. Pierre Legendre Département de sciences biologiques Université de Montréal

Chapter 1. Gaining Knowledge with Design of Experiments

Statistics II 1. Modelling Biology. Basic Applications of Mathematics and Statistics in the Biological Sciences

Small n, σ known or unknown, underlying nongaussian

4/6/16. Non-parametric Test. Overview. Stephen Opiyo. Distinguish Parametric and Nonparametric Test Procedures

Correlation. A statistics method to measure the relationship between two variables. Three characteristics

INTRODUCTION TO MULTIVARIATE ANALYSIS OF ECOLOGICAL DATA

Factors affecting the Power and Validity of Randomization-based Multivariate Tests for Difference among Ecological Assemblages

Lecture 6: Single-classification multivariate ANOVA (k-group( MANOVA)

Self Organizing Maps

STAT 730 Chapter 14: Multidimensional scaling

SRI RAMAKRISHNA INSTITUTE OF TECHNOLOGY DEPARTMENT OF SCIENCE & HUMANITIES STATISTICS & NUMERICAL METHODS TWO MARKS

2 Dean C. Adams and Gavin J. P. Naylor the best three-dimensional ordination of the structure space is found through an eigen-decomposition (correspon

Algebra of Principal Component Analysis

CLASSICAL NORMAL-BASED DISCRIMINANT ANALYSIS

Background to Statistics

Glossary for the Triola Statistics Series

AP Statistics Cumulative AP Exam Study Guide

MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2

Principal Components Analysis. Sargur Srihari University at Buffalo

Statistics Handbook. All statistical tables were computed by the author.

Types of Statistical Tests DR. MIKE MARRAPODI

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Agonistic Display in Betta splendens: Data Analysis I. Betta splendens Research: Parametric or Non-parametric Data?

Principal component analysis

Predictive analysis on Multivariate, Time Series datasets using Shapelets

2/19/2018. Dataset: 85,122 islands 19,392 > 1km 2 17,883 with data

Chapter 1 Statistical Inference

Evaluation Strategies

Discrete Multivariate Statistics

CHAPTER 5. Outlier Detection in Multivariate Data

Chapter 4: Regression Models

Basic Statistical Analysis

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Course in Data Science

Linear Dimensionality Reduction

Short Answer Questions: Answer on your separate blank paper. Points are given in parentheses.

Module 9: Nonparametric Statistics Statistics (OA3102)

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Chapte The McGraw-Hill Companies, Inc. All rights reserved.

-However, this definition can be expanded to include: biology (biometrics), environmental science (environmetrics), economics (econometrics).

Preprocessing & dimensionality reduction

Sources of randomness

Analysing data: regression and correlation S6 and S7

Lecture 5: Ecological distance metrics; Principal Coordinates Analysis. Univariate testing vs. community analysis

Correspondence Analysis & Related Methods

Machine Learning Linear Regression. Prof. Matteo Matteucci

ESP 178 Applied Research Methods. 2/23: Quantitative Analysis

MULTIVARIATE ANALYSIS OF VARIANCE

L11: Pattern recognition principles

1km. the estuary (J). The spatial extent of a site (50m x25m) is approximately as tall as each letter and twice as wide

Lecture: Mixture Models for Microbiome data

LINGUIST 716 Week 9: Compuational methods for finding dimensions

Correspondence Analysis & Related Methods

Disadvantages of using many pooled t procedures. The sampling distribution of the sample means. The variability between the sample means

Linear Regression Models

Correlation and Regression (Excel 2007)

Simulating Uniform- and Triangular- Based Double Power Method Distributions

Freeman (2005) - Graphic Techniques for Exploring Social Network Data

Prentice Hall Stats: Modeling the World 2004 (Bock) Correlated to: National Advanced Placement (AP) Statistics Course Outline (Grades 9-12)

Lecture 3: Mixture Models for Microbiome data. Lecture 3: Mixture Models for Microbiome data

Transcription:

Multivariate analysis Prof dr Ann Vanreusel -Multidimensional scaling -Simper analysis -BEST -ANOSIM

1 2 Gradient in species composition 3 4 Gradient in environment site1 site2 site 3 site 4 site species a 1 2 species b 3 1 1 species c 2 2 3 species d 3 4 6 species e 2 beach zonation beach zonation Similarity 2 4 6 8 site 3 site Stress: site 4 1 site1 site2 site 3 site 4 site site1 site2

Clustering or Classification some disadvantages Even when there is contious structure in the data matrix DISCONTINUOUS OUTPUT CLUSTERS Variation in communities rather continuous than discontinuous However still useful in ecology, mainly in combination with ordination In order to recognize structure (communities) in large datamatrices.

Non metric multidimensional scaling = ordination points close together = sites similar in (species) composition points far apart = sites dissimilar in (species) composition MDS original (species) composition data are replaced by matrix of dissimilarity values between sites this matrix is used to obtain ordination diagram Specifies what similar means Measure needed that expresses how well or badly the distances in the ordination diagram correspond to the dissimilarity values = stress function MDS to choose a configuration that minimizes the degree of stress

Metric ordination (CA, PCA) Stress function depends on the actual numerical values of the dissimilarities Chi square CA Euclidean distance PCA Non metric ordination (MDS) Stress function depends only on the rank order of the dissimilarities Characteristics better flexibility complex algorithm rationale simple few if any assumptions

Based on ranks of similarities Raw data similarities ranks ordination The higher similarity has the lowest rank

site1 site2 site 3 site 4 site species a 1 2 species b 3 1 1 species c 2 2 3 species d 3 4 6 species e 2 Raw counts Bray Curtis similarity matrix Site 1 site 2 site 3 site 4 Site 2 8 Site 3 44,44 44,44 Site 4 19,4 19,4 63,1 Site 1,2 1,2 8,82 7 Site 1 site 2 site 3 site 4 Site 2 1 Site 3 Site 4 6 6 3 Site 7 7 4 2 Ordination diagram 2 site 4 site Ordination ranks 3 and 4 6 and 7 site 3 Resemblance: S17 Bray Curtis similarity 2D Stress: 1 site1 site2

What are stages in the construction of an MDS diagram? Iterative procedure Successively refining of the positions of the points until they satisfy as closely as possible the dissimilarity relationships between samples I. Specify nr of dimensions (usually 2 ) II. Starting configuration of samples (whatever..) III. Regress interpoint distances from this plot on the corresponding dissimilarities

Shepard diagram non-parametric regression = non metric MDS (regression metric MDS) = best fitting line which moulds itself to the shape of scatterplot = constrained to increase (series of steps)

IV. Goodness of fit of the regression by calculating the stress value ΣΣ (d jk d jk )² Stress = ΣΣd jk ² Predicted from regression line Larger scatter = larger stress V. Points are moved to new positions in distribution which decrease the stress most rapidly VI. Repeat steps 3 to until no further improvement of stress can be achieved

Iterative procedure gradually finds it way down to a minimum of the stress function traps - Local minimum of stress function in stead of global minimum Repeat MDS starting with different random positions of samples If same solution re-appears best solution - Degenerate solutions f.i. if data divide in two groups with no species in common No sense to determine how far apart groups should be placed in the MDS plot infinitely apart Two separate analyses

Adequacy of MDS ordination Is stress value small? Is a 2 dimensional plot a usable summary of the sample relationships? Stress <. excellent Stress <.1 good Stress <.2 potential useful Stress >.3 arbitrarily placed points in 2 dimensional space Does the shepard diagram appears satisfactory? The stress value totals the scatter around the regression line in a shepard diagram Outliers might need a higher dimensional representation for accurate placement

Strenghts Weakness Simple in concept Based on relevant sample information Species deletions are unnecessary Generally applicable Similarities can be given unequal weight Computionally demanding Convergence to the global minimum of stress is not guaranteed The algorithm places most weight on the large distances

Based on road distance matrix Based on real distance matrix

site1 site2 site 3 site 4 site Resemblance: S17 Bray Curtis similarity 2D Stress: species a 2 4 1 species b 3 1 1 species c 2 2 1 species d 3 4 6 species e 1 2 site1 site2 site 4site Resemblance: S17 Bray Curtis similarity 2D Stress: species a site 3 site1 site2 2 4 site site 4,4 1,6 site 3 1 2,8 4 Bubble plots Distribution of species over stations Resemblance: S17 Bray Curtis similarity 2D Stress: species e Resemblance: S17 Bray Curtis similarity 2D Stress: species c site1 site2 site 3 1 site site 4 2, 2 3, site1 site2 site 3 2 site site 4 2 1, 2 3,

ANOSIM (Analysis of similarities) To test for statistically significant differences between groups A priori defined structure within set of samples (e.g. replicates ) = simple non-parametric permutation procedure applied to the (rank) similarity matrix Null hypothesis No significant differences in community composition between a priori defined groups

st1a st1b st1c st2a st2b st2c st3a st3b st3c spec A 2 1 1 3 4 4 spec B 4 3 2 3 1 Spec C 6 8 6 6 spec D 7 6 4 6 21 23 19 Spec E 1 9 11 spec F 9 12 16 Resemblance: S17 Bray Curtis similarity 2D Stress:,1 site 1 2 3 st2a st1c st2b st2c st1b st1a st3a st3b st3c Significant differences in species composition between sites???

Cfr ANOVA Compute test statistic R reflecting the observed differences between sites contrasted with differences among replicates within sites Test is based on distances between and within sites or better Based on ranked similarities R is based on difference between - average of rank similarities of all pairs between sites And - Average of rank similarities from all pairs within sites r R = B -r W ((n(n-1)/2)/2) 1 when all replicates within sites are more similar to each other than any other replicates from different sites

Rationale of permutation test all possible allocations of replicate labels to any sample is examined and R statistic is calculated (all = a large number of times) If R statistic falls outside range of R s obtained after permutation H is rejected (H : no site differences)

Global Test Sample statistic (Global R):,934 Significance level of sample statistic:,4% Number of permutations: 28 (All possible permutations) Number of permuted statistics greater than or equal to Global R: 1 Pairwise Tests R Significance Possible Actual Number >= Groups Statistic Level % Permutations Permutations Observed 1, 2,74 1 1 1 1 1, 3 1 1 1 1 1 2, 3 1 1 1 1 1 = low st2a st1c st2b st2c st1b st1a Resemblance: S17 Bray Curtis similarity 2D Stress:,1 st3a st3b st3c site 1 2 3 Significant differences in species composition between sites??? ANOSIM Ho : no sites difference P < % (p>.) R close or = to 1 Ho rejected Sites are different

If R statistic falls outside range of R s obtained after permutation H is rejected (H : no site differences) 73 site Test Sample statistic (Global R):,934 R =.943 is very unlikely 4 times on thousands trials (p =.4 %) Frequency -,4 -,3 -,2 -,1,1,2,3,4,,6,7,8,9 1, R

So far global test To test for specific pairs of sites Repeated significancy test cumulation of risks to draw incorrect conclusion (type I error) Global test is most reliable higher nr of replicates sufficient permutations Pairwise test rather look at R (in stead of p) R approaching 1 separation (in case of low stress value also obvious from MDS) R appraoching no separation Also ANOSIM for two lay layout

Correlation with environmental variables BEST analysis Selects environmental variables, or species "best explaining" community pattern, by maximising a rank correlation between their respective resemblance matrices. Two algorithms are available. In the BIOENV algorithm all permutations of the trial variables are tried. In the BVSTEP algorithm a stepwise search over the trial variables is tried. Use BVSTEP if there is a large number of trial variables and BIOENV is too slow.

BIO -ENV Linking community analysis to environmental variables To which extent are physico-chemical variables related ( explains ) to the observed biological pattern By superimposing univariates on top of the MDS plot

MDS repeated for specific combination of environmental variables Best fitting environmental combination Match between any two plots Ranks of two similarity matrices are compared through a (weighted) rank correlation coefficient (take care for collinearity)

SIMPER (similarity percentages) Species similarity matrix MDS Often high stress for species MDS Therefore concentrate on sample similarities and highlight species responsible for determining the sample groupings in cluster or ordination analysis Compute the average dissimilaity (δ) between all pairs of the intergroup samples = every sample in group 1 paired with every sample in group 2 Break the average down into specific contributions from each species to δ Discriminating species When it contributes much to the dissimilarity between group 1 and 2 (δ is large) When it does so consistently in the inter comparisons of all samples in the 2 groups Standard Deviation of δ is small

Species that are good discriminators between groups are indicated by *

E. Affinis explains almost 3 % Intra group similarity typical species (not necessarily a good discriminator)

st1a st1b st1c st2a st2b st2c st3a st3b st3c spec A 2 1 1 3 4 4 spec B 4 3 2 3 1 Spec C 6 8 6 6 spec D 7 6 4 6 21 23 19 Spec E 1 9 11 spec F 9 12 16 Groups 1 & 2 Average dissimilarity = 19,61 Group 1 Group 2 Species Av.Abund Av.Abund Av.Diss Diss/SD Contrib% Cum.% spec A 1,33 3,67 6,87 3,33 3,4 3,4 spec B 4, 2,,82 1,69 29,69 64,73 spec D 6,, 3,67 1,22 18,7 83,44 Groups 1 & 3 Average dissimilarity = 8,27 Group 1 Group 3 Species Av.Abund Av.Abund Av.Diss Diss/SD Contrib% Cum.% spec D 6, 21, 24,69 6,78 3,76 3,76 spec F, 12,33 2,9 4,78 2,2,78 Spec E, 1, 16,42 11,19 2,46 76,23 Groups 2 & 3 Average dissimilarity = 83,2 Group 2 Group 3 Species Av.Abund Av.Abund Av.Diss Diss/SD Contrib% Cum.% spec D, 21, 26,9 6,69 32,37 32,37 spec F, 12,33 2,3 4,81 24,66 7,3 Spec E, 1, 16,79 11,29 2,16 77,19

st1a st1b st1c st2a st2b st2c st3a st3b st3c spec A 2 1 1 3 4 4 spec B 4 3 2 3 1 Spec C 6 8 6 6 spec D 7 6 4 6 21 23 19 Spec E 1 9 11 spec F 9 12 16 Group 1 Average similarity: 84,93 Species Av.Abund Av.Sim Sim/SD Contrib% Cum.% spec D 6, 3,34 6,6 3,72 3,72 Spec C 6,33 3,13 24,8 3,48 71,2 spec B 4, 18,78 9, 22,11 93,32 Group 2 Average similarity: 87,6 Species Av.Abund Av.Sim Sim/SD Contrib% Cum.% Spec C,67 32,6 21,8 37,21 37,21 spec D, 26,46 14,13 3,2 67,42 spec A 3,67 2,32 9,16 23,2 9,61 Group 3 Average similarity: 9,4 Species Av.Abund Av.Sim Sim/SD Contrib% Cum.% spec D 21, 4,47 11,24,, spec F 12,33 23,1 7,24 2, 76, Spec E 1, 21,6 13,21 23,9 1,