Analysis of Multivariate Ecological Data

Size: px

Start display at page:

Download "Analysis of Multivariate Ecological Data"

Rebecca Dorsey
5 years ago
Views:

1 Analysis of Multivariate Ecological Data School on Recent Advances in Analysis of Multivariate Ecological Data October 2016 Prof. Pierre Legendre Dr. Daniel Borcard Département de sciences biologiques Université de Montréal C.P. 6128, succursale Centre Ville Montréal QC H3C 3J7 Canada

2 Day 3 2

3 Day 3 Statistical testing by permutations 3

4 Statistical testing by permutations! see course material by Pierre Legendre 4

5 Statistical tests for multivariate data 1. Parametric tests When the conditions of a given test are fulfilled, an auxiliary variable constructed on the basis of one or several parameters estimated from the data (for instance an F or t- statistic) has a known behaviour under the null hypothesis. It is thus possible to ascertain whether the observed value of that statistic is likely or not to occur if H 0 is true. If the observed value is as extreme or more extreme than the value of the reference statistic for a pre established probability level (usually α = 0.05), then H 0 is rejected. If not, H 0 is not rejected Two-tailed test: H 1 : S Sα/2 α/2 α/2 1 α H 0 rejected One-tailed test (left) H 1 : S Sα α H 0 rejected Sα/2 One-tailed test (right) H 1 : S Sα Sα H 0 accepted 1 α H 0 accepted 1 α Sα/2 Sα H 0 accepted H 0 rejected α H 0 rejected 5

6 Statistical tests for multivariate data 2. Permutation tests Principle: If no theoretical reference distribution is available, then generate a reference distribution under H 0 from the data themselves. This is achieved by permuting the data randomly in a scheme that ensures H 0 to be true, and recomputing the test statistic. Repeat the procedure a large number of times (e.g. 1000). 6

7 Statistical tests for multivariate data 2. Permutation tests Principle: The observed test statistic is then compared to the set of test statistics obtained by permutations. If the observed value is as extreme or more extreme than, say, the 5% most extreme values obtained under permutations, then it is considered too extreme for H 0 to be likely. H 0 is rejected. 7

8 Statistical tests for multivariate data 2. Permutation tests A B 8

9 Statistical tests for multivariate data 2. Permutation tests Words of caution (permutation tests) The method of permutations does not solve all the problems related to statistical testing. 1. Some problems may require different and more complicated permutation schemes than the simple random scheme applied here. Example: tests of the main factors of an ANOVA, where the permutations for factor A must be limited within the levels of factor B, and vice versa. 9

10 Statistical tests for multivariate data 2. Permutation tests Words of caution (permutation tests) 2. Permutation tests do solve several distributional problems, but not all. In particular, they do not solve distributional problems linked to the hypothesis being tested. For instance, permutational ANOVA does not require normality, but it still does require homogeneity of variances: actually two hypotheses are tested simultaneously, i.e. equality of the means and equality of the variances. 10

11 Statistical tests for multivariate data 2. Permutation tests Words of caution (permutation tests) 3. Contrary to popular belief, permutation tests do not solve the problem of independence of observations. This problem has still to be addressed by special solutions, differing from case to case, and often related to the correction of degrees of freedom. 11

12 Statistical tests for multivariate data 2. Permutation tests Words of caution (permutation tests) 4. Although many statistics can be tested directly by permutations (e.g. Pearson's r), it is advised to use a pivotal statistic whenever possible. A pivotal statistic has a distribution under the null hypothesis which remains the same for any value of the measured effect. 5. It is not the statistic itself which determines if a test is parametric or not: it is the reference to a theoretical distribution (which requires assumptions about the parameters of the statistical population from which the data have been extracted) or to permutations. 12

13 Statistical tests for multivariate data 3. Tests of an RDA or CCA To test one single axis at a time: verify whether an equal or larger eigenvalue can be obtained under the null hypothesis of no relationship between the response matrix and the explanatory matrix. To test the significance of the analysis globally, the basis is the sum of all canonical eigenvalues. The hypotheses are thus: - H 0 : there is no linear relationship between the response matrix and the explanatory matrix; - H 1 : there is a linear relationship between the response matrix and the explanatory matrix. 13

14 Statistical tests for multivariate data 3. Tests of an RDA or CCA Originally, the test statistic was the eigenvalue or sum of canonical eigenvalues itself. Now, one uses a pivotal statistic instead, which is a "pseudo-f" statistic which is defined as: F = sum of all canonical eigenvalues / m RSS/(n m 1) where n is the number of objects, m is the number of explanatory variables and RSS is the residual sum of squares, i.e. the sum of non-canonical eigenvalues (after fitting the explanatory variables). 14

15 Statistical tests for multivariate data 3. Tests of an RDA or CCA: permutation procedures The main permutation types are the following: without covariables in the analysis: - permutation of raw data - permutation of residuals with covariables in the analysis - permutation of residuals under a reduced (or null) model; - permutation of residuals under a full model. 15

16 Day 3 Canonical ordination 16

17 1. Introduction Explicitly puts into relationship two matrices: one dependent matrix and one explanatory matrix. Both are involved at the stage of the ordination. This approach combines the techniques of ordination and multiple regression 17

18 1. Introduction Response variables Explanatory variables Analysis 1 variable 1 variable Simple regression 1 variable m variables Multiple regression p variables - Simple ordination p variables m variables Canonical ordination 18

19 1. Introduction The results of RDA and CCA are presented in the form of biplots or triplots. The explanatory variables can be qualitative (the multistate ones are declared as "factor" (vegan) or coded as a series of binary variables (e.g. Canoco), or quantitative. 19

20 1. Introduction A qualitative explanatory variable is represented on the bi- or triplot as the centroid of the sites that have the description "1" for that variable ("Centroids for factor constraints" in vegan, "Centroids of environmental variables" in Canoco). A quantitative explanatory variable is represented as a vector (the vector apices are given under the name "Biplot scores for constraining variables" in vegan and "Biplot scores of environmental variables" in Canoco). 20

21 Day 3 Redundancy analysis (RDA) 21

22 2. Canonical ordination: redundancy analysis (RDA) Response variables Explanatory var. Data table Y (centred variables) Data table X (centred var.) YU = ordination in the space of variables Y Regress each variable y on table X and compute the fitted (y) ^ and residual (y res ) values Fitted values from the multiple regressions ^ 1 Y = X [X'X] X'Y PCA U = matrix of eigenvectors (canonical) ^ YU = ordination in the space of variables X

23 2. Redundancy analysis (RDA) RDA Scaling 1 = Distance biplot: the eigenvectors are scaled to unit length. 1) Distances among objects in the biplot are approximations of their Euclidean distances in multidimensional space. 2) The angles among response vectors are meaningless. 3) Projecting an object at right angle on a response variable or a quantitative explanatory variable approximates the position of the object along that variable. 23

24 2. Redundancy analysis (RDA) RDA Scaling 1 = Distance biplot: the eigenvectors are scaled to unit length. 4) The angles between response and explanatory variables in the biplot reflect their correlations. 5) The relationship between the centroid of a qualitative explanatory variable and a response variable (species) is found by projecting the centroid at right angle on the variable (as for individual objects). 6) Distances among centroids, and between centroids and individual objects, approximate Euclidean distances. 24

25 Triplot RDA spe.hel ~ env2 - scaling 1 - wa scores RDA on a covariance matrix, Hellingertransformed species abundances, scaling 1 Numbers = sites Red = species Blue = explanatory variables Canonical ordination BAR RDA ABL deb nit GRE GAR dur BOU SPI TOX CAR ANG BCO PSO TANPER BBO PCH HOT 29 VAN GOU BRO ROT pho amm dbo CHE 5 9 BLA CHA OMB 16 ph penmoderate penlow pensteep penvery_steep oxy LOC alt VAI TRU RDA1 25

26 2. Redundancy analysis (RDA) RDA Scaling 2 = correlation biplot: the eigenvectors are scaled to the square root of their eigenvalue. 1) Distances among objects in the biplot are not approximations of their Euclidean distances in multidimensional space. 2) The angles in the biplot between response and explanatory variables, and between response variables themselves or explanatory variables themselves, reflect their correlations. 3) Projecting an object at right angle on a response or an explanatory variable approximates the value of the object along that variable. 26

27 2. Redundancy analysis (RDA) RDA Scaling 2 = correlation biplot: the eigenvectors are scaled to the square root of their eigenvalue. 4) The angles between descriptors reflect their correlations. 5) The relationship between the centroid of a qualitative explanatory variable and a response variable (species) is found by projecting the centroid at right angle on the variable (as for individual objects). 6) Distances among centroids, and between centroids and individual objects, do not approximate Euclidean distances. 27

28 RDA on a covariance matrix, Hellinger-transformed species abundances, scaling 2 Blue: species Green+ brown: quantitative explanatory variables Yellow + red + black: categorical explanatory variables

29 Day 3 Canonical correspondence analysis (CCA) 29

30 3. Canonical correspondence analysis (CCA) CCA is actually a constrained CA, i.e. a constrained PCA on a species data table that has been transformed into a table of Pearson χ 2 statistics. Objects, response variables and centroids of categories are plotted as points on the biplot or the triplot. Quantitative explanatory variables are plotted as vectors (arrows). For the species and objects, the interpretation is the same as in CA. 30

31 3. Canonical correspondence analysis (CCA) Interpretation of the explanatory variables: CCA Scaling type 1 (focus on sites): (1) The position of objects on a quantitative explanatory variable can be obtained by projecting the objects at right angle on the variable. (2) An object found near the point representing the centroid of a qualitative explanatory variable is more likely to possess the state "1" for that variable. 31

32 3. Canonical correspondence analysis (CCA) Interpretation of the explanatory variables: CCA Scaling type 2 (focus on species): (1) The optimum of a species along a quantitative environmental variable can be obtained by projecting the species at right angle on the variable. (2) A species found near the centroid of a qualitative environmental variable is likely to be found frequently (or in larger abundances) in the sites possessing the state "1" for that variable. 32

33 3. CCA: example with scaling 2 33

34 Day 3 Multivariate ANOVA by RDA 34

35 4. Multivariate ANOVA by RDA In its classical, parametric form, multivariate analysis of variance (MANOVA) has stringent conditions of application and restrictions (e.g. multivariate normality of each group of data, homogeneity of the variance-covariance matrices, number of response variables smaller than the number of objects minus the number of groups ). RDA offers an elegant alternative, and adds the versatility of the permutation tests and the triplot representation of results. 35

36 4. Orthogonal factors: coding an ANOVA for RDA To run an equivalent of MANOVA using RDA, to allow testing the factors and interaction in a way that provides the correct F values, one must code the factors in such a way that: 1. The variables represent exactly the experimental design. 2. The variables are orthogonal to one another 3. The interaction (when present) can be properly coded as orthogonal to the main factors. 4. The number of variables needed to code each factor (and the interaction) is equal to their respective number of degrees of freedom. à Helmert contrasts 36

37 4. Orthogonal factors: coding an ANOVA for RDA Two orthogonal factors, several observations (objects) per cell. Factor B : 2 levels Factor A: 3 levels Object 1 Object 2 Object 5 Object 6 Object 9 Object 10 Object 3 Object 4 Object 7 Object 8 Object 11 Object 12 n = 12 Factor A: 3 levels, therefore 2 orthogonal variables Factor B: 2 levels, therefore 1 variable 37

38 4. Orthogonal factors: coding an ANOVA for RDA 1. All columns must have zero sum. 2. The number of variables needed to code a factor corresponds to the number of degrees of freedom of this factor; this includes the interaction. 3. The correlation among variables is 0 everywhere. 4. Interaction variables are produced by columnwise multiplication of factor variables. Obj.1 Obj.2 Obj.3 Obj.4 Obj.5 Obj.6 Obj.7 Obj.8 Obj.9 Obj.10 Obj.11 Obj.12 Factor A Factor B Interaction (A B)

39 Warning 4. Multivariate ANOVA by RDA Testing by permutations does not alleviate the requirement of homogeneity of within-group dispersions in multivariate ANOVA by RDA. This condition can be tested in R by function betadisper() {vegan}. 39

40 Example 4. Multivariate ANOVA by RDA 27 sites of the (Hellinger-transformed) Doubs fish data and fictitious balanced two-way ANOVA design: Factor "altitude" (alt.fac): 3 levels Factor "ph" (ph.fac): 3 levels 40

41 Example 4. Multivariate ANOVA by RDA # Creation of a factor 'altitude' (3 levels, 9 sites each) alt.fac <- gl(3, 9,labels=c("high", "mid", "low")) # Creation of a factor mimicking 'ph' ph.fac <- as.factor(c(1, 2, 3, 2, 3, 1, 3, 2, 1, 2, 1, 3, 3, 2, 1, 1, 2, 3, 2, 1, 2, 3, 2, 1, 1, 3, 3)) # Are the factors balanced? table(alt.fac, ph.fac) ph.fac alt.fac high mid low

42 Example Canonical ordination 4. Multivariate ANOVA by RDA # Creation of Helmert contrasts for the factors and their # interaction alt.ph.helm <- model.matrix(~ alt.fac * ph.fac, contrasts=list(alt.fac="contr.helmert", ph.fac="contr.helmert"))[,-1] head(alt.ph.helm) alt1 alt2 ph1 ph2 alt1:ph1 alt2:ph1 alt1:ph2 alt2:ph

43 4. Multivariate ANOVA by RDA Example Within-group dispersions: see script of today's practicals: ICTP-Day3.R Within-group dispersions are homogeneous. 43

44 Example 4. Multivariate ANOVA by RDA 1. Test of the interaction; unconstrained permutations Permutation test for rda under reduced model Permutation: free Number of permutations: 999 Model: rda(x = spe.hel[1:27, ], Y = alt.ph.helm[, 5:8], Z = alt.ph.helm[, 1:4]) Df Variance F Pr(>F) Model Residual Nonsignificant interaction => we can proceed. 44

45 4. Multivariate ANOVA by RDA Example 2. Test of the main factor "altitude"; permutations constrained within the levels of factor "ph". Permutation test for rda under reduced model Blocks: strata Permutation: free Number of permutations: 999 Model: rda(x = spe.hel[1:27, ], Y = alt.ph.helm[, 1:2], Z = alt.ph.helm[, 3:8]) Df Variance F Pr(>F) Model *** Residual

46 4. Multivariate ANOVA by RDA Example 3. Test of the main factor "ph"; permutations constrained within the levels of factor "altitude". Permutation test for rda under reduced model Blocks: strata Permutation: free Number of permutations: 999 Model: rda(x = spe.hel[1:27, ], Y = alt.ph.helm[, 3:4], Z = alt.ph.helm[, c(1:2, 5:8)]) Df Variance F Pr(>F) Model Residual

47 Example 4. Multivariate ANOVA by RDA Only factor "altitude" is significant. One could compute an RDA with the Helmert contrasts coding for altitude, and draw a triplot with the sites' weighted sum scores related to the factor levels, and the arrows of the species scores. 47

48 Multivariate ANOVA, factor altitude scaling 1 wa scores RDA2 Canonical ordination Baba 4. Multivariate ANOVA by RDA low 19 Alal Pato Albi Legi Cyca Rham Anan Chna Gogo Lele Gyce Blbj Abbr IcmePefl Scer Sqce Titi Eslu Ruru Cogo Thth Teso mid high Phph Babl Satr RDA1 48

49 Day 3 Selection of explanatory variables 49

50 5. Selection of environmental variables There are situations where one wants to reduce the number of explanatory variables in a regression or canonical ordination model for various reasons, e.g.: - not enough "sound ecological thinking => too many candidate explanatory variables; - special procedures (e.g. dbmem, see Day 5) producing a large number of explanatory variables. This can be done with a procedure of selection of explanatory variables. 50

51 5. Selection of environmental variables No single, perfect method exists to reduce the number of variables, besides the examination of all possible subsets of explanatory variables. In multiple regression, the three usual methods are forward, backward and stepwise selection of explanatory variables, the latter one being a combination of the former two. In RDA, forward selection is the method most often applied. 51

52 5. Forward selection of environmental variables The principle of forward selection is as follows: 1. Compute, in turn, the independent contribution of each of the m explanatory variables to the explanation of the variation of the response data table. This is done by running m separate canonical analyses. 2. Test the significance of the contribution of the best variable. 3. If it is significant, include it into the model as a first explanatory variable. 52

53 5. Forward selection of environmental variables 4. Compute (one at a time) the partial contributions (conditional effects) of the m 1 remaining explanatory variables, controlling for the effect of the one already in the model. 5. Test the significance of the best partial contribution among the m 1 variables. 6. If it is significant, include this variable into the model. 7. Compute the partial contributions of the m 2 remaining explanatory variables, controlling for the effect of the two already in the model. 8. The procedure goes on until no more significant partial contribution is found. 53

54 5. Forward selection of environmental variables a) First of all, forward selection is too liberal (i.e., it allows too many explanatory variables to enter a model). Before running a forward selection, always perform a global test (including all explanatory variables). If, and only if the global test is significant, run the forward selection. 54

55 5. Forward selection of environmental variables b) Even if the global test is significant, forward selection is too liberal. Simulations have shown that, in addition to the usual alpha level, one must add a second stopping criterion to forward selection: the model under construction must not have an R 2 adj higher than that of the global model (i.e., the model containing all explanatory variables). Blanchet F. G., P. Legendre, and D. Borcard Forward selection of explanatory variables. Ecology 89:

56 5. Forward selection of environmental variables c) The tests are run by random permutations. d) Like all procedures of selection (forward, backward or stepwise), this one does not guarantee that the best model is found. From the second step on, the inclusion of variables is conditioned by the nature of the variables that are already in the model. 56

57 5. Forward selection of environmental variables e) As in all regression models, the presence of strongly intercorrelated explanatory variables renders the regression/ canonical coefficients unstable. Forward selection does not necessarily eliminate this problem since even strongly correlated variables may be admitted into a model. 57

58 5. Forward selection of environmental variables f) Forward selection can help when several candidate explanatory variables are strongly correlated, but the choice has no a priori ecological validity. In this case it is often advisable to eliminate one of the intercorrelated variables on ecological basis rather than on statistical basis. g) In cases where several correlated explanatory variables are present, without clear a priori reasons to eliminate one or the other, one can examine the variance inflation factors (VIF). 58

59 5. Forward selection of environmental variables h) The variance inflation factors (VIF) measure how much the variance of the regression or canonical coefficients is inflated by the presence of correlations among explanatory variables. This measures in fact the instability of the regression model. i) As a rule of thumb, ter Braak recommends that variables that have a VIF larger than 20 be removed from the analysis. j) Beware: always remove the variables one at a time and recompute the analysis, since the VIF of every variable depends on all the others! 59

60 5. Forward selection of environmental variables In R, variable selection for ecological data can be run with the following functions: forward.sel() {adespatial} ordistep() {vegan} ordir2step() {vegan} Forward Backward R 2 a Factors 60

Canonical analysis. Pierre Legendre Département de sciences biologiques Université de Montréal

Canonical analysis. Pierre Legendre Département de sciences biologiques Université de Montréal Canonical analysis Pierre Legendre Département de sciences biologiques Université de Montréal http://www.numericalecology.com/ Pierre Legendre 2017 Outline of the presentation 1. Canonical analysis: definition