Dissimilarity and transformations. Pierre Legendre Département de sciences biologiques Université de Montréal

Size: px
Start display at page:

Download "Dissimilarity and transformations. Pierre Legendre Département de sciences biologiques Université de Montréal"

Transcription

1 and transformations Pierre Legendre Département de sciences biologiques Université de Montréal Pierre Legendre 2017

2 Definitions An association coefficient is a function of two data vectors that quantifies the strength of their relationship (or association).

3 Descriptors Objects Objects Y np Q mode A nn for Q-mode analysis R mode Descriptors A pp for R-mode analysis Association coefficients can measure the relationship between data rows (objects, Q mode) or between data columns (variables, R mode) Legendre & Legendre 2012, Fig. 2.1.

4 Q-mode coefficients (between objects, e.g. sites) are called similarities (S) and dissimilarities (D) 1. R-mode coefficients (between descriptors or variables) are called dependence coefficients (correlation coefficients, contingency,...) In Q mode, one obtains a S or D matrix by computing similarities (S) or dissimilarities (D) between all pairs of sites. In R mode, one obtains an R matrix by computing correlation coefficients between all pairs of columns. 1 Dissimilarities are often called distance coefficients. Technically, distances are metric dissimilarities (metric properties: described below).

5 A similarity coefficient S produces values in the [0,1] range. An S matrix is symmetric. A similarity matrix S has values 1 on the diagonal, which represents the similarity between an object and itself. S = Obj.1 Obj.2 Obj Obj A dissimilarity coefficients D derived from S also has values in the [0,1] range. A D matrix is symmetric. A dissimilarity matrix D has values 0 on the diagonal, which represents the difference between an object and itself. D = Obj.1 Obj.2 Obj Obj

6 In this presentation, we will focus on Q-mode coefficients In the R language, all association matrices computed with Q- mode coefficients are presented as D matrices. They contain either dissimilarity indices, or similarity indices transformed into dissimilarities (more details below). A D function imposes a model onto the data. It filters the information of the data matrix Y, emphasizing a portion of the information and discarding other portions. The (dis)similarity indices are not interchangeable. Users must know what information is emphasized (i.e. retained) and discarded (i.e. filtered out) by each type of S or D function.

7 Two properties of D coefficients Metric property The attributes of a metric dissimilarity are the following 1 : 1. Minimum 0: if a = b, then D(a, b) = 0; 2. Positiveness: if a b, then D(a, b) > 0; 3. Symmetry: D(a, b) = D(b, a); 4. Triangle inequality: D(a, b) + D(b, c) D(a, c). The sum of two sides of a triangle drawn in ordinary Euclidean space is equal to or larger than the third side. b a c 1 Attributes also described in the PCoA ordination course. Note: A metric dissimilarity function is also called a distance.

8 By reference to the attributes of a metric, 3 types of coefficients can be defined: metric: have all 4 attributes semimetric: can violate the triangle inequality nonmetric: can violate attributes 1 3. An example of a nonmetric coefficient will be presented later. These coefficients are not used in ecology.

9 Euclidean property 1 A dissimilarity coefficient is Euclidean if any resulting D matrix can be fully represented in a Euclidean space without distortion. A non-euclidean dissimilarity matrix is identified by the criterion that PCoA of that matrix produces some negative eigenvalues. Taking the square root of most non-euclidean D matrices makes them metric and Euclidean. 1 Property described in more detail in the principal coordinate analysis (PCoA) course.

10 Converting S to D Coefficients were originally described as either S or D. Among the possible transformations from S to D, two are used in ecological analysis: D = 1 S D = sqrt(1 S) In PCoA and in the distance-based approach to RDA (dbrda), it is useful to make sure that the D matrix is Euclidean; use D = (1 S) when (1 S) is Euclidean; use D = sqrt(1 S) when (1 S) is not Euclidean but sqrt(1 S) is Euclidean.

11 => Examine the metric and Euclidean properties of similarity and dissimilarity coefficients in Tables 9.2 and 9.3 of the Numerical ecology book. See complementary material, file Legendre_&_Legendre_2012_Tables_ pdf. In most cases, sqrt(d) or sqrt(1 S) turns a non-euclidean matrix into Euclidean. More about this in the course on principal coordinate analysis (PCoA).

12 Community composition data: the double-zero problem Whittaker s coenocline: A simulated coenocline along an environmental gradient (abscissa). From Whittaker (1972). (Shown in section of the CA course on the Arch effect.) This figure will help us understand the principle behind doublezero symmetrical and asymmetrical S and D coefficients.

13 Consider the distribution of a single species along that environmental variable: For the presence or absence of that species, are the following pairs of observations an indication of Green arrows: 1, 1 Similarity Difference Red arrows: 1, 0 Brown arrows: 0, 0 Maybe Blue arrows: 0, 0 Maybe Conclusion: interpretation of double zeros is uncertain.

14 Definitions In double-zero asymmetrical coefficients the value D does not change with the addition of double zeros, but it decreases when species with double-x are added to the comparison of two sites, where X is any value of equal abundances other than zero. Examples: Jaccard, Sørensen, Ochiai, Hellinger, chord, Ružička, percentage difference (aka Bray-Curtis). In double-zero symmetrical coefficients the value D does not change when double-zeros or double-x (where X > 0) are added to the two sites that are compared. Examples: Euclidean, Manhattan distances.

15 Coefficients for binary data Example 2 objects, 7 binary variables [0,1] Var1 Var2 Var3 Var4 Var5 Var6 Var7 Object x Object x Object x 1 Object x a b a + b 0 c d c + d a + c b + d p = a + b + c + d => Object x 1 Object x

16 Object x 1 Object x a b 0 c d => Object x 1 Object x Double-zero symmetrical coefficient S SM = Simple matching a + d a + b + c + d = 4 7 = Double-zero asymmetrical coefficient S J = Jaccard index a a + b + c = 2 5 = D SM = b + c a + b + c + d = 3 7 = D J = b + c a + b + c = 3 5 = 0.600

17 Most popular coefficients for binary data Double-zero symmetrical S D=1 S D= 1 S Simple matching S SM = a + d a + b + c + d D SM = b + c a + b + c + d

18 Most popular coefficients for binary data Double-zero asymmetrical S D=1 S D= 1 S Jaccard S Jac = Sørensen S Sor = Ochiai S Och = a a + b + c 2a 2a + b + c a (a + b) (a + c) D Jac = D Sor = b + c a + b + c b + c 2a + b + c D Och =1 a (a + b) (a + c)

19 Adding double-zeros changes D. Hence it changes the Simple matching (symmetrical) but not the Jaccard index (asymmetrical). Example 2 objects, 10 binary variables [0,1] Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 Var10 Object x Object x S SM = Object x 1 Simple matching Object x a b 0 c d a + d a + b + c + d = 7 10 = => Object x 1 Object x S J = Jaccard index a a + b + c = 2 5 = D SM = b + c a + b + c + d = 3 10 = D J = b + c a + b + c = 3 5 = 0.600

20 This example shows that double-zero asymmetrical indices are insensitive to double-zeros, i.e. the absence of species at two sites, because the value d is not included in their formula. However, double-presences (a) do change the denominator, hence they change the index value. These indices are well suited to the analysis of community composition or other types of frequency data, e.g. gene frequencies. Double-zero symmetrical indices are not adapted to the analysis of these types of data.

21 10 different forms of binary indices (for presence-absence data) are available in dist.binary of {ade4}. The three most widely used binary indices, i.e. Jaccard, Sørensen and Ochiai, are also available in dist.ldc of {adespatial}. Note: When computing binary indices in these two packages, these similarities (S) are converted to distances through the transformation D = 1 S. The transformation is automatically applied because it makes the D matrices Euclidean. Euclidean matrices will not produce negative eigenvalues in PCoA. Refer to the PCoA course.

22 Example of a nonmetric coefficient S Nonmetric = a + d b + c Object x 1 Object x a b 0 c d With community composition data, if (a + b) = 0, S = 0. (b + c) can be zero; division by 0 produces S = +Inf. S is then in the range [0, +Inf] For the upper bound S = +Inf, computing D = (1 S) produces D = (1 Inf) = Inf. For the lower bound S = 0, D = (1 S) = (1 0) = 1. D is then in the range [ Inf, 1] This coefficient is not used in ecological analysis.

23 Symmetrical indices for physical descriptors For quantitative physical descriptors (e.g. physical, chemical, morphometric, topographic, etc.): use double-zero symmetrical D coefficients, where both double zeros and double-x (X is any other value) do not change the computed distance. The Euclidean distance The Euclidean distance is the ordinary distance of our physical world. It is computed using Pythagora s formula and it can be applied to data matrices with any number (p) of variables. Formula: D Euclidean (x 1, x 2 ) = p ( y 1 j y ) 2 2 j j=1

24 The Euclidean distance computed on raw physical data changes with the choice of the physical units. Examples: Hardness mg/l Depth m ph (no unit) Lake Lake Lake D Euclidean = Hardness g/l Depth cm ph (no unit) Lake Lake Lake D Euclidean = Same data. Which dissimilarity matrix is the correct one? Do these computed distances make sense? What are the physical units of the distances in each matrix?

25 When the physical descriptors have different physical units, one should compute the Euclidean distance on standardized descriptors, which have no physical units. Hardness (stand.) Depth (stand.) ph (stand.) Lake Lake Lake D Euclidean = All descriptors contribute equally (equal variances of 1) to the computed distance. D Euclidean computed on standardized data has no physical units. Note: Euclidean distances do not have an upper bound. Their values are in the range [0,+Inf].

26 Do not compute D Euclidean on raw (untransformed) community composition data, standardized or not. There are, however, transformations that are appropriate for community data before computing the Euclidean distance. They are described later in this presentation.

27 The Gower coefficient In 1971, John Gower proposed a dissimilarity coefficient designed for ecologists and taxonomists. The Gower coefficient was designed to handle descriptors with different physical units and of mixed precision levels. Because of its general nature, this coefficient is more complex to describe than a simple coefficient like the Euclidean distance. The general form of the coefficient is the following:

28 The general form of the coefficient is the following: D Gower (x 1, x 2 ) =1 1 p s j (x 1, x 2 ) p j=1 Where s j (x 1, x 2 ) is a partial similarity function computed separately for each descriptor. For quantitative descriptors, s j (x 1, x 2 ) is computed as follows: s j (x 1, x 2 ) =1 y 1 j y 2 j R j For qualitative (factors) or binary descriptors: s j (x 1, x 2 ) is 1 if the two objects have the same state; otherwise 0.

29 For quantitative descriptors, s j (x 1, x 2 ) is computed as follows: s j (x 1, x 2 ) =1 y 1 j y 2 j R j R j is the range of descriptor j in the data matrix under study. Dividing y 1j y 2j by the range R j produces a value without physical dimension. The value of the partial similarity s j (x 1, x 2 ) is 1 minus the ranged difference between the values of descriptor j in the two objects under comparison. For qualitative (factors) or binary descriptors, s j (x 1, x 2 ) is 1 if the two objects have the same state; otherwise 0. Semi-quantitative (ordinal) descriptors (called ordered factors in R) can be handled in various ways. See the three options are available in function gowdis() of the {FD} package. In the simplest of these methods, the information is handled as if the descriptor was quantitative, using the equation above.

30 Another interesting modification is the incorporation of weights w j in the formula. D Gower (x 1, x 2 ) =1 p j=1 The weights w j can be used to handle missing values: when a missing value is present for a descriptor in one of the two objects under comparison, w j = 0; otherwise, w j = 1. Missing values are coded NA in the example that follows. w 12 j s j (x 1, x 2 ) p w 12 j j=1

31 Another interesting modification is the incorporation of weights w j in the formula. D Gower (x 1, x 2 ) =1 p j=1 The weights w j can also be used to give different importances to the descriptors in the calculation of D Gower, but they are rarely used for that purpose. The values of D Gower are in the range [0,1]. w 12 j s j (x 1, x 2 ) p w 12 j j=1

32 Numerical example Var1 Var2 Var3 Var4 Var5 (factor) Var6 Site1 2 2 NA Site [ ] Site Site Var7 For each variable, sites 49 and 50 have, respectively, the lowest and highest values in the data matrix. Their purpose is to provide values for computation of the ranges of the quantitative variables. Site1 has an absence of information (coded NA) in Var3. Var5 is a factor; hence the values shown are classes (or states) of the factor.

33 Var1 Var2 Var3 Var4 Var5 Var6 Study Var7 this (factor) example in detail by yourself! Site1 2 2 NA Site [ ] Site Site w 12j R j = (max min) y 1j y 2j y 1j y 2j /R j s 12j = 1 y 1j y 2j /R j w 12j s 12j p = 7; Σw 12j = 6 D Gower (x 1, x 2 ) = 1 ( )/6 = 0.594

34 Comparison of the functions available in R to compute the Gower coefficient: Quant. Semiquant. Factors NA Weights w j vegdist() {vegan} daisy() {cluster} As quant. gowdis() {FD} 3 methods See the documentation file for details on how ordered factors are handled in the three methods implemented in function gowdis(). Missing values are coded NA in R.

35 Quantitative community data: asymmetrical indices I will now focus on seven double-zero asymmetrical dissimilarity coefficients for quantitative community composition data, and mention some others.

36 Asymmetrical non-euclidean indices The first two indices for quantitative community data are the quantitative forms of the Jaccard and Sørensen indices for binary data, described above. These two indices are based on the same decomposition of the species abundance data. The calculation of dissimilarity incorporates the differences in total abundance of the sites. The data are not scaled by rows, as will be the case in the next series of indices. The differences in productivity of the sites are taken into account in the calculation of the dissimilarities.

37 Example 2 objects, 4 species, quantitative abundance data C 1 B 2 A 4 A 4 B3 A 1 A 1 A2 A 2 Site1 Site2 Site1 Site2 Site1 Site2 Site1 Site2 Spec.1 Spec.2 Spec.3 Spec.4 A = sum of abundances common to sites 1 and 2 = A 1 + A A 4! B = sum of abundances unique to site 1 = 0 + B 2 + B 3 + 0! C = sum of abundances unique to site 2 = C !! (B + C) represents the unscaled dissimilarity between 2 sites. (A) would represent the unscaled similarity. Ružička dissimilarity: D Ruz = (B + C)/(A + B + C) Percentage difference (aka Bray-Curtis): D %diff = (B + C)/(2A + B + C) Other equivalent equation forms exist for these two dissimilarities.

38 C 1 B 2 A 4 A 4 B3 A 1 A 1 A2 A 2 Site1 Site2 Site1 Site2 Site1 Site2 Site1 Site2 Spec.1 Spec.2 Spec.3 Ružička dissimilarity: D Ruz = (B + C)/(A + B + C) Spec.4 A = sum of abundances common to sites 1 and 2 = A 1 + A A 4! B = sum of abundances unique to site 1 = 0 + B 2 + B 3 + 0! C = sum of abundances unique to site 2 = C !! Percentage difference (aka Bray-Curtis): D %diff = (B + C)/(2A + B + C) These two dissimilarities are double-zero asymmetrical because the sum of double-zeros is not included in their formulas, whereas the quantity A (sum of double-x) is included. When comparing two sites, pairs of 0 do not change the values of these indices. The values of D ruz and D %diff are in the range [0,1].

39 The D %diff is not metric. Example: Sp.1 Sp.2 Sp.3 Sp.4 Sp.5 Object x Object x Object x The D %diff matrix is not metric because this triangle does not close: D(1,2) + D(2,3) < D(1,3) since ( = 0.592) < Because it is not metric, the D %diff matrix is not Euclidean. However, sqrt(d) is Euclidean. The Ružička dissimilarity, D Ruz, is always metric, like the Jaccard index. It is not a Euclidean coefficient; hence a specific D Ruz matrix may or may not be Euclidean, as in the following example.

40 Full example: compute D Ruz and D %diff for the spider data. # Read the file "Spiders_28x12_spe.txt"! spiders <- read.table(file.choose())! library(adespatial); library(ade4)! #! D.ruz <- dist.ldc(spiders, method="ruzicka")! is.euclid(d.ruz)! [1] FALSE # The D matrix is not Euclidean! is.euclid(sqrt(d.ruz))! [1] TRUE # The sqrt(d) matrix is Euclidean! #! D.pcdiff <- dist.ldc(spiders, method="percentdiff")! is.euclid(d.pcdiff)! [1] FALSE # The D matrix is not Euclidean! is.euclid(sqrt(d.pcdiff))! [1] TRUE # The sqrt(d) matrix is Euclidean!

41 The percentage difference was described by Odum in Historical note D %diff = (B + C)/(2A + B + C) This D index is often called the Bray-Curtis index in computer software. This is a misnomer, repeating a mistake in a paper published around The 1957 paper by Bray-Curtis aimed at describing a new ordination method, known as the Bray-Curtis ordination, not a new D index. Actually, the index used by Bray and Curtis in their 1957 paper, and clearly described on p. 329, is Whittaker s (1952) index of association.

42 Asymmetrical Euclidean indices The following five distance indices are constructed in the same way: the data are scaled by rows using (data transformation); then the Euclidean distance is applied to the scaled data. The scaling by rows removes the differences in productivity of the sites from the data. These differences are not taken into account in the calculation of the dissimilarities.

43 The distance between species profiles Example: community composition data Sp1 Sp2 Sp3 Sp4 Sp5 [ y i+ ] Y = Site1 Site2 Site Divide each value by the row sum, transforming the rows into profiles of relative abundances, y ij y i+ = This is called the profile transformation

44 Divide each value by the row sum, transforming the rows into profiles of relative abundances, y ij y i+ = then compute the Euclidean distance among the scaled rows. Formula: p y D profile (x 1, x 2 ) = 1 j y 2 2 j y 1+ y 2+ Double zeros do not affect the row sums or the final Euclidean distance. So, this distance is double-zero asymmetrical. D profile is easy to compute but it does not have good properties for the analysis of beta diversity. See later presentation. Maximum value of D profile = j=1 for sites with no species in common.

45 The chord distance Example: community composition data Sp1 Sp2 Sp3 Sp4 Sp5 [ Row.norm i ] Y = Site1 Site2 Site The data are divided by the norm (vector length) 1 of each row y ij = Row.norm i This is called the chord transformation. 1 An R function to compute the norm of a vector: row.norm <- function(vec) sqrt(sum(vec^2))!

46 The data are divided by the norm (vector length) of each row, y ij = Row.norm i => The norms of the transformed row vectors are now 1. Then the Euclidean distance is applied to the scaled data. Formula: D chord (x 1, x 2 ) = p j=1 This D is insensitive to double zeros (double-zero asymmetrical). This distance has excellent properties for the analysis of beta diversity, as will be seen in a later presentation. Maximum value of D chord = y 1 j Row.norm 1 2 y 2 j Row.norm 2 for sites with no species in common. 2

47 The chord distance is actually the length of a chord between two points along the circumference of a unit circle. This is a geometric notion. This measure is applied to data that have been normed, so that the norm (or length) of each transformed row vector is 1. Species y 2 1 x 1 D chord (x 1, x 2 ) x 2 1 Species y 1 The maximum value of D chord is in common. 2, for sites that have no species

48 Example: community composition data The Hellinger distance Y = Site1 Site2 Site3 Sp1 Sp2 Sp3 Sp4 Sp5 [ y i+ ] The data are first divided by the sum of each row y ij y i+ = and square-rooted, producing square-rooted species profiles y ij y i+ = This is called the Hellinger transformation

49 Compute square-rooted species profiles y ij y i+ = then compute the Euclidean distance among the scaled rows. Formula: D Hellinger (x 1, x 2 ) = p j=1 y 1 j y 1+ y 2 j y 2+ 2 This D is insensitive to double zeros (double-zero asymmetrical). This distance has excellent properties for the analysis of beta diversity, as will be seen in a later presentation. Maximum value of D Hellinger = 2 for sites with no species in common.

50 Relationships The Hellinger distance is actually the chord distance computed on square-rooted species abundance data. Example with the (3 5) matrix: Y = Site1 Site2 Site3 Sp1 Sp2 Sp3 Sp4 Sp5 [ y i+ ]

51 # Generate the data matrix (3 sites x 5 species)! mat = matrix(c(45,25,7,10,8,15,15,10,20,0,0,14,10,3,12),3,5)!! # Compute Hellinger distance on mat using dist.ldc()! library(adespatial)! ( D.hel <- dist.ldc(mat, "hellinger") )! 1 2! ! !! # Compute the chord distance on mat! ( D.chord <- dist.ldc(mat, "chord") )! 1 2! ! !! # Compute the chord distance on sqrt(mat)! ( D <- dist.ldc(sqrt(mat), "chord") )! 1 2! ! ! Hellinger D = chord D after taking the square root of the abundances.

52 The log-chord distance Instead of a square root, one can compute the log of the abundances before computing the chord transformation on the ' y ij y " ij = = log e (y ij +1) values: The combination of these two transformations is called the log-chord transformation. The Euclidean distance can then be computed on the transformed data to obtain the log-chord distance (Legendre & Borcard, submitted). This D has all the properties of the chord D. It is thus insensitive to double zeros (double-zero asymmetrical). Note The Euclidean distance computed on log(y+1) data is not doublezero asymmetrical. So it is inappropriate for community composition data. ' y ij p i=1 ' (y ij ) 2 ' y ij

53 Idea linking the chord, Hellinger and log-chord transformations λ = {1, 0.5, 0} are members of the Box-Cox series of normalizing transformations: f (y) = (y λ 1) / λ plain chord transf.: λ = 1 => y ij 1 (no transf.), then chord transf. Hellinger transformation: λ = 0.5 => y ij 0.5 before chord transf. log-chord transformation: for λ = 0, the limit of f(y) when λ approaches 0 is log e (y) (Box & Cox, 1964). We use log e (y ij +1) because there are abundances of 0 in community composition data and log(0) = Inf. The log transformation is used to normalize strongly asymmetric frequency distributions before applying the chord transformation. => All D based on the chord transformation inherit the properties of the chord D. In particular, they are double-zero asymmetrical.

54 The chi-square distance The chi-square distance is an important coefficient. It is the distance preserved in correspondence analysis (CA). The chi-square distance can be computed on data that are nonnegative (i.e. 0), frequency-like 1, and dimensionally homogeneous. 1 Examples: community composition or biomass data; monetary units (e.g. $,,, ).

55 Example: community composition data Y = Site1 Site2 Site3 Sp1 Sp2 Sp3 Sp4 Sp5 [ y i+ ] y + j = y ++ = Transform the abundances into relative abundances by row. y ij y i+ = Compute a weighted Euclidean distance of the relative abundances, using the inverses of the column sums as weights. p j=1 y + j D chi.sq (x 1, x 2 ) = y ++ 1 y 1 j y 1+ y 2 j y 2+ 2

56 2. Compute a weighted Euclidean distance of the relative abundances, using the inverses of the column sums as weights. p j=1 y + j D chi.sq (x 1, x 2 ) = y ++ 1 y 1 j y 1+ y 2 j y 2+ 2 => Using these weights actually gives more importance to the rare species, which have small column sums, in the estimation of the dissimilarity, than to the more abundant and ubiquitous species, which have larger column sums. This is a good idea for ecologists who find the presence of rare species to be more informative than the presence of abundant and ubiquitous species. A rare species found at a few sites may indicate special environmental conditions that are required by that species. However, if the rare species are less precisely sampled than the more common species, one should avoid the chi-square distance, and therefore also CA.

57 Chi-square distance formula: p j=1 y + j D chi.sq (x 1, x 2 ) = y ++ 1 y 1 j y 1+ y 2 j y 2+ 2 In the comparison of two sites, pairs of 0 do not change the value of D chi.sq. So it is a double-zero asymmetrical D index. Maximum value of D chi.sq = common. 2y ++ for sites with no species in

58 Binary forms of some quantitative coefficients functions can be computed using presence-absence (1-0) data. In many cases, the result is equivalent to, or a simple transformation of, usual indices for presence-absence data. D Euclidean = sqrt(d simple matching p) where p is the number of variables D Ružička, D Canberra, D Wishart = D Jaccard D %difference = D Sørensen Hellinger D, chord D = sqrt(2(1 S Ochiai )) See Legendre & De Cáceres (2013, Table 1) for other relationships between the quantitative and binary forms.

59 Other useful asymmetrical coefficients Other useful double-zero asymmetrical coefficients are available in R packages: In dist.ldc() of {adespatial}: coefficient of divergence, Canberra metric, Whittaker D, Wishart D, Kulczynski D. Also in dist.ldc(): four abundance-based coefficient of Chao et al. (2006) for quantitative data. These functions correct the index for species that have not been observed due to sampling errors. Other R package, not listed here, also contain indices for community data. One should check if these indices are Euclidean in the form D or sqrt(d) before using them for PCoA ordination or beta diversity studies.

60 Computing D through data transformations Compute asymmetrical D indices for community composition data as follows: data transformation followed by calculation of the Euclidean distance, as shown in the section on asymmetrical indices for quantitative community composition data.

61 (b) Hellinger distance among sites (a) For community composition data, after transformation (ex. Hellinger) (a) compute D euclidean ; (b) or use the transformed data as input into linear methods of data analysis. There is no need to compute the D matrix in that case.

62 Euclidean Chord Ordination in reduced space 3. Pre-transformation of species data: illustration Species profiles The species abundance paradox (Orlóci, 1978) Hellinger Chi-square

63 The previous slide shows that the chord, species profile, Hellinger and chi-square transformations, followed by calculation of the Euclidean distance, produce the same-name dissimilarity indices, which are double-zero asymmetrical, metric and Euclidean.

64 The Euclidean distance paradox Example data Sp.1 Sp.2 Sp.3 Row sums y i+ Row norms Site Site Site Compute the Euclidean distance among the data rows: Site1 Site2 Site3 Site1 Site2 Site According to these D results, the two closest sites are 2 and 3, with D = , despite of the fact that 2 and 3 have no species in common. For ecologists, two sites that have no species in common are very different. Sharing species is more important than differences in abundances.

65 Example data Sp.1 Sp.2 Sp.3 Row sums y i+ Row norms Site Site Site Site1 Site2 Site3 Compute the Euclidean distance among the data rows: Site1 Site2 Site The two least different sites in the data matrix are (1 and 2), which share 2 species, yet the Euclidean distance gives them a large distance. The most different pairs are (1, 3) and (2, 3), which have no species in common, yet D Euclidean gives a very small distance to pair (2, 3).

66 Euclidean Chord Ordination in reduced space 3. Pre-transformation of species data: illustration Species profiles The species abundance paradox (Orlóci, 1978) Hellinger Chi-square The least different and most different pairs in the data matrix.

67 The previous slide shows that the Euclidean distance can give a small D value to a pair of sites that have no species in common, indicating that they are highly similar. Contrary to that, the chord, species profile, Hellinger and chisquare distances produce smaller D values for pairs of sites that contain the same species than for pairs of sites where different species assemblages are found.

68 Note that the transformations and distances are not equivalent and interchangeable. They produce different PCoA ordinations of the sites.

69 Data transformations in The profile, chord, log-chord, Hellinger and chi-square transf. can be computed using vegan s decostand() function. profile transformation: chord transformation: log-chord transformation: Hellinger transformation: chi-square transformation: Y.tr = decostand(y, "total") Y.tr = decostand(y, "norm") Y.tr = decostand(log1p(y), "norm") Y.tr = decostand(y, "hellinger") Y.tr = decostand(y, "chi.sq") The transformed data can be used as input into linear methods of data analysis: PCA, RDA, k-means partitioning, manova, etc. After transforming the data, compute the Euclidean distance using dist() of {stats} to obtain the same-name distances. Direct calculation of the chord, species profile, Hellinger and chisquare distances are available in function dist.ldc() of {adespatial}.

70 (b) Hellinger distance among sites (b) The transformed data matrices can be used as input into linear methods of data analysis that preserve the Euclidean distance, such as PCA (tb-pca), RDA (tb-rda) and k-means partitioning. In these analyses, the chord, log-chord, species profile, Hellinger and chi-square distances, which are double-zero asymmetrical, will be preserved instead of the symmetrical Euclidean distance.

71 References cited Box, G. E. P. & D. R. Cox An analysis of transformations. J. Roy. Statist. Soc. Ser. B 26: Borcard, D., F. Gillet & P. Legendre Numerical ecology with R, 2 nd edition. Use R! series, Springer Science, New York. Bray, R. J. & J. T. Curtis An ordination of the upland forest communities of southern Wisconsin. Ecological Monographs 27: Chao, A., R. L. Chazdon, R. K. Colwell & T. J. Shen Abundance-based similarity indices and their estimation when there are unseen species in samples. Biometrics 62: Gower, J. C A general coefficient of similarity and some of its properties. Biometrics 27: Legendre, P. & D. Borcard. Box-Cox-chord transformations for community composition data prior to beta diversity analysis. Ecography (submitted). Legendre, P. & M. De Cáceres Beta diversity as the variance of community data: dissimilarity coefficients and partitioning. Ecology Letters 16: Legendre, P. & L. Legendre Numerical ecology, 3rd English edition. Elsevier Science BV, Amsterdam. xvi pp. ISBN-13: Odum, E. P Bird populations of the Highlands (North Carolina) Plateau in relation to plant succession and avian invasion. Ecology 31: Whittaker, R. H A study of summer foliage insect communities in the Great Smoky Mountains. Ecological Monographs 22: Whittaker, R. H Evolution and measurement of species diversity. Taxon 21:

72 End of the presentation

1.3. Principal coordinate analysis. Pierre Legendre Département de sciences biologiques Université de Montréal

1.3. Principal coordinate analysis. Pierre Legendre Département de sciences biologiques Université de Montréal 1.3. Pierre Legendre Département de sciences biologiques Université de Montréal http://www.numericalecology.com/ Pierre Legendre 2018 Definition of principal coordinate analysis (PCoA) An ordination method

More information

8. FROM CLASSICAL TO CANONICAL ORDINATION

8. FROM CLASSICAL TO CANONICAL ORDINATION Manuscript of Legendre, P. and H. J. B. Birks. 2012. From classical to canonical ordination. Chapter 8, pp. 201-248 in: Tracking Environmental Change using Lake Sediments, Volume 5: Data handling and numerical

More information

1.2. Correspondence analysis. Pierre Legendre Département de sciences biologiques Université de Montréal

1.2. Correspondence analysis. Pierre Legendre Département de sciences biologiques Université de Montréal 1.2. Pierre Legendre Département de sciences biologiques Université de Montréal http://www.numericalecology.com/ Pierre Legendre 2018 Definition of correspondence analysis (CA) An ordination method preserving

More information

Partial regression and variation partitioning

Partial regression and variation partitioning Partial regression and variation partitioning Pierre Legendre Département de sciences biologiques Université de Montréal http://www.numericalecology.com/ Pierre Legendre 2017 Outline of the presentation

More information

4. Ordination in reduced space

4. Ordination in reduced space Université Laval Analyse multivariable - mars-avril 2008 1 4.1. Generalities 4. Ordination in reduced space Contrary to most clustering techniques, which aim at revealing discontinuities in the data, ordination

More information

Chapter 2 Exploratory Data Analysis

Chapter 2 Exploratory Data Analysis Chapter 2 Exploratory Data Analysis 2.1 Objectives Nowadays, most ecological research is done with hypothesis testing and modelling in mind. However, Exploratory Data Analysis (EDA), which uses visualization

More information

Appendix S1 Replacement, richness difference and nestedness indices

Appendix S1 Replacement, richness difference and nestedness indices Appendix to: Legendre, P. (2014) Interpreting the replacement and richness difference components of beta diversity. Global Ecology and Biogeography, 23, 1324-1334. Appendix S1 Replacement, richness difference

More information

Algebra of Principal Component Analysis

Algebra of Principal Component Analysis Algebra of Principal Component Analysis 3 Data: Y = 5 Centre each column on its mean: Y c = 7 6 9 y y = 3..6....6.8 3. 3.8.6 Covariance matrix ( variables): S = -----------Y n c ' Y 8..6 c =.6 5.8 Equation

More information

Community surveys through space and time: testing the space time interaction

Community surveys through space and time: testing the space time interaction Suivi spatio-temporel des écosystèmes : tester l'interaction espace-temps pour identifier les impacts sur les communautés Community surveys through space and time: testing the space time interaction Pierre

More information

Beta diversity as the variance of community data: dissimilarity coefficients and partitioning

Beta diversity as the variance of community data: dissimilarity coefficients and partitioning Ecology Letters, (2013) 16: 951 963 doi: 10.1111/ele.12141 IDEA AND PERSPECTIVE Beta diversity as the variance of community data: dissimilarity coefficients and partitioning Pierre Legendre 1 * and Miquel

More information

Analysis of Multivariate Ecological Data

Analysis of Multivariate Ecological Data Analysis of Multivariate Ecological Data School on Recent Advances in Analysis of Multivariate Ecological Data 24-28 October 2016 Prof. Pierre Legendre Dr. Daniel Borcard Département de sciences biologiques

More information

Distance Measures. Objectives: Discuss Distance Measures Illustrate Distance Measures

Distance Measures. Objectives: Discuss Distance Measures Illustrate Distance Measures Distance Measures Objectives: Discuss Distance Measures Illustrate Distance Measures Quantifying Data Similarity Multivariate Analyses Re-map the data from Real World Space to Multi-variate Space Distance

More information

Multivariate Analysis of Ecological Data

Multivariate Analysis of Ecological Data Multivariate Analysis of Ecological Data MICHAEL GREENACRE Professor of Statistics at the Pompeu Fabra University in Barcelona, Spain RAUL PRIMICERIO Associate Professor of Ecology, Evolutionary Biology

More information

Ordination & PCA. Ordination. Ordination

Ordination & PCA. Ordination. Ordination Ordination & PCA Introduction to Ordination Purpose & types Shepard diagrams Principal Components Analysis (PCA) Properties Computing eigenvalues Computing principal components Biplots Covariance vs. Correlation

More information

Temporal eigenfunction methods for multiscale analysis of community composition and other multivariate data

Temporal eigenfunction methods for multiscale analysis of community composition and other multivariate data Temporal eigenfunction methods for multiscale analysis of community composition and other multivariate data Pierre Legendre Département de sciences biologiques Université de Montréal Pierre.Legendre@umontreal.ca

More information

Chapter 11 Canonical analysis

Chapter 11 Canonical analysis Chapter 11 Canonical analysis 11.0 Principles of canonical analysis Canonical analysis is the simultaneous analysis of two, or possibly several data tables. Canonical analyses allow ecologists to perform

More information

BIO 682 Multivariate Statistics Spring 2008

BIO 682 Multivariate Statistics Spring 2008 BIO 682 Multivariate Statistics Spring 2008 Steve Shuster http://www4.nau.edu/shustercourses/bio682/index.htm Lecture 11 Properties of Community Data Gauch 1982, Causton 1988, Jongman 1995 a. Qualitative:

More information

Multivariate Statistics 101. Ordination (PCA, NMDS, CA) Cluster Analysis (UPGMA, Ward s) Canonical Correspondence Analysis

Multivariate Statistics 101. Ordination (PCA, NMDS, CA) Cluster Analysis (UPGMA, Ward s) Canonical Correspondence Analysis Multivariate Statistics 101 Ordination (PCA, NMDS, CA) Cluster Analysis (UPGMA, Ward s) Canonical Correspondence Analysis Multivariate Statistics 101 Copy of slides and exercises PAST software download

More information

Lecture 5: Ecological distance metrics; Principal Coordinates Analysis. Univariate testing vs. community analysis

Lecture 5: Ecological distance metrics; Principal Coordinates Analysis. Univariate testing vs. community analysis Lecture 5: Ecological distance metrics; Principal Coordinates Analysis Univariate testing vs. community analysis Univariate testing deals with hypotheses concerning individual taxa Is this taxon differentially

More information

Community surveys through space and time: testing the space-time interaction in the absence of replication

Community surveys through space and time: testing the space-time interaction in the absence of replication Community surveys through space and time: testing the space-time interaction in the absence of replication Pierre Legendre Département de sciences biologiques Université de Montréal http://www.numericalecology.com/

More information

diversity(datamatrix, index= shannon, base=exp(1))

diversity(datamatrix, index= shannon, base=exp(1)) Tutorial 11: Diversity, Indicator Species Analysis, Cluster Analysis Calculating Diversity Indices The vegan package contains the command diversity() for calculating Shannon and Simpson diversity indices.

More information

INTRODUCTION TO MULTIVARIATE ANALYSIS OF ECOLOGICAL DATA

INTRODUCTION TO MULTIVARIATE ANALYSIS OF ECOLOGICAL DATA INTRODUCTION TO MULTIVARIATE ANALYSIS OF ECOLOGICAL DATA David Zelený & Ching-Feng Li INTRODUCTION TO MULTIVARIATE ANALYSIS Ecologial similarity similarity and distance indices Gradient analysis regression,

More information

Species Associations: The Kendall Coefficient of Concordance Revisited

Species Associations: The Kendall Coefficient of Concordance Revisited Species Associations: The Kendall Coefficient of Concordance Revisited Pierre LEGENDRE The search for species associations is one of the classical problems of community ecology. This article proposes to

More information

Lecture 5: Ecological distance metrics; Principal Coordinates Analysis. Univariate testing vs. community analysis

Lecture 5: Ecological distance metrics; Principal Coordinates Analysis. Univariate testing vs. community analysis Lecture 5: Ecological distance metrics; Principal Coordinates Analysis Univariate testing vs. community analysis Univariate testing deals with hypotheses concerning individual taxa Is this taxon differentially

More information

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar Similarity and Dissimilarity Similarity Numerical measure of how alike two data objects are. Is higher

More information

Analysis of community ecology data in R

Analysis of community ecology data in R Analysis of community ecology data in R Jinliang Liu ( 刘金亮 ) Institute of Ecology, College of Life Science Zhejiang University Email: jinliang.liu@foxmail.com http://jinliang.weebly.com R packages ###

More information

VarCan (version 1): Variation Estimation and Partitioning in Canonical Analysis

VarCan (version 1): Variation Estimation and Partitioning in Canonical Analysis VarCan (version 1): Variation Estimation and Partitioning in Canonical Analysis Pedro R. Peres-Neto March 2005 Department of Biology University of Regina Regina, SK S4S 0A2, Canada E-mail: Pedro.Peres-Neto@uregina.ca

More information

Use R! Series Editors:

Use R! Series Editors: Use R! Series Editors: Robert Gentleman Kurt Hornik Giovanni G. Parmigiani Use R! Albert: Bayesian Computation with R Bivand/Pebesma/Gómez-Rubio: Applied Spatial Data Analysis with R Cook/Swayne: Interactive

More information

Community surveys through space and time: testing the space-time interaction in the absence of replication

Community surveys through space and time: testing the space-time interaction in the absence of replication Community surveys through space and time: testing the space-time interaction in the absence of replication Pierre Legendre, Miquel De Cáceres & Daniel Borcard Département de sciences biologiques, Université

More information

Compositional similarity and β (beta) diversity

Compositional similarity and β (beta) diversity CHAPTER 6 Compositional similarity and β (beta) diversity Lou Jost, Anne Chao, and Robin L. Chazdon 6.1 Introduction Spatial variation in species composition is one of the most fundamental and conspicuous

More information

4/2/2018. Canonical Analyses Analysis aimed at identifying the relationship between two multivariate datasets. Cannonical Correlation.

4/2/2018. Canonical Analyses Analysis aimed at identifying the relationship between two multivariate datasets. Cannonical Correlation. GAL50.44 0 7 becki 2 0 chatamensis 0 darwini 0 ephyppium 0 guntheri 3 0 hoodensis 0 microphyles 0 porteri 2 0 vandenburghi 0 vicina 4 0 Multiple Response Variables? Univariate Statistics Questions Individual

More information

Data Mining 4. Cluster Analysis

Data Mining 4. Cluster Analysis Data Mining 4. Cluster Analysis 4.2 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Data Structures Interval-Valued (Numeric) Variables Binary Variables Categorical Variables Ordinal Variables Variables

More information

Similarity and Dissimilarity

Similarity and Dissimilarity 1//015 Similarity and Dissimilarity COMP 465 Data Mining Similarity of Data Data Preprocessing Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed.

More information

Characterizing and predicting cyanobacterial blooms in an 8-year

Characterizing and predicting cyanobacterial blooms in an 8-year 1 2 3 4 5 Characterizing and predicting cyanobacterial blooms in an 8-year amplicon sequencing time-course Authors Nicolas Tromas 1*, Nathalie Fortin 2, Larbi Bedrani 1, Yves Terrat 1, Pedro Cardoso 4,

More information

Supplementary Material

Supplementary Material Supplementary Material The impact of logging and forest conversion to oil palm on soil bacterial communities in Borneo Larisa Lee-Cruz 1, David P. Edwards 2,3, Binu Tripathi 1, Jonathan M. Adams 1* 1 Department

More information

Comparison of two samples

Comparison of two samples Comparison of two samples Pierre Legendre, Université de Montréal August 009 - Introduction This lecture will describe how to compare two groups of observations (samples) to determine if they may possibly

More information

CAP. Canonical Analysis of Principal coordinates. A computer program by Marti J. Anderson. Department of Statistics University of Auckland (2002)

CAP. Canonical Analysis of Principal coordinates. A computer program by Marti J. Anderson. Department of Statistics University of Auckland (2002) CAP Canonical Analysis of Principal coordinates A computer program by Marti J. Anderson Department of Statistics University of Auckland (2002) 2 DISCLAIMER This FORTRAN program is provided without any

More information

MULTIV. Multivariate Exploratory Analysis, Randomization Testing and Bootstrap Resampling. User s Guide v. 2.4

MULTIV. Multivariate Exploratory Analysis, Randomization Testing and Bootstrap Resampling. User s Guide v. 2.4 MULTIV Multivariate Exploratory Analysis, Randomization Testing and Bootstrap Resampling User s Guide v. 2.4 Copyright 2006 by Valério DePatta Pillar Universidade Federal do Rio Grande do Sul, Porto Alegre,

More information

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1 Week 2 Based in part on slides from textbook, slides of Susan Holmes October 3, 2012 1 / 1 Part I Other datatypes, preprocessing 2 / 1 Other datatypes Document data You might start with a collection of

More information

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes Week 2 Based in part on slides from textbook, slides of Susan Holmes Part I Other datatypes, preprocessing October 3, 2012 1 / 1 2 / 1 Other datatypes Other datatypes Document data You might start with

More information

The discussion Analyzing beta diversity contains the following papers:

The discussion Analyzing beta diversity contains the following papers: The discussion Analyzing beta diversity contains the following papers: Legendre, P., D. Borcard, and P. Peres-Neto. 2005. Analyzing beta diversity: partitioning the spatial variation of community composition

More information

4/4/2018. Stepwise model fitting. CCA with first three variables only Call: cca(formula = community ~ env1 + env2 + env3, data = envdata)

4/4/2018. Stepwise model fitting. CCA with first three variables only Call: cca(formula = community ~ env1 + env2 + env3, data = envdata) 0 Correlation matrix for ironmental matrix 1 2 3 4 5 6 7 8 9 10 11 12 0.087451 0.113264 0.225049-0.13835 0.338366-0.01485 0.166309-0.11046 0.088327-0.41099-0.19944 1 1 2 0.087451 1 0.13723-0.27979 0.062584

More information

DETECTING BIOLOGICAL AND ENVIRONMENTAL CHANGES: DESIGN AND ANALYSIS OF MONITORING AND EXPERIMENTS (University of Bologna, 3-14 March 2008)

DETECTING BIOLOGICAL AND ENVIRONMENTAL CHANGES: DESIGN AND ANALYSIS OF MONITORING AND EXPERIMENTS (University of Bologna, 3-14 March 2008) Dipartimento di Biologia Evoluzionistica Sperimentale Centro Interdipartimentale di Ricerca per le Scienze Ambientali in Ravenna INTERNATIONAL WINTER SCHOOL UNIVERSITY OF BOLOGNA DETECTING BIOLOGICAL AND

More information

-Principal components analysis is by far the oldest multivariate technique, dating back to the early 1900's; ecologists have used PCA since the

-Principal components analysis is by far the oldest multivariate technique, dating back to the early 1900's; ecologists have used PCA since the 1 2 3 -Principal components analysis is by far the oldest multivariate technique, dating back to the early 1900's; ecologists have used PCA since the 1950's. -PCA is based on covariance or correlation

More information

Distances and similarities Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Distances and similarities Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining Distances and similarities Based in part on slides from textbook, slides of Susan Holmes October 3, 2012 1 / 1 Similarities Start with X which we assume is centered and standardized. The PCA loadings were

More information

CS249: ADVANCED DATA MINING

CS249: ADVANCED DATA MINING CS249: ADVANCED DATA MINING Clustering Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu May 2, 2017 Announcements Homework 2 due later today Due May 3 rd (11:59pm) Course project

More information

Analysis of Multivariate Ecological Data

Analysis of Multivariate Ecological Data Analysis of Multivariate Ecological Data School on Recent Advances in Analysis of Multivariate Ecological Data 24-28 October 2016 Prof. Pierre Legendre Dr. Daniel Borcard Département de sciences biologiques

More information

Similarity Numerical measure of how alike two data objects are Value is higher when objects are more alike Often falls in the range [0,1]

Similarity Numerical measure of how alike two data objects are Value is higher when objects are more alike Often falls in the range [0,1] Similarity and Dissimilarity Similarity Numerical measure of how alike two data objects are Value is higher when objects are more alike Often falls in the range [0,1] Dissimilarity (e.g., distance) Numerical

More information

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 http://intelligentoptimization.org/lionbook Roberto Battiti

More information

Lecture 2: Diversity, Distances, adonis. Lecture 2: Diversity, Distances, adonis. Alpha- Diversity. Alpha diversity definition(s)

Lecture 2: Diversity, Distances, adonis. Lecture 2: Diversity, Distances, adonis. Alpha- Diversity. Alpha diversity definition(s) Lecture 2: Diversity, Distances, adonis Lecture 2: Diversity, Distances, adonis Diversity - alpha, beta (, gamma) Beta- Diversity in practice: Ecological Distances Unsupervised Learning: Clustering, etc

More information

Spatial eigenfunction modelling: recent developments

Spatial eigenfunction modelling: recent developments Spatial eigenfunction modelling: recent developments Pierre Legendre Département de sciences biologiques Université de Montréal http://www.numericalecology.com/ Pierre Legendre 2018 Outline of the presentation

More information

Introduction to ordination. Gary Bradfield Botany Dept.

Introduction to ordination. Gary Bradfield Botany Dept. Introduction to ordination Gary Bradfield Botany Dept. Ordination there appears to be no word in English which one can use as an antonym to classification ; I would like to propose the term ordination.

More information

Correspondence Analysis & Related Methods

Correspondence Analysis & Related Methods Correspondence Analysis & Related Methods Michael Greenacre SESSION 9: CA applied to rankings, preferences & paired comparisons Correspondence analysis (CA) can also be applied to other types of data:

More information

Ecological Resemblance. Ecological Resemblance. Modes of Analysis. - Outline - Welcome to Paradise

Ecological Resemblance. Ecological Resemblance. Modes of Analysis. - Outline - Welcome to Paradise Ecological Resemblance - Outline - Ecological Resemblance Mode of analysis Analytical saces Association Coefficients Q-mode similarity coefficients Symmetrical binary coefficients Asymmetrical binary coefficients

More information

Species associations

Species associations Species associations Pierre Legendre 1 and F. Guillaume Blanchet 2 1 Département de sciences biologiques, Université de Montréal 2 Department of Renewable Resources, University of Alberta Introduction

More information

A Statistical Distance Approach to Dissimilarities in Ecological Data

A Statistical Distance Approach to Dissimilarities in Ecological Data Clemson University TigerPrints All Dissertations Dissertations 5-2015 A Statistical Distance Approach to Dissimilarities in Ecological Data Dominique Jerrod Morgan Clemson University Follow this and additional

More information

Sampling e ects on beta diversity

Sampling e ects on beta diversity Introduction Methods Results Conclusions Sampling e ects on beta diversity Ben Bolker, Adrian Stier, Craig Osenberg McMaster University, Mathematics & Statistics and Biology UBC, Zoology University of

More information

Multivariate Statistics Summary and Comparison of Techniques. Multivariate Techniques

Multivariate Statistics Summary and Comparison of Techniques. Multivariate Techniques Multivariate Statistics Summary and Comparison of Techniques P The key to multivariate statistics is understanding conceptually the relationship among techniques with regards to: < The kinds of problems

More information

Principal Components Theory Notes

Principal Components Theory Notes Principal Components Theory Notes Charles J. Geyer August 29, 2007 1 Introduction These are class notes for Stat 5601 (nonparametrics) taught at the University of Minnesota, Spring 2006. This not a theory

More information

Figure 43 - The three components of spatial variation

Figure 43 - The three components of spatial variation Université Laval Analyse multivariable - mars-avril 2008 1 6.3 Modeling spatial structures 6.3.1 Introduction: the 3 components of spatial structure For a good understanding of the nature of spatial variation,

More information

Clustering Lecture 1: Basics. Jing Gao SUNY Buffalo

Clustering Lecture 1: Basics. Jing Gao SUNY Buffalo Clustering Lecture 1: Basics Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced topics Clustering

More information

EXAM PRACTICE. 12 questions * 4 categories: Statistics Background Multivariate Statistics Interpret True / False

EXAM PRACTICE. 12 questions * 4 categories: Statistics Background Multivariate Statistics Interpret True / False EXAM PRACTICE 12 questions * 4 categories: Statistics Background Multivariate Statistics Interpret True / False Stats 1: What is a Hypothesis? A testable assertion about how the world works Hypothesis

More information

Type of data Interval-scaled variables: Binary variables: Nominal, ordinal, and ratio variables: Variables of mixed types: Attribute Types Nominal: Pr

Type of data Interval-scaled variables: Binary variables: Nominal, ordinal, and ratio variables: Variables of mixed types: Attribute Types Nominal: Pr Foundation of Data Mining i Topic: Data CMSC 49D/69D CSEE Department, e t, UMBC Some of the slides used in this presentation are prepared by Jiawei Han and Micheline Kamber Data Data types Quality of data

More information

Clustering Ambiguity: An Overview

Clustering Ambiguity: An Overview Clustering Ambiguity: An Overview John D. MacCuish Norah E. MacCuish 3 rd Joint Sheffield Conference on Chemoinformatics April 23, 2004 Outline The Problem: Clustering Ambiguity and Chemoinformatics Preliminaries:

More information

Canonical analysis. Pierre Legendre Département de sciences biologiques Université de Montréal

Canonical analysis. Pierre Legendre Département de sciences biologiques Université de Montréal Canonical analysis Pierre Legendre Département de sciences biologiques Université de Montréal http://www.numericalecology.com/ Pierre Legendre 2017 Outline of the presentation 1. Canonical analysis: definition

More information

Inderjit Dhillon The University of Texas at Austin

Inderjit Dhillon The University of Texas at Austin Inderjit Dhillon The University of Texas at Austin ( Universidad Carlos III de Madrid; 15 th June, 2012) (Based on joint work with J. Brickell, S. Sra, J. Tropp) Introduction 2 / 29 Notion of distance

More information

Inconsistencies between theory and methodology: a recurrent problem in ordination studies.

Inconsistencies between theory and methodology: a recurrent problem in ordination studies. This is the pre-peer-reviewed version of the following article: Inconsistencies between theory and methodology: recurrent problem in ordination studies, Austin, M., Journal of Vegetation Science, vol.

More information

Experimental Design and Data Analysis for Biologists

Experimental Design and Data Analysis for Biologists Experimental Design and Data Analysis for Biologists Gerry P. Quinn Monash University Michael J. Keough University of Melbourne CAMBRIDGE UNIVERSITY PRESS Contents Preface page xv I I Introduction 1 1.1

More information

Revision: Chapter 1-6. Applied Multivariate Statistics Spring 2012

Revision: Chapter 1-6. Applied Multivariate Statistics Spring 2012 Revision: Chapter 1-6 Applied Multivariate Statistics Spring 2012 Overview Cov, Cor, Mahalanobis, MV normal distribution Visualization: Stars plot, mosaic plot with shading Outlier: chisq.plot Missing

More information

MSc in Statistics and Operations Research

MSc in Statistics and Operations Research MSc in Statistics and Operations Research Title: Permutation multivariate analysis of variance on real data and simulations to evaluate for robustness against dispersion and unbalancedness. Author: Lucas

More information

An Introduction to Ordination Connie Clark

An Introduction to Ordination Connie Clark An Introduction to Ordination Connie Clark Ordination is a collective term for multivariate techniques that adapt a multidimensional swarm of data points in such a way that when it is projected onto a

More information

University of Florida CISE department Gator Engineering. Clustering Part 1

University of Florida CISE department Gator Engineering. Clustering Part 1 Clustering Part 1 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville What is Cluster Analysis? Finding groups of objects such that the objects

More information

Spatial non-stationarity, anisotropy and scale: The interactive visualisation of spatial turnover

Spatial non-stationarity, anisotropy and scale: The interactive visualisation of spatial turnover 19th International Congress on Modelling and Simulation, Perth, Australia, 12 16 December 2011 http://mssanz.org.au/modsim2011 Spatial non-stationarity, anisotropy and scale: The interactive visualisation

More information

Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Data Mining: Concepts and Techniques Chapter 2 1 Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser University 2011 Han, Kamber, and Pei. All rights reserved.

More information

Math Introduction to Numerical Analysis - Class Notes. Fernando Guevara Vasquez. Version Date: January 17, 2012.

Math Introduction to Numerical Analysis - Class Notes. Fernando Guevara Vasquez. Version Date: January 17, 2012. Math 5620 - Introduction to Numerical Analysis - Class Notes Fernando Guevara Vasquez Version 1990. Date: January 17, 2012. 3 Contents 1. Disclaimer 4 Chapter 1. Iterative methods for solving linear systems

More information

betapart: an R package for the study of beta diversity Andre s Baselga 1 * and C. David L. Orme 2

betapart: an R package for the study of beta diversity Andre s Baselga 1 * and C. David L. Orme 2 Methods in Ecology and Evolution 2012, 3, 808 812 doi: 10.1111/j.2041-210X.2012.00224.x APPLICATION betapart: an R package for the study of beta diversity Andre s Baselga 1 * and C. David L. Orme 2 1 Departamento

More information

Navigating the multiple meanings of b diversity: a roadmap for the practicing ecologist

Navigating the multiple meanings of b diversity: a roadmap for the practicing ecologist Ecology Letters, (2010) doi: 10.1111/j.1461-0248.2010.01552.x IDEA AND PERSPECTIVE Navigating the multiple meanings of b diversity: a roadmap for the practicing ecologist Marti J. Anderson, 1 * Thomas

More information

2/19/2018. Dataset: 85,122 islands 19,392 > 1km 2 17,883 with data

2/19/2018. Dataset: 85,122 islands 19,392 > 1km 2 17,883 with data The group numbers are arbitrary. Remember that you can rotate dendrograms around any node and not change the meaning. So, the order of the clusters is not meaningful. Taking a subset of the data changes

More information

B490 Mining the Big Data

B490 Mining the Big Data B490 Mining the Big Data 1 Finding Similar Items Qin Zhang 1-1 Motivations Finding similar documents/webpages/images (Approximate) mirror sites. Application: Don t want to show both when Google. 2-1 Motivations

More information

Introduction to multivariate analysis Outline

Introduction to multivariate analysis Outline Introduction to multivariate analysis Outline Why do a multivariate analysis Ordination, classification, model fitting Principal component analysis Discriminant analysis, quickly Species presence/absence

More information

Analysis of Multivariate Ecological Data

Analysis of Multivariate Ecological Data Analysis of Multivariate Ecological Data School on Recent Advances in Analysis of Multivariate Ecological Data 24-28 October 2016 Prof. Pierre Legendre Dr. Daniel Borcard Département de sciences biologiques

More information

Data Screening and Adjustments. Data Screening for Errors

Data Screening and Adjustments. Data Screening for Errors Purpose: ata Screening and djustments P etect and correct data errors P etect and treat missing data P etect and handle insufficiently sampled variables (e.g., rare species) P onduct transformations and

More information

Package LDM. R topics documented: March 19, Type Package

Package LDM. R topics documented: March 19, Type Package Type Package Package LDM March 19, 2018 Title Testing Hypotheses about the Microbiome using an Ordination-based Linear Decomposition Model Version 1.0 Date 2018-3-19 Depends GUniFrac, vegan Suggests R.rsp

More information

Transitivity a FORTRAN program for the analysis of bivariate competitive interactions Version 1.1

Transitivity a FORTRAN program for the analysis of bivariate competitive interactions Version 1.1 Transitivity 1 Transitivity a FORTRAN program for the analysis of bivariate competitive interactions Version 1.1 Werner Ulrich Nicolaus Copernicus University in Toruń Chair of Ecology and Biogeography

More information

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations. Previously Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations y = Ax Or A simply represents data Notion of eigenvectors,

More information

Variations in pelagic bacterial communities in the North Atlantic Ocean coincide with water bodies

Variations in pelagic bacterial communities in the North Atlantic Ocean coincide with water bodies The following supplement accompanies the article Variations in pelagic bacterial communities in the North Atlantic Ocean coincide with water bodies Richard L. Hahnke 1, Christina Probian 1, Bernhard M.

More information

Linking species-compositional dissimilarities and environmental data for biodiversity assessment

Linking species-compositional dissimilarities and environmental data for biodiversity assessment Linking species-compositional dissimilarities and environmental data for biodiversity assessment D. P. Faith, S. Ferrier Australian Museum, 6 College St., Sydney, N.S.W. 2010, Australia; N.S.W. National

More information

NONLINEAR REDUNDANCY ANALYSIS AND CANONICAL CORRESPONDENCE ANALYSIS BASED ON POLYNOMIAL REGRESSION

NONLINEAR REDUNDANCY ANALYSIS AND CANONICAL CORRESPONDENCE ANALYSIS BASED ON POLYNOMIAL REGRESSION Ecology, 8(4),, pp. 4 by the Ecological Society of America NONLINEAR REDUNDANCY ANALYSIS AND CANONICAL CORRESPONDENCE ANALYSIS BASED ON POLYNOMIAL REGRESSION VLADIMIR MAKARENKOV, AND PIERRE LEGENDRE, Département

More information

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra. DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1

More information

Discrimination Among Groups. Discrimination Among Groups

Discrimination Among Groups. Discrimination Among Groups Discrimination Among Groups Id Species Canopy Snag Canopy Cover Density Height 1 A 80 1.2 35 2 A 75 0.5 32 3 A 72 2.8 28..... 31 B 35 3.3 15 32 B 75 4.1 25 60 B 15 5.0 3..... 61 C 5 2.1 5 62 C 8 3.4 2

More information

CSCI 1951-G Optimization Methods in Finance Part 10: Conic Optimization

CSCI 1951-G Optimization Methods in Finance Part 10: Conic Optimization CSCI 1951-G Optimization Methods in Finance Part 10: Conic Optimization April 6, 2018 1 / 34 This material is covered in the textbook, Chapters 9 and 10. Some of the materials are taken from it. Some of

More information

Multivariate Distributions

Multivariate Distributions IEOR E4602: Quantitative Risk Management Spring 2016 c 2016 by Martin Haugh Multivariate Distributions We will study multivariate distributions in these notes, focusing 1 in particular on multivariate

More information

Relational Nonlinear FIR Filters. Ronald K. Pearson

Relational Nonlinear FIR Filters. Ronald K. Pearson Relational Nonlinear FIR Filters Ronald K. Pearson Daniel Baugh Institute for Functional Genomics and Computational Biology Thomas Jefferson University Philadelphia, PA Moncef Gabbouj Institute of Signal

More information

Package mpmcorrelogram

Package mpmcorrelogram Type Package Package mpmcorrelogram Title Multivariate Partial Mantel Correlogram Version 0.1-4 Depends vegan Date 2017-11-17 Author Marcelino de la Cruz November 17, 2017 Maintainer Marcelino de la Cruz

More information

What are the important spatial scales in an ecosystem?

What are the important spatial scales in an ecosystem? What are the important spatial scales in an ecosystem? Pierre Legendre Département de sciences biologiques Université de Montréal Pierre.Legendre@umontreal.ca http://www.bio.umontreal.ca/legendre/ Seminar,

More information

DIDELĖS APIMTIES DUOMENŲ VIZUALI ANALIZĖ

DIDELĖS APIMTIES DUOMENŲ VIZUALI ANALIZĖ Vilniaus Universitetas Matematikos ir informatikos institutas L I E T U V A INFORMATIKA (09 P) DIDELĖS APIMTIES DUOMENŲ VIZUALI ANALIZĖ Jelena Liutvinavičienė 2017 m. spalis Mokslinė ataskaita MII-DS-09P-17-7

More information

Preprocessing & dimensionality reduction

Preprocessing & dimensionality reduction Introduction to Data Mining Preprocessing & dimensionality reduction CPSC/AMTH 445a/545a Guy Wolf guy.wolf@yale.edu Yale University Fall 2016 CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall 2016

More information

Lecture 2: Linear Algebra Review

Lecture 2: Linear Algebra Review EE 227A: Convex Optimization and Applications January 19 Lecture 2: Linear Algebra Review Lecturer: Mert Pilanci Reading assignment: Appendix C of BV. Sections 2-6 of the web textbook 1 2.1 Vectors 2.1.1

More information

Factors affecting the Power and Validity of Randomization-based Multivariate Tests for Difference among Ecological Assemblages

Factors affecting the Power and Validity of Randomization-based Multivariate Tests for Difference among Ecological Assemblages Factors affecting the Power and Validity of Randomization-based Multivariate Tests for Difference among Ecological Assemblages Cameron Hurst B.Sc. (Hons) This thesis was submitted in fulfillment of the

More information

Descriptive Data Summarization

Descriptive Data Summarization Descriptive Data Summarization Descriptive data summarization gives the general characteristics of the data and identify the presence of noise or outliers, which is useful for successful data cleaning

More information