Dissimilarity and transformations. Pierre Legendre Département de sciences biologiques Université de Montréal

and transformations Pierre Legendre Département de sciences biologiques Université de Montréal http://www.numericalecology.com/ Pierre Legendre 2017

Definitions An association coefficient is a function of two data vectors that quantifies the strength of their relationship (or association).

Descriptors Objects Objects Y np Q mode A nn for Q-mode analysis R mode Descriptors A pp for R-mode analysis Association coefficients can measure the relationship between data rows (objects, Q mode) or between data columns (variables, R mode) Legendre & Legendre 2012, Fig. 2.1.

Q-mode coefficients (between objects, e.g. sites) are called similarities (S) and dissimilarities (D) 1. R-mode coefficients (between descriptors or variables) are called dependence coefficients (correlation coefficients, contingency,...) In Q mode, one obtains a S or D matrix by computing similarities (S) or dissimilarities (D) between all pairs of sites. In R mode, one obtains an R matrix by computing correlation coefficients between all pairs of columns. 1 Dissimilarities are often called distance coefficients. Technically, distances are metric dissimilarities (metric properties: described below).

A similarity coefficient S produces values in the [0,1] range. An S matrix is symmetric. A similarity matrix S has values 1 on the diagonal, which represents the similarity between an object and itself. S = Obj.1 Obj.2 Obj.1 1.0 0.7 Obj.2 0.7 1.0 A dissimilarity coefficients D derived from S also has values in the [0,1] range. A D matrix is symmetric. A dissimilarity matrix D has values 0 on the diagonal, which represents the difference between an object and itself. D = Obj.1 Obj.2 Obj.1 0.0 0.3 Obj.2 0.3 0.0

In this presentation, we will focus on Q-mode coefficients In the R language, all association matrices computed with Q- mode coefficients are presented as D matrices. They contain either dissimilarity indices, or similarity indices transformed into dissimilarities (more details below). A D function imposes a model onto the data. It filters the information of the data matrix Y, emphasizing a portion of the information and discarding other portions. The (dis)similarity indices are not interchangeable. Users must know what information is emphasized (i.e. retained) and discarded (i.e. filtered out) by each type of S or D function.

Two properties of D coefficients Metric property The attributes of a metric dissimilarity are the following 1 : 1. Minimum 0: if a = b, then D(a, b) = 0; 2. Positiveness: if a b, then D(a, b) > 0; 3. Symmetry: D(a, b) = D(b, a); 4. Triangle inequality: D(a, b) + D(b, c) D(a, c). The sum of two sides of a triangle drawn in ordinary Euclidean space is equal to or larger than the third side. b a c 1 Attributes also described in the PCoA ordination course. Note: A metric dissimilarity function is also called a distance.

By reference to the attributes of a metric, 3 types of coefficients can be defined: metric: have all 4 attributes semimetric: can violate the triangle inequality nonmetric: can violate attributes 1 3. An example of a nonmetric coefficient will be presented later. These coefficients are not used in ecology.

Euclidean property 1 A dissimilarity coefficient is Euclidean if any resulting D matrix can be fully represented in a Euclidean space without distortion. A non-euclidean dissimilarity matrix is identified by the criterion that PCoA of that matrix produces some negative eigenvalues. Taking the square root of most non-euclidean D matrices makes them metric and Euclidean. 1 Property described in more detail in the principal coordinate analysis (PCoA) course.

Converting S to D Coefficients were originally described as either S or D. Among the possible transformations from S to D, two are used in ecological analysis: D = 1 S D = sqrt(1 S) In PCoA and in the distance-based approach to RDA (dbrda), it is useful to make sure that the D matrix is Euclidean; use D = (1 S) when (1 S) is Euclidean; use D = sqrt(1 S) when (1 S) is not Euclidean but sqrt(1 S) is Euclidean.

=> Examine the metric and Euclidean properties of similarity and dissimilarity coefficients in Tables 9.2 and 9.3 of the Numerical ecology book. See complementary material, file Legendre_&_Legendre_2012_Tables_7.2+7.3.pdf. In most cases, sqrt(d) or sqrt(1 S) turns a non-euclidean matrix into Euclidean. More about this in the course on principal coordinate analysis (PCoA).

Community composition data: the double-zero problem Whittaker s coenocline: A simulated coenocline along an environmental gradient (abscissa). From Whittaker (1972). (Shown in section of the CA course on the Arch effect.) This figure will help us understand the principle behind doublezero symmetrical and asymmetrical S and D coefficients.

Consider the distribution of a single species along that environmental variable: For the presence or absence of that species, are the following pairs of observations an indication of Green arrows: 1, 1 Similarity Difference Red arrows: 1, 0 Brown arrows: 0, 0 Maybe Blue arrows: 0, 0 Maybe Conclusion: interpretation of double zeros is uncertain.

Definitions In double-zero asymmetrical coefficients the value D does not change with the addition of double zeros, but it decreases when species with double-x are added to the comparison of two sites, where X is any value of equal abundances other than zero. Examples: Jaccard, Sørensen, Ochiai, Hellinger, chord, Ružička, percentage difference (aka Bray-Curtis). In double-zero symmetrical coefficients the value D does not change when double-zeros or double-x (where X > 0) are added to the two sites that are compared. Examples: Euclidean, Manhattan distances.

Coefficients for binary data Example 2 objects, 7 binary variables [0,1] Var1 Var2 Var3 Var4 Var5 Var6 Var7 Object x 1 1 1 0 1 1 0 0 Object x 2 0 0 1 1 1 0 0 Object x 1 Object x 2 1 0 1 a b a + b 0 c d c + d a + c b + d p = a + b + c + d => Object x 1 Object x 2 1 0 1 2 2 0 1 2

Object x 1 Object x 2 1 0 1 a b 0 c d => Object x 1 Object x 2 1 0 1 2 2 0 1 2 Double-zero symmetrical coefficient S SM = Simple matching a + d a + b + c + d = 4 7 = 0.571 Double-zero asymmetrical coefficient S J = Jaccard index a a + b + c = 2 5 = 0.400 D SM = b + c a + b + c + d = 3 7 = 0.429 D J = b + c a + b + c = 3 5 = 0.600

Most popular coefficients for binary data Double-zero symmetrical S D=1 S D= 1 S Simple matching S SM = a + d a + b + c + d D SM = b + c a + b + c + d 0.571 0.429 0.655

Most popular coefficients for binary data Double-zero asymmetrical S D=1 S D= 1 S Jaccard S Jac = Sørensen S Sor = Ochiai S Och = a a + b + c 2a 2a + b + c a (a + b) (a + c) D Jac = D Sor = b + c a + b + c b + c 2a + b + c D Och =1 a (a + b) (a + c) 0.400 0.600 0.775 0.571 0.429 0.655 0.577 0.423 0.650

Adding double-zeros changes D. Hence it changes the Simple matching (symmetrical) but not the Jaccard index (asymmetrical). Example 2 objects, 10 binary variables [0,1] Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 Var10 Object x 1 1 1 0 1 1 0 0 0 0 0 Object x 2 0 0 1 1 1 0 0 0 0 0 S SM = Object x 1 Simple matching Object x 2 1 0 1 a b 0 c d a + d a + b + c + d = 7 10 = 0.700 => Object x 1 Object x 2 1 0 1 2 2 0 1 5 S J = Jaccard index a a + b + c = 2 5 = 0.400 D SM = b + c a + b + c + d = 3 10 = 0.300 D J = b + c a + b + c = 3 5 = 0.600

This example shows that double-zero asymmetrical indices are insensitive to double-zeros, i.e. the absence of species at two sites, because the value d is not included in their formula. However, double-presences (a) do change the denominator, hence they change the index value. These indices are well suited to the analysis of community composition or other types of frequency data, e.g. gene frequencies. Double-zero symmetrical indices are not adapted to the analysis of these types of data.

10 different forms of binary indices (for presence-absence data) are available in dist.binary of {ade4}. The three most widely used binary indices, i.e. Jaccard, Sørensen and Ochiai, are also available in dist.ldc of {adespatial}. Note: When computing binary indices in these two packages, these similarities (S) are converted to distances through the transformation D = 1 S. The transformation is automatically applied because it makes the D matrices Euclidean. Euclidean matrices will not produce negative eigenvalues in PCoA. Refer to the PCoA course.

Example of a nonmetric coefficient S Nonmetric = a + d b + c Object x 1 Object x 2 1 0 1 a b 0 c d With community composition data, if (a + b) = 0, S = 0. (b + c) can be zero; division by 0 produces S = +Inf. S is then in the range [0, +Inf] For the upper bound S = +Inf, computing D = (1 S) produces D = (1 Inf) = Inf. For the lower bound S = 0, D = (1 S) = (1 0) = 1. D is then in the range [ Inf, 1] This coefficient is not used in ecological analysis.

Symmetrical indices for physical descriptors For quantitative physical descriptors (e.g. physical, chemical, morphometric, topographic, etc.): use double-zero symmetrical D coefficients, where both double zeros and double-x (X is any other value) do not change the computed distance. The Euclidean distance The Euclidean distance is the ordinary distance of our physical world. It is computed using Pythagora s formula and it can be applied to data matrices with any number (p) of variables. Formula: D Euclidean (x 1, x 2 ) = p ( y 1 j y ) 2 2 j j=1

The Euclidean distance computed on raw physical data changes with the choice of the physical units. Examples: Hardness mg/l Depth m ph (no unit) Lake 1 300 10 7.5 Lake 2 200 15 8.3 Lake 3 250 25 8.0 D Euclidean = 0 100.13 0 52.20 50.99 0 Hardness g/l Depth cm ph (no unit) Lake 1 0.30 1000 7.5 Lake 2 0.20 1500 8.3 Lake 3 0.25 2500 8.9 D Euclidean = 0 500 0 1500 1000 0 Same data. Which dissimilarity matrix is the correct one? Do these computed distances make sense? What are the physical units of the distances in each matrix?

When the physical descriptors have different physical units, one should compute the Euclidean distance on standardized descriptors, which have no physical units. Hardness (stand.) Depth (stand.) ph (stand.) Lake 1 1 0.873 1.072 Lake 2 1 0.218 0.907 Lake 3 0 1.091 0.165 D Euclidean = 0 2.889 0 2.527 1.807 0 All descriptors contribute equally (equal variances of 1) to the computed distance. D Euclidean computed on standardized data has no physical units. Note: Euclidean distances do not have an upper bound. Their values are in the range [0,+Inf].

Do not compute D Euclidean on raw (untransformed) community composition data, standardized or not. There are, however, transformations that are appropriate for community data before computing the Euclidean distance. They are described later in this presentation.

The Gower coefficient In 1971, John Gower proposed a dissimilarity coefficient designed for ecologists and taxonomists. The Gower coefficient was designed to handle descriptors with different physical units and of mixed precision levels. Because of its general nature, this coefficient is more complex to describe than a simple coefficient like the Euclidean distance. The general form of the coefficient is the following:

The general form of the coefficient is the following: D Gower (x 1, x 2 ) =1 1 p s j (x 1, x 2 ) p j=1 Where s j (x 1, x 2 ) is a partial similarity function computed separately for each descriptor. For quantitative descriptors, s j (x 1, x 2 ) is computed as follows: s j (x 1, x 2 ) =1 y 1 j y 2 j R j For qualitative (factors) or binary descriptors: s j (x 1, x 2 ) is 1 if the two objects have the same state; otherwise 0.

For quantitative descriptors, s j (x 1, x 2 ) is computed as follows: s j (x 1, x 2 ) =1 y 1 j y 2 j R j R j is the range of descriptor j in the data matrix under study. Dividing y 1j y 2j by the range R j produces a value without physical dimension. The value of the partial similarity s j (x 1, x 2 ) is 1 minus the ranged difference between the values of descriptor j in the two objects under comparison. For qualitative (factors) or binary descriptors, s j (x 1, x 2 ) is 1 if the two objects have the same state; otherwise 0. Semi-quantitative (ordinal) descriptors (called ordered factors in R) can be handled in various ways. See the three options are available in function gowdis() of the {FD} package. In the simplest of these methods, the information is handled as if the descriptor was quantitative, using the equation above.

Another interesting modification is the incorporation of weights w j in the formula. D Gower (x 1, x 2 ) =1 p j=1 The weights w j can be used to handle missing values: when a missing value is present for a descriptor in one of the two objects under comparison, w j = 0; otherwise, w j = 1. Missing values are coded NA in the example that follows. w 12 j s j (x 1, x 2 ) p w 12 j j=1

Another interesting modification is the incorporation of weights w j in the formula. D Gower (x 1, x 2 ) =1 p j=1 The weights w j can also be used to give different importances to the descriptors in the calculation of D Gower, but they are rarely used for that purpose. The values of D Gower are in the range [0,1]. w 12 j s j (x 1, x 2 ) p w 12 j j=1

Numerical example Var1 Var2 Var3 Var4 Var5 (factor) Var6 Site1 2 2 NA 2 1 2 6 Site2 1 3 3 1 3 2 5 [ ] Site49 1 1 1 1 1 1 1 Site50 2 5 5 4 5 3 6 Var7 For each variable, sites 49 and 50 have, respectively, the lowest and highest values in the data matrix. Their purpose is to provide values for computation of the ranges of the quantitative variables. Site1 has an absence of information (coded NA) in Var3. Var5 is a factor; hence the values shown are classes (or states) of the factor.

Var1 Var2 Var3 Var4 Var5 Var6 Study Var7 this (factor) example in detail by yourself! Site1 2 2 NA 3 1 2 6 Site2 1 4 3 1 3 2 4 [ ] Site49 1 1 1 1 1 1 1 Site50 2 5 5 4 5 3 6 w 12j 1 1 0 1 1 1 1 R j = (max min) 1 4 3 2 5 y 1j y 2j 1 2 2 0 2 y 1j y 2j /R j 1.000 0.500 0.667 0.000 0.400 s 12j = 1 y 1j y 2j /R j 0.000 0.500 0.333 0 1.000 0.600 w 12j s 12j 0.000 0.500 0 0.333 0 1.000 0.600 p = 7; Σw 12j = 6 D Gower (x 1, x 2 ) = 1 (0 + 0.5 + 0 + 0.3333 + 0 + 1 + 0.6)/6 = 0.594

Comparison of the functions available in R to compute the Gower coefficient: Quant. Semiquant. Factors NA Weights w j vegdist() {vegan} daisy() {cluster} As quant. gowdis() {FD} 3 methods See the documentation file for details on how ordered factors are handled in the three methods implemented in function gowdis(). Missing values are coded NA in R.

Quantitative community data: asymmetrical indices I will now focus on seven double-zero asymmetrical dissimilarity coefficients for quantitative community composition data, and mention some others.

Asymmetrical non-euclidean indices The first two indices for quantitative community data are the quantitative forms of the Jaccard and Sørensen indices for binary data, described above. These two indices are based on the same decomposition of the species abundance data. The calculation of dissimilarity incorporates the differences in total abundance of the sites. The data are not scaled by rows, as will be the case in the next series of indices. The differences in productivity of the sites are taken into account in the calculation of the dissimilarities.

Example 2 objects, 4 species, quantitative abundance data C 1 B 2 A 4 A 4 B3 A 1 A 1 A2 A 2 Site1 Site2 Site1 Site2 Site1 Site2 Site1 Site2 Spec.1 Spec.2 Spec.3 Spec.4 A = sum of abundances common to sites 1 and 2 = A 1 + A 2 + 0 + A 4! B = sum of abundances unique to site 1 = 0 + B 2 + B 3 + 0! C = sum of abundances unique to site 2 = C 1 + 0 + 0 + 0!! (B + C) represents the unscaled dissimilarity between 2 sites. (A) would represent the unscaled similarity. Ružička dissimilarity: D Ruz = (B + C)/(A + B + C) Percentage difference (aka Bray-Curtis): D %diff = (B + C)/(2A + B + C) Other equivalent equation forms exist for these two dissimilarities.

C 1 B 2 A 4 A 4 B3 A 1 A 1 A2 A 2 Site1 Site2 Site1 Site2 Site1 Site2 Site1 Site2 Spec.1 Spec.2 Spec.3 Ružička dissimilarity: D Ruz = (B + C)/(A + B + C) Spec.4 A = sum of abundances common to sites 1 and 2 = A 1 + A 2 + 0 + A 4! B = sum of abundances unique to site 1 = 0 + B 2 + B 3 + 0! C = sum of abundances unique to site 2 = C 1 + 0 + 0 + 0!! Percentage difference (aka Bray-Curtis): D %diff = (B + C)/(2A + B + C) These two dissimilarities are double-zero asymmetrical because the sum of double-zeros is not included in their formulas, whereas the quantity A (sum of double-x) is included. When comparing two sites, pairs of 0 do not change the values of these indices. The values of D ruz and D %diff are in the range [0,1].

The D %diff is not metric. Example: Sp.1 Sp.2 Sp.3 Sp.4 Sp.5 Object x 1 2 5 2 5 3 Object x 2 3 5 2 4 3 Object x 3 9 1 1 1 1 The D %diff matrix is not metric because this triangle does not close: D(1,2) + D(2,3) < D(1,3) since (0.059 + 0.533 = 0.592) < 0.600 Because it is not metric, the D %diff matrix is not Euclidean. However, sqrt(d) is Euclidean. The Ružička dissimilarity, D Ruz, is always metric, like the Jaccard index. It is not a Euclidean coefficient; hence a specific D Ruz matrix may or may not be Euclidean, as in the following example.

Full example: compute D Ruz and D %diff for the spider data. # Read the file "Spiders_28x12_spe.txt"! spiders <- read.table(file.choose())! library(adespatial); library(ade4)! #! D.ruz <- dist.ldc(spiders, method="ruzicka")! is.euclid(d.ruz)! [1] FALSE # The D matrix is not Euclidean! is.euclid(sqrt(d.ruz))! [1] TRUE # The sqrt(d) matrix is Euclidean! #! D.pcdiff <- dist.ldc(spiders, method="percentdiff")! is.euclid(d.pcdiff)! [1] FALSE # The D matrix is not Euclidean! is.euclid(sqrt(d.pcdiff))! [1] TRUE # The sqrt(d) matrix is Euclidean!

The percentage difference was described by Odum in 1950. Historical note D %diff = (B + C)/(2A + B + C) This D index is often called the Bray-Curtis index in computer software. This is a misnomer, repeating a mistake in a paper published around 1970. The 1957 paper by Bray-Curtis aimed at describing a new ordination method, known as the Bray-Curtis ordination, not a new D index. Actually, the index used by Bray and Curtis in their 1957 paper, and clearly described on p. 329, is Whittaker s (1952) index of association.

Asymmetrical Euclidean indices The following five distance indices are constructed in the same way: the data are scaled by rows using (data transformation); then the Euclidean distance is applied to the scaled data. The scaling by rows removes the differences in productivity of the sites from the data. These differences are not taken into account in the calculation of the dissimilarities.

The distance between species profiles Example: community composition data Sp1 Sp2 Sp3 Sp4 Sp5 [ y i+ ] Y = Site1 Site2 Site3 45 10 15 0 10 25 8 10 0 3 7 15 20 14 12 Divide each value by the row sum, transforming the rows into profiles of relative abundances, y ij y i+ = 0.563 0.125 0.188 0.000 0.125 0.543 0.174 0.217 0.000 0.065 0.103 0.221 0.294 0.206 0.176 This is called the profile transformation. 80 46 68

Divide each value by the row sum, transforming the rows into profiles of relative abundances, y ij y i+ = then compute the Euclidean distance among the scaled rows. Formula: p y D profile (x 1, x 2 ) = 1 j y 2 2 j y 1+ y 2+ Double zeros do not affect the row sums or the final Euclidean distance. So, this distance is double-zero asymmetrical. D profile is easy to compute but it does not have good properties for the analysis of beta diversity. See later presentation. Maximum value of D profile = 0.563 0.125 0.188 0.000 0.125 0.543 0.174 0.217 0.000 0.065 0.103 0.221 0.294 0.206 0.176 2 j=1 for sites with no species in common.

The chord distance Example: community composition data Sp1 Sp2 Sp3 Sp4 Sp5 [ Row.norm i ] Y = Site1 Site2 Site3 45 10 15 0 10 25 8 10 0 3 7 15 20 14 12 49.497 28.249 31.843 The data are divided by the norm (vector length) 1 of each row y ij = Row.norm i 0.909 0.202 0.303 0.000 0.202 0.885 0.283 0.354 0.000 0.106 0.220 0.471 0.628 0.440 0.377 This is called the chord transformation. 1 An R function to compute the norm of a vector: row.norm <- function(vec) sqrt(sum(vec^2))!

The data are divided by the norm (vector length) of each row, y ij = Row.norm i => The norms of the transformed row vectors are now 1. Then the Euclidean distance is applied to the scaled data. Formula: D chord (x 1, x 2 ) = p j=1 This D is insensitive to double zeros (double-zero asymmetrical). This distance has excellent properties for the analysis of beta diversity, as will be seen in a later presentation. Maximum value of D chord = 0.909 0.202 0.303 0.000 0.202 0.885 0.283 0.354 0.000 0.106 0.220 0.471 0.628 0.440 0.377 y 1 j Row.norm 1 2 y 2 j Row.norm 2 for sites with no species in common. 2

The chord distance is actually the length of a chord between two points along the circumference of a unit circle. This is a geometric notion. This measure is applied to data that have been normed, so that the norm (or length) of each transformed row vector is 1. Species y 2 1 x 1 D chord (x 1, x 2 ) x 2 1 Species y 1 The maximum value of D chord is in common. 2, for sites that have no species

Example: community composition data The Hellinger distance Y = Site1 Site2 Site3 Sp1 Sp2 Sp3 Sp4 Sp5 [ y i+ ] 45 10 15 0 10 25 8 10 0 3 7 15 20 14 12 The data are first divided by the sum of each row y ij y i+ = 0.563 0.125 0.188 0.000 0.125 0.543 0.174 0.217 0.000 0.065 0.103 0.221 0.294 0.206 0.176 and square-rooted, producing square-rooted species profiles y ij y i+ = 0.750 0.354 0.433 0.000 0.354 0.737 0.417 0.466 0.000 0.255 0.321 0.470 0.542 0.454 0.420 This is called the Hellinger transformation. 80 46 68

Compute square-rooted species profiles y ij y i+ = 0.750 0.354 0.433 0.000 0.354 0.737 0.417 0.466 0.000 0.255 0.321 0.470 0.542 0.454 0.420 then compute the Euclidean distance among the scaled rows. Formula: D Hellinger (x 1, x 2 ) = p j=1 y 1 j y 1+ y 2 j y 2+ 2 This D is insensitive to double zeros (double-zero asymmetrical). This distance has excellent properties for the analysis of beta diversity, as will be seen in a later presentation. Maximum value of D Hellinger = 2 for sites with no species in common.

Relationships The Hellinger distance is actually the chord distance computed on square-rooted species abundance data. Example with the (3 5) matrix: Y = Site1 Site2 Site3 Sp1 Sp2 Sp3 Sp4 Sp5 [ y i+ ] 45 10 15 0 10 25 8 10 0 3 7 15 20 14 12 80 46 68

# Generate the data matrix (3 sites x 5 species)! mat = matrix(c(45,25,7,10,8,15,15,10,20,0,0,14,10,3,12),3,5)!! # Compute Hellinger distance on mat using dist.ldc()! library(adespatial)! ( D.hel <- dist.ldc(mat, "hellinger") )! 1 2! 2 0.1222138! 3 0.6480087 0.6441498!! # Compute the chord distance on mat! ( D.chord <- dist.ldc(mat, "chord") )! 1 2! 2 0.1376616! 3 0.9364945 0.9052052!! # Compute the chord distance on sqrt(mat)! ( D <- dist.ldc(sqrt(mat), "chord") )! 1 2! 2 0.1222138! 3 0.6480087 0.6441498! Hellinger D = chord D after taking the square root of the abundances.

The log-chord distance Instead of a square root, one can compute the log of the abundances before computing the chord transformation on the ' y ij y " ij = = log e (y ij +1) values: The combination of these two transformations is called the log-chord transformation. The Euclidean distance can then be computed on the transformed data to obtain the log-chord distance (Legendre & Borcard, submitted). This D has all the properties of the chord D. It is thus insensitive to double zeros (double-zero asymmetrical). Note The Euclidean distance computed on log(y+1) data is not doublezero asymmetrical. So it is inappropriate for community composition data. ' y ij p i=1 ' (y ij ) 2 ' y ij

Idea linking the chord, Hellinger and log-chord transformations λ = {1, 0.5, 0} are members of the Box-Cox series of normalizing transformations: f (y) = (y λ 1) / λ plain chord transf.: λ = 1 => y ij 1 (no transf.), then chord transf. Hellinger transformation: λ = 0.5 => y ij 0.5 before chord transf. log-chord transformation: for λ = 0, the limit of f(y) when λ approaches 0 is log e (y) (Box & Cox, 1964). We use log e (y ij +1) because there are abundances of 0 in community composition data and log(0) = Inf. The log transformation is used to normalize strongly asymmetric frequency distributions before applying the chord transformation. => All D based on the chord transformation inherit the properties of the chord D. In particular, they are double-zero asymmetrical.

The chi-square distance The chi-square distance is an important coefficient. It is the distance preserved in correspondence analysis (CA). The chi-square distance can be computed on data that are nonnegative (i.e. 0), frequency-like 1, and dimensionally homogeneous. 1 Examples: community composition or biomass data; monetary units (e.g. $,,, ).

Example: community composition data Y = Site1 Site2 Site3 Sp1 Sp2 Sp3 Sp4 Sp5 [ y i+ ] 45 10 15 0 10 25 8 10 0 3 7 15 20 14 12 y + j = 77 33 45 14 25 80 46 68 y ++ =194 1. Transform the abundances into relative abundances by row. y ij y i+ = 0.563 0.125 0.188 0.000 0.125 0.543 0.174 0.217 0.000 0.065 0.103 0.221 0.294 0.206 0.176 2. Compute a weighted Euclidean distance of the relative abundances, using the inverses of the column sums as weights. p j=1 y + j D chi.sq (x 1, x 2 ) = y ++ 1 y 1 j y 1+ y 2 j y 2+ 2

2. Compute a weighted Euclidean distance of the relative abundances, using the inverses of the column sums as weights. p j=1 y + j D chi.sq (x 1, x 2 ) = y ++ 1 y 1 j y 1+ y 2 j y 2+ 2 => Using these weights actually gives more importance to the rare species, which have small column sums, in the estimation of the dissimilarity, than to the more abundant and ubiquitous species, which have larger column sums. This is a good idea for ecologists who find the presence of rare species to be more informative than the presence of abundant and ubiquitous species. A rare species found at a few sites may indicate special environmental conditions that are required by that species. However, if the rare species are less precisely sampled than the more common species, one should avoid the chi-square distance, and therefore also CA.

Chi-square distance formula: p j=1 y + j D chi.sq (x 1, x 2 ) = y ++ 1 y 1 j y 1+ y 2 j y 2+ 2 In the comparison of two sites, pairs of 0 do not change the value of D chi.sq. So it is a double-zero asymmetrical D index. Maximum value of D chi.sq = common. 2y ++ for sites with no species in

Binary forms of some quantitative coefficients functions can be computed using presence-absence (1-0) data. In many cases, the result is equivalent to, or a simple transformation of, usual indices for presence-absence data. D Euclidean = sqrt(d simple matching p) where p is the number of variables D Ružička, D Canberra, D Wishart = D Jaccard D %difference = D Sørensen Hellinger D, chord D = sqrt(2(1 S Ochiai )) See Legendre & De Cáceres (2013, Table 1) for other relationships between the quantitative and binary forms.

Other useful asymmetrical coefficients Other useful double-zero asymmetrical coefficients are available in R packages: In dist.ldc() of {adespatial}: coefficient of divergence, Canberra metric, Whittaker D, Wishart D, Kulczynski D. Also in dist.ldc(): four abundance-based coefficient of Chao et al. (2006) for quantitative data. These functions correct the index for species that have not been observed due to sampling errors. Other R package, not listed here, also contain indices for community data. One should check if these indices are Euclidean in the form D or sqrt(d) before using them for PCoA ordination or beta diversity studies.

Computing D through data transformations Compute asymmetrical D indices for community composition data as follows: data transformation followed by calculation of the Euclidean distance, as shown in the section on asymmetrical indices for quantitative community composition data.

(b) Hellinger distance among sites (a) For community composition data, after transformation (ex. Hellinger) (a) compute D euclidean ; (b) or use the transformed data as input into linear methods of data analysis. There is no need to compute the D matrix in that case.

Euclidean Chord Ordination in reduced space 3. Pre-transformation of species data: illustration Species profiles The species abundance paradox (Orlóci, 1978) Hellinger Chi-square

The previous slide shows that the chord, species profile, Hellinger and chi-square transformations, followed by calculation of the Euclidean distance, produce the same-name dissimilarity indices, which are double-zero asymmetrical, metric and Euclidean.

The Euclidean distance paradox Example data Sp.1 Sp.2 Sp.3 Row sums y i+ Row norms Site 1 0 4 8 12 8.944 Site 2 0 1 1 2 1.414 Site 3 1 0 0 1 1.000 Compute the Euclidean distance among the data rows: Site1 Site2 Site3 Site1 Site2 Site3 0 7.6158 9.0000 7.6158 0 1.7321 9.0000 1.7321 0 According to these D results, the two closest sites are 2 and 3, with D = 1.7321, despite of the fact that 2 and 3 have no species in common. For ecologists, two sites that have no species in common are very different. Sharing species is more important than differences in abundances.

Example data Sp.1 Sp.2 Sp.3 Row sums y i+ Row norms Site 1 0 4 8 12 8.944 Site 2 0 1 1 2 1.414 Site 3 1 0 0 1 1.000 Site1 Site2 Site3 Compute the Euclidean distance among the data rows: Site1 Site2 Site3 0 7.6158 9.0000 7.6158 0 1.7321 9.0000 1.7321 0 The two least different sites in the data matrix are (1 and 2), which share 2 species, yet the Euclidean distance gives them a large distance. The most different pairs are (1, 3) and (2, 3), which have no species in common, yet D Euclidean gives a very small distance to pair (2, 3).

Euclidean Chord Ordination in reduced space 3. Pre-transformation of species data: illustration Species profiles The species abundance paradox (Orlóci, 1978) Hellinger Chi-square The least different and most different pairs in the data matrix.

The previous slide shows that the Euclidean distance can give a small D value to a pair of sites that have no species in common, indicating that they are highly similar. Contrary to that, the chord, species profile, Hellinger and chisquare distances produce smaller D values for pairs of sites that contain the same species than for pairs of sites where different species assemblages are found.

Note that the transformations and distances are not equivalent and interchangeable. They produce different PCoA ordinations of the sites.

Data transformations in The profile, chord, log-chord, Hellinger and chi-square transf. can be computed using vegan s decostand() function. profile transformation: chord transformation: log-chord transformation: Hellinger transformation: chi-square transformation: Y.tr = decostand(y, "total") Y.tr = decostand(y, "norm") Y.tr = decostand(log1p(y), "norm") Y.tr = decostand(y, "hellinger") Y.tr = decostand(y, "chi.sq") The transformed data can be used as input into linear methods of data analysis: PCA, RDA, k-means partitioning, manova, etc. After transforming the data, compute the Euclidean distance using dist() of {stats} to obtain the same-name distances. Direct calculation of the chord, species profile, Hellinger and chisquare distances are available in function dist.ldc() of {adespatial}.

(b) Hellinger distance among sites (b) The transformed data matrices can be used as input into linear methods of data analysis that preserve the Euclidean distance, such as PCA (tb-pca), RDA (tb-rda) and k-means partitioning. In these analyses, the chord, log-chord, species profile, Hellinger and chi-square distances, which are double-zero asymmetrical, will be preserved instead of the symmetrical Euclidean distance.

References cited Box, G. E. P. & D. R. Cox. 1964. An analysis of transformations. J. Roy. Statist. Soc. Ser. B 26: 211 243. Borcard, D., F. Gillet & P. Legendre. 2018. Numerical ecology with R, 2 nd edition. Use R! series, Springer Science, New York. Bray, R. J. & J. T. Curtis. 1957. An ordination of the upland forest communities of southern Wisconsin. Ecological Monographs 27: 325 349. Chao, A., R. L. Chazdon, R. K. Colwell & T. J. Shen. 2006. Abundance-based similarity indices and their estimation when there are unseen species in samples. Biometrics 62: 361 371. Gower, J. C. 1971. A general coefficient of similarity and some of its properties. Biometrics 27: 857-871. Legendre, P. & D. Borcard. Box-Cox-chord transformations for community composition data prior to beta diversity analysis. Ecography (submitted). Legendre, P. & M. De Cáceres. 2013. Beta diversity as the variance of community data: dissimilarity coefficients and partitioning. Ecology Letters 16: 951-963. Legendre, P. & L. Legendre. 2012. Numerical ecology, 3rd English edition. Elsevier Science BV, Amsterdam. xvi + 990 pp. ISBN-13: 978-0444538680. Odum, E. P. 1950. Bird populations of the Highlands (North Carolina) Plateau in relation to plant succession and avian invasion. Ecology 31: 587 605. Whittaker, R. H. 1952. A study of summer foliage insect communities in the Great Smoky Mountains. Ecological Monographs 22: 1 44. Whittaker, R. H. 1972. Evolution and measurement of species diversity. Taxon 21: 213-251.

End of the presentation