Dissimilarity and transformations. Pierre Legendre Département de sciences biologiques Université de Montréal

Similar documents
1.3. Principal coordinate analysis. Pierre Legendre Département de sciences biologiques Université de Montréal

8. FROM CLASSICAL TO CANONICAL ORDINATION

1.2. Correspondence analysis. Pierre Legendre Département de sciences biologiques Université de Montréal

Partial regression and variation partitioning

4. Ordination in reduced space

Chapter 2 Exploratory Data Analysis

Appendix S1 Replacement, richness difference and nestedness indices

Algebra of Principal Component Analysis

Community surveys through space and time: testing the space time interaction

Beta diversity as the variance of community data: dissimilarity coefficients and partitioning

Analysis of Multivariate Ecological Data

Distance Measures. Objectives: Discuss Distance Measures Illustrate Distance Measures

Multivariate Analysis of Ecological Data

Ordination & PCA. Ordination. Ordination

Temporal eigenfunction methods for multiscale analysis of community composition and other multivariate data

Chapter 11 Canonical analysis

BIO 682 Multivariate Statistics Spring 2008

Multivariate Statistics 101. Ordination (PCA, NMDS, CA) Cluster Analysis (UPGMA, Ward s) Canonical Correspondence Analysis

Lecture 5: Ecological distance metrics; Principal Coordinates Analysis. Univariate testing vs. community analysis

Community surveys through space and time: testing the space-time interaction in the absence of replication

diversity(datamatrix, index= shannon, base=exp(1))

INTRODUCTION TO MULTIVARIATE ANALYSIS OF ECOLOGICAL DATA

Species Associations: The Kendall Coefficient of Concordance Revisited

Lecture 5: Ecological distance metrics; Principal Coordinates Analysis. Univariate testing vs. community analysis

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Analysis of community ecology data in R

VarCan (version 1): Variation Estimation and Partitioning in Canonical Analysis

Use R! Series Editors:

Community surveys through space and time: testing the space-time interaction in the absence of replication

Compositional similarity and β (beta) diversity

4/2/2018. Canonical Analyses Analysis aimed at identifying the relationship between two multivariate datasets. Cannonical Correlation.

Data Mining 4. Cluster Analysis

Similarity and Dissimilarity

Characterizing and predicting cyanobacterial blooms in an 8-year

Supplementary Material

Comparison of two samples

CAP. Canonical Analysis of Principal coordinates. A computer program by Marti J. Anderson. Department of Statistics University of Auckland (2002)

MULTIV. Multivariate Exploratory Analysis, Randomization Testing and Bootstrap Resampling. User s Guide v. 2.4

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes

The discussion Analyzing beta diversity contains the following papers:

4/4/2018. Stepwise model fitting. CCA with first three variables only Call: cca(formula = community ~ env1 + env2 + env3, data = envdata)

DETECTING BIOLOGICAL AND ENVIRONMENTAL CHANGES: DESIGN AND ANALYSIS OF MONITORING AND EXPERIMENTS (University of Bologna, 3-14 March 2008)

-Principal components analysis is by far the oldest multivariate technique, dating back to the early 1900's; ecologists have used PCA since the

Distances and similarities Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

CS249: ADVANCED DATA MINING

Analysis of Multivariate Ecological Data

Similarity Numerical measure of how alike two data objects are Value is higher when objects are more alike Often falls in the range [0,1]

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Lecture 2: Diversity, Distances, adonis. Lecture 2: Diversity, Distances, adonis. Alpha- Diversity. Alpha diversity definition(s)

Spatial eigenfunction modelling: recent developments

Introduction to ordination. Gary Bradfield Botany Dept.

Correspondence Analysis & Related Methods

Ecological Resemblance. Ecological Resemblance. Modes of Analysis. - Outline - Welcome to Paradise

Species associations

A Statistical Distance Approach to Dissimilarities in Ecological Data

Sampling e ects on beta diversity

Multivariate Statistics Summary and Comparison of Techniques. Multivariate Techniques

Principal Components Theory Notes

Figure 43 - The three components of spatial variation

Clustering Lecture 1: Basics. Jing Gao SUNY Buffalo

EXAM PRACTICE. 12 questions * 4 categories: Statistics Background Multivariate Statistics Interpret True / False

Type of data Interval-scaled variables: Binary variables: Nominal, ordinal, and ratio variables: Variables of mixed types: Attribute Types Nominal: Pr

Clustering Ambiguity: An Overview

Canonical analysis. Pierre Legendre Département de sciences biologiques Université de Montréal

Inderjit Dhillon The University of Texas at Austin

Inconsistencies between theory and methodology: a recurrent problem in ordination studies.

Experimental Design and Data Analysis for Biologists

Revision: Chapter 1-6. Applied Multivariate Statistics Spring 2012

MSc in Statistics and Operations Research

An Introduction to Ordination Connie Clark

University of Florida CISE department Gator Engineering. Clustering Part 1

Spatial non-stationarity, anisotropy and scale: The interactive visualisation of spatial turnover

Data Mining: Concepts and Techniques

Math Introduction to Numerical Analysis - Class Notes. Fernando Guevara Vasquez. Version Date: January 17, 2012.

betapart: an R package for the study of beta diversity Andre s Baselga 1 * and C. David L. Orme 2

Navigating the multiple meanings of b diversity: a roadmap for the practicing ecologist

2/19/2018. Dataset: 85,122 islands 19,392 > 1km 2 17,883 with data

B490 Mining the Big Data

Introduction to multivariate analysis Outline

Analysis of Multivariate Ecological Data

Data Screening and Adjustments. Data Screening for Errors

Package LDM. R topics documented: March 19, Type Package

Transitivity a FORTRAN program for the analysis of bivariate competitive interactions Version 1.1

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Variations in pelagic bacterial communities in the North Atlantic Ocean coincide with water bodies

Linking species-compositional dissimilarities and environmental data for biodiversity assessment

NONLINEAR REDUNDANCY ANALYSIS AND CANONICAL CORRESPONDENCE ANALYSIS BASED ON POLYNOMIAL REGRESSION

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

Discrimination Among Groups. Discrimination Among Groups

CSCI 1951-G Optimization Methods in Finance Part 10: Conic Optimization

Multivariate Distributions

Relational Nonlinear FIR Filters. Ronald K. Pearson

Package mpmcorrelogram

What are the important spatial scales in an ecosystem?

DIDELĖS APIMTIES DUOMENŲ VIZUALI ANALIZĖ

Preprocessing & dimensionality reduction

Lecture 2: Linear Algebra Review

Factors affecting the Power and Validity of Randomization-based Multivariate Tests for Difference among Ecological Assemblages

Descriptive Data Summarization

Transcription:

and transformations Pierre Legendre Département de sciences biologiques Université de Montréal http://www.numericalecology.com/ Pierre Legendre 2017

Definitions An association coefficient is a function of two data vectors that quantifies the strength of their relationship (or association).

Descriptors Objects Objects Y np Q mode A nn for Q-mode analysis R mode Descriptors A pp for R-mode analysis Association coefficients can measure the relationship between data rows (objects, Q mode) or between data columns (variables, R mode) Legendre & Legendre 2012, Fig. 2.1.

Q-mode coefficients (between objects, e.g. sites) are called similarities (S) and dissimilarities (D) 1. R-mode coefficients (between descriptors or variables) are called dependence coefficients (correlation coefficients, contingency,...) In Q mode, one obtains a S or D matrix by computing similarities (S) or dissimilarities (D) between all pairs of sites. In R mode, one obtains an R matrix by computing correlation coefficients between all pairs of columns. 1 Dissimilarities are often called distance coefficients. Technically, distances are metric dissimilarities (metric properties: described below).

A similarity coefficient S produces values in the [0,1] range. An S matrix is symmetric. A similarity matrix S has values 1 on the diagonal, which represents the similarity between an object and itself. S = Obj.1 Obj.2 Obj.1 1.0 0.7 Obj.2 0.7 1.0 A dissimilarity coefficients D derived from S also has values in the [0,1] range. A D matrix is symmetric. A dissimilarity matrix D has values 0 on the diagonal, which represents the difference between an object and itself. D = Obj.1 Obj.2 Obj.1 0.0 0.3 Obj.2 0.3 0.0

In this presentation, we will focus on Q-mode coefficients In the R language, all association matrices computed with Q- mode coefficients are presented as D matrices. They contain either dissimilarity indices, or similarity indices transformed into dissimilarities (more details below). A D function imposes a model onto the data. It filters the information of the data matrix Y, emphasizing a portion of the information and discarding other portions. The (dis)similarity indices are not interchangeable. Users must know what information is emphasized (i.e. retained) and discarded (i.e. filtered out) by each type of S or D function.

Two properties of D coefficients Metric property The attributes of a metric dissimilarity are the following 1 : 1. Minimum 0: if a = b, then D(a, b) = 0; 2. Positiveness: if a b, then D(a, b) > 0; 3. Symmetry: D(a, b) = D(b, a); 4. Triangle inequality: D(a, b) + D(b, c) D(a, c). The sum of two sides of a triangle drawn in ordinary Euclidean space is equal to or larger than the third side. b a c 1 Attributes also described in the PCoA ordination course. Note: A metric dissimilarity function is also called a distance.

By reference to the attributes of a metric, 3 types of coefficients can be defined: metric: have all 4 attributes semimetric: can violate the triangle inequality nonmetric: can violate attributes 1 3. An example of a nonmetric coefficient will be presented later. These coefficients are not used in ecology.

Euclidean property 1 A dissimilarity coefficient is Euclidean if any resulting D matrix can be fully represented in a Euclidean space without distortion. A non-euclidean dissimilarity matrix is identified by the criterion that PCoA of that matrix produces some negative eigenvalues. Taking the square root of most non-euclidean D matrices makes them metric and Euclidean. 1 Property described in more detail in the principal coordinate analysis (PCoA) course.

Converting S to D Coefficients were originally described as either S or D. Among the possible transformations from S to D, two are used in ecological analysis: D = 1 S D = sqrt(1 S) In PCoA and in the distance-based approach to RDA (dbrda), it is useful to make sure that the D matrix is Euclidean; use D = (1 S) when (1 S) is Euclidean; use D = sqrt(1 S) when (1 S) is not Euclidean but sqrt(1 S) is Euclidean.

=> Examine the metric and Euclidean properties of similarity and dissimilarity coefficients in Tables 9.2 and 9.3 of the Numerical ecology book. See complementary material, file Legendre_&_Legendre_2012_Tables_7.2+7.3.pdf. In most cases, sqrt(d) or sqrt(1 S) turns a non-euclidean matrix into Euclidean. More about this in the course on principal coordinate analysis (PCoA).

Community composition data: the double-zero problem Whittaker s coenocline: A simulated coenocline along an environmental gradient (abscissa). From Whittaker (1972). (Shown in section of the CA course on the Arch effect.) This figure will help us understand the principle behind doublezero symmetrical and asymmetrical S and D coefficients.

Consider the distribution of a single species along that environmental variable: For the presence or absence of that species, are the following pairs of observations an indication of Green arrows: 1, 1 Similarity Difference Red arrows: 1, 0 Brown arrows: 0, 0 Maybe Blue arrows: 0, 0 Maybe Conclusion: interpretation of double zeros is uncertain.

Definitions In double-zero asymmetrical coefficients the value D does not change with the addition of double zeros, but it decreases when species with double-x are added to the comparison of two sites, where X is any value of equal abundances other than zero. Examples: Jaccard, Sørensen, Ochiai, Hellinger, chord, Ružička, percentage difference (aka Bray-Curtis). In double-zero symmetrical coefficients the value D does not change when double-zeros or double-x (where X > 0) are added to the two sites that are compared. Examples: Euclidean, Manhattan distances.

Coefficients for binary data Example 2 objects, 7 binary variables [0,1] Var1 Var2 Var3 Var4 Var5 Var6 Var7 Object x 1 1 1 0 1 1 0 0 Object x 2 0 0 1 1 1 0 0 Object x 1 Object x 2 1 0 1 a b a + b 0 c d c + d a + c b + d p = a + b + c + d => Object x 1 Object x 2 1 0 1 2 2 0 1 2

Object x 1 Object x 2 1 0 1 a b 0 c d => Object x 1 Object x 2 1 0 1 2 2 0 1 2 Double-zero symmetrical coefficient S SM = Simple matching a + d a + b + c + d = 4 7 = 0.571 Double-zero asymmetrical coefficient S J = Jaccard index a a + b + c = 2 5 = 0.400 D SM = b + c a + b + c + d = 3 7 = 0.429 D J = b + c a + b + c = 3 5 = 0.600

Most popular coefficients for binary data Double-zero symmetrical S D=1 S D= 1 S Simple matching S SM = a + d a + b + c + d D SM = b + c a + b + c + d 0.571 0.429 0.655

Most popular coefficients for binary data Double-zero asymmetrical S D=1 S D= 1 S Jaccard S Jac = Sørensen S Sor = Ochiai S Och = a a + b + c 2a 2a + b + c a (a + b) (a + c) D Jac = D Sor = b + c a + b + c b + c 2a + b + c D Och =1 a (a + b) (a + c) 0.400 0.600 0.775 0.571 0.429 0.655 0.577 0.423 0.650

Adding double-zeros changes D. Hence it changes the Simple matching (symmetrical) but not the Jaccard index (asymmetrical). Example 2 objects, 10 binary variables [0,1] Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 Var10 Object x 1 1 1 0 1 1 0 0 0 0 0 Object x 2 0 0 1 1 1 0 0 0 0 0 S SM = Object x 1 Simple matching Object x 2 1 0 1 a b 0 c d a + d a + b + c + d = 7 10 = 0.700 => Object x 1 Object x 2 1 0 1 2 2 0 1 5 S J = Jaccard index a a + b + c = 2 5 = 0.400 D SM = b + c a + b + c + d = 3 10 = 0.300 D J = b + c a + b + c = 3 5 = 0.600

This example shows that double-zero asymmetrical indices are insensitive to double-zeros, i.e. the absence of species at two sites, because the value d is not included in their formula. However, double-presences (a) do change the denominator, hence they change the index value. These indices are well suited to the analysis of community composition or other types of frequency data, e.g. gene frequencies. Double-zero symmetrical indices are not adapted to the analysis of these types of data.

10 different forms of binary indices (for presence-absence data) are available in dist.binary of {ade4}. The three most widely used binary indices, i.e. Jaccard, Sørensen and Ochiai, are also available in dist.ldc of {adespatial}. Note: When computing binary indices in these two packages, these similarities (S) are converted to distances through the transformation D = 1 S. The transformation is automatically applied because it makes the D matrices Euclidean. Euclidean matrices will not produce negative eigenvalues in PCoA. Refer to the PCoA course.

Example of a nonmetric coefficient S Nonmetric = a + d b + c Object x 1 Object x 2 1 0 1 a b 0 c d With community composition data, if (a + b) = 0, S = 0. (b + c) can be zero; division by 0 produces S = +Inf. S is then in the range [0, +Inf] For the upper bound S = +Inf, computing D = (1 S) produces D = (1 Inf) = Inf. For the lower bound S = 0, D = (1 S) = (1 0) = 1. D is then in the range [ Inf, 1] This coefficient is not used in ecological analysis.

Symmetrical indices for physical descriptors For quantitative physical descriptors (e.g. physical, chemical, morphometric, topographic, etc.): use double-zero symmetrical D coefficients, where both double zeros and double-x (X is any other value) do not change the computed distance. The Euclidean distance The Euclidean distance is the ordinary distance of our physical world. It is computed using Pythagora s formula and it can be applied to data matrices with any number (p) of variables. Formula: D Euclidean (x 1, x 2 ) = p ( y 1 j y ) 2 2 j j=1

The Euclidean distance computed on raw physical data changes with the choice of the physical units. Examples: Hardness mg/l Depth m ph (no unit) Lake 1 300 10 7.5 Lake 2 200 15 8.3 Lake 3 250 25 8.0 D Euclidean = 0 100.13 0 52.20 50.99 0 Hardness g/l Depth cm ph (no unit) Lake 1 0.30 1000 7.5 Lake 2 0.20 1500 8.3 Lake 3 0.25 2500 8.9 D Euclidean = 0 500 0 1500 1000 0 Same data. Which dissimilarity matrix is the correct one? Do these computed distances make sense? What are the physical units of the distances in each matrix?

When the physical descriptors have different physical units, one should compute the Euclidean distance on standardized descriptors, which have no physical units. Hardness (stand.) Depth (stand.) ph (stand.) Lake 1 1 0.873 1.072 Lake 2 1 0.218 0.907 Lake 3 0 1.091 0.165 D Euclidean = 0 2.889 0 2.527 1.807 0 All descriptors contribute equally (equal variances of 1) to the computed distance. D Euclidean computed on standardized data has no physical units. Note: Euclidean distances do not have an upper bound. Their values are in the range [0,+Inf].

Do not compute D Euclidean on raw (untransformed) community composition data, standardized or not. There are, however, transformations that are appropriate for community data before computing the Euclidean distance. They are described later in this presentation.

The Gower coefficient In 1971, John Gower proposed a dissimilarity coefficient designed for ecologists and taxonomists. The Gower coefficient was designed to handle descriptors with different physical units and of mixed precision levels. Because of its general nature, this coefficient is more complex to describe than a simple coefficient like the Euclidean distance. The general form of the coefficient is the following:

The general form of the coefficient is the following: D Gower (x 1, x 2 ) =1 1 p s j (x 1, x 2 ) p j=1 Where s j (x 1, x 2 ) is a partial similarity function computed separately for each descriptor. For quantitative descriptors, s j (x 1, x 2 ) is computed as follows: s j (x 1, x 2 ) =1 y 1 j y 2 j R j For qualitative (factors) or binary descriptors: s j (x 1, x 2 ) is 1 if the two objects have the same state; otherwise 0.

For quantitative descriptors, s j (x 1, x 2 ) is computed as follows: s j (x 1, x 2 ) =1 y 1 j y 2 j R j R j is the range of descriptor j in the data matrix under study. Dividing y 1j y 2j by the range R j produces a value without physical dimension. The value of the partial similarity s j (x 1, x 2 ) is 1 minus the ranged difference between the values of descriptor j in the two objects under comparison. For qualitative (factors) or binary descriptors, s j (x 1, x 2 ) is 1 if the two objects have the same state; otherwise 0. Semi-quantitative (ordinal) descriptors (called ordered factors in R) can be handled in various ways. See the three options are available in function gowdis() of the {FD} package. In the simplest of these methods, the information is handled as if the descriptor was quantitative, using the equation above.

Another interesting modification is the incorporation of weights w j in the formula. D Gower (x 1, x 2 ) =1 p j=1 The weights w j can be used to handle missing values: when a missing value is present for a descriptor in one of the two objects under comparison, w j = 0; otherwise, w j = 1. Missing values are coded NA in the example that follows. w 12 j s j (x 1, x 2 ) p w 12 j j=1

Another interesting modification is the incorporation of weights w j in the formula. D Gower (x 1, x 2 ) =1 p j=1 The weights w j can also be used to give different importances to the descriptors in the calculation of D Gower, but they are rarely used for that purpose. The values of D Gower are in the range [0,1]. w 12 j s j (x 1, x 2 ) p w 12 j j=1

Numerical example Var1 Var2 Var3 Var4 Var5 (factor) Var6 Site1 2 2 NA 2 1 2 6 Site2 1 3 3 1 3 2 5 [ ] Site49 1 1 1 1 1 1 1 Site50 2 5 5 4 5 3 6 Var7 For each variable, sites 49 and 50 have, respectively, the lowest and highest values in the data matrix. Their purpose is to provide values for computation of the ranges of the quantitative variables. Site1 has an absence of information (coded NA) in Var3. Var5 is a factor; hence the values shown are classes (or states) of the factor.

Var1 Var2 Var3 Var4 Var5 Var6 Study Var7 this (factor) example in detail by yourself! Site1 2 2 NA 3 1 2 6 Site2 1 4 3 1 3 2 4 [ ] Site49 1 1 1 1 1 1 1 Site50 2 5 5 4 5 3 6 w 12j 1 1 0 1 1 1 1 R j = (max min) 1 4 3 2 5 y 1j y 2j 1 2 2 0 2 y 1j y 2j /R j 1.000 0.500 0.667 0.000 0.400 s 12j = 1 y 1j y 2j /R j 0.000 0.500 0.333 0 1.000 0.600 w 12j s 12j 0.000 0.500 0 0.333 0 1.000 0.600 p = 7; Σw 12j = 6 D Gower (x 1, x 2 ) = 1 (0 + 0.5 + 0 + 0.3333 + 0 + 1 + 0.6)/6 = 0.594

Comparison of the functions available in R to compute the Gower coefficient: Quant. Semiquant. Factors NA Weights w j vegdist() {vegan} daisy() {cluster} As quant. gowdis() {FD} 3 methods See the documentation file for details on how ordered factors are handled in the three methods implemented in function gowdis(). Missing values are coded NA in R.

Quantitative community data: asymmetrical indices I will now focus on seven double-zero asymmetrical dissimilarity coefficients for quantitative community composition data, and mention some others.

Asymmetrical non-euclidean indices The first two indices for quantitative community data are the quantitative forms of the Jaccard and Sørensen indices for binary data, described above. These two indices are based on the same decomposition of the species abundance data. The calculation of dissimilarity incorporates the differences in total abundance of the sites. The data are not scaled by rows, as will be the case in the next series of indices. The differences in productivity of the sites are taken into account in the calculation of the dissimilarities.

Example 2 objects, 4 species, quantitative abundance data C 1 B 2 A 4 A 4 B3 A 1 A 1 A2 A 2 Site1 Site2 Site1 Site2 Site1 Site2 Site1 Site2 Spec.1 Spec.2 Spec.3 Spec.4 A = sum of abundances common to sites 1 and 2 = A 1 + A 2 + 0 + A 4! B = sum of abundances unique to site 1 = 0 + B 2 + B 3 + 0! C = sum of abundances unique to site 2 = C 1 + 0 + 0 + 0!! (B + C) represents the unscaled dissimilarity between 2 sites. (A) would represent the unscaled similarity. Ružička dissimilarity: D Ruz = (B + C)/(A + B + C) Percentage difference (aka Bray-Curtis): D %diff = (B + C)/(2A + B + C) Other equivalent equation forms exist for these two dissimilarities.

C 1 B 2 A 4 A 4 B3 A 1 A 1 A2 A 2 Site1 Site2 Site1 Site2 Site1 Site2 Site1 Site2 Spec.1 Spec.2 Spec.3 Ružička dissimilarity: D Ruz = (B + C)/(A + B + C) Spec.4 A = sum of abundances common to sites 1 and 2 = A 1 + A 2 + 0 + A 4! B = sum of abundances unique to site 1 = 0 + B 2 + B 3 + 0! C = sum of abundances unique to site 2 = C 1 + 0 + 0 + 0!! Percentage difference (aka Bray-Curtis): D %diff = (B + C)/(2A + B + C) These two dissimilarities are double-zero asymmetrical because the sum of double-zeros is not included in their formulas, whereas the quantity A (sum of double-x) is included. When comparing two sites, pairs of 0 do not change the values of these indices. The values of D ruz and D %diff are in the range [0,1].

The D %diff is not metric. Example: Sp.1 Sp.2 Sp.3 Sp.4 Sp.5 Object x 1 2 5 2 5 3 Object x 2 3 5 2 4 3 Object x 3 9 1 1 1 1 The D %diff matrix is not metric because this triangle does not close: D(1,2) + D(2,3) < D(1,3) since (0.059 + 0.533 = 0.592) < 0.600 Because it is not metric, the D %diff matrix is not Euclidean. However, sqrt(d) is Euclidean. The Ružička dissimilarity, D Ruz, is always metric, like the Jaccard index. It is not a Euclidean coefficient; hence a specific D Ruz matrix may or may not be Euclidean, as in the following example.

Full example: compute D Ruz and D %diff for the spider data. # Read the file "Spiders_28x12_spe.txt"! spiders <- read.table(file.choose())! library(adespatial); library(ade4)! #! D.ruz <- dist.ldc(spiders, method="ruzicka")! is.euclid(d.ruz)! [1] FALSE # The D matrix is not Euclidean! is.euclid(sqrt(d.ruz))! [1] TRUE # The sqrt(d) matrix is Euclidean! #! D.pcdiff <- dist.ldc(spiders, method="percentdiff")! is.euclid(d.pcdiff)! [1] FALSE # The D matrix is not Euclidean! is.euclid(sqrt(d.pcdiff))! [1] TRUE # The sqrt(d) matrix is Euclidean!

The percentage difference was described by Odum in 1950. Historical note D %diff = (B + C)/(2A + B + C) This D index is often called the Bray-Curtis index in computer software. This is a misnomer, repeating a mistake in a paper published around 1970. The 1957 paper by Bray-Curtis aimed at describing a new ordination method, known as the Bray-Curtis ordination, not a new D index. Actually, the index used by Bray and Curtis in their 1957 paper, and clearly described on p. 329, is Whittaker s (1952) index of association.

Asymmetrical Euclidean indices The following five distance indices are constructed in the same way: the data are scaled by rows using (data transformation); then the Euclidean distance is applied to the scaled data. The scaling by rows removes the differences in productivity of the sites from the data. These differences are not taken into account in the calculation of the dissimilarities.

The distance between species profiles Example: community composition data Sp1 Sp2 Sp3 Sp4 Sp5 [ y i+ ] Y = Site1 Site2 Site3 45 10 15 0 10 25 8 10 0 3 7 15 20 14 12 Divide each value by the row sum, transforming the rows into profiles of relative abundances, y ij y i+ = 0.563 0.125 0.188 0.000 0.125 0.543 0.174 0.217 0.000 0.065 0.103 0.221 0.294 0.206 0.176 This is called the profile transformation. 80 46 68

Divide each value by the row sum, transforming the rows into profiles of relative abundances, y ij y i+ = then compute the Euclidean distance among the scaled rows. Formula: p y D profile (x 1, x 2 ) = 1 j y 2 2 j y 1+ y 2+ Double zeros do not affect the row sums or the final Euclidean distance. So, this distance is double-zero asymmetrical. D profile is easy to compute but it does not have good properties for the analysis of beta diversity. See later presentation. Maximum value of D profile = 0.563 0.125 0.188 0.000 0.125 0.543 0.174 0.217 0.000 0.065 0.103 0.221 0.294 0.206 0.176 2 j=1 for sites with no species in common.

The chord distance Example: community composition data Sp1 Sp2 Sp3 Sp4 Sp5 [ Row.norm i ] Y = Site1 Site2 Site3 45 10 15 0 10 25 8 10 0 3 7 15 20 14 12 49.497 28.249 31.843 The data are divided by the norm (vector length) 1 of each row y ij = Row.norm i 0.909 0.202 0.303 0.000 0.202 0.885 0.283 0.354 0.000 0.106 0.220 0.471 0.628 0.440 0.377 This is called the chord transformation. 1 An R function to compute the norm of a vector: row.norm <- function(vec) sqrt(sum(vec^2))!

The data are divided by the norm (vector length) of each row, y ij = Row.norm i => The norms of the transformed row vectors are now 1. Then the Euclidean distance is applied to the scaled data. Formula: D chord (x 1, x 2 ) = p j=1 This D is insensitive to double zeros (double-zero asymmetrical). This distance has excellent properties for the analysis of beta diversity, as will be seen in a later presentation. Maximum value of D chord = 0.909 0.202 0.303 0.000 0.202 0.885 0.283 0.354 0.000 0.106 0.220 0.471 0.628 0.440 0.377 y 1 j Row.norm 1 2 y 2 j Row.norm 2 for sites with no species in common. 2

The chord distance is actually the length of a chord between two points along the circumference of a unit circle. This is a geometric notion. This measure is applied to data that have been normed, so that the norm (or length) of each transformed row vector is 1. Species y 2 1 x 1 D chord (x 1, x 2 ) x 2 1 Species y 1 The maximum value of D chord is in common. 2, for sites that have no species

Example: community composition data The Hellinger distance Y = Site1 Site2 Site3 Sp1 Sp2 Sp3 Sp4 Sp5 [ y i+ ] 45 10 15 0 10 25 8 10 0 3 7 15 20 14 12 The data are first divided by the sum of each row y ij y i+ = 0.563 0.125 0.188 0.000 0.125 0.543 0.174 0.217 0.000 0.065 0.103 0.221 0.294 0.206 0.176 and square-rooted, producing square-rooted species profiles y ij y i+ = 0.750 0.354 0.433 0.000 0.354 0.737 0.417 0.466 0.000 0.255 0.321 0.470 0.542 0.454 0.420 This is called the Hellinger transformation. 80 46 68

Compute square-rooted species profiles y ij y i+ = 0.750 0.354 0.433 0.000 0.354 0.737 0.417 0.466 0.000 0.255 0.321 0.470 0.542 0.454 0.420 then compute the Euclidean distance among the scaled rows. Formula: D Hellinger (x 1, x 2 ) = p j=1 y 1 j y 1+ y 2 j y 2+ 2 This D is insensitive to double zeros (double-zero asymmetrical). This distance has excellent properties for the analysis of beta diversity, as will be seen in a later presentation. Maximum value of D Hellinger = 2 for sites with no species in common.

Relationships The Hellinger distance is actually the chord distance computed on square-rooted species abundance data. Example with the (3 5) matrix: Y = Site1 Site2 Site3 Sp1 Sp2 Sp3 Sp4 Sp5 [ y i+ ] 45 10 15 0 10 25 8 10 0 3 7 15 20 14 12 80 46 68

# Generate the data matrix (3 sites x 5 species)! mat = matrix(c(45,25,7,10,8,15,15,10,20,0,0,14,10,3,12),3,5)!! # Compute Hellinger distance on mat using dist.ldc()! library(adespatial)! ( D.hel <- dist.ldc(mat, "hellinger") )! 1 2! 2 0.1222138! 3 0.6480087 0.6441498!! # Compute the chord distance on mat! ( D.chord <- dist.ldc(mat, "chord") )! 1 2! 2 0.1376616! 3 0.9364945 0.9052052!! # Compute the chord distance on sqrt(mat)! ( D <- dist.ldc(sqrt(mat), "chord") )! 1 2! 2 0.1222138! 3 0.6480087 0.6441498! Hellinger D = chord D after taking the square root of the abundances.

The log-chord distance Instead of a square root, one can compute the log of the abundances before computing the chord transformation on the ' y ij y " ij = = log e (y ij +1) values: The combination of these two transformations is called the log-chord transformation. The Euclidean distance can then be computed on the transformed data to obtain the log-chord distance (Legendre & Borcard, submitted). This D has all the properties of the chord D. It is thus insensitive to double zeros (double-zero asymmetrical). Note The Euclidean distance computed on log(y+1) data is not doublezero asymmetrical. So it is inappropriate for community composition data. ' y ij p i=1 ' (y ij ) 2 ' y ij

Idea linking the chord, Hellinger and log-chord transformations λ = {1, 0.5, 0} are members of the Box-Cox series of normalizing transformations: f (y) = (y λ 1) / λ plain chord transf.: λ = 1 => y ij 1 (no transf.), then chord transf. Hellinger transformation: λ = 0.5 => y ij 0.5 before chord transf. log-chord transformation: for λ = 0, the limit of f(y) when λ approaches 0 is log e (y) (Box & Cox, 1964). We use log e (y ij +1) because there are abundances of 0 in community composition data and log(0) = Inf. The log transformation is used to normalize strongly asymmetric frequency distributions before applying the chord transformation. => All D based on the chord transformation inherit the properties of the chord D. In particular, they are double-zero asymmetrical.

The chi-square distance The chi-square distance is an important coefficient. It is the distance preserved in correspondence analysis (CA). The chi-square distance can be computed on data that are nonnegative (i.e. 0), frequency-like 1, and dimensionally homogeneous. 1 Examples: community composition or biomass data; monetary units (e.g. $,,, ).

Example: community composition data Y = Site1 Site2 Site3 Sp1 Sp2 Sp3 Sp4 Sp5 [ y i+ ] 45 10 15 0 10 25 8 10 0 3 7 15 20 14 12 y + j = 77 33 45 14 25 80 46 68 y ++ =194 1. Transform the abundances into relative abundances by row. y ij y i+ = 0.563 0.125 0.188 0.000 0.125 0.543 0.174 0.217 0.000 0.065 0.103 0.221 0.294 0.206 0.176 2. Compute a weighted Euclidean distance of the relative abundances, using the inverses of the column sums as weights. p j=1 y + j D chi.sq (x 1, x 2 ) = y ++ 1 y 1 j y 1+ y 2 j y 2+ 2

2. Compute a weighted Euclidean distance of the relative abundances, using the inverses of the column sums as weights. p j=1 y + j D chi.sq (x 1, x 2 ) = y ++ 1 y 1 j y 1+ y 2 j y 2+ 2 => Using these weights actually gives more importance to the rare species, which have small column sums, in the estimation of the dissimilarity, than to the more abundant and ubiquitous species, which have larger column sums. This is a good idea for ecologists who find the presence of rare species to be more informative than the presence of abundant and ubiquitous species. A rare species found at a few sites may indicate special environmental conditions that are required by that species. However, if the rare species are less precisely sampled than the more common species, one should avoid the chi-square distance, and therefore also CA.

Chi-square distance formula: p j=1 y + j D chi.sq (x 1, x 2 ) = y ++ 1 y 1 j y 1+ y 2 j y 2+ 2 In the comparison of two sites, pairs of 0 do not change the value of D chi.sq. So it is a double-zero asymmetrical D index. Maximum value of D chi.sq = common. 2y ++ for sites with no species in

Binary forms of some quantitative coefficients functions can be computed using presence-absence (1-0) data. In many cases, the result is equivalent to, or a simple transformation of, usual indices for presence-absence data. D Euclidean = sqrt(d simple matching p) where p is the number of variables D Ružička, D Canberra, D Wishart = D Jaccard D %difference = D Sørensen Hellinger D, chord D = sqrt(2(1 S Ochiai )) See Legendre & De Cáceres (2013, Table 1) for other relationships between the quantitative and binary forms.

Other useful asymmetrical coefficients Other useful double-zero asymmetrical coefficients are available in R packages: In dist.ldc() of {adespatial}: coefficient of divergence, Canberra metric, Whittaker D, Wishart D, Kulczynski D. Also in dist.ldc(): four abundance-based coefficient of Chao et al. (2006) for quantitative data. These functions correct the index for species that have not been observed due to sampling errors. Other R package, not listed here, also contain indices for community data. One should check if these indices are Euclidean in the form D or sqrt(d) before using them for PCoA ordination or beta diversity studies.

Computing D through data transformations Compute asymmetrical D indices for community composition data as follows: data transformation followed by calculation of the Euclidean distance, as shown in the section on asymmetrical indices for quantitative community composition data.

(b) Hellinger distance among sites (a) For community composition data, after transformation (ex. Hellinger) (a) compute D euclidean ; (b) or use the transformed data as input into linear methods of data analysis. There is no need to compute the D matrix in that case.

Euclidean Chord Ordination in reduced space 3. Pre-transformation of species data: illustration Species profiles The species abundance paradox (Orlóci, 1978) Hellinger Chi-square

The previous slide shows that the chord, species profile, Hellinger and chi-square transformations, followed by calculation of the Euclidean distance, produce the same-name dissimilarity indices, which are double-zero asymmetrical, metric and Euclidean.

The Euclidean distance paradox Example data Sp.1 Sp.2 Sp.3 Row sums y i+ Row norms Site 1 0 4 8 12 8.944 Site 2 0 1 1 2 1.414 Site 3 1 0 0 1 1.000 Compute the Euclidean distance among the data rows: Site1 Site2 Site3 Site1 Site2 Site3 0 7.6158 9.0000 7.6158 0 1.7321 9.0000 1.7321 0 According to these D results, the two closest sites are 2 and 3, with D = 1.7321, despite of the fact that 2 and 3 have no species in common. For ecologists, two sites that have no species in common are very different. Sharing species is more important than differences in abundances.

Example data Sp.1 Sp.2 Sp.3 Row sums y i+ Row norms Site 1 0 4 8 12 8.944 Site 2 0 1 1 2 1.414 Site 3 1 0 0 1 1.000 Site1 Site2 Site3 Compute the Euclidean distance among the data rows: Site1 Site2 Site3 0 7.6158 9.0000 7.6158 0 1.7321 9.0000 1.7321 0 The two least different sites in the data matrix are (1 and 2), which share 2 species, yet the Euclidean distance gives them a large distance. The most different pairs are (1, 3) and (2, 3), which have no species in common, yet D Euclidean gives a very small distance to pair (2, 3).

Euclidean Chord Ordination in reduced space 3. Pre-transformation of species data: illustration Species profiles The species abundance paradox (Orlóci, 1978) Hellinger Chi-square The least different and most different pairs in the data matrix.

The previous slide shows that the Euclidean distance can give a small D value to a pair of sites that have no species in common, indicating that they are highly similar. Contrary to that, the chord, species profile, Hellinger and chisquare distances produce smaller D values for pairs of sites that contain the same species than for pairs of sites where different species assemblages are found.

Note that the transformations and distances are not equivalent and interchangeable. They produce different PCoA ordinations of the sites.

Data transformations in The profile, chord, log-chord, Hellinger and chi-square transf. can be computed using vegan s decostand() function. profile transformation: chord transformation: log-chord transformation: Hellinger transformation: chi-square transformation: Y.tr = decostand(y, "total") Y.tr = decostand(y, "norm") Y.tr = decostand(log1p(y), "norm") Y.tr = decostand(y, "hellinger") Y.tr = decostand(y, "chi.sq") The transformed data can be used as input into linear methods of data analysis: PCA, RDA, k-means partitioning, manova, etc. After transforming the data, compute the Euclidean distance using dist() of {stats} to obtain the same-name distances. Direct calculation of the chord, species profile, Hellinger and chisquare distances are available in function dist.ldc() of {adespatial}.

(b) Hellinger distance among sites (b) The transformed data matrices can be used as input into linear methods of data analysis that preserve the Euclidean distance, such as PCA (tb-pca), RDA (tb-rda) and k-means partitioning. In these analyses, the chord, log-chord, species profile, Hellinger and chi-square distances, which are double-zero asymmetrical, will be preserved instead of the symmetrical Euclidean distance.

References cited Box, G. E. P. & D. R. Cox. 1964. An analysis of transformations. J. Roy. Statist. Soc. Ser. B 26: 211 243. Borcard, D., F. Gillet & P. Legendre. 2018. Numerical ecology with R, 2 nd edition. Use R! series, Springer Science, New York. Bray, R. J. & J. T. Curtis. 1957. An ordination of the upland forest communities of southern Wisconsin. Ecological Monographs 27: 325 349. Chao, A., R. L. Chazdon, R. K. Colwell & T. J. Shen. 2006. Abundance-based similarity indices and their estimation when there are unseen species in samples. Biometrics 62: 361 371. Gower, J. C. 1971. A general coefficient of similarity and some of its properties. Biometrics 27: 857-871. Legendre, P. & D. Borcard. Box-Cox-chord transformations for community composition data prior to beta diversity analysis. Ecography (submitted). Legendre, P. & M. De Cáceres. 2013. Beta diversity as the variance of community data: dissimilarity coefficients and partitioning. Ecology Letters 16: 951-963. Legendre, P. & L. Legendre. 2012. Numerical ecology, 3rd English edition. Elsevier Science BV, Amsterdam. xvi + 990 pp. ISBN-13: 978-0444538680. Odum, E. P. 1950. Bird populations of the Highlands (North Carolina) Plateau in relation to plant succession and avian invasion. Ecology 31: 587 605. Whittaker, R. H. 1952. A study of summer foliage insect communities in the Great Smoky Mountains. Ecological Monographs 22: 1 44. Whittaker, R. H. 1972. Evolution and measurement of species diversity. Taxon 21: 213-251.

End of the presentation