Distance-based multivariate analyses confound location and dispersion effects

Size: px

Start display at page:

Download "Distance-based multivariate analyses confound location and dispersion effects"

Leon Morton
5 years ago
Views:

1 Methods in Ecology and Evolution 2012, 3, doi: /j X x Distance-based multivariate analyses confound location and dispersion effects David I. Warton 1 *, Stephen T. Wright 1 and Yi Wang 1,2 1 School of Mathematics and Statistics and Evolution & Ecology Research Centre; and 2 School of Computer Science and Engineering, The University of New South Wales, NSW 2052, Australia Summary 1. A critical property of count data is its mean variance relationship, yet this is rarely considered in multivariate analysis in ecology. 2. This study considers what is being implicitly assumed about the mean variance relationship in distance-based analyses multivariate analyses based on a matrix of pairwise distances and what the effect is of any misspecification of the mean variance relationship. 3. It is shown that distance-based analyses make implicit assumptions that are typically out-of-step with what is observed in real data, which has major consequences. 4. Potential consequences of this mean variance misspecification are: confounding location and dispersion effects in ordinations; misleading results when trying to identify taxa in which an effect is expressed; failure to detect a multivariate effect unless it is expressed in high-variance taxa. 5. Data transformation does not solve the problem. 6. A solution is to use generalised linear models and their recent multivariate generalisations, which is shown here to have desirable properties. Key-words: Bray Curtis distance, community composition, generalised linear models, mean variance relationship, multi-dimensional scaling, multivariate analysis, PERMANOVA, SIMPER, species-by-site data Introduction A commonly used strategy for the analysis of multivariate data in ecology is to calculate a matrix of pairwise distances or dissimilarities, then analyse those distances (McCune & Grace 2002; Gotelli & Ellison 2004). In this article, such methods will be referred to as distance-based analyses. This method is especially widely used for analysing multivariate abundance data, where abundances of different taxa in an assemblage are simultaneously recorded, also known as community data (McCune & Grace 2002), community composition data (Clarke 1993) and other terms. This study focusses largely on cases where abundance is measured as a count (whether per unit area, per unit length, per subsample of N individuals ), this being the most widely used method of measuring abundance (Warton 2008a). However, the ideas presented here apply equally well to other methods of measuring abundance also (presence absence, semi-quantitative methods, etc). Distance-based analyses as a first step combine abundances across all taxa, in one of many different ways (reviewed *Correspondence author. david.warton@unsw.edu.au Correspondence site: in Legendre & Legendre 1998). The manner in which this is carried out involves making implicit decisions about how one variable scales against another, that is, how the variability from one taxon should be compared with the variability of another. Often this is mediated by some form of standardisation (row, column or pairwise), which can be understood as linking variability in abundance to mean abundance. Hence, it is natural to consider mean variance relationships in studying the implicit scaling decisions being made in distance-based analyses and their consequences. The study of the relationship between mean and variance for abundance data has a long history in the ecology literature (Taylor 1961), e.g. in measuring variability in population density (McArdle et al. 1990; Routledge & Swartz 1991) or studying the role of data transformation in analysing abundance data (Yamamura, 1999 ; McArdle & Anderson 2004). As mean abundance increases, variance also tends to increase, as illustrated in Fig. 1 for taxa from ten different data sets. Note that these data sets follow similar mean variance relationships, which in this case appear to be well approximated by Var(X)¼ l+/l 2, where the variance is a function of the mean (l)anda dispersion parameter (/). Several authors have noticed apparent quadratic mean variance relationships in abundance data Ó 2011 The Authors. Methods in Ecology and Evolution Ó 2011 British Ecological Society

2 90 D. I. Warton, S. T. Wright & Y. Wang Untransformed counts log(x+1) counts variance [log scale] Abund data BORN BRBM CLEV CRYP EKMA FAAB FISH MEMO PFSH VICF variance [log scale] Abund data BORN BRBM CLEV CRYP EKMA FAAB FISH MEMO PFSH VICF mean [log scale] mean [log scale] Fig. 1. Typical mean variance relationships of multivariate abundance data: untransformed; after log (x+1) transformation. Data presented for 20 randomly selected taxa from each of 10 different data sets (see Appendix S1 for details). Note from that there are similar, steeply increasing mean variance relationships for all data sets, and note from that log (x+1) transformation does not completely remove this trend, especially for rarer taxa. (Routledge & Swartz 1991; Yamamura 1999; Warton 2005), which can be derived by assuming counts are Poisson with a varying mean (Routledge & Swartz 1991), although such a pattern is not always evident (Ver Hoef & Boveng 2007). In this article it will be shown that implicit assumptions about the mean variance relationship made by different choices of distance metric are out-of-step with what is known about mean variance relationships of abundance data. The effects of this misspecification of the mean variance relationship in multivariate analysis will be illustrated via the simple situation of testing for a difference between two groups of observations. Using this simple setting facilitates demonstration of the core ideas in this article, although the ideas are by no means limited to the two-group case and are relevant to any multivariate analysis, which involves making inferences about the association between multivariate abundances and other variables measured for each observation (whether experimental treatments, environmental variables or some combination thereof) using other distance-based methods (e.g. Clarke & Ainsworth, 1993; McArdle & Anderson 2001; Ferrier et al. 2007). Key to this study is the distinction between location and dispersion in multivariate analysis and how dispersion is defined. When testing for a difference between groups of multivariate observations, we distinguish between location effects, where there is an effect on the mean abundance of some species, and dispersion effects, where there is an effect on variability. But a mean variance relationship as in Fig. 1 implies that a change in mean will always lead to a change in variance, so it is natural to think of dispersion effects specifically as changes beyond those explained by the mean variance relationship that is, we define dispersion as a change in the mean variance relationship. When comparing two groups of observations, Anderson et al. (2008) described four types of possible between-group difference. These are summarised schematically in Fig. 2 and will be referred to as: no effect, there is no difference between groups (Fig. 2a); location effect, there is a difference in location only (Fig. 2b); dispersion effect, there is a difference in dispersion only (Fig. 2c); and location/dispersion effect, there Axis2 Axis2 (c) No effect Axis1 Dispersion effect Axis1 is a change in both location and dispersion between groups (Fig. 2d). A range of different methods have been developed for detecting between-group effects, and these distance-based analyses are known to be unable to distinguish between location and dispersion effects (e.g. Clarke 1993; Anderson 2001; Mielke & Berry 2010). Usually location effects are of primary interest, however sometimes dispersion effects are of interest (Warwick & Clarke 1993; Anderson et al. 2011), and Anderson (2006) developed a method intended to test specifically for such dispersion effects. This study not only helps understand why distance-based analyses confound location and dispersion effects, but also demonstrates some additional undesirable properties. Axis2 Axis2 (d) Location effect Axis1 Location/dispersion effect Axis1 Fig. 2. A schematic diagram as in Anderson et al. (2008) illustrating on two axes the types of between-group effects that are often of ecological interest: no effect; location effect; (c) dispersion effect; (d) location/dispersion effect.

3 Mean variance confounding in multivariate analysis 91 Specific aims are to identify: 1. The mean variance relationship under which some common distance metrics are appropriate, and how this compares to what relationships are observed in practice. 2. Some consequences, involving important misinterpretations in the literature, because of use of a distance metric that is not appropriate given the mean variance relationship of data. Each consequence will be demonstrated both using published data and using simulation under controlled settings. 3. What can be done to address this issue. Unfortunately, the way multivariate analysis methods are typically applied in ecology at the moment (McCune & Grace 2002; Anderson et al. 2008), the distance metrics chosen are not appropriate given the mean variance relationship of data hence hundreds of papers every year currently use methods which confound mean and variance effects, risking the undesirable consequences identified in this study. Distance metrics and their corresponding mean variance relationships Although many distance-based analyses are distribution-free (e.g. Clarke 1993; Anderson 2001), this does not mean that such analyses are used in a way that is assumption-free. Using a distance-based analysis (whether an ordination or a significance test) to disentangle location and dispersion effects is a case in point. If dispersion is defined as a change in mean variance relationship, then distance-based analysis can only disentangle dispersion effects and location effects if the distance metric used in analyses correctly accounts for the mean variance relationship when combining data across taxa. In this section, we study the scaling decisions made by different distance metrics and relate them to the mean variance relationship of data. The key ideas are illustrated in Fig. 3. METHODS, FIGURE 3 Figure 3 demonstrates the implicit mean variance assumptions in a given distance metric. The first column of Fig. 3 displays the mean variance plot of two independent groups of data, simulated to have a location effect but no dispersion effect. Data are presented for two different scenarios under the mean variance relationship assumed by the given distance measure (triangles on Fig. 3 left), and once under a more typical mean variance relationship (circles on Fig. 3 left) details on how this was carried out are in Appendix S2. For each distance metric under consideration, the second column of Fig. 3 displays a two-dimensional multidimensional scaling (MDS) ordination using the distance metric, for data generated from the mean variance relationship assumed by this distance metric.the third column displays a MDS ordination for data generated from the more typical quadratic mean variance relationship. DISTANCES WITH NO STANDARDISATION First, consider the simplest case, where the distance metric of interest does not involve any form of standardisation. The most common example is the Euclidean distance (Legendre & Legendre 1998) between two observations in multivariate space (with p variables): vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ux p d ij ¼ t ðy ik y jk Þ 2 k¼1 The Euclidean distance can be interpreted as the straight line distance from one point to the other. The Euclidean distance is a suitable multivariate measure of distance in situations where variables are expected to have equal variance, as noted previously in the context of redundancy analysis (RDA) (ter Braak & Prentice 1988). A simple way to see this is to note that the summation in the formula for d ij is across taxa without taking into consideration their scale; hence, analyses using this metric are appropriate if all taxa are measured on comparable scales with comparable variability. On a mean variance plot, the equal variance assumption corresponds to the assumption that the mean variance relation is a horizontal line (slope zero). Figure 3a demonstrates what happens when data are simulated to have a location effect both under the equal variance assumption and under a more realistic quadratic mean variance assumption. When the equal variance assumption is satisfied, an MDS plot exhibits the location effect (Fig. 3a, centre). However, when there is a mean variance trend, such that the variance changes across groups together with the mean, the subsequent Euclidean distance MDS plot is strongly suggestive of a dispersion effect (Fig. 3a, right) rather than just the location effect that data were actually simulated from. The reason for this is that the implicit assumption of equal variance does not hold, so location and dispersion effects are confounded. The Manhattan distance (Legendre & Legendre 1998) has a similar form to the Euclidean, except that instead of the summation being over squared differences, it is over absolute differences. Using the same arguments aforementioned, but with the rough approximation that 1 P qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi N i6¼j jy ik y jk j is proportional to 1 N Pi6¼j ðy ik y jk Þ 2, the Manhattan distance can be seen to be suitable when all taxa have the same variance (slope zero on a mean variance plot), but confounds location and dispersion effects otherwise, as in Fig. 3b. DISTANCES WITH PAIRWISE STANDARDISATION Many distance metrics involve some form of pairwise standardisation rather than simply summing some measure of distance across all taxa, the contributions are standardised in some way, using the abundances from the pair of observations being compared. An instructive example to consider first is the Canberra distance (Lance & Williams 1966): d ij ¼ Xp k¼1 jy ik y jk j ðy ik þ y jk Þ This distance metric is not widely used in ecology, but it is a useful precursor to the more widely used Bray Curtis distance. The Canberra distance has a similar form to the Manhattan distance, except that prior to summation, the

4 92 D. I. Warton, S. T. Wright & Y. Wang Mean var plot MDS for assumed mean var MDS for typical mean var Variance [log scale] Assumed Typical Variance [log scale] Assumed Typical (c) Variance [log scale] Assumed Typical (d) Variance [log scale] Assumed Typical Mean [log scale] MDS1 MDS1 Fig. 3. Assumed mean variance relationships for some common distance metrics: Euclidean, Manhattan, (c) Canberra, (d) Bray Curtis distance. Left column: The assumed mean variance relationship compared with what is more typical for count data (as in Fig. 1). Centre/right columns: MDS plots for data simulated to have a location effect under the assumed mean variance relationship (centre) and under a more typical mean variance relationship (right). Note that an multidimensional scaling plot suggests a location effect when the assumed mean variance relationship is correct (centre), but otherwise, it suggests dispersion effects also (right). contribution from each taxon is standardised by the abundance in that taxon across the two observations being compared. This distance is only appropriate, in the sense that each taxon has equal opportunity to contribute to pairwise distances, in the special case where (on average) changes in the numerator are matched by proportionate changes in the denominator across taxa. That is, average absolute difference must be proportional to average abundance across taxa. Using the standard deviation as an approximation for average absolute difference (as previously), changes across taxa in standard deviation must be approximately matched by proportionate changes in average abundance. On a mean variance plot, this corresponds to a straight-line relationship with slope two on the logarithmic scale. This is reasonably close to the quadratic mean variance relationship typically observed in practice, except for rare species, for which the observed variance is typically large compared with what is expected

5 Mean variance confounding in multivariate analysis 93 by a line of slope two (Fig. 3c, left). Hence, we might expect this distance to put undue weight on rare species, and an MDS ordination of a typical data set would suggest there is greater dispersion between sites that tend to have lower abundances, even when it is known that there is no dispersion at all. This effect is demonstrated by simulation in Fig. 3c, right. The Bray Curtis distance (Bray & Curtis 1957) has a similar form to the Canberra distance, except for the important difference that the summation across taxa is done inside the fraction rather than outside it: P p k¼1 d ij ¼ jy ik y jk j P p k¼1 ðy ik þ y jk Þ As for the Canberra distance, the standardisation used in the Bray Curtis distance is only appropriate when between-group differences in average absolute difference are matched by proportionate changes in average abundance. However, this assumption applies to averages across all taxa rather than to each taxon separately, because the summation occurs inside the fraction rather than outside it. Inside the fraction, there is no longer any consideration of the different variability of different taxa, as for the Euclidean and Manhattan distances. So like those distances, the Bray Curtis will only weight different taxa appropriately if all variables have similar variability. In summary, when Bray Curtis analyses are performed with the intention of disentangling location and dispersion effects across groups, the following very restrictive assumptions are implicit: 1. Average absolute difference within groups must be equal across taxa, i.e. within-group standard deviations must be (approximately) equal for all taxa 2. Any between-group differences in average absolute difference must be matched by proportionate between-group changes in average abundance (i.e. on a mean variance plot, between-group differences must fall approximately along a line of slope two). Figure 3d (centre) demonstrates this idea, where a location effect is visible in an MDS plot if all taxa have equal mean and variance within group and between-group differences fall along a line of slope two (Fig. 3d, left). Note that this is not what is typically observed in abundance data hence, in a more typical data set, a location effect would appear on a MDS ordination as if a location/dispersion effect (Fig. 3d, right), as an artefact of the use of the Bray Curtis distance. DISTANCES WITH COLUMN STANDARDISATION Some distance metrics involve a column standardisation, that is, differences between observations in each taxon are standardised using the total abundance across all observations before combining across taxa. The most common example is the chisquared distance, implicitly used in correspondence analysis (Legendre & Legendre 1998). Such distance metrics do not fit naturally into the framework proposed here, because the value used in within-group standardisation is a function of the overall mean across all groups rather than being a function of within-group means only. The form of the chi-square distance implies the variance should scale linearly with the mean (a slope of one on mean variance plots), but our simulations do not confirm that the method performs well in this setting. The issue of correspondence analysis and related methods will be revisited later. Consequences of using inappropriate distances in analyses In this section, three different types of consequences illustrate the core idea that if the mean variance relationship that is implicitly assumed is different from actual, location and dispersion effects are confounded in subsequent analyses. This may lead one to draw a conclusion that is incorrect as in and below oritmayobscurepatternsindatathatwould otherwisebeapparent asin(c)below. Figure 4 summarises the key ideas. This figure uses a combination of analysis of data from the literature (first column) and simulation under controlled conditions (second column) to demonstrate three consequences of using distance-based analyses that do not account for the mean variance relationship in data. The three rows of Fig. 4, labelled (a c), correspond to the three consequences labelled (a c) below. (A) CONCLUDING THERE IS A DISPERSION EFFECT WHEN THERE IS A LOCATION EFFECT If using an inappropriate distance metric in ordinations, location effects can be misinterpreted as dispersion effects, or as seen previously in Fig. 3c,d, as location/dispersion effects. MDS is commonly used to visualise multivariate abundance data and hence to diagnose whether a between-group effect appears to involve a change in location, dispersion or both. We illustrate in Fig. 4a that MDS plots do not perform this job reliably the reason being that the distance metric used in constructing the MDS is rarely appropriate given the actual mean variance relationship of data. Methods, Figure 4a Coral assemblage data were collected in 1981 and 1983 from 10 transects located in the Tikus Islands, Indonesia (Warwick et al. 1990), and compared across sampling times using a MDSordinationcalculatedonBray Curtisdistancesofuntransformed data in Fig. 4a (left). Studying mean and variance trends for this data set (Fig. 5) suggests that there is a very strong location effect, but no dispersion effect. Hence, we used simulation to see whether the ordination results of Fig. 4a (left) can be reproduced in simulations where data are known to have a strong location effect but no dispersion effect. Data were generated using a Poisson lognormal distribution (details in Appendix S2), with parameters chosen to mimic the properties of the coral data (Warwick et al. 1990), but in such a way that while simulated 1981 and 1983 observations differed in mean (location effect), they shared a common mean variance

6 94 D. I. Warton, S. T. Wright & Y. Wang MDS plot of coral data MDS of data simulated with location effect 1981 observations 1983 observations 1981 observations 1983 observations LR statistic (c) P PERMANOVA P sum of LR MDS1 'Between group contribution' for 20 taxa SIMPER scores Error variance [log scale] Testing for a group effect in 13 data sets 0 258/ / / / / / / / / / / / / sum of LR more powerful PERMANOVA more powerful 0% 20% 40% 60% 80% 100% % Error variance due to effect taxa % of time is in top five Power MDS1 Simulation with five effect taxa SIMPER for effect taxa SIMPER for no effect taxa LR for effect taxa LR for no effect taxa Error variance [log scale] Power simulations with one effect taxon PERMANOVA sum of LR Signif. level (no power) Error variance of effect taxon [log scale] Fig. 4. Three consequences of violations of implicit mean variance assumptions are location and dispersion effects are confounded in ordinations; SIMPER detects taxa with high variance more so than taxa actually expressing a between-group difference, unlike negative binomial regression ( LR ); (c) PERMANOVA fails to detect a significant between-group difference unless it is present in taxa with high variance, unlike sum-of-lr. The left column illustrates each of these consequences using real data, the right column illustrates each of these consequences via simulation. See the text for further details. relationship (no dispersion effect). For each of 200 simulated data sets, MDS ordinations were constructed using untransformed Bray Curtis distances as for Fig. 4a (left). Results, Figure 4a (left) The MDS plot of Fig. 4a (left) displays the 1981 observations near the centre of the ordination and the 1983 observations spread out around these points, suggesting no change in location but an increase in dispersion (a dispersion effect as in Fig. 2c). However, studying mean variance trends directly suggests the opposite all but five of the 41 species decreased in abundance and 25 of them completely disappeared from transects in 1983 (Fig. 5a), suggesting a substantial location effect. Further, a mean variance plot suggests little evidence of dispersion (Fig. 5b), with observations appearing to fall close to a common mean variance curve for both sampling years. Hence, it appears that there was a very strong location effect but no change in dispersion, despite the MDS ordination suggesting the opposite. Anderson et al. (2011) analysed this data set using different distance metrics and found that contradictory results can be obtained, e.g. they found that Euclidean distance MDS plots suggest it is the 1981 data that has high dispersion. They speculated that the differences in results could

7 Mean variance confounding in multivariate analysis vs 1983 mean(s.e) abundance Mean variance relationship 1983 mean abundance [log(x+1) scale] Variance [log scale] observations 1983 observations mean abundance [log(x+1) scale] Mean abundance [log scale] Fig. 5. Plots demonstrating that the Tikus Island data consists of a very strong location effect and little or no dispersion effect: 1981 vs mean abundance (plus standard error); Mean variance relationship. Note from that there is a very strong decrease in mean abundance from 1981 to 1983 and from there is a similar mean variance relationship for 1981 and 1983 data, flatter than a slope of two on the log scale. relate to whether or not a distance metric counts double zeros, but simulations demonstrate the key role of the mean variance relation. Results, Figure 4a (right) Simulations demonstrate that the apparent dispersion effect in Fig. 4 (left) arises as an artefact of failure of implicit mean variance assumptions. Of the 200 simulated data sets, 33% of MDS plots revealed the pattern in Fig. 4a (right) being suggestive of a dispersion effect, when it is known that there was no such dispersion effect. The other 67% of simulated data sets produced MDS plots suggestive of a location/dispersion effect, with no plots correctly suggesting an effect in location only. The volatile behaviour of MDS ordinations for this data set can be understood via the mean variance relationship, which was increasing but not as steeply as a slope of two (Fig. 5b). This means that the decrease in variability in 1983 (numerator of the Bray Curtis distance) was not large enough to match the substantial reduction in mean abundance (denominator) hence, Bray Curtis distances within 1983 observations were very large relative to In contrast, because the variances in 1983 were much smaller, Euclidean distances were smaller in 1983 and the opposite effect can be seen, with the 1983 observations appearing less dispersed (Anderson et al. 2011). (B) FAILING TO IDENTIFY TAXA WITH LARGE BETWEEN-GROUP EFFECTS SIMilarity PERcentages (SIMPER, Clarke 1993) is widely used in ecology as a post hoc test of a between-group difference in multivariate abundance to answer the question in which taxa is this difference most evident?. However, a potential consequence of failing to correctly account for the mean variance relationship is that SIMPER might not detect taxa with evidence of a between-group effect at all. The SIMPER procedure, based on the Bray Curtis distance, measures the contribution of the kth taxon to the between-group effect using the between-group pairwise average of y ik )y jk. However, these values will typically be much larger for taxa with high variance, irrespective of whether there is a between-group effect. This confounds strength of between-group mean effect with variance, such that the SIMPER algorithm will not just identify taxa with a strong between-group effect, but also taxa with large within-group variance, irrespective of presence of a between-group effect, as seen in Fig. 4b. Methods, Figure 4b The example data set used to showcase the SIMPER algorithm in the PRIMER manual (Clarke & Gorley 2006) was re-analysed to determine how sensitive SIMPER results are to two confounding factors strength of between-group effect and size of error variance. The data set (Collins & Williams 1982) consists of two groups of observations on 20 taxa, and the goal of analysis is to determine which taxa most strongly express the between-group effect. Figure 4b (left) plots SIMPER results for each of the 20 taxa on two axes one representing strength of between-group effect (measured as the likelihood ratio statistic from negative binomial regression) and the other axis representing variability within each taxon (measured as within-group error variance). The point size is scaled relative to SIMPER results, where a larger dot indicates that SIMPER considers the taxon to make a larger contribution to the average between-group distance. In Fig. 4b (right), simulation was used to study to what extent the problematic results seen in Fig. 4b (left) were attributed to systematic problems with the SIMPER method. One thousand sets of negative binomial data were generated to mimic the properties of the original data (Collins & Williams 1982), except that the means of the two groups were equal for all but five effect taxa, generated to have a 10-fold difference in means (see Appendix S2 for details). For each simulated data set, the five taxa with highest SIMPER score were recorded. Figure 4b shows the proportion of times each taxon was in the top five, as a function of the error variance of the taxon.

8 96 D. I. Warton, S. T. Wright & Y. Wang Results, Figure 4b (left) Figure 4b (left) clearly shows that the larger dots tend to occur towards the right of the plot, i.e. the SIMPER algorithm tends to consider as the most important taxa, in terms of describing between-group differences, those that have the largest variances. A more desirable result would be for the taxa contributing to the between-group effect to occur towards the top of the plot that is, it would have been desirable for SIMPER to instead flag variables, which tend to have stronger evidence of between-group effects. But taxa with relatively strong betweengroup effects were often given little weight (when they had relatively small variance), which is highly misleading. Results, Figure 4b (right) ThetaxawhoseSIMPERscoresareinthetopfiveshouldtend to be the five taxa for which there actually was a betweengroup effect. However, SIMPER instead systematically selected the five taxa with highest variance (Fig. 4b, right), and only two of these actually had a between-group effect. Taxa with known between-group effects but smaller variances tended not to be identified as key contributors to average between-group distance. (C) LOW POWER FOR DETECTING A BETWEEN-GROUP DIFFERENCE UNLESS IT IS EXPRESSED IN VARIABLE TAXA A final consequence of incorrectly accounting for the mean variance relationship is that unless between-group differences are expressed in taxa with relatively high variance, standard distance-based methods of detecting between-group differences will have low power. This is undesirable as we are usually interested in the community response, and so are interested in detecting a difference if expressed in any taxa, not just those with high variance. Figure 4c illustrates this problem. Methods, Figure 4c A hypothesis was tested for each of 13 data sets from the literature (see Appendix S1 for details) using PERMANOVA with Bray Curtis distances (Anderson 2001), a technique that is widely used for simultaneously testing the response of multivariate abundances to one or more factors (Anderson 2001). For each data set, results are contrasted with those when using a test statistic that accounts for the mean variance relationship in data ( sum-of-lr, Warton 2011). In both cases, P-values were calculated from 1000 (restricted) permutations. Figure 4c (left) plots the ratio of the resultant P-values against the variability of effect taxa (defined here as the proportion of error variance in the most significant 20% of taxa). Figure 4c (right) presents the results from power simulations, studying how power is affected by variability of effect taxa in a controlled setting. In this simulation, negative binomial data were generated to mimic the properties of the data set used in Fig. 4b (Collins & Williams 1982), but with a simulated change in mean by a factor of 10 in one taxon only, and no effect in any remaining taxa. There were 20 simulations, one using each of the 20 taxa of Fig. 4b as the effect taxon. In each simulation, power of the two tests used in Fig. 4c (left) was estimated from 1000 data sets (at the 0.05 significance level). Figure 4c (right) reports power as a function of the error variance of the effect taxon. Results, Figure 4c (left) P-values calculated by PERMANOVA tended to be smaller, hence more significant, for data sets where effect taxa had high variances. PERMANOVA P-values tended to be larger, hence less significant, when the effect taxa tended to have small variance. These observations illustrate that the power of PER- MANOVA to detect a between-group effect expressed in lessvariable taxa is smaller than that when detecting an effect expressed in more variable taxa, as confirmed by our power simulations. Results, Figure 4c (right) The PERMANOVA power curve is consistent with the behaviour observed in Fig. 4c (left) the power to detect a betweengroup effect steadily increased as the variance of the effect taxon increased. What is particularly concerning is that when the variance of the effect taxon was less than about 1000, there was in effect no power to detect a between-group difference power was close to 0.05, the significance level that was used. While only performance of PERMANOVA has been discussed here, similar effects are expected in any multivariate analysis that does not appropriately standardise variables and indeed, some additional simulations (Appendix S2) demonstrate that the effect in Fig. 4b is also evident when using ANOSIM (Clarke 1993), RDA or canonical correspondence analysis (CCA; ter Braak 1986). The reason for the poor properties of distance-based analyses is that taxa with high variance, relative to the variance implicitly assumed in analysis, tend to dominate, as seen in as well as in (c). Hence, methods that test for a between-group difference will readily detect it when it is expressed in taxa with relatively high variance. But if the effect is expressed only in less-variable taxa, the analysis method may not detect a between-group difference at all. Generalised linear models model the mean variance relationship Fortunately, there is a solution to the problem of accounting for mean variance relationships using generalised linear models (GLM; McCullagh & Nelder 1989) and their multivariate extensions (Warton 2011). GLM and a related method, negative binomial regression, extend linear regression to the situation where data follow a known mean variance function. This is precisely the situation encountered here. These methods are standard for analysing univariate count data and strongly advocated in a recent issue of this journal (O Hara & Kotze

9 Mean variance confounding in multivariate analysis ), hence a natural starting point for the development of methods for analysing multivariate abundance data. Negative binomial regression in particular is often appropriate for count data (O Hara & Kotze 2010), because the mean variance function of counts is usually quadratic more so than linear (Warton 2005). GLMs can readily address the problems identified in Fig. 4b,c, and software to address these problems is freely available in the mvabund package on R (Wang et al. 2010). The problem of identifying taxa that most strongly express a between-group difference (Fig. 4b) is relatively simple to solve, being essentially a univariate problem. For the example data set of Fig. 4b, diagnostic plots suggested that negative binomial regression was appropriate (Fig. 6a). Hence, we propose fitting separate univariate negative binomial regressions to each taxon and using the likelihood ratio statistic as a measure of strength of between-group effect. This method performs well (Fig. 4b), correctly detecting the five effect taxa 90 95% of the time, in stark contrast to SIMPER. Further, the GLM framework offers the potential for substantially greater functionality than SIMPER the mvabund package already enables application to any fixed effects design (including continuous environmental variables as well as factors), and multiple testing can be applied to assess the statistical significance of the effect observed in any given taxon. The problem of testing for an effect of a difference in mean abundance between a set of multivariate samples (Fig. 4c) can be addressed using the simple approach of fitting separate models to each taxon, then summing the test statistics, and using resampling to assess the significance of this multivariate test statistic (Warton 2011). This sum-of-lr statistic is analogous to the sum-of-f approach based on analysis of variance (Edgington & Onghena 2007), which was previously shown to have better power properties than distance-based analyses (Warton & Hudson 2004). The key advantage however of the sum-of-lr approach is that we can correctly model the mean variance relationship, leading to better power properties than sum-of-f statistics when sampling is unbalanced (Warton 2008b). The effectiveness of the approach is demonstrated in Fig. 4c, where compared with PERMANOVA, power of the sum-of-lr statistic was usually much higher and tended to vary much less as a function of the variance of the effect taxon. See Warton (2011) for further simulations demonstrating power advantages. As previously, the method is implemented in the mvabund package. For multisample designs such as that considered in Fig. 4c, permutation tests can be used to construct exact tests irrespective of the actual distribution of the data, as is the case for PERMANOVA. The method can also be used for the more general problem of testing for an association between any (sub)set of environmental variables and multivariate abundance, although resampling only ensures approximately valid tests in this case. The mvabund package is easy to use for the purposes described in Fig. 4b,c. The code required to fit the model and assess the significance of multivariate and univariate terms takes just two lines: ft ¼ manyglm(abundtreatment, family¼"negative.binomial") results ¼ anova(ft, resamp¼"perm.resid", p.uni¼"adjusted") This code assumes that abundance data are stored in the workspace as a matrix abund and the treatment factor is stored as treatment. It will then return the multivariate and univariate results in the object results. While well-developed solutions are available for the two problems of Fig. 4b,c, GLM solutions to the ordination problem of Fig. 4a are in earlierstagesofdevelopment.a flexible method of constrained ordination, where the ordination axes are functions of environmental variables, has been proposed (Yee 2006) and software is currently available in the R package VGAM (Yee 2010) which can handle the most common forms of mean variance relationship that are observed in practice. Methods for unconstrained ordination along the lines of Fig. 4a also exist in particular, generalised latent variable models have potential (Skrondal & Rabe-Hesketh 2004) however, we are yet to find freely available software, and its application in ecology is yet to be evaluated. MEAN VARIANCE ASSUMPTION CHECKING It is critical when fitting any model to check the assumptions of that model if a GLM method is used that makes inappropriate mean variance assumptions, it will suffer the same shortcomings as distance-based approaches. Negative binomial regression makes two key assumptions, both of which can be readily checked by looking for no pattern on a residual vs. fits plot (Fahrmeir & Tutz 2001), as in Fig. 6. The first assumption is that the mean count for the jth taxon in the ith site, l ij, is a log-linear function of K predictor variables x ik : logðl ij Þ¼b 0j þ XK x ik b kj k¼1 This assumption is only meaningful for models with interaction terms or where some of the predictor variables x ik are quantitative otherwise, as for Fig. 6, all terms in the model are indicator variables, and the assumption is guaranteed to be satisfied. The second assumption is that the variance of the count for the jth taxon and the ith site, Var(Y ij ), is the following quadratic function of its mean: VarðY ij Þ¼l ij þ / j l 2 ij If this assumption is violated, then a trend will be seen on a residual vs. fits plot, where the spread of the residuals tends to change as fitted values change, as in Fig. 6b. A quasi-poisson model was used for Fig. 6b, which assumed that the variance was proportional to the mean. The fan shape on this plot suggests that a different mean variance relationship should be used, and the weaker pattern in Fig. 6a suggests that negative binomial regression is a more reasonable choice. A strength of the GLM framework is

10 98 D. I. Warton, S. T. Wright & Y. Wang Negative binomial regression Quasi poisson regression Pearson residuals Pearson residuals Fitted value [log scale] Fitted value [log scale] Fig. 6. Residual vs. fitted value plots to check mean variance assumptions for the data set from Fig. 4b when analysed using negative binomial regression, with a quadratic mean variance assumption; quasi-poisson regression, with a linear mean variance assumption. Note the fan shape in suggesting that linearity is not satisfied. that many different options for mean variance relationships are available, you can even construct your own via quasi-likelihood (McCullagh & Nelder 1989). An issue to consider when constructing residual plots is that there is no unique definition of residuals for GLM and negative binomial regression, although for the purposes described here (adjusted) Pearson residuals should suffice (Fahrmeir & Tutz 2001), as given by default in the mvabund package. It is also usual to plot fitted values on the linear predictor scale hence, log (l ij ) appears on the X-axis in Fig. 6. It should also be noted that when dealing with overdispersed counts, the residuals will usually be strongly right-skewed (as in Fig. 6) hence, they should not be expected to be symmetric about zero. Other solutions? In this section, we briefly review some other ways forward that one might have considered, apart from using a generalised linear modelling approach. However, it will be seen that there are problems with each of these methods, and where applicable, we recommend the use of GLM approaches instead. DATA TRANSFORMATION The analyses presented in Fig. 4 involved untransformed data, but most analysts of multivariate abundances transform their data in some way prior to analysis. Transforming abundance data in ecology has been criticised (McArdle & Anderson 2004, O Hara & Kotze 2010), and we will explore some of the key problems as they apply here. The effects of transformation on mean variance relationships can be approximated via the delta method (Yamamura 1999). A summary of how the effects of transformation can be understood via the delta method is available in supplementary material (Appendix S3). It is important to recognise that while an appropriately chosen transformation may reduce the extent of a mean variance trend in multivariate abundance data, it is unlikely to completely remove the trend, especially if the data set contains rare species. To see this, note that when a taxon is completely absent, its variance (and its mean) are zero. No matter how data are transformed, the variance will remain at zero. Similarly, a rare taxon with a mean and variance very close to zero will have a transformed variance very close to zero. This point is illustrated in Fig. 1b, where the log (x+1) transformation reduces the trend in typical mean variance relationships for more abundant species, but a noticeable trend is still apparent for rarer species. Because transformation will rarely remove a mean variance trend in multivariate abundance data, the confounding of location and dispersion effects previously seen for distance-based analyses are still apparent for transformed data. Figure 4 has been reproduced for log (x+1)-transformed data in Fig. 7. This was the bestperforming of the transformations considered [x 1/2, x 1/4 and log (x+1)], but all the consequences of misspecifying the mean variance relationship demonstrated for untransformed data still apply after transformation. The log (x+1)transformationreducedtheextentofeachofthese consequences, but all three consequences were still concerningly evident (Fig. 7). STANDARDISING ABUNDANCES IN EACH TAXON An alternative solution is to use a method of analysis that applies a standardisation to abundances in each taxon during analysis. An appropriate standardisation would use withingroup (error) variance note this is different from standardising taxa prior to analysis, which can produce very poor results (Warton & Hudson 2004). Standardising prior to analysis, i.e. standardising using total variance, would standardise as a function not just of within-group variance but also of effect size. This reduces power (Warton & Hudson 2004), because the taxa in which the strongest effects are observed are the ones that tend to be most heavily down-weighted in standardisation.

11 Mean variance confounding in multivariate analysis 99 MDS plot of transformed coral data MDS of data simulated with location effect 1981 observations 1983 observations 1981 observations 1983 observations MDS1 MDS1 'Between group contribution' for 20 taxa Simulation with five effect taxa LR statistic SIMPER scores % of time is in top five SIMPER for effect taxa SIMPER for no effect taxa LR for effect taxa LR for no effect taxa Error variance Error variance [log scale] (c) Testing for a group effect in 13 data sets Power simulations with one effect taxon P PERMANOVA P sum of LR / / / / / / / / / / / / / sum of LR more powerful PERMANOVA more powerful Power PERMANOVA sum of LR Signif. level (no power) % Error variance due to effect taxa Error variance of effect taxon [log scale] Fig. 7. Three consequences of violations of implicit mean variance assumptions, as in Fig. 4, where distance-based analyses have been recalculated for log (x+1)-transformed abundance. Note that all consequences observed for untransformed data in Fig. 4 are still apparent under transformation, although sometimes to a lesser extent. The within-group standardisation approach is demonstrated in Appendix S2 via the use of analysis of variance (ANOVA). ANOVA applied separately to each taxon will standardise estimates of between-group effects using the withingroup variance, before combining or comparing across taxa. Thiscanbequiteeffective,althoughinAppendixS2,ANOVA did not perform as well as negative binomial regression. While ANOVA assumes different variances across different taxa, it does assume equal variance of all observations within each taxon, which can be especially problematic when sampling is unbalanced (Romano 1990; Warton 2008b). An alternative form of within-group standardisation was proposed in Clarke et al. (2006), where taxa were weighted according to their level of overdispersion relative to the Poisson distribution, and these reweighted variables were used in distance-based analyses. However, a more direct approach would have been to fit a form of overdispersed count model to the data directly, as proposed in the previous section, rather than using theory for overdispersed counts to derive weights for use in a different analysis. One barrier to within-group standardisation in distancebased analyses is computational efficiency. To obtain a valid test, the standardisation would need to be recalculated for each permutation, which in turn means that the whole matrix of pairwise distances would need to be recalculated also. The increase in computation time could be substantial.

12 100 D. I. Warton, S. T. Wright & Y. Wang (CANONICAL) CORRESPONDENCE ANALYSIS When not analysing multivariate data using a distance-based approach, the most common approach in ecology is currently to use CCA (ter Braak 1986) or related methods such as RDA (van den Wollenberg 1977) and its generalisations for partitioning beta-diversity (Legendre et al. 2005). Unfortunately, like distance-based approaches, these methods tend to make implicit mean variance assumptions that appear to be out-ofstep with what tends to be observed in practice, with similar consequences to those observed in Fig. 4. Redundancy analysis is appropriate for data whose variance is constant for all taxa (ter Braak & Prentice 1988), i.e. when data have a slope of zero on a mean variance plot. In fact, RDA can be understood as a distance-based approach, where Euclidean distances have been used for analysis (McArdle & Anderson 2001). Irrespective of data transformation, the constant variance assumption is unlikely to be satisfied, especially if rare species are included in analyses (Fig. 1). ter Braak (1986) proposed CCA as an approximation to Gaussian ordination of Poisson data, i.e. data with variance proportional to the mean (a slope of one on a mean variance plot). But multivariate abundance data are typically overdispersed relative to the Poisson (with a slope steeper than one on Fig. 1), so CCA and related approaches make mean variance assumptions that are typically inappropriate, with important consequences. In Appendix S2, simulation is used to show that RDA, CCA and related approaches confound location and dispersion effects in the same way that distance-based approaches do. Misleading ordinations as in Fig. 4a are demonstrated for principal components analysis and correspondence analysis, which are the unconstrained ordination analogues of RDA and CCA (ter Braak & Prentice 1988). Poor properties as in Fig. 4c are demonstrated via simulation for RDA and CCA. NEW DISTANCE METRICS The above-mentioned problems with distance-based analysis could be argued to arise as a result of incorrect choice of distance metric, rather than incorrect choice of analysis technique. Hence, another way forward is to use distance metrics that scale variables appropriately according to their mean variance assumptions. Unfortunately, no known distance metrics currently achieve this, but it is the subject of ongoing research (M. J. Anderson, pers. comm.). Discussion All methods of multivariate analysis make some implicit decisions about how to combine data from different variables and, in particular, how to scale the variation in one taxon compared with others. It has been shown that distance-based approaches, as they are typically used, do not make appropriate scaling decisions, leading at times to serious misinterpretation of results and failure to detect very strong patterns in data. It has also been shown that while transformation and standardisation are not good solutions to the problem, using a generalised linear model (GLM) is it explicitly models the mean variance relationship, and free software has been developed specifically for the application of GLMs in multivariate analysis in ecology (Wang et al. 2010; Yee 2010). The issue of location dispersion confounding identified in this article does not just affect methods designed for comparing two samples, but other multivariate methods applicable for more general study designs (e.g. Clarke & Ainsworth 1993; Anderson 2001; McArdle & Anderson 2001; Ferrier et al. 2007) when there are more than two groups to compare, several factors (orthogonal or nested), continuous explanatory variables sampled along environmental gradients or when clustering related observations (Legendre & Legendre 1998). The reason location dispersion confounding still arises in these other contexts is that the key property of distance-based analyses leading to confounding (using a distance metric that does notreflectthemean variancerelationshipofthedata)isaproperty of the abundance data and of how abundance was measured, rather than being a property of the study design. Hence, location dispersion confounding is a problem to consider whenever multivariate abundances are analysed, rather than being linked to the comparison of two samples or any other specific study design. Fortunately, while the problems discussed in this article apply to data collected in general designs, so do the proposed solutions (Wang et al. 2010; Yee 2010). While mean variance confounding has been demonstrated for count data in this study, the issue arises for other methods of collecting multivariate measures of abundance (or occurrence) such as presence/absence data, % cover or semi-quantitative measures of abundance. All measures of abundance imply a mean variance relationship and the form of that relationship should be studied and used to inform the method of analysis. This is especially important when rare species are sampled, because rare species have small means and variances (bottom-left of Fig. 1), implying a stronger mean variance, relationship which typically cannot be removed by transformation. Recently, research efforts developing new methodology for the analysis of multivariate abundances have most often used a distance-based framework (De ath 2002; Anderson 2006; Ferrier et al. 2007), where distances are modelled rather than the abundances originally sampled. This study discusses one type of problem that arises when using such a distance-based framework and a problem that can be readily addressed by using a model-based framework where we directly model the data that has been collected, and explicitly model key data properties, such as the mean variance relationship. The development of such model-based approaches for the analysis of multivariate abundance data (Warton 2011; Yee 2006) has considerable potential, and we sincerely hope that such approaches to multivariate analysis will receive greater attention in the future. Acknowledgements This research was supported under Australian Research Council s Discovery Projects funding scheme (project number DP ). Thanks to those who

DETECTING BIOLOGICAL AND ENVIRONMENTAL CHANGES: DESIGN AND ANALYSIS OF MONITORING AND EXPERIMENTS (University of Bologna, 3-14 March 2008)

DETECTING BIOLOGICAL AND ENVIRONMENTAL CHANGES: DESIGN AND ANALYSIS OF MONITORING AND EXPERIMENTS (University of Bologna, 3-14 March 2008) Dipartimento di Biologia Evoluzionistica Sperimentale Centro Interdipartimentale di Ricerca per le Scienze Ambientali in Ravenna INTERNATIONAL WINTER SCHOOL UNIVERSITY OF BOLOGNA DETECTING BIOLOGICAL AND