Distance-based multivariate analyses confound location and dispersion effects

Size: px
Start display at page:

Download "Distance-based multivariate analyses confound location and dispersion effects"

Transcription

1 Methods in Ecology and Evolution 2012, 3, doi: /j X x Distance-based multivariate analyses confound location and dispersion effects David I. Warton 1 *, Stephen T. Wright 1 and Yi Wang 1,2 1 School of Mathematics and Statistics and Evolution & Ecology Research Centre; and 2 School of Computer Science and Engineering, The University of New South Wales, NSW 2052, Australia Summary 1. A critical property of count data is its mean variance relationship, yet this is rarely considered in multivariate analysis in ecology. 2. This study considers what is being implicitly assumed about the mean variance relationship in distance-based analyses multivariate analyses based on a matrix of pairwise distances and what the effect is of any misspecification of the mean variance relationship. 3. It is shown that distance-based analyses make implicit assumptions that are typically out-of-step with what is observed in real data, which has major consequences. 4. Potential consequences of this mean variance misspecification are: confounding location and dispersion effects in ordinations; misleading results when trying to identify taxa in which an effect is expressed; failure to detect a multivariate effect unless it is expressed in high-variance taxa. 5. Data transformation does not solve the problem. 6. A solution is to use generalised linear models and their recent multivariate generalisations, which is shown here to have desirable properties. Key-words: Bray Curtis distance, community composition, generalised linear models, mean variance relationship, multi-dimensional scaling, multivariate analysis, PERMANOVA, SIMPER, species-by-site data Introduction A commonly used strategy for the analysis of multivariate data in ecology is to calculate a matrix of pairwise distances or dissimilarities, then analyse those distances (McCune & Grace 2002; Gotelli & Ellison 2004). In this article, such methods will be referred to as distance-based analyses. This method is especially widely used for analysing multivariate abundance data, where abundances of different taxa in an assemblage are simultaneously recorded, also known as community data (McCune & Grace 2002), community composition data (Clarke 1993) and other terms. This study focusses largely on cases where abundance is measured as a count (whether per unit area, per unit length, per subsample of N individuals ), this being the most widely used method of measuring abundance (Warton 2008a). However, the ideas presented here apply equally well to other methods of measuring abundance also (presence absence, semi-quantitative methods, etc). Distance-based analyses as a first step combine abundances across all taxa, in one of many different ways (reviewed *Correspondence author. david.warton@unsw.edu.au Correspondence site: in Legendre & Legendre 1998). The manner in which this is carried out involves making implicit decisions about how one variable scales against another, that is, how the variability from one taxon should be compared with the variability of another. Often this is mediated by some form of standardisation (row, column or pairwise), which can be understood as linking variability in abundance to mean abundance. Hence, it is natural to consider mean variance relationships in studying the implicit scaling decisions being made in distance-based analyses and their consequences. The study of the relationship between mean and variance for abundance data has a long history in the ecology literature (Taylor 1961), e.g. in measuring variability in population density (McArdle et al. 1990; Routledge & Swartz 1991) or studying the role of data transformation in analysing abundance data (Yamamura, 1999 ; McArdle & Anderson 2004). As mean abundance increases, variance also tends to increase, as illustrated in Fig. 1 for taxa from ten different data sets. Note that these data sets follow similar mean variance relationships, which in this case appear to be well approximated by Var(X)¼ l+/l 2, where the variance is a function of the mean (l)anda dispersion parameter (/). Several authors have noticed apparent quadratic mean variance relationships in abundance data Ó 2011 The Authors. Methods in Ecology and Evolution Ó 2011 British Ecological Society

2 90 D. I. Warton, S. T. Wright & Y. Wang Untransformed counts log(x+1) counts variance [log scale] Abund data BORN BRBM CLEV CRYP EKMA FAAB FISH MEMO PFSH VICF variance [log scale] Abund data BORN BRBM CLEV CRYP EKMA FAAB FISH MEMO PFSH VICF mean [log scale] mean [log scale] Fig. 1. Typical mean variance relationships of multivariate abundance data: untransformed; after log (x+1) transformation. Data presented for 20 randomly selected taxa from each of 10 different data sets (see Appendix S1 for details). Note from that there are similar, steeply increasing mean variance relationships for all data sets, and note from that log (x+1) transformation does not completely remove this trend, especially for rarer taxa. (Routledge & Swartz 1991; Yamamura 1999; Warton 2005), which can be derived by assuming counts are Poisson with a varying mean (Routledge & Swartz 1991), although such a pattern is not always evident (Ver Hoef & Boveng 2007). In this article it will be shown that implicit assumptions about the mean variance relationship made by different choices of distance metric are out-of-step with what is known about mean variance relationships of abundance data. The effects of this misspecification of the mean variance relationship in multivariate analysis will be illustrated via the simple situation of testing for a difference between two groups of observations. Using this simple setting facilitates demonstration of the core ideas in this article, although the ideas are by no means limited to the two-group case and are relevant to any multivariate analysis, which involves making inferences about the association between multivariate abundances and other variables measured for each observation (whether experimental treatments, environmental variables or some combination thereof) using other distance-based methods (e.g. Clarke & Ainsworth, 1993; McArdle & Anderson 2001; Ferrier et al. 2007). Key to this study is the distinction between location and dispersion in multivariate analysis and how dispersion is defined. When testing for a difference between groups of multivariate observations, we distinguish between location effects, where there is an effect on the mean abundance of some species, and dispersion effects, where there is an effect on variability. But a mean variance relationship as in Fig. 1 implies that a change in mean will always lead to a change in variance, so it is natural to think of dispersion effects specifically as changes beyond those explained by the mean variance relationship that is, we define dispersion as a change in the mean variance relationship. When comparing two groups of observations, Anderson et al. (2008) described four types of possible between-group difference. These are summarised schematically in Fig. 2 and will be referred to as: no effect, there is no difference between groups (Fig. 2a); location effect, there is a difference in location only (Fig. 2b); dispersion effect, there is a difference in dispersion only (Fig. 2c); and location/dispersion effect, there Axis2 Axis2 (c) No effect Axis1 Dispersion effect Axis1 is a change in both location and dispersion between groups (Fig. 2d). A range of different methods have been developed for detecting between-group effects, and these distance-based analyses are known to be unable to distinguish between location and dispersion effects (e.g. Clarke 1993; Anderson 2001; Mielke & Berry 2010). Usually location effects are of primary interest, however sometimes dispersion effects are of interest (Warwick & Clarke 1993; Anderson et al. 2011), and Anderson (2006) developed a method intended to test specifically for such dispersion effects. This study not only helps understand why distance-based analyses confound location and dispersion effects, but also demonstrates some additional undesirable properties. Axis2 Axis2 (d) Location effect Axis1 Location/dispersion effect Axis1 Fig. 2. A schematic diagram as in Anderson et al. (2008) illustrating on two axes the types of between-group effects that are often of ecological interest: no effect; location effect; (c) dispersion effect; (d) location/dispersion effect.

3 Mean variance confounding in multivariate analysis 91 Specific aims are to identify: 1. The mean variance relationship under which some common distance metrics are appropriate, and how this compares to what relationships are observed in practice. 2. Some consequences, involving important misinterpretations in the literature, because of use of a distance metric that is not appropriate given the mean variance relationship of data. Each consequence will be demonstrated both using published data and using simulation under controlled settings. 3. What can be done to address this issue. Unfortunately, the way multivariate analysis methods are typically applied in ecology at the moment (McCune & Grace 2002; Anderson et al. 2008), the distance metrics chosen are not appropriate given the mean variance relationship of data hence hundreds of papers every year currently use methods which confound mean and variance effects, risking the undesirable consequences identified in this study. Distance metrics and their corresponding mean variance relationships Although many distance-based analyses are distribution-free (e.g. Clarke 1993; Anderson 2001), this does not mean that such analyses are used in a way that is assumption-free. Using a distance-based analysis (whether an ordination or a significance test) to disentangle location and dispersion effects is a case in point. If dispersion is defined as a change in mean variance relationship, then distance-based analysis can only disentangle dispersion effects and location effects if the distance metric used in analyses correctly accounts for the mean variance relationship when combining data across taxa. In this section, we study the scaling decisions made by different distance metrics and relate them to the mean variance relationship of data. The key ideas are illustrated in Fig. 3. METHODS, FIGURE 3 Figure 3 demonstrates the implicit mean variance assumptions in a given distance metric. The first column of Fig. 3 displays the mean variance plot of two independent groups of data, simulated to have a location effect but no dispersion effect. Data are presented for two different scenarios under the mean variance relationship assumed by the given distance measure (triangles on Fig. 3 left), and once under a more typical mean variance relationship (circles on Fig. 3 left) details on how this was carried out are in Appendix S2. For each distance metric under consideration, the second column of Fig. 3 displays a two-dimensional multidimensional scaling (MDS) ordination using the distance metric, for data generated from the mean variance relationship assumed by this distance metric.the third column displays a MDS ordination for data generated from the more typical quadratic mean variance relationship. DISTANCES WITH NO STANDARDISATION First, consider the simplest case, where the distance metric of interest does not involve any form of standardisation. The most common example is the Euclidean distance (Legendre & Legendre 1998) between two observations in multivariate space (with p variables): vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ux p d ij ¼ t ðy ik y jk Þ 2 k¼1 The Euclidean distance can be interpreted as the straight line distance from one point to the other. The Euclidean distance is a suitable multivariate measure of distance in situations where variables are expected to have equal variance, as noted previously in the context of redundancy analysis (RDA) (ter Braak & Prentice 1988). A simple way to see this is to note that the summation in the formula for d ij is across taxa without taking into consideration their scale; hence, analyses using this metric are appropriate if all taxa are measured on comparable scales with comparable variability. On a mean variance plot, the equal variance assumption corresponds to the assumption that the mean variance relation is a horizontal line (slope zero). Figure 3a demonstrates what happens when data are simulated to have a location effect both under the equal variance assumption and under a more realistic quadratic mean variance assumption. When the equal variance assumption is satisfied, an MDS plot exhibits the location effect (Fig. 3a, centre). However, when there is a mean variance trend, such that the variance changes across groups together with the mean, the subsequent Euclidean distance MDS plot is strongly suggestive of a dispersion effect (Fig. 3a, right) rather than just the location effect that data were actually simulated from. The reason for this is that the implicit assumption of equal variance does not hold, so location and dispersion effects are confounded. The Manhattan distance (Legendre & Legendre 1998) has a similar form to the Euclidean, except that instead of the summation being over squared differences, it is over absolute differences. Using the same arguments aforementioned, but with the rough approximation that 1 P qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi N i6¼j jy ik y jk j is proportional to 1 N Pi6¼j ðy ik y jk Þ 2, the Manhattan distance can be seen to be suitable when all taxa have the same variance (slope zero on a mean variance plot), but confounds location and dispersion effects otherwise, as in Fig. 3b. DISTANCES WITH PAIRWISE STANDARDISATION Many distance metrics involve some form of pairwise standardisation rather than simply summing some measure of distance across all taxa, the contributions are standardised in some way, using the abundances from the pair of observations being compared. An instructive example to consider first is the Canberra distance (Lance & Williams 1966): d ij ¼ Xp k¼1 jy ik y jk j ðy ik þ y jk Þ This distance metric is not widely used in ecology, but it is a useful precursor to the more widely used Bray Curtis distance. The Canberra distance has a similar form to the Manhattan distance, except that prior to summation, the

4 92 D. I. Warton, S. T. Wright & Y. Wang Mean var plot MDS for assumed mean var MDS for typical mean var Variance [log scale] Assumed Typical Variance [log scale] Assumed Typical (c) Variance [log scale] Assumed Typical (d) Variance [log scale] Assumed Typical Mean [log scale] MDS1 MDS1 Fig. 3. Assumed mean variance relationships for some common distance metrics: Euclidean, Manhattan, (c) Canberra, (d) Bray Curtis distance. Left column: The assumed mean variance relationship compared with what is more typical for count data (as in Fig. 1). Centre/right columns: MDS plots for data simulated to have a location effect under the assumed mean variance relationship (centre) and under a more typical mean variance relationship (right). Note that an multidimensional scaling plot suggests a location effect when the assumed mean variance relationship is correct (centre), but otherwise, it suggests dispersion effects also (right). contribution from each taxon is standardised by the abundance in that taxon across the two observations being compared. This distance is only appropriate, in the sense that each taxon has equal opportunity to contribute to pairwise distances, in the special case where (on average) changes in the numerator are matched by proportionate changes in the denominator across taxa. That is, average absolute difference must be proportional to average abundance across taxa. Using the standard deviation as an approximation for average absolute difference (as previously), changes across taxa in standard deviation must be approximately matched by proportionate changes in average abundance. On a mean variance plot, this corresponds to a straight-line relationship with slope two on the logarithmic scale. This is reasonably close to the quadratic mean variance relationship typically observed in practice, except for rare species, for which the observed variance is typically large compared with what is expected

5 Mean variance confounding in multivariate analysis 93 by a line of slope two (Fig. 3c, left). Hence, we might expect this distance to put undue weight on rare species, and an MDS ordination of a typical data set would suggest there is greater dispersion between sites that tend to have lower abundances, even when it is known that there is no dispersion at all. This effect is demonstrated by simulation in Fig. 3c, right. The Bray Curtis distance (Bray & Curtis 1957) has a similar form to the Canberra distance, except for the important difference that the summation across taxa is done inside the fraction rather than outside it: P p k¼1 d ij ¼ jy ik y jk j P p k¼1 ðy ik þ y jk Þ As for the Canberra distance, the standardisation used in the Bray Curtis distance is only appropriate when between-group differences in average absolute difference are matched by proportionate changes in average abundance. However, this assumption applies to averages across all taxa rather than to each taxon separately, because the summation occurs inside the fraction rather than outside it. Inside the fraction, there is no longer any consideration of the different variability of different taxa, as for the Euclidean and Manhattan distances. So like those distances, the Bray Curtis will only weight different taxa appropriately if all variables have similar variability. In summary, when Bray Curtis analyses are performed with the intention of disentangling location and dispersion effects across groups, the following very restrictive assumptions are implicit: 1. Average absolute difference within groups must be equal across taxa, i.e. within-group standard deviations must be (approximately) equal for all taxa 2. Any between-group differences in average absolute difference must be matched by proportionate between-group changes in average abundance (i.e. on a mean variance plot, between-group differences must fall approximately along a line of slope two). Figure 3d (centre) demonstrates this idea, where a location effect is visible in an MDS plot if all taxa have equal mean and variance within group and between-group differences fall along a line of slope two (Fig. 3d, left). Note that this is not what is typically observed in abundance data hence, in a more typical data set, a location effect would appear on a MDS ordination as if a location/dispersion effect (Fig. 3d, right), as an artefact of the use of the Bray Curtis distance. DISTANCES WITH COLUMN STANDARDISATION Some distance metrics involve a column standardisation, that is, differences between observations in each taxon are standardised using the total abundance across all observations before combining across taxa. The most common example is the chisquared distance, implicitly used in correspondence analysis (Legendre & Legendre 1998). Such distance metrics do not fit naturally into the framework proposed here, because the value used in within-group standardisation is a function of the overall mean across all groups rather than being a function of within-group means only. The form of the chi-square distance implies the variance should scale linearly with the mean (a slope of one on mean variance plots), but our simulations do not confirm that the method performs well in this setting. The issue of correspondence analysis and related methods will be revisited later. Consequences of using inappropriate distances in analyses In this section, three different types of consequences illustrate the core idea that if the mean variance relationship that is implicitly assumed is different from actual, location and dispersion effects are confounded in subsequent analyses. This may lead one to draw a conclusion that is incorrect as in and below oritmayobscurepatternsindatathatwould otherwisebeapparent asin(c)below. Figure 4 summarises the key ideas. This figure uses a combination of analysis of data from the literature (first column) and simulation under controlled conditions (second column) to demonstrate three consequences of using distance-based analyses that do not account for the mean variance relationship in data. The three rows of Fig. 4, labelled (a c), correspond to the three consequences labelled (a c) below. (A) CONCLUDING THERE IS A DISPERSION EFFECT WHEN THERE IS A LOCATION EFFECT If using an inappropriate distance metric in ordinations, location effects can be misinterpreted as dispersion effects, or as seen previously in Fig. 3c,d, as location/dispersion effects. MDS is commonly used to visualise multivariate abundance data and hence to diagnose whether a between-group effect appears to involve a change in location, dispersion or both. We illustrate in Fig. 4a that MDS plots do not perform this job reliably the reason being that the distance metric used in constructing the MDS is rarely appropriate given the actual mean variance relationship of data. Methods, Figure 4a Coral assemblage data were collected in 1981 and 1983 from 10 transects located in the Tikus Islands, Indonesia (Warwick et al. 1990), and compared across sampling times using a MDSordinationcalculatedonBray Curtisdistancesofuntransformed data in Fig. 4a (left). Studying mean and variance trends for this data set (Fig. 5) suggests that there is a very strong location effect, but no dispersion effect. Hence, we used simulation to see whether the ordination results of Fig. 4a (left) can be reproduced in simulations where data are known to have a strong location effect but no dispersion effect. Data were generated using a Poisson lognormal distribution (details in Appendix S2), with parameters chosen to mimic the properties of the coral data (Warwick et al. 1990), but in such a way that while simulated 1981 and 1983 observations differed in mean (location effect), they shared a common mean variance

6 94 D. I. Warton, S. T. Wright & Y. Wang MDS plot of coral data MDS of data simulated with location effect 1981 observations 1983 observations 1981 observations 1983 observations LR statistic (c) P PERMANOVA P sum of LR MDS1 'Between group contribution' for 20 taxa SIMPER scores Error variance [log scale] Testing for a group effect in 13 data sets 0 258/ / / / / / / / / / / / / sum of LR more powerful PERMANOVA more powerful 0% 20% 40% 60% 80% 100% % Error variance due to effect taxa % of time is in top five Power MDS1 Simulation with five effect taxa SIMPER for effect taxa SIMPER for no effect taxa LR for effect taxa LR for no effect taxa Error variance [log scale] Power simulations with one effect taxon PERMANOVA sum of LR Signif. level (no power) Error variance of effect taxon [log scale] Fig. 4. Three consequences of violations of implicit mean variance assumptions are location and dispersion effects are confounded in ordinations; SIMPER detects taxa with high variance more so than taxa actually expressing a between-group difference, unlike negative binomial regression ( LR ); (c) PERMANOVA fails to detect a significant between-group difference unless it is present in taxa with high variance, unlike sum-of-lr. The left column illustrates each of these consequences using real data, the right column illustrates each of these consequences via simulation. See the text for further details. relationship (no dispersion effect). For each of 200 simulated data sets, MDS ordinations were constructed using untransformed Bray Curtis distances as for Fig. 4a (left). Results, Figure 4a (left) The MDS plot of Fig. 4a (left) displays the 1981 observations near the centre of the ordination and the 1983 observations spread out around these points, suggesting no change in location but an increase in dispersion (a dispersion effect as in Fig. 2c). However, studying mean variance trends directly suggests the opposite all but five of the 41 species decreased in abundance and 25 of them completely disappeared from transects in 1983 (Fig. 5a), suggesting a substantial location effect. Further, a mean variance plot suggests little evidence of dispersion (Fig. 5b), with observations appearing to fall close to a common mean variance curve for both sampling years. Hence, it appears that there was a very strong location effect but no change in dispersion, despite the MDS ordination suggesting the opposite. Anderson et al. (2011) analysed this data set using different distance metrics and found that contradictory results can be obtained, e.g. they found that Euclidean distance MDS plots suggest it is the 1981 data that has high dispersion. They speculated that the differences in results could

7 Mean variance confounding in multivariate analysis vs 1983 mean(s.e) abundance Mean variance relationship 1983 mean abundance [log(x+1) scale] Variance [log scale] observations 1983 observations mean abundance [log(x+1) scale] Mean abundance [log scale] Fig. 5. Plots demonstrating that the Tikus Island data consists of a very strong location effect and little or no dispersion effect: 1981 vs mean abundance (plus standard error); Mean variance relationship. Note from that there is a very strong decrease in mean abundance from 1981 to 1983 and from there is a similar mean variance relationship for 1981 and 1983 data, flatter than a slope of two on the log scale. relate to whether or not a distance metric counts double zeros, but simulations demonstrate the key role of the mean variance relation. Results, Figure 4a (right) Simulations demonstrate that the apparent dispersion effect in Fig. 4 (left) arises as an artefact of failure of implicit mean variance assumptions. Of the 200 simulated data sets, 33% of MDS plots revealed the pattern in Fig. 4a (right) being suggestive of a dispersion effect, when it is known that there was no such dispersion effect. The other 67% of simulated data sets produced MDS plots suggestive of a location/dispersion effect, with no plots correctly suggesting an effect in location only. The volatile behaviour of MDS ordinations for this data set can be understood via the mean variance relationship, which was increasing but not as steeply as a slope of two (Fig. 5b). This means that the decrease in variability in 1983 (numerator of the Bray Curtis distance) was not large enough to match the substantial reduction in mean abundance (denominator) hence, Bray Curtis distances within 1983 observations were very large relative to In contrast, because the variances in 1983 were much smaller, Euclidean distances were smaller in 1983 and the opposite effect can be seen, with the 1983 observations appearing less dispersed (Anderson et al. 2011). (B) FAILING TO IDENTIFY TAXA WITH LARGE BETWEEN-GROUP EFFECTS SIMilarity PERcentages (SIMPER, Clarke 1993) is widely used in ecology as a post hoc test of a between-group difference in multivariate abundance to answer the question in which taxa is this difference most evident?. However, a potential consequence of failing to correctly account for the mean variance relationship is that SIMPER might not detect taxa with evidence of a between-group effect at all. The SIMPER procedure, based on the Bray Curtis distance, measures the contribution of the kth taxon to the between-group effect using the between-group pairwise average of y ik )y jk. However, these values will typically be much larger for taxa with high variance, irrespective of whether there is a between-group effect. This confounds strength of between-group mean effect with variance, such that the SIMPER algorithm will not just identify taxa with a strong between-group effect, but also taxa with large within-group variance, irrespective of presence of a between-group effect, as seen in Fig. 4b. Methods, Figure 4b The example data set used to showcase the SIMPER algorithm in the PRIMER manual (Clarke & Gorley 2006) was re-analysed to determine how sensitive SIMPER results are to two confounding factors strength of between-group effect and size of error variance. The data set (Collins & Williams 1982) consists of two groups of observations on 20 taxa, and the goal of analysis is to determine which taxa most strongly express the between-group effect. Figure 4b (left) plots SIMPER results for each of the 20 taxa on two axes one representing strength of between-group effect (measured as the likelihood ratio statistic from negative binomial regression) and the other axis representing variability within each taxon (measured as within-group error variance). The point size is scaled relative to SIMPER results, where a larger dot indicates that SIMPER considers the taxon to make a larger contribution to the average between-group distance. In Fig. 4b (right), simulation was used to study to what extent the problematic results seen in Fig. 4b (left) were attributed to systematic problems with the SIMPER method. One thousand sets of negative binomial data were generated to mimic the properties of the original data (Collins & Williams 1982), except that the means of the two groups were equal for all but five effect taxa, generated to have a 10-fold difference in means (see Appendix S2 for details). For each simulated data set, the five taxa with highest SIMPER score were recorded. Figure 4b shows the proportion of times each taxon was in the top five, as a function of the error variance of the taxon.

8 96 D. I. Warton, S. T. Wright & Y. Wang Results, Figure 4b (left) Figure 4b (left) clearly shows that the larger dots tend to occur towards the right of the plot, i.e. the SIMPER algorithm tends to consider as the most important taxa, in terms of describing between-group differences, those that have the largest variances. A more desirable result would be for the taxa contributing to the between-group effect to occur towards the top of the plot that is, it would have been desirable for SIMPER to instead flag variables, which tend to have stronger evidence of between-group effects. But taxa with relatively strong betweengroup effects were often given little weight (when they had relatively small variance), which is highly misleading. Results, Figure 4b (right) ThetaxawhoseSIMPERscoresareinthetopfiveshouldtend to be the five taxa for which there actually was a betweengroup effect. However, SIMPER instead systematically selected the five taxa with highest variance (Fig. 4b, right), and only two of these actually had a between-group effect. Taxa with known between-group effects but smaller variances tended not to be identified as key contributors to average between-group distance. (C) LOW POWER FOR DETECTING A BETWEEN-GROUP DIFFERENCE UNLESS IT IS EXPRESSED IN VARIABLE TAXA A final consequence of incorrectly accounting for the mean variance relationship is that unless between-group differences are expressed in taxa with relatively high variance, standard distance-based methods of detecting between-group differences will have low power. This is undesirable as we are usually interested in the community response, and so are interested in detecting a difference if expressed in any taxa, not just those with high variance. Figure 4c illustrates this problem. Methods, Figure 4c A hypothesis was tested for each of 13 data sets from the literature (see Appendix S1 for details) using PERMANOVA with Bray Curtis distances (Anderson 2001), a technique that is widely used for simultaneously testing the response of multivariate abundances to one or more factors (Anderson 2001). For each data set, results are contrasted with those when using a test statistic that accounts for the mean variance relationship in data ( sum-of-lr, Warton 2011). In both cases, P-values were calculated from 1000 (restricted) permutations. Figure 4c (left) plots the ratio of the resultant P-values against the variability of effect taxa (defined here as the proportion of error variance in the most significant 20% of taxa). Figure 4c (right) presents the results from power simulations, studying how power is affected by variability of effect taxa in a controlled setting. In this simulation, negative binomial data were generated to mimic the properties of the data set used in Fig. 4b (Collins & Williams 1982), but with a simulated change in mean by a factor of 10 in one taxon only, and no effect in any remaining taxa. There were 20 simulations, one using each of the 20 taxa of Fig. 4b as the effect taxon. In each simulation, power of the two tests used in Fig. 4c (left) was estimated from 1000 data sets (at the 0.05 significance level). Figure 4c (right) reports power as a function of the error variance of the effect taxon. Results, Figure 4c (left) P-values calculated by PERMANOVA tended to be smaller, hence more significant, for data sets where effect taxa had high variances. PERMANOVA P-values tended to be larger, hence less significant, when the effect taxa tended to have small variance. These observations illustrate that the power of PER- MANOVA to detect a between-group effect expressed in lessvariable taxa is smaller than that when detecting an effect expressed in more variable taxa, as confirmed by our power simulations. Results, Figure 4c (right) The PERMANOVA power curve is consistent with the behaviour observed in Fig. 4c (left) the power to detect a betweengroup effect steadily increased as the variance of the effect taxon increased. What is particularly concerning is that when the variance of the effect taxon was less than about 1000, there was in effect no power to detect a between-group difference power was close to 0.05, the significance level that was used. While only performance of PERMANOVA has been discussed here, similar effects are expected in any multivariate analysis that does not appropriately standardise variables and indeed, some additional simulations (Appendix S2) demonstrate that the effect in Fig. 4b is also evident when using ANOSIM (Clarke 1993), RDA or canonical correspondence analysis (CCA; ter Braak 1986). The reason for the poor properties of distance-based analyses is that taxa with high variance, relative to the variance implicitly assumed in analysis, tend to dominate, as seen in as well as in (c). Hence, methods that test for a between-group difference will readily detect it when it is expressed in taxa with relatively high variance. But if the effect is expressed only in less-variable taxa, the analysis method may not detect a between-group difference at all. Generalised linear models model the mean variance relationship Fortunately, there is a solution to the problem of accounting for mean variance relationships using generalised linear models (GLM; McCullagh & Nelder 1989) and their multivariate extensions (Warton 2011). GLM and a related method, negative binomial regression, extend linear regression to the situation where data follow a known mean variance function. This is precisely the situation encountered here. These methods are standard for analysing univariate count data and strongly advocated in a recent issue of this journal (O Hara & Kotze

9 Mean variance confounding in multivariate analysis ), hence a natural starting point for the development of methods for analysing multivariate abundance data. Negative binomial regression in particular is often appropriate for count data (O Hara & Kotze 2010), because the mean variance function of counts is usually quadratic more so than linear (Warton 2005). GLMs can readily address the problems identified in Fig. 4b,c, and software to address these problems is freely available in the mvabund package on R (Wang et al. 2010). The problem of identifying taxa that most strongly express a between-group difference (Fig. 4b) is relatively simple to solve, being essentially a univariate problem. For the example data set of Fig. 4b, diagnostic plots suggested that negative binomial regression was appropriate (Fig. 6a). Hence, we propose fitting separate univariate negative binomial regressions to each taxon and using the likelihood ratio statistic as a measure of strength of between-group effect. This method performs well (Fig. 4b), correctly detecting the five effect taxa 90 95% of the time, in stark contrast to SIMPER. Further, the GLM framework offers the potential for substantially greater functionality than SIMPER the mvabund package already enables application to any fixed effects design (including continuous environmental variables as well as factors), and multiple testing can be applied to assess the statistical significance of the effect observed in any given taxon. The problem of testing for an effect of a difference in mean abundance between a set of multivariate samples (Fig. 4c) can be addressed using the simple approach of fitting separate models to each taxon, then summing the test statistics, and using resampling to assess the significance of this multivariate test statistic (Warton 2011). This sum-of-lr statistic is analogous to the sum-of-f approach based on analysis of variance (Edgington & Onghena 2007), which was previously shown to have better power properties than distance-based analyses (Warton & Hudson 2004). The key advantage however of the sum-of-lr approach is that we can correctly model the mean variance relationship, leading to better power properties than sum-of-f statistics when sampling is unbalanced (Warton 2008b). The effectiveness of the approach is demonstrated in Fig. 4c, where compared with PERMANOVA, power of the sum-of-lr statistic was usually much higher and tended to vary much less as a function of the variance of the effect taxon. See Warton (2011) for further simulations demonstrating power advantages. As previously, the method is implemented in the mvabund package. For multisample designs such as that considered in Fig. 4c, permutation tests can be used to construct exact tests irrespective of the actual distribution of the data, as is the case for PERMANOVA. The method can also be used for the more general problem of testing for an association between any (sub)set of environmental variables and multivariate abundance, although resampling only ensures approximately valid tests in this case. The mvabund package is easy to use for the purposes described in Fig. 4b,c. The code required to fit the model and assess the significance of multivariate and univariate terms takes just two lines: ft ¼ manyglm(abundtreatment, family¼"negative.binomial") results ¼ anova(ft, resamp¼"perm.resid", p.uni¼"adjusted") This code assumes that abundance data are stored in the workspace as a matrix abund and the treatment factor is stored as treatment. It will then return the multivariate and univariate results in the object results. While well-developed solutions are available for the two problems of Fig. 4b,c, GLM solutions to the ordination problem of Fig. 4a are in earlierstagesofdevelopment.a flexible method of constrained ordination, where the ordination axes are functions of environmental variables, has been proposed (Yee 2006) and software is currently available in the R package VGAM (Yee 2010) which can handle the most common forms of mean variance relationship that are observed in practice. Methods for unconstrained ordination along the lines of Fig. 4a also exist in particular, generalised latent variable models have potential (Skrondal & Rabe-Hesketh 2004) however, we are yet to find freely available software, and its application in ecology is yet to be evaluated. MEAN VARIANCE ASSUMPTION CHECKING It is critical when fitting any model to check the assumptions of that model if a GLM method is used that makes inappropriate mean variance assumptions, it will suffer the same shortcomings as distance-based approaches. Negative binomial regression makes two key assumptions, both of which can be readily checked by looking for no pattern on a residual vs. fits plot (Fahrmeir & Tutz 2001), as in Fig. 6. The first assumption is that the mean count for the jth taxon in the ith site, l ij, is a log-linear function of K predictor variables x ik : logðl ij Þ¼b 0j þ XK x ik b kj k¼1 This assumption is only meaningful for models with interaction terms or where some of the predictor variables x ik are quantitative otherwise, as for Fig. 6, all terms in the model are indicator variables, and the assumption is guaranteed to be satisfied. The second assumption is that the variance of the count for the jth taxon and the ith site, Var(Y ij ), is the following quadratic function of its mean: VarðY ij Þ¼l ij þ / j l 2 ij If this assumption is violated, then a trend will be seen on a residual vs. fits plot, where the spread of the residuals tends to change as fitted values change, as in Fig. 6b. A quasi-poisson model was used for Fig. 6b, which assumed that the variance was proportional to the mean. The fan shape on this plot suggests that a different mean variance relationship should be used, and the weaker pattern in Fig. 6a suggests that negative binomial regression is a more reasonable choice. A strength of the GLM framework is

10 98 D. I. Warton, S. T. Wright & Y. Wang Negative binomial regression Quasi poisson regression Pearson residuals Pearson residuals Fitted value [log scale] Fitted value [log scale] Fig. 6. Residual vs. fitted value plots to check mean variance assumptions for the data set from Fig. 4b when analysed using negative binomial regression, with a quadratic mean variance assumption; quasi-poisson regression, with a linear mean variance assumption. Note the fan shape in suggesting that linearity is not satisfied. that many different options for mean variance relationships are available, you can even construct your own via quasi-likelihood (McCullagh & Nelder 1989). An issue to consider when constructing residual plots is that there is no unique definition of residuals for GLM and negative binomial regression, although for the purposes described here (adjusted) Pearson residuals should suffice (Fahrmeir & Tutz 2001), as given by default in the mvabund package. It is also usual to plot fitted values on the linear predictor scale hence, log (l ij ) appears on the X-axis in Fig. 6. It should also be noted that when dealing with overdispersed counts, the residuals will usually be strongly right-skewed (as in Fig. 6) hence, they should not be expected to be symmetric about zero. Other solutions? In this section, we briefly review some other ways forward that one might have considered, apart from using a generalised linear modelling approach. However, it will be seen that there are problems with each of these methods, and where applicable, we recommend the use of GLM approaches instead. DATA TRANSFORMATION The analyses presented in Fig. 4 involved untransformed data, but most analysts of multivariate abundances transform their data in some way prior to analysis. Transforming abundance data in ecology has been criticised (McArdle & Anderson 2004, O Hara & Kotze 2010), and we will explore some of the key problems as they apply here. The effects of transformation on mean variance relationships can be approximated via the delta method (Yamamura 1999). A summary of how the effects of transformation can be understood via the delta method is available in supplementary material (Appendix S3). It is important to recognise that while an appropriately chosen transformation may reduce the extent of a mean variance trend in multivariate abundance data, it is unlikely to completely remove the trend, especially if the data set contains rare species. To see this, note that when a taxon is completely absent, its variance (and its mean) are zero. No matter how data are transformed, the variance will remain at zero. Similarly, a rare taxon with a mean and variance very close to zero will have a transformed variance very close to zero. This point is illustrated in Fig. 1b, where the log (x+1) transformation reduces the trend in typical mean variance relationships for more abundant species, but a noticeable trend is still apparent for rarer species. Because transformation will rarely remove a mean variance trend in multivariate abundance data, the confounding of location and dispersion effects previously seen for distance-based analyses are still apparent for transformed data. Figure 4 has been reproduced for log (x+1)-transformed data in Fig. 7. This was the bestperforming of the transformations considered [x 1/2, x 1/4 and log (x+1)], but all the consequences of misspecifying the mean variance relationship demonstrated for untransformed data still apply after transformation. The log (x+1)transformationreducedtheextentofeachofthese consequences, but all three consequences were still concerningly evident (Fig. 7). STANDARDISING ABUNDANCES IN EACH TAXON An alternative solution is to use a method of analysis that applies a standardisation to abundances in each taxon during analysis. An appropriate standardisation would use withingroup (error) variance note this is different from standardising taxa prior to analysis, which can produce very poor results (Warton & Hudson 2004). Standardising prior to analysis, i.e. standardising using total variance, would standardise as a function not just of within-group variance but also of effect size. This reduces power (Warton & Hudson 2004), because the taxa in which the strongest effects are observed are the ones that tend to be most heavily down-weighted in standardisation.

11 Mean variance confounding in multivariate analysis 99 MDS plot of transformed coral data MDS of data simulated with location effect 1981 observations 1983 observations 1981 observations 1983 observations MDS1 MDS1 'Between group contribution' for 20 taxa Simulation with five effect taxa LR statistic SIMPER scores % of time is in top five SIMPER for effect taxa SIMPER for no effect taxa LR for effect taxa LR for no effect taxa Error variance Error variance [log scale] (c) Testing for a group effect in 13 data sets Power simulations with one effect taxon P PERMANOVA P sum of LR / / / / / / / / / / / / / sum of LR more powerful PERMANOVA more powerful Power PERMANOVA sum of LR Signif. level (no power) % Error variance due to effect taxa Error variance of effect taxon [log scale] Fig. 7. Three consequences of violations of implicit mean variance assumptions, as in Fig. 4, where distance-based analyses have been recalculated for log (x+1)-transformed abundance. Note that all consequences observed for untransformed data in Fig. 4 are still apparent under transformation, although sometimes to a lesser extent. The within-group standardisation approach is demonstrated in Appendix S2 via the use of analysis of variance (ANOVA). ANOVA applied separately to each taxon will standardise estimates of between-group effects using the withingroup variance, before combining or comparing across taxa. Thiscanbequiteeffective,althoughinAppendixS2,ANOVA did not perform as well as negative binomial regression. While ANOVA assumes different variances across different taxa, it does assume equal variance of all observations within each taxon, which can be especially problematic when sampling is unbalanced (Romano 1990; Warton 2008b). An alternative form of within-group standardisation was proposed in Clarke et al. (2006), where taxa were weighted according to their level of overdispersion relative to the Poisson distribution, and these reweighted variables were used in distance-based analyses. However, a more direct approach would have been to fit a form of overdispersed count model to the data directly, as proposed in the previous section, rather than using theory for overdispersed counts to derive weights for use in a different analysis. One barrier to within-group standardisation in distancebased analyses is computational efficiency. To obtain a valid test, the standardisation would need to be recalculated for each permutation, which in turn means that the whole matrix of pairwise distances would need to be recalculated also. The increase in computation time could be substantial.

12 100 D. I. Warton, S. T. Wright & Y. Wang (CANONICAL) CORRESPONDENCE ANALYSIS When not analysing multivariate data using a distance-based approach, the most common approach in ecology is currently to use CCA (ter Braak 1986) or related methods such as RDA (van den Wollenberg 1977) and its generalisations for partitioning beta-diversity (Legendre et al. 2005). Unfortunately, like distance-based approaches, these methods tend to make implicit mean variance assumptions that appear to be out-ofstep with what tends to be observed in practice, with similar consequences to those observed in Fig. 4. Redundancy analysis is appropriate for data whose variance is constant for all taxa (ter Braak & Prentice 1988), i.e. when data have a slope of zero on a mean variance plot. In fact, RDA can be understood as a distance-based approach, where Euclidean distances have been used for analysis (McArdle & Anderson 2001). Irrespective of data transformation, the constant variance assumption is unlikely to be satisfied, especially if rare species are included in analyses (Fig. 1). ter Braak (1986) proposed CCA as an approximation to Gaussian ordination of Poisson data, i.e. data with variance proportional to the mean (a slope of one on a mean variance plot). But multivariate abundance data are typically overdispersed relative to the Poisson (with a slope steeper than one on Fig. 1), so CCA and related approaches make mean variance assumptions that are typically inappropriate, with important consequences. In Appendix S2, simulation is used to show that RDA, CCA and related approaches confound location and dispersion effects in the same way that distance-based approaches do. Misleading ordinations as in Fig. 4a are demonstrated for principal components analysis and correspondence analysis, which are the unconstrained ordination analogues of RDA and CCA (ter Braak & Prentice 1988). Poor properties as in Fig. 4c are demonstrated via simulation for RDA and CCA. NEW DISTANCE METRICS The above-mentioned problems with distance-based analysis could be argued to arise as a result of incorrect choice of distance metric, rather than incorrect choice of analysis technique. Hence, another way forward is to use distance metrics that scale variables appropriately according to their mean variance assumptions. Unfortunately, no known distance metrics currently achieve this, but it is the subject of ongoing research (M. J. Anderson, pers. comm.). Discussion All methods of multivariate analysis make some implicit decisions about how to combine data from different variables and, in particular, how to scale the variation in one taxon compared with others. It has been shown that distance-based approaches, as they are typically used, do not make appropriate scaling decisions, leading at times to serious misinterpretation of results and failure to detect very strong patterns in data. It has also been shown that while transformation and standardisation are not good solutions to the problem, using a generalised linear model (GLM) is it explicitly models the mean variance relationship, and free software has been developed specifically for the application of GLMs in multivariate analysis in ecology (Wang et al. 2010; Yee 2010). The issue of location dispersion confounding identified in this article does not just affect methods designed for comparing two samples, but other multivariate methods applicable for more general study designs (e.g. Clarke & Ainsworth 1993; Anderson 2001; McArdle & Anderson 2001; Ferrier et al. 2007) when there are more than two groups to compare, several factors (orthogonal or nested), continuous explanatory variables sampled along environmental gradients or when clustering related observations (Legendre & Legendre 1998). The reason location dispersion confounding still arises in these other contexts is that the key property of distance-based analyses leading to confounding (using a distance metric that does notreflectthemean variancerelationshipofthedata)isaproperty of the abundance data and of how abundance was measured, rather than being a property of the study design. Hence, location dispersion confounding is a problem to consider whenever multivariate abundances are analysed, rather than being linked to the comparison of two samples or any other specific study design. Fortunately, while the problems discussed in this article apply to data collected in general designs, so do the proposed solutions (Wang et al. 2010; Yee 2010). While mean variance confounding has been demonstrated for count data in this study, the issue arises for other methods of collecting multivariate measures of abundance (or occurrence) such as presence/absence data, % cover or semi-quantitative measures of abundance. All measures of abundance imply a mean variance relationship and the form of that relationship should be studied and used to inform the method of analysis. This is especially important when rare species are sampled, because rare species have small means and variances (bottom-left of Fig. 1), implying a stronger mean variance, relationship which typically cannot be removed by transformation. Recently, research efforts developing new methodology for the analysis of multivariate abundances have most often used a distance-based framework (De ath 2002; Anderson 2006; Ferrier et al. 2007), where distances are modelled rather than the abundances originally sampled. This study discusses one type of problem that arises when using such a distance-based framework and a problem that can be readily addressed by using a model-based framework where we directly model the data that has been collected, and explicitly model key data properties, such as the mean variance relationship. The development of such model-based approaches for the analysis of multivariate abundance data (Warton 2011; Yee 2006) has considerable potential, and we sincerely hope that such approaches to multivariate analysis will receive greater attention in the future. Acknowledgements This research was supported under Australian Research Council s Discovery Projects funding scheme (project number DP ). Thanks to those who

DETECTING BIOLOGICAL AND ENVIRONMENTAL CHANGES: DESIGN AND ANALYSIS OF MONITORING AND EXPERIMENTS (University of Bologna, 3-14 March 2008)

DETECTING BIOLOGICAL AND ENVIRONMENTAL CHANGES: DESIGN AND ANALYSIS OF MONITORING AND EXPERIMENTS (University of Bologna, 3-14 March 2008) Dipartimento di Biologia Evoluzionistica Sperimentale Centro Interdipartimentale di Ricerca per le Scienze Ambientali in Ravenna INTERNATIONAL WINTER SCHOOL UNIVERSITY OF BOLOGNA DETECTING BIOLOGICAL AND

More information

CAP. Canonical Analysis of Principal coordinates. A computer program by Marti J. Anderson. Department of Statistics University of Auckland (2002)

CAP. Canonical Analysis of Principal coordinates. A computer program by Marti J. Anderson. Department of Statistics University of Auckland (2002) CAP Canonical Analysis of Principal coordinates A computer program by Marti J. Anderson Department of Statistics University of Auckland (2002) 2 DISCLAIMER This FORTRAN program is provided without any

More information

ANOVA approach. Investigates interaction terms. Disadvantages: Requires careful sampling design with replication

ANOVA approach. Investigates interaction terms. Disadvantages: Requires careful sampling design with replication ANOVA approach Advantages: Ideal for evaluating hypotheses Ideal to quantify effect size (e.g., differences between groups) Address multiple factors at once Investigates interaction terms Disadvantages:

More information

Experimental Design and Data Analysis for Biologists

Experimental Design and Data Analysis for Biologists Experimental Design and Data Analysis for Biologists Gerry P. Quinn Monash University Michael J. Keough University of Melbourne CAMBRIDGE UNIVERSITY PRESS Contents Preface page xv I I Introduction 1 1.1

More information

Chapter 11 Canonical analysis

Chapter 11 Canonical analysis Chapter 11 Canonical analysis 11.0 Principles of canonical analysis Canonical analysis is the simultaneous analysis of two, or possibly several data tables. Canonical analyses allow ecologists to perform

More information

-Principal components analysis is by far the oldest multivariate technique, dating back to the early 1900's; ecologists have used PCA since the

-Principal components analysis is by far the oldest multivariate technique, dating back to the early 1900's; ecologists have used PCA since the 1 2 3 -Principal components analysis is by far the oldest multivariate technique, dating back to the early 1900's; ecologists have used PCA since the 1950's. -PCA is based on covariance or correlation

More information

Distance Measures. Objectives: Discuss Distance Measures Illustrate Distance Measures

Distance Measures. Objectives: Discuss Distance Measures Illustrate Distance Measures Distance Measures Objectives: Discuss Distance Measures Illustrate Distance Measures Quantifying Data Similarity Multivariate Analyses Re-map the data from Real World Space to Multi-variate Space Distance

More information

Multivariate analysis

Multivariate analysis Multivariate analysis Prof dr Ann Vanreusel -Multidimensional scaling -Simper analysis -BEST -ANOSIM 1 2 Gradient in species composition 3 4 Gradient in environment site1 site2 site 3 site 4 site species

More information

INTRODUCTION TO MULTIVARIATE ANALYSIS OF ECOLOGICAL DATA

INTRODUCTION TO MULTIVARIATE ANALYSIS OF ECOLOGICAL DATA INTRODUCTION TO MULTIVARIATE ANALYSIS OF ECOLOGICAL DATA David Zelený & Ching-Feng Li INTRODUCTION TO MULTIVARIATE ANALYSIS Ecologial similarity similarity and distance indices Gradient analysis regression,

More information

EXAM PRACTICE. 12 questions * 4 categories: Statistics Background Multivariate Statistics Interpret True / False

EXAM PRACTICE. 12 questions * 4 categories: Statistics Background Multivariate Statistics Interpret True / False EXAM PRACTICE 12 questions * 4 categories: Statistics Background Multivariate Statistics Interpret True / False Stats 1: What is a Hypothesis? A testable assertion about how the world works Hypothesis

More information

Linking species-compositional dissimilarities and environmental data for biodiversity assessment

Linking species-compositional dissimilarities and environmental data for biodiversity assessment Linking species-compositional dissimilarities and environmental data for biodiversity assessment D. P. Faith, S. Ferrier Australian Museum, 6 College St., Sydney, N.S.W. 2010, Australia; N.S.W. National

More information

4/2/2018. Canonical Analyses Analysis aimed at identifying the relationship between two multivariate datasets. Cannonical Correlation.

4/2/2018. Canonical Analyses Analysis aimed at identifying the relationship between two multivariate datasets. Cannonical Correlation. GAL50.44 0 7 becki 2 0 chatamensis 0 darwini 0 ephyppium 0 guntheri 3 0 hoodensis 0 microphyles 0 porteri 2 0 vandenburghi 0 vicina 4 0 Multiple Response Variables? Univariate Statistics Questions Individual

More information

Factors affecting the Power and Validity of Randomization-based Multivariate Tests for Difference among Ecological Assemblages

Factors affecting the Power and Validity of Randomization-based Multivariate Tests for Difference among Ecological Assemblages Factors affecting the Power and Validity of Randomization-based Multivariate Tests for Difference among Ecological Assemblages Cameron Hurst B.Sc. (Hons) This thesis was submitted in fulfillment of the

More information

Generalized Linear Models (GLZ)

Generalized Linear Models (GLZ) Generalized Linear Models (GLZ) Generalized Linear Models (GLZ) are an extension of the linear modeling process that allows models to be fit to data that follow probability distributions other than the

More information

LOGISTIC REGRESSION Joseph M. Hilbe

LOGISTIC REGRESSION Joseph M. Hilbe LOGISTIC REGRESSION Joseph M. Hilbe Arizona State University Logistic regression is the most common method used to model binary response data. When the response is binary, it typically takes the form of

More information

ANCOVA. Lecture 9 Andrew Ainsworth

ANCOVA. Lecture 9 Andrew Ainsworth ANCOVA Lecture 9 Andrew Ainsworth What is ANCOVA? Analysis of covariance an extension of ANOVA in which main effects and interactions are assessed on DV scores after the DV has been adjusted for by the

More information

BIO 682 Multivariate Statistics Spring 2008

BIO 682 Multivariate Statistics Spring 2008 BIO 682 Multivariate Statistics Spring 2008 Steve Shuster http://www4.nau.edu/shustercourses/bio682/index.htm Lecture 11 Properties of Community Data Gauch 1982, Causton 1988, Jongman 1995 a. Qualitative:

More information

Multivariate Statistics 101. Ordination (PCA, NMDS, CA) Cluster Analysis (UPGMA, Ward s) Canonical Correspondence Analysis

Multivariate Statistics 101. Ordination (PCA, NMDS, CA) Cluster Analysis (UPGMA, Ward s) Canonical Correspondence Analysis Multivariate Statistics 101 Ordination (PCA, NMDS, CA) Cluster Analysis (UPGMA, Ward s) Canonical Correspondence Analysis Multivariate Statistics 101 Copy of slides and exercises PAST software download

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION doi:10.1038/nature25973 Power Simulations We performed extensive power simulations to demonstrate that the analyses carried out in our study are well powered. Our simulations indicate very high power for

More information

Multivariate Statistics Summary and Comparison of Techniques. Multivariate Techniques

Multivariate Statistics Summary and Comparison of Techniques. Multivariate Techniques Multivariate Statistics Summary and Comparison of Techniques P The key to multivariate statistics is understanding conceptually the relationship among techniques with regards to: < The kinds of problems

More information

Contents. Acknowledgments. xix

Contents. Acknowledgments. xix Table of Preface Acknowledgments page xv xix 1 Introduction 1 The Role of the Computer in Data Analysis 1 Statistics: Descriptive and Inferential 2 Variables and Constants 3 The Measurement of Variables

More information

Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/

Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/ Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/28.0018 Statistical Analysis in Ecology using R Linear Models/GLM Ing. Daniel Volařík, Ph.D. 13.

More information

Multivariate Analysis of Ecological Data using CANOCO

Multivariate Analysis of Ecological Data using CANOCO Multivariate Analysis of Ecological Data using CANOCO JAN LEPS University of South Bohemia, and Czech Academy of Sciences, Czech Republic Universitats- uric! Lanttesbibiiothek Darmstadt Bibliothek Biologie

More information

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages: Glossary The ISI glossary of statistical terms provides definitions in a number of different languages: http://isi.cbs.nl/glossary/index.htm Adjusted r 2 Adjusted R squared measures the proportion of the

More information

Generalized Linear Models

Generalized Linear Models York SPIDA John Fox Notes Generalized Linear Models Copyright 2010 by John Fox Generalized Linear Models 1 1. Topics I The structure of generalized linear models I Poisson and other generalized linear

More information

Multivariate Analysis of Ecological Data

Multivariate Analysis of Ecological Data Multivariate Analysis of Ecological Data MICHAEL GREENACRE Professor of Statistics at the Pompeu Fabra University in Barcelona, Spain RAUL PRIMICERIO Associate Professor of Ecology, Evolutionary Biology

More information

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1 Parametric Modelling of Over-dispersed Count Data Part III / MMath (Applied Statistics) 1 Introduction Poisson regression is the de facto approach for handling count data What happens then when Poisson

More information

8. FROM CLASSICAL TO CANONICAL ORDINATION

8. FROM CLASSICAL TO CANONICAL ORDINATION Manuscript of Legendre, P. and H. J. B. Birks. 2012. From classical to canonical ordination. Chapter 8, pp. 201-248 in: Tracking Environmental Change using Lake Sediments, Volume 5: Data handling and numerical

More information

VarCan (version 1): Variation Estimation and Partitioning in Canonical Analysis

VarCan (version 1): Variation Estimation and Partitioning in Canonical Analysis VarCan (version 1): Variation Estimation and Partitioning in Canonical Analysis Pedro R. Peres-Neto March 2005 Department of Biology University of Regina Regina, SK S4S 0A2, Canada E-mail: Pedro.Peres-Neto@uregina.ca

More information

Generalized linear models

Generalized linear models Generalized linear models Douglas Bates November 01, 2010 Contents 1 Definition 1 2 Links 2 3 Estimating parameters 5 4 Example 6 5 Model building 8 6 Conclusions 8 7 Summary 9 1 Generalized Linear Models

More information

Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 16 Introduction

Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 16 Introduction Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 16 Introduction ReCap. Parts I IV. The General Linear Model Part V. The Generalized Linear Model 16 Introduction 16.1 Analysis

More information

Multivariate Data Analysis a survey of data reduction and data association techniques: Principal Components Analysis

Multivariate Data Analysis a survey of data reduction and data association techniques: Principal Components Analysis Multivariate Data Analysis a survey of data reduction and data association techniques: Principal Components Analysis For example Data reduction approaches Cluster analysis Principal components analysis

More information

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Introduction Edps/Psych/Stat/ 584 Applied Multivariate Statistics Carolyn J Anderson Department of Educational Psychology I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN c Board of Trustees,

More information

Time: 1 hour 30 minutes

Time: 1 hour 30 minutes Paper Reference(s) 6663/0 Edexcel GCE Core Mathematics C Gold Level G5 Time: hour 30 minutes Materials required for examination Mathematical Formulae (Green) Items included with question papers Nil Candidates

More information

Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood

Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood Jonathan Gruhl March 18, 2010 1 Introduction Researchers commonly apply item response theory (IRT) models to binary and ordinal

More information

Tutorial 6: Linear Regression

Tutorial 6: Linear Regression Tutorial 6: Linear Regression Rob Nicholls nicholls@mrc-lmb.cam.ac.uk MRC LMB Statistics Course 2014 Contents 1 Introduction to Simple Linear Regression................ 1 2 Parameter Estimation and Model

More information

Review of Statistics 101

Review of Statistics 101 Review of Statistics 101 We review some important themes from the course 1. Introduction Statistics- Set of methods for collecting/analyzing data (the art and science of learning from data). Provides methods

More information

Discrimination Among Groups. Discrimination Among Groups

Discrimination Among Groups. Discrimination Among Groups Discrimination Among Groups Id Species Canopy Snag Canopy Cover Density Height 1 A 80 1.2 35 2 A 75 0.5 32 3 A 72 2.8 28..... 31 B 35 3.3 15 32 B 75 4.1 25 60 B 15 5.0 3..... 61 C 5 2.1 5 62 C 8 3.4 2

More information

H-LIKELIHOOD ESTIMATION METHOOD FOR VARYING CLUSTERED BINARY MIXED EFFECTS MODEL

H-LIKELIHOOD ESTIMATION METHOOD FOR VARYING CLUSTERED BINARY MIXED EFFECTS MODEL H-LIKELIHOOD ESTIMATION METHOOD FOR VARYING CLUSTERED BINARY MIXED EFFECTS MODEL Intesar N. El-Saeiti Department of Statistics, Faculty of Science, University of Bengahzi-Libya. entesar.el-saeiti@uob.edu.ly

More information

Sample size determination for logistic regression: A simulation study

Sample size determination for logistic regression: A simulation study Sample size determination for logistic regression: A simulation study Stephen Bush School of Mathematical Sciences, University of Technology Sydney, PO Box 123 Broadway NSW 2007, Australia Abstract This

More information

Practical Statistics for the Analytical Scientist Table of Contents

Practical Statistics for the Analytical Scientist Table of Contents Practical Statistics for the Analytical Scientist Table of Contents Chapter 1 Introduction - Choosing the Correct Statistics 1.1 Introduction 1.2 Choosing the Right Statistical Procedures 1.2.1 Planning

More information

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data Ronald Heck Class Notes: Week 8 1 Class Notes: Week 8 Probit versus Logit Link Functions and Count Data This week we ll take up a couple of issues. The first is working with a probit link function. While

More information

MLMED. User Guide. Nicholas J. Rockwood The Ohio State University Beta Version May, 2017

MLMED. User Guide. Nicholas J. Rockwood The Ohio State University Beta Version May, 2017 MLMED User Guide Nicholas J. Rockwood The Ohio State University rockwood.19@osu.edu Beta Version May, 2017 MLmed is a computational macro for SPSS that simplifies the fitting of multilevel mediation and

More information

ECNS 561 Multiple Regression Analysis

ECNS 561 Multiple Regression Analysis ECNS 561 Multiple Regression Analysis Model with Two Independent Variables Consider the following model Crime i = β 0 + β 1 Educ i + β 2 [what else would we like to control for?] + ε i Here, we are taking

More information

Lecture 2: Linear Models. Bruce Walsh lecture notes Seattle SISG -Mixed Model Course version 23 June 2011

Lecture 2: Linear Models. Bruce Walsh lecture notes Seattle SISG -Mixed Model Course version 23 June 2011 Lecture 2: Linear Models Bruce Walsh lecture notes Seattle SISG -Mixed Model Course version 23 June 2011 1 Quick Review of the Major Points The general linear model can be written as y = X! + e y = vector

More information

A Practitioner s Guide to Generalized Linear Models

A Practitioner s Guide to Generalized Linear Models A Practitioners Guide to Generalized Linear Models Background The classical linear models and most of the minimum bias procedures are special cases of generalized linear models (GLMs). GLMs are more technically

More information

Introduction to ordination. Gary Bradfield Botany Dept.

Introduction to ordination. Gary Bradfield Botany Dept. Introduction to ordination Gary Bradfield Botany Dept. Ordination there appears to be no word in English which one can use as an antonym to classification ; I would like to propose the term ordination.

More information

Lecture 2: Diversity, Distances, adonis. Lecture 2: Diversity, Distances, adonis. Alpha- Diversity. Alpha diversity definition(s)

Lecture 2: Diversity, Distances, adonis. Lecture 2: Diversity, Distances, adonis. Alpha- Diversity. Alpha diversity definition(s) Lecture 2: Diversity, Distances, adonis Lecture 2: Diversity, Distances, adonis Diversity - alpha, beta (, gamma) Beta- Diversity in practice: Ecological Distances Unsupervised Learning: Clustering, etc

More information

Linear Regression Models

Linear Regression Models Linear Regression Models Model Description and Model Parameters Modelling is a central theme in these notes. The idea is to develop and continuously improve a library of predictive models for hazards,

More information

DISPLAYING THE POISSON REGRESSION ANALYSIS

DISPLAYING THE POISSON REGRESSION ANALYSIS Chapter 17 Poisson Regression Chapter Table of Contents DISPLAYING THE POISSON REGRESSION ANALYSIS...264 ModelInformation...269 SummaryofFit...269 AnalysisofDeviance...269 TypeIII(Wald)Tests...269 MODIFYING

More information

Multiple Linear Regression

Multiple Linear Regression Multiple Linear Regression University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/~eariasca/teaching.html 1 / 42 Passenger car mileage Consider the carmpg dataset taken from

More information

9 Correlation and Regression

9 Correlation and Regression 9 Correlation and Regression SW, Chapter 12. Suppose we select n = 10 persons from the population of college seniors who plan to take the MCAT exam. Each takes the test, is coached, and then retakes the

More information

Subject CS1 Actuarial Statistics 1 Core Principles

Subject CS1 Actuarial Statistics 1 Core Principles Institute of Actuaries of India Subject CS1 Actuarial Statistics 1 Core Principles For 2019 Examinations Aim The aim of the Actuarial Statistics 1 subject is to provide a grounding in mathematical and

More information

Analysis of Multivariate Ecological Data

Analysis of Multivariate Ecological Data Analysis of Multivariate Ecological Data School on Recent Advances in Analysis of Multivariate Ecological Data 24-28 October 2016 Prof. Pierre Legendre Dr. Daniel Borcard Département de sciences biologiques

More information

Generalized Linear Models: An Introduction

Generalized Linear Models: An Introduction Applied Statistics With R Generalized Linear Models: An Introduction John Fox WU Wien May/June 2006 2006 by John Fox Generalized Linear Models: An Introduction 1 A synthesis due to Nelder and Wedderburn,

More information

Unconstrained Ordination

Unconstrained Ordination Unconstrained Ordination Sites Species A Species B Species C Species D Species E 1 0 (1) 5 (1) 1 (1) 10 (4) 10 (4) 2 2 (3) 8 (3) 4 (3) 12 (6) 20 (6) 3 8 (6) 20 (6) 10 (6) 1 (2) 3 (2) 4 4 (5) 11 (5) 8 (5)

More information

Multivariate Statistical Analysis

Multivariate Statistical Analysis Multivariate Statistical Analysis Fall 2011 C. L. Williams, Ph.D. Lecture 3 for Applied Multivariate Analysis Outline 1 Reprise-Vectors, vector lengths and the angle between them 2 3 Partial correlation

More information

CATS regression a model-based approach to study- ing trait-based community assembly

CATS regression a model-based approach to study- ing trait-based community assembly 1 2 CATS regression a model-based approach to study- ing trait-based community assembly 3 David I. Warton, Bill Shipley and Trevor Hastie 4 5 6 School of Mathematics and Statistics and Evolution & Ecology

More information

Chapter 22: Log-linear regression for Poisson counts

Chapter 22: Log-linear regression for Poisson counts Chapter 22: Log-linear regression for Poisson counts Exposure to ionizing radiation is recognized as a cancer risk. In the United States, EPA sets guidelines specifying upper limits on the amount of exposure

More information

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn A Handbook of Statistical Analyses Using R Brian S. Everitt and Torsten Hothorn CHAPTER 6 Logistic Regression and Generalised Linear Models: Blood Screening, Women s Role in Society, and Colonic Polyps

More information

MSc in Statistics and Operations Research

MSc in Statistics and Operations Research MSc in Statistics and Operations Research Title: Permutation multivariate analysis of variance on real data and simulations to evaluate for robustness against dispersion and unbalancedness. Author: Lucas

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

Visualizing Tests for Equality of Covariance Matrices Supplemental Appendix

Visualizing Tests for Equality of Covariance Matrices Supplemental Appendix Visualizing Tests for Equality of Covariance Matrices Supplemental Appendix Michael Friendly and Matthew Sigal September 18, 2017 Contents Introduction 1 1 Visualizing mean differences: The HE plot framework

More information

Random-effects ordination: describing and predicting multivariate correlations and co-occurrences

Random-effects ordination: describing and predicting multivariate correlations and co-occurrences Ecological Monographs, 81(4), 2011, pp. 635 663 Ó 2011 by the Ecological Society of America Random-effects ordination: describing and predicting multivariate correlations and co-occurrences STEVEN C. WALKER

More information

4/4/2018. Stepwise model fitting. CCA with first three variables only Call: cca(formula = community ~ env1 + env2 + env3, data = envdata)

4/4/2018. Stepwise model fitting. CCA with first three variables only Call: cca(formula = community ~ env1 + env2 + env3, data = envdata) 0 Correlation matrix for ironmental matrix 1 2 3 4 5 6 7 8 9 10 11 12 0.087451 0.113264 0.225049-0.13835 0.338366-0.01485 0.166309-0.11046 0.088327-0.41099-0.19944 1 1 2 0.087451 1 0.13723-0.27979 0.062584

More information

BOOTSTRAPPING WITH MODELS FOR COUNT DATA

BOOTSTRAPPING WITH MODELS FOR COUNT DATA Journal of Biopharmaceutical Statistics, 21: 1164 1176, 2011 Copyright Taylor & Francis Group, LLC ISSN: 1054-3406 print/1520-5711 online DOI: 10.1080/10543406.2011.607748 BOOTSTRAPPING WITH MODELS FOR

More information

Generalized Linear Models I

Generalized Linear Models I Statistics 203: Introduction to Regression and Analysis of Variance Generalized Linear Models I Jonathan Taylor - p. 1/16 Today s class Poisson regression. Residuals for diagnostics. Exponential families.

More information

Chapter 1. Modeling Basics

Chapter 1. Modeling Basics Chapter 1. Modeling Basics What is a model? Model equation and probability distribution Types of model effects Writing models in matrix form Summary 1 What is a statistical model? A model is a mathematical

More information

Testing for homogeneity of multivariate dispersions using dissimilarity measures

Testing for homogeneity of multivariate dispersions using dissimilarity measures Biometrics, 1?? DOI: 000 Testing for homogeneity of multivariate dispersions using dissimilarity measures Irène Gijbels Department of Mathematics and Leuven Statistics Research Center (LStat), Katholieke

More information

12 Generalized linear models

12 Generalized linear models 12 Generalized linear models In this chapter, we combine regression models with other parametric probability models like the binomial and Poisson distributions. Binary responses In many situations, we

More information

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014 LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Liang (Sally) Shan Nov. 4, 2014 L Laboratory for Interdisciplinary Statistical Analysis LISA helps VT researchers

More information

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010 1 Linear models Y = Xβ + ɛ with ɛ N (0, σ 2 e) or Y N (Xβ, σ 2 e) where the model matrix X contains the information on predictors and β includes all coefficients (intercept, slope(s) etc.). 1. Number of

More information

Lecture 3: Linear Models. Bruce Walsh lecture notes Uppsala EQG course version 28 Jan 2012

Lecture 3: Linear Models. Bruce Walsh lecture notes Uppsala EQG course version 28 Jan 2012 Lecture 3: Linear Models Bruce Walsh lecture notes Uppsala EQG course version 28 Jan 2012 1 Quick Review of the Major Points The general linear model can be written as y = X! + e y = vector of observed

More information

1 Motivation for Instrumental Variable (IV) Regression

1 Motivation for Instrumental Variable (IV) Regression ECON 370: IV & 2SLS 1 Instrumental Variables Estimation and Two Stage Least Squares Econometric Methods, ECON 370 Let s get back to the thiking in terms of cross sectional (or pooled cross sectional) data

More information

Introduction to General and Generalized Linear Models

Introduction to General and Generalized Linear Models Introduction to General and Generalized Linear Models Generalized Linear Models - part II Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs.

More information

Generalized Linear Models

Generalized Linear Models Generalized Linear Models Lecture 3. Hypothesis testing. Goodness of Fit. Model diagnostics GLM (Spring, 2018) Lecture 3 1 / 34 Models Let M(X r ) be a model with design matrix X r (with r columns) r n

More information

Poisson regression: Further topics

Poisson regression: Further topics Poisson regression: Further topics April 21 Overdispersion One of the defining characteristics of Poisson regression is its lack of a scale parameter: E(Y ) = Var(Y ), and no parameter is available to

More information

Probability and Statistics

Probability and Statistics Probability and Statistics Kristel Van Steen, PhD 2 Montefiore Institute - Systems and Modeling GIGA - Bioinformatics ULg kristel.vansteen@ulg.ac.be CHAPTER 4: IT IS ALL ABOUT DATA 4a - 1 CHAPTER 4: IT

More information

Do not copy, post, or distribute

Do not copy, post, or distribute 14 CORRELATION ANALYSIS AND LINEAR REGRESSION Assessing the Covariability of Two Quantitative Properties 14.0 LEARNING OBJECTIVES In this chapter, we discuss two related techniques for assessing a possible

More information

Introduction to multivariate analysis Outline

Introduction to multivariate analysis Outline Introduction to multivariate analysis Outline Why do a multivariate analysis Ordination, classification, model fitting Principal component analysis Discriminant analysis, quickly Species presence/absence

More information

Interpreting Regression Results

Interpreting Regression Results Interpreting Regression Results Carlo Favero Favero () Interpreting Regression Results 1 / 42 Interpreting Regression Results Interpreting regression results is not a simple exercise. We propose to split

More information

Generalized Linear. Mixed Models. Methods and Applications. Modern Concepts, Walter W. Stroup. Texts in Statistical Science.

Generalized Linear. Mixed Models. Methods and Applications. Modern Concepts, Walter W. Stroup. Texts in Statistical Science. Texts in Statistical Science Generalized Linear Mixed Models Modern Concepts, Methods and Applications Walter W. Stroup CRC Press Taylor & Francis Croup Boca Raton London New York CRC Press is an imprint

More information

Review of Multiple Regression

Review of Multiple Regression Ronald H. Heck 1 Let s begin with a little review of multiple regression this week. Linear models [e.g., correlation, t-tests, analysis of variance (ANOVA), multiple regression, path analysis, multivariate

More information

Online Resource 2: Why Tobit regression?

Online Resource 2: Why Tobit regression? Online Resource 2: Why Tobit regression? March 8, 2017 Contents 1 Introduction 2 2 Inspect data graphically 3 3 Why is linear regression not good enough? 3 3.1 Model assumptions are not fulfilled.................................

More information

Dissimilarity and transformations. Pierre Legendre Département de sciences biologiques Université de Montréal

Dissimilarity and transformations. Pierre Legendre Département de sciences biologiques Université de Montréal and transformations Pierre Legendre Département de sciences biologiques Université de Montréal http://www.numericalecology.com/ Pierre Legendre 2017 Definitions An association coefficient is a function

More information

Multivariate Analysis of Ecological Data

Multivariate Analysis of Ecological Data Multivariate Analysis of Ecological Data MICHAEL GREENACRE Professor of Statistics at the Pompeu Fabra University in Barcelona, Spain RAUL PRIMICERIO Associate Professor of Ecology, Evolutionary Biology

More information

The Model Building Process Part I: Checking Model Assumptions Best Practice

The Model Building Process Part I: Checking Model Assumptions Best Practice The Model Building Process Part I: Checking Model Assumptions Best Practice Authored by: Sarah Burke, PhD 31 July 2017 The goal of the STAT T&E COE is to assist in developing rigorous, defensible test

More information

Linear, Generalized Linear, and Mixed-Effects Models in R. Linear and Generalized Linear Models in R Topics

Linear, Generalized Linear, and Mixed-Effects Models in R. Linear and Generalized Linear Models in R Topics Linear, Generalized Linear, and Mixed-Effects Models in R John Fox McMaster University ICPSR 2018 John Fox (McMaster University) Statistical Models in R ICPSR 2018 1 / 19 Linear and Generalized Linear

More information

Introduction to Statistical Data Analysis Lecture 8: Correlation and Simple Regression

Introduction to Statistical Data Analysis Lecture 8: Correlation and Simple Regression Introduction to Statistical Data Analysis Lecture 8: and James V. Lambers Department of Mathematics The University of Southern Mississippi James V. Lambers Statistical Data Analysis 1 / 40 Introduction

More information

Correlation and regression

Correlation and regression 1 Correlation and regression Yongjua Laosiritaworn Introductory on Field Epidemiology 6 July 2015, Thailand Data 2 Illustrative data (Doll, 1955) 3 Scatter plot 4 Doll, 1955 5 6 Correlation coefficient,

More information

36-309/749 Experimental Design for Behavioral and Social Sciences. Dec 1, 2015 Lecture 11: Mixed Models (HLMs)

36-309/749 Experimental Design for Behavioral and Social Sciences. Dec 1, 2015 Lecture 11: Mixed Models (HLMs) 36-309/749 Experimental Design for Behavioral and Social Sciences Dec 1, 2015 Lecture 11: Mixed Models (HLMs) Independent Errors Assumption An error is the deviation of an individual observed outcome (DV)

More information

Ordination & PCA. Ordination. Ordination

Ordination & PCA. Ordination. Ordination Ordination & PCA Introduction to Ordination Purpose & types Shepard diagrams Principal Components Analysis (PCA) Properties Computing eigenvalues Computing principal components Biplots Covariance vs. Correlation

More information

A Re-Introduction to General Linear Models (GLM)

A Re-Introduction to General Linear Models (GLM) A Re-Introduction to General Linear Models (GLM) Today s Class: You do know the GLM Estimation (where the numbers in the output come from): From least squares to restricted maximum likelihood (REML) Reviewing

More information

Varieties of Count Data

Varieties of Count Data CHAPTER 1 Varieties of Count Data SOME POINTS OF DISCUSSION What are counts? What are count data? What is a linear statistical model? What is the relationship between a probability distribution function

More information

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form: Outline for today What is a generalized linear model Linear predictors and link functions Example: fit a constant (the proportion) Analysis of deviance table Example: fit dose-response data using logistic

More information

4. Ordination in reduced space

4. Ordination in reduced space Université Laval Analyse multivariable - mars-avril 2008 1 4.1. Generalities 4. Ordination in reduced space Contrary to most clustering techniques, which aim at revealing discontinuities in the data, ordination

More information

2016 VCE Specialist Mathematics 2 examination report

2016 VCE Specialist Mathematics 2 examination report 016 VCE Specialist Mathematics examination report General comments The 016 Specialist Mathematics examination comprised 0 multiple-choice questions (worth a total of 0 marks) and six extended-answer questions

More information

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1)

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1) The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1) Authored by: Sarah Burke, PhD Version 1: 31 July 2017 Version 1.1: 24 October 2017 The goal of the STAT T&E COE

More information

y response variable x 1, x 2,, x k -- a set of explanatory variables

y response variable x 1, x 2,, x k -- a set of explanatory variables 11. Multiple Regression and Correlation y response variable x 1, x 2,, x k -- a set of explanatory variables In this chapter, all variables are assumed to be quantitative. Chapters 12-14 show how to incorporate

More information

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model 1 Linear Regression 2 Linear Regression In this lecture we will study a particular type of regression model: the linear regression model We will first consider the case of the model with one predictor

More information