Statistical comparison of univariate tests of homogeneity of variances

Size: px

Start display at page:

Download "Statistical comparison of univariate tests of homogeneity of variances"

Lindsey Richardson
6 years ago
Views:

1 Submitted to the Journal of Statistical Computation and Simulation Statistical comparison of univariate tests of homogeneity of variances Pierre Legendre* and Daniel Borcard Département de sciences biologiques, Université de Montréal, C.P. 628, succursale Centre-ville, Montréal, Québec H3C 3J7, Canada Abstract This paper compares empirical type I error and power of different tests that have been proposed to assess the homogeneity of within-group variances, prior to anova. The tests of homogeneity of variance (THV) compared in this study are: Bartlett's test, the Scheffé-Box log-anova test, Cochran s C test and Box s M test, in their parametric and permutational forms. The main questions addressed in the paper are: () under what conditions is heterogeneity of variances really a problem in anova, and (2) under these conditions, is any of the THVs useful for the detection of heterogeneity? A preliminary simulation study confirmed that anova is very sensitive to heterogeneity of the variances, even when the data are normally distributed. Any pattern of heteroscedasticity results in inflated type I error, the worst results occurring when one variance is larger than the others. A second study was conducted to find out which tests of homogeneity of variances should be used under extreme conditions (small sample sizes, non-normal distributions). The best overall methods are Bartlett's or Box's tests; even with normally distributed data, one should avoid Cochran's test which is only sensitive to a single high variance, as well as the loganova test because it has low power with small to moderate sample sizes. With non-normal data, Bartlett's and Box's tests can be used if the samples are fairly large. Species abundancelike data should be log-transformed and subjected to parametric or permutational Bartlett's or Box's tests. An Appendix presents a comparison of the Welch-corrected t-test with the parametric and permutational forms of the t-test. The test with Welch correction is useful when the data are normal, sample sizes are small, and the variances are heterogeneous. Otherwise, use the parametric t-test for normal data, or the permutational t-test for skewed data. For heteroscedastic data that cannot be normalized, a nonparametric test should be used. Keywords: Anova; permutation test; power; simulation study; test of homogeneity of variances; t-test; type I error Running head: Tests of homogeneity of variances *Corresponding author. Tel.: (5) 33-75, Fax (5) , Pierre.Legendre@umontreal.ca Tel.: (5) 33-75, Fax (5) , BorcardD@magellan.umontreal.ca

2 2. Introduction Several tests of homogeneity of variances (THV) have been proposed in the literature during the 30 s, 0 s and 50 s (Bartlett 37a, 37b; Cochran, 5; Box 53). Natural selection has left us with only a few that are presented in current textbooks. Authors usually present one or two, making various claims about their robustness (or lack of) to departures from normality. Many authors claim that a test of homogeneity of variances is a prerequisite to analysis of variance. Others, like Zar (), confide that the tests presently available have such poor performance that they are not really useful, Anova being more robust to departures from homoscedasticity than can be detected using a test of homogeneity of variances, especially under conditions of non-normality. Underwood (7) reminds us that the analysis of variance presents problems with heterogeneity in balanced samples only when one of the variances is markedly larger than the others; it is not especially sensitive to non-normality of the data which badly affects most of the classical tests for homogeneity of variances. In analysis of variance, the Behrens-Fisher problem is that of comparing means of samples drawn from normal populations without assuming equal variances; valid solutions to the Behrens-Fisher problem exist for two groups and one variable (t-test) but not, to our knowledge, for several groups nor for the multivariate case. The main question addressed in this paper is thus a double interrogation: () under what conditions is heterogeneity of variances really a problem in anova, and (2) under these conditions, is any of the tests of homogeneity of variances useful for the detection of heterogeneity? With the advent of microcomputers and the availability of ever more powerful machines, permutation tests have gained in popularity during the past 20 years. This type of test is known to alleviate the conditions of normality often associated with parametric statistical tests and preserve correct type I error independently of the distributions of the variables under study. Is this also the case with the permutational forms of the tests of homogeneity of variances? A simulation study was undertaken (3) to verify this hypothesis and () find out which tests of homogeneity should be used under extreme conditions (small sample sizes, non-normal distributions) and (5) which one(s) should be described in textbooks and taught in introductory courses of statistics. 2. Methods 2.. THV statistics The tests under study are those found in various textbooks. We will use the following symbols in the descriptions of the statistics: k = number of groups, n j = number of 2 observations within group j, n = total number of observations, s p = pooled within-group variance, SSW = sum of within-group sums-of-squares.

3 3. Bartlett s test is the one most often presented in textbooks and taught in introductory courses because of its ease of computation. The test statistic B involves a comparison of the separate within-group sums-of-squares to the pooled within-group sum-of-squares: B = ( n k)ln s p ( n j )ln s j where s p = SSW ( n k). () j = A correction factor C B is computed as: k C B = ( k ) j = ( n j ) ( n k) and applied to B to obtain the corrected B C statistic: k (2) B C = B C B. (3) 2 χ k B C has an asymptotic distribution of ( ). Bartlett s test is known to be powerful if the sampled populations are normal, but badly affected by non-normality (Box 53, Zar ). 2. The Scheffé-Box log-anova test (Martin and Games 77) is based on papers by Box (53) and Scheffé (5). In this test, one first divides the observations of each group at random among a number of subgroups and computes the log of the variance of each subgroup. The test uses an F-type statistic which compares the among-group mean-squares to the within-group mean-squares of the logarithms of the subgroup variances instead of the raw data: F SS among ( k ) = ; () k SS within ( m j ) j = m j is the number of subgroups within group j; m j is approximately equal to where n j is the number of observations within group j. Computation of the test statistic is described in detail in Sokal and Rohlf (5) and in Scherrer (8), for example. The log-anova F- statistic is tested against a critical value of F with the degrees of freedom of the numerator and denominator of eq.. The log-anova test is said to be less sensitive to departures from normality than Bartlett s test (Sokal and Rohlf 5). When the number of observations is small, the results of the log-anova test may be quite unstable. The reason is the following: in this test, one first assigns at random the observations of each group to subgroups, as described above. The variance of each subgroup is computed and log-transformed, before being used to compute an F-statistic based on these log(variance) values. The number of subgroups is approximately equal to the square root of n j

4 the number of observations in a group; this number, which determines the number of log(variance) values representing each group in the analysis, is very small if n j, which is the number of observations in group j, is small. Moreover, if n j is small, a different random assignment of the objects of a group to the subgroups may result in a very different set of subgroup variances. This is why the results of the log-anova test may vary greatly for different assignments of the objects to the subgroups, especially for small n. 3. Cochran s C test statistic (Cochran, 5) is: C = 2 s largest s 2 j. (5) Tables of critical values for some combinations of degrees of freedom (ν = to, 6, 36 and ) and number of groups (2 to 0, 5, 20) have been reproduced by different authors (e.g., Winer 7) from the table originally published by Eisenhart et al. (7). The degrees of freedom (ν) are: ν = max(n j ) where n j is the number of observations in group j. Professor A. J. Underwood (pers. comm. ) suggested that Cochran s test may be the best method to detect cases where the variance of one of the groups is much larger than that of the other groups. This is a situation where the analysis of variance of balanced samples is known to present problems (Underwood 7).. Box s M statistic provides a test of multivariate dispersion which can be used with a single variable (p =, where p designates the number of variables): Box s M = ( n k)ln S ( n j )ln S j k j = where S is the pooled within-group dispersion matrix and S j is the dispersion matrix for group j. For univariate data, the M and B statistics are identical. A correction factor, different from that of Bartlett s test, is used with the M statistic to turn it into a statistic distributed like chi-square: (6) C M = 2p 2 + 3p ( p + ) ( k ) k j = ( n j ) ( n k) (7) and M C = M C M. (8)

5 5 After applying the correction factor C M, the corrected M C statistic has an asymptotic 2 distribution. χ p( p+ ) ( k ) 2 Hartley's (50) test, which uses the ratio of the largest to the smallest variances (and thus resembles Cochran's C test with a less optimal use of the information available), and Kullback's (5) test of homogeneity of variance-covariance matrices, which is largely similar to Box's M, were not included in the present simulation study. All these statistics can also be tested by the method of permutations. From the descriptions above, readers will appreciate the fact that these statistics use only the withingroup dispersion portions of the data. This indicates that an appropriate permutation procedure would require the groups to be centered on a common mean, before permutations, in order for the test to produce realizations of a null hypothesis related to within-group dispersions only. Without this precaution, imagine what would happen in the case of groups differing in their means: permutation of the data across groups would create pseudo-groups with larger variances than the original groups. After centering the groups on a common mean (e.g. the origin), the null hypothesis under test is (H 0 ) the exchangeability of objects among the groups after the differences in positions of the means have been removed Simulation methods Two computer programs were written in Fortran 77 to carry out the simulations: one for anova and one for the tests of homogeneity of variances. The programs were designed to read a file of parameters describing the characteristics of a stack of simulation problems, and run them in a sequence. Random data with the proper characteristics were generated within the program and transformed as required, and the various forms of tests (parametric and permutational) were computed. Permutations of the simulated data, where needed, were done using a uniform random generation algorithm sensu Furnas (8). For each problem, the program wrote one line of results to an output file. This line contained the simulation parameters of the problem, the rate of rejection of the null hypothesis for each test after the stated number of simulations, and the 5% confidence interval of this rejection rate. The output files were assembled into a data base which was used to produce summary tables as well as the figures presented in this paper. The simulation program for anova computed a parametric and a permutational anova for each simulated data set. In the case of two groups only, a t-test with Welch correction was also computed. The simulation program for the tests of homogeneity of variances computed the four statistics described in Section 2. and tested each one parametrically and permutationally. Data were generated with the following distributions: () random normal deviates with specified mean and variance, (2) random power deviates, i.e., y' = base y where base

6 6 was chosen to have the value. and y was a random normal deviate with specified mean and variance, and (3) truncated random power deviates, i.e., power deviates as in (2) where the negatives values were truncated to zero and the positive values were rounded to the nearest integer, to simulate species abundance data which are encountered in ecological data analysis, or other types of frequency data as found in other fields of application. With option (2), a log-transformation of random power deviates restores normality. In contrast, with option (3), log transformation of truncated random power deviates does not necessarily produce normally distributed data. Random lognormal deviates were not used in the simulations because the distributions were too variable among simulated data sets, due to the appearance of extreme outlier values; we would have needed a very large number of simulations to obtain reasonable confidence intervals for the rejection rates of the tests. Random power data with base. were used instead. The problem of outlier values was minimal, so that we obtained reasonable confidence intervals after 5000 simulations. In the study of type I error, data were generated in such a way that the null hypothesis was true (H 0 in anova: the population means are equal; H 0 in THV: the population variances are equal). The rate of type I error was computed as the proportion of the simulations where the null hypothesis was rejected at the α = 5% significance level. In the power study, data were simulated in such a way that the null hypothesis was false. was computed as the proportion of the simulations where the null hypothesis was rejected at the α = 5% significance level Simulation setup In each simulation problem, the rate of rejection of the null hypothesis (at α = 5%) was computed after 5000 independent simulations involving the desired number of observations from the selected type of distribution. The permutation tests involved random permutations of the data; following Hope (68), the reference value of the statistic obtained for the unpermuted data was added to the distribution of values obtained after permutation, before calculating the permutational probability associated with the statistic. Since we wanted to test all the selected methods under the same conditions, we had to restrict the combinations of number of observations and numbers of groups to those for which tables of critical values exist for Cochran s test. This explains why the simulations have been run with 0, 7, 37 and 5 observations in each group. Within one run the number of observations was equal in all groups. We ran our tests with two or three groups, but only the results involving three groups will be presented here, since the two-group simulations gave essentially similar results and three-group situations are more relevant to anova questions. Some anova results for two groups are presented in the Appendix to assess the Welch correction for two groups with unequal variances.

7 7 3. Results 3.. Anova: type I error Since the simulation results of the anova tests were largely independent of sample size (within our limits: 0 to 5 observations per group), the results shown in Figure are those based on n j = 0. The population means were equal in the anova simulations for type I error Homogeneous variances of the parametric anova is correct for normally distributed data (Fig. a), but it is slightly too low for the power base. distribution (Fig. b). The apparent increase in type I error, in Figure a, from variance to 6 to, is a random effect which is not found in the other series of simulations that we have done using n j = 7, 37 and 5 (results not presented in detail here). Thus, application of a parametric anova requires at least that the distribution be symmetric, a property which can be obtained to a convenient degree by log-transforming the data simulated under the base-. distribution (including its truncated version). As long as the variances are homogeneous, permutational anova yields correct type I error irrespective of the distribution (normal or power base., truncated power base., and log-transformed data computed from the latter; some of these results are not illustrated) Heterogeneous variances For normal data, type I error is inflated as soon as one of the variances is higher than the others, but less so when one of the variances is smaller than the others (Fig. c). Both parametric and permutational tests suffer from this problem. The problem is worse for the power base. distribution (Fig. d); using a permutation test does not correct the problem Anova: power In the anova power simulations, means differed among groups but variances were the same (5 in the case of the normal distribution, in the case of the power base. distribution). Having shown above that heterogeneous variances alter type I error of anova in most cases, the question of power for data with heterogeneous variances is irrelevant. Our simulation results show that, with homogeneous variances, the power of anova is good for both the normal and power base. distributions, as soon as the contrast between the two extreme means is sufficient (in this study: 0 and 5 for normal data;.6 and 5 for power data; see Figs. e and f). Parametric and permutational anova have approximately the same power.

8 8 These results confirm that anova, in both its parametric and permutational forms, is very sensitive to heterogeneity of the variances, even when the data are normally distributed. Any pattern of heteroscedasticity results in inflated type I error, the worst results occurring when one variance is larger than the others. However, permutational anova retains correct type I error in the presence of skewed distributions as long as the variances are homogeneous. It has the same power as parametric anova Tests of homogeneity of variances (THV): type I error For the THVs, we shall first present the simulation results for the normal and power base. continuous data, comparing the behavior of the different methods in the presence of symmetrical and skewed distributions. We postpone to Section 3.5 the presentation of the results using the truncated power base. distribution, which were devised to study the behavior of the THVs in the presence of species abundance-like data. The population variances were equal in the THV simulations for type I error Normal distribution In this situation, all the methods tested, in their parametric (Fig. 2a) and permutational (Fig. 2b) forms, have correct type I error base. distribution Under this distribution, the log-anova test (parametric and permutational) is the only one to maintain correct type I error (Figs. 2c and 2d). is inflated in all other tests in both forms, although less so for the permutational forms. 3.. Tests of homogeneity of variances (THV): power 3... Normal distribution Figures 2e and 2f show the power of the THVs with the same combination of variances as those used to assess type I error in the anova simulations (Fig. c). The loganova is clearly less powerful than the other tests, even in its permutational form. In most cases, Cochran s test is the most powerful when one of the variances is markedly higher than the others, but it loses most of its power when the variances are spread more evenly. The Bartlett, Cochran and Box parametric tests perform slightly better than their permutational counterparts; this is not the case for the log-anova test. To summarize, a comparison between Figure c and Figures 2e and 2f shows that, as long as the data are normally distributed, the Bartlett and Box tests are powerful enough to detect heterogeneous variances when these induce inflated type I error in anova.

9 3..2. base. distribution A cursory glance at Figures 2g and 2h may lead one to believe that all but the loganova test have high power and are thus appropriate methods. However, the left-hand and right-hand groups of simulations in both graphs represent in fact cases with equal variances, i.e., measures of type I error. They are drawn here to remind us that the log-anova test is the only one that maintains correct type I error, and thus that its power, low as it may be, is that of the only reliable method among those investigated here. of the log-anova test reaches less than 5% in the best cases. We conclude that all the THVs under study are unusable for skewed distributions, such as the power base. data Effect of sample size We stated above that, for anova, the number of objects per group did not influence the results significantly, at least within the range used in the simulations reported here (n j = 0 to 5). Does that hold for the THVs, or do the results change with sample size? Normal data: the simulation results (not shown) do not vary with n j. remains correct for all methods, using parametric or permutational tests, regardless of n j. base. data (Figs. 3a and 3b): type I error of the log-anova test is correct, as in Figures 2c and 2d, and is unaffected by n j. The results for the other THVs are interesting: their parametric forms show worse results when n j gets larger, while their permutational forms improve markedly, the rejection rate reaching the α significance level with n j = Normal data: as expected, the power of all THVs improve greatly when the number of observations per group increases (Figs. 3c and 3d). At the maximum value subjected to simulations (n j = 5), the power of all methods, including that of the log-anova test, is. For intermediate sample sizes, Bartlett and Box tests are clearly the most powerful in the presence of a gradual distribution of the variances. base. data: the log-anova test, which is the only one with an overall correct type I error under this distribution (Figs. 3a and 3b), has mediocre power, culminating at about 8% for n j = 5 for the variances used in these simulations (Figs. 3e and 3f). The other THVs, whose type I errors are correct only when n j is large, do hardly better, with powers between 3 and 5%. These results confirm that for skewed distributions the tests of homogeneity of variances are not usable.

10 Truncated power base. distribution: data simulating species abundances This distribution deserves a special section because it simulates species abundance data which are of great interest to ecologists. Contrary to the continuous power base. data used above, these data have been altered in such a way that log-transformation generally does not restore complete normality, much like with true species abundance data. These data being truncated at zero, the logarithmic transformation cannot restore the complete distribution Anova: type I error, homogeneous variances Our simulations yielded the same results for truncated as for untruncated power base. data; the latter results are shown in Figure b. of the parametric anova was slightly too low (around ), while that of the permutational anova was correct Anova: power, homogeneous variances Again, the results are similar those obtained for untruncated power base. data, shown in Figure f. is good when the contrast between the two extreme variances is high THV: type I error The performances of the Bartlett, Cochran and Box tests are approximately the same (Figs. a and b) as for untruncated power base. data (i.e., equally bad; compare with Figs. 2c and 2d). The log-anova test reacts surprisingly badly to this type of data, at least when n j = 0 and the within-group variances are small. of the parametric loganova test is always inflated. of the permutational test is correct only in the presence of high within-group variances. It is too low in the other cases, and improves when the sample size increases (see below), so that the permutational log-anova test is valid for small or large values of within-group variances because type I error is smaller than or equal to α. Transforming the truncated power base. data, using y' = ln(y+), greatly improves type I error of all but the log-anova test (Figs. c and d). s of Bartlett s, Cochran s and Box s tests are approximately correct in their parametric form, and correct in their permutational form THV: power The type I error results reported in Section indicate that the only meaningful use of the THVs for truncated power base. data is after log-transformation of the data, using y' = ln(y+). The results are shown in Figures e and f. The log-anova test is unusable either

11 because of its inflated type I error (parametric form) or because it has nearly no power (permutational form). Among the other tests, Bartlett s and Box s are the most powerful in general; as usual, Cochran s test is slightly better at detecting heterogeneity when one of the variances is markedly higher than the others Effect of sample size For truncated power base. data, increasing the sample size tends to restore correct type I error for the log-anova test (Figs. 5a and 5b). This occurs at smaller sample size (n j = 7) in the permutational than in the parametric test (n j = 37). For the other tests, the results are the same as with untruncated power base. data (Figs. 3a and 3b): larger sample size means more inflated type I error for the parametric versions, but for permutation tests, type I error becomes correct at n j = 5. Log-transforming the data (using y' = ln(y+)) improved the results drastically, in particular when using permutations (Figs. 5c and 5d), for all but the log-anova test. The permutational versions of Bartlett s, Cochran s and Box s tests have a valid type I error over the range of sample sizes simulated. remains too low for the log-anova test with small n j, but the test remains valid Despite the good performance of the log-anova test in terms of type I error (Figs. 5a and 5b) for truncated power base. data, power simulation results are not presented because power is too low for this test to be useful with this kind of data (between 56 and 2 for the same range of combinations as, for instance, those presented in Figures e and f). For the other THVs, a power study was run only at n j = 5, because these tests had incorrect type I error for truncated power base. data for lower n j. The results (not illustrated) show that Bartlett s and Box s tests have a power above 0., while Cochran s test performs poorly (power around ) when one variance is markedly smaller than the others. simulations were done to study the effect of sample size for log-transformed truncated power base. data. A marked improvement in power occurs at high sample sizes (Figs. 5e and 5f). The parametric log-anova power results are meaningless for n j = 0 and 7 because type I error is inflated in Figures 5c and 5d for these sample sizes. For larger sample sizes, where type I error is correct, the power of the parametric log-anova test is smaller than that of the other THVs. While the parametric tests seem to have better power overall, remember that they generally have slightly inflated type I error (Figs. 5c and 5d) whereas the permutation tests have correct type I error.

12 2. Discussion We can now go back to the questions that motivated this study. First, we can state that heterogeneity of variances is always a problem in anova, and is troublesome even in the most benign cases, i.e., when one of the variances is smaller than the others. The problem is the worst when one of the variances is markedly larger than the others. The effect of variance heterogeneity on anova is moderately to extremely inflated type I error. Answers to our other questions (usability of the THVs, differences between permutation and parametric tests, extreme conditions) need elaborate answers that will be presented below in the form of a table of recommendations (TABLE I). This is commanded by the many characteristics of the data that influence the simulation results. For instance, our simulations have shows that anyone wanting to apply anova to non-normal data is caught between contradictory requirements. On the one hand, anova is not very sensitive to skewness but needs homogeneous variances; on the other hand, the available THVs often give fanciful results when the data are skewed. Thus, it is highly recommended to normalize the data as well as possible, even though anova itself does not require it. For multi-modal data, transformations should aim at reducing skewness. To summarize, the best overall methods to test the homogeneity of variances are Bartlett's or Box's tests. Even with normally distributed data, one should avoid Cochran's test which is only sensitive to a single high variance, as well as the log-anova test because it has low power with small to moderate sample sizes. With non-normal data, Bartlett's and Box's tests can be used if the samples are fairly large. Species abundance-like data should be log-transformed and subjected to parametric or permutational Bartlett's or Box's tests. Acknowledgments We are most thankful to General Cambronne for assistance during the simulation work. This research was supported by NSERC grants OGP and EQP0608 to P. Legendre.

13 3 References Bartlett, M. S. (37a) Some examples of statistical methods of research in agriculture and applied biology. J. Roy. Statist. Soc. Suppl. : Bartlett, M. S. (37b) Properties of sufficiency and statistical tests. Proc. Roy. Statist. Soc. Ser. A 60: Box, G. E. P. (53) Non-normality and tests on variances. Biometrika 0: Cochran, W. G. () The distribution of the largest of a set of estimated variances as a fraction of their total. Annals of Eugenics (London) : Cochran, W. G. (5) Testing a linear relation among variances. Biometrics 7: Edgington, E. S. (5) Randomization Tests (Third Edition). New York: Marcel Dekker. Eisenhart, C. (7) Significance of the largest of a set of sample estimates of variance. In: Selected techniques of statistical analysis for scientific and industrial research and production and management engineering (Eds. Eisenhart, C., Hastay, M. W. and Wallis, W. A.), pp , New York: McGraw-Hill. Furnas, G. W. (8) The generation of random, binary unordered trees, J. Classif.,, Hartley, H. O. (50) The maximum F-ratio as a short-cut test for heterogeneity of variance. Biometrika 37: Hope, A. C. A. (68) A simplified Monte Carlo test procedure, J. Roy. Statist. Soc. B, 50, Kullback, S. (5) Information theory and statistics. New York: Wiley. Martin, C. G. and Games, P. A. (77) Anova tests for homogeneity of variances: nonnormality and unequal samples. Journal of Educational Statistics, 2, Scheffé, H. (5) The analysis of variance. New York: Wiley. Scherrer, B. (8) Biostatistique. Boucherville: Gaëtan Morin Ed.

14 Sokal, R. R. and Rohlf, F. J. (5) Biometry The Principles and Practice of Statistics in Biological Research (Third Edition). New York: W. H. Freeman. Underwood, A. J. (7) Experiments in ecology Their logical design and interpretation using analysis of variance. Cambridge: Cambridge University Press. Welch, B. L. (36) Specification of rules for rejecting too variable a product, with particular reference to an electric lamp problem. J. Roy. Statist. Soc., Suppl. 3, 2-8. Welch, B. L. (38) The significance of the difference between two means when the population variances are unequal. Biometrika, 2, Winer, B. J. (7) Statistical principles in experimental design (Second Edition). New York: McGraw-Hill. Zar, J. H. () Biostatistical analysis (Fourth Edition). Upper Saddle River, N.J.: Prentice Hall.

15 5 Appendix: t-test with Welch correction This appendix presents additional simulation results in which the t-test with Welch (36, 38) correction was compared to parametric and permutational t-tests for two types of data distributions and for equal and unequal population variances. The result of a t-test is identical to that of an anova computed for two groups; the t-statistic is the square root of the F-statistic used in anova. The Welch correction, described in most textbooks of statistics (e.g., Scherrer, 8, and Zar, ), for example, is a widely used solution to the Behrens- Fisher problem of testing for the difference in the means of two populations when the variances are unequal. These simulations, which led us to formulate recommendations with respect to the use of this correction, should prove useful to application domains where unequal variances are commonly encountered. The Welch correction was designed to provide a valid t-test in the presence of unequal population variances. It consists of using a corrected number of degrees of freedom ν to assess the significance of the t-statistic computed as usual. ν is the next smaller integer of the value obtained from the following equation: ν 2 2 [ ( s n ) + ( s 2 n2 )] 2 where and are the sample variances of groups and 2 respectively, whereas n and n 2 are the number of observations in groups and 2. When the variances are equal, equation reduces to the usual formula ν = (n + n 2 2) when the two groups have equal numbers of observations, but to a lower value when n n 2, making the test with Welch correction too conservative. We will see if the simulations can illustrate this bias, and what are its practical consequences, if any, for the users of the test. s 2 s 2 2 = ( s n ) 2 2 ( s n 2 ) n n 2 (). Equal sample sizes Simulations were carried out using two groups of data of equal sizes, with n = n 2 = {0,, 50, 00}. The first series of simulations used normal random deviates with mean 0 and standard deviations between and, as specified in the graphs; the values were chosen in such a way that the standard deviations of the two reference populations added to 0. For the power study, the population mean of group was 0 while that of group 2 was simulations were run in each case, during which the following statistics were computed: a standard parametric t-test, a permutational t-test (using random permutations), and a t-test with Welch correction. The t-test with Welch correction is expected to produce correct type I error

16 6 when the variances are not homogeneous, whereas the permutational t-test is expected to do the same for skewed data. Figures 6a and 6e show that the t-test with Welch correction has correct type I error for any and all combinations of population variances, and for all sample sizes. The parametric and permutational t-tests are affected by inequality of the variances. The effect is strong when sample size is small (n j = 0 in Fig. 6a), but disappears gradually as sample size increases. For example, with n j = 50, the tests have slightly inflated type I error only in the most extreme case of inequality of the population variances (Fig. 6e). No inflation of type I error was found at n j = 00 (results not illustrated). The power of the three tests is comparable when they are valid, i.e., when type I error is not larger than α (Figs. 6b and 6f). In the second series of simulations, power base. data were used, as described in Section 2.2 of the main paper. Otherwise, the design of the simulations was the same as for normal data. When the variances are equal or nearly so (σ = σ 2 in Figs. 6c and 6g), all three tests are valid for all sample sizes. The permutational t-test presents the advantage of having correct type I error whereas the other two forms of the test are too conservative; the permutational t- test also has the highest power (Figs. 6d and 6h). When the variances are unequal (σ = σ 2, e.g., in Fig. 6g), type I error of all three tests becomes inflated to various degrees, so that the tests are invalid and should not be used. This was the case with all sample sizes used in the simulations, except with n j = 0 (Fig. 6c); power of the tests is irrelevant when type I error is larger than α. The parametric and Welchcorrected t-test are valid when sample sizes are very small since type I error is not larger than α (Fig. 6c where n j = 0), but power is so low that the tests are unusable (Fig. 6d). 2. Unequal sample sizes Simulations were also conducted with different sample sizes chosen in such a way that n + n 2 = {20, 50 or 00}. Otherwise, the design of the simulations was the same as for equal sample sizes; the standard deviations were made to vary between and, as in Fig. 6, the values being chosen in such a way that the standard deviations of the two reference populations added to 0. Simulations were carried out to measure type I error (with the population means equal) and power (with the population means unequal). The first series of simulations used normal random deviates with mean 0 and standard deviations between and summing to 0 for the two groups, as specified in the graphs. When the population variances are equal, the parametric and permutational t-tests have correct type I error for any combination of sample sizes (Fig. 7a), whereas the test with Welch correction becomes too conservative when sample sizes are strongly unequal.

17 7 of all tests decreases as the sample sizes become more unequal (Fig. 7b); the t-test with Welch correction has lower power than the other forms in the most extreme cases of inequality of the sample sizes. Of course, power of all tests increases as n + n 2 grows from 20 to 50 to 00 (not illustrated). When the population variances are unequal, all tests have correct type I error for equal group sizes (e.g., Fig. 7c, except for a slight inflation of type I error in the most extreme case of inequality of the population variances, already shown in Fig. 6e), but type I error becomes increasingly too conservative as the group sizes become more unequal. All tests remain valid with sample sizes n + n 2 = 50 or 00, but for strongly unequal group sizes, power becomes too low for the tests to be useful (Fig. 7d). We already know from Fig. 6a that for very small sample sizes, such as n + n 2 = 20, there is a small inflation of type I error of the parametric and permutational t-tests in the case of equal group sizes; this is quickly compensated by the conservativeness of these tests in the case of unequal group sizes. The second series of simulations, based upon power base. random deviates (skewed distribution), gave the following results. When the population variances are equal, only the permutational t-test has correct type I error for any combination of sample sizes and for all n + n 2 = {20, 50 or 00} subjected to simulations (Fig. 7e); it also has good power (Fig. 7f). The test with Welch correction is too conservative for all combinations of sample sizes; power is reduced compared to the power of the permutational t-test. The parametric t-test has conservative type I error in most cases, but when the samples become strongly unequal in size, it has inflated type I error. In the area where the parametric t-test is valid (sample sizes equal or moderately unequal), its power is less than that of the permutational t-test. With unequal variances, the behavior of the three tests becomes erratic (e.g., Fig. 7g). When the tests are valid, they have poor power so that they are useless (Fig. 7h). 3. Recommendations The recommendations that can be derived from our simulation results are complex. They are presented in tabular form (TABLE II). To summarize, the test with Welch correction is useful when the data are normal, sample sizes are small, and the variances are heterogeneous. Otherwise, use the parametric t-test for normal data, or the permutational t- test for skewed data. For heteroscedastic data that cannot be normalized, a nonparametric test should be used.

18 8 TABLE I Recommended strategy for THVs and anova. References to the appropriate sections of the Results are given in brackets.. Are the data normal? Yes -> Run parametric Bartlett or Box THV; avoid Cochran (sensitive only to a single high variance) and log-anova (low power if n j < 5) [3..] -> go to 2 No -> go to 2. THV result: Variances homogeneous -> run parametric or permutational anova [3..]. Variances heterogeneous -> homogenize variances [3..2] -> go to 3 3. Variance homogenization: Successful -> run parametric or permutational anova [3..]. Unsuccessful -> choose alternative method, e.g. nonparametric anova.. Normalizing transformation of the data: Successful -> go to Unsuccessful -> go to 5 5. Distribution of data: Real, continuous, positively skewed -> go to 6 Species abundances: data are null or positive, discrete, positively skewed -> go to 7 Other distributions: not simulated in this study. 6. Sample size: n j < 5 -> no THV is appropriate. is correct only for log anova [3..3.], but this test has very low power [3..3.2]. * n j is large -> Bartlett or Box tests can be used, but power is low [3..3.2]. -> go to 2 7. Sample size: n j < 5: type I error is correct only for log anova [3.5.5.], but this test is unusable because of its very low power [ ]. * -> go to 8 n j is large -> Bartlett or Box tests can be used [ ]. -> go to 2 8. Log-transform the data using y' = ln(y+), then: Use permutational Bartlett or Box test (which have correct type I error but slightly lower power) or their parametric form (slightly higher power but also sometimes slightly inflated type I error) [3.5., ] -> go to 2 * If anova is computed on data with skewed distribution without the results of a prior THV, a significant result may only be found if at least one of the means differs sufficiently from the others (Fig. f). Otherwise, use nonparametric anova.

19 TABLE II Recommendations for t-test of equality of two means.. Sample sizes equal? Yes -> go to 2 No -> go to 6 2. Equal sample sizes. Distribution: Normal-> -> go to 3 Skewed-> -> go to 5 3. Normal distributions. THV result: Variances homogeneous -> use any one of the 3 tests (simplest: parametric t-test). Variances unequal-> -> go to. Variances unequal. Sample size: Small-> use the t-test with Welch correction. Large-> use any one of the 3 tests (simplest: parametric t-test). 5. Skewed distributions. THV result: Variances homogeneous -> all 3 tests are valid, but the permutational t-test is preferable because it has correct type I error and the highest power. Variances unequal-> normalize the data or use a nonparametric test (Wilcoxon-Mann- Whitney test, median test, Kolmogorov-Smirnov two-sample test, etc.). 6. Unequal sample sizes. Distribution: Normal-> -> go to 7 Skewed-> -> go to 8 7. Normal distributions. THV result: Variances homogeneous -> use the parametric or permutational t-tests (simplest: parametric t-test). Variances unequal-> use any one of the 3 tests (simplest: parametric t-test). is low when the sample sizes are strongly unequal; avoid the Welch-corrected t-test in the most extreme cases of sample size inequality (lower power). 8. Skewed distributions. THV result Variances homogeneous -> use the permutational t-test. Variances unequal-> normalize the data or use a nonparametric test (Wilcoxon-Mann- Whitney test, median test, Kolmogorov-Smirnov two-sample test, etc.).

20 20 Figure captions FIGURE Results of the ANOVA simulation study for normal [(a), (c) and (e)] and power base. [(b), (d) and (f)] data. (a) and (b): type I error and 5% confidence intervals (error bars) at α = 5, in the presence of various amounts of withingroup population variance (abscissa). (c) and (d): type I error and 5% confidence intervals (error bars) at α = 5, in the presence of heterogeneous within-group variances. Simulations were run with three groups; the population variances are shown under the abscissa. At both ends of the graphs, results for equal variances are also shown for comparison. (e) and (f): power of ANOVA in the presence of homogeneous variances and various combinations of means (abscissa). (e) Normal distribution, variance = 5; (f) power base. distribution, variance =. Overlapping symbols have been offset horizontally to improve clarity. The 5% confidence error bars, which are closer than the size of the symbols, have been omitted. FIGURE 2 FIGURE 3 Results of the THV simulation study for parametric [(a), (c), (e) and (g)] and permutational [(b), (d), (f) and (h)] tests for normal and power base. data and n j = 0. When they were closer than the size of the symbols, the 5% confidence error bars have been omitted. (a) and (b): type I error and 5% confidence intervals (error bars) of the four THVs, tested at α = 5, for normally distributed data. The simulations involved three groups from populations with equal variances (abscissa) and zero means. (c) and (d): type I error of the four THVs for the power base. distribution. The simulations involved three groups with equal population variances (abscissa). (e) and (f): power of the four THVs for normal data. The simulations involved three groups drawn from populations with zero means and variances shown under the abscissa. Overlapping symbols have been offset horizontally to improve clarity. (g) and (h): power of the four THVs for power base. data. The simulations involved three groups with equal means and variances shown under the abscissa. At both ends of the graphs, results for equal variances have been added for comparison. Overlapping symbols have been offset horizontally to improve clarity. Effect of sample size on the four THVs for normal and power base. data. (a), (c) and (e): parametric tests; (b), (d) and (f): permutation tests. When they were closer than the size of the symbols, the 5% confidence error bars have been omitted. (a) and (b): type I error for power base. data. The within-group variances were equal to in these simulations. (c) and (d): power for normal data. Simulations were run with three groups with variances equal to, and. (e) and (f): power for power base. data. Simulations were run with three groups with variances equal to, and. For the parametric tests, in all but

21 2 the log-anova tests the results are drawn only for n j =5, because for smaller values of n j the type I error of the tests is inflated. FIGURE FIGURE 5 FIGURE 6 FIGURE 7 Results of the THV simulation study for parametric [(a), (c) and (e)] and permutational [(b), (d) and (f)] tests for truncated or truncated and logtransformed power base. data and n j = 0. The 5% confidence error bars, which are closer than the size of the symbols, have been omitted. (a) and (b): type I error for truncated power base. data. Simulations were run with three groups with equal variances (abscissa) and equal means. (c) and (d): type I error for truncated and log-transformed power base. data. Simulations were run with three groups with equal variances (abscissa) and equal means. (e) and (f): power for truncated and log-transformed power base. data. Simulations were run with three groups with equal means and variances shown under the abscissa. At both ends of the graphs, results for equal variances have been added for comparison. Overlapping symbols have been offset horizontally for better clarity. In the parametric results, the lines connecting the log-anova symbols are dashed to remind us that this test is unusable because of its incorrect type I error. Effect of sample size on the four THVs on truncated or truncated and logtransformed base. data. (a), (c) and (e): parametric tests; (b), (d) and (f): permutation tests. When they were closer than the size of the symbols, the 5% confidence error bars have been omitted. (a) and (b): type I error for truncated power base. data. The error bars, which are closer than the size of the symbols, have been omitted. (c) and (d): type I error for truncated and logtransformed power base. data. (e) and (f): power for truncated and logtransformed power base. data. Simulations were run with three groups with variances equal to, and. (a) and 5% confidence intervals (error bars) for t-tests of difference between two group means, at α = 5, for two groups of normal data (n = n 2 = 0) with equal means, as a function of the two population standard deviations (σ k, abscissa). (b) simulation results for the same data. (c, d) Same as (a, b) for power base. data. (e, f, g, h) Same as (a, b, c, d) using n = n 2 = 50. When they were closer than the size of the symbols, the 5% confidence error bars have been omitted. (a) and 5% confidence intervals (error bars) for t-tests of difference between two group means, at α = 5, for normal data with equal population standard deviations (σ j ), as a function of the sample sizes (n j, abscissa). (b) simulation results for the same data. (c, d) Same as (a, b) for unequal population standard deviations (σ j ). (e, f, g, h) Same as (a, b, c, d)

22 using power base. data. When they were closer than the size of the symbols, the 5% confidence error bars have been omitted. 22

23 Normal data base. data (a) 0 6 Within-group variance 0 (b) 0 6 Within-group variance 8 (c) 8 (d) Variance of group Variance of group 2 Variance of group (e) Mean of group Mean of group 2 Mean of group 3 (f) Parametric ANOVA Permutational ANOVA Legendre & Borcard, Fig.

24 Parametric THV Permutational THV (a) 0 6 Within-group variance (b) 0 6 Within-group variance (c) 0 6 Within-group variance (d) 0 6 Within-group variance (e) Variance of group Variance of group 2 Variance of group 3 (f) (g) Variance of group Variance of group 2 Variance of group 3 (h) Bartlett Log-Anova Cochran C Box M Legendre & Borcard, Fig.2

Application of Parametric Homogeneity of Variances Tests under Violation of Classical Assumption

Application of Parametric Homogeneity of Variances Tests under Violation of Classical Assumption Alisa A. Gorbunova and Boris Yu. Lemeshko Novosibirsk State Technical University Department of Applied Mathematics,