TESTS FOR MEAN EQUALITY THAT DO NOT REQUIRE HOMOGENEITY OF VARIANCES: DO THEY REALLY WORK?

Size: px

Start display at page:

Download "TESTS FOR MEAN EQUALITY THAT DO NOT REQUIRE HOMOGENEITY OF VARIANCES: DO THEY REALLY WORK?"

Preston Golden
5 years ago
Views:

1 TESTS FOR MEAN EQUALITY THAT DO NOT REQUIRE HOMOGENEITY OF VARIANCES: DO THEY REALLY WORK? H. J. Keselman Rand R. Wilcox University of Manitoba University of Southern California Winnipeg, Manitoba Los Angeles, California Canada R3T N Jason Taylor University of Manitoba Winnipeg, Manitoba Canada R3T N Rhonda K. Kowalchuk University of Manitoba Winnipeg, Manitoba Canada R3T N Key Words: Tests for Mean Equality; Variance Heterogeneity; Nonnormality; Monte Carlo; Robust Estimators ABSTRACT Tests for mean equality proposed by Weerahandi (1995) and Chen and Chen (1998), tests that do not require equality of population variances, were examined when data were not only heterogeneous but, as well, nonnormal in unbalanced completely randomized designs. Furthermore, these tests were compared to a test examined by Lix and Keselman (1998), a test that uses a heteroscedastic statistic (i.e., Welch, 1951) with robust estimators (0% trimmed means and Winsorized variances). Our findings confirmed previously published data that the tests are indeed robust to variance heterogeneity when the data are obtained from normal populations. However, the Weerahandi (1995) and Chen and Chen (1998) tests were not found to be robust when data were obtained from nonnormal populations. Indeed, rates of Type I error were typically in excess of 10% and, at times, exceeded 50%. On the other hand, the 1

2 statistic presented by Lix and Keselman (1998) was generally robust to variance heterogeneity and nonnormality. 1. INTRODUCTION The Behrens-Fisher problem (see Fisher, 1935) refers to the problem of testing for mean equality in the presence of variance heterogeneity. This problem was originally discussed within the context of a two-group layout but also has been extended to the many-group layout. For example, Welch (1951), James (1951, 1954), Brown and Forsythe (1974) among others (see Gamage & Weerahandi, 1998; Lix & Keselman, 1995) have presented approximate test statistics for testing for mean equality when there are more than two groups and when population variances are not presumed to be equal. Popular methods (e.g., Welch) have been found to be generally robust to variance heterogeneity under normality, but the reverse is true when data are nonnormal (see Lix & Keselman, 1995). For completeness we note that in addition to the popular approximate methods, other solutions to the problem have been presented. For example, transformations of the data, nonparametric tests, as well as tests based on robust estimators (e.g., trimmed means and Winsorized variances) have also been proposed. Unfortunately, these procedures have not proven to be uniformly successful in controlling test size (!) when data are heterogeneous as well as nonnormal, particularly in unbalanced designs. Two recent solutions to this problem have been presented by Weerahandi (1995) and Chen and Chen (1998). These authors have derived test statistics which test for mean equality without requiring that population variances be equal. Thus, researchers may be able to use either of these procedures to test for mean equality and be confident that the test size will not be distorted by heterogeneous variances, a condition believed to characterize applied data (see Wilcox, 1997). Unfortunately, the data presented regarding the operating characteristics of these two test statistics is extremely limited. Gamage and Weerahandi (1998)

3 presented Type I error results indicating that the Weerahandi (1995) procedure is robust to nonnormality and variance heterogeneity in unbalanced designs. However, they only investigated a one-way design containing three treatment groups in which there were a limited number of unequal sample sizes and variances for one type of nonnormal distribution (gamma). Chen and Chen (1998) in their investigation only report power data for their test. Accordingly, the purpose of our investigation was to examine in detail the test statistics presented by these authors. In addition, we compared these procedures to a test examined by Lix and Keselman (1998); namely, a heteroscedastic statistic (i.e., Welch, 1951) that uses trimmed means and Winsorized variances, as suggested by Yuen (1974).. DEFINITION OF THE TEST STATISTICS Suppose n independent random observations X, X, ÞÞÞ, X are 1 n sampled from population ( œ "ß ÞÞÞ ß J). We assume that the X s (i œ 1, á, n ; D n œ N) are obtained from a normal population with mean. and _ # # # w unknown variance 5, with 5 Á 5w ( Á ). Then, let X œ DiX i/n and _ s œ Di(Xi X) /n [ Gamage and Weerahandi (1998) defined the sample variance with n 1 in the denominator, while Weerahandi (1995) used n ; to replicate the Gamage and Weerahandi findings, however, the denominator needed to be n.]. The usual less-than-full-rank model i X i œ.! % i can be applied to the problem at hand where the % i s are assumed to be independent random variables with % µ N(0, 5 ) and!! œ 0. Thus the i J œ1 null hypothesis can be expressed as either H 0:! 1 œ! œ â œ! J or H 0:. 1 œ. œ â œ. J..1 Generalized F-Test (Weerahandi, 1995). According to Weerahandi his generalized F-test is carried out by determining a generalized p-value 3

4 which is then compared to the nominal significance level to determine whether the null hypothesis of mean equality can be reected or not. To determine the generalized p-value one first computes a standardized µ between-group sum of squares, S, where b µ µ J J J q q S œ S ( 5, á, 5 ) œ! # n X / 5 (! # n X / 5 ) /! # n / 5. (.1.1) b b 1 J œ1 œ1 œ1 The generalized p-value is calculated to be p œ 1 q, where µ 1 N J ns ns J 1 b B B âb (1 B )B âb 1 q œ XŒH J 1, N J { s [,, 1 J 1 1 J 1 3 J ns 3 ns J (1 B )B âb 1 B, â, ]}, (.1.) 3 J 1 J 1 and H J 1, N J is the cdf of the F distribution with J 1 and N J degrees of freedom and the expectaion is taken with respect to independent Beta random variables k Bk µ Beta Œ! Ðn 1) Ðn " 1),, k œ 1,, á, J 1. (.1.3) iœ1 According to Weerahandi (1995) the p-value can be computed by numerical integration with respect to the Beta random variables or also through Monte Carlo methods. He points out that when the number of simulations is large the mean of the probabilites will well approximate the expected value. Interested readers can find a derivation of the method in Weerahandi.. The Chen and Chen (1998) Method. The statistic presented by Chen and Chen (1998) is an exact single-stage analysis of variance type procedure (as opposed to two-stage procedures--see Bishop and Dudewicz, 1978), which under the null hypothesis of the distribution of the 4

5 test, is completely free of the unknown variances. (Chen and Chen, p. 644) Again assuming the previously defined model, this procedure uses the _ first n 1 (where n 3) observations to define the sample mean (X ) and variance (s µ ), i.e., _ n 1 X œ! X /(n 1), and i iœ1 n 1 _ µ s œ! (X X) /(n ). iœ1 i Weights for the observations are defined as U œ 1 n " µ µ n Ê " Ðn 1) [s (m) /s 1] " V ÉÐn 1)[s µ /s µ 1] (..1) œ 1 n n (m) where µ s is the maximum of µ s,, µ (m) 1 á s J. Finally, a weighted sample mean is calculated as n µ X œ! W X (..). i i iœ1 where Wi œ U for 1 Ÿ i Ÿ n 1 V for iœ n where U and V satisfy the following equations (n 1)U V œ 1, 5

6 (n 1)U V œ µ s /n µ s. (..3) (m) Chen and Chen (1998) indicate that the transformation t œ µ X.. n Í µ Ì s! W i iœ1 5 µ has a conditional normal distribution with mean zero and variance /s. They also show (p. 646) that the conditional normal distributions of the ts, given the µ s, are unconditional and independent Student t variables with n degrees of freedom. An equivalent version of t, given by Equation 3 is t. œ µ X µ.. s / Èn (m) To test H Chen and Chen (1998) suggest the statistic 0 J µ µ µ 1 X F œ!. X.. µ (..4) s (m)/ Èn, œ1 µ J µ where X œ! X /J. According to Chen and Chen one would reect H.. œ1. 0 µ 1 µ when F F, the upper percentage point of the null distribution of F µ 1!, J, n! (based on a balanced design). A SAS (SAS, Version 6.1) computer program can be obtained from the authors to obtain critical values for both balanced and unbalanced designs..3 Lix and Keselman's (1998) procedure. Lix and Keselman (1998) and Wilcox, Keselman and Kowalchuk (1998) have shown how to obtain a robust test of location equality in unbalanced one-way layouts when the 6

7 underlying data are neither normal in form nor possessing equal variability. The heteroscedastic statistic used by Lix and Keselman (1998) and Wilcox, Keselman, and Kowalchuk (1998) is due to Welch (1951). The statistic can be defined as J! wðx XÑ ÎÐJ "Ñ œ1 F œ ß (.3.1) J " J 1 w W Ð Ñ Ð! Î Ñ ÐJ 1Ñ n 1 œ1 _ J _ J _ where w œ nîs ß X œ! wx /W, W œ! w and X œ DiX i/n and œ1 œ1 s œ D i(xi X ) /(n 1), where X is the estimate of. and s is the usual unbiased estimate of the variance for population. The test statistic is approximately distributed as an F variate and is referred to the critical value F[(1! ); (J 1), /], the (1! )-centile of the F distribution, where error degrees of freedom are obtained from / œ J 1 J (1 w /W) 3! n 1 œ1. (.3.) Yuen (1974) initially suggested that trimmed means and variances based on Winsorized sums of squares be used in conunction with Welch's (1938) two-sample statistic. For heavy-tailed symmetric distributions, Yuen showed that the statistic based on these robust estimators could adequately control the rate of Type I errors and resulted in greater power than a statistic based on the usual mean and variance. While a wide range of robust estimators have been proposed in the literature (see Gross, 1976), the trimmed mean and Winsorized variance are intuitively appealing because of their computational simplicity and good theoretical properties (Wilcox, 1995a). In particular, while the standard error of the usual mean can become seriously inflated when the 7

8 underlying distribution has heavy tails (Tukey, 1960), the standard error of the trimmed mean is less affected by departures from normality because extreme observations, that is, observations in the tails of a distribution, are removed. Furthermore, as Gross (1976) notes, the Winsorized variance is a consistent estimator of the variance of the corresponding trimmed mean" (p. 410). In computing the Winsorized variance, the most extreme observations are replaced with less extreme values in the distribution of scores. While the trimmed mean has been shown to be highly effective, we caution the reader that this measure should only be adopted if one is interested in testing for treatment effects across groups using a measure of location that more accurately reflects the typical score within a group when working with heavy-tailed distributions. As an illustration of how a trimmed mean may provide a better estimate of the typical score than the usual mean, consider the example given by Wilcox (1995a, p. 57) in which a single score in a chi-square distribution with four df (hence. œ 4) is multiplied by 10 (with probability.1). This contaminated chi-square distribution has a population mean of 7.6, a value closer to the upper tail of the distribution. However, a 0% population trimmed mean is 4., a value that is closer to the bulk of scores, hence closer to the typical score in the distribution. Lix and Keselman (1998) and Wilcox, Keselman, and Kowalchuk (1998) replace the hypothesis of equal means with H!:. t1 œ. t œ â œ. tj, the hypothesis of equal trimmed means. Let X(1) Ÿ X() Ÿ á Ÿ X (n ) represent the ordered observations associated with the th group. Let g œ [# n ], where # represents the proportion of observations that are to be trimmed in each tail of the distribution. For reasons summarized by Wilcox (1995a,b), 0% trimming (# œ.) is used here. The effective sample size for the th group becomes h œ n g. The th sample trimmed mean is n g q 1 X t œ! h X (i). (.3.3) iœg 1 and the th sample Winsorized mean is 8

9 where q X œ 1 Y, w n! n iœ1 Yi œ X (g 1) if Xi Ÿ X (g 1) œ X if X X X œ X (n g ) if Xi X (n g ). i (g 1) i (n g) i The sample Winsorized variance is s w œ 1 n 1 n!(yi q X w), (.3.4) iœ1 and (n 1)sw s œ (.3.5) h(h 1) µ w estimates the squared standard error of the sample trimmed mean (see Wilcox, 1996). Thus, with robust estimation, the trimmed group means q q (X s) replace the least squares group means (X s), the Winsorized group t w ) variances estimators ( s s) replace the least squares variances (s s, and D h replaces N, in the statistics and their df. That is, Equations (.3.1) and (.3.) become J! wtðxt XtÑ ÎÐJ "Ñ œ" F t œ ß (.3.6) " J J Ð Ñ Ð1 wtîwtñ! ÐJ 1Ñ h 1 œ 1 _ J _ J where w œh Î µ # s ßX œ! w X /W and W œ! w, where / is estimated by t w t t t t t t t œ" œ" 9

10 / t œ J 1 J (1 w /W ) 3! t t h 1 œ1. (.3.7) 3. METHOD Four variables were manipulated in the study: (a) number of groups (4 and 6), (b) sample size (two cases), (c) population distribution (five distributions: one normal and four nonnormal distributions), and (d) degree/pattern of variance heterogeneity (moderate and large/all (mostly) unequal and all but one equal). Variances and group sizes were both positively and negatively paired. Table I contains the numerical values of the sample sizes and variances investigated in this study. Table I Sample Size and Variance Conditions CON Sample Sizes (Two Cases) Population Variances A 10, 15, 0, 5; 15, 0, 5, 30 1, 4, 9, 16 B 10, 15, 0, 5; 15, 0, 5, 30 1, 1, 1, 36 C 10, 15, 0, 5; 15, 0, 5, 30 16, 9, 4, 1 D 10, 15, 0, 5; 15, 0, 5, 30 36, 1, 1, 1 E 10, 15(), 0(), 5; 15, 0(), 5(), 30 1(), 4, 9(), 16 F 10, 15(), 0(), 5; 15, 0(), 5(), 30 1(5), 36 G 10, 15(), 0(), 5; 15, 0(), 5(), 30 16, 9(), 4, 1() H 10, 15(), 0(), 5; 15, 0(), 5(), 30 36, 1(5) As indicated we investigated one-way designs having four and six groups. For each design size, two sample size cases were investigated. In our unbalanced designs, the smaller of the two cases investigated for each design had an average group size of less than 0, while the larger case in each design had an average group size of at least 0. With respect to the effects of distributional shape on Type I error, we chose to investigate conditions in which the statistics were likely to be 10

11 prone to an excessive number of Type I errors as well as a normally distributed case. Thus, we generated data from four skewed distributions. Specifically, we sampled from a ; 6 and a ; 3 distribution and we also used the method described in Hoaglin (1985) to generate distributions with more extreme degrees of skewness and kurtosis. These particular types of nonnormal distributions were selected since data obtained in applied settings (e.g., behavioral science data) typically have skewed distributions (Micceri, 1989; Wilcox, 1994a, 1994b, 1995a,b). Furthermore, Sawilowsky and Blair (199) investigated the effects of eight non-normal distributions identified by Micceri on the robustness of Student's t test and found that only distributions with the most extreme degree of skewness which were investigated (e.g., # 1 œ 1.64) were found to affect the Type I error control of the independent sample t statistic. Thus, since the statistics we investigated have operating characteristics similar to those reported for the t statistic, we felt that our approach to modeling skewed data would adequately reflect conditions in which those statistics might not perform optimally. For the distribution, skewness and kurtosis values are ; 3 # 1 œ 1.63 and # œ 4.00, respectively (the corresponding values for the ; 6 data are # œ 1.15 and # œ.0) (see Table II). Accordingly, our simulated 1 ; 3 distribution mirrors data found in behavioral science experiments with regard to skewness. The other types of nonnormal distributions were generated from the g- and h-distribution (Hoaglin, 1985). Specifically, we chose to investigate two g- and h- distributions: (a) a g œ 1 and h œ 0 distribution and (b) a g œ 1 and h œ.5 distribution. To give meaning to these values it should be noted that for the standard normal distribution g œ h œ 0. When g œ 0 a distribution is symmetric and the tails of a distribution will become heavier as h increases. Values of skewness and kurtosis corresponding to the investigated g and h distributions are (a) # 1 œ 6. and # œ 114, respectively, and (b) # 1 œ # œ undefined (see Table II). Finally, it should be noted that though the selected combinations of g and h result in extremely skewed distributions, these values according to Wilcox (1994a, 1994b, 1995a,b), are representative of measurements obtained in applied settings (e.g., psychometric measures). Moreover, as Wilcox (1995a) notes, if a procedure performs well over a wide range of 11

12 simulation conditions, including extreme conditions, this suggests that the positive operating characteristics of the procedure might hold over conditions not considered in the simulation and thus positively reflect on the procedure's versatility. Table II Distributions Investigated and Their Properties Distribution Skewness Kurtosis Chi Square (6) Chi Square (3) g œ 1 & h œ g œ 1 & h œ.5 Undefined Undefined As indicated we both positively and negatively paired the group sizes and variances. For positive (negative) pairings, the group having the smallest number of observations was associated with the population having the smallest (largest) variance, while the group having the greatest number of observations was associated with the population having the greatest (smallest) variance. These conditions were chosen since they typically produce distrorted Type I error rates. To generate pseudo-random normal variates, we used the SAS generator RANNOR (SAS Institute, 1989). If Z i is a standard normal variate, then X i œ. ( 5 Z i) is a normal variate with mean equal to. and variance equal to 5. To generate pseudo-random variates having a ; distribution with six (three) degrees of freedom, six (three) standard normal variates were squared and summed. The variates were standardized, and then transformed to ; or ; variates having mean. and variance 5 [see 6 3 t Hastings & Peacock (1975), pp , for further details on the generation of data from this distribution]. To generate data from a g- and h-distribution, standard unit normal variables (Z) were converted to the random variable 1

13 X i œ exp (g Z i) 1 g expœ h Zi, according to the values of g and h selected for investigation. To obtain a distribution with standard deviation 5, each X i ( œ 1, á, J) was multiplied by a value of obtainable from Table I. It is important to note 5 that this does not affect the value of the null hypothesis when g œ 0 (see Wilcox, 1994a, p. 97). However, when g 0, the population mean for a g- and h-distributed variable is 1 œ (exp{g /(1 h)} 1) g(1 h) " #. gh (see Hoaglin, 1985, p. 503). Thus, for those conditions where g 0, was first subtracted from X i before multiplying by 5. Lastly, it should be noted that the standard deviation of a g- and h-distribution is not equal to one, and thus the values enumerated in Table I reflect only the amount that each random variable is multiplied by and not the actual values of the standard deviations (see Wilcox, 1994a, p. 98). As Wilcox notes, the values for the variances (standard deviations) in Table I more aptly reflect the ratio of the variances (standard deviations) between the groups. Our simulation program was written in SAS/IML (SAS, 1989). One thousand replications of each condition were performed using a.05 significance level; Beta values within each simulation were based on 5000 replications (simulations).. gh 4. RESULTS To evaluate the particular conditions under which a test was insensitive to assumption violations, Bradley's (1978) liberal criterion of robustness was employed. According to this criterion, in order for a test to be 13

14 considered robust, its empirical rate of Type I error (!s ) must be contained in the interval 0.5! Ÿ! s Ÿ1.5!. Therefore, for the five percent level of significance used in this study, a test was considered robust in a particular condition if its empirical rate of Type I error fell within the interval.05 Ÿ s! Ÿ.075. Correspondingly, a test was considered to be nonrobust if, for a particular condition, its Type I error rate was not contained in this interval. In the tables, boldfaced entries are used to denote these latter values. We chose this criterion since we feel that it provides a reasonable standard by which to udge robustness. That is, in our opinion, applied researchers should be comfortable working with a procedure that controls the rate of Type I error within these bounds, if the procedure limits the rate across a wide range of assumption violation conditions. Nonetheless, there is no one universal standard by which tests are udged to be robust, so different interpretations of the results are possible. Tables III and IV contain empirical rates of Type I error for a completely randomized design containing four and six groups, respectively. The tabled data indicate that when the observations were obtained from normal distributions, rates of Type I error were controlled, as was reported by Gamage and Weerahandi (1998), Chen and Chen (1998) and Lix and Keselman (1998). However, our results also very clearly indicate that the procedures due to Gamage and Weerahandi (1998) (GW) and Chen and Chen (1998) (CC) can not limit their rates of Type I error within Bradley's (1978) liberal limit when data were nonnormal. Indeed, even for our midly skewed chi-square (6) distribution, rates were typically liberal, approaching values that were approximately equal to 10%. As the the nonnormality of the sampled distribution increased, rates of error became progressively larger, attaining values in excess of 50%. 14

15 TABLE III Empirical Rates of Type I Error (J œ 4) CON Population Type 5 s ns Normal ; 6 ; 3 g œ 1 & h œ 0 g œ 1&h œ.5 GF CC LK GF CC LK GF CC LK GF CC LK GF CC LK A A B B C C D D

16 TABLE IV Empirical Rates of Type I Error (J œ 6) CON Population Type 5 s ns Normal ; 6 ; 3 g œ 1 & h œ 0 g œ 1&h œ.5 GF CC LK GF CC LK GF CC LK GF CC LK GF CC LK E E F F G G H H

17 The procedure presented by Lix and Keselman (1998) (LK) however, was, in most instances, able to limit its rate of Type I error within Bradley's (1978) interval. Indeed, out of the 80 investigated conditions, the test was liberal in ust 7 cases (there was also one conservative value). 5. DISCUSSION We were not surprised to find that the Weerahandi (1995) and Chen and Chen (1998) procedures would not be robust to heterogeneity of variances when data were also nonnormal in unbalanced designs. To date, most test statistics that are intended to cope with the effects of variance heterogeneity have been found to lack robustness when heterogeneity of variances occurs with data that are also nonnormal, particularly when group sizes are unequal. As we indicated in our introduction no procedure for testing mean equality has been found to be uniformly robust to assumption violations when they occur simultaneously. However, this unfortunate state of affairs relates only to test statistics that use least squares measures of central tendency and variability. On the other hand, our results, and those presented by others, indicate that researchers can generally, though not uniformly, obtain a robust test of treatment performance equality by substituting robust measures of central tendency and variability into heteroscedastic test statistics (see e.g., Keselman, Kowalchuk & Lix, 1998; Keselman, Lix & Kowalchuk, 1998; Keselman & Wilcox, 1999; Lix & Keselman, 1998). That is, by substituting 0% trimmed means and Winsorized variances into, say, the Welch (1951) test, one typically can achieve robustness to both nonnormality and variance heterogeneity, even in unbalanced designs. The benefits of using robust estimators, that is, 0% trimmed means and Winsorized variances instead of least squares estimators to combat the effects of nonnormality has been discussed extensively (see e.g., Keselman, Kowalchuk & Lix, 1998; Keselman, Lix & Kowalchuk, 1998; Keselman & Wilcox, 1999; Lix & Keselman, 1998; Wilcox, 1995a,b, 1997). Finally, we note that we did not compare the power of the procedure presented by Lix and Keselman (1998) to those presented by Weerahandi 17

18 (1995) and Chen and Chen (1998) because the latter procedures were not able to control their rates of Type I error. That is, comparisons of power are only meaningful when the procedures being compared are capable of controlling their rates of Type I error. However, we should point out that the power characteristics of statistics based on robust estimators can be predicted from theory and prior work (Lix and Keselman, 1998). That is, as previously indicated, theory tells us that procedures based on sample means result in poor power because the standard error of the mean is inflated when distributions have heavy tails; however, this is less of a problem when working with trimmed means (see Tukey, 1960; Wilcox, 1995b). This phenomenon is illustrated in a number of sources. For example, Wilcox (1994b, 1995b) has presented results indicating that in the two sample and one-way problem, tests (i.e., t and F) based on the usual least squares estimators lose power when data contains outliers and/or is heavy tailed. Specifically, in the two sample problem, Wilcox (1994b) compared the Welch (1938) and Yuen (1974) procedures and found that when data were obtained from contaminated normal distributions (distributions that have thicker tails compared to the normal) the power of Welch's test was considerably diminished compared to its sensitivity to detect nonnull effects when data were normally distributed and, as well, was less sensitive than Yuen's test. Indeed, the power of Welch's test to detect nonnull effects went from.931 when distributions were normally distributed to.78 and.16 for the two contaminated normal distributions that were investigated; the corresponding power values for Yuen's test were.890,.784, and.60, respectively. Wilcox (1995b) presented similar results for four independent groups. Readers should also refer to the data presented by Lix and Keselman (1998) which compared the power values of other independent group statistics based on robust estimators. ACKNOWLEDGEMENTS This research was supported by a Natural Sciences and Engineering Research Council (Canada) grant. 18

19 BIBLIOGRAPHY Bishop, T. A. and Dudewicz, E. J. (1978). Exact analysis of variance with unequal variances: Test procedures and tables. Technometrics, 0, Bradley, J.V. (1978). Robustness? British Journal of Mathematical and Statistical Psychology, 31, Brown, M.B., and Forsythe, A.B. (1974). The small sample behavior of some statistics which test the equality of several means, Technometrics, 16, Chen, S., and Chen, H.J. (1998). Single-stage analysis of variance under heteroscedasticity, Communications in Statistics-Simulation and Computation, XX, Fisher, R. A. (1935). The fiducial argument in statistical inference, Annals of Eugenics, 6, Gamage, J. and Weerahandi, S. (1998). Size performance of some tests in one-way ANOVA, Communications in Statistics-Simulation and Computation, XX, Gross, A. M. (1976). Confidence interval robustness with long-tailed symmetric distributions, Journal of the American Statistical Association, 71, Hastings, N. A. J., and Peacock, J. B. (1975). Statistical distributions: A handbook for students and practitioners, New York: Wiley. Hoaglin, D.C. (1985). Summarizing shape numerically: The g- and h distributions, In D. Hoaglin, F. Mosteller, & J. Tukey (Eds.), Exploring data tables, trends, and shapes (pp ). New York: Wiley. James, G. S. (1951). The comparison of several groups of observations when the ratios of the population variances are unknown, Biometrika, 38, James, G. S. (1954). Tests of linear hypotheses in univariate and multivariate analysis when the ratios of the population variances are unknown, Biometrika, 41,

20 Keselman, H.J., Kowalchuk, R.K., and Lix, L.M. (1998). Robust nonorthogonal analyses revisited: An update based on trimmed means, Psychometrika, 63, Keselman, H.J., Lix, L.M., and Kowalchuk, R.K. (1998). Multiple comparison procedures for trimmed means, Psychological Methods, 3, Keselman, H.J., and Wilcox, R.R. (1999). The 'improved' Brown and Forsythe test for mean equality: Some things can't be fixed, Communications in Statistics-Simulation and Computation, 8(3), Lix, L.M., and Keselman, H.J. (1995). Approximate degrees of freedom tests: A unified perspective on testing for mean equality, Psychological Bulletin, 117, Lix, L.M., and Keselman, H.J. (1998). To trim or not to trim: Tests of mean equality under heteroscedasticity and nonnormality, Educational and Psychological Measurement, 58, (Errata: 58, 853). Micceri, T. (1989). The unicorn, the normal curve, and other improbable creatures, Psychological Bulletin, 105, SAS Institute Inc. (1989), SAS/IML software: Usage and reference, version 6 (1st ed.), Cary, NC: Author. Sawilowsky, S.S., and Blair, R.C. (199). A more realistic look at the robustness and Type II error probabilities of the > test to departures from population normality, Psychological Bulletin, 111, Tukey, J. W. (1960). A survey of sampling from contaminated normal distributions, In I.Olkin et al. (Eds.), Contributions to probability and statistics. Stanford, CA: Stanford University Press. Weerahandi, S. (1995). ANOVA under unequal error variances, Biometrics, 51, Welch, B.L. (1938). The significance of the difference between two means when the population variances are unequal, Biometrika, 9, Welch, B.L. (1951). On the comparison of several mean values: An alternative approach, Biometrika, 38,

21 Wilcox, R.R. (1994a). A one-way random effects model for trimmed means, Psychometrika, 59, Wilcox, R.R. (1994b). Some results on the Tukey-McLaughlin and Yuen methods for trimmed means when distributions are skewed, Biometrical Journal, 36, Wilcox, R.R. (1995a). ANOVA: A paradigm for low power and misleading measures of effect size?, Review of Educational Research, 65(1), Wilcox, R.R. (1995b). ANOVA: The practical importance of heteroscedastic methods, using trimmed means versus means, and designing simulation studies, British Journal of Mathematical and Statistical Psychology, 48, Wilcox, R.R. (1996a). Statistics for the social sciences, New York: Academic Press. Wilcox, R.R. (1997). Introduction to robust estimation and hypothesis testing, New York: Academic Press. Wilcox, R. R., Keselman, H. J., and Kowalchuk, R. K. (1998). Can tests for treatment group equality be improved?: The bootstrap and trimmed means conecture, British Journal of Mathematical and Statistical Psychology, 51, Yuen, K.K. (1974). The two-sample trimmed t for unequal population variances, Biometrika, 61,

THE 'IMPROVED' BROWN AND FORSYTHE TEST FOR MEAN EQUALITY: SOME THINGS CAN'T BE FIXED

THE 'IMPROVED' BROWN AND FORSYTHE TEST FOR MEAN EQUALITY: SOME THINGS CAN'T BE FIXED H. J. Keselman Rand R. Wilcox University of Manitoba University of Southern California Winnipeg, Manitoba Los Angeles,