AN EMPIRICAL INVESTIGATION OF TUKEY 1 S HONESTLY SIGNIFICANT DIFFERENCE TEST WITH VARIANCE HETEROGENEITY AND EQUAL SAMPLE SIZES,

Size: px

Start display at page:

Download "AN EMPIRICAL INVESTIGATION OF TUKEY 1 S HONESTLY SIGNIFICANT DIFFERENCE TEST WITH VARIANCE HETEROGENEITY AND EQUAL SAMPLE SIZES,"

Lorin Lawrence
6 years ago
Views:

1 37<? A/8/i A/o, /S3 AN EMPIRICAL INVESTIGATION OF TUKEY 1 S HONESTLY SIGNIFICANT DIFFERENCE TEST WITH VARIANCE HETEROGENEITY AND EQUAL SAMPLE SIZES, UTILIZING BOX'S COEFFICIENT OF VARIANCE VARIATION DISSERTATION Presented to the Graduate Council of the North Texas State University in Partial Fulfillment of the Requirements For the Degree of DOCTOR OF PHILOSOPHY By Michael W. Strozeski, B.S., M.Ed. Denton, Texas May, 1980

3 Strozeski, Michael Wayne, An Empirical Investigation of Tukey's Honestly Significant Difference Test with Variance Heterogeneity and Equal Sample Sizes, Utilizing Box's Coefficient of Variance Variation. Doctor of Philosophy (Educational Research), May, 1980, 1.45 pp., 50 tables, bibliography, 50 titles. This study sought to determine boundary conditions for robustness of the Tukey HSD statistic when the assumptions of homogeneity of variancewereviolated. Box's coefficient of variance variation, C, was utilized to index the degree of variance heterogeneity. Selected numbers of comparison groups and equal sample sizes were evaluated. Tukey's HSD statistic was declared robust if the actual significance level fell within the 95 per cent confidence limits around the corresponding nominal significance level. A Monte Carlo computer simulation technique was employed to generate data under controlled violation of the homogeneity of variance assumption. For each sample size and number of treatment groups condition, an analysis of variance F-test was computed, and Tukey's multiple comparison technique was calculated. This procedure was repeated 4,000 times; the actual level of significance was determined and compared to the nominal significance level of The index of variance variation was systematically adjusted, and this procedure

4 was repeated until the C value was reached, such that any increase in its value would produce an FWI error rate that exceeded the upper limit of the 95 per cent confidence interval about the 0.05 level of significance, thereby establishing a boundary for C. On the basis of the synthesis and analysis of the generated data, the following conclusions were drawn. First, the Tukey HSD statistic was found to be generally robust when the violations of homogeneity of varianceswereof small magni- tude. In all cases, however, as the value of C was increased from zero, a point was reached at which the Tukey HSD statistic was no longer robust and too many FWI errors were produced. Second, when either the violation of the homogeneity of variance assumption was more pronounced (C values were larger) or the number of treatment groups increased, discrepancies between the actual and nominal significance levels occurred. With larger numbers of treatment groups, Tukey's HSD was less robust. The boundary value for C decreased as the number of treatment groups increased. As C values were increased, FWI errors increased. Third, Tukey's HSD was found to be more robust with larger sample sizes. This trend was generally supported in all of the sample size groups proposed for this study. This conclusion was further supported by the addition of fortyeight and seventy-two sample size groups to the five treatment groups experiment. In both of these additional sample size

5 o cases, the C value was greatly increased by the larger sample sizes. A fourth and final conclusion was reached. When the two additional sample size cases were added to investigate the large sample sizes, the Tukey test was found to be conservative when C was set at zero. The actual significance level fell below the lower limit of the 95 per cent confidence interval around the 0.05 nominal significance level. Apparently, large sample sizes decrease the likelihood of an FWI error but may increase the likelihood of Type II error.

6 TABLE OF CONTENTS LIST OF TABLES... Page Chapter I. INTRODUCTION Statement of the Problem Purpose of the Study Hypothesis Mathematical Model of Tukey's HSD Statistic Definition of Terms Delimitations Chapter Bibliography II. SURVEY OF RELATED RESEARCH 9 Chapter Bibliography III. PROCEDURE FOR DATA COLLECTION 3 Procedures for Producing Data Model Validation Statistical Tests of Pseudorandom Numbers Experiment Simulation Procedure Summary of Procedures Chapter Bibliography IV. ANALYSIS OF DATA AND FINDINGS 45 Part 1. k = Three Treatment Groups Part. k = Four Treatment Groups Part 3. k = Five Treatment Groups Part 4. k = Six Treatment Groups Part 5. k = Seven Treatment Groups Part 6. Larger Samples V. SUMMARY, CONCLUSIONS, IMPLICATIONS, AND RECOMMENDATIONS 81 Summary Conclusions Implications Recommendations Chapter Bibliography iii

7 IV Page APPENDIX A 89 APPENDIX B 95 APPENDIX C 116 APPENDIX D 13 APPENDIX E 15 APPENDIX F 137 BIBLIOGRAPHY 139

8 LIST OF TABLES Table Page 1. Actual Levels of Significance Under Conditions of Non-Violation of the Assumptions Underlying the Use of Tukey's HSD Statistic Number of Treatment Groups and Size of Sample Per Experiment Condition Comparison of Actual Significance Levels for Familywise Type I Error Rates to the Nominal 0.05 Significance Level in Simulated Experiments on the Tukey HSD Test for k=3 Groups with n-3 Observations in Each Group for Varying Degrees of Variance Variation, C Comparison of Actual Significance Levels for Familywise Type I Error Rates to the Nominal 0.05 Significance Level in Simulated Experiments on the Tukey HSD Test for k =3 Groups with n=6 Observations in Each Group for Varying Degrees of Variance Variation, C Comparison of Actual Significance Levels for Familywise Type I Error Rates to the Nominal 0.05 Significance Level in Simulated Experiments on the Tukey HSD Test for k=3 Groups with n=1 Observations in Each Group for Varying Degrees of Variance Variation, C Comparison of Actual Significance Levels for Familywise Type I Error Rates to the Nominal 0.05 Significance Level in Simulated Experiments on the Tukey HSD Test for k=3 Groups with n=14 Observations in Each Group for Varying Degrees of Variance Variation, C Comparison of Actual Significance Levels for Familywise Type I Error Rates to the Nominal 0.05 Significance Level in v

9 VI Table Page Simulated Experiments on the Tukey HSD Test for k=4 Groups with n=3 Observations in Each Group for Varying Degrees of Variance Variation, C Comparison of Actual Significance Levels for Familywise Type I Error Rates to the Nominal 0.05 Significance Level in Simulated Experiments on the Tukey HSD Test for k=4 Groups with n=6 Observations in Each Group for Varying Degrees of Variance Variation, C Comparison of Actual Significance Levels for Familywise Type I Error Rates to the Nominal 0.05 Significance Level in Simulated Experiments on the Tukey HSD Test for fe=4 Groups with n=1 Observations in Each Group for Varying Degrees of Variance Variation, C Comparison of Actual Significance Levels for Familywise Type I Error Rates to the Nominal 0.05 Significance Level in Simulated Experiments on the Tukey HSD Test for k-4 Groups with n=4 Observations in Each Group for Varying Degrees of Variance Variation, C Comparison of Actual Significance Levels for Familywise Type I Error Rates to the Nominal 0.05 Significance Level in Simulated Experiments on the Tukey HSD Test for k=5 Groups with n=3 Observations in Each Group for Varying Degrees of Variance Variation, C Comparison of Actual Significance Levels for Familywise Type I Error Rates to the Nominal 0.05 Significance Level in Simulated Experiments on the Tukey HSD Test for k=5 Groups with n=6 Observations in Each Group for Varying Degrees of Variance Variation, C Comparison of Actual Significance Levels for Familywise Type I Error Rates to the Nominal 0.05 Significance Level in Simulated Experiments on the Tukey HSD Test

10 Vll Table Page for k=5 Groups with n=1 Observations in Each Group for Varying Degrees of Variance Variation, C Comparison of Actual Significance Levels for Familywise Type I Error Rates to the Nominal 0.05 Significance Level in Simulated Experiments on the Tukey HSD Test for k-5 Groups with n-4 Observations in Each Group for Varying Degrees of Variance Variation, C Comparison of Actual Significance Levels for Familywise Type I Error Rates to the Nominal 0.05 Significance Level in Simulated Experiments on the Tukey HSD Test for k=6 Groups with n=3 Observations in Each Group for Varying Degrees of Variance Variation, C Comparison of Actual Significance Levels for Familywise Type I Error Rates to the Nominal 0.05 Significance Level in Simulated Experiments on the Tukey HSD Test for k=6 Groups with n-6 Observations in Each Group for Varying Degrees of Variance Variation, C Comparison of Actual Significance Levels for Familywise Type I Error Rates to the Nominal 0.05 Significance Level in Simulated Experiments on the Tukey HSD Test for k-6 Groups with n=j Observations in Each Group for Varying Degrees of Variance Variation, C Comparison of Actual Significance Levels for Familywise Type I Error Rates to the Nominal 0.05 Significance Level in Simulated Experiments on the Tukey HSD Test for k-6 Groups with n=4 Observations in Each Group for Varying Degrees of Variance Variation, C Comparison of Actual Significance Levels for Familywise Type I Error Rates to the Nominal 0.05 Significance Level in Simulated Experiments on the Tukey HSD Test

11 Vlll Table Page for k=7 Groups with n=3 Observations in Each Group for Varying Degrees of Variance Variation, C Comparison of Actual Significance Levels for Familywise Type I Error Rates to the Nominal 0.05 Significance Level in Simulated Experiments on the Tukey HSD Test for k=7 Groups with n=6 Observations in Each Group for Varying Degrees of Variance Variation, C 7 1. Comparison of Actual Significance Levels for Familywise Type I Error Rates to the Nominal 0.05 Significance Level in Simulated Experiments on the Tukey HSD Test for k=7 Groups with n=1 Observations in Each Group for Varying Degrees of Variance Variation, C 74. Comparison of Actual Significance Levels for Familywise Type I Error Rates to the Nominal 0.05 Significance Level in Simulated Experiments on the Tukey HSD Test forfc.=7groups with n=4 Observations in Each Group for Varying Degrees of Variance Variation, C Comparison of Actual Significance Levels for Familywise Type I Error Rates to the Nominal 0.05 Significance Level in Simulated Experiments on the Tukey HSD Test for k=5 Groups with n=4s Observations in Each Group for Varying Degrees of Variance Variation, C Comparison of Actual Significance Levels for Familywise Type I Error Rates to the Nominal 0.05 Significance Level in Simulated Experiments on the Tukey HSD Test for k=s Groups with n=7 Observations in Each Group for Varying Degrees of Variance Variation, C Degree of Variance Variation, C, Above which the Actual Significance Level Significantly Differed from the Nominal 0.05 Significance Level 80

12 IX Table Page 6. Ninety-Five Per Cent Confidence Limits for a Proportion Corresponding to a Nominal Significance Level Obtained Versus Expected Means and Variances for k=3 Treatment Groups with n=3 Observations in Each Group and the Expected Value of a for Each Computer Run of 4,000 Experiments Obtained Versus Expected Means and Variances for k-3 Treatment Groups with n=6 Observations in Each Group and the Expected Value of a for Each Computer Run of 4,000 Experiments Obtained Versus Expected Means and Variances for k=3 Treatment Groups with n=1 Observations in Each Group and the Expected Value of a for Each Computer Run of 4,000 Experiments Obtained Versus Expected Means and Variances for k-3 Treatment Groups with n=4 Observations in Each Group and the Expected Value of a for Each Computer Run of 4,000 Experiments Obtained Versus Expected Means and Variances for k=4 Treatment Groups with n=3 Observations in Each Group and the Expected Value of a for Each Computer Run of 4,000 Experiments T01 3. Obtained Versus Expected Means and Variances for k=4 Treatment Groups with n-6 Observations in Each Group and the Expected Value of a for Each Computer Run of 4,000 Experiments Obtained Versus Expected Means and Variances for k=4 Treatment Groups with n=1 Observations in Each Group and the Expected Value of a. for Each Computer Run of 4,000 Experiments 10

13 X Table Page 34. Obtained Versus Expected Means and Variances for k=4 Treatment Groups with n=4 Observations in Each Group and the Expected Value of a for Each Computer Run of 4,000 Experiments Obtained Versus Expected Means and Variances for k=5 Treatment Groups with n=3 Observations in Each Group and the Expected Value of a for Each Computer Run of 4,000 Experiments Obtained Versus Expected Means and Variances for k=5 Treatment Groups with n=6 Observations in Each Group and the Expected Value of a for Each Computer Run of 4,000 Experiments Obtained Versus Expected Means and Variances for k-5 Treatment Groups with n=1 Observations in Each Group and the Expected Value of a for Each Computer Run of 4,000 Experiments Obtained Versus Expected Means and Variances for k=5 Treatment Groups with n=4 Observations in Each Group and the Expected Value of a for Each Computer Run of 4,000 Experiments Obtained Versus Expected Means and Variances for k=6 Treatment Groups with n=3 Observations in Each Group and the Expected Value of a for Each Computer Run of 4,000 Experiments 10B 40. Obtained Versus Expected Means and Variances for k=6 Treatment Groups with n=6 Observations in Each Group and the Expected Value of a. for Each Computer Run of 4,000 Experiments Obtained Versus Expected Means and Variances for k=6 Treatment Groups with n-7 Observations in Each Group and the Expected Value of a for Each Computer Run of 4,000 Experiments 115

14 XI Table Page 4. Obtained Versus Expected Means and Variances for k=6 Treatment Groups with n=4 Observations in Each Group and the Expected Value of a for Each Computer Run of 4,000 Experiments Ill 43. Obtained Versus Expected Means and Variances for k=7 Treatment Groups with n=3 Observations in Each Group and the Expected Value of a for Each Computer Run of 4,000 Experiments Obtained Versus Expected Means and Variances for k=7 Treatment Groups with n=6 Observations in Each Group and the Expected Value of a for Each Computer Run of 4,000 Experiments Obtained Versus Expected Means and Variances for k = 7 Treatment Groups with n=1 Observations in Each Group and the Expected Value of a for Each Computer Run of 4,000 Experiments Obtained Versus Expected Means and Variances for k=7 Treatment Groups with n=4 Observations in Each Group and the Expected Value of a.for Each Computer Run of 4,000 Experiments Ten Per Cent Intervals for the Normal Distribution with a Mean of Zero and a Standard Deviation of One Expected and Observed Frequencies of One Hundred Numbers in Ten Per Cent Intervals Corresponding to a Normal Distribution A Summary of Variance Heterogeneity as Indexed by C and the Corresponding Ratio of Variances which Resulted in Familywise Type I Error Rates in Excess of the Nominal 0.05 Significance Level Critical F max Values for Corresponding Degrees of Variance Variation 138

15 CHAPTER INTRODUCTION A very frequent concern of educational researchers is determining whether or not k group means differ from one another. The analysis of variance (ANOVA) is often used to test whether or not sample means are indicative of experimental treatment effects or of merely chance variation. Experimenters usually follow a significant F-test in analysis of variance with a multiple comparison statistic when k is greater than two, because ANOVA indicates only the presence of overall treatment effects. Multiple comparison statistics enable the researcher to locate the specific mean differences which have caused the ANOVA F-test to be significant. Tukey's multiple comparison test is a frequently-cited procedure when the researcher's multiple comparison hypotheses are for pairwise differences (Games, 1971; Keselman and Toothaker, 1974; Marascuilo, 1971). Tukey's multiple comparison test specifies the familywise Type I error rate at a for a family of tests on all possible pairs of means allowing the error rate per comparison to decrease as k increases. According to Petrinovich and Hardyck (1969), little has been published on the characteristics and properties of Tukey's Honestly Significant

16 Difference (HSD) procedure. Petrinovich and Hardyck provided more information about Tukey's HSD procedure, but Games (1971) indicated that they provided only limited evidence and that further study is needed. Glass, Peckham, and Sanders (197) indicated that the role of unequal variances in combination with equal sample sizes appears to have boundary conditions which have not been sufficiently probed. Agreeing with Glass, Peckham, and Sanders, authors Rogan, Keselman, and Breen (1977) state that data from their investigations indicate that the degree of variance heterogeneity may play some part in determining those boundary conditions. This study was designed to provide further evidence about the robustness of Tukey's HSD procedure. Whenever populations differ with respect to variances and the means are equal, statistical tests designed to determine the mean difference can be influenced by the difference in variances. The statistical test may yield more or fewer significant results by chance than would be expected. Evidence for the results being influenced by the unequal variances is obtained when significant departures from expected results are found based on the familywise Type I (FWI) error rate at a when means are equal. Any study of robustness of a statistical procedure involves creating differences in parameters other than the parameter for which the statistical procedure was designed

17 to test a difference. The variable manipulated in this study was the population variance. This research was performed to determine the robustness of Tukey's HSD procedure in the presence of variance heterogeneity. Variance heterogeneity was indexed by use of Box's (1954) coefficient of variance variation. Statement of the Problem The problem of this study was the effect of violating the assumptions of homogeneity of variance with equal sample sizes upon Tukey's Honestly Significant Difference (HSD) multiple comparison procedure, utilizing Box's coefficient to index the degree of variance variation. Purpose of the Study The purpose of this study was to empirically evaluate the effects of varying degrees of heterogeneity and equal sample sizes when the degree of variance variation was within a range of 0.00 tofc-1,where k equals the number of treatment groups. Hypothesis The following hypothesis was formulated to carry out the purpose of this study [C = coefficient of variance variation which indexes the degree of heterogeneity; n = sample size; k = number of samples]:

18 Using Tukey's (HSD) procedure, actual significance levels will not differ significantly from nominal significance levels at the 0.05 level of significance when C has a value from 0.00 to fe-1 for experimental conditions of n = 3, 6, 1, and 4, and fe = 3, 4, 5, 6, and 7. Mathematical Model of Tukey's HSD Statistic Tukey's HSD statistic was mathematically defined by Kirk (1968, p. 88) as HSD = q M I,,nx 1 /MS a,v / error ' (1) /J n where HSD = the value to be exceeded for a comparison involving two means to the declared significant? q n a,v = the value determined by entering a table for the percentage points of the studentized range with v degrees of freedom corresponding to the MS error term degrees, of freedom, a level of significance, and the number of treatment groups in the experiment or range of levels in the MS error experiment; = an estimate taken from the one-way analysis of variance mean square within group of the experiment; n = the sample size of each group.

19 If the difference between two groups exceeded the HSD value, then the results were declared significant at the given a level. Definition of Terms Actual Significance Level. The percentage of computed statistical values which exceed the tabled value of the statistic in an empirical investigation. Coefficient of Variance Variation, C. The degree of heterogeneity present in an experimental paradigm as indexed by a coefficient of variance variation, C, in the formula 1 5: k (a t ~ CT ) () t=l - (ar Familywise Error Rate. An error rate that is the ratio of the number of families with at least one statement (comparison) falsely declared significant to the total number of families. Monte Carlo Simulation. A procedure in which random samples are drawn from populations having specified parameters, and then a given statistic is calculated. Nominal Significance Level. The percentage of computed statistical values which exceed the tabled value of the statistic for the theoretical distribution.

20 Pseudorandom Numbers. Pseudorandom numbers are "pseudo" since once the generating sequence is begun, each number is precisely determinedbythe preceding number. Pseudorandom numbers have the basic properties of randomness which makes them quite usable in simulation studies (Lehman and Bailey, 1968). Hereafter in this study, pseudorandom numbers are referred to as random numbers. Robust. When a violation of an assumption underlying a statistical model does not seriously affect the result, then that statistical model is said to be robust. Significant Difference between Nominal and Actual Significance Levels. An actual significance level which fails to fall within a 95 per cent confidence interval about the nominal significance level is said to be statistically different from the nominal significance level. Limitations This study was subject to experimental limitations due to experimental conditions simulated with the following conditions : 1. A selected number (3 to 7) of treatment groups was considered.. Selected equal sample sizes were employed, varying from three to twenty-four. 3. Degrees of variance heterogeneity were selected, ranging from 0.00 to a possible maximum of fe-1.

21 CHAPTER BIBLIOGRAPHY Box, G. E. P. Some theoremson quadratic forms applied in the study of variance problems. I. Effect of inequality of variances in the one-way classification. Annals of Mathematical Statistics, 1954, 5^, Games, Paul A. Multiple comparison of means. American Educational Research Journal, 1971, 8_{3), Glass, G. V., Peckham, P. D., and Sanders, J. R. Consequences of failure to meet assumptions underlying the fixed effects analysis of variance and covariance. Review of Educational Research, 197, (3), Keselman, H. J., and Toothaker, L. E. Comparison of Tukey's t-method and Scheffe's s-method for various numbers of all possible differences of averages contrast under violation of assumptions. Educational and Psychological Measurement, 1974, 3, Kirk, Roger E. Experimenta1 design: procedures for the behavioral sciences. Belmont, California: Brooks/Cole Publishing Company, Lehman, R., and Bailey, D. E. Digital computing, fortran n and its applications to the behavioral sciences. New York: John Wiley and Sons, 1963.

22 Marascuilo, L. A. Statistical methods for behavioral science research. New York: McGraw-Hill, Petrinovich, L. R., and Hardyck, C. D. Error rates for multiple comparison methods: some evidence concerning the frequency of erroneous conclusions. Psychological Bulletin, 1961, 71, Rogan, J. C., Keselman, H. J., and Breen, L. J. Assumption violations and rates of type I error for the Tukey multiple comparison test: a review and empirical investigation via a coefficient of variance variation. Journal of Experimental Education, 1977, A6_(1), 0-5.

23 CHAPTER II SURVEY OF RELATED RESEARCH The effects of violating the assumptions underlying the fixed-effects analysis of variance (ANOVA) on Type I error rate have been of great concern to educational researchers and statisticians since before 1930 (Pearson, 199). For the most part, the major effects of violation of assumptions underlying ANOVA are now quite well known. Concern about whether or not ANOVA assumptions are satisfied is not unfounded. Assumptions of most mathematical models are almost always false to some extent. The important question to be asked is not whether these assumptions have been exactly met, but whether violations of these assumptions have had any serious effects on the probability statements that have been formulated based on the standard assumptions. Applied statistics in education and the social sciences experienced a largely unnecessary hegira to non-parametric statistics during the 1950s. Increasingly during the 1950s and early 1960s the fixed effects, normal theory ANOVA was replaced by such comparable nonparametric techniques as the Wilcoxon test, Mann-Whitney U-test, Kruskal-Wallis one-way ANOVA, and the Friedman two-way ANOVA for ranks

24 10 [Siegel, 1956]. The change to non-parametrics was unnecessary primarily because researchers asked, 'Are normal theory ANOVA assumptions met?' instead of 'How important are the inevitable violations of normal theory ANOVA assumptions?' (Glass, Peckham, and Sanders, 197, p. 37). The following assumptions were made for the simple oneway fixed effects model ANOVA in this study: 1. XL. = y + r. + e. y (3). e.. ~ NID(0,a ) (4) <-3 3. It. = 0 (5) 3 The first assumption was that of additivity. Any observation was taken to be the simple sum of three components. First, \i, the population mean; second, x., the effect of 3 treatment 3 on the dependent variable for all of the observations in group 3; and third, e.., the error of the ^c,jth *-5 observation. The second assumption was that the e^_y s have a normal distribution with a population mean of zero and a variance of a and that they were independent. According to Glass, Peckham, and Sanders (197), the third assumption need be of little concern; it is merely a consequence of choosing to express X- in three terms (y,x., 3 j e..) instead of two, for example, 3- = y + x- and e

25 11 Three different violations of assumption have been considered in the past: (a) non-normality, (b) different variances from different groups, and (c) non-independence. The thrust of this study was to investigate the (b) violation, i.e., heterogeneity of variances. Hsu (1938) was one of the first to obtain concise mathematical results in the study of the effects of heterogeneous variances. Hsu determined the actual significance level of a result tested at the 0.05 level for different values of the ratio of a to a in a two-tailed t test. Scheffe (1959) and 1 ~~ Pratt (1964) addressed the same problem. Box (1954) studied the effect on alpha level of heterogeneous variances in the one-way ANOVA. One example of Box's findings was that if three treatments were compared and n l =9, n =5 ' and n 3~ 1 ' and the P P ulation variances are in the ratio of 1:1:3, the probability of a Type I error was actually 0.17,when the experimenter would expect it to be Of particular interest for this study was that Box's results agreed quite closely with those of Hsu. When n's were equal the actual and the nominal significance levels agreed quite closely. Also of special interest for this study was the finding that with seven groups of n=3 (equal n's) and a variance ratio of 1:1:1:1:1:1:7, Box found an actual significance level of 0.1 when the nominal significance level of 0.05 was expected.

26 1 One of the most significant and comprehensive studies was made by Dee W. Norton at the State University of Iowa in 195 (Lindquist, 1956). From Norton's investigations, it appeared that marked heterogeneity of variance has a small but real effect on the form of the F-distribution. Kohr and Games (1974) indicated that the F-test was robust with regard to heterogeneity of variance with equal sample sizes but more susceptible to error when sample sizes are unequal. The F-test and the analysis of variance have been investigated (Atiqullah, 196; Norton, 195 [found in Lindquist]; Pearson, 1931; and Scheffe, 1959), with the conclusion that they have a high degree of robustness. The result of robustness for the analysis of variance has precipitated similar questions concerning assumptions underlying multiple comparison procedures. Hypotheses about mean differences from a set of k means (fe>) may provide a situation that requires the use of some multiple comparison technique. According to Kirk (1968), the analysis of variance is equivalent to a simultaneous test of the hypothesis that all possible comparisons among means are equal to zero.... If an over-all test of significance using an F-ratio is significant, an experimenter can be certain that some set of orthogonal comparisons contains at least one significant comparison among means.... It remains for an experimenter to carry out followup tests [multiple comparisons] to determine what has happened (p. 73).

27 13 One solution for determining the location of a significant difference was to use multiple t tests; but according to Games (1971), although this procedure has been found to be powerful, it allowed the familywise (FWI) error rate to increase as the number of t tests increased, sometimes resulting in an unacceptable error rate. In order to locate significant differences in means without producing high FWI rate, other multiple comparison procedures have been developed. Games (1971) reported twelve different multiple comparison procedures. Games discussed and compared the multiple t test, Scheffe's least significant difference test, Bonferroni's t statistic, Tukey's procedure, and Dunnett's test. Games also reviewed sequential multiple comparison techniques,including the Newman-Keuls test and Duncan's multiple range test. Evidence that multiple comparison procedures are controversial topics in statistics was presented by Petrinovich and Hardyck (1969) when they stated that Textbook authors at least in the area of psychological statistics have not been particularly helpful. Authors such as Edwards [1960], Federer [1955], Hays [1963], McNemar [195], and Winer [196] either offer no evaluation as to which method is preferable, or preface their remarks with a cautionary statement to the effect that mathematical statisticians are not entirely in agreement concerning the preferred

28 14 method. Similarly, disagreement exists as to when these methods may be used. Some discussions state that a significant F ratio over all conditions must be obtained before multiple comparison methods can be used; other discussions make no mention of such a requirement, or deny that it is necessary at all (p. 44). Hopkins and Chadbourn (1967) [found in Games, 1971, p.559] suggested that the overall F-test be routinely run first, then if it is found to be significant, a multiple comparison procedure should follow. According to them, this second stage should be the Bonferroni t procedure, the Newman-Keuls, the Tukey wholly significant difference test (WSD), or the Scheffe, depending on certain factors. According to Games (1971), There seems to be little point in applying the overall F-test prior to running C contrasts by procedures that set P(EI>0) alpha (method 3 and the Bonferroni t's). If the C contrast express the experimental interests directly, they are justified whether the overall F is significant or not and P(EI>0) is still controlled. The Newman-Keuls and WSD also control P(EI>0), so do not need a significant F to justify them (p. 560). Here, Games used the symbols P(EI>0) to represent the familywise risk of Type I error. The familywise rate was the risk of making one of more Type I errors in the entire set of contrasts that comprise a family.

29 15 Tukey's multiple comparison test has been a frequently cited procedure when the researcher's multiple comparison hypotheses are for pairwise differences (Games, 1971; Keselman and Toothaker, 1974; Marasculio, 1971). Evidence of interest in Tukey's HSD procedure has been presented in papers published in education, psychology, and statistics journals (Howell and Games, 1973a, 1973b; Keselman, Murray, and Rogan, 1976; Keselman, Toothaker, and Shooter, 1975; Petrinovich and Hardyck, 1969; Steel and Torrie, 1966). For the most part, these papers have investigated the effects of the violation of the assumptions under which the Tukey test was derived. The importance of these studies has been related to the validity of the use of the Tukey test in actual educational situations because these actual educational situations seldom, if ever, meet the assumptions under which the Tukey test was developed. Just as in the case of the ANOVA F-test, Tukey's HSD test was derived under the assumptions that the observations of each of the populations under study are independently and normally distributed with equal variances. Further, Tukey's HSD method was derived under the restriction that the variances of the sample means be equal; hence, each sample mean must be based on an equal number of observations. When the requirement of equal sample sizes cannot be met, several unequal n forms of the Tukey procedure have been suggested. Winer (196, p. 101) suggested that the estimated variance, S /n, should be replaced with the average of the

30 16 variances of the means when sample sizes do not differ a great deal. Steel and Torrie (1966, p. 114) suggested the use of the Kramer method with the Tukey test. Kramer's method only employed the sample sizes of the means actually involved in the simple contrast. Miller (1966, p. 48) suggested the use of an average or median value of the group sizes as an approximate value of n. Smith (1971) compared Kramer's method, Winer's harmonic mean, and Miller's unequal n forms of the Tukey test for unequal sample sizes under conditions of homogeneous population variances. Smith recommended the use of the Kramer method. Keselman, Murray, and Rogan (1976) reported that the Tukey test did not have to be restricted to comparisons having equal n 1 s. They recommended Kramer's unequal n procedure. Howell and Games (1973) investigated the robustness of the harmonic mean form of the Tukey test under conditions of unequal sample sizes coupled with various patterns of population variance heterogeneity. They found that when the smallest sample size was selected from the population with the smallest variance, and the largest sample size was selected from the population with the largest variance, the Tukey test was conservative, i.e., the empirical significance level was less than the nominal significance level. When the smallest sample size was sampled from the population with the largest variance, and the largest sample size was sampled from the population with the smallest variance, the Tukey test was found to be liberal; i.e., the empirical significance level was found to

31 17 be greater than the nominal significance level. Petrinovich and Hardyck (1969) and Keselman and Toothaker (1974) examined the robustness of the harmonic mean form of the Tukey test and reported results similar to Howell and Games (1973). Also, the Tukey test was found to be robust to conditions of non-normality. Ramseyer and Tcheng (1973) investigated three different multiple comparison procedures that make use of the studentized range statistic, q. The procedures they studied were the Tukey HSD test, the Newman-Keuls test, and the Duncan multiple range test. They studied the effect on the Type I error rate of assumption violations on these three procedures. In Ramseyer and Tcheng's investigation, homogeneity of variance was violated with variance ratios of (a) 1:1: [fe=3], (b) 1:1:4 [fe=3], (c) 1:1:1:: [fe=5], and (d) 1:1:1:4:4 [fe=5]. Normality was violated with populations that were positively and negatively exponentially skewed and rectangularly distributed. A combination of the violation of the normality assumption and the homogeneous variance assumption was also studied. Ramseyer and Tcheng concluded that q is robust to the violation of homogeneity of variance and normality. They also reported that violation of normality produced Type I error rates lower than nominal levels. Carmer and Swanson (1978) used computer simulation techniques to study the Type I and Type III error rates for ten pairwise multiple comparison procedures, including the Tukey

32 18 statistic. Their results indicated that Scheffe.'s test, Tukey's test, and Newman-Keuls' test were less appropriate than a restricted least-significant-difference (LSD) test, some Bayesian modifications of the LSD,and Duncan's multiple range test. Carmer and Swanson (1978) stated that the inferiority of Scheffe's test, Tukey's test, and Student'sNewman- Keuls' test was even more apparent with sets of ten and twenty treatments. This was, according to them, due to the critical values of these procedures being dependent on the number of treatments. Keselman, Toothaker, and Shooter (1975) studied the harmonic mean and the Kramer unequal n forms of the Tukey HSD statistic. In their study, unequal sample sizes and unequal variances were combined in varied patterns that included normal and skewed population shapes and population variances in the ratios of (a) 1:1:4:4, (b) 1:1:1:, (c) 1:.5:.5:4, and (d) 1::3:4. Their findings indicated a close agreement between the two unequal n forms of the Tukey statistic. Both methods were adversely affected when unequal sample sizes were combined with unequal variances in the way reported by Howell and Games (1973), Petrinovich and Hardyck (1969), and Keselman and Toothaker (1974). Keselman and Rogan (1978) investigated five modifications of Tukey's statistic and compared them with Scheffe's test in controlling Type I errors and sensitivity to unequal sample sizes, variance heterogeneity, and sampling from non-normal

33 19 populations. They utilized a coefficient of variance variation to index the degree of variance heterogeneity. All of their investigations used k=4 groups,and sample sizes varied from a low of sixteen to a high of eighty-nine. Keselman and Rogan reported that a Games and Howell (1976) modification of Tukey's test controlled the Type I error rate at or below the nominal level for all conditions they investigated. Keselman and Rogan selected values of 0.0, 0.40, 0.80, and 1.00 for values of C (Keselman and Rogan's index of variance variation was C, not C ) since they felt this selection of C values represented those likely to be encountered in actual research. Based on the results of their investigation, Keselman and Rogan recommended the Games and Howell modification of the Tukey multiple comparison test for pairwise comparisons of means. According to Winer (1971, p. 198), there are two popular versions of the Tukey multiple comparison procedure. Winer labeled the more popular of the two procedures as Tukey A. The Tukey A procedure has been frequently labeled as Tukey's Honestly Significant Difference test (Winer, 1971; Kirk, 1968; Games, 1971). Tukey A has been known as the T-Method (Glass and Stanley, 1970; Scheffe, 1959), and the WSD test (Games, 1971). Apparently, Games and Kirk do not agree that the WSD test and the HSD test are one and the same,because Kirk states "The WSD test merits consideration but is more complex than the HSD test" (1968, p. 90). Therefore, Kirk has indicated

34 0 that the HSD and the WSD are two different procedures. The form of the statistic utilized in this investigation is that found in Kirk (1968, p. 88): HSD = ^av v / /MS error /V n (6) HSD is the value that must be exceeded in order for a comparison involving two means to be declared significant. The value of a is determined by entering a table for the studentized ^av range distribution with v degrees of freedom that correspond terms degrees of freedom and a level of signifi- ^ to the MS error cance. Another factor that determines q is the number of treatment levels in the experiment. MS error -*- san est i mate taken from the one way analysis of variance mean square withingroup of the experiment. Group sample size is designated by n and the number of treatment levels is designated by k. Tukey's Honestly Significant Difference (HSD) test was designed to make all pairwise comparisons among means (Kirk, 1968). According to Winer (1971) in 1953, Tukey extended an approach originally suggested by Fisher to control FWI error rate. It was this procedure that has been called the HSD test. The basic assumptions of the HSD test are normality, homogeneity of variance, randomization, and equal sample sizes (Kirk, 1968, p. 88). Ryan (1959) introduced two general issues involving multiple comparisons. These were a versus a poitzfiyiofia,

35 1 comparisons and the concept of error rate. According to Ryan, an a pkloh.*. test is one in which "the experimenter states in advance all possible conclusions and the rules by which these conclusions will be drawn" (p. 38). A po&to.n-ioti'l tests are those which are suggested by data. These typesof tests have been known as data snooping or as post-mortem comparisons. Ryan indicated that there were several types of error rates, but Kirk (1968) has defined six kinds of error rates: (a) error rate per comparison, (b) error rate per hypothesis, (c) error rate per experiment, (d) error rate experimentwise, (e) error rate per family, and (f) error rate familywise. "It should be noted that the various error rates are all identical for an experiment involving a single comparison. The error rates become more divergent as the number of comparisons and hypotheses evaluated in an experiment are increased"(kirk, 1968, p. 83). The error rate conceptualized for the HSD test was "familywise." In the one-dimensional case, "per family" and "per experiment," and "familywise" and "experimentwise," are equivalent terms (Ryan, 1959). Therefore, in the one-way analysis of variance, Tukey's terms "family" and "familywise" took on the more simple definition of "experiment" and "experimentwise." According to Kirk (1968), error rate per experiment (i.e., family in the one-dimensional case) was defined as (p. 84) number of comparisons falsely declared significant total number of experiments.

36 Error rate experimentwise (familywise in the one-dimensional case) was defined as (p. 84) number of experiments with at least one statement falsely declared significant total number of experiments. Kirk concluded that... it should be observed that once an experimenter has specified an error rate and has decided on an appropriate conceptual unit for error rate, he can compute the corresponding rate for any other conceptual unit. Basically, the problem facing an experimenter is that of choosing, prior to the conduct of an experiment, a test statistic that provides the kind of protection desired (p. 86). Much research has been conducted on the robustness of the F-test and multiple comparison procedures under the violation of the assumptions of homogeneity of variance. For the most part, the research supports the theory that when sample sizes are equal, the F-test and multiple comparison procedures are robust. According to Box (1954), "It appears that if the groups are equal, moderate inequality of variance does not seriously affect the test" (p. 98). However, "moderate inequality of variance" was not specifically defined. Box's results under extreme conditions (k=7, n's equal, variance ratio = 1:1:1:1:1:1:7, and nominal alpha = 0.05, the empirical alpha = 0.1) indicated that the question has not been

37 3 fully investigated. Therefore, the focus of this study was to investigate the robustness of the Tukey HSD procedure under conditions of equal sample size and heterogeneous variances. In 197, Glass, Peckham>and Sanders stated the following: Whatever the cause, we find it significant to note that subsequent investigators have not extended Box's work in the direction of this curious finding. The conventional conclusion that heterogeneous variances are not important when Kt's are equal seems to have boundary conditions like all other conclusions in this area, and the boundary conditions may have not been sufficiently probed (p. 45). In 1977, Rogan, Keselman, and Breen reported: Of special interest was the finding that large degrees of variance heterogeneity produced liberal Type I error rates even in the presence of equal sample sizes. Although Box found serious distortions in the Type I error of the ANOVA F-test under similar conditions, this finding is contrary to the conventional conclusion that heterogeneous variances are not important when sample sizes are equal. The authors agree with Glass, Peckham, and Sanders in that this conclusion regarding the role of unequal variances in combination with equal sample sizes appears to have boundary conditions which have not been sufficiently probed. The data

38 4 from this investigation suggests that the degree of variance heterogeneity may play a role in determining these boundary conditions (p. 5). Box (1954) developed a method for the indexing of the degree of heterogeneity by a coefficient of variance variation symbolized by C, where 1/ fe C = -1 (N-fe) E v fe ( k ~ ) (7) _ fe and a = Zv^aj^ is the weighted mean of the fe variances, v fe v^ = n^-1 represents the degrees of freedom associated with each of the fe variances, fe represents the number of treatment groups, and N represents the total number of observations. Rogan, Keselman, and Breen (1977) demonstrated that very different ratios of unequal variances and unequal sample sizes may be identical with respect to their degree of heteroge- neity or the C value. For example, consider the two sets of sample sizes and variances presented on the following page.

39 5 Case A Case B n k 4, 3, 36, 40, 48, 60 6, 1, 14, 16, 4, 48.05,.,.35,.5, 16, ,.64,.64,.64,.64,.79 a ratios 1:4:7:10:1:6 1:1:1:1:1:4.35 o ro Both of the above cases involve very different ratios of unequal variance yet are similar with respect to their degree of heterogeneity as indexed by Box's coefficient of variance variation. Though different ratios of variances have been manipulated in other studies, the degree of variance heterogeneity may in some cases not have been varied. Also, a simpler form of the C equation for equal n's was derived by Box (1954): k C = 1 E (a -a ) k t=l t - (a \ ) () C is the variance of the variances divided by the square of the mean variance. If the variances range from a lower value a to an upper value ao (where a is a coefficient of a and

40 6 a>l), then the largest possible value for C is attained when fe-1 of the variances are equal to a and the remaining variance is equal to ao. In this case, C = (fc-1)(a-1) (8) (a-l+k). Values of C greater than one or at most two probably would be extremely rare in reality (Box, 1954). This study was limited to values of C less than fe-1. Tamhane (1979) used Box's coefficient of variation as a measure of unbalance in the values of var (x.) = T - = o^ /n. [i. = 1,..., fc). (9) Tamhane indicated that although Keselman and Rogan (1978) had used this index for measuring variance variation, he believed that t was a more relevant parameter in his study than was a. The purpose of this study was to further investigate the question regarding the effects of variance heterogeneity and equal sample sizes by utilizing Box's coefficient of variance variation to index heterogeneity in order to determine whether boundary conditions existed where the Tukey HSD procedure was no longer robust. Results of this investigation should provide researchers in the behavioral sciences with additional information regarding the proper use of the Tukey HSD statistic.

41 CHAPTER BIBLIOGRAPHY Atiqullah, M. The robustness of the covariance analysis of a one-way classification. Biometrika, 1964, 51_, Box, G. E. P. Some theoremson quadratic forms applied in the study of variance problems. I. Effect of inequality of variances in the one-way classification. Arinals of Mathematica1 Statistics, 1954, 5, Carmer, S. G., and Swanson, M. R. An evaluation of ten pairwise multiple comparison procedures by Monte Carlo methods. Journal of the American Statistical Association, 1978, 6, Games, P. A. Multiple comparisons of means. American Educational Research Journal, 1971, B_(3), Games, P. A. Inverse relation between the risks of type I and type II errors and suggestions for the unequal n case in multiple comparisons. Psychological Bulletin, 1971, 7_5 (), Games, P. A., and Howell, J. F. Pairwise multiple comparison procedures with unequal n's and/or variances: A Monte Carlo study. Journal of Educational Statistics, 1976, 1, Glass, G. V., Peckham, P. D., and Sanders, J. R. Consequences of failure to meet assumptions underlying the fixed 7

42 8 effects analysis of variance and covariance. Review of Educational Research, 197, (3), Glass, G. V., and Stanley, J. C. Statistical methods in education and psychology. Englewood Cliffs, N. J.: Prentice Hall, Howell, J. F., and Games, P. A. The effects of variance heterogeneity on simultaneous multiple comparison procedures with equal sample size. Paper presented at the American Educational Research Association Convention, February, 1973 (ERIC document ED ). (a) Howell, J. F., and Games, P. A. The robustness of the analysis of variance and the Tukey WSD test under various patterns of heterogeneous variances. Journal of Experimental Education, 1973b, 41(4), (b) Hsu, P. L. Contributions to the theory of students t-test as applied to the problem of two samples. Statistical Research Memoirs, II, 1938, 1-4. Keselman, H. J., Murray, R., and Rogan, J. Effect of very unequal group sizes on Tukey's multiple comparison test. Educational and Psychological Measurement, 1976, 36, Keselman, H. J., and Rogan, J. C. A comparison of the modified Tukey and Scheff^ methods of multiple comparisons for pairwise contrasts. Journal of the American Statistical Association, 1978, 73 (361), 47-5.

Presented to the Graduate Council of the. North Texas State University. in Partial. Fulfillment of the Requirements. For the Degree of.

Presented to the Graduate Council of the. North Texas State University. in Partial. Fulfillment of the Requirements. For the Degree of. AN EMPIRICAL INVESTIGATION OF TUKEY'S HONESTLY SIGNIFICANT DIFFERENCE TEST WITH VARIANCE HETEROGENEITY AND UNEQUAL SAMPLE SIZES, UTILIZING KRAMER'S PROCEDURE AND THE HARMONIC MEAN DISSERTATION Presented