U.S. Department of Agriculture, Beltsville, Maryland 20705

Size: px

Start display at page:

Download "U.S. Department of Agriculture, Beltsville, Maryland 20705"

Myrtle Andrews
6 years ago
Views:

1 AN EVALUATION OF MULTIPLE COMPARISON PROCEDURES D. R. Waldo 1'2 U.S. Department of Agriculture, Beltsville, Maryland SUMMARY Least significant difference, Duncan's multiple range test, Student-Newman-Keuls, Tukey's significant difference, and Scheffe's significant difference procedures are compared. The criteria of comparison are Type 1 and Type II error probabilities. Type I error probabilities on 2-treatment and n-treatment bases are theoretically related. The theoretical error probabilities are compared to published observed error probabilities. The choice of a test procedure rests on the relative importance of Type I and Type II errors. (Key Words: Multiple Comparisons, Least Significant Difference, Duncan's Multiple Range Test, Student-Newman-Keuls Test, Tukey's Significant Difference, Scheffe's Significant Difference.) INTRODUCTION Multiple comparison procedures have been available for several years. Probably the most widely used is that of Duncan (1955), which has even been included in the Statistical Analysis System (Barr and Goodnight, 1972) program. However, Gill (1973) has attacked Duncan's test on the basis that it does not control experimentwise Type I error probabilities. Some journals have developed or are developing editorial policies that either strongly support or strongly reject Duncan's test. Even with several recent papers on multiple comparison procedures (Warde, 1970; O'Neill and Wetherill, 1971; Carmer and Swanson, 1973; Gill, 1973) very different viewpoints persist. These more ' 4 I A.R.S., Nutrition Institute, Ruminant Nutrition Laboratory, BARe-East. 2The author deeply appreciates the statistical discussions with David B. Duncan, Johns Hopkins University and Judson U. McGuire, Jr., A.R.S., U.S.D.A. Patricia M. Cochran, A.R.S., U.S.D.A., improved the grammar and the mathematical clarity, precision, and consistency. The author appreciates the difficult typing job performed by Jo Ann Brown and Jenny Hysan. recent and more applied papers have used very little of the more fundamental statistical theory described by Harter (1957) and Duncan (1965). The objective of this paper is to present a comparison in a precise, point-by-point discussion of fundamental relationships and support these with published Monte Carlo results. Guidance for practical decisions on the use of these methods will be based on these theoretical relationships. BASES FOR EVALUATION Basically, two types of error are possible when a statistical decision is made, as indicated in table I. Deciding that a treatment difference exists, when in fact it does not, results in a Type I error which has probability, r Deciding that a treatment difference does not exist, when in fact it does, results in a Type II error which has probability,/3. For multiple comparisons testing, the Type I error requires further subdivisions. These subdivisions are usually termed comparisonwise, ac, and experimentwise, a E, error (Tukey, 1953) and are defined as: ~C = ~E = Number of erroneous inferences Number of inferences Number of experiments with one or more erroneous inferences Number of experiments Although the term, experimentwise, is correct for a simple one-way analysis or a completely random experiment, it is a misnomer for more complicated experiments because it loses its implied relation to the conventional F test. Consider a 2 3 factorial experiment conducted in a completely random design. Assuming five replications the analysis of variance is given in table 2. This one experiment could be tested with three F tests. If only three F tests are used with each at e~ =.05, the error probability in the experiment rises to This relation is described by (l--& E) = (1--ac)k for 539 JOURNAL OF ANIMAL SCIENCE, Vol. 42, No. 2, 1976

2 540 WALDO TABLE 1. TYPES OF ERRORS ARISING FROM DECISIONS ABOUT POPULATIONS Population null hypothesis Decision Type of Probability (H o) on H o error of error True Accept None... True Reject Type I a False Accept Type II 3 False Reject None... k independent comparisons (Steel; 1961). Even though the mean squares in the numerator are independent the tests are not independent because the same mean square for error was used (Kempthorne 1952, p. 245). The effect of this lack of independence decreases with increasing degrees of freedom for error. The dependence introduced by the repeated use of the same mean square for error was ignored in this paper. For this discussion, the terminology of Duncan (1955) seems more consistent. His generalized 'p-mean significance level' or O:p will be used, where a2 is equivalent to the comparisonwise ~ and a n is equivalent to the experimentwise a of the one way classification. For more complex designs the a n is equivalent to the error chosen for the F test of any one line in any complex analysis. A third type of error, Type III, is possible in multiple comparisons testing (Harter, undated; Carmer and Swanson, 1973). Type III error was considered outside the scope of this paper. BASES FOR TESTS Three of the five tests to be described are based on the studentized range, q (ap, p, n2), where q is studentized range statistic, ap is the Type 1 error probability, p = 2, 3... n l is the number of treatment means, and n2 is the error degrees of freedom. An example of the tabu- Source TABLE 2. ANALYSIS OF VARIANCE TABLE Whole plots 30 Mean 1 A 1 B 2 AB 2 Error 24 df lated values given a =.05, nl = 5, n2 = 10 is shown in table 3. The first of three tests is the least significant difference (LSD) or muhiple t test (Student, 1908). I.SD = q (a2, 2, n2) Sx (1) where c~2 is the 2-treatment error rate and p = 2; i.e., it has a single and minimum critical value. The second test is Tukey's (1953) significant difference (TSD), TSD = q (an nl, n2) Sx (2) where ~n is the n-treatment error rate and p = n~; i.e., it has a single and maximum critical value. The third test is Student (1927)--Newman (1939)--Keuls (1952) or (SNK), SNKp = q (ap, p, n2)s~ (3) where ap is the p-treatment error rate and p = 2, 3,..., nl; i.e., it has n1-1 critical values. These critical values are intermediate between the LSD and the TSD. The fourth test is Duncan's (1955) multiple range test (MRT), MRTp = Q (t~p, p, n2 ) s,~ (4) where Q is the multiple range statistic, ap is the p-treatment error rate, and p = 2, 3,..., nl ; i.e., it has n~-i critical values. The C~p = 1-(1-a2)P-1 by definition (Duncan, 1955). The Q values are tabulated in the same way as q values. An example is given in table 3. Critical values to four significant digits and a =.10,.05,.01,.005 and.001 are available (Harter, 1960). The tabulated values of Q are less than those of q for p>2, so the critical values of MRT, except when n~ = 2, are always greater than the critical value of LSD and less than the critical values of SNK. The fifth test is Scheffe's (1953) significant difference (SSD),

3 MULTIPLE COMPARISON PROCEDURES 541 TABLE 3. STUDENTIZED RANGES p = number of treatment means Error Range df a Ordinary, qa Special, Qb asteel and Torrie, Table A.8. Federer, Table 1I-1. bsteel and Torrie, Table A.7. Federer, Table il-3. Harter, Table 1 gives critical values with four digits. SSD= [(nl--1)f(an, nl-l, n2)] 1/2Sd (5) where nl-1 is the numerator df, F is the F statistic, a n is the n-treatment error rate or conventional F-test error rate, and n2 is the denominator df. To make this test consistent with the other four tests, use the relation s~ = s d giving SSD= [(nt-1)f(a n,nx-l, n2)l 1/2 42s x " (6) Then for the example in table 4 where nl = 5, n2 = 10 and a n =.05, F = So with s x = 1, SSD = [4(3.48)] 1/2 42 = Except when nl = 2 where all five tests are equivalent, the SSD is always greater than the TSD. POWER OR TYPE II ERROR PROBABILITY The relative sizes of the Type II error probabilities are indicated by the relative critical values tabulated in table 4. All of these critical values must be multiplied by s~- before their use. Smaller critical values imply narrower confidence intervals which imply more powerful tests, which further imply lower Type II error probabilities on a theoretical basis. The ob- served results of Monte Carlo experiments (Carmer and Swanson, 1973) support this ranking of the relative power of these tests. The relative power of these five tests is influenced by both nl and n2. Increasing nl increases the difference between the tests. When nl = 2, results of all tests are identical. When nl increases to 5, as in table 4, the differences between the tests increase. As n2 increases, the differences in relative power decrease. This difference can be seen by constructing a table such as table 4 but by letting n2 = 20. TYPE I ERROR The second major criterion for evaluating these five tests is the Type I error probability. However, as described earlier, it should be recalled that the tests have different Type I error bases. In review, these are listed in table 5. The orthogonal contrast or single degree of freedom test is also included in this discussion. To make the different error bases comparable it is necessary to know the translation equations. The relation of error bases for independent tests is derived from Duncan (1955) or Steel (1961) as: TABLE 4. RELATIVE CRITICAL VALUES a p = number of treatment means Test Theoretical Rank Observed b LSD 3.15 ~ 1 MRTp SNKp TSD ~ SSD ( arequire multiplier of s x. Assumed error df = 10 and Type I error =.05. bcarmer and Swanson, 1973.

4 542 WALDO r: xo00 eq ~- 0 ~l" ~ ' ~ 0 0 O0 0 " 0 0 0,-~ o,10,"-~ o Lr..] >., e,i,,.,. ~~176 ~ ~ ~ 0 O 0 O r 7 I v I < I= II d II A t A I:,X ii u,~ - II g~ r-, el & '4 d ~ ",d e,,i e,,i e~l eq ~ I:I.:-, 0 ~ N m m

5 MULTIPLE COMPARISON PROCEDURES 543 0/nl = 1--(1--0/2) nl-1 (7) This equation is independent of and is the same for all n2. These translation equations apply to orthogonal contrasts because of their independence and to MRT by its definition. When all possible tests among n means are made, the relations of the error bases are given by Duncan (1965) as probability statements. For calculating 0/n when given o/2, 0/n = Pr [qn > t 4-21 (8) where Pr is a probability. Substituting q2 for t ~2 gives, 0/n=Pr [qn >q2] " (9) Using the complementary relation 1--0/p = Pp, where P is defined as a protection level, Pn can substitute for 0/n and at the same time the inequality reverses to give, Pn = Pr [qn < q21 " (10) For calculating 0/2 when given 0/n, P2 = Pr [t < qn / ~] " (11) Multiplying the inequality by ~ tuting q2 for t 4-2-gives, and substi- P2 =Pr [q2<qn] " (12) These relations are dependent on n2, just as q was dependent on n2. These translation equations apply to finding 0/n from tx2 for LSD and finding 0/2 from 0/n for TSD. For practical applications, these changes of error base are easily made by the use of table B-1 of Harter (undated). Example data in the same format are presented in table 6. Similar translations of LSD and TSD error rates may be made by the use of tables of Harter (1957) or Harter et al, (1959) with identical results. For SSD, which is based on the F test, the proper equation for computing the 2-treatment error is given by Duncan (Private communication, 1974). 0/2 = 2 Pr [t > ((nl-1) F) 1/2] (13) In the example in table 5, nl = 5 and F= 3.48 so 0/2 = 2 Pr [t >(4 (3.48)1/2 ] or2pr[t > 3.73]. From table 9 of Pearson and Hartley (1966) with t = 3.7 and n2 = o = 10, Pr [t > 3.7] = =.002 and 0/2 = 2 Pr or.004. Similar translations may be made from table II, p. 149 of Locks et al., (1963) with KP = 0, i.e. non-centrality = 0. The SNK test has p critical values from the tables of q with 0/n theoretically equal to the chosen a, and 0/2 ~< 0/n =.05. Implicitly 0/2 is defined as the probability of getting a wrongful rejection of a null hypothesis for the two-mean test. This definition of 0/2 should not be confused with the maximum for the same probability that is sometimes denoted by 0/2, for example by Duncan (1955). These theoretical equations and their theoretical values are presented in table 5. Base 0/ was assumed to be.05. The criticism of MRT by Gill (1973) has resulted from a comparison of 0/n for MRT with 0/2 for orthogonal contrasts. Of the 0e2 based tests, the orthogonal and MRT have identical 0/n. This identity is a result of the definition of the Qp values. These will be identical for all nl. The proper comparison must be consistently based on either 0/2 or 0/n. The LSD test has a higher n-treatment error than either orthogonal contrasts or MRT. When the LSD and MRT are used only after a significant F test, the n-treatment error is limited to that used in the F test. In applica- TABLE 6. PROBABILITY INTEGRAL (Pr) OF THE STUDENTIZED RANGE a p = number of treatment means q Setting Pp =.95 when p = Setting Pp =.95 when p = n = aharter, undate& Table B1, p. 492 with n~ = 10.

6 544 WALDO tion, this pretesting with F is a common practice. But when the LSD is used after a significant F test, the aa..., an-1 can be as high as that for the LSD without prior F-test significance (Duncan, 1965). Significance statements based on a multiple comparison test should state not only which test was used but also whether it was preceded by a significant F test in the case of LSD and MRT. The tests based on a n =.05 have very low theoretical a 2 values even with nl = 5. With larger nl, the difference between a2 and a n increases. The observed values of Carmer and Swanson (1973) are in good agreement with the theoretical values as presented in table 5. When their observed 2-treatment Type I errors are used, the ranking of all tests is exactly the inverse of the ranking on Type li errors. A quote from Warde (1970) summarizes this paper. "Thus the choice of which procedure to use will depend upon the importance attached by the experimentor to each of the two types of error." Those wishing to minimize Type II errors will tend toward LSD whereas those wishing to minimize Type I errors will tend toward SSD. Duncan (1965) and Waller and Duncan (1969) have proposed Bayesian approaches that optimize these errors, but these approaches were considered beyond the scope of this paper. CONCLUSIONS 1. Where orthogonal contrasts, orthogonal polynominals, or factorial analyses are appropriate, do not use multiple comparisons. 2. State which test is being used. 3. State whether LSD and MRT are used only after a significant F test. LITERATURE CITED Barr, A. J. and J. H. Goodnight Statistical analysis system. Student Supply Stores, North Carolina State Univ., Raleigh Carmer, S. G. and M. R. Swanson An evaluation of ten pairwise multiple comparison procedures by Monte Carlo methods. J. Amer. Statist. Ass. 68:66. Duncan, D. B Multiple range and multiple F tests. Biometrics 11:1. Duncan, D. B A Bayesian approach to multiple comparisons. Technometrics. 7 : 171. Federer, W. T Experimental Design Theory and Applications. The Macmillian Company. New York. Gill, J. L Current status of multiple compari- sons of means in designed experiments. J. Dairy Sci. 56:973. Hatter, H. L Error rates and sample sizes for range tests in multiple comparisons. Biometrics 13:511. Hatter, H. L Critical values for Duncan's new multiple range test. Biometrics 16:671. Hatter, H. L. Undated (but must be 1969 or 1970). Order Statistics and Their Use in Testing and Estimation. Volume 1. Tests Based on Range and Studentized Range of Samples from a Normal Population. Aerospace Research Laboratories. Office of Aerospace Research. United States Air Force. For sale by the Superintendent of Documents, U.S. Government Printing Office, Washington, D.C Harter, H. L., D. S. Clemm and E. H. Guthrie The Probability Integrals of the Range and of the Studentized Range: Probability Integral and Percentage Points of the Studentized Range; Critical Values for Duncan's New Multiple Range Test. WADC Technical Rep Volume II. Wright Air Development Center, Wright-Patterson Air Force Base. Ohio. Kempthorne, O The Design and Analysis of Experiments. John Wiley and Sons, Inc., New York. Keuls, M The use of the studentized range in connection with an analysis of variance. Euphytica. 1:112. Locks, M. O., M. J. Alexander and B. J. Byars New Tables of the Noncentral t Distribution. Aeronautical Research Laboratories. Office of Aerospace Research. United States Air Force. For sale by the Office of Technical Services. U.S. Department of Commerce. Washington 25, D.C. Newman, D The distribution of range in samples from a normal population, expressed in terms of an independent estimate of standard deviation. Biometrika 31:20. O'Neill, R. and G. B. Wetherill The present state of multiple comparison methods. J. Royal Stat. Soc. B33:218. Pearson, E. S. and H. O. Hartley Biometrika Tables for Statisticians. Volume I, 3rd ed. Cambridge University Press. Cambridge, England. Scheffe, H A method for judging all contrasts in the analysis of variance. Biometrika 40:87. Steel, R. G. D Query: Error rates in multiple comparisons. Biometrika 17: 326. Steel, R. G. D. and J. H. Torrie Principles and Procedures of Statistics. McGraw-Hill Book Company, Inc., New York. Student The probable error of a mean. Biometrika 6:1. Student Errors of routine analysis. Biometrika 19:151. Tukey, J. W The problem of multiple comparisons. Unpublished dittoed notes. Princeton University, Princeton, N.J. 396 pp. (As cited by Federer, 1955 or Steel and Torrie, 1960). Waller, R. A. and D. B. Duncan A Bayes rule for the symmetric multiple comparisons problem. J. Amer. Statist. Ass. 65:1484. Warde, W. D A review of multiple comparison procedures. Agricultural Experiment Station. Statistical Laboratory. Iowa State Univ.

Linear Combinations. Comparison of treatment means. Bruce A Craig. Department of Statistics Purdue University. STAT 514 Topic 6 1

Linear Combinations. Comparison of treatment means. Bruce A Craig. Department of Statistics Purdue University. STAT 514 Topic 6 1 Linear Combinations Comparison of treatment means Bruce A Craig Department of Statistics Purdue University STAT 514 Topic 6 1 Linear Combinations of Means y ij = µ + τ i + ǫ ij = µ i + ǫ ij Often study