Robustness of multiple comparisons against variance heterogeneity Dijkstra, J.B.

Robustness of multile comarisons against variance heterogeneity Dijkstra, J.B. Published: 01/01/1983 Document Version Publisher s PDF, also known as Version of Record (includes final age, issue and volume numbers) Please check the document version of this ublication: A submitted manuscrit is the author's version of the article uon submission and before eer-review. There can be imortant differences between the submitted version and the official ublished version of record. Peole interested in the research are advised to contact the author for the final version of the ublication, or visit the DOI to the ublisher's website. The final author version and the galley roof are versions of the ublication after eer review. The final ublished version features the final layout of the aer including the volume, issue and age numbers. Link to ublication Citation for ublished version (APA): Dijkstra, J. B. (1983). Robustness of multile comarisons against variance heterogeneity. (Comuting centre note; Vol. 17). Eindhoven: Technische Hogeschool Eindhoven. General rights Coyright and moral rights for the ublications made accessible in the ublic ortal are retained by the authors and/or other coyright owners and it is a condition of accessing ublications that users recognise and abide by the legal requirements associated with these rights. Users may download and rint one coy of any ublication from the ublic ortal for the urose of rivate study or research. You may not further distribute the material or use it for any rofit-making activity or commercial gain You may freely distribute the URL identifying the ublication in the ublic ortal? Take down olicy If you believe that this document breaches coyright lease contact us roviding details, and we will remove access to the work immediately and investigate your claim. Download date: 20. Nov. 2018

THE-RC 52857 -.-.-...= Bibliotheek T Eindhoven University of Technology Comuting Centre Note 17 Robustness of Multile Comarisons against variance heterogeneity Jan B. Dijkstra BIBLIOTHEEK "- 8 31.0177 T.H.EINOHOVEN Preared for the Conference on Robustness of Statistical Methods and Nonarametric Statistics. May 29 to June 4, ]983 Schwerin, GDR.

THE-RC 52857/1 ROBUSTNESS OF MULTIPLE COMPARISONS AGAINST VARIANCE HETEROGENEITY Jan B. Dijkstra Comuting Centre, Eindhoven University of Technology. ABSTRACT If HO: ~... - ~ 1s rejected for normal oulations with classical one way analysis of variance, it is usually of interest to know where the differences may be. If the oulation variances are equal there are several aroaches one might consider: 1. Least Significant Difference test (Fisher, 1935) 2. Multile Range test for equal samle sizes (Newman, 1939) 3. An adatation for unequal samle sizes (Kramer, 1956) 4. Multile F-test (Duncan, 1951) 5. Multile Comarisons test (Duncan, 1952). For all these methods (including the one way analysis of variance) alternatives exist that are robust against variance heterogeneity. A modification of (3) has some unattractive roerties if the variances and the samle sizes differ greatly. The adatations for unequal variances of (4) and (5) seem better than (1) for cases with many samles. Test (2) is rather robust in itself if the variances are not too much different. Modifications exist that allow slight unequalities in the samle sizes. 1. INTRODUCTION In 1981 Werter and the author ublished a study on tests for the equality of several means when the oulation variances are unequal. The roblem can be stated as follows: H Q : ll.-... ~ Xi" - N(lJ i, 0/) for i 1,...) k -J j = 1, ) n i " The conclusion of this study was that the second order method of James (1951) gives the user better control over the size than some other tests [Welch (1951), Brown and Forsythe (1974)], so it is to be referred since none of the tests in the study was uniformly most owerful.

THE-RC 52857/2 The test statistic t t k _ 2 Lw. (xi-x), i=1 ~ is defined as: Xi 1 ni _ 1 k k where wi -2' xi - L x ij ' x - 2WiX i and w LWi s1 ni j-1 W i-1 i-1 For some chosen size a this test statistic is to be comared with a critical level h 2 (a), given by: 2 2 2. Here X X (a) is the ercentage oint of ax -distributed variate with r k-l degrees of freedom, having a tail robability a. The other basic items in the formula are given by: - a

'rhe-rc 52857/3 This method is an aroximation of order -2 in the vi to an "ideal" method. Brown and Forsythe (1974) considered the first order method of James (order -1 in the vi). Their conclusion was that for unequal variances the- difference between the nominal size and the actual robability of rejecting the null hyothesis when it is true can be quite imressive. Werter and the author found that this difference almost vanishes if one takes into account the second order terms. The test as stated gives only the binary result that H O is acceted or rejected. If one refers the tail robability of the test the equation t h 2 (a) has to be solved. Because h 2 (a) is monotonous in a this can be done in about ten function evaluations with an accetable recision of 0.001 in a. In the formula for h (a) the terms R are indeendent of at so it is only necessary to 2 st recomute the XZ s for every iteration. This version of the test was used on a Burroughs B7700 comuter. The average amount of rocessing time for common cases was about 0.026 sect so the very comlicated formula does not yield an exensive algorithm. If He is acceted this usually means the end of the analysis. Otherwise it may be of interest to know where the differences lie. For this one has to erform a simultaneous test and it would be nice if this could be done in such a way that a means "The acceted robability of declaring any air ij i t ij j different when in fact they are equal". In the following sections some strategies are worked out for this kind of simultaneous statistical inference. 2. LEAST SIGNIFICANT DIFFERENCE TEST The method consists of two stages. First H O : ~ ~ is to be tested with classical one way analysis of variance. If H O is rejected a t-test is to be erformed for every air. This idea originates from Fisher (1935) and it resuoses the variances to be equal. Fisher suggested using the same a for the t-tests as for the overall analysis of variance. Of course this is not safe in the sense mentioned in the introduction. An alternative to be considered is the Bonferroni idea S= ai(;) that is mentioned in Miller (1966). For this the robability that no error is made under H O is limited as follows: P = (1 -

THE-RC 52857/4 DIJK5TRA For unequal variances the one way analysis of variance can be relaced by the James second order test. For comaring the airs there are several ossibilities. The situation is called the Behrens-Fisher (1929) roblem, and one of the best aroximate solutions is Welch's modified t-test (1949). This test has been evaluated by Wang (1971) and he concluded that it gives the user excellent control over the size, whatever the value of the nuisance arameter 2 2 may be. The test statistic is e = a i la j and the critical level for some chosen size a is given by Students t distribution with a arameter 'V that takes the attern of the variances into account: 2 2 si s~ 2 n n i j (_ +...o!-) In most cases 'V ij is not an integer, so it has to be relaced by the nearest one. Ury and Wiggins (1971) suggested using this test with the Bonferroni a. The simultaneous confidence intervals for this aroach are given by: There are some alternatives mentioned in the literature. Hochberg (1976) suggested using: where y a is the solution of from Welch's modified t-test. k L 1=1 k L {(\t I>y} 'Vij j=1+1 a, in which 'V ij comes

THE-Re 52857/5 Tanhame (1977) suggested using Bajernee's (1961) aroximate solution 1 Behrens-Fisher roblem with y = 1 - ( 1-a ) k-1 This y has some history also be mentioned in the following sections. The confidence intervals of the and will become: Tamhane also suggested using Welch's test with this y. In the literature the author has found nine different aroximate solutions of the Behrens-Fisher roblem and five ideas concerning the size of the searate tests. Every combination can be made, so there is quite a lot of methods one can consider for airwise comarisons. But to be really safe, in the sense that the robability of declaring any air different when in fact they are equal should be limited by a, the airwise size S will become very small. For k = 15 and a = 0.05 the Bonferroni aroach will yield 8 = 0.00048, so it becomes almost imossible to reject any airwise comarison. Another disadvantage of this aroach is the fact that the results have to be reresented by a matrix containing symbols for accetance and rejection. Working at a terminal, as is usually done in alied statistics nowadays, one has to swallow an enormous lot of information in one glance if k exceeds the region of very small values. The next sections will suggest aroaches that are better in this resect. 3. MULTIPLE RANGE TESTS In this section a strategy will be ointed out that was originated by Newman (1939), Duncan (1951) and Keuls (1952). At first it will be necessary for the samle sizes to be equal (n i = n for i = 1, "', k). Also variance heterogeneity will not be allowed. Later on these limitations will be droed. Let XCI)' "', x(k) be the samle means, sorted in non-decreasing order. The first hyothesis of interest is H O : ).11 = = lit' where the ).Ii'S are renumbered so that their ordering becomes the same as the samle means which are their estimates.

THE-RC 52857/6 Then H O can be tested with: where q is the studentized range distribution. v "" variance Is estimated by: k(n-l) and the residual 2 s 1 "" - v If H O Is rejected. the next stage is to test Ill- "" ~-1 and Ilz "" "" ~. Proceeding like this until every hyothesis is acceted will yield a result that can be reresented as follows: -1-----1--------+-----+----------+------------ The interretation of this figure is that Il i "" Il j has to be rejected if there is no unbroken line that underscores x(i) and x(j). For instance: 114 "" IlS acceted IJ S "" 1J6 acceted ]J4 "" 116 rejected. Ifa candidate for the slitting used instead of qk Newman and 1.v ct "" 1 - (1- ct) - ct ct rocess contains means then qp.v is to be Keuls suggested ct "" ct and Duncan referred Now the equality of the samle sizes will be droed. but for the moment the variances will still have to be equal. Miller (1966) suggested using the median 11 k 1 of n 1 n k Winer (1962) considered the harmonic mean H (-= - L -). H k i=1 n 1

!HE-RC 52857/7 Kramer (1956) modified the formula of the test to this situation: all ~. k u - U E [x - x - q s {~(- + -)} ], where 'J - N - k and N "" I n i j i j, v n i n j i=l i Only in Kramerts case does the studentized range distribution hold. For Miller and Winer the aroximation will be reasonable if the samle sizes are not too different. Kramerts test contains a tra that can be shown in the following figure: Suose n l and n 4 are much smaller than n Z and n 3 Then ut -... = U4 can be acceted while ~ and U are significantly different. But the strategy will make 3 sure that this difference will never be found. From here on the variances will be allowed to the unequal. For equal samle sizes Ramseyer and Tcheng (1973) found that the studentized range statistic is remarkably robust against variance heterogeneity. So for almost equal samle sizes it seems reasonable to use the Winer or Miller aroach and ignore the differences in the variances. Unfortunately, the robustness of Kramer's test is rather oor [Games and Howell (1976)], so if the samle sizes differ greatly one might be temted to consider: a u i - ).Ij E. [x. - x j + q P ]., "ij where only the variances of the extreme samles are taken into account. This idea was mentioned by Games and Howell (1976) with Welch's "ij' The studentized range distribution does not hold for these searately estimated variances, but the aroximation seems reasonable though a bit conservative.

THE-RC 52857/8 The context in which Games and Howell suggested using this method was one of airwise comarisons with other arameters for q. But it looks like a good start for the construction of a "Generalized Multile Range test". This test, however attractive it may seem, still contains the tra that was already mentioned for Kramer's method. But there is more: 2 2 Suose s2 and s3 are (much) difference between ~ and Jl:3 2 2 smaller than s1 and s4' Then a significant can easily be ignored. The author has not found in the literature other aroaches to variance heterogeneity within the strategy of multile range tests. Some other a 's have been suggested, but since the choice of a has almost nothing to do with robustness against variance heterogeneity, their merits will not be discussed in this aer. The reresentation of the results with underscoring lines seems very attractive since this simle figure contains a lot of information, and also the artificial consistency that comes from the ordered means has some aeal. However the whole idea of a Generalized Multile Range te~t seems wrong. One simly cannot afford to take only the extreme means into account if the samle sizes and the variances differ greatly. 4. MULTIPLE F-TEST This test was roosed by Duncan (1951). In the original version the oulation variances must be equal. The rocedure is the same as for the Multile Range test, only the q-statistic is relaced by an F, so that the first stage becomes classical one way analysis of variance. At first Duncan roosed using a = 1 - (1-a)-1, but later he found a = 1 - (1_a)(-1)/(k-l) more suitable [Duncan (1955)]. The nature of the F-test allows unequal samle sizes. This seems to make this aroach more attractive than the Multile Range test, but there is a roblem:

THE-RC 52857/9 Suose ~1 = lj 4 is rejected. The next two hyotheses to be tested are 1J 1 = = 11 3 and ~ = = \.1 4 ' So 1.1 1 and 1J4 will always be called different. But if n 1 and n 4 are much smaller then n 2 and n 3 it is ossible that a airwise test for III and 1J 4 would not yield any significance. Duncan (1952) saw this roblem and suggested using a t-test for the airs that seemed significant as a result of the Multile F-test. This aroach he called the Multile Comarisons test. Nowadays this term has a more general meaning and, it seems to cover every classifying rocedure one might consider after rejecting I.It = = 1\' Now the equality of the variances will be droed. It is well known that the F-test is not robust against variance heterogeneity [Brown and Forsythe (1974), Ekbohm (1976)]. So it seems reasonable to use the non-iterative version of the second order method of James, thus making a "Multile James test". One could use Duncan's a, but the author refers a = 1 - (1-a)/k [Ryan (1960)] as a consequence of some arguments ointed out by Einot and Gabriel (1975). This a was mentioned in another context, but the arguments are not much shaken by the unequality of the variances. This new test contains the same roblem as the Multile F-test, but that is not all: I.It and \.1 4 will always be called different if \.1 1 = = 114 is rejected. Now 2 2 Z 2 suose that s2 and s3 are much smaller than s1 and s4' Then the difference between I.It and 11 4 may not be significant in a airwise comarison. Here the structural difference between this test and the aroach mentioned in the revious section comes into the icture: If extreme means coincide with big variances and small samles, then the Generalized Multile Range test can ignore imortant differences, while the Multile James test can wrongly declare means to be different. One can of course aly Welch's test for the Behrens-Fisher roblem to the airs that seem significant as a consequence of the Multile James test. This combination should be called the "Generalized Multile Comarisons test". A lot of extra work may be asked for, so it is of interest to know if this extension can have any serious influence on the conclusions.

THE-RC 52857/10 Werter and the author have examined this by adding another member to the family: the "Leaving One Out test". This is a Multile James test in which after rejection of ll. '"... ~ not only 1\... ~-l and JJ 2 '" '" ~ are considered but all the subsets of JJl' "', ~ where one JJ i is left out. The same a is used and the accetance of a hyothesis means that the slitting rocess for this subset stos. The Leaving One Out strategy is not limited to JJ 1 '"... ~ but is alied to every subset that.becomes a candidate. This aroach will avoid the classical tra of the Multile F-test and also the secific roblem that comes from variance heterogeneity. The Multile James test and the Leaving One Out test were alied to 7 case studies, containing 277 airs. Only 2 different airwise conclusions were reached, where the Leaving One Out test did not confirm the significance found by the Multile James test. But since the Multile Comarisons test is considered a useful reresentative. extension of the Multile F-test, this may not be The Leaving One Out test can be very exensive. In the worst case situation k where all the means are isolated the number of tests will be 2 -(k+l) instead of only \k(k-l) for the Multile James test and any member of the Least Significant Difference family. For k '" 15 this means 32752 tests instead of only 105. For values of k that make the Least Significance Difference aroach unattractive, the Multile James test is recommended with Ryan's a A terminal oriented comuter rogram such as BMDP should not only give the final result but also the mean, variance and number of observations for every samle. An interesting airwise significance can be verified by Welch's test for the Behrens-Fisher roblem. This should be considered if the samle variances involved are relatively big or if the samles contain only a few observations. 5. FINAL REMARK This small study on robustness of multile comarisons against variance heterogeneity only just touches some of the major roblems. They are dealt with searately in a simlified examle of four samles. In reality one has to deal with them simultaneously which makes the roblems much more difficult. Also there are some well known disturbing effects that are not mentioned in this aer.

THE-RC 52857/12 DIJK.STRA [8] Welch, B.L. Further note on Mrs. Asin's tables and on certain aroximation to the tabled function Biometrika ~ (1949), 293-296 [9] Wang, Y. Y. Probabilities of tye I errors of the Welch tests for the Behrens-Fisher roblem Journal of the American Statistical Association 66 (1971) [10] Ury, H.K. and A.D. Wiggins Large samle and other multile comarisons among means British Journal of Mathematical and Statistical Psychology 24 (1971), 174-194 [11] Hochberg, Y. A modification of the T-method of multile comarisons for a one-way lax: out with unequal variances Journal of the American Statistical Association 11 (1976), 200-203 [12] Tamhane, A.C. Multile Comarisons in model-1 one way!nova with unequal variances Communications in Statistics A6(1) (1977), 15-32 [13] Banerjee, S.K. On confidence intervals for two-means roblem based on searate estimates of variances and tabulated values of t-variable Sankhya, A23 (1961) [14] Newman, D. The distribution of the range in samles from a normal oulation, exressed in terms of an indeendent estimate of standard deviation Biometrika 11 (1939), 20-30 [15] Keuls, M. The use of the "studentized range" in connection with an analysis of variance Euhytica 1 (1952), 112-122 [16] Winer, B.J. Statistical rinciles in exerimental design New York, McGraw-Hill (1962) [17] Kramer, C.Y. Extension of multile range tests to grou means with unequal numbers of relications Biometrics 12 (1956), 307-310

THE-RC 52857/13 [18] Ramseyer G.C. and T. Tcheng The robustness of the studentized range statistic to violations of the normality and homogeneity of variance assumtions American Educational Research Journal 10 (1973) [19] Games, P.A. and J.F. Howell Pairwise multile comarison rocedures with unequal N's and/or Variances: a Monte Carlo Study Journal of Educational Statistics! (1976), 113-125 (20] Duncan, D.B. A significance test for differences between ranked treatments in an analysis of variances Virginia Journal of Science 2 (1951) [21] Duncan, D.B. Multile range and Multile F-tests Biometrics!l (1955), 1-42 [22] Duncan, D.B. On the roerties of the multile comarisons test Virginia Journal of Science 3 (1952) [23] Ekbohm, G. On testing the equality of several means with small samles The Agricultural College of Sweden, Usala (1976) [24] Ryan, T.A. Significance Tests for multile comarison of roortions, variances and other statistics Psychological Bulletin 57 (1960), 318-328 [25] Einot, I. and K.R. Gabriel A study of the owers of several methods of multile comarisons Journal of the American Statistical Association 70 (1975), 574-583