A nonparametric two-sample wald test of equality of variances

Size: px

Start display at page:

Download "A nonparametric two-sample wald test of equality of variances"

Magdalen Marilynn Jacobs
6 years ago
Views:

Unversty of Wollongong Research Onlne Centre for Statstcal & Survey Methodology Workng Paper Seres Faculty of Engneerng and Informaton Scences 0 A nonparametrc

John, A nonparametrc two-sample wald test of equalty of varances, Centre for Statstcal and Survey Methodology, Unversty of Wollongong, Workng Paper -, 0. http://ro.

1 Unversty of Wollongong Research Onlne Centre for Statstcal & Survey Methodology Workng Paper Seres Faculty of Engneerng and Informaton Scences 0 A nonparametrc two-sample wald test of equalty of varances Davd Allngham Unversty of Newcastle John Rayner Unversty of Wollongong Recommended Ctaton Allngham, Davd and Rayner, John, A nonparametrc two-sample wald test of equalty of varances, Centre for Statstcal and Survey Methodology, Unversty of Wollongong, Workng Paper -, 0. Research Onlne s the open access nsttutonal repostory for the Unversty of Wollongong. For further nformaton contact the UOW Lbrary: research-pubs@uow.edu.au

2 A Nonparametrc Two-Sample Wald Test of Equalty of Varances Davd Allngham* and J. C. W. Rayner Abstract We develop a test for equalty of varances gven two ndependent random samples of observatons. The test can be expected to perform well when both sample szes are at least moderate and the sample varances are asymptotcally equvalent to the maxmum lkelhood estmators of the populaton varances. The test s motvated by and s here assessed for the case when both populatons sampled are assumed to be normal. Popular choces of test would be the two sample F test f normalty can be assumed and Levene s test f ths assumpton s dubous. Another compettor s the Wald test for the dfference n the populaton varances. We gve a nonparametrc analogue of ths test and call t the R test. In an ndcatve emprcal study when both populatons are normal, we fnd that for moderate sample szes the R test s nearly as robust as Levene s test and nearly as powerful as the F test. Key Words: Bartlett s test; Levene s test; Wald tests.. Introducton: Testng Equalty of Varances for Two Independent Samples In the two-sample problem we are gven two ndependent random samples X,..., X n and X,..., X n. The locaton problem attracts most attenton. Assumng that the samples are from normal populatons, the pooled t-test s used to test equalty of means assumng equal varances and Welch s test can be used when equalty of varances s suspect but normalty s not. When normalty s n doubt the Wlcoxon test s often used. The correspondng dsperson problem s of nterest to confrm the valdty of, for example, the pooled t-test, and for ts own sake. For example, testng for reduced varablty s of nterest n confrmng natural selecton (see, for example, [, Secton 5.5]) and f some processes are n control. In exploratory data analyss t s sensble to assess f one populaton s more varable than another. If t s, the cause may be that one populaton s b-modal and the other s not; the consequences of ths n both the scenaro and the model can then be explored n depth. The study here ntroduces a new test of equalty of varances. The asymptotc null dstrbuton of the test statstc s, but, dependng on the populatons sampled from, ths may or may not be satsfactory for small to moderate sample szes. To assess ths, and other aspects of the proposed test, an ndcatve emprcal study s undertaken. We gve comparsons when both populatons sampled are normal, when both samples have the same sample sze, and for 5% level tests. Frst *Author to whom correspondence should be addressed. Centre for Computer-Asssted Research Computaton and ts Applcatons, School of Mathematcal and Physcal Scences, Unversty of Newcastle, NSW 308, Australa e-mal: Davd.Allngham@newcastle.edu.au Centre for Statstcal and Survey Methodology, School of Mathematcs and Appled Statstcs, Unversty of Wollongong, NSW 5, Australa and School of Mathematcal and Physcal Scences, Unversty of Newcastle, NSW 308, Australa e-mal: John.Rayner@newcastle.edu.au

3 we derve a fnte sample correcton to the crtcal values based on the asymptotc null dstrbuton when samplng from normal populatons. Dfferent correctons would be needed when samplng from dfferent dstrbutons. Next we show that n moderate samples the new test s nearly as powerful as the F test when normalty may be assumed, and fnally, that t s nearly as robust as Levene s test when normalty s n doubt. Moore and McCabe [, p.59] clam that the F test and other procedures for nference about varances are so lackng n robustness as to be of lttle use n practce. The new test gves a counterexample to that proposton. We acknowledge that the new test s most effectve for at least moderate sample szes. In the normal case each random sample should have at least 5 observatons, whch s what we would expect of a serous study amng at reasonable power that cannot be hoped for wth samples of sze 0 or so. See Secton. We are aware of more expansve comparatve studes such as [3 and ]. Our goal here s not to emulate these studes but to show that the new test s compettve n terms of test sze, robustness and power. In Secton the new test s ntroduced. Sectons 3, and 5 gve the results of an emprcal nvestgaton when the populatons sampled are assumed to be normal. In Secton 3 we nvestgate test sze, showng that the asymptotc crtcal values should only be used for moderate to large sample szes. For smaller sample szes a fnte sample correcton to the asymptotc 5% crtcal value s gven. Ths results actual test szes between.6% and 5.3%. In Secton we show that when normalty holds the new test s not as powerful as the Levene test for smaller sample szes, but overtakes t for moderate samples of about 5. The new test s always nferor to the optmal F test. However the R test has power that approaches that of the F test, beng at least 95% that of the F test throughout most of the parameter space n samples of at least 80. In Secton 5 we show that f we sample from t dstrbutons wth varyng degrees of freedom, the F test s hghly non-robust for small degrees of freedom, as s well-known for fat-taled dstrbutons. The new test performs far better than the F test, and although t s challenged somewhat for small degrees of freedom, ts performance s only slghtly nferor to the Levene test. That the new test gves a good compromse between power and robustness, and s vald when normalty doesn t hold, are strong reasons for preferrng the new test for sample szes that are at least moderate and normalty s dubous.. Compettor Tests for Equalty of Varance Intally we assume that we have two ndependent random samples X,..., X from normal populatons, N(, ) for = and. We wsh to test H: = n aganst the alternatve K: nusance parameters. If n S = j, wth the populaton means beng unknown ( Xj X. ) /( n ), n whch X. = n j X j / n, =,, then the S are the unbased sample varances, and the so-called F test s based on ther quotent, S / S = F, say. It s well-known, and wll be confrmed yet agan n Secton 5, that the null dstrbuton of F, Fn, n, s senstve to departures from normalty. If F n, n (x) s the cumulatve dstrbuton functon of the Fn, n 3 August 00

4 dstrbuton, and f c p s such that F n, n (c p ) = p, then the F test rejects H at the 00 % level when F < c / and when F > c ( /). Common practce when normalty s n doubt s to use Levene s test or a nonparametrc test such as Mood s test. Levene s test s based on the ANOVA F test appled to the resduals. There are dfferent versons of Levene s test usng dfferent defntons of resdual. The two most common versons use resduals based on the ~ ~ group means, X j X., and the group medans, X j X., n whch X. s the medan of the th sample. The latter s called the Brown-Forsythe test. Agan t s well-known that these tests are robust n that when the populaton varances are equal but the populatons themselves are not normal, they acheve levels close to nomnal. However ths happens at the expense of some power. As the emprcal study n ths paper s ntended to be ndcatve rather than exhaustve, we wll henceforth make comparsons only wth the Levene test, based on a test statstc we denote by L. We now construct a new test that we call the R test. For unvarate parameters, a Wald test statstc of H: = 0 aganst the alternatve K: 0 s based on ˆ, the maxmum lkelhood estmator of, usually va the test statstc ˆ ( ) /estvar( ˆ 0 ), where est var( ˆ ) s a consstent estmate of var( ˆ). Under the null hypothess ths test statstc has an asymptotc dstrbuton. As well as beng equvalent to the lkelhood rato test, the F test s also a Wald test for testng H: = / = aganst K:. Rayner [5] derved the Wald test for testng H: = = 0 aganst K: 0. The test statstc s ( S S ) S S n n = W, say. Beng a Wald test, the asymptotc dstrbuton of W s, whle ts exact dstrbuton s not mmedately obvous. However W s a - functon of F, so the two tests are equvalent. Snce the exact dstrbuton of F s avalable, the F test s the obvous test to use. In W, the varances var( S ) are estmated optmally usng the Rao-Blackwell j theorem. Ths depends very strongly on the assumpton of normalty. If normalty s n doubt then we can estmate var( S S ) usng results gven, for example, n [6]. For a random sample Y,..., Y n wth populaton and sample central moments r and m r = n j r ( Y Y ) / n, r =, 3,..., [6] gves that j E[m r ] = r + O(n ) and var(m ) = ( )/n + O(n ). Applyng [6, 0.5], may be estmated to O(n ) by m, or, equvalently, by n m /(n ) = S, where S s the unbased sample varance. It follows that, to order O(n ), var(m ) may be estmated by (m m )/n. We thus propose a robust alternatve to W, gven by 3 3 August 00

5 ( S m S n S ) m S n = R say, n whch m, are the fourth central sample moments for the th sample, =,. We call the test based on R the R test. As the sample szes ncrease, the dstrbutons of the sample varances approach normalty, the denomnator n R wll approxmate var( S S ), and R wll have asymptotc dstrbuton. Thus, f c s the pont for whch the dstrbuton has weght n the rght hand tal, then the R test rejects H at approxmately the 00 % level when R > c. We emphasse that although the motvaton for the dervaton of R s under the assumpton of samplng from normal populatons, t s a vald test statstc for testng equalty of varances no matter what the populatons sampled. If the sample varances are equal to or asymptotcally equvalent to the maxmum lkelhood estmators of the populaton varances, as s the case when samplng from normal populatons, then the R test s a Wald test for equalty of varances n the sense descrbed above. Snce t doesn t depend on any dstrbutonal assumptons about the data, t can be thought of as a nonparametrc Wald test. As such t can be expected to have good propertes n large samples. We note that all the above test statstcs are nvarant under transformatons a(x j b ), for constants a, b and b and for j =,..., n and =,. The next three sectons report an emprcal study when the dstrbutons sampled are assumed to be normal. As ths s an ndcatve study, we fx the samples szes to be equal, n = n = n, say, and the sgnfcance level to be 5% throughout. 3. Test Sze Under Normalty Under the null hypothess, the dstrbuton of F s known exactly, the dstrbuton of L s known approxmately, and, as mentoned above, the dstrbuton of R s known asymptotcally. In analysng data these dstrbutons are used to determne p-values and crtcal values. We now nvestgate ther use n determnng test sze, the probablty of rejectng the null hypothess when t s true. Two emprcal assessments of test sze wll now be undertaken. Snce the test statstcs are scale nvarant, t s suffcent under the null hypothess to take both populaton varances to be one. In the frst assessment we assume normalty. For varous values of the common sample sze n, we estmate the 5% crtcal ponts for each test by generatng 00,000 pars of random samples of sze n, calculatng the test statstcs, orderng them and hence dentfyng the 0.95th percentle. The estmated crtcal ponts of R approach the 5% crtcal pont 3.8. These estmated crtcal ponts wll subsequently be used n the power study to gve tests wth test sze of exactly 5%. To see the extent of the error caused by usng the asymptotc crtcal pont 3.8, Fgure gves the proporton of rejectons n 00,000 pars of random samples for sample szes up to 00. For n = 0 the proporton of rejectons s nearly 0% and although for n = 0 ths has dropped to nearly 7%, most users would hope for observed test szes closer to 5%. 3 August 00

6 Fgure. Proporton of rejectons of the R test usng the 5% crtcal pont 3.8 for sample szes up to 00 To mprove the applcaton of the test we plotted the estmated 5% crtcal ponts aganst n and used standard curve fttng technques to fnd that the exact crtcal ponts were well approxmated between n = 0 and 00 by c(n, 0.05) = 3.86( / n +.7/n). For larger n t s suffcent to use the asymptotc 5% value 3.86, the error beng at most 0.8%. We checked the exact probabltes of rejecton under the null hypothess when applyng the test wth crtcal value c(n, 0.05), and all were between.6% and 5.3%. For levels other than 5%, and when sample szes are unequal, further emprcal work needs to be done to fnd crtcal values. However, as ths study was ntended to be ndcatve, we leave extensve tabulaton of crtcal values for another tme.. Power Under Normalty For the F, Levene and R tests we estmate the power as the proporton of rejectons from 00,000 pars of random samples of sze n when the frst sample s from a N(0, ) populaton and the second s from a N(0, ) populaton wth >. To compare lke wth lke, we use estmated crtcal values that gve exact 5% level tests. It s apparent that for sample szes less than about 0 the Levene test s more powerful than the R test and that between approxmately 0 and 30 the R test takes over from the Levene test; thereafter the R test s always more powerful than the Levene test. Ths s shown n Fgure. 5 3 August 00

7 Fgure. Power of the 5% level L test (sold lne) and R test (dashed lne) for varous sample szes Fgure 3. Contour plots of the L test (left panel) and R test (rght panel) relatve to the F test showng regons n whch the power ratos are less than 95%, between 95% and 99.99%, and greater than 99.99%. 6 3 August 00

8 Both the Levene and R tests are always less powerful that the F test. Ths s explored n Fgure 3 that compares the Levene test to the F test n the left hand panel and the R test to the F test n the rght hand panel. What s gven s a contour plot of the regons n whch the ratos of the power of the stated test to the F test s less than 95%, between 95% and 99.99%, and greater than 99.99%. Generally, for any gven n and, t s clear that the power of the Levene test s at most that of the R test. For example t appears that for n = n = 80 approxmately the power of the R test s always at least 95% that of the F test, whereas there s a consderable regon where the power of the Levene test s less than 95% that of the F test. Fgure. Test szes for the F (dots), L (dashes) and R (sold lne) tests for t dstrbutons wth varyng degrees of freedom,, from to 0 5. Robustness Even f the R test has good power, the test s of lttle value unless t s robust n the sense that when the dstrbutons from whch we are samplng are not from the nomnated populaton (here, the normal dstrbuton) the p-values are reasonably accurate. It s thus of nterest to estmate the proporton of rejectons when the null hypothess s true and both the populatons from whch we sample are not normal. We could have looked at varable skewness through gamma dstrbutons wth varyng shape parameter, but nstead we consder varable kurtoss va t dstrbutons wth varyng degrees of freedom. If the degrees of freedom, say, are large, the dstrbuton sampled wll be close enough to normal that we could expect the proporton of rejectons to be close to the nomnal. The crtcal values used here are c(n, 0.05) gven n Secton August 00

9 In Fgure we plot the proporton of rejectons for the Levene, F and R tests when samplng from t dstrbutons for =,..., 0. We show curves for each test wth common sample szes n = 0, 5, 50 and 00. The crtcal values used are the c(n, 0.05) from Secton 3. It s apparent that the F test performs ncreasngly poorly as the degrees of freedom reduce and the tals of the dstrbuton become fatter. Interestngly, n ths scenaro, the F test s always lberal (exact test sze more than the sgnfcance level) whle the R test s almost always conservatve (exact test sze less than the sgnfcance level). In general the latter s to be preferred. The Levene test generally has exact level closer to the nomnal level than the R test except for small degrees of freedom. However the level of the R test s almost always reasonable, and whle for very small the level s not as close to the exact level as perhaps we may prefer, the same s the case for the Levene test. 5. Concluson Frst, we reflect on testng for equalty of varances when t s assumed that the populatons sampled are normal. The F test s both the lkelhood rato test and a Wald test, and s the approprate test to apply. When normalty does not hold, the F test s no longer an asymptotcally optmal test and ts well-known non-robustness means that tests such as the Levene are more approprate for small to moderate sample szes. However for sample szes of about 5 or more the R test s more powerful than the Levene and wth the small sample corrected crtcal values t holds ts nomnal sgnfcance level well. For these sample szes t can be preferred to the Levene test. Second, consder testng for equalty of varances when both samples are drawn from the same populaton. If that populaton s nomnated then the R test may be appled after determnng crtcal values or usng p-values calculated by Monte Carlo methods. When the sample varances are asymptotcally equvalent to the maxmum lkelhood estmators of the populaton varances (as, for example, s the case when samplng from normal populatons but not Posson populatons), the R test s a nonparametrc Wald test and hence wll have good power n suffcently large samples. If the populaton s not specfed the R test can be confdently appled when the sample szes are large, usng the asymptotc null dstrbuton to calculate p- values or crtcal values. References [] B.F.J. Manley, The Statstcs of Natural Selecton, Chapman and Hall, London, 987. [] D.S. Moore and G.P. McCabe, Introducton to the Practce of Statstcs (5 th ed.), W.H. Freeman, New York, 006. [3] W.J. Conover, Mark E Johnson and Myrle M. Johnson, A comparatve study of tests of homogenety of varances, wth applcatons to the outer contnental shelf bddng data, Technometrcs, vol. 3, no., pp , 98. [] Denns D. Boos and Cavell Browne, Bootstrap methods for testng homogenety of varances, Technometrcs, vol. 3, no., pp. 69-8, 989. [5] J.C.W. Rayner, The Asymptotcally Optmal Tests, Journal of the Royal Statstcal Socety: Seres D, vol. 6, no. 3, , 997. [6] A. Stuart and J.K. Ord, Kendall's Advanced Theory Of Statstcs, Vol.: Dstrbuton theory (6th ed.), Hodder Arnold, London, August 00

A nonparametric two-sample wald test of equality of variances

University of Wollongong Research Online Faculty of Informatics - Papers (Archive) Faculty of Engineering and Information Sciences 211 A nonparametric two-sample wald test of equality of variances David