A RANK STATISTIC FOR NON-PARAMETRIC K-SAMPLE AND CHANGE POINT PROBLEMS

J. Japa Statist. Soc. Vol. 41 No. 1 2011 67 73 A RANK STATISTIC FOR NON-PARAMETRIC K-SAMPLE AND CHANGE POINT PROBLEMS Yoichi Nishiyama* We cosider k-sample ad chage poit problems for idepedet data i a uified way. We propose a test statistic based o the rak statisitcs. The asymptotic distributio uder the ull hypothesis is show to be the supremum of the 2-dimesioal stadard Browia pillow. Also, the test is show to be cosistet uder the alterative that k distributio fuctios are liearly idepedet. It is importat from practical poit of view that our test is ot oly asymptotically distributio free but also distributio free eve for fixed fiite sample. Key words ad phrases: covergece. Empirical process, ivariace priciple, rak statistic, weak 1. Itroductio This paper studies the k-sample ad chage poit problems i a uified way. Both problems have log histories. Kiefer (1959) cosidered k-sample Kolmogorov-Smirov ad Cramér-vo Mises tests, while Scholz ad Stephes (1987) studied the k-sample Aderso- Darlig test. The approaches based o the empirical distributio fuctios are aive, but the limit distributios are ofte difficult to compute. O the other had, other approaches ofte eed some restrictios o alteratives. For example, Jockheere (1954) cosidered the k-sample test agaist ordered alteratives (see also Odeh (1971)), while Mack ad Wolfe (1981) treated that agaist umbrella alteratives. Hettmasperger ad Norto (1987) cosidered the k-sample problem agaist a pattered alterative. (O the cotrary, the alterative i our approach just requires that the k distributio fuctios are liearly idepedet.) Regardig the chage poit problems, may authors have cosidered parametric ad o-parametric approaches. See, e.g., the book by Csörgő ad Horváth (1997). Sice we are iterested i a o-parametric approach, we oly review precedig results i that directio. Pettitt (1979) applied the Wilcoxo- Ma-Whitey statistic to the o-parametric chage poit problem. Lombard (1987) proposed a procedure based o quadratic form rak statistics to test oe or more chage poits. (Oe of the iterestig poits of our approach is that we do ot require prior kowledge about how may chage poits exist uder the alterative.) Recetly, based o Lombard s approach, Murakami (2010) itroduced a rak statistic for the chage poit problem of locatio-scale parameters. Praagma (1988) established the Bahadur efficiecy of some rak tests for the o-parametric chage poit problem. As for estimatio problems, Carlstei Received Jauary 4, 2011. Revised March 24, 2011. Accepted April 15, 2011. *The Istitute of Statistical Mathematics, 10-3 Midori-cho, Tachikawa, Tokyo 190-8562, Japa.

68 YOICHI NISHIYAMA (1988) proposed a estimator for a chage poit without assumig ay specific structure of the uderlyig distributio, amog others. Our idea comes from a kid of CUSUM empirical process, although the resultig test is a rak statistic. Our test has the followig merits. (1) It is distributio free uder the ull hypothsis; the distributio of our test depeds oly o the total sample size. (2) The asymptotic distributio uder the ull hypothesis is the supremum of the absolute value of the 2-dimesioal stadard Browia pillow; of course, it does ot deped o k. (3) Our alterative is atural; we oly assume that the k distributio fuctios are liearly idepedet. (4) Our test is easy to compute. The orgaizatio of the rest of this paper is as follows. I Sectio 2, we state some asymptotic results uder the ull ad alterative hypotheses. These results are proved i Sectio 3. I Sectio 4, we preset some simulatio studies. 2. Asymptotic results Let us describe two problems which we cosider i this paper. The first problem is the so-called k-sample problem. Let X1 c,...,xc c, c = 1,...,k,be1-dimesioal idepedet data such that for every c = 1,...,k the data X1 c,...,xc c come from a 1-dimesioal cotiuous distributio F c. We wish to test the hypotheses: H 0 : F c s are the same for all c =1,...,k(we deote the commo distributio by F ); H 1 : F c s are distributio fuctios that are liearly idepedet: that is, for weight costats (w 1,...,w k ) such that c w c =0,itholds that c w cf c ( ) 0 imply w c =0for all c. (The additioal costrait c w c =0 isot a real restrictio for the liear idepedece because of the fact that F c ( ) =1for all c.) The secod problem is the so-called chage poit problem. Let X 1,...,X be 1-dimesioal idepedet data. We wish to test the hypothesis: H 0 : all X i s come from a certai cotiuous distributio F ; H 1 : there exist 0 = u 0 <u 1 < <u k =1such that X i, i =[u c 1 ]+ 1,...,[u c ], c =1,...,k, come from a distributio F c where F c s are fuctios that are liearly idepedet. We ca treat the secod problem as a special case of the first, by regardig c =[u c ] [u c 1 ], hece from ow o we deal with the first problem. We shall assume that γ c = lim c, where = c c,as c for every c =1,...,k. (We assume that at least two γ c s are positive.) This is a atural assumptio for the secod problem where γ c = u c u c 1, ad throughout this paper we cosider this asymptotic scheme. I the first problem, let us set {X 1,...,X } = {X 1 1,...,X 1 1,...,X k 1,...,X k k }. Let us deote by {X (1),...,X () } the order statistics ad by {R 1,...,R } the

K-SAMPLE AND CHANGE POINT PROBLEMS 69 rak statistics of the data {X 1,...,X }.Wepropose the test statistic D = 1 i max 1 i,j 1{R q j} ij (2.1). The mai result of this paper is the followig. Theorem 1. (i) Uder the ull hypothesis H 0, it holds that, as, d D sup 0 s,t 1 W (s, t), where W is a cetered Gaussia process with the covariace E[W (s 1,t 1 )W (s 2,t 2 )] = (s 1 s 2 s 1 s 2 )(t 1 t 2 t 1 t 2 ). Hece the test is asymptotically distributio free uder H 0. (ii) Uder the alterative H 1, it holds that, as c for every c =1,...,k, D max B c (x) o( ) O P (1), sup 1 c k x where B c (x) = γ c (1 u )F c (x) γ c u F c (x) c c c>c ad where u = c c γ c. (Notice that the sum of coefficiets of F c s is zero, so by assumptio sup x B c (x) is positive.) Hece the test is cosistet uder H 1. It is importat from a practical poit of view that we do ot assume ay specific structure o the distributios F c s. Our approach is purely a o-parametric oe. It is also importat to otice that the value k is ot used i the costructio of our test, so we ca treat it as a ukow parameter. Compare our way of costructig the test statistic with the oes for other tests i the cotext of the chage poit problems where the value k has ofte bee assumed to be kow. 3. Proof of Theorem 1 To begi with, let us otice that 1 D := sup sup (1{i u} u})1{x i x} u [0,1] x R = max sup 1 i 1 i x R 1{X q x} i 1{X q x} + O 1 i = max 1 i,j 1{X q X (j) } i 1{X q X (j) } + O ( ) 1 = D + O. ( 1 ) ( 1 )

70 YOICHI NISHIYAMA H0 (o chage poit) 4 0 2 4 0 10 20 30 40 H1 (a chage poit) 4 0 2 4 0 10 20 30 40 Figure 1. Plots of the data. Proof of Theorem 1 (i). Sice sup u [0,1] 1 (1{i u} u) 1, the radom variable D is asymptotically equivalet to the supremum of the absolute value of u, x 1 (1{i u} u)(1{x i x} F (x)), which coverges weakly i l ([0, 1] R) tothe cetered Gaussia process u, x W (u, F (x)). (We deote by l (T ) the space of bouded fuctios o T, ad equip it with the uiform metric. See e.g. va der Vaart ad Weller (1996) for the weak covergece theory i this space.) Hece the result follows from the cotiuous mappig theorem. Proof of Theorem 1 (ii). Let c be ay of the idex c. Wealso deote by x the argusup of x B c (x). Notice that D 1 (1{i u } u )F τ(i) (x ) 1 (1{i u } u )(1{X i x } F τ(i) (x )),

K-SAMPLE AND CHANGE POINT PROBLEMS 71 0.0 0.2 0.4 0.6 0.8 1.0 0.4 0.6 0.8 1.0 Figure 2. EDF of 1000 simulatios of D 40 uder H 0 (left) ad H 1 (right). 0.0 0.2 0.4 0.6 0.8 1.0 0.4 0.6 0.8 1.0 1.2 Figure 3. EDF of 1000 simulatios of D 100 uder H 0 (left) ad H 1 (right). where τ(i) =c for i = c 1 +1,..., c.bythe cetral limit theorem, the secod term o the right had side coverges weakly to a tight limit. The first term ca be writte as (B c (x )+o(1)). The proof is fiished. 4. Remark for o-asymptotic case The way of defiig the test statistic (2.1) has a merit ot oly for asymptotic study but also for the fiite simple argumet. Sice the test is defied oly through the rak statistics, it is distributio free uder H 0 as far as the uderlyig distributio F is a 1-dimesioal cotiuous distributio. So, we ca compute the p-values for fixed by computer simulatio, by settig the uderlyig distributio F to be, for example, the uiform distributio. The p-values

72 YOICHI NISHIYAMA 0.0 0.2 0.4 0.6 0.8 1.0 0.4 0.6 0.8 1.0 Figure 4. EDF of 1000 simulatios of D 10, D 40 ad D 100 uder H 0. obtaied by Mote-Carlo simulatio are much better tha the oe obtaied by usig the asymptotic distributio sup s,t W (s, t), especially whe is ot very large. Here we preset some umerical results by figures. For illustratio, we demostrate the simulatio for the chage poit problem with k = 2. Figure 1 is the plots of the idepedet data X 1,...,X from the Gaussia distributio N(0, 1) [upper], ad X 1,...,X [3/4] from N(0, 1) ad X [3/4]+1,...,X from N(1, 1) [lower] (with = 40). We perform 1000 simulatios for the computatio of the test statistic D. Figure 2 ( = 40) ad Figure 3 ( = 100) preset the empirical distributio fuctios (EDF) uder H 0 ad H 1. Fially, we draw the empirical distributio fuctios of 1000 simulatios for D uder H 0 for = 10, 40, 100 (Figure 4). We set F to be the uiform distributio, but the choice of the uderlyig distributio is ot importat because we kow that the test is distributio free. So we ca compute the approximate p-values eve for small based o this kid of computer simulatio. Ackowledgemets This work was supported by Grat-i-Aid for Scietific Research (C), 21540157, from Japa Society for the Promotio of Sciece. Refereces Carlstei, E. (1988). Noparametric chage-poit estimatio, A. Statist., 16, 188 197. Csörgő, M. ad Horváth, L. (1997). Limit Theorems i Chage-Poit Aalysis, Wiley, New York. Hettmasperger, T. P. ad Norto, R. M. (1987). Test for pattered alteratives i k-sample problems, J. Amer. Statist. Assoc., 82, 292 299. Jockheere, A. R. (1954). A distributio free k-sample test agaist ordered alteratives, Biometrika, 41, 133 145.

K-SAMPLE AND CHANGE POINT PROBLEMS 73 Kiefer, J. (1959). K-sample aalogues of the Kolmogorov-Smirov ad Cramér-V. Mises tests, A. Math. Statist., 30, 420 447. Lombard, F. (1987). Rak tests for chagepoit problems, Biometrika, 74, 615 624. Mack, G. A. ad Wolfe, D. A. (1981). K-sample rak tests for umbrella alteratives, J. Amer. Statist. Assoc., 76, 175 181. Murakami, H. (2010). A rak statistic for the chage-poit problem ad its applicatio, J. Jp. Soc. Comp. Statist., 23, 27 40. Odeh, R. E. (1971). O Jockheere s k-sample test agaist ordered alteratives, Techometrics, 13, 912 918. Pettitt, A. N. (1979). A o-parametric approach to the chage-poit problem, Appl. Statist., 28, 126 135. Praagma, J. (1988). Bahadur efficiecy of rak tests for the chage-poit problem, A. Statist., 16, 198 217. Scholz, F. W. ad Stephes, M. A. (1987). K-sample Aderso-Darlig tests, J. Amer. Statist. Assoc., 82, 918 924. va der Vaart, A. W. ad Weller, J. A. (1996). Weak Covergece ad Empirical Processes: With Applicatios to Statistics, Spriger-Verlag, New York.