Bounds of the normal approximation to random-sum Wilcoxon statistics

R ESEARCH ARTICLE doi: 0.306/scienceasia53-874.04.40.8 ScienceAsia 40 04: 8 9 Bounds of the normal approximation to random-sum Wilcoxon statistics Mongkhon Tuntapthai Nattakarn Chaidee Department of Mathematics and Computer Science Faculty of Science Chulalongkorn University Bangkok 0330 Thailand Corresponding author e-mail: nattakarn.c@chula.ac.th. Received 4 May 03 Accepted 6 Feb 04 ABSTRACT: Consider sequences X i i= and Y j j= of independent and identically distributed i.i.d. random variables random variables K K ranging over of all positive integers where the X i s Y j s K and K are all independent. We obtain Berry-Esseen bounds for random-sum Wilcoxon s statistics in the form W K K U/V and W K K a/b where W K K = K i= K j= IXi > Yj and U V are random variables and a b are constants. We also show that the rate of convergence is OEK / provided by EK /EK τ for some constant τ > 0 when EK and EK tend to infinity. KEYWORDS: Wilcoxon s rank-sum statistics Mann-Whitney s statistics random-indexed summands Berry- Esseen bounds non-random centring INTRODUCTION Let X X... X m and Y Y... Y n be two sequences of independent and identically distributed i.i.d. random variables. Note that the distributions of X and Y are not necessarily identical. Suppose the X i s and Y j s are independent and continuous random variables. To test the null hypothesis that the distributions of X and Y are equal Wilcoxon introduced rank-sum statistics by ranking the random variables between two sequences and then computing the sum of these rankings. Mann and Whitney proposed the Mann-Whitney s statistic W mn = m i= j= n I X i > Y j. Mann and Whitney showed that their statistics and Wilcoxon s rank-sum statistics are equivalent. They also showed that under the null hypothesis Mann- Whitney s distribution can be approximated by the standard normal distribution denoted by Φ. Consider a Wilcoxon s rank-sum statistic defined by W m = m k= R k where R k is the ranking-number of X k between two samples X... X m and Y... Y n. Pestman 3 obtained normal approximation theorems and Alberink 4 investigated Berry-Esseen bound for Wilcoxon s rank-sum statistics by assuming the null hypothesis. From the result of Alberink 4 and the well-known fact that W m = W mn mm/ we obtain the bound for Mann-Whitney s statistics as the next statement. Theorem Ref. 4 Under the null hypothesis that the distributions of X and Y are equal we have the following result: sup x R P W mn mn/ x Φx mnm n/ 7 3 3m n m n 4. mnm n It is obvious that Mann-Whitney s statistics are example of U-statistics introduced by Hoeffding 5 and Lehmann 6. So that the bounds for Mann-Whitney s statistics were established by applying the results of Grams and Serfling 7 or Chen and Shao 8. Now we need some notation: = P X > Y µ = = E E I X > Y X γ = E E I X > Y X 3 = E E I X > Y Y γ = E E I X > Y Y 3.

ScienceAsia 40 04 83 Consider a two-sample U-statistic given by U mn = m n m i= j= m hx i Y j where hx i Y j = I X i > Y j. Observe that EhX Y = 0 Eh X Y = µ and that l µ for all l =. Also we note that U mn /m /n = W mn mn mn. nm Hence we can use the bounds for U mn of Chen and Shao 8 to derive the bounds for Mann-Whitney s statistics as the next statement. Theorem Ref. 8 sup P W mn mn mn n m x Φx x R µ /m /n /m /n 6.6 γ /m γ /n /m /n3/ and there exists a constant C which does not depend on m and n such that for all x R P W mn mn mn n m x Φx 9 µ /m /n x /m /n 3.5 µ /m /n e x/3 /m /n C γ /m γ /n x 3 /m. /n3/ In particular if = / and = = / then the constant 7 in Theorem can be replaced by 3. Mann-Whitney s and Wilcoxon s statistics are usually used to compare two treatments in many comparative experiments. However these statistics cannot be used when the sample sizes of observation from two treatments are not measured 9 0. For example the number of normal and abnormal images in the medical imaging studies and diagnostic marker studies. In 0 under the situation that the sample sizes are random variables Tang and Balakrishnan 0 proposed a random-sum Wilcoxon s statistic as the following: Let X i i= and Y j j= be two sequences of i.i.d. random variables such that the X i s and Y j s are independent. Assume that the X i s and Y j s are continuous random variables. Let K and K be independent positive integer-valued random variables which are independent of the X i s and Y j s. A random-sum Wilcoxon s statistic is defined by K K W KK = I X i > Y j. i= j= Tang and Balakrishnan 0 also obtained normal approximation theorems for random-sum Wilcoxon s statistics as the next statement. Theorem 3 Ref. 0 For each l = let K l be a positive integer-valued random variable depending on a parameter τ l. Suppose that C the random variable K l has its asymptotic distribution as the normal distribution with mean EK l and variance VarK l as τ l ; C EK /EK η for some finite non-zero constant η as τ l ; C3 VarK l /EK l δ l for some finite constant δ l as τ l ; C4 EK l p /τ l λ l for some finite non-zero constant λ l and for some p as τ l. Then P τ 3/p W KK EK EK x Φx Var τ 3/p W KK tends to zero where Var τ 3/p W KK λ 3 η λ 3 η δ λ 3 η δ λ 3 η as τ and τ. MAIN RESULTS The main results of this article are Berry-Esseen bounds for random-sum Wilcoxon s statistics. In Theorem 4 we consider a random-sum Wilcoxon s statistic with random centring and random norming as in the form W KK U /V where U and V are random variables. In Theorem 5 we consider a random-sum Wilcoxon s statistic with non-random centring and non-random norming as in the form W KK a /b where a and b are constants.

84 ScienceAsia 40 04 Theorem 4 Let K and K be positive integer-valued random variables such that the X i s Y j s K K are.9 γ n/m γ n/m 3/ n independent. We have the following results: m/n3/ and there exists an absolute constant C which does sup x R P W KK K K not depend on m and n such that for all x R K K K x Φx K. VarK EK VarK P W mn mn mn n m x Φx EK EK EK EK n/m 3. EK µ /EK C µ m/n EK /EK x n m/n EK EK /EK C µ n/m m/n.9 γ EK /EK γ EK /EK 3/ x 3 n m/n EK EK /EK 3/ C γ n/m γ n/m 3/ x 3 n and there exists a constant C such that for all x R. m/n3/ It is understood that Wilcoxon s statistics can P W KK K K be approximated by the sum of independent random K K K x Φx K variables introduced by Hajek. Now we denote the C µ VarK EK x EK EK VarK projection of W KK on the X i s Y j s K K by EK EK K EK C µ /EK Ŵ KK := E W KK EK EK X i i= EK /EK K x EK EK /EK E W KK C EK µ /EK EK EK Y j j= EK /EK x 3 E W KK EK EK EK EK K K /EK C γ EK /EK γ EK /EK 3/ K = EK E I X i > Y j X i x 3 EK EK /EK. 3/ i= K EK E I X i > Y j Y j Remark If K and K satisfy the conditions C j= C3 and C4 in Theorem 3 then the rate of convergence is Oτ /p for some p when τ and τ tend to infinity. By definition random-sum Wilcoxon s statistics include Mann-Whitney s statistics in the special case of fixed-indices K = m and K = n. In this case our results are same as Theorem the bounds of Chen and Shao 8. But the constants in a uniform bound of Theorem are smaller. The next corollary is an immediately consequence of Theorem 4. Corollary sup x R P W mn mn mn n m x Φx 3. µ n/m m/n n m/n K K EK EK. By taking the conditional expectation given by K and K see p. of Ref. we observe that VarŴK K = E Var Ŵ KK K K Var E Ŵ KK K K Set = EK EK EK EK EK VarK EK VarK VarK VarK. Var ŴK K := VarŴK K VarK VarK.

ScienceAsia 40 04 85 Similarly we can see that VarW KK = E VarW KK K K Var E W KK K K = E µk K E K K K E K K K VarK K = µek EK EK EK VarK EK EK EK VarK EK EK VarK EK VarK VarK VarK. This implies that VarW KK /Var ŴK K when EK and EK tend to infinity. Hence the quantity VarW KK for normalization in the normal approximation theorem can be replaced by Var ŴK K. In the next statement we show a uniform bound for random-sum Wilcoxon s statistics with nonrandom centring EK EK and non-random norming Var ŴK K. Theorem 5 Let ξ... ξ m be i.i.d. positive integervalued random variables with Eξ = a Varξ = b and E ξ 3 <. Let ζ... ζ n be i.i.d. positive integer-valued random variables satisfying that Eζ = a Varζ = b and E ζ 3 <. Suppose the X i s Y j s ξ k s ζ l s are independent. Put K = m j= ξ j K = n l= ζ l. Then sup x R P W K K EK EK x Φx Var ŴK K. a b τ b n a a 3. τ/a µ a /a τ n a τ/a.9 γ / a τ γ a τ 3/ /a n a τ/a 3/ b 6.E ξ 3 n a τ Eξ 3/ n a τ 3b a a b n where τ = m/n. a b a b b n a τ 6E ξ 3 a b a m a b a a b a τ b n a a n b a 4 b ma a ma a 6.E ζ 3 n b 3 b a τ Remark The random variables K = m j= ξ j and K = n l= ζ l depend on the parameters m and n respectively. It is easy to see that K K satisfy the conditions C C3 and C4 in Theorem 3. Moreover if m/n converges to some constant when m and n tend to infinity then K K satisfy the condition C. Hence the rate of convergence is O n /. APPLICATION IN LROC ANALYSIS A location receiver operating characteristic LROC curve introduced by Starr Metz Lusted and Goodenoughet 3 provides a useful method in radiology based on a location of data. And the area under the LROC curve has been widely used to measure the diagnostic accuracy of imaging study because it represents the probability that positive and negative case are correctly classified. By using an empirical estimation model Tang and Balakrishnan 0 showed that the area under the LROC curve is equivalent to the random-sum Wilcoxon s statistic as the following: Let K be a binomial distributed random variables with parameters m and p. Suppose that the X i s Y j s and K are all independent. An estimator for the area under the LROC curve is given by A LROC = K n I X i Y j. m n i= j= Tang and Balakrishnan 0 also obtained the asymptotic distribution of A LROC as the following statement. Theorem 6 Ref. 0 Let n be any fixed-index and K be a binomial distributed random variable with parameters m and p. Suppose that m/n τ for some constant τ > 0 as m and n. Under

86 ScienceAsia 40 04 the null hypothesis that the distributions of X and Y are equal we have n P ALROC p/ m p 4n 3np mp / x Φx tends to zero as m and n. In the next theorem we investigate Berry-Esseen bounds for A LROC. Theorem 7 Let n be any fixed-index and K be a binomial distributed random variable with parameters m and p. Set τ = m/n. Suppose that the X i s Y j s and K are independent. Under the null hypothesis that the distributions of X and Y are equal we have the following results: sup x R P A LROC K/m mn nk n K / x Φx. n p pτ 5.4 n /pτ pτ pτ 5.5 n /pτ /pτ 3/ pτ 3/ and there exists an absolute constant C which does not depend on m n and p such that for all x R P A LROC K/m mn nk n K / x Φx /pτ C p pτ x n pτ pτ C /pτ pτ x 3 n pτ 3/ /pτ /pτ. pτ 3/ Proof : Under the assumption that the distributions of X and Y are equal Alberink 4 showed that = / = = / and γ = γ = /3. From these facts the bounds in Theorem 4 can be simplified to Theorem 7. PROOF OF MAIN RESULTS In the proof of Theorem 4 we apply Theorem Berry-Esseen theorem for Mann-Whitney s statistics to investigate the bounds for the randomly-indexed summands. In the proof of Theorem 5 we obtain the bounds for random summands with non-random centring and non-random norming by improving the arguments of Chen Goldstein and Shao see p. 7 of Ref. 4 for the case of two random indices. Proof of Theorem 4: Uniform bound. By using the conditional probability given by K K and applying the uniform bound of Theorem we can see that P W KK K K K K K x Φx K P K EK > 0.09EK P K = k 0.9EK k.0ek P W kk k K k K k K x Φx P K EK > 0.09EK P K EK > 0.09EK P K = k P K = k 0.9EK k.09ek 0.9EK k.09ek P W kk k k k k k x Φx k VarK VarK. EK EK P K = k P K = k 0.9EK k.09ek 0.9EK k.09ek k k µ /k k /k k /k 6.6 γ k /k γ k /k 3/ k /k 3/. VarK EK VarK EK EK EK EK 3. EK µ /EK EK /EK EK EK /EK.9 γ EK /EK γ EK /EK 3/ EK EK /EK 3/ where we used the facts that for l = 0.9EK l k l.09ek l

ScienceAsia 40 04 87 and that EK /EK 09 9 k /k in the last inequality. Non-uniform bound. For all k k N put p k = P K = k P K = k where k = k k and set A := k k N N. 0.5EKk.5EK 0.5EK k.5ek From it suffices to investigate the non-uniform bound for x >. Without loss of generality we may assume that x >. By using the non-uniform bound of Theorem and taking the conditional probability given by K and K we have P W KK K K K K K K k / A p k P C µ x W kk k k k k k k k A k A p k x Φx x Φx k /k k /k k k /k C µ k /k k /k x 3 p k k k /k C γ k /k γ k /k 3/ x 3 p k k k /k 3/ k / A p k P Set where k A W kk k k k k k k x Φx EK C µ /EK EK /EK x EK EK /EK C EK µ /EK EK /EK x 3 EK EK /EK C γ EK /EK γ EK /EK 3/ x 3 EK EK /EK. 3/ k k Ŵ k := g X i g Y j i= j= g X i := E W kk k k X i = k E I X > Y X = X i and g Y j := E W kk k k Y j = k E I X > Y Y = Y j. Note that EŴk = 0 and Set VarŴk = k k k k. k := W kk k k Ŵk. Hence we can check that k k k = I X j Y j g X i g Y j k k and that i= j= E k = k k µ. Observe that P W kk k k k k k x Φx k Ŵ P k k k k > x k 3 k P k k k > x k 3 Φx C E Ŵk x E k k k k k C µ x k k. 3 From 3 and the fact that l µ for l = we can see that W kk p k P k k k k k / A k x Φx k C x P K = k µ k k EK >0.5EK C x P K = k µ k k EK >0.5EK Cµ x P K EK > 0.5EK Cµ x P K EK > 0.5EK

88 ScienceAsia 40 04 C µ VarK x EK EK EK EK VarK EK. 4 Hence combining and 4 one can prove the nonuniform bound of Theorem 4. Proof of Theorem 5: Let Z Z and Z 3 be independent standard normal random variables which are independent of the X i s Y j s K K. Set T := W K K EK EK Var ŴK K T := K K EK EK Z Q Var ŴK K T := K EKEKZK VarKZ q Var ŴK K T 3 := Z3EK VarKZ EK VarKZ q Var ŴK K where Q := K K K K q := EK EK EK EK. It is clear that T 3 has the standard normal distribution and easy to see that P W K K EK EK x Φx Var ŴK K P T x P T x P T x P T x P T x P T 3 x = P WKK K K y Φy Q P K EK VarK Z Q q K VarK y Φy P K EK VarK ZK EK VarK EK VarK y 3 Φy 3 =: R R R 3 5 where y y and y 3 are given by y := x Var ŴK K K K EK EK Q y := x Var ŴK K K EK EK Z q K VarK y 3 := x VarŴr Z EK VarK Z q. EK VarK Firstly following the arguments of in the proof of Theorem 4 we can see that R. VarK EK VarK EK EK EK EK 3. EK µ /EK EK /EK EK EK /EK.9 γ EK /EK γ EK /EK 3/ EK EK /EK 3/ =. a b τ b n a a 3. τ/a µ a /a τ n a τ/a.9 n γ / a τ γ a τ 3/ /a a τ/a 3/. 6 Secondly using the conditional probability given by K and Z R P K EK > 0.5EK P K = k where and 0.5EK k.5ek P S y Φy 7 S = K EK VarK Z k K k K q = k. VarK For each i =... m let Z k ɛ i k ɛ i q i = k VarK

ScienceAsia 40 04 89 where ɛ i = K ξ i a. Note that E S E S k K k K q qk VarK E Sk K EKkk EKEK qk VarK E Sk EKEKEKE K EK EK qk VarK E SkK EK k K EK EK qk VarK E Sk EKE K E K EKEK qk VarK k VarK k k EK EK k EK EK VarK k EK EK EK EK VarK k EK EK VarK k E K EK 3 k EK VarK k EK EK VarK k EK E K EK VarK k EK EK VarK k a n a m k EK a a nb k EK a b a k b k a b 6E ξ 3 a b a mn a n k EK a m b a 8 k b a n k a where we used the fact that Rosenthal s inequality E K EK 3 m 8 i= E ξ i 3 m 3/ i= Eξ i in the last inequality. Also we can see that E ξ i a i ξi a k E ξ i a k K ɛ i k VarK k ɛ i k ɛ i ξ i a E ɛ i VarK ξ i a ɛ i ξ i a E k ɛ i VarK ξ i a E ɛ i VarK ξ i a E k VarK ξ i a k ɛ i VarK b b m a m a m k m b k m where we use the fact that Eɛ i 4b a m 9 a m P ɛ i EK > 0.5EK I ɛi EK 0.5EK E 4VarK EK EK = 4b a m a m in the last inequality. By applying the result of p. 58 of Ref. 4 the uniform bound for nonlinear statistics we have that ɛ i sup P S z Φz z R 6.E ξ 3 m Eξ E S 3/ m i= Hence combining 7 0 we obtain E ξ i a i b m. 0 R P K EK 0.5EK P K = k 0.5EK k.5ek P S y Φy VarK 6.E ξ 3 EK m Eξ 3/ P K = k 0.5EK k.5ek k a n a m k EK a a nb a k EK a b k b k a b 6E ξ 3 a b a mn a n

90 ScienceAsia 40 04 k EK a m b a k b a n k a b a m a m b k k 4b a m a m VarK 6.E ξ 3 EK m Eξ 3/ EK a n a m VarK a a nb VarK a b a b EK a b EK and set 6E ξ 3 a b a nm a n a m VarK b a m b EK a b b a nek a EK a m EK = b 6.E ξ 3 n a τ Eξ 3/ n n n n n Finally we denote 4b a m a m a a τ 3b a b b a a b b n a τ 6E ξ 3 a b a m a a b a b a τ b a a n 4 b. b a a ma a ma S = K EK VarK Λ = K EK VarK Z. EK VarK For each j =... n let Λ j = K ζ j a EK VarK Z. EK VarK Note that ES E K EK VarK E S Λ E Z EK VarK and that b a m E ζ j a Λ Λ j E ζ j a VarK E Z EK VarK b b a mn. By using the conditional probability given by Z Z and applying the result of p. 58 of Ref. 4 we have R 3 = P S Λ y 3 Φy 3 n 6.E ζ 3 E S neζ 3/ Λ j= E ζ j a Λ Λ j b n 6.E ζ 3 n Eζ b 3/ a m = 6.E ζ 3 n Eζ b. 3/ a τ We combine 5 6 and to obtain the main result. Acknowledgements: The authors thank Professor Kritsana Neammanee for helpful discussions and the Centre of Excellence in Mathematics CHE for financial support. REFERENCES. Wilcoxon F 945 Individual comparisons by ranking methods. Biometrics 80 3.. Mann HB Whitney DR 947 On a test of whether one of two random variables is stochastically larger than the other. Ann Math Stat 8 50 60. 3. Pestman WR 998 Mathematical Statistics: An Introduction De Gruyter. 4. Alberink IB 000 A Berry-Esseen bound for U- statistics in the non-i.i.d. case. J Theor Probab 3 59 33. 5. Hoeffding W 948 A class of statistics with asymptotically normal distribution. Ann Math Stat 9 93 35.

ScienceAsia 40 04 9 6. Lehmann EL 95 Consistency and unbiasedness of certain nonparametric tests. Ann Math Stat 65 79. 7. Grams WF Serfling RJ 973 Convergence rates for U-statistics and related statistics. Ann Probab 53 60. 8. Chen LHY Shao QM 007 Normal approximation for nonlinear statistics using a concentration inequality approach. Bernoulli 3 58 99. 9. Chakraborty DP 006 A search model and figure of merit for observer data acquired according to the freeresponse paradigm. Phys Med Biol 5 3449 6. 0. Tang LL Balakrishnan N 0 A random-sum Wilcoxon statistic and its applications to analysis of ROC and LROC data. J Stat Plann Infer 4 335 44.. Hajek J 968 Asymptotic normality of simple linear rank statistics under alternatives. Ann Math Stat 39 35 46.. Ross S 995 Stochastic Processes nd edn Wiley New York. 3. Starr SJ Metz CE Lust edn LB Goodenough DJ 975 Visual detection and localization of radiographic images. Radiology 6 533 8. 4. Chen LHY Goldstein L Shao QM 00 Normal Approximation by Stein s Method Springer New York.