T 2 Type Test Statistic and Simultaneous Confidence Intervals for Sub-mean Vectors in k-sample Problem

Size: px

Start display at page:

Download "T 2 Type Test Statistic and Simultaneous Confidence Intervals for Sub-mean Vectors in k-sample Problem"

Godfrey Clark
6 years ago
Views:

1 T Type Test Statistic and Simultaneous Confidence Intervals for Sub-mean Vectors in k-sample Problem Toshiki aito a, Tamae Kawasaki b and Takashi Seo b a Department of Applied Mathematics, Graduate School of Science, Tokyo University of Science, Tokyo, Japan; b Department of Applied Mathematics, Faculty of Science, Tokyo University of Science, Tokyo, Japan Abstract In this paper, we consider tests for sub-mean vectors. In the one-sample and two-sample problems, we give the T type test statistic and the simultaneous confidence intervals by using the approximate upper percentiles of T type test statistic. In the k-sample problem, we give the simultaneous confidence intervals for pairwise multiple comparisons by using Bonferroni s approximation. Finally, we investigate the asymptotic behavior of the approximate upper percentiles of T type statistic by Monte Carlo simulation, and we give an example to illustrate the simultaneous confidence intervals. AMS 000 Mathematics Subject Classification. 6H0, 6H5. Key Words and Phrases: Bonferroni s approximation, Monte Carlo simulation, pairwise multiple comparisons, T max statistic, type I error Introduction Consider the tests of mean when the partial mean vector is given. That is, letting µ = (µ, µ be a mean vector, where µ = (µ, µ,..., µ r and µ = (µ r+, µ r+,..., µ p, the problem is the testing of µ when µ is given. For the problem of sub-mean vectors, Eaton and Kariya (983 derived tests for the independence of two normally distributed sub-mean vectors for the case that an additional random sample is available. Provost (990 obtained explicit expressions for the case that the maximum likelihood estimators (MLEs of all the parameters of the multi-normal random vector are given, and the likelihood ratio statistic for testing the independence between sub-mean vectors has been obtained. For the onesample problem, Rao (949 gave Rao s U-statistic and additional information. The null distribution of Rao s U-statistic has been introduced by Siotani et al. (985. A test for sub-mean vectors with two-step monotone missing data was discussed by Kawasaki and Seo (06. A test for sub-mean vectors in two-sample problem was introduced by Rencher (0. For the k-sample problem, Fujikoshi et al. (00 gave an asymptotic expansion of the distribution of the generalized U-statistic under normality. Gupta et al. (006 gave an asymptotic expansion of the distribution of the generalized U-statistic under

2 a general condition. However, the problem for sub-mean vectors in terms of simultaneous confidence intervals does not appear to have been discussed. In this paper, we give the T type test statistic and derive the simultaneous confidence intervals for sub-mean vectors in the one-sample and two-sample problems. For the k-sample problem, we give the simultaneous confidence intervals for pairwise multiple comparisons for sub-mean vectors. The remainder of this paper is organized. In Section, we propose the T type test statistic for the one-sample case (T test statistic, and its approximate upper percentiles and simultaneous confidence intervals. In Section 3, we propose the T type test statistic for the two-sample case (T test statistic, and its approximate upper percentiles and simultaneous confidence intervals. In Section 4, we present simultaneous confidence intervals for multiple comparisons in the k sample problem. In order to obtain the simultaneous confidence intervals, we derive the approximate upper percentiles of the T max statistic by Bonferroni s approximation. In Section 5, we investigate the asymptotic behavior of the approximate upper percentiles of the T test statistic and the T max statistic by Monte Carlo simulation. In Section 6, we give an example to illustrate simultaneous confidence intervals. One-sample problem. T type test statistic for a sub-mean vector Let x, x,..., x i.i.d. p (µ, Σ. In this section, we consider the following hypothesis H 0 : µ = µ 0 given µ = µ 0 vs. H : µ µ 0 given µ = µ 0, where µ 0 = (µ 0,, µ 0,,..., µ 0,p = (µ 0, µ 0 are given values, and µ = p ( µ µ, µ = r µ µ. µ r, µ = s µ r+ µ r+. µ p, Σ = Σ Σ. Σ Σ We partition x j into a r random vector, a s random vector as x j = (x j, x j, where x j : r, x j : s, p = r + s, j =,,...,. This hypothesis test is then the same as the following hypothesis: H 0 : µ = µ 0 vs. H : µ µ 0,

3 where µ = µ Σ Σ µ, µ 0 = µ 0 Σ Σ µ 0. We then derive the MLEs of µ and Σ. Employing the derivation of Siotani et al. (985, we use the transformed parameters as follows: ( η µ Σ η = = Σ µ 0 Ψ Ψ, Ψ = Σ Σ Σ = η µ 0 Ψ Ψ Σ Σ, Σ where Σ = Σ Σ Σ Σ. We note that (η, Ψ is in one-to-one correspondence with (µ, Σ. Using the transformed parameters (η, Ψ, the likelihood function is given by L(η, Ψ =(π p Ψ Ψ exp (x j Ψ x j η Ψ (x j Ψ x j η exp (x j µ 0 Ψ (x j µ 0. Therefore, the logarithm of L(η, Ψ can be expressed as log L(η, Ψ = p log(π log Ψ log Ψ (x j Ψ x j η Ψ (x j Ψ x j η (x j µ 0 Ψ (x j µ 0. Differentiating log L(η, Ψ with respect to η and Ψ, we have log L(η, Ψ η log L(η, Ψ Ψ log L(η, Ψ Ψ = log L(η, Ψ Ψ = Ψ (x j Ψ x j η, = Ψ + Ψ (x j Ψ x j η (x j Ψ x j η Ψ, Ψ { (xj x (x j x Ψ (x j x (x j x }, = Ψ + Ψ (x j µ 0 (x j µ 0 Ψ. Solving the partial derivative of log L(η, Ψ = 0, the MLEs of η and Ψ are given by η = x Ψ x, Ψ = V, Ψ = V V Ψ, Ψ = V, where x = x j = ( x x, V = (x j x(x j x V V =, V = V V V V V V. 3

4 Therefore, using the relation that (η, Ψ is in one-to-one correspondence with (µ, Σ, the MLEs of µ and Σ are given by ( ( µ x µ = = Σ Σ (x µ 0, µ µ 0 ( Σ Σ Σ = = V + Σ Σ Σ V V Σ Σ Σ V V V The T test statistic for sub-mean vectors is then given by Σ. T = ( µ µ 0 { Ĉov( µ µ 0 } ( µ µ 0, where µ, Ĉov( µ µ 0 are estimators of µ, Cov( µ µ 0. We note that under H 0, T is approximately distributed as a F distribution with r and p degrees of freedom. Using this result, the approximate upper 00α percentile of the T test statistic is given by t app(α = ( r p F r, p(α, where F r, p (α is the upper 00α percentile of the F distribution with r and p degrees of freedom.. Simultaneous confidence intervals for sub-mean vectors We consider the simultaneous confidence intervals for any and all linear compounds of the sub-mean. Using the approximate upper percentiles of T from Section., for any nonnull vector a = (a, a,, a p, the simultaneous approximate confidence intervals for a (µ µ 0 are given by a u where c R app a (µ µ 0 a u + c R app, a R r {0}, u = x µ 0 V V (x µ 0, R app = (t app(αa V a (, c = s. 3 Two-sample problem 3. T type test statistic for sub-mean vectors Let x, x,, x i.i.d. p (µ, Σ, i =,, = ( + (. In this section, we consider the following hypothesis H 0 : µ ( = µ ( given µ ( = µ ( vs. H : µ ( µ ( given µ ( = µ (, 4

5 where We partition x j µ = p µ µ, µ = r µ µ µ r., µ = s µ r+ µ r+. µ p Σ Σ, Σ =. Σ Σ into a r random vector and a s random vector, as x j = (x j, x j, where j =,,...,, and p = r + s. This hypothesis test is then the same as the following hypothesis H 0 : µ ( = µ( vs. H : µ ( µ(, where µ = µ Σ Σ µ, µ = µ ( = µ (. We derive the MLEs of µ and Σ as follows. As for the one-sample case, we use the following transformed parameters (η, Ψ η η = = η ( µ Σ Σ µ, Ψ = µ Ψ Ψ = Ψ Ψ Σ Σ Σ Σ Σ, Σ where Σ = Σ Σ Σ Σ, i =,. We note that (η, Ψ is in one-to-one correspondence with (µ, Σ. Using the transformed parameters (η, Ψ, the likelihood function is given by L(η (, η (, Ψ =(π p Ψ Ψ exp exp i= i= (x j Ψ x j η Ψ (x j Ψ x j η (x j η Ψ (x Therefore, the logarithm of L(η (, η (, Ψ can be expressed as j η. log L(η (, η (, Ψ = p log(π log Ψ log Ψ i= (x j Ψ x j η Ψ (x j Ψ x j η (x j η Ψ (x j η. i= 5

6 Differentiating log L(η (, η (, Ψ with respect to η, Ψ, we have log L(η (, η (, Ψ η log L(η (, η (, Ψ η log L(η (, η (, Ψ Ψ = Ψ + i= log L(η (, η (, Ψ Ψ = log L(η (, η (, Ψ Ψ = Ψ (x j Ψ x j η, = Ψ (x j η, i= Ψ (x j Ψ x j η (x j Ψ x j η Ψ, i= = Ψ + { } Ψ (x j η x j Ψ x x, i= Ψ (x j η (x j η Ψ. Solving the partial derivative of log L(η (, η (, Ψ = 0, the maximum likelihood estimates of η (, η (, and Ψ are given by η = x Ψ x, i =,, η = x, Ψ = V, Ψ = V V Ψ, { Ψ = V + (x (j x (x (j x }, where x = V = V = (x j x j = ( x x, x = x (x j x = V V V = V V i= i= x, ( V V V V, V = V V V V., Therefore, using the relation that (η (, η (, Ψ is in one-to-one correspondence with (µ (, µ (, Σ, the MLEs of µ (, µ (, and Σ are given by ( µ µ x = = Σ Σ (x x, i =,, µ x ( Σ Σ V + Σ Σ Σ V V Σ = = { Σ Σ Σ V V V + i= Σ (x x (x } x, 6

7 The T test statistic for the sub-mean vector is then given by T = ( µ ( µ( { Ĉov µ ( µ( ( } ( µ µ(, where µ and Ĉov µ ( µ( are the estimators of µ and Cov µ ( µ(, respectively. We note that under H 0, T is asymptotically distributed as a χ distribution with r degrees of freedom. However, when the sample is not large, the χ distribution is not a good approximation of the upper percentile of T. Let We can then rewrite T as u = x ( x ( V V (x( x (, c ( 3 = ( ( ( s 3. T = ( c u V u = ( z W z = ( z z z W z z, z where z = c Σ u, W = Σ V Σ. We note that u is distributed on r (µ ( µ(, c Σ when (, (. Therefore, the distribution of z z is a χ distribution with r degrees of freedom. We note that under H 0, ( p s T /( r is approximately distributed as a F distribution with r and p s degrees of freedom. Using this result, the approximate upper 00α percentile of the T statistic is given by t app(α = ( r p s F r, p s (α, where F r, p s (α is the upper 00α percentiles of the F distribution with r and p s degrees of freedom. 3. Simultaneous confidence intervals We consider the simultaneous confidence intervals for any and all linear compounds of the sub-mean. Using the upper percentiles of T from Section 3., for any nonnull vector a = (a, a,, a p, the simultaneous confidence intervals for a (µ ( µ( are given by a c u M a (µ ( µ( a u + c M, a Rr {0}, 7

8 where M = (t (αa V a, and t (α is the upper 00α percentiles of the T test statistic. However, it is not easy to obtain t (α. Therefore, using the approximate upper 00α percentiles of the T test statistic, t app(α, the approximate simultaneous confidence intervals for a (µ ( a c u M app a (µ ( µ( a u + where M app = (t app(αa V a. 4 k-sample problem 4. The T max type test statistic µ( c M app, a R r {0}, can obtained by In this section, we consider the T max type test statistic for testing any two sub-mean vectors and propose the approximate upper 00α percentiles of this statistic with Bonferroni s approximation. Let x, x,, x i.i.d. p (µ, Σ for i =,,, k, = k i=. Gupta et al. (006 has derived the likelihood ratio test (LRT statistic for testing the hypothesis H 0 : µ ( = µ ( = = µ (k given µ ( = µ ( = = µ (k vs. H : at least two µ s are an equal given µ( = µ ( = = µ (k. When H 0 is rejected, our interest is pairwise comparisons of sub-mean vectors. Under the assumption of the common population covariance matrix, for fixed a; b, we can use the T type statistic for the two-sample problem derived in Section3., that is, T ab = ( µ µ(j { Ĉov( µ µ(j } ( µ µ(j, where µ (= µ Σ Σ µ and Ĉov( µ µ(j are estimators of µ respectively. Similarly, as for the two-sample case, the MLEs are given by µ = ( ( µ x = Σ Σ µ x ( Σ Σ Σ = = Σ Σ (x x, V + Σ Σ Σ V V Σ V V { k V + i= Σ (x x (x and Cov( µ µ(j, x }, 8

9 where x = V = V = (x j x j = x, x x = x (x j x = k V V V = V V i= k i= x, ( V V V V, V = V V V V., Similarly, as Section 3., under the hypothesis that the two sub-mean vectors are equal, we have the following approximate upper 00α percentile of the Tab test statistic for fixed i = a, b: t ab app(α = ( kr r k(s + + F r, r k(s++(α, where F r, r k(s++ (α is the upper 00α percentile of the F distribution with r and r k(s++ degrees of freedom. Using the test statistic, the T type test statistic for H 0 : µ (a = µ (b for all a, b, a < b k given µ ( = µ ( = = µ (k vs. H : H 0 given µ ( = µ ( = = µ (k, is given by T max = max a<b k T ab. Generally, it is not easy to obtain the upper percentile of the T max statistic. Therefore, in this section, we adopt Bonferroni s approximation, which is one of the solutions to this problem. We first consider the T ab test statistic. We consider the upper 00α percentile of the T max statistic t max(α when ( = ( = = (k. An approximate upper percentile of T max, t BO(α, is a solution of the equation given by a<b k Pr(T ab > t BO(α = α. Using Bonferroni s approximation, t BO(α is given by t BO(α = t ab(α, 9

10 where α = α/k(k, t ab (α is the upper 00α percentiles of the Tab test statistic. However, it is not easy to obtain t ab (α. Using t ab app(α, the approximate upper 00α percentile of the T max statistic is given by t B app(α = ( kr k(s + r + F r, k(s+ r+(α, where the F r, k(s+ r+ (α is the upper approximate 00α percentile of F distribution with r and r k(s + + degrees of freedom. 4. Simultaneous confidence intervals for multiple comparisons among submean vectors In this section, we consider simultaneous confidence intervals for pairwise multiple comparisons among sub-mean vectors. Using the approximate upper percentile of T max, for any nonnull vector a = (a, a,, a p, the approximate simultaneous confidence intervals for a (µ a cab u ab k L app a (µ (a µ(b a u ab + µ(j, a < b k are given by cab k L app, a R r {0}, a < b k, where L app = ( t B app(αa V a, c ab = ( (a + (b ( k (a (b ( k s. 5 Simulation studies In this section, we perform a Monte Carlo simulation (with 0 6 runs in order to evaluate the asymptotic behavior of the F approximations and the accuracy of the approximate upper 00α percentiles of the T and T max test statistic. Tables a and b present the simulated upper 00α percentile of the T test statistic and the approximate upper 00α percentile of the T test statistic for the two-sample problem; (p, r, s = (4,, 3, (4,,, (4, 3,, (8,, 6, (8, 4, 4, (8, 6, ; α = 0.05, 0.0; and for the following two cases of ( (, ( : ( (, ( = { (l, l, l = 0, 40, 00, 00, 400 (l, l, l = 0, 40, 00, 00. Tables a and b present the type I errors for the upper 00α percentile of the χ distribution with r degrees of freedom and the approximate upper 00α percentile of the T test statistic given by α = Pr(T > χ r(α, α = Pr(T > t app(α, 0

11 respectively. Table gives the approximate upper 00α percentile of the Tmax statistic, t BO(α, and the approximate upper 00α percentile of the Tmax statistic, t B app(α, for the k-sample problem; k = 3, 4, 5, 6, 0; (p, r, s = (4,, ; α = 0.05, 0.0; (k = l, where l = 0, 40, 00, 00, 400. Table presents the type I error for the approximate upper 00α percentile of the Tmax statistic: α 3 = Pr(T max > t B app(α, where α = α/k(k. It may be noted from Tables a and b that the simulated values approach closer to the upper percentile of the χ distribution when both of the sample sizes ( and ( become large. In addition, it can be seen from both tables that the proposed approximation value is good for all cases even when the sample size is small. The results for the type I error of the proposed approximation value are closer than those of the χ value for all cases. In Table, the simulated values also approach the upper 00α percentile of the χ distribution when both of the sample sizes ( and ( become large. The results for the type I error of the proposed approximation value are always lower than α for all cases. That is, it is conservative in terms of the type I error when k is large. Therefore, in terms of the type I error, it can be concluded that the proposed approximation values are more accurate than the simulated values for all cases. In addition, we perform a Monte Carlo simulation (with 0 6 runs in order to evaluate the powers of the T test statistic. Table 3a presents the powers of the T test statistic for the two-sample problem; (p, r, s = (4,, ; α = 0.05, 0.0; δ = µ ( µ ( = 0, 0.,..., ; and for the following cases of ( (, ( : ( = ( = l = 0, 40, 50. Table 3b presents the powers of the T test statistic for the two-sample problem; r =, (p, r, s = (4,,, (5,, 3, (6,, 4, (0,, 8; α = 0.05, 0.0; δ = µ ( µ ( = 0, 0.,..., ; ( (, ( = (50, 50. Tables 3a and 3b present the powers of the T test statistic: β = Pr(T > χ r(α H, β = Pr(T > t app(α H, respectively. It may be noted from Table 3a that the power takes conservative values when both of the sample sizes ( and ( become small. In addition, it may be noted from Table 3b that the power takes values that are a little conservative when s become small. The reasons are that the power may depend on the noncentrality parameters only through θ = δ Σ δ /c, where δ = µ ( µ (.

12 6 umerical example In this section, we discuss an example to illustrate the approximation of t B app(α by comparing each of the simultaneous confidence intervals in Section 4.. In this example, we utilize the data in the iris plant taken from Fisher (936. Data are presented on three species of iris, setosa, versicolor, and virginica, comprising four different measurements: x : petal width, x : petal length, x 3 : sepal width, and x 4 : sepal length. The population mean vectors are µ = (µ, µ = (µ, µ, µ 3, µ 4, where µ : mean of petal width, µ : mean of petal length, µ 3 : mean of sepal width, µ 4 : mean of sepal length, µ = (µ, µ, and µ = (µ 3, µ 4. We assume that these data are distributed normality, and µ = µ ( = µ ( = µ (3. We consider the simultaneous approximate confidence intervals of pairwise comparisons for testing the sub-mean vectors hypothesis: H 0 : µ (a = µ (b for all a, b, a < b 3 given µ ( = µ ( = µ (3 vs. H : H 0 given µ ( = µ ( = µ (3. Table 4 presents the simultaneous confidence intervals obtained by using t max(α, t BO(α, and t B app(α. For example, let a = (0, and α = 0.05; then the simultaneous confidence intervals obtained by using t max(0.05, t BO(0.05, and t B app(0.05 for pairwise comparisons are constructed as a (µ ( µ( obtained by using t max(0.05 : (.74,.066, t BO(0.05 : (.75,.065, t B app(0.05 : (.77,.063, respectively. These simultaneous confidence intervals present the petal length between the score of the setosa population and the score of the versicolor population. The simultaneous confidence intervals obtained by using the approximate upper 00α percentile, t B app(0.05, are a little longer than those obtained by using t BO(0.05 and t max(0.05. The remaining results are the same. Therefore, it can be concluded that the simultaneous confidence intervals obtained by using t B app(α are useful.

13 Table a t (α and t app(α when p = 4 Sample Size α = 0.05 α = 0.0 ( ( t (α t app(α α α t (α t app(α α α (r, s = (, 3, χ (0.05 = 3.84, χ (0.0 = (r, s = (,, χ (0.05 = 5.99, χ (0.0 = (r, s = (3,, χ 3(0.05 = 7.8, χ 3(0.0 =

14 Table b t (α and t app(α when p = 8 Sample Size α = 0.05 α = 0.0 ( ( t (α t app(α α α t (α t app(α α α (r, s = (, 6, χ (0.05 = 5.99, χ (0.0 = (r, s = (4, 4, χ 4(0.05 = 9.49, χ 4(0.0 = (r, s = (6,, χ 6(0.05 =.59, χ 6(0.0 =

15 Table t BO(α and t B app(α when (p, r, s = (4,, Sample Size α = 0.05 α = 0.0 t BO(α t B app(α α 3 t BO(α t B app(α α 3 k = k = k = k = k = ote : χ = 8.9, χ =.4, χ = 9.57, χ =.79, χ = 0.60, χ = 3.8, χ =.4, χ = 4.63, χ = 3.60, χ =

16 Table 3a Power for T test statistic when (p, r, s = (4,, α = 0.05 α = 0.0 δ ( (, ( β β β β 0 (0, (40, (50, ote : δ = µ ( µ (, β = Pr(T > χ r(α H, β = Pr(T > t app(α H 6

17 Table 3b Power for T test statistic when ( (, ( = (50, 50, r = α = 0.05 α = 0.0 δ (p, r, s β β β β 0 (4,, (5,, (6,, (0,, ote : δ = µ ( µ (, β = Pr(T > χ r(α H, β = Pr(T > t app(α H 7

18 Table 4 Simultaneous confidence intervals obtained by using t max(0.05, t BO(0.05, and t B app(0.05 a (, 0 (0, (, a (µ ( µ( µ( µ ( µ ( µ ( (µ ( + µ ( (µ( + µ ( t max(0.05 (.77,.070 (.74,.066 ( 3.448, 3.40 t BO(0.05 (.77,.070 (.75,.065 ( 3.449, 3.39 t B app(0.05 (.8,.067 (.77,.063 ( 3.45, 3.37 a (µ ( µ(3 µ( µ (3 µ ( µ (3 (µ ( + µ ( (µ(3 + µ (3 t max(0.05 (.889,.68 ( 3.53,.945 ( 4.938, t BO(0.05 (.890,.679 ( 3.55,.944 ( 4.939, 4.79 t B app(0.05 (.89,.678 ( 3.56,.94 ( 4.940, 4.77 a (µ ( µ(3 µ( µ (3 µ ( µ (3 (µ ( + µ ( (µ(3 + µ (3 t max(0.05 ( 0.75, ( 0.983, (.594,.386 t BO(0.05 ( 0.76, ( 0.984, (.595,.385 t B app(0.05 ( 0.78, ( 0.986, 0.77 (.0, References [] Eaton, M. L. and Kariya, T. (983, Multivariate tests with incomplete data, The Anal. Stat., Vol., o., [] Fisher, R. A., (936, The use of multiple measurements in taxonomic problems, Ann. Eugen., 7, [3] Fujikoshi, Y., Ulyanov, V. V. and Shimizu, R. (00, Multivariate Statistics: High-Dimensional and Large-Sample Approximations, Hoboken, J: Wiley. [4] Gupta, A. K., Xu, J. and Fujikoshi, Y. (006, An asymptotic expansion of the distribution of Rao s U-statistic under a general condition, J. Multivariate Anal., 97, [5] Kawasaki, T. and Seo, T. (06, A test for subvector of mean vector with two-step monotone missing data, SUT J. Math., 5, -39. [6] Provost, S. B. (990, Estimators for the parameters of a multivariate normal random vector with incomplete data on two subvectors and test of independence, Comput. Statist. Data Anal., 9, [7] Rao, C. R. (949, On some problems arising out of discrimination with multiple characters, Sankhyā, 9, [8] Rencher, A. C. (0, Methods of Multivariate Analysis, Hoboken, J: Wiley. [9] Siotani, M., Hayakawa, T. and Fujikoshi, Y. (985, Modern Multivariate Statistical Analysis: A Graduate Course and Handbook, American Science Press Inc., Ohio. 8

SIMULTANEOUS CONFIDENCE INTERVALS AMONG k MEAN VECTORS IN REPEATED MEASURES WITH MISSING DATA

SIMULTANEOUS CONFIDENCE INTERVALS AMONG k MEAN VECTORS IN REPEATED MEASURES WITH MISSING DATA Kazuyuki Koizumi Department of Mathematics, Graduate School of Science Tokyo University of Science 1-3, Kagurazaka,