High-dimensional two-sample tests under strongly spiked eigenvalue models

Size: px

Start display at page:

Download "High-dimensional two-sample tests under strongly spiked eigenvalue models"

Priscilla Wiggins
5 years ago
Views:

1 1 High-dimensional two-sample tests under strongly spiked eigenvalue models Makoto Aoshima and Kazuyoshi Yata University of Tsukuba Abstract: We consider a new two-sample test for high-dimensional data under the strongly spiked eigenvalue (SSE) model. We provide a general test statistic as a function of positive-semidefinite matrices. We investigate the test statistic under the SSE model by considering strongly spiked eigenstructures and create a new effective test procedure for the SSE model. Key words and phrases: Asymptotic normality, eigenstructure estimation, large p small n, noise reduction methodology, spiked model. 1. Introduction A common feature of high-dimensional data is that the data dimension is high, however, the sample size is relatively low. This is the so-called HDLSS or large p, small n data, where p is the data dimension, n is the sample size and p/n. Statistical inference on this type of data is becoming increasingly relevant, especially in the areas of medical diagnostics, engineering and other big data. Suppose we have independent samples of p-variate random variables from two populations, π i, i = 1, 2, having an unknown mean vector µ i and unknown positive-definite covariance matrix Σ i for each π i. We do not assume that the population distributions are Gaussian. The eigen-decomposition of Σ i (i = 1, 2) is given by Σ i = H i Λ i H T i = p =1 λ ih i h T i, where Λ i = diag(λ i1,..., λ ip ) is a diagonal matrix of eigenvalues, λ i1 λ ip > 0, and H i = [h i1,..., h ip ] is an orthogonal matrix of the corresponding eigenvectors. Note that λ i1 is the largest eigenvalue of Σ i for i = 1, 2. Having recorded i.i.d. samples, x i, = 1,..., n i, from each π i, let x i = H i Λ 1/2 i z i + µ i, where z i = (z i1,..., z ip ) T is considered as a sphered data vector having the zero mean vector and identity covariance matrix. We assume that the fourth moments of each variable in z i are uniformly bounded. When Σ 1 = Σ 2, we simply omit the population index from Σ i, λ i s

2 2 Makoto Aoshima and Kazuyoshi Yata and h i s. For example, we write the covariance matrix as Σ when Σ 1 = Σ 2. In this paper, we consider the two-sample test: H 0 : µ 1 = µ 2 vs. H 1 : µ 1 µ 2. (1.1) Having recorded i.i.d. samples, x i, = 1,..., n i, from each π i, we define x ini = ni =1 x i/n i and S ini = n i =1 (x i x ini )(x i x ini ) T /(n i 1) for i = 1, 2. We assume n i 4 for i = 1, 2. Hotelling s T 2 -statistic is defined by T 2 = n 1n 2 n 1 + n 2 (x 1n1 x 2n2 ) T S 1 (x 1n1 x 2n2 ), where S = {(n 1 1)S 1n1 + (n 2 1)S 2n2 }/(n 1 + n 2 2). However, S 1 does not exist in the HDLSS context such as p/n i, i = 1, 2. In such situations, Dempster (1958, 1960) and Srivastava (2007) considered the test when π 1 and π 2 are Gaussian. When π 1 and π 2 are non-gaussian, Bai and Saranadasa (1996) and Cai et al. (2014) considered the test under homoscedasticity, Σ 1 = Σ 2. On the other hand, Chen and Qin (2010) and Aoshima and Yata (2011, 2015) considered the distance-based two-sample test under heteroscedasticity, Σ 1 Σ 2. discussed in Section 2 of Aoshima and Yata (2015), the distance-based twosample test is quite flexible for high-dimension, non-gaussian data. As We note that those two-sample tests were constructed under the eigenvalue condition as follows: λ 2 i1 tr(σ 2 0 as p for i = 1, 2. (1.2) i ) However, if (1.2) is not met, one cannot use those two-sample tests. See Aoshima and Yata (2016) for the details. Aoshima and Yata (2016) called (1.2) the nonstrongly spiked eigenvalue (NSSE) model. On the hand, Aoshima and Yata (2016) considered the strongly spiked eigenvalue (SSE) model as follows: lim inf p { λ 2 } i1 tr(σ 2 > 0 for i = 1 or 2. (1.3) i ) We emphasize that high-dimensional data often have the SSE model. See Fig. 1 in Yata and Aoshima (2013) and Section 8 in Aoshima and Yata (2016). For the SSE model, Katayama et al. (2013) considered a one-sample test when the population distribution is Gaussian. Ishii et al. (2016) considered the one-sample test for non-gaussian cases. Ma et al. (2015) considered a two-sample test for the factor model when Σ 1 = Σ 2.

3 Two-sample tests for strongly spiked models 3 In this paper, we propose a new effective test procedure for the SSE model. In Section 2, we provide a general test statistic as a function of positive-semidefinite matrices. We investigate the test statistic under the SSE model by considering strongly spiked eigenstructures. In Section 3, we create a new test procedure by estimating the eigenstructures for the SSE model. 2. Test statistic using eigenstructures In this paper, we consider the divergence condition such as p, n 1 and n 2, which is equivalent to Let m, where m = min{p, n min } with n min = min{n 1, n 2 }. Ψ i(s) = p λ 2 i for i = 1, 2; s = 1,..., p. =s We consider the following model: (A-i) For i = 1, 2, there exists a positive fixed integer k i such that λ i1,..., λ iki are distinct in the sense that lim inf p (λ i /λ i 1) > 0 when 1 < k i, and λ iki and λ iki +1 satisfy lim inf p λ 2 ik i Ψ i(ki ) > 0 and λ 2 ik i +1 Ψ i(ki +1) 0 as p. Note that (A-i) implies (1.3), that is (A-i) is one of the SSE models. (A-i) is also a power spiked model given by Yata and Aoshima (2013). We consider the following test statistic with positive-semidefinite matrices, A i, i = 1, 2, of dimension p: T (A 1, A 2 ) = 2 2 i=1 ni < x T i A ix i n i (n i 1) 2x T 1n 1 A 1/2 1 A 1/2 2 x 2n2. Let I p denote the identity matrix of dimension p. Note that T (I p, I p ) is equivalent to the distance-based two-sample test. Let us write that µ A12 = A 1/2 1 µ 1 A 1/2 2 µ 2 and Σ i,ai = A 1/2 i Σ i A 1/2 i, i = 1, 2. Let (A 1, A 2 ) = µ A12 2 and K(A 1, A 2 ) = K 1 (A 1, A 2 ) + K 2 (A 1, A 2 ), where K 1 (A 1, A 2 ) = 2 2 i=1 tr(σ 2 i,a i ) n i (n i 1) + 4tr(Σ 1,A i Σ 2,Ai ) n 1 n 2

4 4 Makoto Aoshima and Kazuyoshi Yata and K 2 (A 1, A 2 ) = 4 2 i=1 µt A 12 Σ i,a µ A12 /n i. Note that E{T (A 1, A 2 )} = (A 1, A 2 ) and Var{T (A 1, A 2 )} = K(A 1, A 2 ). Let λ max (B) denote the largest eigenvalue of any positive-semidefinite matrix, B. We consider the following condition: {λ max (Σ i,ai )} 2 tr(σ 2 i,a i ) Then, Aoshima and Yata (2016) showed that as m 0 as p for i = 1, 2. (2.1) T (A 1, A 2 ) (A 1, A 2 ) {K(A 1, A 2 )} 1/2 N(0, 1) (2.2) under (2.1), lim sup m { (A 1, A 2 )} 2 /K 1 (A 1, A 2 ) < and some regularity conditions. Here, denotes the convergence in distribution and N(0, 1) denotes a random variable distributed as the standard normal distribution. We consider A i s as k i A i(ki ) = I p h i h T i = =1 p =k i +1 h i h T i for i = 1, 2. Note that A i(ki ) = A 1/2 i(k i ). Let Σ i = A 1/2 i(k i ) Σ ia 1/2 i(k i ) = p =k i +1 λ ih i h T i for i = 1, 2. Then, it holds that tr(σ 2 i ) = Ψ i(ki +1) and λ max (Σ i ) = λ ki +1 for i = 1, 2, so that (2.1) is met when A i = A i(ki ), i = 1, 2, under (A-i). Hence, for A i = A i(ki ), i = 1, 2, we can claim (2.2) under (A-i) instead of (2.1). Hereafter, we simply write T = T (A 1(k1 ), A 2(k2 )), µ i = A i(ki )µ i for i = 1, 2, = (A 1(k1 ), A 2(k2 )) = µ 1 µ 2 2, K = K(A 1(k1 ), A 2(k2 )) and K 1 = K 1 (A 1(k1 ), A 2(k2 )) = 2 2 i=1 Note that tr(σ 2 i ) = Ψ i(ki +1) for i = 1, 2. Let tr(σ 2 i ) n i (n i 1) + 4tr(Σ 1 Σ 2 ). n 1 n 2 x il = h T ix il = λ 1/2 i z il + µ i() for all i,, l, where µ i() = h T iµ i. Then, we write that 2 ni l<l (x T T =2 il x il k i =1 x ilx il ) n i (n i 1) i=1 2 n1 l=1 n2 l =1 (x 1l k 1 =1 x 1lh 1 ) T (x 2l k 2 =1 x 2l h 2) n 1 n 2.

5 Two-sample tests for strongly spiked models 5 In order to use T, it is necessary to estimate x il s and h i s. 3. Test procedure using eigenstructures for the SSE model In this section, we assume (A-i) and the following assumption for π i, i = 1, 2: (A-ii) E(zis 2 z2 it ) = E(z2 is )E(z2 it ), E(z isz it z iu ) = 0 and E(z is z it z iu z iv ) = 0 for all s t, u, v, with z il s defined in Section 1. When the π i s are Gaussian, (A-ii) naturally holds. First, we discuss estimation of the eigenvalues and eigenvectors in the SSE model Estimation of eigenvalues and eigenvectors Throughout this section, we omit the subscript with regard to the population for the sake of simplicity. Let ˆλ 1 ˆλ p 0 be the eigenvalues of S n. Let us write the eigen-decomposition of S n as S n = p =1 ˆλ ĥ ĥ T, where ĥ denotes a unit eigenvector corresponding to ˆλ. We assume h T ĥ 0 w.p.1 for all without loss of generality. Let X = [x 1,..., x n ] and X = [x n,..., x n ]. Then, we define the n n dual sample covariance matrix by S D = (n 1) 1 (X X) T (X X). Note that S n and S D share non-zero eigenvalues. Let us write the eigendecomposition of S D as S D = n 1 =1 ˆλ û û T, where û = (û 1,..., û n ) T denotes a unit eigenvector corresponding to ˆλ. Note that ĥ can be calculated by ĥ = {(n 1)ˆλ } 1/2 (X X)û. Let δ = p =k+1 λ /(n 1). Let m 0 = min{p, n}. First, we have the following result. Proposition 1 (Aoshima and Yata, 2016). Assume (A-i) and (A-ii). It holds for = 1,..., k, that as m 0 ˆλ = 1 + δ ( + O P (n 1/2 ) and λ λ (ĥt h ) 2 = 1 + δ ) 1 + OP (n 1/2 ). λ If δ/λ as m 0, ˆλ and ĥ are strongly inconsistent in the sense that λ /ˆλ = o P (1) and (ĥt h ) 2 = o P (1). In order to overcome the curse of dimensionality, Yata and Aoshima (2012) proposed an eigenvalue estimation

6 6 Makoto Aoshima and Kazuyoshi Yata called the noise-reduction (NR) methodology, which was brought about by a geometric representation of S D. If one applies the NR methodology, the λ s are estimated by λ = ˆλ tr(s D) l=1 ˆλ l n 1 ( = 1,..., n 2). (3.1) Note that λ 0 w.p.1 for = 1,..., n 2, and the second term in (3.1) is an estimator of δ. When applying the NR methodology to the PC direction vector, one obtains h = {(n 1) λ } 1/2 (X X)û (3.2) for = 1,..., n 2. Then, we have the following result. Proposition 2 (Aoshima and Yata, 2016). Assume (A-i) and (A-ii). It holds for = 1,..., k, that as m 0 λ λ = 1 + O P (n 1/2 ) and ( h T h ) 2 = 1 + O P (n 1 ). We note that h is a consistent estimator of h in terms of the inner product even when δ/λ as m 0. On the other hand, we note that h T (x l µ) = λ 1/2 z l for all, l. For ĥ and h, we have the following result. Proposition 3 (Aoshima and Yata, 2016). Assume (A-i) and (A-ii). It holds for = 1,..., k (l = 1,..., n) that as m 0 λ 1/2 ĥ T (x l µ) = z l + (n 1) 1/2 û l λ 1 δ{1 + o P (1)} (1 + λ 1 δ) 1/2 + O P (n 1/2 ); λ 1/2 h T (x l µ) = z l + (n 1) 1/2 û l λ 1 δ{1 + o P (1)} + O P (n 1/2 ). Let us consider the standard deviation of the above quantities. Note that [ n l=1 {(n 1)1/2 û l δ/λ } 2 /n] 1/2 = O(δ/λ ) and δ = O(p/n) for λ k+1 = O(1). Hence, in Proposition 3, the inner products are very biased when p is large. Now, we explain the main reason why the inner products involve the large biased terms. Let P n = I n 1 n 1 T n /n, where 1 n = (1,..., 1) T. Note that 1 T n û = 0 and P n û = û when ˆλ > 0 since 1 T n S D 1 n = 0. Also, when ˆλ > 0, note that {(n 1) λ } 1/2 h = (X X)û = (X M)P n û = (X M)û,

7 Two-sample tests for strongly spiked models 7 where M = [µ,..., µ]. Thus it holds that {(n 1) λ } 1/2 ht (x l µ) = û T (X M) T (x l µ) = û l x l µ 2 + n s=1( l) ûs(x s µ) T (x l µ), so that û l x l µ 2 is very biased since E( x l µ 2 )/{(n 1) 1/2 λ } (n 1) 1/2 δ/λ. Hence, one should not apply the ĥs or the h s to the estimation of the inner product. Here, we consider a bias-reduced estimation of the inner product. Let us write that û l = (û 1,..., û l 1, û l /(n 1), û l+1,..., û n ) T whose l-th element is û l /(n 1) for all, l. Note that û l = û (0,..., 0, û l n/(n 1), 0,..., 0) T. Let h l = {(n 1) λ } 1/2 (X X)û l (3.3) for all, l. When ˆλ > 0, we note that {(n 1) λ } 1/2 hl = (X M)P n û l = (X M)û (l) since 1 T n û = n l=1 ûl = 0, where û (l) = (û 1,..., û l 1, 0, û l+1,..., û n ) T + (n 1) 1 û l 1 n(l) for l = 1,..., n. Here, 1 n(l) = (1,..., 1, 0, 1,..., 1) T whose l-th element is 0. Thus it holds that {(n 1) λ } 1/2 ht l(x l µ) = û T (l) (X M)T (x l µ) n = {û s + (n 1) 1 û l }(x s µ) T (x l µ), s=1( l) so that the large biased term, x l µ 2, has vanished. following result. Then, we have the Proposition 4 (Aoshima and Yata, 2016). Assume (A-i) and (A-ii). It holds for = 1,..., k (l = 1,..., n) that as m 0 λ 1/2 h T l(x l µ) = z l + û l O P {(n 1/2 λ ) 1 λ 1 } + O P (n 1/2 ). Note that [ n l=1 {û lλ 1 /(n 1/2 λ )} 2 /n] 1/2 = λ 1 /(λ n). The bias term is small when λ 1 /λ is not large Test procedure using eigenstructures Let x il = h T ilx il for all i,, l, where h il s are defined by (3.3). From Propo-

8 8 Makoto Aoshima and Kazuyoshi Yata sitions 2 and 4, we consider the following test statistic for (1.1): T =2 2 i=1 2 ni l<l (x T il x il k i =1 x il x il ) n i (n i 1) n1 l=1 n2 l =1 (x 1l k 1 =1 x 1l h 1 ) T (x 2l k 2 =1 x 2l h2 ) n 1 n 2, where h i s are defined by (3.2). Then, we have the following result. Theorem 1 (Aoshima and Yata, 2016). Assume (A-i) and (A-ii). Assume also Then, it holds that as m lim sup m 2 K 1 <. under some regularity conditions. T K 1/2 N(0, 1) Let z c be a constant such that P {N(0, 1) > z c } = c for c (0, 1). We note that K 1 /K = 1 + o(1) as m under (A-i) and lim sup m 2 /K 1 <. Then, for given α (0, 1/2), we consider testing the hypothesis in (1.1) by reecting H 0 T K 1/2 1 > z α, (3.4) where K 1 is defined in Section 5.2 of Aoshima and Yata (2016). Let power( ) denote the power of the test (3.4). Then, we have the following result. Theorem 2 (Aoshima and Yata, 2016). Assume (A-i) and (A-ii). Then, the test (3.4) has as m size = α + o(1) and power( ) Φ ( K 1/2 ( K1 ) ) 1/2 z α = o(1) K under some regularity conditions, where Φ( ) denotes the cumulative distribution function of N(0, 1). In general, k i s are unknown in T. See Section 6.2 in Aoshima and Yata (2016) for estimation of k i s.

9 9 Acknowledgements The research of the first author was partially supported by Grants-in-Aid for Scientific Research (A) and Challenging Exploratory Research, Japan Society for the Promotion of Science (JSPS), under Contract Numbers 15H01678 and The research of the second author was partially supported by Grantin-Aid for Young Scientists (B), JSPS, under Contract Number References Aoshima, M. and Yata, K. (2011). Two-stage procedures for high-dimensional data. Sequential Anal. (Editor s special invited paper) 30, Aoshima, M. and Yata, K. (2015). Asymptotic normality for inference on multisample, high-dimensional mean vectors under mild conditions. Methodol. Comput. Appl. Probab. 17, Aoshima, M. and Yata, K. (2016). Two-sample tests for high-dimension, strongly spiked eigenvalue models. Statist. Sinica, revised (arxiv: ). Bai, Z. and Saranadasa, H. (1996). Effect of high dimension: By an example of a two sample problem. Statist. Sinica 6, Cai, T. T., Liu, W. and Xia, Y. (2014). Two sample test of high dimensional means under dependence. J. R. Statist. Soc. Ser. B 76, Chen, S. X. and Qin, Y.-L. (2010). A two-sample test for high-dimensional data with applications to gene-set testing. Ann. Statist. 38, Dempster, A. P. (1958). A high dimensional two sample significance test. Ann. Math. Statist. 29, Dempster, A. P. (1960). A significance test for the separation of two highly multivariate small samples. Biometrics 16, Ishii, A., Yata, K. and Aoshima, M. (2016). Asymptotic properties of the first principal component and equality tests of covariance matrices in highdimension, low-sample-size context. J. Statist. Plan. Infer. 170,

10 10 Makoto Aoshima and Kazuyoshi Yata Katayama, S., Kano, Y. and Srivastava, M. S. (2013). Asymptotic distributions of some test criteria for the mean vector with fewer observations than the dimension. J. Multivariate Anal. 116, Ma, Y., Lan, W. and Wang, H. (2015). A high dimensional two-sample test under a low dimensional factor structure. J. Multivariate Anal. 140, Srivastava, M. S. (2007). Multivariate theory for analyzing high dimensional data. J. Japan Statist. Soc. 37, Yata, K. and Aoshima, M. (2012). Effective PCA for high-dimension, low-samplesize data with noise reduction via geometric representations. J. Multivariate Anal. 105, Yata, K. and Aoshima, M. (2013). PCA consistency for the power spiked model in high-dimensional settings. J. Multivariate Anal. 122, Institute of Mathematics, University of Tsukuba, Ibaraki , Japan. aoshima@math.tsukuba.ac.p Institute of Mathematics, University of Tsukuba, Ibaraki , Japan. yata@math.tsukuba.ac.p

Statistica Sinica Preprint No: SS R2

Statistica Sinica Preprint No: SS-2016-0063R2 Title Two-sample tests for high-dimension, strongly spiked eigenvalue models Manuscript ID SS-2016-0063R2 URL http://www.stat.sinica.edu.tw/statistica/ DOI