Regularized LRT for Large Scale Covariance Matrices: One Sample Problem

Size: px

Start display at page:

Download "Regularized LRT for Large Scale Covariance Matrices: One Sample Problem"

Randolf Wade
5 years ago
Views:

1 Regularized LRT for Large Scale ovariance Matrices: One Sample Problem Young-Geun hoi, hi Tim Ng and Johan Lim arxiv: v3 [stat.me] 0 Jun 206 August 6, 208 Abstract The main theme of this paper is a modification of the likelihood ratio test LRT for testing high dimensional covariance matrix. Recently, the correct asymptotic distribution of the LRT for a large-dimensional case the case p/n approaches to a constant γ 0, ] is specified by researchers. The correct procedure is named as corrected LRT. Despite of its correction, the corrected LRT is a function of sample eigenvalues that are suffered from redundant variability from high dimensionality and, subsequently, still does not have full power in differentiating hypotheses on the covariance matrix. In this paper, motivated by the successes of a linearly shrunken covariance matrix estimator simply shrinkage estimator in various applications, we propose a regularized LRT that uses, in defining the LRT, the shrinkage estimator instead of the sample covariance matrix. We compute the asymptotic distribution of the regularized LRT, when the true covariance matrix is the identity matrix and a spiked covariance matrix. The obtained asymptotic results have applications in testing various hypotheses on the covariance matrix. Here, we apply them to testing the identity of the true covariance matrix, which is a long standing problem in the literature, and show that the regularized LRT outperforms the corrected LRT, which is its non-regularized counterpart. In addition, we compare the power of the regularized LRT to those of recent non-likelihood based procedures. Keywords: Asymptotic normality; covariance matrix estimator; identity covariance matrix; high dimensional data; linear shrinkage estimator; linear spectral statistics; random matrix theory; regularized likelihood ratio test; spiked covariance matrix. Introduction High dimensional data are now prevalent everywhere that include genomic data in biology, Young-Geun hoi and Johan Lim are at Department of Statistics, Seoul National University, Seoul, Korea. hi Tim Ng is at Department of Statistics, honnam National University, Gwangju, Korea. All correspondences are to johanlim@snu.ac.kr

2 financial times series data in economics, and natural language processing data in machine learning and marketing. The traditional procedures that assume that sample size n is large and dimension p is fixed are not valid anymore for the analysis of high dimensional data. A significant amount of research are made to resolve the difficulty from the dimensionality of the data. This paper considers the inference problem of large scale covariance matrix whose dimension p is large compared to the sample size n. To be specific, we are interested in testing whether the covariance matrix equals to a given matrix; H 0 : Σ = Σ 0, where Σ 0 can be set I p without loss of generality. The likelihood ratio test LRT statistic for testing H 0 : Σ = I p is defined by LRT = tr p S n log S n p = li log l i, where S n is the unbiased and centered sample covariance matrix and l i is the i th largest eigenvalue of the sample covariance matrix. When p is finite, LRT follows the chi-square distribution with degrees of freedom pp + /2 asymptotically. However, this does not hold when p increases. Its correct asymptotic distribution is computed by Bai et al for the case p/n approaches γ 0, and both n and p increase. They further numerically show that their asymptotic normal distribution defines a valid procedure for testing H 0 : Σ = I p. The results of Bai et al are refined by Jiang et al. 202, which include the asymptotic null distribution for the case γ =. Despite of the correction of the null distribution, the sample covariance is known to have redundant variability when p is large, and it still remains a general question that the LRT is asymptotically optimal for testing problem in the n, p large scheme. In this paper, it is shown that the corrected LRT can be further improved by introducing a linear shrinkage component. In detail, we consider a modification of the LRT, denoted by regularized LRT rlrt, defined by rlrt = tr Σ log Σ p = p ψi log ψ i, where Σ is a regularized covariance matrix and ψ i is the i th largest eigenvalue of Σ. Here, we consider the regularization via linear shrinkage: Σ λs n + λi p. 2 2

3 We also occasionally notate rlrtλ to emphasize the use of the value λ. The linearly shrunken sample covariance matrix simply shrinkage estimator is known to reduce expected estimation loss of the sample covariance matrix Ledoit and Wolf, It is also successfully applied to many high-dimensional procedures to resolve the dimensionality problem. For example, Schäfer and Strimmer 2005 reconstruct a gene regulatory network from microarray gene expression data using the inverse of a regularized covariance matrix. hen et al. 20 propose a modified Hotelling s T 2 -statistic for testing high dimensional mean vectors and apply it to finding differentially expressed gene sets. We are motivated by the success of above examples and inspect whether the power can be improved by the reduced variability via linear shrinkage. To the best of our knowledge, our work is the first time to apply the linear shrinkage to the covariance matrix testing problem itself. We derive the asymptotic distribution of the proposed rlrtλ under two scenarios, i when Σ = I p for the null distribution, and additionally ii when Σ = Σ spike for power study. Here Σ spike means a covariance matrix from the spiked population model Johnstone, 200, roughly it is defined as a covariance matrix whose eigenvalues are all s but some finite nonunit spike. The spiked covariance matrix assumed here includes the well known compound symmetry matrix Σ cs ρ = I p + ρj p, where J p is the p p matrix of ones. The main results show that rlrtλ has normal distribution in asymptotic under both i and ii; their asymptotic means are different but the variances are same. The main results are useful in testing various one sample covariance matrices. To be specific, first, in testing H 0 : Σ = I p, i provides the asymptotic null distribution of rlrtλ. Second, combining i and ii provides the asymptotic power for an arbitrary spiked alternative covariance matrix including Σ cs ρ. Finally, the results with λ = provide various asymptotic distributions of the corrected LRT. Among these many applications, in this paper, we particularly focus on the LRT for testing H 0 : Σ = I p, which has long been studied by many researchers Anderson, 2003; Bai et al., 2009; hen et al., 200; Jiang et al., 202; Ledoit and Wolf, The paper is organized as follows. In Section 2, we briefly review results of the random matrix theory that are essential to the asymptotic theory of the proposed rlrt. The results include the limit of empirical spectral distribution ESD of the sample covariance matrix and the central limit theorem LT for linear spectral statistics LSS. In Section 3, we formally define the rlrt, and prove the asymptotic normality of the rlrt when the true 3

4 covariance matrix Σ is I p or Σ spike. In Section 4, the results developed in Section 3 are applied to testing H 0 : Σ = I p. Numerical study is provided to compare the powers of the LRT and other existing methods including the corrected LRT and other non-lrt tests by Ledoit and Wolf 2002 and hen et al In Section 5, we conclude the paper with discussions of several technical details of the rlrt, for example, close spiked eigenvalues. 2 Random matrix theory In this section, some useful properties of linear spectral statistics of the sample covariance matrix are introduced. The true covariance matrix Σ is identity or that from a spiked population model. The following notation is used throughout the paper. Let M be a real-valued symmetric matrix of size p p and α j M be the j th largest eigenvalue of the matrix M with natural labeling α p M α M. The spectral distribution SD for M is defined by F M t := p p δ αj Mt, t R, j= where and δ α t is a point mass function that can be also written, with notational abuse, as δ α t = Iα t. Here, IA denotes the indicator function of a set A. 2. Limiting spectral distribution of sample covariance matrix Let {z ij i,j be an infinite double array of independent and identically distributed realvalued random variables with Ez = 0, Ez 2 = and Ez 4 = 3. Let Z n = {z ij, i =, 2,..., n, j =, 2,..., p be the top-left n p block of the infinite double array. We assume that both n and p diverge and their ratio γ := p/n converges to a positive constant γ. The data matrix and the uncentered sample covariance matrix are X n = Z n Σ /2 p S 0 n = n X n X n respectively, where {Σ p, p =, 2,... is a sequence of p p nonrandom symmetric matrices. Note that the fourth moment condition Ez 4 = 3 is used later on in Proposition. In the random matrix theory literature Bai et al., 2009; Bai and Silverstein, 2004, the limiting distribution of empirical SD F S0 n is determined by both the limits of p/n and F Σp. Specifically, if H p := F Σp converges in distribution to H, a random 4 and

5 distribution function F S0 n converges in distribution to a nonrandom distribution function, say F γ,h with probability one. The definition of F γ,h is given by its Stieltjes transform m γ,h z that is the unique solution of the following system of equations: on z {z : Imz > 0. m γ,h z = γ mγ,h z + γ γz ; 3 z = m γ,h z + γ t dht 4 + tm γ,h z Generally, m γ,h z is known as the Stieltjes transform of the limiting SD of the so-called companion matrix for S 0 n, which is defined by S 0 n := n X nx n. The density of F γ,h can be calculated from m γ,h by the inversion formula, df γ,h dx x = lim z x γπ Im[mγ,H z], x R, z : Imz > 0. 5 In the special case Σ p = I p, we have H p t = δ t and the corresponding spectral distribution F γ,δ follows the Mar cenko-pastur law. To see this, note that the second equation of 4 can be rewritten as γ z = +, 6 m γ,δ z + m γ,δ z when H = δ. By the inversion formula, 5 yields the probability density function of the Mar cenko-pastur distribution indexed by γ when 0 < γ <, df γ,δ dx x = bγ xx aγ, aγ x bγ, 7 2πγx where aγ := γ 2 and bγ := + γ entral limit theorem for linear spectral statistics Many multivariate statistical procedures are based on F Sn, the empirical SD of the centered sample covariance matrix S n. onsider is a family of functionals of eigenvalues that is also called as linear spectral statistics LSS or linear eigenvalue statistics: p p gl j = j= gxdf Sn x where g is a function fulfilling certain complex-analytic conditions. 5

6 If the sample covariance matrix is uncentered, the central limit theorem LT for the corresponding LSS is developed in Bai and Silverstein 2004 and Bai et al The proposition below is adapted from Theorem 2. in Bai et al and used in the asymptotic property of the proposed regularized LRT under the null. Note that the centering term of the LT possesses a finite-dimensional proxy F γ,h p. Proposition Bai et al Let T n g be the functional { T n g = p gx d F S0 n x F γ,h p x. 8 Suppose that the two functions g and g 2, are complex analytic on an open domain containing an closed interval [aγ, bγ] on the real axis. If γ = p/n γ 0, and H p converges in distribution to δ, then the vector T n g, T n g 2 converges in distribution to a bivariate normal distribution with mean µg i = g iaγ + g i bγ 4 bγ g i x dx 9 2π aγ 4γ x γ 2 for i =, 2, and variance vg, g 2 = 2π 2 g z g 2 z 2 mz mz 2 2 dmz dmz 2, 0 where m = m γ,δ is defined in 6. The two contours in 0 are non-overlapping and containing [aγ, bγ], the support of F γ,δ. The proposition requires the data to have known population mean vector known as zero without loss of generality and considers the non-centered sample covariance matrix S 0 n. However, this is seldomly true in practice and it is common to use the unbiased and centered sample covariance matrix X n = B n Z n Σ /2 p and S n = /ñx n X n, where ñ = n and B n = I n /nj n. The extension of Proposition to the centered sample covariance matrix is studied recently in Zheng et al At the paper, the authors prove that, under the fourth moment assumptions the assumption Ez 4 = 3 in Section 2., Proposition is still valid for S n if T n g is redefined by { T n g = p gx d F Sn x F γ,h p x, 6

7 where ñ = n is the adjusted sample size and γ = p / ñ. This is named as substitution principle. The assumption H = δ roughly implies that the spectrum of Σ p is eventually concentrated around one. One simple example is Σ p = I p. Then the SD is H p = F Ip = δ, which trivially converges to δ. In addition, we note that the spiked population model Johnstone, 200 has H = δ as the limiting SD and is applicable to Proposition. In our settings, the spiked population model refers to the data whose covariance matrix Σ p has the following eigenvalue structure: Then the SD H p corresponding to Σ p is a,..., a {{, a 2,..., a 2,..., a {{ k,..., a k,,...,. 2 {{{{ n n k n 2 H p t = p K δ t + p p p K n i δ ai t, 3 where K = n +...+n k is a fixed finite integer, not depending on n so that p K eigenvalues of unity eventually dominate corresponding H p when p is large. Thus, the limiting SD remains unchanged as H = δ. To study the power of the proposed regularized LRT, it is useful to study how to apply the spiked population model to Proposition. Although the limiting SD is simple δ, the spiked population model has several difficulties with the use of Proposition. In the spiked model, its SD, H p in 3, has masses at K + distinct points and m γ,h p z is the solution to a polynomial equation of degree K + 2. A polynomial equation has an analytical solution only when its degree is less than or equal to 4. Therefore, if K 3, we do not have an analytic form of m γ,h p z. To resolve this difficulty, recently, Wang et al. 204 provides an approximation formula of gdf γ,h p in Proposition for the spiked population model. Such an approximation of gm γ,h p at m γ,δ is constructed based on the idea that m γ,h p z and m γ,δ z would be close enough if p is large. Proposition 2 Wang et al Suppose that γ < and H p is given by 3, with a i > γ for all i =, 2,..., K. If a complex-valued function g is analytic on an open domain containing the interval [aγ, bγ] and k points ϕa i := a i + γa i a i, i =, 2,..., K 7

8 on the real axis, then gxdf γ,h p x in 8 can be approximated by gxdf γ,h p x = g 2πip m + γ + m + f K 2πip m + γ + m + K gxdf γ,δ x + p p K K γ m a i n i + a i m + m n i a 2 i m dm 4 + a i m 2 m γ m dm 5 + m 2 n i gϕa i + O n 6 2 where m = m γ,δ is defined in 6 by substituting γ by γ, and is a counterclockwise contour [ ] enclosing the interval, γ + on the real axis. γ The above proposition is a special case of Theorem 2 in Wang et al. 204 when all the a i s are distant spikes, i.e., a i > γ. Note that Theorem 2 of Wang et al. 204 allows close spikes a i that is defined by a i γ. In this paper, we focus on the alternative hypothesis with distant spike in the power study. We finally remark that the substitution principle is directly applicable to Wang et al. s results, that is, one can approximate gxdf γ,h p x in by Proposition 2, with γ = p/n in the formula replaced by γ = p/n. 3 Main results In this section, the asymptotic results of the rlrt are presented. Here, the rlrt is defined via the linear shrinkage estimator instead of the sample covariance matrix : rlrtλ := tr Σ log Σ p, where Σ := λsn + λi p. The shrinkage intensity λ is fixed and chosen from 0,. Define ψx = λx + λ and gx = ψx log{ψx. We consider rlrtλ = p gxdf Sn x, whose sample covariance matrix term is of the centered version. Then Proposition along with the substitution principle yields the following results. Theorem. Let gx = ψx log{ψx and ψx = λx + λ with fixed λ 0,. Suppose that Σ p = I p. If γ = p / ñ γ 0, with ñ = n, then T n g = rlrtλ p gxdf γ,δ x 8

9 converges in distribution to the normal distribution with mean and variance where µg = log + λγ 2 4λ 2 γ 2 { vg = 2 λ M λ + γ λγ + λγ + 2π log + λγ 2λ γ cos θ dθ 7 4π 0 + N log M N, 8 M + N M, N = Mλ, γ, Nλ, γ := 2λ + λγ ± 2λ + λγ 2 + 4λ λ 2 λ The detailed proof of Theorem is given in Appendix A. Note that when λ =, µg = log γ/2 and vg = 2γ 2 log γ that are consistent with the results in Bai et al To see this, observe that the integral in the mean function 2π 0 log + λγ 2λ γ cos θ dθ approaches zero according to the dominate convergence theorem. In addition, it can be shown that M goes to / γ, and N goes to + as λ using the approximation formula x + x x + 2 x /2 x. Next, consider the finite-dimensional proxy gxdf γ,δ x. From the density function of Mar cenko-pastur law 7 and the fact that = xdf γ,δ x = df γ,δ x, 9 gxdf γ,δ x = = b γ a γ b γ a γ λx λ logλx + λ {b γ x{x a γ 2πx γ dx logλx + λ {b γ x{x a γ 2πx γ dx. By substituting x = + γ 2 γ cos θ, we have an alternative representation of the integral gxdf γ,δ x = 2 π log + λ γ 2λ γ cos θ π + γ 2 γ sin 2 θdθ. 20 cos θ 0 It is remarked that the right-hand-side of 20 can be evaluated via the standard numerical integration techniques. 9

10 Theorem 2 below establishes the asymptotic normality of the rlrt under the alternative hypothesis that the true covariance matrix from the spiked population model in Section 2. It follows directly from Proposition 2. Theorem 2. Let gx = ψx log{ψx and ψx = λx + λ with fixed λ 0,. Suppose that Σ p has SD H p t = p K δ p t + K p n iδ ai t as in 3 with a i > γ for all i =, 2,..., K. If γ = p/ñ γ 0, with ñ = n, then rlrtλ p gxdf γ,h p x N µg, vg, where µg, vg are defined in Theorem and p gxdf γ,h p x = K gxdf γ,δ x + K p p λ, γ + O with λ, γ = λ n i a i λ K K [ γ log M + K [ λ + n i λk M N n i log ψ{ϕa i K { a i M + + a i M a i γ M 2 ai n i log + a i M M + N + n 2 ] n i + a i M a i + a i MM + + γ M 2 M + 2 { ] a i γ + γ 2MN + M + N, a i M + N + where ϕa = a + γa a and both M and N in λ, γ are M = Mλ, γ and N = Nλ, γ. The proof of the above theorem is provided in Appendix A. Theorem 2 can be applied to Σ cs, the covariance matrix with compound symmetry, which is defined by + β/p β/p β/p β/p β β/p + β/p. Σ cs =.... = I p p + β. p J p β/p + β/p This matrix has a spiked eigenvalue structure; + β for one eigenvalue and for the other p eigenvalues. The corresponding SD is H p t = p δ p t + δ p +βt. Theorem 2 with K =, k =, n =, and a = + β gives the following corollary. 0

11 orollary. Let gx = ψx log{ψx and ψx = λx + λ with fixed λ 0,. Suppose that Σ p has SD H p t = p δ p t + δ p +βt with β > γ. If γ := p/ñ γ 0, where ñ = n, then rlrtλ p gxdf γ,h p x Nµg, vg, where µg, vg are defined in Theorem and gxdf γ,h p x = gxdf γ,δ x + p p λ, γ + O with λ, γ = λβ log ψ{ϕ + β + γ { λ + log M βm log { λ + λm N β + + βm n 2 + βm βm γ + βm βmm + + γ M 2 M + 2 where ϕa = a + γa a and both M and N in λ, γ are M = Mλ, γ and N = Nλ, γ. In the corollary, the condition β > γ means that the spiked eigenvalue + β is distant. 4 Testing the identity covariance matrix The results in Section 3 can be used for testing various hypotheses on one sample covariance matrix. In this section, we study the finite-sample properties of the proposed rlrt in testing H 0 : Σ = I p. Additionally, we compare the power of the proposed rlrt and the following existing procedures in the literature. Ledoit and Wolf 2002 assume that p / n γ 0, and propose a statistic T LW = { S p tr 2 I p n { p tr S 2 + p n. The asymptotic distribution of T LW is, if both n and p increase with p/n γ 0,, nt LW p converges in distribution to normal distribution with mean and variance 4.,

12 Bai et al and Zheng et al. 205 propose a corrected LRT for the cases where both n and p increases and p / n converges to γ 0,. The corrected LRT statistic is clrt = tr S n log S n p. They show that T lrt = vg {clrt /2 p gxdf γ,δ xdx µg converges in distribution to the standard normal distribution, where µg = log γ 2, vg = 2 log γ 2γ, and gxdf γ,δ xdx = γ γ log γ. Finally, hen et al. 200 proposes to use the statistic T = p V 2.n 2 p V.n +, where V.n = n V 2.n = P 2 n n + P 4 n i j X i X i P 2 n Xi X j i j X i X j 2 2 P 3 n Xi X j Xk X l i,j,k,l P r n = n!/n r! Xi X j Xj X k i,j,k and is the summation over different indices. The asymptotic theory suggests that, under the null, nt converges in distribution to the normal distribution with mean 0 and variance Power comparison with the clrt The asymptotic power curves of the clrt and rlrt for the alternative hypothesis of the compound symmetry can be obtained using orollary. When Σ p Σ cs β/p and γ γ, the probability of rejecting H 0 : Σ p = I p at level η is [ Φ Φ η + ] gxdf γ,δ x λ, γ, vg β > γ 22 2

13 where λ, γ is defined in orollary and Φ denotes the cumulative distribution function of the standard normal distribution. The powers of the clrt and rlrt with λ = 0.4 and λ = 0.7 are plotted in Figure. Each panel of Figure compares the powers of the clrt and rlrt for different sample size n. In each panel, the results of β < γ close spike are also included to study the performances when the the assumption of distant spike in orollary is violated. The results in Figure suggest that Theorem 2 would be applicable when there is a close spike eigenvalue. More detailed discussions are given in Section 5. We find that in all cases the rlrt has higher empirical power than the clrt for the chosen values of λ = 0.4 and 0.7. We also find that the empirical power curve increases to less rapidly if λ or γ is closer to. In addition, although we do not report the details, the empirical curves converge fast and do have minor changes after n = 80 for the selected values of λ and γ. To understand the power gain due to the use of the rlrt better, we plot the empirical density of the rlrt and clrt under the null and four alternative hypotheses A-A4 used in Section 4.2 in Figure 2. The figure shows that i the variances of the rlrt are smaller than the clrt under both null and alternative hypotheses and ii the distances between the null and the alternative distributions are larger in the rlrt than in the clrt. Accordingly, the rlrt has larger power than the clrt and we will see this is true regardless of the choice of n and p in the next section. 4.2 Power comparison with other existing procedures In this section, we numerically compare the empirical sizes and powers of the rlrt statistic to other existing tests, namely the corrected LRT clrt by Bai et al. 2009, the invariant test by Ledoit and Wolf 2002, and the non-parametric test by hen et al In this study, random samples of size n are generated from the p dimensional multivariate normal distribution MVN p 0, Σ. The covariance matrix is set as Σ = I p to obtain the empirical sizes. The sample size n is chosen as 20, 40, 80, and 60, and, for each n, γ = γ = p/n is chosen as 0.2, 0.5, and 0.8. For example, p = 32, 80, and 28 are considered for n = 60 in the simulation. We take 0.05 as as the level of significance. The shrinkage intensity λ of rlrtλ is selected from 0.2, 0.5, and 0.8 to investigate the effect of the magnitude of 3

14 Analytical n = 20 H_0 rejection prob β < γ H_0 rejection freq β β n = 40 n = 80 H_0 rejection freq H_0 rejection freq RLRT0.4 RLRT0.7 LRT β β Figure : Analytic and empirical power curves for the rlrt and clrt. the linear shrinkage. This means that we compare the empirical sizes of clrt, rlrt0.2, rlrt0.5, and rlrt0.8 under varying n and γ. To compare the powers, we consider the following four alternatives: A independent but heteroscedastic variance case Σ = diag2, 2,, 2,,,, where the number of 2 s is min{, 0.2p, where r is the round-down of r; A2 independent with a single diverging spiked eigenvalue as Σ = diag + 0.2p,,,, ; A3 compound symmetry Σ = Σ cs 0.2; and A4 compound symmetry Σ = Σ cs 0. as defined in 2. Here, the compound symmetry matrix Σ cs ρ has a single spiked eigenvalue + ρ p and p non-spiked eigenvalues 4

15 clrt rlrt Density Density Null A A2 A3 A Figure 2: Empirical density functions of the rlrt and the clrt under the null and four alternative hypotheses when n = 40, p = 32, and λ = 0.5: A independent but heteroscedastic variance case Σ = diag2, 2,, 2,,,, where the number of 2 s is min{, 0.2p, where r is the round-down of r; A2 independent with a single diverging spiked eigenvalue as Σ = diag + 0.2p,,,, ; A3 compound symmetry Σ = Σ cs 0.2; and A4 compound symmetry Σ = Σ cs 0. as defined in 2. of. Thus, A2 and A3 have the identical spectra. The sample size n and γ are chosen to be the same as those of the null. The empirical sizes and powers of the listed methods are reported in Table. First, the empirical sizes of all the tests approach to the aimed level 0.05 as n increases. However, the size of Ledoit and Wolf 2002 shows slower convergence and more upward bias than the size of the other tests do in all cases we considered here. For this reason, the power of Ledoit and Wolf 2002 after correcting the size empirically are also reported in Table, where the cut off value is decided based on 00, 000 simulated test statistics under the null for each simulation setting. Second, it can be seen that the emprical powers of the rlrts are higher than those of the clrt in all cases we considered. In addition, it is interesting to note that the power improvement is especially higher in the case γ = 0.8 when p is relatively largfe. Third, comparing to the tests of hen et al. 200 and the biased-corrected version of Ledoit and Wolf 2002, the proposed rlrt0.5 and rlrt0.2 has higher empirical power in most of the cases. Finally, we remark that the computational cost of hen et al. 200 is at least Opn 4 due to the fourth moment calculation so it is not suitable for data with 5

16 large n. In fact, to test the data with n = 500, hen et al. 200 takes tens of hours to finish all computation using codes and Intel ore-i7 PU, whereas the other tests only require seconds. 5 Discussion We conclude the paper with a few additional issues of the proposed rlrt not fully discussed in the mainbody of the paper. First, we consider the case where n < p; equivalently, γ. In this case, the clrt is not well-defined because the logarithm term, log S n = i logl i, contains some zero l i s. On the other hand, the rlrt is still well-defined as the corresponding logartithm term i log ψl i remains positive even if λ <. Since the LT of the linear spectral statistics holds for γ [0, Bai and Silverstein, 2004, it is possible to extend Theorem and 2 in this paper to the case where γ [,. Second, we discuss the case with closely spiked eigenvalues. In the compound symmetry model of 2, the close spiked eigenvalues are those where the spike + β is smaller than + γ. As shown in Figure, it appears that the power curves over the interval β 0, γ could be obtained simply by extending the formula of orollary to 0, γ. We remark that, however, if we follow Wang et al. 204, the term log ψ + β in orollary should be omitted when β 0, γ, leading to incoherence between the analytical and empirical power curves on 0, γ. Finally, the selection of shrinkage intensity λ is still not well understood for hypothesis testing. As a reviewer points out, when λ approaches 0, the rlrt becomes irrelevant with the alternative covariance matrix and its power is expected to be close to the size. Thus, an appropriate selection of λ is important for good performance of the rlrt. The selection of λ for the purpose of improved estimation is well studied in the literature, for example, Ledoit and Wolf 2004, Schäfer and Strimmer 2005 and Warton However, our additional numerical study shows that such a choice of λ for the estimation purpose cannot achieve a power gain in testing problem. The optimal selection of λ for the hypothesis testing needs further research. 6

17 γ n clrt rlrt rlrt rlrt hen LW Null A: Indep with hetero variance A2: Indep with single diversing spike p A3: ompound symmetry with ρ = A4: ompound symmetry with ρ = Table : Summary of sizes and powers over 00,000 replications except for hen et al. 200 and,000 replications for hen et al. 200 due to heavy computation. The empirically corrected powers of Ledoit and Wolf 2002 are reported in the parentheses. For each row, the maximum powers are highlighted in bold and the maximum powers among the four LRT-based tests are underlined. The powers for n = 60 are all equal to and are removed from the table. 7

18 Acknowledgements This paper is supported by National Research Foundation of Korea NRF grant funded by the government MSIP No The authors are grateful to two reviewers and the AE for a careful reading and providing improvements. A Proof of the main theorems A. Proof of Theorem Theorem is a direct consequence of Proposition. Here, we calculate the integrals 9 and 0 for gz = g z = g 2 z = ψz logψz, where ψz = λz + λ. Mean: Using Proposition and the substitution x = + γ 2 γ cos θ, µg i = g iaγ + g i bγ 4 = λγ log + λγ 2 4λ 2 γ 2 π 2π 0 bγ g i x 2π dx aγ 4γ x γ 2 λγ 2λ γ cos θ log + λγ 2λ γ cos θ dθ = log + λγ 2 4λ 2 γ 2 = log + λγ 2 4λ 2 γ 2 + π 2π + 4π 0 2π 0 log + λγ 2λ γ cos θ dθ log + λγ 2λ γ cos θ dθ. Variance: We write m := mz and m 2 = mz 2 for notational simplicity. We have vg = gz m gz 2π 2 2 m 2 m m 2 dm dm 2 2, 23 To evaluate this integral with auchy s formula, we need to identify the points of singularity in gzm. It can be seen that there is singularity when ψzm = 0. Rewrite ψzm as ψzm = λ + γ + λ = λm Mm N m m + m m + 8

19 where M, N = 2λ + λγ ± 2λ + λγ 2 + 4λ λ. 2 λ Then, M, N are points of singularity. Next, choose contours and 2 enclosing and M, but not 0 and N, such that on the contours, the logarithm in gz is single-valued. In addition, and 2 are chosen so that they do not overlap. Applying integration by parts and auchy s formula, we have gzm m m 2 dm 2 { λ λγ = m m Then, Here, { = 2πi m 2 λγ m m m 2 M vg = 2π 2 = i { gz 2 m 2 π gzm2 m + m M m N gz m gz 2 m 2 m m 2 dm dm 2 2 λγ m m m 2 M m m 2 dm dm = m 2 + dm 2 2 { λ m 2 2 = 2πi + N λγ m m 2 + m 2 + m 2 M m 2 N m 2 + dm 2 Applying integration by parts, auchy s formula, and Lemma, we have gzm m m 2 dm 2 Then, { λ = { = 2πi λγ m m m + m M m N λγ m m m 2 M m 2 vg = 2π 2 = i { gz 2 m 2 π gz m gz 2 m 2 m m 2 dm dm 2 2 λγ m m m 2 M m m 2 dm dm

20 Using integration by parts, gzm2 m 2 + dm 2 2 { λ λγ = m 2 2 m m 2 m 2 + m 2 M m 2 N m 2 + dm 2 = 2πi λ N Applying Lemma and auchy s formula, we obtain gzm2 m 2 + dm 2 = 2πi {log λ + N 28 and { gzm2 m 2 M dm 2 = 2πi λ M λm N λ log M ombining the results 26-29, we have the desired result of variance. A.2 Proof of Theorem 2 29 We compute gxdf γ,h p x = {λx λ logλx + λ df γ,h p x, where H p is the SD of spiked population model, which is written as H p t = p K δ t + p p Following the lines of Section 3 in Wang et al. 204, λx λ df γ,h p x = λ p n i δ ai t. n i a i λk p + O n. 2 The difficult part lies on the evaluation of integration of the logarithm-related term. Using the labeling in Proposition 2, we rewrite it as logλx + λdf γ,h p x = and calculate 4, 5 and 6 separately. In the remainder of the proof, we write m and γ instead of m and γ, respectively, for notational convenience. The terms 4 and 5 20

21 involve contour integrals. Recall that the contour on 4 and 5 encloses the closed interval [ γ, + γ ] on the real axis of the complex plane and has poles of {m =, {m = M, where M = Mλ, γ is defined in 9. It is easy to show that < M < γ and N > 0 provided γ 0, and n is large, where N = Nλ, γ is from 9. + γ Following the lines of Section 3.3 of Wang et al. 204, recall that λ + γ + λ = m +m λm Mm N. We have mm+ 4 = 2πip = = log 2πipγ 2πipγ = K 2πipγ log λ m + A + A 2. λ m + log λm N m γ K K + m + λ γm γ + λ +m m + log m M m+ m log m M m+ dm + m 2πipγ K K n i a 2 i m dm + a i m 2 n i a 2 i m 2 γ dm + a i m 2 n i a 2 i m 2 γ dm + a i m 2 log m M K n i a 2 i mγ m + + a i m dm 2 Here, and A 2 = = A = K log m M d log m 2πipγ m + K = log m d log m M 2πipγ m + K log m = M + 2πipγ m + m M dm 2πip 2πip = K pγ log M, log m M K m + A 3 A 4, log m M ni a i m + n i a 2 i m + a i m 2 dm + a i m + a i m 2 dm 2

22 where and A 3 = = A 4 = 2πip 2πip = 2πip = 2πip M + K = p = log m M n i a i m + + a i m dm n i log m M d log + ai m m + n i log + a i m d log m M m + n i log + a i m m + m M dm n i log a i p 2πip 2πip = M + 2πip = p n i log + a i M, log m M n i a i m + + a i m dm 2 n i + a i m d log m M m + n i + a i mm Mm + dm n i + a i M a i. ombiing the results of A + A 2 = A + A 3 A 4, we have 4 = K pγ log M + p ai n i log + a i M p n i + a i M. a i Next, consider 5. Taking f = log ψ, we have log ψ m + γ λ = + m λx + λ x= m + γ +m = λmm + λm Mm N. 22

23 Then 5 = 2πip = 2πip λ λ = 2πip λ λ log ψ m + n i γ + m K n i a i + a i m n i + m m γm dm + m 2 mm + a i m Mm N + a i m + m m γm dm + m 2 n i B B 2 B 3 + B 4, where and a i m + B = m Mm N + a i m dm a i M + = 2πi, + a i MM N a i γm 2 B 2 = m Mm N + a i mm + dm a i γm 2 = 2πi M N + a i MM + + a i γ, M + N + a i B 3 = m Mm N dm = 2πi M N, B 4 = γm 2 m Mm Nm + dm 2 = 2πi γm 2 γ2mn + M + N M NM + 2 M + 2 N + 2. ollecting the four terms, we have : 5 λ = p λ [ { a i M + n i M N + a i M a i γm 2 + a i MM + + γm 2 M + 2 { ] a i γ γ2mn + M + N +. M + N + a i M + N + To obtain 6, note that the integration term logψxdf γ,δ x is equal to {ψx+ 23

24 logψx df γ,δ x since M-P law satisfies xdf γ,δ x = df γ,δ x. This gives 6 = K gxdf γ,δ x + n i log ψ{ϕa i + O p p n 2 where ϕa i = a i + γa i a i. Finally, combining the four results, we finally obtain the centering term : {ψx logψx df γ,h p x = = λx λ df γ,δ x K p gxdf γ,δ x + p λ, γ + O n 2, where n = λ n i a i λ K K [ γ log M + K [ λ + n i λk M N A.3 Technical lemma n i log ψ{ϕa i K { a i M + + a i M a i γm 2 ai n i log + a i M M + N + { ] n i + a i M a i + a i MM + + γm 2 M + 2 ] a i γ γ2mn + M + N +. a i M + N + Lemma. Let z A and z B be any two different fixed complex numbers. Then, a for any contour enclosing z A and z B such that log z z A z z B is single-valued on the contour, we have z z A log z z A z z B dz = 0, b for any contour enclosing z A and z B, we have dz z z A z z B = 0 24

25 Figure 3: Illustration of the path of the contour integral Proof. a Let be a contour enclosing such that both and are clockwise anticlockwise, and on, we have z z A < z B z A see Figure 3. Then, D, the singly connected region between and as indicated in Figure 3 contains no singularity. Therefore, the integral on and are the same. Next, consider the power series expansion log z z A = log + z A z B z z B z z { A za z B = 2 za z B + 3 za z B.... z z A 2 z z A 3 z z A Such power series converges on. The desired result is a consequence of dz z z A i = 0 for all i = 2, 3,.... b Applying auchy s formula, we have { dz z z A z z B = z A z B z z A z z B dz = z A z B = 0. References Anderson, T. W An Introduction to Multivariate Statistical Analysis. Wiley- Interscience, 3rd edition. 25

26 Bai, Z., Jiang, D., Yao, J.-F., and Zheng, S orrections to LRT on large-dimensional covariance matrix by RMT. The Annals of Statistics, 376B: Bai, Z. and Yao, J On sample eigenvalues in a generalized spiked population model. Journal of Multivariate Analysis, 06: Bai, Z. D. and Silverstein, J. W. J LT for linear spectral statistics of largedimensional sample covariance matrices. The Annals of Probability, 32A: ai, T. T. and Ma, Z Optimal hypothesis testing for high dimensional covariance matrices. Bernoulli, 95B: hen, L. S., Paul, D., Prentice, R. L., and Wang, P. 20. A regularized Hotelling s T-test for pathway analysis in proteomic studies. Journal of the American Statistical Association, 06496: hen, S., Zhang, L., and Zhong, P Tests for high-dimensional covariance matrices. Journal of the American Statistical Association, 05490: Jiang, D., Jiang, T., and Yang, F Likelihood ratio tests for covariance matrices of high-dimensional normal distributions. Journal of Statistical Planning and Inference, 428: Johnstone, I. M On the distribution of the largest eigenvalue in principal components analysis. The Annals of Statistics, 292: Ledoit, O. and Wolf, M Some hypothesis tests for the covariance matrix when the dimension is large compared to the sample size. The Annals of Statistics, 304: Ledoit, O. and Wolf, M A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis, 882: Pan, G omparison between two types of large sample covariance matrices. Annales de l Institut Henri Poincaré, Probabilités et Statistiques, 502: Pan, G. M. and Zhou, W entral limit theorem for signal-to-interference ratio of reduced rank linear receiver. The Annals of Applied Probability, 83:

27 Schäfer, J. and Strimmer, K A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statistical applications in genetics and molecular biology, 4. Wang,., Yang, J., Miao, B., and ao, L Identity tests for high dimensional data using RMT. Journal of Multivariate Analysis, 8: Wang, Q., Silverstein, J. W., and Yao, J.-f A note on the LT of the LSS for sample covariance matrix from a spiked population model. Journal of Multivariate Analysis, 30: Warton, D. I Penalized normal likelihood and ridge regularization of correlation and covariance matrices. Journal of the American Statistical Association, 0348: Won, J.-H., Lim, J., Kim, S.-J., & Rajaratnam, B ondition number regularized covariance estimation. Journal of the Royal Statistical Society. Series B, Statistical Methodology, 753: Zheng, S., Bai, Z. D, and Yao, J-F Substitution principle for LT of linear spectral statistics of high-dimensional sample covariance matrices with applications to hypothesis testing. The Annals of Statistics, 43:

On corrections of classical multivariate tests for high-dimensional data

On corrections of classical multivariate tests for high-dimensional data Jian-feng Yao with Zhidong Bai, Dandan Jiang, Shurong Zheng Overview Introduction High-dimensional data and new challenge in statistics