On corrections of classical multivariate tests for high-dimensional data. Jian-feng. Yao Université de Rennes 1, IRMAR

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova On corrections of classical multivariate tests for high-dimensional data Jian-feng Yao Université de Rennes 1, IRMAR

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova Overview Introduction a two sample problem High-dimensional data and new challenge in statistics A two sample problem Marčenko-Pastur distributions and one-sample problems Sample v.s. population covariance matrices Marčenko-Pastur distributions CLT s for linear spectral statistics of S n Random Fisher matrices and two-sample problems Random Fisher matrices Testing covariance matrices one-sample case Simulation study I Testing covariance matrices two-samples case Simulation study II Conclusions

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova Introduction a two sample problem High-dimensional data and new challenge in statistics A two sample problem Marčenko-Pastur distributions and one-sample problems Sample v.s. population covariance matrices Marčenko-Pastur distributions CLT s for linear spectral statistics of S n Random Fisher matrices and two-sample problems Random Fisher matrices Testing covariance matrices one-sample case Simulation study I Testing covariance matrices two-samples case Simulation study II Conclusions

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova High-dimensional data and new challenge in statistics High dimensional data High dimensional data high dimensional models Nonparametric regression a very high-dimensional model (i.e. infinite dimensional model) but with one-dimensional data y i = f (x i ) + ε i, f R R, i = 1,..., n High-dimensional data observation vectors x i R p, with p relatively high w.r.t. the sample size n

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova High-dimensional data and new challenge in statistics High dimensional data Some typical data dimensions data dimension p sample size n y = p/n portfolio 50 500 0.10 climate survey 320 600 0.21 speech analysis a 10 2 b 10 2 1 ORL face data base 1440 320 4.5 micro-arrays 2000 200 10 Important y = p/n not always close to 0 ; frequently larger than 1

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova A two sample problem High-dimensional effect by an example The two-sample problem two independent samples x 1,..., x n1 (µ 1, Σ), y 1,..., y n2 (µ 2, Σ) want to test H 0 µ 1 = µ 2 against H 1 µ 1 µ 2. Classical approach Hotelling s T 2 test where x = S n = n X 1 i=1 1 n 2 T 2 = n1n2 n (x y) Sn 1 (x y), n X 2 x i, y = y j, n = n 1 + n 2, " n1 j=1 # n X X 2 (x i x i )(x i x i ) + (y j y j )(y j y i ) i=1 j=1. S n a sample covariance matrix

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova A two sample problem The two-sample problem Hotelling s T 2 test nice properties invariance under linear transformations; optimality if Gaussian LRT, BLUE.. Hotelling s T 2 test bad news low power even for moderate data dimensions; very few is known for the non Gaussian case; fatal deficiency when p > n 2, S n is not invertible.

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova A two sample problem Dempster s non-exact test (NET) Dempster A.P., 58, 60 A reasonable test must be based on x y even when p > n 2; choose a new basis in R n such that 1. axis 1 ground mean, 2. axis 2 (x y). with an orthogonal matrix H n and the data matrix X n p xn 1, y 1,..., y n2 ), 0 1 0 1 0 Z = n p Hn X = n n B @ h 1.. h n C A X = B @ z 1.. z n C A, h 1 = 1 1 n, h 2 = n Under normality, we have the z i s are n independent N p(, Σ); Ez 1 = 1 (n 1µ n 1 + n 2µ 2 ), Ez 2 = n1n2 (µ n 1 µ 2 ), Ez 3 = 0, i = 3,..., n. B @ n 2 nn1 1 n1 n 1 nn2 1 n2 1 C A.

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova A two sample problem Dempster s non-exact test (NET) Test statistic z 2 2 F D = (n 2) z 3 2 + + z n 2 Under H 0, z j 2 Q = rx α k χ 2 1(k), k=1 where α 1 α r > 0 are the non null eigenvalues of Σ. The distribution of F D is complicated Approximations (so the NET test!) 1. Q mχ 2 r ; 2. next estimate r by ˆr ; Finally, under H 0, F D F (ˆr, (n 2)ˆr).

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova A two sample problem Dempster s non-exact test (NET) Problems with the NET test Difficult to construct the orthogonal transformation H n = {h j } for large n; even under Gaussianity, the exact power function depend on H n.

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova A two sample problem Bai and Saranadasa s test (ANT) Bai & Saranadasa, 96 Consider directly the statistic M n = x y 2 n tr S n ; n 1n 2 generally under very mild conditions (here RMT comes!), M n σ 2 n = N(0, 1), σn 2 = Var(M n) = n2 n 1 n1 2n2 2 n 2 tr Σ2. A ratio consistent estimator bσ 2 n = 2n(n 1)(n 2) n 1n 2(n 3)» tr Sn 2 1 (tr Sn)2 n 2, bσ 2 n/σ 2 n P 1. Finally, under H 0, Z n = Mn bσ 2 n = N(0, 1) This is the Bai-Saranadasa s asymptotic normal test (ANT).

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova A two sample problem Comparison between T 2, NET and ANT Power functions Bai & Saranadasa, 96 Assuming p, n, p/n y (0, 1), n 1/n κ ; Hotelling s T 2, Dempster s NET and Bai-Saranadasa s ANT s β H (µ) = Φ ξ α + β D (µ) = Φ ξ α + where α = test size, and! n(1 y) κ(1 κ) Σ 1/2 µ 2 + o(1), 2y n 2 tr Σ 2 κ(1 κ) µ 2 «+ o(1) = β BS (µ). µ = µ 1 µ 2, ξ α = Φ 1 (1 α). Important because of the factor (1 y), T 2 losses power when y increases, i.e. p increases relatively to n.

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova A two sample problem Comparison between T 2, NET and ANT Simulation results 1 Gaussian case Choice of covariance Σ = (1 ρ)i p + ρj p, J p = 1 p1 p noncentral parameter η = µ 1 µ 2 2, (n tr Σ 2 1, n 2) = (25, 20), n = 45

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova A two sample problem A summary of the introduction High-dimensional effect need to be taken into account ; Surprisingly, asymptotic methods with RMT perform well even for small p (as low as p = 4) ; many of classical multivariate analysis methods have to be examined with respect to high-dimensional effects.

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova Sample v.s. population covariance matrices An effect of high-dimensions S n Σ x i, 1 i n i.i.d. N p(0, Σ), Sample covariance matrix S n = 1 X x i x i. n i 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 Répartition des v.p., p=40, n=160 p = 40 0.2 n = 160 0.1 y = p/n = 0.25 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 Histogram of eigenvalues of S n

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova Sample v.s. population covariance matrices Répartition des v.p., p=40, n=320 1.2 1.0 p = 40 n = 320 y = p/n = 0.125 0.8 0.6 0.4 0.2 0.0 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 Histogram of eigenvalues of S n

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova Marčenko-Pastur distributions The Marčenko-Pastur distribution Theorem. Assume Marčenko & Pastur, 1967 X = p n i.i.d. variables (0, 1), Σ = I p not necessarily Gaussian, but with finite 4-th moment p, n, p/n y (0, 1] Then, the (empirical) distribution of the eigenvalues of S n = 1 n XXT converges to the distribution with density function f (x) = 1 p (x a)(b x), a x b, 2πyx where a = (1 y) 2, b = (1 + y) 2.

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova Marčenko-Pastur distributions The Marčenko-Pastur distribution f (x) = 1 p (x a)(b x), (1 y) 2 = a x b = (1 + y) 2. 2πyx Densités de la loi de Marcenko Pastur 1.0 y p/n a b 1/8 0.42 1.83 1/4 0.25 2.25 1/2 0.09 2.91 0.8 0.6 0.4 0.2 0.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 c=1/4 c=1/8 c=1/2

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova Marčenko-Pastur distributions An explanation of the power deficiency of Hotelling s T 2 when p increases, even in Gaussian case, S n is different from its population counterpart Σ ; when y = p/n 1, the left edge a 0 small eigenvalues yield an instability of the T 2 statistic T 2 = n1n2 n (x y) Sn 1 (x y).

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova CLT s for linear spectral statistics of Sn CLT for linear spectral statistics of S n Set y n = p ; n [a, b] U open C. for any g analytic on U G n(g) = p [F n(g) µ yn (g)] (1) where µ α is the MP distribution of index α (0, 1).

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova CLT s for linear spectral statistics of Sn A CLT for linear spectral statistics Theorem Assume that g 1,, g k are k analytic functions on U ; Bai and Silverstein, 04 the matrix entries x ij are i.i.d. real-valued random variables such that Ex ij = 0, Ex 2 ij = 1, Ex ij 4 = 3. as n, p, y n = p y (0, 1); n Then, (G n(g 1),, G n(g k )) N k (m, V ), with a given mean vector m = m(g 1,..., g k ) and asymptotic covariance matrix V = V (g 1,..., g k ).

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova Random Fisher matrices Random Fisher matrices two independent samples x 1,..., x n1 (0, I p), y 1,..., y n2 (0, I p) with i.i.d coordinates of mean 0 and variance 1 Associated sample covariance matrices S 1 = 1 n X 1 x i x i, S 2 = 1 n X 2 y j yj. n 1 n 2 i=1 j=1 Fisher matrix V n = S 1S 1 2 where n 2 > p.

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova Random Fisher matrices Random Fisher matrices Assume y n1 = p n 1 y 1 (0, 1), y n2 = p n 2 y 2 (0, 1). Under mild moment conditions, the ESD Fn Vn of V n has a LSD F y1,y 2 with density 8 >< l(x) = > (1 y p 2) (b x)(x a), a x b, 2πx(y 1 + y 2x) 0, otherwise where a = (1 y 2) 2 `1 y 1 + y 2 y 1y 2 2, b = (1 y 2) 2 `1 + y 1 + y 2 y 1y 2 2.

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova Random Fisher matrices CLT for LSS of random Fisher matrices let U e C be an open set including the interval " I (0,1) (y (1 y 1) 2 # 1) (1 + y, (1 + y1) 2 2) 2 (1, y 2) 2 for an analytic function f on e U, define fg n(f ) = p Z + h i f (x) Fn Vn F yn1,y n2 (dx), where F yn1,y n2 is the LSD with indexes y nk, k = 1, 2.

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova Random Fisher matrices CLT for LSS of random Fisher matrices Theorem Assume Ex 4 11 = Ey 4 11 < and let β = E x 11 4 3. Then for any analytic functions f 1,, f k defined on e U, h fgn(f 1),, f G n(f k )i = N k (m, υ). Zheng, 08

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova Random Fisher matrices CLT for LSS of random Fisher matrices Limiting mean function m Zheng, 08 m(f j ) = lim [(2) + (3) + (4)] r 1 + I " # 1 1 f j (z(ζ)) + 1 2 4πi ζ =1 ζ 1 ζ + 1 ζ + y dζ (2) 2 r r hr I β y1(1 y2)2 1 + f 2πi h 2 j (z(ζ)) ζ =1 (ζ + y dζ (3) 2 )3 hr I β y2(1 y2) + f j (z(ζ)) ζ + 1 hr 2πi h ζ =1 (ζ + y dζ, (4) 2 )3 hr where h i z(ζ) = (1 y 2) 2 1 + h 2 + 2hR(ζ), h = y 1 + y 2 y 1y 2.

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova Random Fisher matrices CLT for LSS of random Fisher matrices Limiting covariance function υ Zheng, 08 υ(f j, f l ) = lim [(5) + (6))] 1<r 1 <r 2 1 + 1 I I f j (z(r 1ζ 1))f l (z(r 2ζ 2))r 1r 2 dζ 1dζ 2, (5) 2π 2 ζ 2 =1 ζ 1 =1 (r 2ζ 2 r 1ζ 1) 2 I I β (y1 + y2)(1 y2)2 f j (z(ζ 1)) f l (z(ζ 2)) 4π 2 h 2 ζ 1 =1 (ζ 1 + y 2 hr 1 ) 2 dζ1 ζ 2 =1 (ζ 2 + y dζ2 (6) 2 hr 2 ) 2 j, l {1,, k}.

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova One-sample test on covariance matrices a sample x 1,..., x n) N p(µ, Σ) want to test H 0 Σ = I p LR statistic T n = n [trs n log S n p] with the sample covariance matrix S n = 1 nx (x i x)(x i x), n i=1 Classical LRT Data dimension p is fixed, and when n, T n = χ 2 p(p+1)/2. Will see rapidly deficient when p is not small.

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova RMT Corrected LRT Theorem Bai, Jiang, Y and Zheng 09 Assume p/n y (0, 1) and let g(x) = x log x 1. Then, under H 0 and when n» Tn n p F yn (g) N(m(g), υ(g)), where F yn is the Marčenko-Pastur law of index y n and log (1 y) m(g) =, 2 υ(g) = 2 log (1 y) 2y.

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova Simulation study I Comparison of LRT and CLRT by simulation nominal test level α = 0.05 ; for each (p, n), 10,000 independent replications with real Gaussian variables. Powers are estimated under the alternative H 1 Σ = diag(1, 0.05, 0.05, 0.05,..., 0.05). CLRT LRT (p, n ) Size Difference with 5% Power Size Power (5, 500) 0.0803 0.0303 0.6013 0.0521 0.5233 (10, 500) 0.0690 0.0190 0.9517 0.0555 0.9417 (50, 500) 0.0594 0.0094 1 0.2252 1 (100, 500) 0.0537 0.0037 1 0.9757 1 (300, 500) 0.0515 0.0015 1 1 1

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova Simulation study I On a plot n=500 Type I Error 0.2 0.4 0.6 0.8 1.0 CLRT LRT 0 50 100 150 200 250 300 Dimension

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova Two-samples test on covariance matrices two samples x 1,..., x n1 ) N p(µ 1, Σ 1), y 1,..., y n2 ) N p(µ 2, Σ 2) want to test H 0 Σ 1 = Σ 2 The associated sample covariance matrices are S 1 = 1 n X 1 (x i x)(x i x), S 2 = 1 n X 2 (y i y)(y i y), n 1 n 2 i=1 i=1 Let the LR statistic S1S 1 n1 2 2 L 1 = c1s 1S 1 n, 2 + c 2I p 2 where n = n 1 + n 2 and c k = n k n, k = 1, 2.

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova Two-samples test on covariance matrices Classical LRT Data dimension p is fixed, and when n 1, n 2 and under H 0, T n = 2 log L 1 χ 2 p(p+1)/2. Will see rapidly deficient when p is not small.

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova RMT Corrected LRT Theorem Bai, Jiang, Y and Zheng 09 Assuming that the conditions of CLT for LSS of Fisher matrices hold and let f (x) = log(y 1 + y 2x) y2 y 1 + y 2 log x log(y 1 + y 2). Then under H 0 and as n 1 n 2,» 2 log L1 p F yn1,y n n2 (f ) N (m(f ), υ(f )), with m(f ) = 1 2» log y1 + y 2 y 1y 2 y 1 + y 2 «y1 y 1 + y 2 log(1 y 2) y2 log(1 y 1), y 1 + y 2 υ(f ) = 2y 2 2 (y 1 + y 2) 2 log(1 y1) 2y 2 1 (y 1 + y 2) 2 log(1 y2) 2 log y 1 + y 2 y 1 + y 2 y 1y 2.

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova Simulation study II Comparison of LRT and CLRT by simulation nominal test level α = 0.05 ; for each (p, n 1, n 2), 10,000 independent replications with real Gaussian variables. Powers are estimated under the alternative H 1 Σ 1Σ 1 2 = diag(3, 1, 1,, ).

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova Simulation study II Comparison of LRT and CLRT by simulation with (y1, y2) = (0.05, 0.05) CLRT LRT (p, n 1, n 2 ) Size Difference with 5% Power Size Power (5, 100, 100) 0.0770 0.0270 1 0.0582 1 (10, 200, 200) 0.0680 0.0180 1 0.0684 1 (20, 400, 400) 0.0593 0.0093 1 0.0872 1 (40, 800, 800) 0.0526 0.0026 1 0.1339 1 (80, 1600, 1600) 0.0501 0.0001 1 0.2687 1 (160, 3200, 3200) 0.0491-0.0009 1 0.6488 1 (320, 6400, 6400) 0.0447-0.0053 0.9671 1 1

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova Simulation study II Comparison of LRT and CLRT by simulation with (y1, y2) = (0.05, 0.1) CLRT LRT (p, n 1, n 2 ) Size Difference with 5% Power Size Power (5, 100, 50) 0.0781 0.0281 0.9925 0.0640 0.9849 (10, 200, 100) 0.0617 0.0117 0.9847 0.0752 0.9904 (20, 400, 200) 0.0573 0.0073 0.9775 0.1104 0.9938 (40, 800, 400) 0.0561 0.0061 0.9765 0.2115 0.9975 (80, 1600, 800) 0.0521 0.0021 0.9702 0.4954 0.9998 (160, 3200, 1600) 0.0520 0.0020 0.9702 0.9433 1 (320, 6400, 3200) 0.0510 0.0010 1 0.9939 1

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova Simulation study II Comparisons of LRT and CLRT y1=0.05, y2=0.05 y1=0.05, y2=0.1 Type I Error 0.2 0.4 0.6 0.8 1.0 CLRT LRT Type One Error 0.2 0.4 0.6 0.8 1.0 CLRT LRT 0 50 100 150 200 250 300 0 50 100 150 200 250 300 Dimension Dimension

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova Some conclusions High dimensional effects should be taken into account; RMT for sample covariance matrices is a powerful tool to coorect classical multivariate procedures ; Each time some Σ is to be estimated, take care of the natural estimator bσ = S n. Yet the RMT is not suffciently developped for statistics 1. dependent observations time series ; 2. not identically distributed variables.

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova Some references Bai, Z. D. and Saranadasa, H. (1996). Effect of high dimension comparison of significance tests for a high dimensional two sample problem. Statistica Sinica. 6, 311-329. Bai, Z. D. and Silverstein, J. W. (2004). CLT for linear spectral statistics of large-dimensional sample covariance matrices. Ann.Probab. 32, 553-605. Z. D. Bai, D. Jiang, J. Yao and S. Zheng, 2009. Corrections to LRT on Large Dimensional Covariance Matrix by RMT. Annals of Statistics 37, 3822 3840 Z. D. Bai, D. Jiang, J. Yao and S. Zheng, 2009. Large Regression Analysis. submitted Dempster, A. P. (1958). A high dimensional two sample significance test. Ann. Math. Statist. 29, 995-1010. Zheng, S. (2008). Central Limit Theorem for Linear Spectral Statistics of Large Dimensional F Matrix. Preprint, Northern-Est Normal University