On corrections of classical multivariate tests for high-dimensional data

Size: px

Start display at page:

Download "On corrections of classical multivariate tests for high-dimensional data"

Briana Cox
5 years ago
Views:

1 On corrections of classical multivariate tests for high-dimensional data Jian-feng Yao with Zhidong Bai, Dandan Jiang, Shurong Zheng

2 Overview Introduction High-dimensional data and new challenge in statistics A two sample problem Sample covariance matrices Sample v.s. population covariance matrices Marčenko-Pastur distributions Bai and Silverstein s CLT for linear spectral statistics Random Fisher matrices Random Fisher matrices Testing covariance matrices I Simulation study I Testing covariance matrices II Simulation study II Multivariate regressions Conclusions

3 Introduction High-dimensional data and new challenge in statistics A two sample problem Sample covariance matrices Sample v.s. population covariance matrices Marčenko-Pastur distributions Bai and Silverstein s CLT for linear spectral statistics Random Fisher matrices Random Fisher matrices Testing covariance matrices I Simulation study I Testing covariance matrices II Simulation study II Multivariate regressions Conclusions

4 High-dimensional data and new challenge in statistics High dimensional data High dimensional data high dimensional models Nonparametric regression a very high-dimensional model (i.e. infinite dimensional model) but with one-dimensional data y i = f (x i ) + ε i, f R R, i = 1,..., n High-dimensional data observation vectors y i R p, with p relatively high w.r.t. the sample size n

5 High-dimensional data and new challenge in statistics High dimensional data Some typical data dimensions data ratio n/p data dimension p sample size n n/p portfolio climate survey speech analysis a 10 2 b ORL face data base micro-arrays Important data ratio n/p not always large ; could be 1 Note use of the Inverse data ratio y = p/n

6 A two sample problem High-dimensional effect by an example The two-sample problem two independent samples x 1,..., x n1 (µ 1, Σ), y 1,..., y n2 (µ 2, Σ) want to test H 0 µ 1 = µ 2 against H 1 µ 1 µ 2. Classical approach Hotelling s T 2 test where x = S n = n X 1 i=1 1 n 2 T 2 = n1n2 n (x y) Sn 1 (x y), n X 2 x i, y = y j, n = n 1 + n 2, " n1 j=1 # n X X 2 (x i x i )(x i x i ) + (y j y j )(y j y i ) i=1 j=1. S n a sample covariance matrix

7 A two sample problem The two-sample problem Hotelling s T 2 test nice properties invariance under linear transformations; finite-sample optimality if Gaussian; asymptotic optimality otherwise. Hotelling s T 2 test bad news low power even for moderate data dimensions; high instability in computing Sn 1 even for p = 40; very few is known for the non Gaussian case; fatal deficiency when p > n 2, S n is not invertible.

8 A two sample problem Dempster s non-exact test (NET) Dempster A.P., 58, 60 A reasonable test must be based on x y even when p > n 2; choose a new basis in R n, project the data such that 1. axis 1 Ground mean (n 1 µ 1 + n 2 µ 2 )/n 2. axis 2 (x y). let the data matrix X = (x n p 1,..., x n1, y 1,..., y n2 ), and the (orthonormal) base change H n Z = n p Hn X = n n h 1.. h n C A X = z 1.. z n C A, h 1 = 1 1 n, h 2 = n Under normality, we have the z i s are n independent N p(, Σ); Ez 1 = 1 (n 1µ n 1 + n 2µ 2 ), Ez 2 = n1n2 (µ n 1 µ 2 ), Ez 3 = 0, i = 3,..., n. n 2 nn1 1 n1 n 1 nn2 1 n2 C A.

9 A two sample problem Dempster s non-exact test (NET) Test statistic z 2 2 F D = (n 2) z z n 2 Under H 0, z j 2 Q = rx α k χ 2 1(k), k=1 where α 1 α r > 0 are the non null eigenvalues of Σ. The distribution of F D is complicated Approximations - so the NET test think as Σ = I p, 1. Q mχ 2 r ; 2. next estimate r by ˆr ; Finally, under H 0, F D F (ˆr, (n 2)ˆr).

10 A two sample problem Dempster s non-exact test (NET) Problems with the NET test Difficult to construct the orthogonal transformation H n = {h j } for large n; even under Gaussianity, the exact power function depend on H n.

11 A two sample problem Bai and Saranadasa s test (ANT) Bai & Saranadasa, 96 Consider directly the statistic M n = x y 2 n tr S n ; n 1n 2 generally under very mild conditions (here RMT comes!), M n σ 2 n = N(0, 1), σn 2 = Var(M n) = n2 n 1 n1 2n2 2 n 2 tr Σ2. A ratio consistent estimator bσ 2 n = 2n(n 1)(n 2) n 1n 2(n 3)» tr Sn 2 1 (tr Sn)2 n 2, bσ 2 n/σ 2 n P 1. Finally, under H 0, Z n = Mn bσ 2 n = N(0, 1) This is the Bai-Saranadasa s asymptotic normal test (ANT).

12 A two sample problem Comparison between T 2, NET and ANT Power functions Assuming p, n, p/n y (0, 1), n 1/n κ ; Hotelling s T 2, Dempster s NET and Bai-Saranadasa s ANT s β H (µ) = Φ ξ α + β D (µ) = Φ ξ α + where α = test size, and! n(1 y) κ(1 κ) Σ 1/2 µ 2 + o(1), 2y n 2 tr Σ 2 κ(1 κ) µ 2 «+ o(1) = β BS (µ). µ = µ 1 µ 2, ξ α = Φ 1 (1 α). Important because of the factor (1 y), T 2 losses power when y increases, i.e. p increases relatively to n.

13 A two sample problem Comparison between T 2, NET and ANT Simulation results 1 Gaussian case Choice of covariance Σ = (1 ρ)i p + ρj p, J p = 1 p1 p noncentral parameter η = µ 1 µ 2 2, (n tr Σ 2 1, n 2) = (25, 20), n = 45

14 A two sample problem A summary of the introduction High-dimensional effect need to be taken into account ; Surprisingly, asymptotic methods with RMT perform well even for small p (as low as p = 4) ; many of classical multivariate analysis methods have to be examined with respect to high-dimensional effects.

15 Introduction High-dimensional data and new challenge in statistics A two sample problem Sample covariance matrices Sample v.s. population covariance matrices Marčenko-Pastur distributions Bai and Silverstein s CLT for linear spectral statistics Random Fisher matrices Random Fisher matrices Testing covariance matrices I Simulation study I Testing covariance matrices II Simulation study II Multivariate regressions Conclusions

16 Marčenko-Pastur distributions The Marčenko-Pastur distribution Theorem. Assume Marčenko & Pastur, 1967 X = p n i.i.d. variables (0, 1), Σ = I p not necessarily Gaussian, but with finite 4-th moment p, n, p/n y (0, 1] Then, the (empirical) distribution of the eigenvalues of S n = 1 n XXT converges to the distribution with density function f (x) = 1 p (x a)(b x), a x b, 2πyx where a = (1 y) 2, b = (1 + y) 2.

17 Marčenko-Pastur distributions The Marčenko-Pastur distribution f (x) = 1 p (x a)(b x), (1 y) 2 = a x b = (1 + y) 2. 2πyx Densités de la loi de Marcenko Pastur 1.0 y p/n a b 1/ / / c=1/4 c=1/8 c=1/2

18 Marčenko-Pastur distributions An explanation of the power deficiency of Hotelling s T 2 when p increases, even in Gaussian case, S n is different from its population counterpart Σ ; when y = p/n 1, the left edge a 0 small eigenvalues yield an instability of the T 2 statistic T 2 = n1n2 n (x y) Sn 1 (x y).

19 Bai and Silverstein s CLT for linear spectral statistics Bai and Silverstein s CLT for linear spectral statistics of S n Set the Empirical spectral distribution F n = 1 p px δ λj, j=1 where λ j s are p eigenvalues of S n; y n = p ; n [a, b] U open C. for any g analytic on U G n(g) = p [F n(g) µ yn (g)] where µ α is the MP distribution of index α (0, 1).

20 Bai and Silverstein s CLT for linear spectral statistics A CLT for linear spectral statistics Theorem Assume that g 1,, g k are k analytic functions on U ; Bai and Silverstein, 04 the matrix entries x ij are i.i.d. real-valued random variables such that Ex ij = 0, Ex 2 ij = 1, Ex ij 4 = 3. as n, p, y n = p y (0, 1); n Then, (G n(g 1),, G n(g k )) N k (m, V ), with a given mean vector m = m(g 1,..., g k ) and asymptotic covariance matrix V = V (g 1,..., g k ). Other versions exist Lytova & Pastur 09; Bai & Wang 09

21 Introduction High-dimensional data and new challenge in statistics A two sample problem Sample covariance matrices Sample v.s. population covariance matrices Marčenko-Pastur distributions Bai and Silverstein s CLT for linear spectral statistics Random Fisher matrices Random Fisher matrices Testing covariance matrices I Simulation study I Testing covariance matrices II Simulation study II Multivariate regressions Conclusions

22 Random Fisher matrices Random Fisher matrices two independent samples x 1,..., x n1 (0, I p), y 1,..., y n2 (0, I p) with i.i.d coordinates of mean 0 and variance 1 Associated sample covariance matrices S 1 = 1 n X 1 x i x i, S 2 = 1 n X 2 y j yj. n 1 n 2 i=1 j=1 Fisher matrix V n = S 1S 1 2 where n 2 > p.

23 Random Fisher matrices Random Fisher matrices Assume y n1 = p n 1 y 1 (0, 1), y n2 = p n 2 y 2 (0, 1). Under mild moment conditions, the ESD Fn Vn of V n has a LSD F y1,y 2 with density 8 >< l(x) = > (1 y p 2) (b x)(x a), a x b, 2πx(y 1 + y 2x) 0, otherwise where a = (1 y 2) 2 `1 y 1 + y 2 y 1y 2 2, b = (1 y 2) 2 `1 + y 1 + y 2 y 1y 2 2.

24 Random Fisher matrices CLT for LSS of random Fisher matrices let " I (0,1) (y (1 y 1) 2 # 1) (1 + y, (1 + y1) 2 2) 2 (1 U y e open C, 2) 2 for an analytic function f on e U, define fg n(f ) = p Z + h i f (x) Fn Vn F yn1,y n2 (dx), where F yn1,y n2 is the LSD with indexes y nk, k = 1, 2.

25 Random Fisher matrices CLT for LSS of random Fisher matrices Theorem Assume Ex 4 11 <, Ey 4 11 < and let β x = E x , β y = E y Then for any analytic functions f 1,, f k defined on e U, h fgn(f 1),, f G n(f k )i = N k (m, υ) with suitable asymptotic mean and covariance functions m and υ. Zheng, 08

26 Random Fisher matrices CLT for LSS of random Fisher matrices Limiting mean function m Zheng, 08 m(f j ) = lim [(1) + (2) + (3)] r 1 + I " # 1 1 f j (z(ζ)) πi ζ =1 ζ 1 ζ + 1 ζ + y dζ (1) 2 r r hr I βx y1(1 y2)2 1 + f 2πi h 2 j (z(ζ)) ζ =1 (ζ + y dζ (2) 2 )3 hr I βy y2(1 y2) + f j (z(ζ)) ζ + 1 hr 2πi h ζ =1 (ζ + y dζ, (3) 2 )3 hr where h i z(ζ) = (1 y 2) h 2 + 2hR(ζ), h = y 1 + y 2 y 1y 2.

27 Random Fisher matrices CLT for LSS of random Fisher matrices Limiting covariance function υ Zheng, 08 υ(f j, f l ) = lim [(4) + (5))] 1<r 1 <r I I f j (z(r 1ζ 1))f l (z(r 2ζ 2))r 1r 2 dζ 1dζ 2, (4) 2π 2 ζ 2 =1 ζ 1 =1 (r 2ζ 2 r 1ζ 1) 2 I I (βxy1 + βy y2)(1 y2)2 f j (z(ζ 1)) f l (z(ζ 2)) 4π 2 h 2 ζ 1 =1 (ζ 1 + y 2 hr 1 ) 2 dζ1 ζ 2 =1 (ζ 2 + y 2 hr 2 ) dζ2(5) 2 j, l {1,, k}.

28 Introduction High-dimensional data and new challenge in statistics A two sample problem Sample covariance matrices Sample v.s. population covariance matrices Marčenko-Pastur distributions Bai and Silverstein s CLT for linear spectral statistics Random Fisher matrices Random Fisher matrices Testing covariance matrices I Simulation study I Testing covariance matrices II Simulation study II Multivariate regressions Conclusions

29 One-sample test on covariance matrices a sample x 1,..., x n) N p(µ, Σ) want to test H 0 Σ = I p in high-dimensional case, several previous work exist Ledoit & Wolf 02; Schott 07; Srivastava we focus on the LR statistic T n = n [trs n log S n p], S n = 1 n nx (x i x)(x i x), i=1 Classical LRT Data dimension p is fixed, and when n, T n = χ 2 p(p+1)/2. Will see rapidly deficient when p is not small.

30 RMT Corrected LRT Theorem Bai, Jiang, Y and Zheng 09 Assume p/n y (0, 1) and let g(x) = x log x 1. Then, under H 0 and when n» Tn n p F yn (g) N(m(g), υ(g)), where F yn is the Marčenko-Pastur law of index y n and log (1 y) m(g) =, 2 υ(g) = 2 log (1 y) 2y.

31 Simulation study I Comparison of LRT and CLRT by simulation nominal test level α = 0.05 ; for each (p, n), 10,000 independent replications with real Gaussian variables. Powers are estimated under the alternative H 1 Σ = diag(1, 0.05, 0.05, 0.05,..., 0.05). CLRT LRT (p, n ) Size Difference with 5% Power Size Power (5, 500) (10, 500) (50, 500) (100, 500) (300, 500)

32 Simulation study I On a plot n=500 Type I Error CLRT LRT Dimension

33 Introduction High-dimensional data and new challenge in statistics A two sample problem Sample covariance matrices Sample v.s. population covariance matrices Marčenko-Pastur distributions Bai and Silverstein s CLT for linear spectral statistics Random Fisher matrices Random Fisher matrices Testing covariance matrices I Simulation study I Testing covariance matrices II Simulation study II Multivariate regressions Conclusions

34 Two-samples test on covariance matrices two samples x 1,..., x n1 N p(µ 1, Σ 1), y 1,..., y n2 N p(µ 2, Σ 2) want to test H 0 Σ 1 = Σ 2 The associated sample covariance matrices are S 1 = 1 n X 1 (x i x)(x i x), S 2 = 1 n X 2 (y i y)(y i y), n 1 n 2 i=1 i=1 Let the LR statistic S1S 1 n1 2 2 L 1 = c1s 1S 1 n, 2 + c 2I p 2 where n = n 1 + n 2 and c k = n k n, k = 1, 2.

35 Two-samples test on covariance matrices Classical LRT Data dimension p is fixed, and when n 1, n 2 and under H 0, T n = 2 log L 1 χ 2 p(p+1)/2. Will see rapidly deficient when p is not small.

36 RMT Corrected LRT Theorem Bai, Jiang, Y and Zheng 09 Assuming that the conditions of CLT for LSS of Fisher matrices hold and let f (x) = log(y 1 + y 2x) y2 y 1 + y 2 log x log(y 1 + y 2). Then under H 0 and as n 1 n 2,» 2 log L1 p F yn1,y n n2 (f ) N (m(f ), υ(f )), with m(f ) = 1 2» log y1 + y 2 y 1y 2 y 1 + y 2 «y1 y 1 + y 2 log(1 y 2) y2 log(1 y 1), y 1 + y 2 υ(f ) = 2y 2 2 (y 1 + y 2) 2 log(1 y1) 2y 2 1 (y 1 + y 2) 2 log(1 y2) 2 log y 1 + y 2 y 1 + y 2 y 1y 2.

37 Simulation study II Comparison of LRT and CLRT by simulation nominal test level α = 0.05 ; for each (p, n 1, n 2), 10,000 independent replications with real Gaussian variables. Powers are estimated under the alternative H 1 Σ 1Σ 1 2 = diag(3, 1, 1,, ).

38 Simulation study II Comparison of LRT and CLRT by simulation with (y 1, y 2 ) = (0.05, 0.05) CLRT LRT (p, n 1, n 2 ) Size Difference with 5% Power Size Power (5, 100, 100) (10, 200, 200) (20, 400, 400) (40, 800, 800) (80, 1600, 1600) (160, 3200, 3200) (320, 6400, 6400)

39 Simulation study II Comparison of LRT and CLRT by simulation with (y 1, y 2 ) = (0.05, 0.1) CLRT LRT (p, n 1, n 2 ) Size Difference with 5% Power Size Power (5, 100, 50) (10, 200, 100) (20, 400, 200) (40, 800, 400) (80, 1600, 800) (160, 3200, 1600) (320, 6400, 3200)

40 Simulation study II Comparisons of LRT and CLRT y1=0.05, y2=0.05 y1=0.05, y2=0.1 Type I Error CLRT LRT Type One Error CLRT LRT Dimension Dimension

41 Introduction High-dimensional data and new challenge in statistics A two sample problem Sample covariance matrices Sample v.s. population covariance matrices Marčenko-Pastur distributions Bai and Silverstein s CLT for linear spectral statistics Random Fisher matrices Random Fisher matrices Testing covariance matrices I Simulation study I Testing covariance matrices II Simulation study II Multivariate regressions Conclusions

42 A general linear hypothesis in a multivariate regression A p-th dimensional regression model where x i = Bz i + ε i, i = 1,..., n ε i N p(0, Σ), x R p, z i R q, n p + q. A general linear hypothesis Write a bloc decomposition B = (B 1, B 2) with q 1 and q 2 columns To test H 0 B 1 = B 1, with a given B 1.

43 Wilk s Λ Let Σ b 0 and Σ b 1 be the likelihood estimator of Σ under H 0 and the alternative, respectively LRT statistic equals L 0/L 1 = (Λ n) n/2, Λ n = b Σ b Σ 0, where Λ n is the celebrated Wilk s Λ Wilks 32, 34 ; Bartlett 34. Classic (low dimensional) approximation of LRT for fixed p and q, n and under H 0 U n = n log Λ n χ 2 pq 1. Less biased Bartlett s correction Ũ n = k log Λ n, k = n q 1 (p q1 + 1). 2

44 High-dimensional correction of Wilk s Λ Theorem Let p, q 1, n q and Bai, Jiang, Y and Zheng, 10 y n1 = p y 1 (0, 1), y n2 = p y2 (0, 1). q 1 n q Then, under H 0, T n = υ(f ) 1 2 ˆ log Λn p F yn1,yn2 (f ) m(f ) N (0, 1), where m(f ), υ(f ) and F yn1,y n2 (f ) are suitable constants computed from f (x) = log(1 + yn 2 y n1 x).

45 The centering term where y n2 1 F yn1,y n2 (f ) = log c n + yn 1 1 log(c n d nh n) y n2 y n1 «= + yn 1 + y n2 cnh n d ny n2 log, y n1 y n2 h n h n = p y n1 + y n2 y n1 y n2 a n, b n = c n, d n = 1 2 (1 hn)2 (1 y n2 ) 2»r 1 + yn 2 y n1 b n ± r 1 + yn 2 a n, c n > d n, y n1

46 The limiting parameters where m(f ) = 1 2 log (c2 d 2 )h 2 (ch y, 2d) 2 «c 2 υ(f ) = 2 log, c 2 d 2 h = y 1 + y 2 y 1y 2 a 0, b 0 = c, d = 1 2 (1 h)2 (1 y 2) 2»r 1 + y2 y 1 b 0 ± r 1 + y2 a 0, c > d. y 1

47 A simulation experiment p=10, n=100, q=50, q1=30 p=20, n=100, q=60, q1=50 Power LRT CLRT ST1 ST2 Power LRT CLRT ST1 ST Change in Non center parameter c0 Change in Non center parameter c0 Gaussian entries, non central parameter c 0 d(h, H 0).

48 Introduction High-dimensional data and new challenge in statistics A two sample problem Sample covariance matrices Sample v.s. population covariance matrices Marčenko-Pastur distributions Bai and Silverstein s CLT for linear spectral statistics Random Fisher matrices Random Fisher matrices Testing covariance matrices I Simulation study I Testing covariance matrices II Simulation study II Multivariate regressions Conclusions

49 Some conclusions High dimensional effects should be taken into account; RMT for sample covariance matrices is a powerful tool to correct classical multivariate procedures ; Each time some Σ is to be estimated, one should take care of the natural estimator S n for high-dimensional data, S n Σ. Yet the RMT is not suffciently developped for statistics 1. dependent observations time series ; 2. not identically distributed variables.

50 Some references Bai, Z. D. and Saranadasa, H. (1996). Effect of high dimension comparison of significance tests for a high dimensional two sample problem. Statistica Sinica. 6, Bai, Z. D. and Silverstein, J. W. (2004). CLT for linear spectral statistics of large-dimensional sample covariance matrices. Ann.Probab. 32, Z. D. Bai, D. Jiang, J. Yao and S. Zheng, Corrections to LRT on Large Dimensional Covariance Matrix by RMT. Annals of Statistics 37, Z. D. Bai, D. Jiang, J. Yao and S. Zheng, Large regression analysis. submitted Dempster, A. P. (1958). A high dimensional two sample significance test. Ann. Math. Statist. 29, Zheng, S. (2008). Central Limit Theorem for Linear Spectral Statistics of Large Dimensional F Matrix. Preprint, Northern-Est Normal University

On corrections of classical multivariate tests for high-dimensional data. Jian-feng. Yao Université de Rennes 1, IRMAR

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova On corrections of classical multivariate tests for high-dimensional