On corrections of classical multivariate tests for high-dimensional data. Jian-feng. Yao Université de Rennes 1, IRMAR

Similar documents
On corrections of classical multivariate tests for high-dimensional data

A note on a Marčenko-Pastur type theorem for time series. Jianfeng. Yao

A new test for the proportionality of two large-dimensional covariance matrices. Citation Journal of Multivariate Analysis, 2014, v. 131, p.

TESTS FOR LARGE DIMENSIONAL COVARIANCE STRUCTURE BASED ON RAO S SCORE TEST

Large sample covariance matrices and the T 2 statistic

Estimation of the Global Minimum Variance Portfolio in High Dimensions

HYPOTHESIS TESTING ON LINEAR STRUCTURES OF HIGH DIMENSIONAL COVARIANCE MATRIX

Assessing the dependence of high-dimensional time series via sample autocovariances and correlations

An Introduction to Large Sample covariance matrices and Their Applications

Notes on Random Vectors and Multivariate Normal

Random Matrices: Beyond Wigner and Marchenko-Pastur Laws

A new method to bound rate of convergence

Asymptotic Statistics-VI. Changliang Zou

On testing the equality of mean vectors in high dimension

Covariance estimation using random matrix theory

Central Limit Theorems for Classical Likelihood Ratio Tests for High-Dimensional Normal Distributions

Testing High-Dimensional Covariance Matrices under the Elliptical Distribution and Beyond

Asymptotic Statistics-III. Changliang Zou

Numerical Solutions to the General Marcenko Pastur Equation

Concentration Inequalities for Random Matrices

PATTERNED RANDOM MATRICES

SHOTA KATAYAMA AND YUTAKA KANO. Graduate School of Engineering Science, Osaka University, 1-3 Machikaneyama, Toyonaka, Osaka , Japan

Eigenvalues and Singular Values of Random Matrices: A Tutorial Introduction

Regularized LRT for Large Scale Covariance Matrices: One Sample Problem

Spectral analysis of the Moore-Penrose inverse of a large dimensional sample covariance matrix

Non white sample covariance matrices.

Estimation of large dimensional sparse covariance matrices

CS145: Probability & Computing

Applications of Random Matrix Theory in Wireless Underwater Communication

Fluctuations from the Semicircle Law Lecture 4

Multivariate Analysis and Likelihood Inference

Exponential tail inequalities for eigenvalues of random matrices

The purpose of this section is to derive the asymptotic distribution of the Pearson chi-square statistic. k (n j np j ) 2. np j.

Lecture 3. Inference about multivariate normal distribution

Lectures 6 7 : Marchenko-Pastur Law

Random Matrices for Big Data Signal Processing and Machine Learning

Asymptotic Distribution of the Largest Eigenvalue via Geometric Representations of High-Dimension, Low-Sample-Size Data

Random Matrices and Multivariate Statistical Analysis

Multivariate Regression

Testing Some Covariance Structures under a Growth Curve Model in High Dimension

Applications of random matrix theory to principal component analysis(pca)

Strong Convergence of the Empirical Distribution of Eigenvalues of Large Dimensional Random Matrices

Chapter 2: Fundamentals of Statistics Lecture 15: Models and statistics

Sliced Inverse Regression

Random Matrix Theory Lecture 1 Introduction, Ensembles and Basic Laws. Symeon Chatzinotas February 11, 2013 Luxembourg

Uses of Asymptotic Distributions: In order to get distribution theory, we need to norm the random variable; we usually look at n 1=2 ( X n ).

High-dimensional two-sample tests under strongly spiked eigenvalue models

An introduction to G-estimation with sample covariance matrices

STA205 Probability: Week 8 R. Wolpert

Comparison Method in Random Matrix Theory

Random Matrix Theory and its Applications to Econometrics

Dependence. Practitioner Course: Portfolio Optimization. John Dodson. September 10, Dependence. John Dodson. Outline.

Gaussian vectors and central limit theorem

On singular values distribution of a matrix large auto-covariance in the ultra-dimensional regime. Title

Triangular matrices and biorthogonal ensembles

Statistical signal processing

Estimation theory. Parametric estimation. Properties of estimators. Minimum variance estimator. Cramer-Rao bound. Maximum likelihood estimators

t x 1 e t dt, and simplify the answer when possible (for example, when r is a positive even number). In particular, confirm that EX 4 = 3.

X n = c n + c n,k Y k, (4.2)

Economics 241B Review of Limit Theorems for Sequences of Random Variables

TWO-SAMPLE BEHRENS-FISHER PROBLEM FOR HIGH-DIMENSIONAL DATA

A Brief Analysis of Central Limit Theorem. SIAM Chapter Florida State University

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Multivariate Random Variable

Jackknife Empirical Likelihood Test for Equality of Two High Dimensional Means

Applications and fundamental results on random Vandermon

1 Intro to RMT (Gene)

A Note on the Central Limit Theorem for the Eigenvalue Counting Function of Wigner and Covariance Matrices

Multivariate Time Series

Preface to the Second Edition...vii Preface to the First Edition... ix

Introduction to Self-normalized Limit Theory

Statistica Sinica Preprint No: SS

1 Data Arrays and Decompositions

MATH5745 Multivariate Methods Lecture 07

The Multivariate Normal Distribution 1

Convergence of empirical spectral distributions of large dimensional quaternion sample covariance matrices

Fluctuations from the Semicircle Law Lecture 1

CS281 Section 4: Factor Analysis and PCA

Testing linear hypotheses of mean vectors for high-dimension data with unequal covariance matrices

STAT 512 sp 2018 Summary Sheet

Lecture Stat Information Criterion

High-dimensional asymptotic expansions for the distributions of canonical correlations

MULTIVARIATE THEORY FOR ANALYZING HIGH DIMENSIONAL DATA

Stat 710: Mathematical Statistics Lecture 31

EUSIPCO

Goodness-of-Fit Tests for Time Series Models: A Score-Marked Empirical Process Approach

STAT 730 Chapter 4: Estimation

Likelihood Ratio Tests for High-dimensional Normal Distributions

Random Correlation Matrices, Top Eigenvalue with Heavy Tails and Financial Applications

Chapter 9. Hotelling s T 2 Test. 9.1 One Sample. The one sample Hotelling s T 2 test is used to test H 0 : µ = µ 0 versus

EXTENDED GLRT DETECTORS OF CORRELATION AND SPHERICITY: THE UNDERSAMPLED REGIME. Xavier Mestre 1, Pascal Vallet 2

Operator-Valued Free Probability Theory and Block Random Matrices. Roland Speicher Queen s University Kingston

Anderson-Darling Type Goodness-of-fit Statistic Based on a Multifold Integrated Empirical Distribution Function

arxiv: v1 [math.pr] 13 Aug 2007

4. Distributions of Functions of Random Variables

Bootstrap - theoretical problems

EAS 305 Random Processes Viewgraph 1 of 10. Random Processes

Free Probability, Sample Covariance Matrices and Stochastic Eigen-Inference

Random Vectors 1. STA442/2101 Fall See last slide for copyright information. 1 / 30

Stochastic Design Criteria in Linear Models

Transcription:

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova On corrections of classical multivariate tests for high-dimensional data Jian-feng Yao Université de Rennes 1, IRMAR

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova Overview Introduction a two sample problem High-dimensional data and new challenge in statistics A two sample problem Marčenko-Pastur distributions and one-sample problems Sample v.s. population covariance matrices Marčenko-Pastur distributions CLT s for linear spectral statistics of S n Random Fisher matrices and two-sample problems Random Fisher matrices Testing covariance matrices one-sample case Simulation study I Testing covariance matrices two-samples case Simulation study II Conclusions

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova Introduction a two sample problem High-dimensional data and new challenge in statistics A two sample problem Marčenko-Pastur distributions and one-sample problems Sample v.s. population covariance matrices Marčenko-Pastur distributions CLT s for linear spectral statistics of S n Random Fisher matrices and two-sample problems Random Fisher matrices Testing covariance matrices one-sample case Simulation study I Testing covariance matrices two-samples case Simulation study II Conclusions

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova High-dimensional data and new challenge in statistics High dimensional data High dimensional data high dimensional models Nonparametric regression a very high-dimensional model (i.e. infinite dimensional model) but with one-dimensional data y i = f (x i ) + ε i, f R R, i = 1,..., n High-dimensional data observation vectors x i R p, with p relatively high w.r.t. the sample size n

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova High-dimensional data and new challenge in statistics High dimensional data Some typical data dimensions data dimension p sample size n y = p/n portfolio 50 500 0.10 climate survey 320 600 0.21 speech analysis a 10 2 b 10 2 1 ORL face data base 1440 320 4.5 micro-arrays 2000 200 10 Important y = p/n not always close to 0 ; frequently larger than 1

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova A two sample problem High-dimensional effect by an example The two-sample problem two independent samples x 1,..., x n1 (µ 1, Σ), y 1,..., y n2 (µ 2, Σ) want to test H 0 µ 1 = µ 2 against H 1 µ 1 µ 2. Classical approach Hotelling s T 2 test where x = S n = n X 1 i=1 1 n 2 T 2 = n1n2 n (x y) Sn 1 (x y), n X 2 x i, y = y j, n = n 1 + n 2, " n1 j=1 # n X X 2 (x i x i )(x i x i ) + (y j y j )(y j y i ) i=1 j=1. S n a sample covariance matrix

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova A two sample problem The two-sample problem Hotelling s T 2 test nice properties invariance under linear transformations; optimality if Gaussian LRT, BLUE.. Hotelling s T 2 test bad news low power even for moderate data dimensions; very few is known for the non Gaussian case; fatal deficiency when p > n 2, S n is not invertible.

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova A two sample problem Dempster s non-exact test (NET) Dempster A.P., 58, 60 A reasonable test must be based on x y even when p > n 2; choose a new basis in R n such that 1. axis 1 ground mean, 2. axis 2 (x y). with an orthogonal matrix H n and the data matrix X n p xn 1, y 1,..., y n2 ), 0 1 0 1 0 Z = n p Hn X = n n B @ h 1.. h n C A X = B @ z 1.. z n C A, h 1 = 1 1 n, h 2 = n Under normality, we have the z i s are n independent N p(, Σ); Ez 1 = 1 (n 1µ n 1 + n 2µ 2 ), Ez 2 = n1n2 (µ n 1 µ 2 ), Ez 3 = 0, i = 3,..., n. B @ n 2 nn1 1 n1 n 1 nn2 1 n2 1 C A.

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova A two sample problem Dempster s non-exact test (NET) Test statistic z 2 2 F D = (n 2) z 3 2 + + z n 2 Under H 0, z j 2 Q = rx α k χ 2 1(k), k=1 where α 1 α r > 0 are the non null eigenvalues of Σ. The distribution of F D is complicated Approximations (so the NET test!) 1. Q mχ 2 r ; 2. next estimate r by ˆr ; Finally, under H 0, F D F (ˆr, (n 2)ˆr).

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova A two sample problem Dempster s non-exact test (NET) Problems with the NET test Difficult to construct the orthogonal transformation H n = {h j } for large n; even under Gaussianity, the exact power function depend on H n.

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova A two sample problem Bai and Saranadasa s test (ANT) Bai & Saranadasa, 96 Consider directly the statistic M n = x y 2 n tr S n ; n 1n 2 generally under very mild conditions (here RMT comes!), M n σ 2 n = N(0, 1), σn 2 = Var(M n) = n2 n 1 n1 2n2 2 n 2 tr Σ2. A ratio consistent estimator bσ 2 n = 2n(n 1)(n 2) n 1n 2(n 3)» tr Sn 2 1 (tr Sn)2 n 2, bσ 2 n/σ 2 n P 1. Finally, under H 0, Z n = Mn bσ 2 n = N(0, 1) This is the Bai-Saranadasa s asymptotic normal test (ANT).

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova A two sample problem Comparison between T 2, NET and ANT Power functions Bai & Saranadasa, 96 Assuming p, n, p/n y (0, 1), n 1/n κ ; Hotelling s T 2, Dempster s NET and Bai-Saranadasa s ANT s β H (µ) = Φ ξ α + β D (µ) = Φ ξ α + where α = test size, and! n(1 y) κ(1 κ) Σ 1/2 µ 2 + o(1), 2y n 2 tr Σ 2 κ(1 κ) µ 2 «+ o(1) = β BS (µ). µ = µ 1 µ 2, ξ α = Φ 1 (1 α). Important because of the factor (1 y), T 2 losses power when y increases, i.e. p increases relatively to n.

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova A two sample problem Comparison between T 2, NET and ANT Simulation results 1 Gaussian case Choice of covariance Σ = (1 ρ)i p + ρj p, J p = 1 p1 p noncentral parameter η = µ 1 µ 2 2, (n tr Σ 2 1, n 2) = (25, 20), n = 45

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova A two sample problem Comparison between T 2, NET and ANT Simulation results 2 Gamma variables Choice of covariance Σ = tridiagonal(ρ; 1 + ρ 2 ; ρ)

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova A two sample problem A summary of the introduction High-dimensional effect need to be taken into account ; Surprisingly, asymptotic methods with RMT perform well even for small p (as low as p = 4) ; many of classical multivariate analysis methods have to be examined with respect to high-dimensional effects.

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova Introduction a two sample problem High-dimensional data and new challenge in statistics A two sample problem Marčenko-Pastur distributions and one-sample problems Sample v.s. population covariance matrices Marčenko-Pastur distributions CLT s for linear spectral statistics of S n Random Fisher matrices and two-sample problems Random Fisher matrices Testing covariance matrices one-sample case Simulation study I Testing covariance matrices two-samples case Simulation study II Conclusions

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova Sample v.s. population covariance matrices An effect of high-dimensions S n Σ x i, 1 i n i.i.d. N p(0, Σ), Sample covariance matrix S n = 1 X x i x i. n i 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 Répartition des v.p., p=40, n=160 p = 40 0.2 n = 160 0.1 y = p/n = 0.25 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 Histogram of eigenvalues of S n

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova Sample v.s. population covariance matrices Répartition des v.p., p=40, n=320 1.2 1.0 p = 40 n = 320 y = p/n = 0.125 0.8 0.6 0.4 0.2 0.0 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 Histogram of eigenvalues of S n

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova Marčenko-Pastur distributions The Marčenko-Pastur distribution Theorem. Assume Marčenko & Pastur, 1967 X = p n i.i.d. variables (0, 1), Σ = I p not necessarily Gaussian, but with finite 4-th moment p, n, p/n y (0, 1] Then, the (empirical) distribution of the eigenvalues of S n = 1 n XXT converges to the distribution with density function f (x) = 1 p (x a)(b x), a x b, 2πyx where a = (1 y) 2, b = (1 + y) 2.

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova Marčenko-Pastur distributions The Marčenko-Pastur distribution f (x) = 1 p (x a)(b x), (1 y) 2 = a x b = (1 + y) 2. 2πyx Densités de la loi de Marcenko Pastur 1.0 y p/n a b 1/8 0.42 1.83 1/4 0.25 2.25 1/2 0.09 2.91 0.8 0.6 0.4 0.2 0.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 c=1/4 c=1/8 c=1/2

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova Marčenko-Pastur distributions An explanation of the power deficiency of Hotelling s T 2 when p increases, even in Gaussian case, S n is different from its population counterpart Σ ; when y = p/n 1, the left edge a 0 small eigenvalues yield an instability of the T 2 statistic T 2 = n1n2 n (x y) Sn 1 (x y).

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova CLT s for linear spectral statistics of Sn CLT for linear spectral statistics of S n Set y n = p ; n [a, b] U open C. for any g analytic on U G n(g) = p [F n(g) µ yn (g)] (1) where µ α is the MP distribution of index α (0, 1).

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova CLT s for linear spectral statistics of Sn A CLT for linear spectral statistics Theorem Assume that g 1,, g k are k analytic functions on U ; Bai and Silverstein, 04 the matrix entries x ij are i.i.d. real-valued random variables such that Ex ij = 0, Ex 2 ij = 1, Ex ij 4 = 3. as n, p, y n = p y (0, 1); n Then, (G n(g 1),, G n(g k )) N k (m, V ), with a given mean vector m = m(g 1,..., g k ) and asymptotic covariance matrix V = V (g 1,..., g k ).

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova Introduction a two sample problem High-dimensional data and new challenge in statistics A two sample problem Marčenko-Pastur distributions and one-sample problems Sample v.s. population covariance matrices Marčenko-Pastur distributions CLT s for linear spectral statistics of S n Random Fisher matrices and two-sample problems Random Fisher matrices Testing covariance matrices one-sample case Simulation study I Testing covariance matrices two-samples case Simulation study II Conclusions

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova Random Fisher matrices Random Fisher matrices two independent samples x 1,..., x n1 (0, I p), y 1,..., y n2 (0, I p) with i.i.d coordinates of mean 0 and variance 1 Associated sample covariance matrices S 1 = 1 n X 1 x i x i, S 2 = 1 n X 2 y j yj. n 1 n 2 i=1 j=1 Fisher matrix V n = S 1S 1 2 where n 2 > p.

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova Random Fisher matrices Random Fisher matrices Assume y n1 = p n 1 y 1 (0, 1), y n2 = p n 2 y 2 (0, 1). Under mild moment conditions, the ESD Fn Vn of V n has a LSD F y1,y 2 with density 8 >< l(x) = > (1 y p 2) (b x)(x a), a x b, 2πx(y 1 + y 2x) 0, otherwise where a = (1 y 2) 2 `1 y 1 + y 2 y 1y 2 2, b = (1 y 2) 2 `1 + y 1 + y 2 y 1y 2 2.

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova Random Fisher matrices CLT for LSS of random Fisher matrices let U e C be an open set including the interval " I (0,1) (y (1 y 1) 2 # 1) (1 + y, (1 + y1) 2 2) 2 (1, y 2) 2 for an analytic function f on e U, define fg n(f ) = p Z + h i f (x) Fn Vn F yn1,y n2 (dx), where F yn1,y n2 is the LSD with indexes y nk, k = 1, 2.

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova Random Fisher matrices CLT for LSS of random Fisher matrices Theorem Assume Ex 4 11 = Ey 4 11 < and let β = E x 11 4 3. Then for any analytic functions f 1,, f k defined on e U, h fgn(f 1),, f G n(f k )i = N k (m, υ). Zheng, 08

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova Random Fisher matrices CLT for LSS of random Fisher matrices Limiting mean function m Zheng, 08 m(f j ) = lim [(2) + (3) + (4)] r 1 + I " # 1 1 f j (z(ζ)) + 1 2 4πi ζ =1 ζ 1 ζ + 1 ζ + y dζ (2) 2 r r hr I β y1(1 y2)2 1 + f 2πi h 2 j (z(ζ)) ζ =1 (ζ + y dζ (3) 2 )3 hr I β y2(1 y2) + f j (z(ζ)) ζ + 1 hr 2πi h ζ =1 (ζ + y dζ, (4) 2 )3 hr where h i z(ζ) = (1 y 2) 2 1 + h 2 + 2hR(ζ), h = y 1 + y 2 y 1y 2.

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova Random Fisher matrices CLT for LSS of random Fisher matrices Limiting covariance function υ Zheng, 08 υ(f j, f l ) = lim [(5) + (6))] 1<r 1 <r 2 1 + 1 I I f j (z(r 1ζ 1))f l (z(r 2ζ 2))r 1r 2 dζ 1dζ 2, (5) 2π 2 ζ 2 =1 ζ 1 =1 (r 2ζ 2 r 1ζ 1) 2 I I β (y1 + y2)(1 y2)2 f j (z(ζ 1)) f l (z(ζ 2)) 4π 2 h 2 ζ 1 =1 (ζ 1 + y 2 hr 1 ) 2 dζ1 ζ 2 =1 (ζ 2 + y dζ2 (6) 2 hr 2 ) 2 j, l {1,, k}.

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova Introduction a two sample problem High-dimensional data and new challenge in statistics A two sample problem Marčenko-Pastur distributions and one-sample problems Sample v.s. population covariance matrices Marčenko-Pastur distributions CLT s for linear spectral statistics of S n Random Fisher matrices and two-sample problems Random Fisher matrices Testing covariance matrices one-sample case Simulation study I Testing covariance matrices two-samples case Simulation study II Conclusions

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova One-sample test on covariance matrices a sample x 1,..., x n) N p(µ, Σ) want to test H 0 Σ = I p LR statistic T n = n [trs n log S n p] with the sample covariance matrix S n = 1 nx (x i x)(x i x), n i=1 Classical LRT Data dimension p is fixed, and when n, T n = χ 2 p(p+1)/2. Will see rapidly deficient when p is not small.

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova RMT Corrected LRT Theorem Bai, Jiang, Y and Zheng 09 Assume p/n y (0, 1) and let g(x) = x log x 1. Then, under H 0 and when n» Tn n p F yn (g) N(m(g), υ(g)), where F yn is the Marčenko-Pastur law of index y n and log (1 y) m(g) =, 2 υ(g) = 2 log (1 y) 2y.

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova Simulation study I Comparison of LRT and CLRT by simulation nominal test level α = 0.05 ; for each (p, n), 10,000 independent replications with real Gaussian variables. Powers are estimated under the alternative H 1 Σ = diag(1, 0.05, 0.05, 0.05,..., 0.05). CLRT LRT (p, n ) Size Difference with 5% Power Size Power (5, 500) 0.0803 0.0303 0.6013 0.0521 0.5233 (10, 500) 0.0690 0.0190 0.9517 0.0555 0.9417 (50, 500) 0.0594 0.0094 1 0.2252 1 (100, 500) 0.0537 0.0037 1 0.9757 1 (300, 500) 0.0515 0.0015 1 1 1

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova Simulation study I On a plot n=500 Type I Error 0.2 0.4 0.6 0.8 1.0 CLRT LRT 0 50 100 150 200 250 300 Dimension

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova Introduction a two sample problem High-dimensional data and new challenge in statistics A two sample problem Marčenko-Pastur distributions and one-sample problems Sample v.s. population covariance matrices Marčenko-Pastur distributions CLT s for linear spectral statistics of S n Random Fisher matrices and two-sample problems Random Fisher matrices Testing covariance matrices one-sample case Simulation study I Testing covariance matrices two-samples case Simulation study II Conclusions

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova Two-samples test on covariance matrices two samples x 1,..., x n1 ) N p(µ 1, Σ 1), y 1,..., y n2 ) N p(µ 2, Σ 2) want to test H 0 Σ 1 = Σ 2 The associated sample covariance matrices are S 1 = 1 n X 1 (x i x)(x i x), S 2 = 1 n X 2 (y i y)(y i y), n 1 n 2 i=1 i=1 Let the LR statistic S1S 1 n1 2 2 L 1 = c1s 1S 1 n, 2 + c 2I p 2 where n = n 1 + n 2 and c k = n k n, k = 1, 2.

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova Two-samples test on covariance matrices Classical LRT Data dimension p is fixed, and when n 1, n 2 and under H 0, T n = 2 log L 1 χ 2 p(p+1)/2. Will see rapidly deficient when p is not small.

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova RMT Corrected LRT Theorem Bai, Jiang, Y and Zheng 09 Assuming that the conditions of CLT for LSS of Fisher matrices hold and let f (x) = log(y 1 + y 2x) y2 y 1 + y 2 log x log(y 1 + y 2). Then under H 0 and as n 1 n 2,» 2 log L1 p F yn1,y n n2 (f ) N (m(f ), υ(f )), with m(f ) = 1 2» log y1 + y 2 y 1y 2 y 1 + y 2 «y1 y 1 + y 2 log(1 y 2) y2 log(1 y 1), y 1 + y 2 υ(f ) = 2y 2 2 (y 1 + y 2) 2 log(1 y1) 2y 2 1 (y 1 + y 2) 2 log(1 y2) 2 log y 1 + y 2 y 1 + y 2 y 1y 2.

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova Simulation study II Comparison of LRT and CLRT by simulation nominal test level α = 0.05 ; for each (p, n 1, n 2), 10,000 independent replications with real Gaussian variables. Powers are estimated under the alternative H 1 Σ 1Σ 1 2 = diag(3, 1, 1,, ).

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova Simulation study II Comparison of LRT and CLRT by simulation with (y1, y2) = (0.05, 0.05) CLRT LRT (p, n 1, n 2 ) Size Difference with 5% Power Size Power (5, 100, 100) 0.0770 0.0270 1 0.0582 1 (10, 200, 200) 0.0680 0.0180 1 0.0684 1 (20, 400, 400) 0.0593 0.0093 1 0.0872 1 (40, 800, 800) 0.0526 0.0026 1 0.1339 1 (80, 1600, 1600) 0.0501 0.0001 1 0.2687 1 (160, 3200, 3200) 0.0491-0.0009 1 0.6488 1 (320, 6400, 6400) 0.0447-0.0053 0.9671 1 1

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova Simulation study II Comparison of LRT and CLRT by simulation with (y1, y2) = (0.05, 0.1) CLRT LRT (p, n 1, n 2 ) Size Difference with 5% Power Size Power (5, 100, 50) 0.0781 0.0281 0.9925 0.0640 0.9849 (10, 200, 100) 0.0617 0.0117 0.9847 0.0752 0.9904 (20, 400, 200) 0.0573 0.0073 0.9775 0.1104 0.9938 (40, 800, 400) 0.0561 0.0061 0.9765 0.2115 0.9975 (80, 1600, 800) 0.0521 0.0021 0.9702 0.4954 0.9998 (160, 3200, 1600) 0.0520 0.0020 0.9702 0.9433 1 (320, 6400, 3200) 0.0510 0.0010 1 0.9939 1

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova Simulation study II Comparisons of LRT and CLRT y1=0.05, y2=0.05 y1=0.05, y2=0.1 Type I Error 0.2 0.4 0.6 0.8 1.0 CLRT LRT Type One Error 0.2 0.4 0.6 0.8 1.0 CLRT LRT 0 50 100 150 200 250 300 0 50 100 150 200 250 300 Dimension Dimension

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova Introduction a two sample problem High-dimensional data and new challenge in statistics A two sample problem Marčenko-Pastur distributions and one-sample problems Sample v.s. population covariance matrices Marčenko-Pastur distributions CLT s for linear spectral statistics of S n Random Fisher matrices and two-sample problems Random Fisher matrices Testing covariance matrices one-sample case Simulation study I Testing covariance matrices two-samples case Simulation study II Conclusions

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova Some conclusions High dimensional effects should be taken into account; RMT for sample covariance matrices is a powerful tool to coorect classical multivariate procedures ; Each time some Σ is to be estimated, take care of the natural estimator bσ = S n. Yet the RMT is not suffciently developped for statistics 1. dependent observations time series ; 2. not identically distributed variables.

Introduction a two sample problem Marčenko-Pastur distributions and one-sample problems Random Fisher matrices and two-sample problems Testing cova Some references Bai, Z. D. and Saranadasa, H. (1996). Effect of high dimension comparison of significance tests for a high dimensional two sample problem. Statistica Sinica. 6, 311-329. Bai, Z. D. and Silverstein, J. W. (2004). CLT for linear spectral statistics of large-dimensional sample covariance matrices. Ann.Probab. 32, 553-605. Z. D. Bai, D. Jiang, J. Yao and S. Zheng, 2009. Corrections to LRT on Large Dimensional Covariance Matrix by RMT. Annals of Statistics 37, 3822 3840 Z. D. Bai, D. Jiang, J. Yao and S. Zheng, 2009. Large Regression Analysis. submitted Dempster, A. P. (1958). A high dimensional two sample significance test. Ann. Math. Statist. 29, 995-1010. Zheng, S. (2008). Central Limit Theorem for Linear Spectral Statistics of Large Dimensional F Matrix. Preprint, Northern-Est Normal University