Principal Component Analysis for a Spiked Covariance Model with Largest Eigenvalues of the Same Asymptotic Order of Magnitude

Size: px

Start display at page:

Download "Principal Component Analysis for a Spiked Covariance Model with Largest Eigenvalues of the Same Asymptotic Order of Magnitude"

Irene Barker
5 years ago
Views:

1 Principal Component Analysis for a Spiked Covariance Model with Largest Eigenvalues of the Same Asymptotic Order of Magnitude Addy M. Boĺıvar Cimé Centro de Investigación en Matemáticas A.C. May 1, 2010 Addy Boĺıvar (CIMAT) PCA for the Spiked Covariance Model May 1, / 18

2 Contenido 1 Introduction Spiked Covariance Model Recent results Addy Boĺıvar (CIMAT) PCA for the Spiked Covariance Model May 1, / 18

3 Contenido 1 Introduction Spiked Covariance Model Recent results 2 Random Matrix Theory Result derived with RMT Addy Boĺıvar (CIMAT) PCA for the Spiked Covariance Model May 1, / 18

4 Contenido 1 Introduction Spiked Covariance Model Recent results 2 Random Matrix Theory Result derived with RMT 3 Asymptotic results Consistency of eigenvalues Consistency of eigenvectors Addy Boĺıvar (CIMAT) PCA for the Spiked Covariance Model May 1, / 18

5 An important problem in multivariate statistical analysis is the estimation of the population covariance matrix. In principal component analysis (PCA) one often fails to estimate the population eigenvalues and eigenvectors, since the sample covariance matrix is not a good approximation to the population covariance matrix when the data dimension is larger than the sample size. Addy Boĺıvar (CIMAT) PCA for the Spiked Covariance Model May 1, / 18

6 Spiked Covariance Model As pointed out in Johnstone (2001), one often observes one or a small number of large sample eigenvalues well separated from the rest. In this case, of special interest is the so-called spiked covariance model. Addy Boĺıvar (CIMAT) PCA for the Spiked Covariance Model May 1, / 18

7 More specifically, suppose we have a d n data matrix X = [x 1, x 2,..., x n ] with d n, in the sense that d n, where x j = (x 1j,..., x dj ), j = 1, 2,..., n. Assume the columns of X are independent and identically distributed random vectors from a multivariate Gaussian distribution with mean zero and unknown covariance matrix Σ. The spiked covariance model considers a covariance matrix of the type Σ = diag(τ 1, τ 2,..., τ p, 1,..., 1), (1) with τ 1 τ 2 τ p > 1, for some 1 p < d. Addy Boĺıvar (CIMAT) PCA for the Spiked Covariance Model May 1, / 18

8 Some results Ahn, Marron, Muller and Chi (2007) showed that for p = 1 and Σ = diag(d α, 1,..., 1), with α > 1, it is possible to estimate very well the first population eigenvalue and eigenvector using PCA when d and n are sufficiently large and d n. For the case p 2, Jung and Marron (2009) showed the consistency of the first p largest sample eigenvalues under the assumption that each of the p largest population eigenvalues, τ 1 τ 2 τ p, has a different asymptotic order of magnitude when dimension increases, that is τ i d α i c i, as d, where α 1 > α 2 > > α p > 1 and c i > 0, i = 1, 2,..., p. Addy Boĺıvar (CIMAT) PCA for the Spiked Covariance Model May 1, / 18

9 They showed that if τ 1 τ 2 τ p are the p largest sample eigenvalues τ i w χ 2 n, as d, (2) n τ i for i = 1, 2,..., p, where χ 2 n is a Chi-square random variable with n degrees of freedom. Therefore, since χ 2 n/n w 1 as n the first p sample eigenvalues are consistent. They also show that in this case the corresponding sample eigenvectors are consistent. Addy Boĺıvar (CIMAT) PCA for the Spiked Covariance Model May 1, / 18

10 Purpose of this work To study the asymptotic behaviour of the sample eigenvalues and eigenvectors, when d, n are sufficiently large and d n, in the case of the spiked covariance model where the p largest population eigenvalues τ 1,..., τ p, p > 1, have the same asymptotic order of magnitude when dimension increases, i.e. α 1 = α 2 = = α p = α > 1 and c 1 = c 2 = = c p = c > 0. Addy Boĺıvar (CIMAT) PCA for the Spiked Covariance Model May 1, / 18

11 Under this last assumption, we first prove a multivariate extension of (2), namely ( τ 1 τ 1, τ 2 τ 2,..., τ p τ p ) w (cn) 1 (l 1, l 2,..., l p ) as d where l 1 l 2 l p are the nonzero eigenvalues of a Wishart random matrix with p degrees of freedom and n n covariance matrix ci n. Addy Boĺıvar (CIMAT) PCA for the Spiked Covariance Model May 1, / 18

12 Wishart Distribution Addy Boĺıvar (CIMAT) PCA for the Spiked Covariance Model May 1, / 18

13 Lemma 2.1 If l 1 l 2 l p are the nonzero eigenvalues of U W n (p, ci n ) with n > p and c > 0, then the joint density of (l 1, l 2,..., l p ) is given by f l1,l 2,...,l p (l 1, l 2,..., l p ) = α S p sign(α) exp( 1 2c with l 1 > l 2 > > l p, where p k=1 p l k ) k=1 l α k+(n p 1)/2 1 p+1 k, (3) = π p2 /2 (2c) pn/2 Γ p ( p 2 )Γ p( n (4) 2 ). Addy Boĺıvar (CIMAT) PCA for the Spiked Covariance Model May 1, / 18

14 Proposition 2.1 If l 1 l 2 l p are the nonzero eigenvalues of U W n (p, ci n ) with n > p and c > 0, then the characteristic function of (l 1, l 2,..., l p ) is given by ϕ l1,l 2,...,l p (t 1, t 2,..., t p ) = ( ) pn/2 2c p Ĝ( t j ; pn p 2, 2c p ) j=1 p 1 C n,p (k 1,..., k p 1 ) r=1 ( r p k 1=0 k p 1=0 Γ( pn p k j ) ) kr Ĝ( p j=1 t j; k r, 2c p ) Ĝ( p j=1 j=p+1 r t (5) j; k r, 2c ), where Ĝ(t; a, b) is the characteristic function of the gamma distribution Gamma(a, b) and C n,p (k 1,..., k p 1 ) = p 1 sign(α) α S p r=1 Γ( r j=1 α j + r(n p 1) 2 + r 1 j=1 k j) Γ( r j=1 α j + r(n p 1) 2 + r j=1 k j + 1). (6) r Addy Boĺıvar (CIMAT) PCA for the Spiked Covariance Model May 1, / 18

15 Theorem 3.1 Assume X as above with unknown covariance matrix given by the spiked covariance model (1) with p < n < d and where τ 1 τ 2 τ p > 1 have the same asymptotic order of magnitude, that is τ i = τ i (d), d α τ i (d) c as d for some c > 0 and α > 1, i = 1, 2,..., p. Let τ 1 τ 2 τ p be the p largest sample eigenvalues of the sample covariance matrix S = n 1 XX. Then ( τ 1 τ 1, τ 2 τ 2,..., τ p τ p ) w (cn) 1 (l 1, l 2,..., l p ), when d, where l 1 l 2 l p are the nonzero eigenvalues of a Wishart random matrix with distribution W n (p, ci n ). Addy Boĺıvar (CIMAT) PCA for the Spiked Covariance Model May 1, / 18

16 Proposition 3.1 Let c > 0 and l 1 l 2 l p be the nonzero eigenvalues of U W n (p, ci n ) with n > p. Then (cn) 1 (l 1, l 2,..., l p ) w (1, 1,..., 1), as n. Addy Boĺıvar (CIMAT) PCA for the Spiked Covariance Model May 1, / 18

17 Theorem 3.2 Under the assumptions of Theorem 3.1, we have that for all ɛ > 0 ( ĺım ĺım P ( τ 1, τ 2,..., τ ) p ) (1, 1,..., 1) > ɛ = 0. (7) n d τ 1 τ 2 τ p Addy Boĺıvar (CIMAT) PCA for the Spiked Covariance Model May 1, / 18

18 Theorem 3.3 Under the same conditions of Theorem 3.1 we have d α n( τ 1, τ 2,..., τ n ) w (l1, l 2,..., l p, 0,..., 0) as d, where l 1 l 2 l p are the nonzero eigenvalues of a Wishart random matrix with distribution W n (p, ci n ). Addy Boĺıvar (CIMAT) PCA for the Spiked Covariance Model May 1, / 18

19 We define vj [Proj span{ui :i J}v j ] Angle(v j, span{u i : i J}) = arc cos( v j Proj span{ui :i J}v j ) vj ( i J = arc cos( (u i v j )u i ) v j i J (u i v j )u i ). Following the definition given in Jung and Marron (2009), we say that the sample eigenvector v j, with j J, is subspace consistent if Angle(v j, span{u i : i J}) P 0. Addy Boĺıvar (CIMAT) PCA for the Spiked Covariance Model May 1, / 18

20 Theorem 3.4 Under the same conditions of Theorem 3.1, let v 1, v 2,..., v p be the sample eigenvectors corresponding to the first p sample eigenvalues τ 1 τ 2 τ p. Then if ɛ > 0 ĺım ĺım P(Angle(v i, span{e j : j = 1, 2,... p}) > ɛ) = 0, (8) n d for i = 1, 2,..., p. Addy Boĺıvar (CIMAT) PCA for the Spiked Covariance Model May 1, / 18

Principal component analysis and the asymptotic distribution of high-dimensional sample eigenvectors

Principal component analysis and the asymptotic distribution of high-dimensional sample eigenvectors Kristoffer Hellton Department of Mathematics, University of Oslo May 12, 2015 K. Hellton (UiO) Distribution