Estimation of large dimensional sparse covariance matrices

Size: px

Start display at page:

Download "Estimation of large dimensional sparse covariance matrices"

Ralph Bailey
5 years ago
Views:

1 Estimation of large dimensional sparse covariance matrices Department of Statistics UC, Berkeley May 5, 2009

2 Sample covariance matrix and its eigenvalues Data: n p matrix X n (independent identically distributed) observations of a random vector (X i ) n i= in Rp. Suppose X i -s have covariance matrix Σ p Often interested in estimating Σ p (e.g for PCA); eigenvalues: λ i Standard estimator: Σ p = (X X ) (X X )/(n ); eigenvalues l i

3 Role of eigenvalues: examples Principal Component Analysis (PCA); large eigenvalues: Optimization problems: Example in risk management: Invest a j in X t,j. Risk measured by var (X t a) = a Σ p a. (Constraints on a in more realistic versions: e.g a i = ) Role of small(/all) eigenvalues.

4 Classical results from multivariate Statistics Classical theory: p fixed, n Fact: Σ unbiased, n-consistent estimator of Σ: n Σ Σ = OP (). Also, fluctuation theory (CLT) for largest eigenvalues (Anderson, 63), under eg normality of the data Much more is known about estimation of covariance matrices: Efron, Haff, Morris, Stein etc... Will focus here on large n, large p situation (quite common now), so p/n has finite non-zero limit

5 Visual Example Histogram eigenvalues of X'X/n X i : N(0,Id) p=2000 n= "True" (i.e population) eigenvalues

6 Graphical illustration: n=500, p= Marchenko-Pastur Law and Histogram of empirical eigenvalues X: 500*200 matrix, entries i.i.d N(0,) Histogram of eigenvalues of X'X/n and corresponding Marchenko-Pastur density superimposed Note: Σ = Id, so all population (or true ) eigenvalues equal.

7 Surprising? Previous plots perhaps surprising: In particular, CLT implies that ˆσ ij σ ij = O(n /2 ): good entrywise estimation But poor spectral estimation

8 Importance of ρ n = p/n Case of Σ = Id Suppose entries of n p matrix X are iid mean 0, sd, 4th moment. Population covariance = Id Let ρ n = p/n. Theorem (Marčenko-Pastur, 67) l i : (decreasing) eigenvalues of n X X. Assume ρ n ρ (0, ]. If F p (x) = p {#l i x}, then F ρ has known density. F p = F ρ in probability. Support: [a, b], with a = ( ρ /2 ) 2, b = ( + ρ /2 ) 2 Bias issue/inconcistency problems for these asymptotics.

9 Marcenko-Pastur law illustration Population covariance: Σ = Id Here, density of limiting spectral distribution computable: Density of Marcenko Pastur law for varying r r =

10 Remark about extensions/limitations RMT also gives results about limiting spectral distribution of eigenvalues when Σ Id Models of form X i = Σ /2 Y i, entries of Y i i.i.d (say 4+ɛ moments) Problem: geometric implications for the data; near orthogonality, near constant norm of X / p, when λ (Σ) bounded Two models can have same Σ and different limiting spectral distributions

11 Estimation problems Previous results highlights fact that naive estimation of covariance matrix results in inconsistency in high-dimension. Question: can we find procedures that consistently estimate spectral properties of these matrices? (I.e also eigenspaces) And are more robust to distributional assumptions? Can we exploit good entrywise estimation to improved spectral estimation?

12 Previous work Related ideas: Banding Scheme proposed by P. Bickel and L. Levina ( 06): Consider population covariance matrices with small entries away from the diagonal Technically: Σ p satisfies max{ σ(i, j) } < Cm α, m j i j >m Recommend banding: ˆσ i,j = 0 if i j > k, otherwise keep estimate from Σ p Call this estimator B k ( Σ p ) Theorem (Bickel and Levina) If k (n /2 log(p)) /(α+), + tail decay conditions B k ( Σ p ) Σ p 2 0

13 Sparsity Introduction Oft-made assumption: many σ(i, j)=0 or small Appeal of thresholding (i.e keep large values, and put small ones to 0) Aim: find simple methods for improving estimation Practical requirements: speed, parallelizability, limited assumptions Theoretical requirements: good convergence properties, in spectral norm ( 2 ) Estimator not sensitive to ordering of variables

14 What is a good estimator? Requirement: where (for symmetric matrices) Σ p Σ p 2 0 M 2 = σ (M) = λ (M M) = max λ i (M). i Convergence of 2 implies Consistency of all eigenvalues (Weyl s inequality) Consistency of stable subspaces corresponding to separated eigenvalues (Davis-Kahan sin θ theorem) Our aim: permutation equivariant estimator, operator-norm consistent Our strategy: (hard) thresholding

15 Standard notion of sparsity Ill-suited for spectral problems Standard notion : count number of non-zero elements p p... p p E = E 2 = p p p p p p p p p Eigenvalues E : ± p p, (multiplicity: p-2) Eigenvalues( E 2 : ) + 2 p cos πk p+, k =,..., p

16 Adjacency matrices p p... p p E = p p A =

17 Adjacency matrices E 2 =. p p... 0 p p p p 0 p A 2 =

18 Adjacency matrices and graphs: comparison Closed walks of length k Closed walk: start and finishes at same vertex Length of walk: number of vertices it traverses

19 Adjacency matrices and graphs Connection with spectrum of covariance matrices Closed walk (length k) γ: i i 2... i k i k+ = i Weight of γ: w γ = σ(i, i 2 )... σ(i k, i ) ¾(,) ¾(,) ¾(,6) ¾(,2) ¾(,2) ¾(6,6) 6 2 ¾(2,2) ¾(6,6) 6 2 ¾(2,2) ¾(,5) ¾(,4) ¾(,3) ¾(5,6) ¾(2,3) ¾(5,5) 5 3 ¾(3,3) ¾(5,5) 5 3 ¾(3,3) 4 ¾(4,5) 4 ¾(3,4) ¾(4,4) trace (Σ k) = λ k i (Σ) = γ C p(k) ¾(4,4) w γ

20 Notion of sparsity compatible with spectral analysis A proposal Given Σ p, p p covariance matrix, compute adjacency matrix A p = σ(i,j) 0 Associate graph G p to it Consider C p (k) = {closed walks of length k on the graph with adjacency matrix A p } and φ p (k) = Card {C p (k)} = trace ( A k p). Call sequence of Σ p β-sparse if k 2N, φ p (k) f (k)p β(k )+ where f (k) independent of p and 0 β <

21 Examples Computation of sparsity coefficients Diagonal matrix : A p = Id p. φ(k) = p, for all k. Sparsity coefficient: 0. Matrices with at most M non-zero elements on each line φ(k) pm k. Sparsity coefficient: 0. Matrices with at most Mp α non-zero elements on each line φ(k) M (k ) pp α(k ). Sparsity coefficient: α

22 Assumptions underlying results In all that follows, Σ p (i, i) stay bounded 2 X i,j have infinitely many moments 3 Rows of (n p) data matrix X i.i.d 4 p/n l (0, )

23 Simple case: gap in entries of covariance matrix Gaussian MLE, centered case X i,j centered; cov(x i ) = Σ Suppose Σ p β-sparse, β = /2 η and η > 0 Theorem S p = n n X X, and T α (S p )(i, j) = S p (i, j) Sp(i,j) >Cn α. i= T α (S p )= thresholded version of S p at level Cn α Then, if α = β + ɛ, ɛ < η/2, T α (S p ) Σ p 2 0i.p.

24 Beyond truly sparse matrices Approximation by sparse matrices How does thresholding perform on matrices approximated by sparse matrices? Suppose T α (Σ p ) = Σ p, β-sparse. Suppose Σ p Σ p 2 0. Suppose α 0 < α < /2 δ 0 such that adjacency matrix of (i, j) s such that Cn α < σ(i, j) < Cn α 0 is γ-sparse, γ α 0 ζ 0, ζ 0 > 0. Proposition Then conclusions of theorem above apply: for α (α 0, α ), T α (S p ) Σ p 2 0 i.p.

25 Complements Theorems apply to sample covariance matrix, i.e (X X ) (X X )/(n ), not just the Gaussian MLE Also to correlation matrices Finite number of moments possible p = n r for certain r depending on β Proof provides finite n, p bounds on deviation probabilities of T α (S p ) Σ p 2 Example: Σ () p (i, j) = ρ i j, and Σ p = P Σ () p P, P random permutation matrix Can be approximated by Σ p = T n /2+ɛ(Σ p ) Σ p is asymptotically 0-sparse Proposition above applies when moment conditions satisfied

26 Sharpness of /2-sparse assumption Consider α 2 α 3... α p α Σ p = α p α p with, e.g, α i = p Σ p : /2-sparse (eigenvalues A p : ( ± p and s) Oracle estimator (O( Σ p )) of Σ p inconsistent in Gaussian case: O( Σ p ) Σ p 2 2 = p i=2 (ˆα i α i ) 2 /2 η sparsity assumption sharp.

27 Practical observations Asymptopia? Practical implementation Practicalities: implemented thresholding technique using FDR at level / p Theory: great results possible; Practice: good results (n,p a few 00 s to a few 000 s); maybe not as good as hoped for. Practice very good when few coefficients to estimate; far away from σ/ n (unsurprisingly) Making mistakes OK. Softer techniques (Huber-type) might be good alternatives, especially for non-sparse matrices Eigenvector practice somewhat disappointing. In hard situations, performs OK. In less hard situations, results comparable or a bit worse than sparse PCA. Example thresholding

28 Conclusions Highlighted some difficulties in covariance estimation in large n, large p setting Proposed notion of sparsity compatible with spectral analysis

29 Real world example Data Analysis: Portfolio of Industry Indexes Data: Daily Returns 48 Indexes, by Industry: Agriculture, Toys, Beer, Soda, etc... from Kenneth French s website at Dartmouth. p = 48 2 years of observations n = 504 Effect of shrinkage on smallest eigenvalue: l empirical l shrunk 0.27 Through Subsampling, get (0.205,0.303) 95% CI

30 Data Analysis: Portfolio of Industry Indexes Scree plot for Industry Index Portfolio data Smallest eigenvalue Figure: Scree plot for Industry index data

31 Data Analysis: Portfolio of Industry Indexes Subsampling distribution of adjusted estimator of smallest eigenvalue Industry indexes n=504 p=48 60 number Repetitions: number subsamples: empirical smallest eigenvalue Figure: Subsampling distribution of estimator Back

32 (Easy) Example : Toeplitz(,.3,.4) n = p = Population covariance: Toeplitz(,0.3,0.4) n=p= Thresholded matrix eigenvalues 0.5 Population eigenvalues

33 (Easy) Example : Toeplitz(,.3,.4) n = p = Population covariance: Toeplitz(,0.3,0.4) n=p=500 2 Oracle eigenvalues.5 Thresholded matrix eigenvalues 0.5 Population eigenvalues

34 (Easy) Example : Toeplitz(,.3,.4) n = p = Population covariance: Toeplitz(,0.3,0.4) n=p= Sample Covariance eigenvalues 2 Thresholded matrix eigenvalues 0 Population eigenvalues

35 Example 2: Toeplitz(2,.2,.3,0,-.4) n = p = Population covariance: Toeplitz(2,.2,.3,0,.4) n=p= Population Spectrum 2.5 Spectrum Thresholded Matrix

36 Example 2: Toeplitz(2,.2,.3,0,-.4) n = p = Population covariance: Toeplitz(2,.2,.3,0,.4) n=p= Population Spectrum Sample Covariance Matrix 2 0 Spectrum Thresholded Matrix

37 Example 2: Toeplitz(2,.2,.3,0,-.4) Independent simulation Population covariance: Toeplitz(2,.2,.3,0,.4) n=p= Black line: Population eigenvalues 2.5 Blue: Thresholded Matrix eigenvalues 0.5 Green + : oracle banded matrix Red.: oracle matrix Back

38 Estimating the theoretical frontier Practical implementation Data: from Fama-French website. Year n = 252. Those are p = 48 industry indexes. Orders of magnitude: α = 0.040, β = , γ = â = , b = , ĉ =.563 Also, δ = , d =.0208

39 Estimating the theoretical frontier Practical implementation: 252 days, 47 assets; raw data 0. Correction to empirical frontier: year of data Target Returns Naive Plug-in Estimator Corrected Estimator σ

40 Estimating the theoretical frontier Practical implementation: 252 days, 47 assets; simulations 0. Correction of efficient frontier: simulation results Efficient frontier 0.02 Mean of Corrected Frontier Bottom of 95% CI for corrected frontier Top of 95% CI for corrected frontier Median of Corrected Frontier 0.0 Mean uncorrected Frontier Bottom of 95% CI for uncorrected frontier Top of 95% CI for uncorrected frontier Back

41 Eigenfaces An interesting example of PCA Question: compression of digitized pictures of human faces. Idea: Gather a database of faces. Example: ORL face database: 0 pictures of 40 distinct subjects Pictures are say 92*2=0304 pixels, 256 levels of gray images. Code them as vectors. Get data matrix X. Run Principal Component Analysis on the corresponding matrix Eigenvectors corresponding to large eigenvalues are eigenfaces. Need only a few numbers to approximate a new face: coefficients of its projection on eigenfaces.

42 Some eigenfaces for ORL face database Data obtained from Wikipedia. Copyright by AT&T Laboratories Cambridge. Back

Regularized Estimation of High Dimensional Covariance Matrices. Peter Bickel. January, 2008

Regularized Estimation of High Dimensional Covariance Matrices Peter Bickel Cambridge January, 2008 With Thanks to E. Levina (Joint collaboration, slides) I. M. Johnstone (Slides) Choongsoon Bae (Slides)