Regularized Estimation of High Dimensional Covariance Matrices. Peter Bickel. January, 2008

Similar documents
Permutation-invariant regularization of large covariance matrices. Liza Levina

Sparse Permutation Invariant Covariance Estimation: Motivation, Background and Key Results

High-dimensional covariance estimation based on Gaussian graphical models

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation

High Dimensional Covariance and Precision Matrix Estimation

Sparse Permutation Invariant Covariance Estimation: Final Talk

Estimation of large dimensional sparse covariance matrices

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage

Robust Inverse Covariance Estimation under Noisy Measurements

Sparse estimation of high-dimensional covariance matrices

High-dimensional Covariance Estimation Based On Gaussian Graphical Models

Tuning Parameter Selection in Regularized Estimations of Large Covariance Matrices

High-dimensional statistics: Some progress and challenges ahead

Computationally efficient banding of large covariance matrices for ordered data and connections to banding the inverse Cholesky factor

High Dimensional Inverse Covariate Matrix Estimation via Linear Programming

DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania

Sparse Permutation Invariant Covariance Estimation

Reconstruction from Anisotropic Random Measurements

Sparse inverse covariance estimation with the lasso

arxiv: v1 [math.st] 31 Jan 2008

Theory and Applications of High Dimensional Covariance Matrix Estimation

Composite Loss Functions and Multivariate Regression; Sparse PCA

Sparse Permutation Invariant Covariance Estimation

Journal of Multivariate Analysis. Consistency of sparse PCA in High Dimension, Low Sample Size contexts

Small sample size in high dimensional space - minimum distance based classification.

Tractable Upper Bounds on the Restricted Isometry Constant

Latent Variable Graphical Model Selection Via Convex Optimization

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss

Robust and sparse Gaussian graphical modelling under cell-wise contamination

Introduction to graphical models: Lecture III

High-dimensional graphical model selection: Practical and information-theoretic limits

Learning discrete graphical models via generalized inverse covariance matrices

Posterior convergence rates for estimating large precision. matrices using graphical models

Estimating Structured High-Dimensional Covariance and Precision Matrices: Optimal Rates and Adaptive Estimation

Sparse PCA in High Dimensions

A path following algorithm for Sparse Pseudo-Likelihood Inverse Covariance Estimation (SPLICE)

Dimension Reduction in Abundant High Dimensional Regressions

Convex relaxation for Combinatorial Penalties

High dimensional Ising model selection

Learning gradients: prescriptive models

A Least Squares Formulation for Canonical Correlation Analysis

Sparse PCA with applications in finance

The Sparsity and Bias of The LASSO Selection In High-Dimensional Linear Regression

Penalized versus constrained generalized eigenvalue problems

Learning with Singular Vectors

Indirect multivariate response linear regression

High-dimensional graphical model selection: Practical and information-theoretic limits

Sparse Gaussian conditional random fields

Sparse Covariance Selection using Semidefinite Programming

arxiv: v2 [math.st] 7 Aug 2014

Estimation of Graphical Models with Shape Restriction

A General Framework for High-Dimensional Inference and Multiple Testing

arxiv: v1 [math.st] 13 Feb 2012

L 2,1 Norm and its Applications

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

Principal Component Analysis

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

(Part 1) High-dimensional statistics May / 41

Random Matrices and Multivariate Statistical Analysis

Learning Local Dependence In Ordered Data

Incorporation of Sparsity Information in Large-scale Multiple Two-sample t Tests

Asymptotic Distribution of the Largest Eigenvalue via Geometric Representations of High-Dimension, Low-Sample-Size Data

Least squares under convex constraint

Estimating Covariance Structure in High Dimensions

Regularized Parameter Estimation in High-Dimensional Gaussian Mixture Models

High-dimensional Covariance Estimation Based On Gaussian Graphical Models

MP 472 Quantum Information and Computation

Variable Selection for Highly Correlated Predictors

A Multiple Testing Approach to the Regularisation of Large Sample Correlation Matrices

arxiv: v1 [stat.me] 16 Feb 2018

Graphical Model Selection

Shrinkage Tuning Parameter Selection in Precision Matrices Estimation

The picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R

Sparse Covariance Matrix Estimation with Eigenvalue Constraints

A Consistent Model Selection Criterion for L 2 -Boosting in High-Dimensional Sparse Linear Models

arxiv: v2 [math.st] 2 Jul 2017

Inference For High Dimensional M-estimates. Fixed Design Results

Global Testing and Large-Scale Multiple Testing for High-Dimensional Covariance Structures

The lasso, persistence, and cross-validation

Tuning-parameter selection in regularized estimations of large covariance matrices

Sparse statistical modelling

A direct formulation for sparse PCA using semidefinite programming

Estimation of the Global Minimum Variance Portfolio in High Dimensions

Generalized Power Method for Sparse Principal Component Analysis

6-1. Canonical Correlation Analysis

Homework 1. Yuan Yao. September 18, 2011

Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001)

An Introduction to Graphical Lasso

Assessing the dependence of high-dimensional time series via sample autocovariances and correlations

Stability Approach to Regularization Selection (StARS) for High Dimensional Graphical Models

Lecture 2: Linear Algebra Review

Non white sample covariance matrices.

Sparse Principal Component Analysis Formulations And Algorithms

Inference in high-dimensional graphical models arxiv: v1 [math.st] 25 Jan 2018

Model Selection and Geometry

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Gaussian Graphical Models and Graphical Lasso

1.1 Basis of Statistical Decision Theory

Transcription:

Regularized Estimation of High Dimensional Covariance Matrices Peter Bickel Cambridge January, 2008 With Thanks to E. Levina (Joint collaboration, slides) I. M. Johnstone (Slides) Choongsoon Bae (Slides)

Outline 1. Why estimate covariance matrices? 2. Some examples 3. Pathologies of empirical covariance matrix 4. Sparsity in various guises 5. A brief review of methods 6. Asymptotic theory for banding, thresholding, SPICE (LASSO penalized Gaussian log likelihood) 7. Simulations and data 8. Some future directions

Model X 1,..., X n : p vectors, I.I.D. F. Base model: F = N p (µ, Σ) More generally: E F X 2 < µ(f ) = E F X Σ(F ) = E F (X µ)(x µ) T Asymptotic theory: n, p, Σ T p,n Extensions to Stationary X 1,..., X n, Wei Biao Wu (2007).

Why Estimate Covariance Matrix? Principal component analysis (PCA) Linear or quadratic discriminant analysis (LDA/QDA) Inferring independence and conditional independence in Gaussian models(graphical models) Inference about the mean (e.g. longitudinal mean response curve) Covariance itself is usually not the end goal: PCA: Estimation of the eigenstructure LDA/QDA, Graphical models: Estimation of inverses

Covariance Matrices in Climate Research Example 1 Data courtesy of Serge Guillas Empirical Orthogonal Functions(EOF) aka principal components. X 1,..., X n : p dimensional stationary vector time series. Monthly Jan Temp 1850 2006 Latitude longitude calls 5 o 5 o grid p=2592 n=157

Example 1 EOF pattern #1 (clim.pact)

Example 1 X k = {X (i, j) : i = latitude, j = longitude} X n p = ( X T 1,..., XT n E(X) = 0 Σ = Var(X) = E ) T ( XX T ) = p λ j e j e T j j=1 e 1,..., e p : Principal components. λ 1 > > λ p : Eigenvalues. Goal : Estimate, interpret e j, j = 1,..., K such that K j=1 λ j p j=1 λ j large.

Covariance Matrices in Climate Research Example 2 Data assimilation X j = ave. pressure, temperature,... in 50km 50km variable block of atmosphere, J 10 7, computer model. X i = X(t i ) i = 1,..., T. Y(t i ) : Data vectors. Ensemble: X U j (t), 1 j n. Data assimilated : X F j (t), 1 j n uses ˆΣ 1 as estimate of true [ Var(X U 1 )] 1 for Kalman gain.

Pathologies of Empirical Covariance Matrix Observe X 1,..., X n, i.i.d. p-variate random variables ˆΣ = 1 n n ( Xi X ) ( X i X ) T i=1 MLE, for Gaussian unbiased (almost), well-behaved (and well studied) for fixed p, n. But very noisy if p is large. Singular if p > n, so ˆΣ 1 is not uniquely defined. Computational issues with ˆΣ 1 for large p. LDA completely breaks down if p/n Eigenstructure can be inconsistent if p/n γ > 0.

Eigenvalues Description of spreading phenomenon: Empirical distribution function: for eigenvalues {ˆl i } p j=1 G p (t) = p 1 #{ˆl j t} G(t) g(t)dt. Marčenko-Pastur, (67), For F = N p (0, I ), p/n γ 1.5 n=4p n=p g MP (b+ t)(t b ) (t) =, 2πγt b ± = (1 ± γ) 2. 1 0.5 0 0 0.5 1 1.5 2 2.5 3 3.5 4

Eigenvectors D.Paul (2006) For N p (0, Σ) λ 0 0 0 1 0 Σ = diag(λ, 1,..., 1) =. 0.. 0 1 Eigenvector ê ˆλ, e λ,, 1 < λ < 1 + γ, e ê 2 a.s. 2

Fisher s LDA For classification into two populations using: X 1,..., X n N p (, Σ), X n+1,..., X 2n N p (, Σ) Minimax behaviour over T p,n = { Σ : T Σ 1 c > 0 } of LDA based on MLE Σ 1 (Moore-Penrose for p 2 n 2 ) is equivalent to coin tossing if p n (Bickel - Levina (2004)) True for any rule!. What if T consists of sparse matrices?

Eigenstructure Sparsity Write, Σ = Σ 0 + σ 2 I p, where Σ 0 = M λ j θ j θ t j j=1 and λ 1... λ M > 0, {θ j } orthonormal. Equivalent : M X i = µ+ λj v ji θ j +σz i, i = 1,..., n, v ji i.i.d. N (0, 1) j=1

Eigenstructure Sparsity defined p, n large but, (i) M small fixed. (ii) θ j sparse well approximated by θ js, θ js 0 s, s small. v js 0 # of nonzero coordinates of v.

Label Dependent Sparsity of Σ Write Σ = σ ij, σ ij small (effectively 0) for i j large More generally, given metric m on J, σ ij small if m(i, j) large Note: σ ij = 0 X i X j under Gaussianity.

Label Dependent Sparsity of Σ 1 If Σ 1 = σ ij, σ ij small for many (i, j) pairs if i j is large. Note: σ ij = 0 X i X j {X k : k i, j} for Gaussian case.

Examples of Label Dependent Sparsity (i) X stationary σ(i, j) = σ( i j ) spectral density f, 0 < ε f 1 ε < Ergodic ARMA processes satisfy (ii) T = S + K X = Y + Z Y, Z independent, S Y, K Z, S = [s(i j)] as in (i), K Hilbert-Schmidt: K 2 p (i, j) < (Z m 0, non stationary) i,j (i) s 2 (i) < i S = 0 Σ singular but correspond to functional data analysis A typical goal: Estimate Σ 1 for prediction.

Permutation Invariant Sparsity of Σ or Σ 1 a) Each row of Σ sparse or sparsely approximable e.g. If σ i = {σ i,j : 1 j p}, σ i 0 s for all i. b) Each row of Σ 1 sparsely approximable. a) roughly implies b) if λ min (Σ) δ > 0. A more general notion (El Karoui (2007) AS to appear). Roughly) The number of loops of length k in the undirected graph: i j iff σ ij 0 grows more slowly than p k/2 as k. The corresponding definition for Σ 1 is not equivalent in general.

Permutation Invariant Sparsity of Σ 1 Σ 1 corresponds to a graphical model. N (i) = { j : σ ij 0, j i } Sparsity means neighbourhood size p. Example: Gene networks Goal : Determine N (i) i = 1,..., p.

Construction of Estimates of Σ or Parameters of Σ which work well for T having sparse members Eigenstructure sparsity a) Sparse PCA d Aspremont et al., SIAM Review (2007), Hastie, Tibshirani, Zhou (2007). PCA with Lasso penalty on coordinates of eigenvectors Focus on computation, no asymptotics

Construction of Estimates of Σ or Parameters of Σ which work well for T having sparse members Eigenstructure sparsity continued b) Johnstone and Lu, JASA (2006). Sparse PCA for Σ corresponding to signals X = (X (t 1),..., X (t p)) T Represent in wavelet basis Use only k << p coefficients to represent data to whose covariance matrix PCA is applied Asymptotics for large p and examples

Construction of Estimates of Σ or Parameters of Σ which work well for T having sparse members Label Dependent Sparsity Banding Bickel and Levina (2004), Bernoulli Bickel and Levina (2007), Annals of Statistics (To appear) a) Banding Σ For any M m ij, B k (M) m ij 1( i j k) Σ 1 n (X i n X)(X i X) T Estimate: B k ( Σ), k = o(p). i=1

Construction of Estimates of Σ or Parameters of Σ which work well for T having sparse members Label Dependent Sparsity continued b) Banding Σ 1 Σ 1 = AA T, A Lower triangular Estimate: Σ 1 ÂkÂT k, where Âk is obtained by regression. Pourahmadi and Wu (2003), Biometrika. Huang,Liu,Pourahmadi (2007)

Construction of Estimates of Σ or Parameters of Σ which work well for T having sparse members Permutation Invariant Sparsity (a) Thresholding For any M = m ij, T t (M) m ij 1( m ij t) Estimate: T t ( Σ), t = o(1) El Karoui (2007) Bickel and Levina (2007).

Construction of Estimates of Σ or Parameters of Σ which work well for T having sparse members Permutation Invariant Sparsity continued b) SPICE Estimate of Σ 1 : Σ 1 sparse F (P, M) Trace(MP) log det(p) + λ P 1, where M 1 = i,j m ij { } a) ˆP (1) arg min F (P, ˆΣ) : P 0 Yuan, Li (2007), Biometrika. Fixed p asymptotics Banerjee et al. (2006), ICML computation { } b) ˆR (2) arg min F (P, ˆR) : R 0, r ii = 1, all i ˆP (2) ˆD ˆR ˆD, ˆD diag(ˆσ ii). ˆR ˆD 1 Σ ˆD 1 a),b): Rothman et al. (2007), Submitted to JRSS(B) Asymptotics

Construction of Estimates of Σ or Parameters of Σ which work well for T having sparse members Permutation Invariant Sparsity continued c) Methods for neighbourhood estimation Meinshausen and Buhlmann (2006), Zhao and Yu (2007): Regression methods + Lasso Wainwright (2006): Bounds Kalisch and Buhlmann (2007): Using the PC algorithm of Spertes et al. (2000) to roughly hierarchically test whether partial correlation coefficients are 0

Some Asymptotic Theory Matrix norms Operator norms M m ij p p p x r r x j r, x = (x 1,..., x p ) j=1 M (2,2) = λ 1/2 max(mm T ) p M (1,1) = max j m ij M (, ) = max i i=1 p m ij j=1 M max m ij i,j M 2 F i,j m 2 ij : Frobenius norm

For any operator norm, Properties AB A B Henceforth, M (2,2) M. If M p p is symmetric, { λ (M) M = Max, M [ ] 1/2. M (1,1) M (, ) M M F. Max λ (M) Min }.

Basic Property of Given A n, B n symmetric A n B n 0, suppose λ 1 (B n ) > λ 2 (B n ) > > λ k (B n ) > λ k+1 (B n ) and define λ j (A n ) analogously. Suppose λ j+1 (B n ) < λ j (B n ), 1 j k Dimension B n arbitrary, k, > 0 fixed. Then, a) λ j (A n ) λ j (B n ) = O( (j 1) A n B n ) b) If E ja respectively E jb is projection operator onto eigenspace corresponding to λ j, then E ja E jb = O( j A n B n )

B-L (2006) Main Result I Banded estimator : ˆΣ k,p (i, j) = ˆΣ p (i, j) 1( i j k) Let Theorem 1 U(ɛ 0, α, C) = { Σ : 0 < ε 0 λ min (Σ) λ max (Σ) 1/ε 0, max { σ ij : i j > k} Ck α for all k 0 }. j i If X is Gaussian and k n ( n 1 log p ) 1 2(α+1), then, uniformly on Σ U(ε 0, α, C), ˆΣ kn,p Σ p = O P ( (n 1 log p ) α 2(α+1) ) = ˆΣ 1 k n,p Σ 1 p The banded estimator and its inverse are consistent if log p n 0 Remark λ min ε 0 not needed if only convergence in to Σ is needed.

An Approximation Theorem (Bickel and Lindner (2007)) If m B k A δ, (B k banded of width 2k), A = M <, A 1 = m 1 <, K M m, There exists N(m, M, δ), B kn such that B kn A 1 1 c(m, M, δ)δ, M where c(m, M, δ) tends to c(k) as δ 0. B kn has a simply computable form.

Thresholding Theorem(Thresholding) (B-L Submitted to AS) p If U t (q, c 0 (p), M) = Σ : σ ii M 0, σ ij q c 0 (p), all i j=1 log p 0 q < 1, t n = M n, ) ( ) ) log p (1 q)/2 T tn (ˆΣ Σ = O p (c 0 (p) n, Note q = 0 c 0 (p) : non zero entries per row More subtle consistency results : N. El Karoui (2007).

SPICE (Rothman et al.) If S { (i, j) : σ ij 0 }, S s, Σ 1 m 1, Σ M and X is Gaussian, ˆP (1) Σ 2 F ( = O p (p + s) log p ) n ) ( ˆP (2) Σ 2 s log p = O p n

Choosing the Banding (or Thresholding or SPICE) Ideally want to minimize risk Parameter R(k) = E ˆΣ k Σ Estimate via a resampling scheme: Split the data into two samples of size n 1, n 2, N times at random Let ˆΣ (ν) 1, ˆΣ (ν) 2 be the two sample covariance matrices from the ν-th split. The risk can be estimated by ˆR(k) = 1 N (ˆΣ (ν) 1 N ) k ˆΣ (ν) 2 ν=1 We used n 1 = n/3, N = 50, and the L 1 matrix norm instead of L 2. Oracle properties for Frobenius norm (Bickel and Levina (2007), Annals of Statistics submitted).

Covariance of AR(1): Σ U Simulation Examples σ ij = ρ i j n = 100, p = 10, 100, 200, ρ = 0.1, 0.5, 0.9. Fractional Gaussian noise (FGN): long-range dependence, not in U σ ij = 1 2 [( i j + 1) 2H 2 i j 2H + ( i j 1) 2H] H [0.5, 1] is the Hurst parameter H = 0.5 is white noise; H = 1 is perfect dependence n = 100, p = 10, 100, 200, H = 0.5, 0.6, 0.7, 0.8, 0.9.

True and Estimated Risk for AR(1) 2 p = 10, ρ = 0.1 2.5 p = 10, ρ = 0.5 8 7 p = 10, ρ = 0.9 1.5 2 6 1 1.5 5 4 0.5 True risk Estimated risk 0 0 2 4 6 8 k 1 0.5 0 0 2 4 6 8 k 3 2 1 0 0 2 4 6 8 k p = 100, ρ = 0.1 20 15 10 5 p = 100, ρ = 0.5 20 15 10 5 p = 100, ρ = 0.9 30 25 20 15 10 5 0 0 20 40 60 80 k 0 0 20 40 60 80 k 0 0 20 40 60 80 k 40 35 30 25 20 15 10 5 p = 200, ρ = 0.1 0 0 50 100 150 k 40 35 30 25 20 15 10 5 p = 200, ρ = 0.5 0 0 50 100 150 k 50 40 30 20 10 p = 200, ρ = 0.9 0 0 50 100 150 k

(1, 1) Loss and Choice of k for AR(1) Mean(SD) p ρ k 1 ˆk ˆΣˆk Loss ˆΣ k1 10 0.1 0.5(0.5) 0.0(0.2) 0.5 0.4 1.1 10 0.5 3.3(0.8) 2.0(0.6) 1.1 1.0 1.3 10 0.9 8.6(0.7) 8.9(0.3) 1.5 1.5 1.5 100 0.1 0.2(0.4) 0.1(0.3) 0.6 0.6 10.2 100 0.5 2.7(0.7) 2.3(0.5) 1.6 1.5 10.6 100 0.9 21.3(4.5) 15.9(2.6) 9.2 8.5 13.5 200 0.1 0.2(0.4) 0.2(0.4) 0.7 0.6 20.4 200 0.5 2.4(0.7) 2.7(0.5) 1.8 1.7 20.8 200 0.9 20.2(4.5) 16.6(2.4) 9.9 9.5 24.5 Table: AR(1): Oracle and estimated k and the corresponding loss values. k 1 = argmin ˆΣ k Σ (1,1). k ˆΣ

Operator Norm Results for Various Estimates and Situations p ρ Sample Band Band-P Thresh SPICE 10 0.1 0.60(0.01) 0.35(0.01) 0.35(0.01) 0.40(0.01) 0.35(0.01) 10 0.5 0.73(0.02) 0.64(0.02) 0.74(0.02) 0.80(0.02) 0.72(0.02) 10 0.9 1.02(0.05) 1.02(0.05) 1.02(0.05) 1.02(0.05) 1.01(0.05) 100 0.1 2.86(0.02) 0.45(0.01) 0.45(0.01) 0.50(0.01) 0.45(0.01) 100 0.5 3.39(0.03) 0.95(0.01) 2.04(0.00) 1.96(0.01) 1.52(0.01) 100 0.9 6.03(0.12) 4.45(0.09) 6.03(0.12) 6.18(0.12) 5.57(0.12) 200 0.1 4.63(0.02) 0.49(0.01) 0.49(0.01) 0.51(0.01) 0.50(0.01) 200 0.5 5.47(0.03) 1.00(0.01) 2.06(0.00) 2.04(0.00) 1.60(0.01) 200 0.9 9.77(0.16) 5.24(0.09) 9.76(0.16) 8.80(0.10) 7.50(0.10) Table: Operator Norm Loss, Averages and SEs over 100 replications. AR(1)

Operator Norm Results for Various Estimates and Situations p H Sample Band Band-P Thresh SPICE 10 0.5 0.60(0.01) 0.27(0.01) 0.27(0.01) 0.35(0.01) 0.28(0.01) 10 0.7 0.67(0.02) 0.88(0.02) 1.03(0.03) 0.88(0.04) 1.05(0.05) 10 0.9 0.96(0.04) 0.96(0.04) 0.96(0.04) 0.96(0.04) 1.13(0.05) 100 0.5 2.83(0.01) 0.38(0.01) 0.38(0.01) 0.46(0.01) 0.39(0.01) 100 0.7 3.43(0.05) 4.46(0.02) 5.37(0.00) 5.36(0.00) 5.30(0.02) 100 0.9 8.02(0.28) 8.03(0.28) 8.02(0.28) 8.02(0.28) 14.18(0.70) 200 0.5 4.60(0.02) 0.44(0.01) 0.44(0.01) 0.46(0.01) 0.45(0.01) 200 0.7 5.81(0.06) 6.46(0.01) 7.40(0.00) 7.40(0.00) 7.39(0.00) 200 0.9 14.34(0.43) 14.65(0.46) 14.34(0.43) 14.34(0.43) 30.16(0.60) Table: Operator Norm Loss, Averages and SEs over 100 replications. FGN

Example 1 continued EOF pattern #1 (Thresholding)

Example 1 continued EOF pattern #2 (Thresholding)

Some Important Directions Choice of k or other regularization parameters. Canonical correlation regularized estimation. Independent component analysis regularization. Estimation of parameters governing independence and conditional independence in graphical models.