Permutation-invariant regularization of large covariance matrices. Liza Levina

Size: px

Start display at page:

Download "Permutation-invariant regularization of large covariance matrices. Liza Levina"

Sybil Armstrong
5 years ago
Views:

1 Liza Levina Permutation-invariant covariance regularization 1/42 Permutation-invariant regularization of large covariance matrices Liza Levina Department of Statistics University of Michigan Joint work with Adam Rothman, Amy Wagaman, Ji Zhu (University of Michigan) Peter Bickel (UC Berkeley)

2 Liza Levina Permutation-invariant covariance regularization 2/42 Outline Introduction Motivation: a result for ordered variables Covariance Thresholding Generalized Thresholding Discovering Variable Orderings Regularization of the inverse

3 Liza Levina Permutation-invariant covariance regularization 3/42 Why estimate covariance? Principal component analysis (PCA) Linear or quadratic discriminant analysis (LDA/QDA) Inferring independence and conditional independence in Gaussian graphical models Inference about the mean (e.g., longitudinal mean response curve) Covariance itself is not always the end goal: PCA requires estimation of the eigenstructure LDA/QDA and conditional independence require the inverse (a.k.a. the concentration or precision matrix)

4 Liza Levina Permutation-invariant covariance regularization 4/42 What s wrong with the sample covariance? Observe X 1,...,X n, i.i.d. p-variate random variables ˆΣ = 1 n ( Xi n X )( X i X ) T i=1 MLE, unbiased (almost), well-behaved (and well studied) for fixed p, n. But very noisy if p is large. Eigenvalues overdispersed unless p/n 0 (Marčenko-Pastur, 1967; Wachter, 1978; Geman, 1980; Bai and Yin, 1993; Johnstone, 2001; Paul, 2004; el Karoui, 2007) Eigenvectors are not consistent (Johnstone and Lu, 2004; Paul, 2006) LDA breaks down if p/n (Bickel and Levina, 2004) Singular if p > n

5 Liza Levina Permutation-invariant covariance regularization 5/42 Desirable features for a covariance estimator Positive definite and well-conditioned Inverse easily available (in some applications) Sparse covariance or inverse (in some applications) Ultimately, consistency and good rates for large p

6 Liza Levina Permutation-invariant covariance regularization 6/42 Two general classes of estimators Variables have a natural notion of distance (time series, spatial data, longitudinal data, spectroscopy) exploit weaker dependence between distant variables The variable order is meaningless (genetics, finance, social sciences) estimators should be invariant to variable permutations

7 Liza Levina Permutation-invariant covariance regularization 7/42 Outline Introduction Motivation: a result for ordered variables Covariance Thresholding Generalized Thresholding Discovering Variable Orderings Regularization of the inverse

8 Liza Levina Permutation-invariant covariance regularization 8/42 Estimators for ordered variables Exploit weaker dependence between distant variables Banding or tapering the covariance (Bickel and Levina, 2008; Furrer and Bengsston, 2007): make the entries far away from the diagonal smaller The concentration matrix is usually regularized via its Cholesky factor (Wu and Pourahmadi, 2003; Bickel and Levina, 2008; Huang et al., 2006; Levina et al., 2008).

9 Liza Levina Permutation-invariant covariance regularization 9/42 Bickel and Levina (2008) A result for banding/tapering Replace ˆΣ with ˆΣ R, where means Schur (element-wise) product If R is positive definite, so is ˆΣ R Banding: R k (i, j) = 1( i j k) gives B k (ˆΣ) (not necessarily positive definite) Other more smooth tapers guarantee positive definiteness

10 Liza Levina Permutation-invariant covariance regularization 10/42 All results in operator norm, a.k.a. matrix 2-norm: for symmetric M, M = max i λ i (M) Convergence in operator norm implies convergence of eigenvalues and eigenvectors Result uniform over a class of approximately bandable covariance matrices as p, n U(α, M, C) = { Σ :λ max (Σ) M, max σ ij Ck α for all k 0 }. j i: i j >k

11 Liza Levina Permutation-invariant covariance regularization 11/42 Theorem 1: If X is Gaussian and k on Σ U(α, M, C), B k (ˆΣ p ) Σ p = O P ( ) 1 α+1 log p n, then, uniformly ( log p n ) α α+1 For approximately bandable matrices, the banded estimator is consistent if log p n 0. Theorem also holds for general tapering If we assume λ min (Σ) ε > 0, the same rate holds for the inverse. In practice the tuning parameter k is selected by cross-validation

12 Liza Levina Permutation-invariant covariance regularization 12/42 Outline Introduction Motivation: a result for ordered variables Covariance Thresholding Generalized Thresholding Discovering Variable Orderings Regularization of the inverse

13 Liza Levina Permutation-invariant covariance regularization 13/42 Covariance thresholding Bickel and Levina (2007); El Karoui (2007). If the variable order is meaningless, estimators should be invariant to variable permutations Consider thresholding as a permutation-invariant analogue of banding: T t (ˆΣ) = [ˆσ ij 1( ˆσ ij t)] This has been done in practice for a long time

14 Liza Levina Permutation-invariant covariance regularization 14/42 A class of approximately sparse matrices (a permutation-invariant analogue of approximately bandable ): for 0 q < 1 U τ ( q, M, c0 (p) ) = {Σ : σ ii M, p σ ij q c 0 (p), for all i} j=1 If q = 0, U τ (0, M, c 0 (p)) is a class of sparse matrices. U U τ for appropriately chosen constants (approximately bandable matrices are approximately sparse).

15 Liza Levina Permutation-invariant covariance regularization 15/42 ( Theorem 2: Suppose X is Gaussian. Uniformly on U τ q, c0 (p), M ), for sufficiently large M, if t = M log p log p n and n = o(1), then ( ) 1 q log p T t (ˆΣ p ) Σ p = O P c 0 (p) n For approximately sparse matrices, the thresholded estimator is consistent if log p n 0 (and c 0(p) does not grow too fast). If we assume λ min (Σ) > ε > 0, the same rate holds for the inverse. Threshold chosen by cross-validation

16 Liza Levina Permutation-invariant covariance regularization 16/42 Climate data example January mean temperatures from 1850 to 2006 (n = 157) The region covered by p = 2592 recording stations extends from to degrees longitude, and from to 87.5 latitude Not all the stations were in place for the entire period Look at EOFs (empirical orthogonal functions), the principal components of the spatial covariance matrix

17 Liza Levina Permutation-invariant covariance regularization 17/42 Sample Covariance EOFs EOF #1 (14.82% Variance) EOF #2 (8.68% Variance) Latitude Longitude Longitude

18 Liza Levina Permutation-invariant covariance regularization 18/42 Thresholded Covariance EOFs EOF #1 (8.49% Variance) EOF #2 (7.82% Variance) Latitude Longitude Longitude

19 Liza Levina Permutation-invariant covariance regularization 19/42 Outline Introduction Motivation: a result for ordered variables Covariance Thresholding Generalized Thresholding Discovering Variable Orderings Regularization of the inverse

20 Liza Levina Permutation-invariant covariance regularization 20/42 Generalized thresholding Rothman, Levina, Zhu (2008) Theorem 2 applies to hard thresholding In wavelets and regression, smoother penalties often do better We define generalized thresholding by a function s t : R R s.t. (i) s t (z) z (shrinkage) (ii) s t (z) = 0 for z t (thresholding) (iii) s t (z) z t for all z R and t 0 (limit shrinkage) Also makes sense to have s t (z) = sign(z)s t ( z ). The covariance estimator is S t (ˆΣ) = [s t (ˆσ ij )]

21 Liza Levina Permutation-invariant covariance regularization 21/42 Examples of Generalized Thresholding Hard thresholding: s t (z) = z 1( z > t) Soft thresholding: (Donoho and Johnstone, 1994, 1995) s t (z) = sign(z)( z t) + Equivalent to minimizing quadratic loss with lasso penalty: S t (ˆΣ) = arg min S (s ij ˆσ ij ) 2 + t i,j i,j s ij

22 Liza Levina Permutation-invariant covariance regularization 22/42 Comparison of generalized thresholding operators s( z ) Hard Adaptive Lasso Soft SCAD z

23 Liza Levina Permutation-invariant covariance regularization 23/42 More examples SCAD (Fan, 1997; Fan and Li, 2001) Linear interpolation between soft and hard thresholding Equivalent to quadratic loss with the SCAD penalty Adaptive Lasso (Zou, 2006), goes back to Breiman (1995) Equivalent to quadratic loss with weighted lasso penalty, S t (ˆΣ) = arg min S (s ij ˆσ ij ) 2 + t i,j i,j w ij s ij with weights w ij = ˆσ ij η, which penalize small elements more.

24 Liza Levina Permutation-invariant covariance regularization 24/42 Theorem 3: (a) Under the same conditions, the hard thresholding theorem holds for generalized thresholding. (b) Sparsistency : with probability tending to 1, s t (ˆσ ij ) = 0 for all (i, j) such that σ ij = 0, (c) If we additionally assume that all non-zero elements satisfy σ ij > τ, where n(τ t), we also have, with probability tending to 1, s t (ˆσ ij ) 0 for all (i, j) such that σ ij 0. Tuning parameters have to be selected by cross-validation

25 Liza Levina Permutation-invariant covariance regularization 25/42 Some generalized thresholding results SCAD and adaptive lasso do somewhat better than hard and soft thresholding as measured by operator loss Sparsity: look at true positive rate (fraction of non-zeros estimated as non-zero) vs. false positive rate (fraction of zeros estimated as non-zero). Example: MA(1) (tri-diagonal matrix, σ ii = 1, σ i,i 1 = 0.3)

26 Liza Levina Permutation-invariant covariance regularization 26/42 p = 30 p = 200 True Positive Rate Soft Hard Adapt. Lasso SCAD True Positive Rate Soft Hard Adapt. Lasso SCAD False Positive Rate False Positive Rate

27 Liza Levina Permutation-invariant covariance regularization 27/42 Outline Introduction Motivation: a result for ordered variables Covariance Thresholding Generalized Thresholding Discovering Variable Orderings Regularization of the inverse

28 Liza Levina Permutation-invariant covariance regularization 28/42 Discovering Variable Orderings Wagaman and Levina (2007) Rate comparisons, simulations and intuition suggest that if a structured order exists, it is better to use it than to ignore it Goal: reorder variables to make the covariance matrix as close to bandable as possible. Main Idea: use correlations as a measure of similarity and embed the variables in 1-d. The coordinates provide an ordering.

29 Liza Levina Permutation-invariant covariance regularization 29/42 The Isomap algorithm The classical tool is multi-dimensional scaling (MDS) An alternative: the Isomap (Tenenbaum et al, 2000) (developed for manifold embeddings). Given a dissimilarity measure d(i, j) (we use 1 ˆρ ij ) 1. For each point, find its r nearest neighbors and construct a neighborhood graph with dissimilarities as edge weights. 2. Estimate the geodesic distance d r (i, j) between each pair of points i, j by computing the shortest-path graph distance from i to j. 3. Apply metric MDS to the estimated geodesic distances.

30 Liza Levina Permutation-invariant covariance regularization 30/42 The Isoband estimator 1. Construct the neighborhood graph using d(i, j) = 1 ˆρ ij. 2. Apply Isomap to each connected component of the graph 3. Reorder the variables within each component according to their 1-d embedding coordinates and apply banding 4. Concatenate components into a block-diagonal estimator

31 Liza Levina Permutation-invariant covariance regularization 31/42 A bootstrap approach to improve the nearest neighbor graph There is noise in the k-nn links based on sample correlations Main idea: create T bootstrap replicates and only include an edge if it appears in at least ct of the replicates, where 0 < c < 1 is a tuning parameter (we use c = 0.5). Similar to bagging Improves detection of independent blocks and ordering, but adds computational cost

32 Liza Levina Permutation-invariant covariance regularization 32/42 A simulation example AR(1): σ ij = ρ i j, n = 100, ρ = 0.7, average (SE) operator norm loss over 50 replications p Sample Band. Band.Perm. MDS Isoband Bootst. Thresh (.05) 0.89(.04) 0.84(.05) 0.92(.04) 0.87(.04) 0.90(.04) 0.87(.05) (.06) 1.65(.04) 4.63(.01) 3.41(.05) 1.82(.04) 1.67(.04) 2.85(.03) (.07) 1.79(.04) 4.67(.00) 3.97(.03) 2.10(.06) 1.85(.05) 3.02(.02)

33 Liza Levina Permutation-invariant covariance regularization 33/42 Assessing sparsity: a block-diagonal example Three AR blocks of size 50,30,20, with ρ = 0.7,0.8,0.9 Heatmap of percentage of times each element was estimated as zero (black is 0%, white is 100%). As benchmark, apply banding to known blocks separately, each one in correct order.

Liza Levina Permutation-invariant covariance regularization 34/42 1 1 10 0.9 10 0.9 20 0.8 20 0.8 30 0.7 30 0.7 40 0.6 40 0.6 50 0.5 50 0.5 60 0.4 60 0.4 70 0.3 70 0.3 80 0.2 80 0.2 90 0.1 90 0.

34 Liza Levina Permutation-invariant covariance regularization 34/ (a) Banding known blocks (b) Thresholding (c) Isoband (d) Bootstrap

35 Liza Levina Permutation-invariant covariance regularization 35/42 Outline Introduction Motivation: a result for ordered variables Covariance Thresholding Generalized Thresholding Discovering Variable Orderings Regularization of the inverse

36 Liza Levina Permutation-invariant covariance regularization 36/42 Estimators of the inverse invariant under variable permutations Inverse Ω = Σ 1 (a.k.a. concentration matrix or precision matrix); needed in graphical models, LDA/QDA. Graphical models mostly focus on testing/finding zeros rather than estimating the matrix (Drton and Perlman, 2004; Dobra et al, 2004; Meinshausen and Bühlmann, 2006; Kalisch and Bühlmann, 2007) Testing for zeros can be followed by constrained estimation (Chaudhuri, Drton and Richardson, 2007), but hard to analyze the resulting estimator

37 Liza Levina Permutation-invariant covariance regularization 37/42 A Sparse Permutation-Invariant Estimator The negative log-likelihood up to a constant: l(x; Ω) = tr(ωˆσ) log Ω Add lasso penalty on off-diagonal elements of Ω force regularization and sparsity. SPICE: ˆΩλ = arg min Ω 0 { } tr(ωˆσ) log Ω + λ Ω jj j j The optimization problem has been studied by several authors; Yuan and Lin (2007) provided a fixed p analysis.

38 Liza Levina Permutation-invariant covariance regularization 38/42 High-Dimensional Analysis of SPICE Rothman, Bickel, Levina, Zhu (2007) To get the best rate, standardize first, run SPICE on the correlation matrix, then multiply variances back in Ω λ Theorem 4: Let Ω = Σ 1, and take λ (A1) 0 < ε 0 λ min (Σ) λ max (Σ) ε 1 0 log p n. Assume (A2) card{(i, j) : Ω ij 0, i j} s. Then, ( ) (1 + s) log p Ω λ Ω = O P n SPICE is consistent as long as s log p n 0. Compare to c 0 (p) log p n for thresholding.

39 Liza Levina Permutation-invariant covariance regularization 39/42 Solving the optimization problem Optimization is non-trivial and computationally expensive Yuan and Lin (2007): Maxdet algorithm (interior point optimization) Banerjee et al. (2005): Nesterov s convex optimization Our approach (Rothman et al 2008): parametrize via the Cholesky factor to enforce positive definiteness, use local quadratic approximation and coordinate-wise descent Friedman, Hastie, and Tibshirani (2008): apply the coordinate-wise descent lasso algorithm (R package glasso)

40 Liza Levina Permutation-invariant covariance regularization 40/42 Colon tumor classification example 40 colon adenocarcinoma and 22 healthy tissue samples (Alon et al, 1999) Affymetrix gene expression data Top p genes selected by two-sample t-test (out of 2000) 100 random splits into training (2/3) and test (1/3) sets LDA classification Consider two ways of selecting the tuning parameter

41 Liza Levina Permutation-invariant covariance regularization 41/42 Colon tumor classification errors Tuning parameter for SPICE chosen by (A) 5-fold CV on the training data maximizing the likelihood; (B) 5-fold CV on the training data minimizing the classification error; (C) minimizing the classification error on the test data p Naive Bayes A B C (0.8) 13.5(0.6) 14.6(0.6) 9.7(0.5) (0.8) 12.1(0.7) 14.7(0.7) 9.0(0.6) (0.8) 18.7(0.8) 16.9(0.9) 9.1(0.5) (1.0) 18.3(0.7) 18.0(0.7) 10.2(0.5)

42 Liza Levina Permutation-invariant covariance regularization 42/42 Conclusions Regularization is necessary High-dimensional asymptotics give trade-offs of dimension vs. sample size vs. sparsity and conditions on the covariance model Structure helps regularization Efficient optimization and low computational complexity are very important

43 Liza Levina Permutation-invariant covariance regularization 43/42 Thank you

Regularized Estimation of High Dimensional Covariance Matrices. Peter Bickel. January, 2008

Regularized Estimation of High Dimensional Covariance Matrices Peter Bickel Cambridge January, 2008 With Thanks to E. Levina (Joint collaboration, slides) I. M. Johnstone (Slides) Choongsoon Bae (Slides)