Sparse Covariance Selection using Semidefinite Programming

Similar documents
FIRST-ORDER METHODS FOR SPARSE COVARIANCE SELECTION

Semidefinite Optimization with Applications in Sparse Multivariate Statistics

A direct formulation for sparse PCA using semidefinite programming

A direct formulation for sparse PCA using semidefinite programming

Sparse PCA with applications in finance

A DIRECT FORMULATION FOR SPARSE PCA USING SEMIDEFINITE PROGRAMMING

SMOOTH OPTIMIZATION WITH APPROXIMATE GRADIENT. 1. Introduction. In [13] it was shown that smooth convex minimization problems of the form:

Sparse inverse covariance estimation with the lasso

Convex Optimization Techniques for Fitting Sparse Gaussian Graphical Models

Relaxations and Randomized Methods for Nonconvex QCQPs

ELE539A: Optimization of Communication Systems Lecture 15: Semidefinite Programming, Detection and Estimation Applications

Approximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina.

A direct formulation for sparse PCA using semidefinite programming

Adaptive First-Order Methods for General Sparse Inverse Covariance Selection

Tractable Upper Bounds on the Restricted Isometry Constant

Matrix Rank Minimization with Applications

Frist order optimization methods for sparse inverse covariance selection

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Lecture 25: November 27

Conditional Gradient Algorithms for Rank-One Matrix Approximations with a Sparsity Constraint

Gaussian Graphical Models and Graphical Lasso

Big & Quic: Sparse Inverse Covariance Estimation for a Million Variables

Generalized Power Method for Sparse Principal Component Analysis

Tractable performance bounds for compressed sensing.

Convex Optimization in Classification Problems

Short Course Robust Optimization and Machine Learning. Lecture 4: Optimization in Unsupervised Learning

Sparse Gaussian conditional random fields

Convex Optimization M2

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Support Vector Machine Classification with Indefinite Kernels

Nonlinear Optimization for Optimal Control

Statistical Analysis of Online News

Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

CS295: Convex Optimization. Xiaohui Xie Department of Computer Science University of California, Irvine

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Permutation-invariant regularization of large covariance matrices. Liza Levina

A Convex Upper Bound on the Log-Partition Function for Binary Graphical Models

Sparse Inverse Covariance Estimation for a Million Variables

Dual and primal-dual methods

Maximum Margin Matrix Factorization

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation

Introduction to Alternating Direction Method of Multipliers

Probabilistic Graphical Models

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers

LMI MODELLING 4. CONVEX LMI MODELLING. Didier HENRION. LAAS-CNRS Toulouse, FR Czech Tech Univ Prague, CZ. Universidad de Valladolid, SP March 2009

STATS 306B: Unsupervised Learning Spring Lecture 13 May 12

Lasso: Algorithms and Extensions

Newton s Method. Javier Peña Convex Optimization /36-725

LINEAR SYSTEMS (11) Intensive Computation

A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation

Newton-Like Methods for Sparse Inverse Covariance Estimation

Optimization methods

Barrier Method. Javier Peña Convex Optimization /36-725

Robust Principal Component Analysis

L. Vandenberghe EE236C (Spring 2016) 18. Symmetric cones. definition. spectral decomposition. quadratic representation. log-det barrier 18-1

Mixtures of Gaussians with Sparse Regression Matrices. Constantinos Boulis, Jeffrey Bilmes

Inverse Covariance Estimation with Missing Data using the Concave-Convex Procedure

minimize x x2 2 x 1x 2 x 1 subject to x 1 +2x 2 u 1 x 1 4x 2 u 2, 5x 1 +76x 2 1,

Advances in Convex Optimization: Theory, Algorithms, and Applications

Determinant maximization with linear. S. Boyd, L. Vandenberghe, S.-P. Wu. Information Systems Laboratory. Stanford University

High-dimensional graphical model selection: Practical and information-theoretic limits

Quadratic reformulation techniques for 0-1 quadratic programs

High-dimensional covariance estimation based on Gaussian graphical models

Composite Loss Functions and Multivariate Regression; Sparse PCA

MATH 829: Introduction to Data Mining and Analysis Graphical Models III - Gaussian Graphical Models (cont.)

Semidefinite Programming Basics and Applications

15. Conic optimization

Pattern Recognition and Machine Learning

Learning Gaussian Graphical Models with Unknown Group Sparsity

Ch 4. Linear Models for Classification

UC Berkeley UC Berkeley Electronic Theses and Dissertations

Transmitter optimization for distributed Gaussian MIMO channels

Consistent high-dimensional Bayesian variable selection via penalized credible regions

Generalized Concomitant Multi-Task Lasso for sparse multimodal regression

High-dimensional graphical model selection: Practical and information-theoretic limits

Fantope Regularization in Metric Learning

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

COURSE ON LMI PART I.2 GEOMETRY OF LMI SETS. Didier HENRION henrion

Sparse Principal Component Analysis: Algorithms and Applications

Semidefinite Programming

Lecture 3 September 1

Accelerated Proximal Gradient Methods for Convex Optimization

ORIE 6326: Convex Optimization. Quasi-Newton Methods

Non-convex optimization. Issam Laradji

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Sketching for Large-Scale Learning of Mixture Models

Variables. Cho-Jui Hsieh The University of Texas at Austin. ICML workshop on Covariance Selection Beijing, China June 26, 2014

ADMM and Fast Gradient Methods for Distributed Optimization

Statistical Machine Learning for Structured and High Dimensional Data

Reweighted Nuclear Norm Minimization with Application to System Identification

arxiv: v1 [cs.ai] 4 Jul 2007

Convex Optimization and l 1 -minimization

Lecture 23: November 21

Sparse Optimization Lecture: Dual Methods, Part I

Expanding the reach of optimal methods

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Probabilistic Graphical Models

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Transcription:

Sparse Covariance Selection using Semidefinite Programming A. d Aspremont ORFE, Princeton University Joint work with O. Banerjee, L. El Ghaoui & G. Natsoulis, U.C. Berkeley & Iconix Pharmaceuticals Support from EUROCONTROL and Google. 1

Introduction We estimate a sample covariance matrix Σ from empirical data... Objective: infer dependence relationships between variables. We want this information to be as sparse as possible. Basic solution: look at the magnitude of the covariance coefficients: Σ ij > β variables i and j are related, and simply threshold smaller coefficients to zero. (not always psd.) We can do better... 2

Covariance Selection Following Dempster (1972), look for zeros in the inverse covariance matrix: Parsimony. Suppose that we are estimating a Gaussian density: f(x, Σ) = ( )p ( )1 1 2 1 2 ( exp 1 ) 2π detσ 2 xt Σ 1 x, a sparse inverse matrix Σ 1 corresponds to a sparse representation of the density f as a member of an exponential family of distributions: f(x, Σ) = exp(α 0 + t(x) + α 11 t 11 (x) +... + α rs t rs (x)) with here t ij (x) = x i x j and α ij = Σ 1 ij. Dempster (1972) calls Σ 1 ij a concentration coefficient. There is more... 3

Covariance Selection Covariance selection: With m + 1 observations x i R n on n random variables, we estimate a sample covariance matrix S such that S = 1 m+1 m i=1 (x i x)(x i x) T Choose a symmetric subset I of matrix coefficients and denote by J the remaining coefficients. Choose a covariance matrix estimator ˆΣ such that: ˆΣ ij = S ij for all indices (i, j) in J ˆΣ 1 ij = 0 for all indices (i,j) in I We simply select a topology of zeroes in the inverse covariance matrix... 4

Covariance Selection Why is this a good choice? Dempster (1972) shows: Maximum Entropy. Among all Gaussian models Σ such that Σ ij = S ij on J, the choice = 0 on I has maximum entropy. ˆΣ 1 ij Maximum Likelihood. Among all Gaussian models Σ such that Σ 1 ij = 0 on I, the choice ˆΣ ij = S ij on J has maximum likelihood. Existence and Uniqueness. If there is a positive semidefinite matrix ˆΣ ij satisfying ˆΣ ij = S ij on J, then there is only one such matrix satisfying ˆΣ 1 ij = 0 on I. 5

Conditional independence: Covariance Selection Suppose X,Y,Z have are jointly normal with covariance matrix Σ, with ( ) Σ11 Σ Σ = 12 Σ 21 Σ 22 where Σ 11 R 2 2 and Σ 22 R. Conditioned on Z, X,Y are still normally distributed with covariance matrix C satisfying: C = Σ 11 Σ 12 Σ 1 22 Σ 21 = ( Σ 1) 1 11 So X and Y are conditionally independent iff ( Σ 1) 11 which is also: Σ 1 xy = 0 is diagonal, 6

Covariance Selection Suppose we have iid noise ǫ i N(0, 1) and the following linear model: x = z + ǫ 1 y = z + ǫ 2 z = ǫ 3 Graphically, this is: Z X Y 7

Covariance Selection The covariance matrix and inverse covariance are given by: Σ = 2 1 1 1 2 1 1 1 1 Σ 1 = 0 1 1 1 0 1 1 1 3 The inverse covariance matrix has Σ 1 12 clearly showing that the variables x and y are independent conditioned on z. Graphically, this is again: Z Z X Y versus X Y 8

On a slightly larger scale... Covariance Selection 1Y 5.5Y 4.5Y 7.5Y 9.5Y 9Y 10Y 7Y 3Y 0.5Y 2Y 6.5Y 6.5Y 8.5Y 8Y 6Y 5Y 4Y 2.5Y 5Y 1.5Y 8Y 3.5Y 7Y 5.5Y 7.5Y 2.5Y 3Y 1Y 4Y 6Y 8.5Y 10Y 9Y 9.5Y 0.5Y 2Y 1.5Y Before After 9

Applications & Related Work Gene expression data. The sample data is composed of gene expression vectors and we want to isolate links in the expression of various genes. See Dobra, Hans, Jones, Nevins, Yao & West (2004), Dobra & West (2004) for example. Speech Recognition. See Bilmes (1999), Bilmes (2000) or Chen & Gopinath (1999). Finance. Covariance estimation. Related work by Dahl, Roychowdhury & Vandenberghe (2005): interior point methods for large, sparse MLE. See also d Aspremont, El Ghaoui, Jordan & Lanckriet (2005) on sparse principal component analysis (PCA). 10

Outline Introduction Robust Maximum Likelihood Estimation Algorithms Numerical Results 11

Maximum Likelihood Estimation We can estimate Σ by solving the following maximum likelihood problem: max log detx Tr(SX) n X S This problem is convex, has an explicit answer Σ = S 1 if S 0. Problem here: how do we make Σ 1 sparse? In other words, how do we efficiently choose I and J? Solution: penalize the MLE. 12

AIC and BIC Original solution in Akaike (1973), penalize the likelihood function: max log detx Tr(SX) ρcard(x) n X S where Card(X) is the number of nonzero elements in X. Set ρ = 2/(m + 1) for the Akaike Information Criterion (AIC). Set ρ = log(m+1) (m+1) for the Bayesian Information Criterion (BIC). Of course, this is a (NP-Hard) combinatorial problem... 13

Convex Relaxation We can form a convex relaxation of AIC or BIC penalized MLE max log detx Tr(SX) ρcard(x) n X S replacing Card(X) by X 1 = ij X ij to solve max X S n log detx Tr(SX) ρ X 1 Classic l 1 heuristic: X 1 is a convex lower bound on Card(X). See Fazel, Hindi & Boyd (2001) for related applications. 14

l 1 relaxation Assuming x 1, this relaxation replaces: with Graphically, this is: Card(x) = x 1 = n i=1 1 {xi 0} n x i i=1 1 x Card(x) 1 0 1 x 15

Robustness This penalized MLE problem can be rewritten: max X S n min log detx Tr((S + U)X) U ij ρ This can be interpreted as a robust MLE problem with componentwise noise of magnitude ρ on the elements of S. The relaxed sparsity requirement is equivalent to a robustification. See d Aspremont et al. (2005) for similar results on sparse PCA. 16

Outline Introduction Robust Maximum Likelihood Estimation Algorithms Numerical Results 17

Algorithms We need to solve: max X S n log detx Tr(SX) ρ X 1 For medium size problems, this can be done using interior point methods. In practice, we need to solve very large, dense instances... The X 1 penalty implicitly introduces O(n 2 ) linear constraints and makes interior point methods too expensive. 18

Algorithms Complexity options... O(n) O(n) O(n 2 ) Memory O(1/ǫ 2 ) O(1/ǫ) O(log(1/ǫ)) First-order Smooth Newton IP Complexity 19

Algorithms Here, we can exploit problem structure Our problem here has a particular min-max structure: max X S n min log detx Tr((S + U)X) U ij ρ This min-max structure means that we use prox function algorithms by Nesterov (2005) (see also Nemirovski (2004)) to solve large, dense problem instances. We also detail a greedy block-coordinate descent method with good empirical performance. 20

Nesterov s method Assuming that a problem can be written according to a min-max model, the algorithm works as follows... Regularization. Add strongly convex penalty inside the min-max representation to produce an ǫ-approximation of f with Lipschitz continuous gradient (generalized Moreau-Yosida regularization step, see Lemaréchal & Sagastizábal (1997) for example). Optimal first order minimization. Use optimal first order scheme for Lipschitz continuous functions detailed in Nesterov (1983) to the solve the regularized problem. Caveat: Only efficient if the subproblems involved in these steps can be solved explicitly or very efficiently... 21

Nesterov s method Numerical steps: computing the inverse of X and two eigenvalue decompositions. Total complexity estimate of the method is: O ( κ ) (log κ) n 4.5 αρ ǫ where log κ = log(β/α) bounds the solution s condition number. 22

Dual block-coordinate descent Here we consider the dual of the original problem: maximize log det(s + U) subject to U ρ S + U 0 The diagonal entries of an optimal U are U ij = ρ. We will solve for U column by column, sweeping all the columns. 23

Dual block-coordinate descent Let C = S + U be the current iterate, after permutation we can always assume that we optimize over the last column: maximize ( log det subject to u ρ where C 12 is the last column of C (off-diag.). ) C 11 C 12 + u C 21 + u T C 22 Each iteration reduces to a simple box-constrained QP: minimize u T (C 11 ) 1 u subject to u ρ We stop when Tr(SX) + ρ X 1 n ǫ where X = C 1. 24

Dual block-coordinate descent Complexity? Luo & Tseng (1992): block coordinate descent has linear convergence in this case. Smooth first-order methods to solve the inner QP problem: The hardest numerical step at each iteration is computing an inverse. The matrix to invert is only updated by a low rank matrix at each iteration: use Sherman-Woodbury-Morrisson formula. 25

Outline Introduction Robust Maximum Likelihood Estimation Algorithms Numerical Results 26

Numerical Examples Generate random examples: Take a sparse, random p.s.d. matrix A S n We add a uniform noise with magnitude σ to its inverse We then solve the penalized MLE problem (or the modified one): max X S n log detx Tr(SX) ρ X 1 and compare the solution with the original matrix A. 27

Numerical Examples A basic example... Original inverse A Solution for ρ = σ Noisy inverse Σ 1 The original inverse covariance matrix A, the noisy inverse Σ 1 and the solution. 28

Covariance Selection Forward rates covariance matrix for maturities ranging from 0.5 to 10 years. 1Y 5.5Y 4.5Y 7.5Y 9.5Y 9Y 10Y 7Y 3Y 0.5Y 2Y 6.5Y 6.5Y 8.5Y 8Y 6Y 5Y 4Y 2.5Y 5Y 1.5Y 8Y 3.5Y 7Y 5.5Y 7.5Y 2.5Y 3Y 1Y 4Y 6Y 8.5Y 10Y 9Y 9.5Y 0.5Y 2Y 1.5Y ρ = 0 ρ =.01 29

ROC curves 1 1 0.9 0.9 0.8 0.8 Sensitivity 0.7 0.6 0.5 0.4 0.3 Sensitivity 0.7 0.6 0.5 0.4 0.3 0.2 0.2 0.1 0 1 0.8 0.6 0.4 Specificity Covsel Simple Thresholding 0.2 0 0.1 0 1 0.8 0.6 0.4 Specificity Covsel Simple Thresholding 0.2 0 Classification Error. ROC curves for the solution to the covariance selection problem compared with a simple thresholding of B 1, for various levels of noise: σ = 0.3 (left) and σ = 0.5 (right). Here n = 50. 30

10 4 Dual Descent Primal Ascent Smooth Optim. 10 3 Duality Gap 10 2 10 1 10 0 10 1 0 20 40 60 80 100 CPU time (in seconds) Computing time. Duality gap versus CPU time (in seconds) on a random problem, solved using Nesterov s method (squares) and the coordinate descent algorithms (circles and solid line). 31

Conclusion A convex relaxation for sparse covariance selection. Robustness interpretation. Two algorithms for dense large-scale instances. Precision requirements? Thresholding? How do to fix ρ?... If you have financial applications in mind... Network graphs generated using Cytoscape. 32

References Akaike, J. (1973), Information theory and an extension of the maximum likelihood principle, in B. N. Petrov & F. Csaki, eds, Second international symposium on information theory, Akedemiai Kiado, Budapest, pp. 267 281. Bilmes, J. A. (1999), Natural statistic models for automatic speech recognition, Ph.D. thesis, UC Berkeley, Dept. of EECS, CS Division. Bilmes, J. A. (2000), Factored sparse inverse covariance matrices, IEEE International Conference on Acoustics, Speech, and Signal Processing. Chen, S. S. & Gopinath, R. A. (1999), Model selection in acoustic modeling, EUROSPEECH. Dahl, J., Roychowdhury, V. & Vandenberghe, L. (2005), Maximum likelihood estimation of gaussian graphical models: numerical implementation and topology selection, UCLA preprint. d Aspremont, A., El Ghaoui, L., Jordan, M. & Lanckriet, G. R. G. (2005), A direct formulation for sparse PCA using semidefinite programming, Advances in Neural Information Processing Systems 17, 41 48. Dempster, A. (1972), Covariance selection, Biometrics 28, 157 175. Dobra, A., Hans, C., Jones, B., Nevins, J. J. R., Yao, G. & West, M. (2004), Sparse graphical models for exploring gene expression data, Journal of Multivariate Analysis 90(1), 196 212. Dobra, A. & West, M. (2004), Bayesian covariance selection, working paper. Fazel, M., Hindi, H. & Boyd, S. (2001), A rank minimization heuristic with application to minimum order system approximation, Proceedings American Control Conference 6, 4734 4739. Lemaréchal, C. & Sagastizábal, C. (1997), Practical aspects of the Moreau-Yosida regularization: theoretical preliminaries, SIAM Journal on Optimization 7(2), 367 385. Luo, Z. Q. & Tseng, P. (1992), On the convergence of the coordinate descent method for convex differentiable minimization, Journal of Optimization Theory and Applications 72(1), 7 35. Nemirovski, A. (2004), Prox-method with rate of convergence O(1/T) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems, SIAM Journal on Optimization 15(1), 229 251. Nesterov, Y. (1983), A method of solving a convex programming problem with convergence rate O(1/k 2 ), Soviet Mathematics Doklady 27(2), 372 376. Nesterov, Y. (2005), Smooth minimization of nonsmooth functions, Mathematical Programming, Series A 103, 127 152. 33