A direct formulation for sparse PCA using semidefinite programming

Similar documents
A direct formulation for sparse PCA using semidefinite programming

Sparse PCA with applications in finance

A direct formulation for sparse PCA using semidefinite programming

Tractable Upper Bounds on the Restricted Isometry Constant

A DIRECT FORMULATION FOR SPARSE PCA USING SEMIDEFINITE PROGRAMMING

Sparse Covariance Selection using Semidefinite Programming

Semidefinite Optimization with Applications in Sparse Multivariate Statistics

STATS 306B: Unsupervised Learning Spring Lecture 13 May 12

Full Regularization Path for Sparse Principal Component Analysis

Generalized Power Method for Sparse Principal Component Analysis

SMOOTH OPTIMIZATION WITH APPROXIMATE GRADIENT. 1. Introduction. In [13] it was shown that smooth convex minimization problems of the form:

Tractable performance bounds for compressed sensing.

FIRST-ORDER METHODS FOR SPARSE COVARIANCE SELECTION

Sparse Principal Component Analysis Formulations And Algorithms

Relaxations and Randomized Methods for Nonconvex QCQPs

Conditional Gradient Algorithms for Rank-One Matrix Approximations with a Sparsity Constraint

Short Course Robust Optimization and Machine Learning. Lecture 4: Optimization in Unsupervised Learning

UC Berkeley UC Berkeley Electronic Theses and Dissertations

Accelerated Proximal Gradient Methods for Convex Optimization

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Lecture 9: September 28

Generalized power method for sparse principal component analysis

Sparse Principal Component Analysis: Algorithms and Applications

Sparse Principal Component Analysis with Constraints

Lasso: Algorithms and Extensions

Deflation Methods for Sparse PCA

Sparse PCA: Convex Relaxations, Algorithms and Applications

An Inverse Power Method for Nonlinear Eigenproblems with Applications in 1-Spectral Clustering and Sparse PCA

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Generalized Power Method for Sparse Principal Component Analysis

Short Course Robust Optimization and Machine Learning. Lecture 6: Robust Optimization in Machine Learning

A CONIC DANTZIG-WOLFE DECOMPOSITION APPROACH FOR LARGE SCALE SEMIDEFINITE PROGRAMMING

Advances in Convex Optimization: Theory, Algorithms, and Applications

Lecture 25: November 27

ELE539A: Optimization of Communication Systems Lecture 15: Semidefinite Programming, Detection and Estimation Applications

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

arxiv: v3 [math.oc] 19 Oct 2017

Optimization methods

Approximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina.

Lecture 17: October 27

Sparse Gaussian conditional random fields

Convex Optimization in Classification Problems

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Frist order optimization methods for sparse inverse covariance selection

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Convex Optimization and l 1 -minimization

Lecture 23: November 21

Optimization methods

Lecture 20: November 1st

Numerical Methods I Non-Square and Sparse Linear Systems

Spectral Bounds for Sparse PCA: Exact and Greedy Algorithms

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Testing the Nullspace Property using Semidefinite Programming

Tighten after Relax: Minimax-Optimal Sparse PCA in Polynomial Time

Big Data Analytics: Optimization and Randomization

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices

Supplement to A Generalized Least Squares Matrix Decomposition. 1 GPMF & Smoothness: Ω-norm Penalty & Functional Data

OWL to the rescue of LASSO

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Matrix Rank Minimization with Applications

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Permutation-invariant regularization of large covariance matrices. Liza Levina

Optimization for Machine Learning

An Optimal Affine Invariant Smooth Minimization Algorithm.

Smoothing Proximal Gradient Method. General Structured Sparse Regression

A semidefinite relaxation scheme for quadratically constrained quadratic problems with an additional linear constraint

Solving large Semidefinite Programs - Part 1 and 2

Large-Scale Sparse Principal Component Analysis with Application to Text Data

Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles

Lecture 1: September 25

Sparse principal component analysis via regularized low rank matrix approximation

Conditional Gradient (Frank-Wolfe) Method

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

Adaptive First-Order Methods for General Sparse Inverse Covariance Selection

Convex Optimization M2

An augmented Lagrangian approach for sparse principal component analysis

PENNON A Generalized Augmented Lagrangian Method for Convex NLP and SDP p.1/39

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

1 Number Systems and Errors 1

Homework 5. Convex Optimization /36-725

Sparsity Matters. Robert J. Vanderbei September 20. IDA: Center for Communications Research Princeton NJ.

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Lecture 8: February 9

Projection methods to solve SDP

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

Primal-dual Subgradient Method for Convex Problems with Functional Constraints

Lecture 23: November 19

EE 227A: Convex Optimization and Applications October 14, 2008

Convex optimization. Javier Peña Carnegie Mellon University. Universidad de los Andes Bogotá, Colombia September 2014

10-725/36-725: Convex Optimization Prerequisite Topics

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Support Vector Machine Classification with Indefinite Kernels

About Split Proximal Algorithms for the Q-Lasso

A priori bounds on the condition numbers in interior-point methods

Subgradient Method. Ryan Tibshirani Convex Optimization

15. Conic optimization

An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems

Homework 4. Convex Optimization /36-725

Transcription:

A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley A. d Aspremont, INFORMS, Denver, October 2004. 1

Introduction PCA is a classic tool in multivariate data analysis Input: a covariance matrix A Output: a sequence of factors ranked by variance Each factor is a linear combination of the problem variables Typical use: reduce the number of dimensions of a model while maximizing the information (variance) contained in the simplified model. A. d Aspremont, INFORMS, Denver, October 2004. 2

Introduction Numerically: just an eigenvalue decomposition of the covariance matrix: A = n λ i x i x T i i=1 where... The factors x i are uncorrelated The result of the PCA is usually not sparse, i.e. each factor is a linear combination of all the variables in the model. Can we get sparse factors instead? A. d Aspremont, INFORMS, Denver, October 2004. 3

Applications, previous works Why sparse factors? Financial time series analysis, dimensionality reduction, hedging, etc (Rebonato (1998),...) Multiscale data processing (Chennubhotla & Jepson (2001),...) Gene expression data (survey by Wall, Rechtsteiner & Rocha (2002),...) Signal & image processing, vision, OCR, ECG (Johnstone & Lu (2003)) A. d Aspremont, INFORMS, Denver, October 2004. 4

Sparse PCA: Applications What does sparsity mean here? Financial time series analysis: sparse factors often mean less assets in the portfolio, hence less fixed transaction costs Multiscale data processing: get sparse structure from motion data,... Gene expression data: each variable is a particular gene, sparse factors highlight the action of a few genes, making interpretation easier Image processing: sparse factors involve only specific zones or objects in the image. A. d Aspremont, INFORMS, Denver, October 2004. 5

Previous work: Related literature Cadima & Jolliffe (1995): the loadings with small absolute value are thresholded to zero. A non-convex method called SCoTLASS by Jolliffe & Uddin (2003). (Same setup here, numerical issues solved by relaxation) Zou, Hastie & Tibshirani (2004): a regression based technique called sparse PCA (S-PCA) (SPCA). Based on the fact that PCA can be written as a regression-type (non convex) optimization problem, using LASSO Tibshirani (1996) a l 1 norm penalty. Performance: These methods are either very suboptimal or nonconvex Regression: works for large scale examples A. d Aspremont, INFORMS, Denver, October 2004. 6

A: rank one approximation Problem definition: Here, we focus on the first factor x, computed as the solution of: min x R A xxt F where X F is the Frobenius norm of X, i.e. X F = Tr(X 2 ) In this case, we get an exact solution λ max (A)x 1 x T 1 where λ max (X) is the maximum eigenvalue and x 1 is the associated eigenvector. A. d Aspremont, INFORMS, Denver, October 2004. 7

Variational formulation We can rewrite the previous problem as: max x T Ax subject to x 2 = 1. (1) Perron-Frobenius: this problem is easy, its solution is again λ max (A) at x 1. Here however, we want a little bit more... We look for a sparse solution and solve instead: max x T Ax subject to x 2 = 1 Card(x) k, (2) where Card(x) denotes the cardinality (number of non-zero elements) of x. This is non-convex and numerically hard. A. d Aspremont, INFORMS, Denver, October 2004. 8

Outline Introduction Semidefinite relaxation Large-scale problems Numerical results A. d Aspremont, INFORMS, Denver, October 2004. 9

Semidefinite relaxation Start from: max x T Ax subject to x 2 = 1 Card(x) k, let X = xx T, and write everything in terms of the matrix X: max Tr(AX) subject to Tr(X) = 1 Card(X) k 2 X = xx T. This is strictly equivalent! A. d Aspremont, INFORMS, Denver, October 2004. 10

Why? If X = xx T, then: in the objective: x T Ax = Tr(AX) Semidefinite relaxation the constraint Card(x) k becomes Card(X) k 2 the constraint x 2 = 1 becomes Tr(X) = 1. We can go a little further and replace X = xx T by an equivalent X 0, Rank(X) = 1, to get: max Tr(AX) subject to Tr(X) = 1 Card(X) k 2 X 0, Rank(X) = 1, (3) Again, this is the same problem! A. d Aspremont, INFORMS, Denver, October 2004. 11

Semidefinite relaxation Numerically, this is still hard: The Card(X) k 2 is still non-convex So is the constraint Rank(X) = 1 but, we have made some progress: The objective Tr(AX) is now linear in X The (non-convex) constraint x 2 = 1 became a linear constraint Tr(X) = 1. To solve this problem efficiently, we need to relax the two non-convex constraints above. A. d Aspremont, INFORMS, Denver, October 2004. 12

Semidefinite relaxation Easy to do here... If u R p, Card(u) = q implies u 1 q u 2. We transform the non-convex problem into a convex relaxation: Replace Card(X) k 2 by the weaker (but convex) 1 T X 1 k Simply drop the rank constraint Our problem becomes now: max Tr(AX) subject to Tr(X) = 1 1 T X 1 k X 0, (4) This is a convex program and can be solved efficiently. A. d Aspremont, INFORMS, Denver, October 2004. 13

Semidefinite programming In fact, we get a semidefinite program in the variable X S n, which can be solved using SEDUMI by Sturm (1999) or SDPT3 by Toh, Todd & Tutuncu (1996). max Tr(AX) subject to Tr(X) = 1 1 T X 1 k X 0. Complexity: Polynomial... Problem here: the program has O(n 2 ) dense constraints on the matrix X. In practice, hard to solve problems with n > 15 without additional work. A. d Aspremont, INFORMS, Denver, October 2004. 14

Singular Value Decomposition Same technique works for Singular Value Decomposition instead of PCA. The variational formulation of SVD is here: min A uv T F subject to Card(u) k 1 Card(v) k 2, in the variables (u, v) R m R n where k 1 m, k 2 n are fixed. This can be relaxed as the following semidefinite program: max Tr(A T X 12 ) subject to X 0, Tr(X ii ) = 1 1 T X ii 1 k i, i = 1,2 1 T X 12 1 k 1 k 2, in the variable X S m+n with blocks X ij for i,j = 1,2. A. d Aspremont, INFORMS, Denver, October 2004. 15

Outline Introduction Semidefinite relaxation Large-scale problems Numerical results A. d Aspremont, INFORMS, Denver, October 2004. 16

IP versus first-order methods Interior Point methods for semidefinite/cone programs Produce a solution up to machine precision Compute a Newton step at each iteration: costly In our case: We are not really interested in getting a solution up to machine precision The problems are too big to compute a Newton step... Solution: use first-order techniques... A. d Aspremont, INFORMS, Denver, October 2004. 17

First-order methods Basic model for the problem: black-box oracle producing the function value f(x) a subgradient g(x) f(x) f is here convex, non-smooth. Using only this info, we need O(1/ε 2 ) steps to find an ε-optimal solution. However, if the function is convex with a Lipschitz-continuous gradient with constant L then ( L/ε ) we need only O steps to get an ε-optimal solution.... Smoothness brings a massive improvement in the complexity... A. d Aspremont, INFORMS, Denver, October 2004. 18

Sparse PCA? In our case, we look at a penalized version of the relaxed sparse PCA problem: max U Tr(AU) 1T U 1 : U 0, TrU = 1. (5) Difference? If we can solve the dual, these two formulations are equivalent. Otherwise: scale A... Problem here, the function to minimize is not smooth! Can we hope to do better than the worst case complexity of O(1/ε 2 )? The answer is yes, exploit this particular problem structure... A. d Aspremont, INFORMS, Denver, October 2004. 19

Sparse PCA? We can rewrite our problem as a convex-concave game: where max {U 0, Tr U=1} Tr(AU) 1T U 1 = min X Q 1 max U Q 2 X, U + Tr(AU) Q 1 = {X S n : X ij 1, 1 i, j n} Q 2 = {U S n : TrU = 1} A. d Aspremont, INFORMS, Denver, October 2004. 20

Sparse PCA: complexity Why a convex-concave game? Recent result by Nesterov (2003) shows that this specific structure can be exploited to significantly reduce the complexity compared to the black-box case All the algorithm steps can be worked out explicitly in this case Algorithm in Nesterov (2003): reduces the complexity to O (1/ε) instead of O(1/ε 2 )! A. d Aspremont, INFORMS, Denver, October 2004. 21

Sparse PCA: large-scale algo. We can formulate our problem using the notations in Nesterov (2003) (except for A becoming L here): where max Tr(AU) {U 0, Tr U=1} 1T U 1 = min f(x) X Q 1 Q 1 = {X S n : X ij 1, 1 i, j n} f(x) = λ max (A + X) = max U Q2 BX, U ˆφ(U) Q 2 = {U S n : TrU = 1}, B = I n 2, ˆφ(U) = Tr(AU) A. d Aspremont, INFORMS, Denver, October 2004. 22

Smooth minimization of non-smooth functions What makes the algorithm in Nesterov (2003) work: First use the convex-concave game structure to regularize the function. (Inf-convolution with strictly convex function, à la Moreau-Yosida. See for example Lemaréchal & Sagastizábal (1997)) Then use the optimal first-order minimization algorithm in Nesterov (1983) to minimize the smooth approximation. The method works particularly well if: All the steps in the regularization can be performed in closed-form All the auxiliary minimization sub-problems can be solved in closed-form It is the case here... A. d Aspremont, INFORMS, Denver, October 2004. 23

Regularization: prox functions Procedure: First, we fix a regularization parameter µ Then, we define a prox-function for the set Q 2 : d 2 (U) = Tr(U log(u)) + log(n), U Q 2 With this choice of d 2 : the center of the set if then X 0 = n 1 I n with d 2 (X 0 ) = 0 the convexity parameter of d 2 on Q 2 is bounded below by σ 2 = 1/2 (non-trivial, cf. Ben-Tal & Nemirovski (2004)) A. d Aspremont, INFORMS, Denver, October 2004. 24

Regularization: prox functions The non-smooth objective of the original problem is replaced with min f µ (X), X Q 1 where f µ is the penalized function involving the prox-function d 2 : f µ (X) = max U Q 2 X, U + Tr(AU) µd 2 (U) Because of our choice of prox-function: the function f µ (X) approximates f with a maximum error of ε/2 f µ is Lipschitz continuous with constant: L = 1 µσ 2 A. d Aspremont, INFORMS, Denver, October 2004. 25

Set the regularization parameter µ. For k 0 do: Compute f µ (X k ) and f µ (X k ) Find Algorithm Y k = T Q1 (X k ) = arg min Y Q 1 f µ (X), Y X + 1 2 L X Y 2 F Find Z k = arg min X { L d 1 (X) + σ 1 k i=0 } i + 1 2 f µ(x i ), X X i : X Q 1 Set X k = 2 k+3 Z k + k+1 k+3 Y k A. d Aspremont, INFORMS, Denver, October 2004. 26

Algorithm Most expensive step is the first one, computing the value and gradient of f µ : Compute f µ (X) as max Tr(ZU) µd 2 (U), U Q 2 for Z = A + X The gradient is the maximizer itself: f µ (X) = arg max U Q 2 Tr(ZU) µd 2 (U) The solution can be computed in closed-form as: µlog ( n i=1 ) exp( λ i(a + X) ) µ µlog n A. d Aspremont, INFORMS, Denver, October 2004. 27

Algorithm The second step can also be computed in closed form. Y k = T Q1 (X k ) = arg min Y Q 1 f µ (X), Y X + 1 2 L X Y 2 F is equivalent to a simple projection problem: Solution given by: arg min Y 1 Y V F, Y ij = sgn(v ij ) min( V ij,1), 1 i, j n. The third step is similar... A. d Aspremont, INFORMS, Denver, October 2004. 28

Convergence We can stop the algorithm when the gap λ max (A + X k ) TrAU k + 1 T U k 1 ɛ, where U k = u ((A + X k )/µ) is our current estimate of the dual variable The above gap is necessarily non-negative, since both X k and U k are feasible for the primal and dual problem, respectively Only check this criterion only periodically, for example every 100 iterations. A. d Aspremont, INFORMS, Denver, October 2004. 29

Complexity Max number of iterations is given by with N = 4 B 1,2 D1 D 2 σ 1 σ 2 1 ɛ, D 1 = n 2 /2, σ 1 = 1, D 2 = log(n), σ 2 = 1/2, B 1,2 = 1. Since each iteration costs O(n 3 ) flops, the worst-case flop count to get a ε-optimal solution is given by ( n 4 ) log n O ɛ A. d Aspremont, INFORMS, Denver, October 2004. 30

Outline Introduction Semidefinite relaxation Large-scale problems Numerical results A. d Aspremont, INFORMS, Denver, October 2004. 31

Cardinality versus k: model Start with a sparse vector v = (1,0,1,0, 1,0,1, 0,1,0). We then define the matrix A as: A = U T U + 15 vv T here U S 10 is a random matrix (uniform coefs in [0,1]). We solve: max Tr(AX) subject to Tr(X) = 1 1 T X 1 k X 0, Try k = 1,...,10 For each k, sample a 100 matrices A Plot average solution cardinality (and standard dev. as error bars) A. d Aspremont, INFORMS, Denver, October 2004. 32

12 10 cardinality 8 6 4 2 0 0 2 4 6 8 10 12 Figure 1: Cardinality versus k. k (k + 1) is a very good predictor of the cardinality... A. d Aspremont, INFORMS, Denver, October 2004. 33

Sparsity versus # iterations Start with a sparse vector v = (1,0,1,0, 1,0,1, 0,1,0,...,0) R 20. We then define the matrix A as: A = U T U + 100 vv T here U S 20 is a random matrix (uniform coefs in [0,1]). We solve: for ρ = 5. max Tr(AU) ρ1 T U 1 s.t. TrU = 1 U 0 A. d Aspremont, INFORMS, Denver, October 2004. 34

Sparsity versus # iterations 10 0 PCA DSPCA loadings 10 1 10 2 0 2 4 6 8 10 12 14 16 18 20 Number of iterations: 10,000 to 100,000. Computing time: 12 to 110. i A. d Aspremont, INFORMS, Denver, October 2004. 35

References Ben-Tal, A. & Nemirovski, A. (2004), Non-euclidean restricted memory level method for large-scale convex optimization, MINERVA Working paper. Cadima, J. & Jolliffe, I. T. (1995), Loadings and correlations in the interpretation of principal components, Journal of Applied Statistics 22, 203 214. Chennubhotla, C. & Jepson, A. (2001), Sparse PCA: Extracting multi-scale structure from data, in International Conference on Computer Vision, IEEE, Vancouver, Canada, pp. 641 647. Johnstone, I. & Lu, A. Y. (2003), Sparse principal components analysis, in FDA Workshop, Gainesville. Jolliffe, I. T. & Uddin, M. (2003), A modified principal component technique based on the lasso, Journal of Computational and Graphical Statistics 12, 531 547. A. d Aspremont, INFORMS, Denver, October 2004. 36

Lemaréchal, C. & Sagastizábal, C. (1997), Practical aspects of the Moreau-Yosida regularization: theoretical preliminaries, SIAM Journal on Optimization 7(2), 367 385. Nesterov, Y. (1983), A method of solving a convex programming problem with convergence rate O(1/k 2 ), Soviet Math. Dokl. 27(2), 372 376. Nesterov, Y. (2003), Smooth minimization of nonsmooth functions, CORE discussion paper 2003/12 (Accepted by Math. Prog.). Rebonato, R. (1998), Interest-Rate Options Models, Financial Engineering, Wiley. Sturm, J. F. (1999), Using sedumi 1.0x, a matlab toolbox for optimization over symmetric cones, Optimization Methods and Software 11, 625 653. Tibshirani, R. (1996), Regression shrinkage and selection via the lasso, Journal of the Royal statistical society, series B 58(267-288). A. d Aspremont, INFORMS, Denver, October 2004. 37

Toh, K. C., Todd, M. J. & Tutuncu, R. H. (1996), Sdpt3 a matlab software package for semidefinite programming, Technical report, School of Operations Research and Industrial Engineering, Cornell University. Wall, M. E., Rechtsteiner, A. & Rocha, L. M. (2002), Singular value decomposition and principal component analysis, Technical Report ArXiv physics/0208101, Los Alamos National Laboratory. Zou, H., Hastie, T. & Tibshirani, R. (2004), Sparse principal component analysis, Technical report, statistics department, Stanford University. A. d Aspremont, INFORMS, Denver, October 2004. 38