STATS 306B: Unsupervised Learning Spring Lecture 13 May 12

Similar documents
Tractable Upper Bounds on the Restricted Isometry Constant

Sparse PCA with applications in finance

A direct formulation for sparse PCA using semidefinite programming

A direct formulation for sparse PCA using semidefinite programming

Sparse Principal Component Analysis Formulations And Algorithms

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Spectral Bounds for Sparse PCA: Exact and Greedy Algorithms

Deflation Methods for Sparse PCA

An Inverse Power Method for Nonlinear Eigenproblems with Applications in 1-Spectral Clustering and Sparse PCA

Generalized Power Method for Sparse Principal Component Analysis

Tighten after Relax: Minimax-Optimal Sparse PCA in Polynomial Time

A direct formulation for sparse PCA using semidefinite programming

Full Regularization Path for Sparse Principal Component Analysis

A DIRECT FORMULATION FOR SPARSE PCA USING SEMIDEFINITE PROGRAMMING

Sparse Regression as a Sparse Eigenvalue Problem

Principal Component Analysis

Sparse Principal Component Analysis with Constraints

Sparse principal component analysis via regularized low rank matrix approximation

Expectation-Maximization for Sparse and Non-Negative PCA

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

1 Principal Components Analysis

Learning with Singular Vectors

Structured matrix factorizations. Example: Eigenfaces

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

COMS 4771 Lecture Fixed-design linear regression 2. Ridge and principal components regression 3. Sparse regression and Lasso

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

Large-Scale Sparse Principal Component Analysis with Application to Text Data

Lecture Notes 10: Matrix Factorization

Sparse Principal Component Analysis

Nonnegative Sparse PCA with Provable Guarantees

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

PCA, Kernel PCA, ICA

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

A Least Squares Formulation for Canonical Correlation Analysis

Conditional Gradient Algorithms for Rank-One Matrix Approximations with a Sparsity Constraint

A Least Squares Formulation for Canonical Correlation Analysis

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

An Augmented Lagrangian Approach for Sparse Principal Component Analysis

A SURVEY OF SOME SPARSE METHODS FOR HIGH DIMENSIONAL DATA

Generalized Power Method for Sparse Principal Component Analysis

STATISTICAL LEARNING SYSTEMS

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Linear Methods for Regression. Lijun Zhang

CSC 411 Lecture 12: Principal Component Analysis

On the Minimization Over Sparse Symmetric Sets: Projections, O. Projections, Optimality Conditions and Algorithms


STA 414/2104: Lecture 8

Sparse Covariance Selection using Semidefinite Programming

Data Mining Techniques

Machine Learning (BSMC-GA 4439) Wenke Liu

STATS 306B: Unsupervised Learning Spring Lecture 12 May 7

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

Protein Expression Molecular Pattern Discovery by Nonnegative Principal Component Analysis

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

PRINCIPAL COMPONENTS ANALYSIS

Dimensionality Reduction

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

Sparse statistical modelling

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization

An augmented Lagrangian approach for sparse principal component analysis

Sparse Principal Component Analysis for multiblocks data and its extension to Sparse Multiple Correspondence Analysis

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Sparse Principal Component Analysis via Alternating Maximization and Efficient Parallel Implementations

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

Truncated Power Method for Sparse Eigenvalue Problems

Ridge Regression. Mohammad Emtiyaz Khan EPFL Oct 1, 2015

An Inverse Power Method for Nonlinear Eigenproblems with Applications in 1-Spectral Clustering and Sparse PCA

Structured Sparse Principal Component Analysis

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Statistical Pattern Recognition

STA 414/2104: Lecture 8

Penalized versus constrained generalized eigenvalue problems

UC Berkeley UC Berkeley Electronic Theses and Dissertations

Dimension Reduction (PCA, ICA, CCA, FLD,

Principal Component Analysis. Applied Multivariate Statistics Spring 2012

Generalized power method for sparse principal component analysis

Statistical Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

7 Principal Component Analysis

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Semidefinite Optimization with Applications in Sparse Multivariate Statistics

Sparse Principal Component Analysis: Algorithms and Applications

Lecture 24 May 30, 2018

Lecture 2 Part 1 Optimization

Sparse Gaussian conditional random fields

Introduction to Machine Learning

Principal Component Analysis

Linear Model Selection and Regularization

14 Singular Value Decomposition

Lecture 6: Methods for high-dimensional problems

Dimension Reduction Methods

Maximum variance formulation

25 : Graphical induced structured input/output models

Machine Learning - MT & 14. PCA and MDS

MLCC 2015 Dimensionality Reduction and PCA

Technical Report No August 2011

What is Principal Component Analysis?

Inverse Power Method for Non-linear Eigenproblems

1 Regression with High Dimensional Data

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

Transcription:

STATS 306B: Unsupervised Learning Spring 2014 Lecture 13 May 12 Lecturer: Lester Mackey Scribe: Jessy Hwang, Minzhe Wang 13.1 Canonical correlation analysis 13.1.1 Recap CCA is a linear dimensionality reduction procedure for paired or two-viewed data. Given two mean-centered datasets X and Y, each with n rows, the CCA objective is max u,v u T X T Y v ut X T Xuv T Y T Y v. This is a generalized eigenvalue problem, which is solved by a generalized eigenvector (u, v ) T satisfying ( ) ( ) ( ) ( ) 0 X T Y u X Y T X 0 v = λ T X 0 u 0 Y T Y v. As in PCA, after obtaining the first pair of canonical directions in this way, subsequent pairs (u 2, v 2 ),..., (u k, v k ) may be extracted by solving the CCA optimization problem subject to the constraint that the new canonical variables be uncorrelated with prior ones, i.e., Corr(u T j x, u T l x) = Corr(v T j y, v T l y) = 0 l < j, where x is a random vector taking on values x 1,..., x n with probability 1/n each, and y plays an equivalent role. 13.1.2 Degeneracy in CCA There are several situations in which the CCA solution exhibits degeneracy: If x i = Ay i for all i, then any u is an optimal CCA direction (with correlation = 1). This can be seen by choosing v = A u. So the CCA direction is meaningless in this case. If the coordinates of x and y are uncorrelated, i.e., 1 n XT Y is identically 0, then any (u, v) is optimal with correlation = 0. CCA is sometimes applied without centering X and Y ; this corresponds to using a cosine similarity objective instead of correlation. Then if rank(x) = n, any v is optimal, with correlation = 1, via u = X T (XX T ) 1 Y v. We have the same problem if rank(y ) = n. 13-1

Regularization is often introduced into CCA to break such degeneracy ties in favor of higher-variance directions and to control overfitting. The standard regularized CCA objective is given by max u,v u T X T Y v ut (X T X + λ 1 I)uv T (Y T Y + λ 2 I)v for some λ 1, λ 2 > 0. This is akin to penalizing or constraining the l 2 norms of u and v as in ridge regression. 13.1.3 Kernel CCA As a final note, one may derive a kernelized version of CCA in much the same way that we derived kernel PCA. In this case, we work with two kernel functions k X and k Y. 13.2 Sparse unsupervised learning 13.2.1 Motivation When performing dimensionality reduction or latent feature modeling, we often want to interpret the recovered component loadings u j R p in terms of the input coordinates. Here are two examples. Recall the PCA decomposition of handwritten 3s discussed in ESL. We were able to visualize our decomposition in terms of the extracted component loadings, the eigen- 3s : These intuitive visualizations of the loadings allowed us to interpret the data features being captured by the component directions (in this case, long-tailedness and thickness of the 3, respectively). When studying gene expression in cancer patients (so that each coordinate of x i is a gene), we would like to declare that a few genes are responsible for most of the variation observed in the data. We would like the extracted loadings u 1,..., u k R p to be sparse so that all variance is attributed to those few non-zero coordinates. Unfortunately, the loadings obtained from most of the latent feature modeling methods discussed so far are typically dense: most if not all coordinates are nonzero. This is similar to the dense regression coefficients that result from least squares linear regression. This leads us to the question, How can we modify a latent feature modeling procedure like PCA to obtain sparse loadings? In the next subsection, we describe some early approaches to answering this question in the context of PCA. 13-2

13.2.2 Earliest attempts Rotating the loadings matrix One way to improve the sparsity of a collection of loadings without sacrificing any of the variance explained is to rotate the k loadings to be sparser and more interpretable, i.e., we transform U to U rot = UR where U rot exhibits sparsity (see, e.g., Richman, 1986). Essentially we are finding a more interpretable basis for the subspace spanned by the original PC loadings. The resulting U rot captures the same data variance overall as the original U. Here is a simple example. Example 1. Take p = 3. In the following picture, (u 1, u 2 ) represents our original pair of principal component loadings. Each vector is only 2-sparse. Meanwhile, (ũ 1, ũ 2 ), a rotation of (u 1, u 2 ), is sparser than (u 1, u 2 ) but also spans exactly the same subspace of R 3 (namely, the plane in which this sheet of paper lies). ũ 1 = (0, 1, 0) u 1 = (1, 1, 0) ũ 2 = (1, 0, 0) u 2 = (1, 1, 0) This method has a couple of drawbacks. First, there is no guarantee that a sparse rotation exists. Second, the rotated loadings lose the successive variance maximization property; that is, ũ 1 may not be the direction of maximum variance if k > 1. Thresholding This approach involves thresholding the small-magnitude entries in the loading vectors (see, e.g., Cadima and Jolliffe, 1995), i.e., setting those small magnitude entries to 0. Unfortunately the results lose orthonormality, and they may not be optimal for a target sparsity level. 13.2.3 An optimization approach Continuing to focus on the PCA setting, we will next examine the possibility of directly constraining or penalizing the cardinality of the loadings in the context of a PCA optimization problem. Define p card(u) = I (u i 0) = u 0. i=1 Note that u 0 is often called the l 0 norm of u, although this is actually not a norm. 13-3

Adding a cardinality constraint to the PCA objective leads to the so-called sparse PCA optimization problem, which is the subject of the next section. 13.3 Sparse PCA optimization problem 13.3.1 Equivalent (single component) objectives When we first introduced PCA, we presented two different motivations, namely variance maximization and reconstruction error minimization, which led to two equivalent optimization problems. We will now add cardinality constraints to both of these optimization problems. Variance maximization max u T X T X 1 u 1 n u 1 s.t. u 1 2 = 1, u 1 0 c 1 (V) where c 1 is the cardinality bound. Note that we could equivalently penalize u 1 0 in the objective instead of constraining. Reconstruction error minimization In the original PCA setting, we had the objectives min u 1 s.t. u 1 2 =1 n x i u 1 u T 1 x i 2 2 i=1 min u 1,v 1 s.t. v 1 2 =1 n x i v 1 u T 1 x i 2 2 + λ u 1 2 2 i=1 (R1) (R2) For λ > 0, (R1) and (R2) are equivalent (up to the scale of u 1 ) since v1 = u 1 u 1 2 solution. at the We showed earlier that, in the absence of a cardinality constraint, variance maximization and reconstruction error minimization are equivalent optimization problems. Perhaps surprisingly, they are still equivalent here: adding the cardinality constraint u 1 0 c 1 to (R1) or (R2) is equivalent to (V) (up to the scale of u 1 )! Exercise. Prove the above statement. (Hint: v 1 X T Xu 1 for (R2).) 13.3.2 Solutions for sparse PCA A potential difficulty in the above optimization problem is selecting the cardinality c 1. A more daunting obstacle is that any of these cardinality-constrained optimization problems is NP-hard. We have several options for dealing with this obstacle. Option 1. We could use standard combinatorial optimization techniques like branchand-bound to solve the problem exactly (Moghaddam et al., 2006). This is optimal but in practice may take a very, very long time. 13-4

Option 2. We could take a greedy approach to variable selection like the GSPCA (greedy sparse PCA) approach of Moghaddam et al. (2006). They propose a greedy bidirectional search consisting of both a forward pass and a backward pass: In the forward pass, we start with no selected variables and successively add in the variable that improves the objective the most. In the backward pass, we start with all variables and successively remove the one that reduces the objective the least. Then, for a given target cardinality c, we choose the better solution of the forward and backward passes. This is not guaranteed to yield an optimal solution in general, but it works well in practice. Option 3. Our third option is the most commonly explored. We will attempt to solve a (hopefully more tractable) surrogate optimization problem. The idea is to replace the hard combinatorial problem with a simpler non-combinatorial problem. Many, many authors have followed this path; we will explore some of the more popular solutions like SCoTLASS of Jolliffe et al. (2003), SPCA of Zou et al. (2006), and Direct SPCA (DSPCA) of d Aspremont et al. (2004). 13.3.3 SCoTLASS The SCoTLASS procedure replaces u 1 0 with the convex surrogate u 1 1 = p j=1 u 1j in the variance maximization problem (V). The l 1 norm is selected, because it is known to induce sparse solutions (e.g., as it does in Lasso regression). The resulting optimization problem is u 1 X T Xu 1 max s.t. u 1 2 = 1, u 1 t 1 u 1 n where t 1 is a tuning parameter that controls the resulting sparsity level. The good news is that we now have a continuous optimization problem. The bad is that this problem is still nonconvex, and the proposed algorithm (projected gradient) has been shown to be slow, computationally expensive, and prone to local maxima. We will continue our discussion of surrogate optimization problems in the next lecture. 13-5

Bibliography Cadima, J. and I. T. Jolliffe (1995). Loading and correlations in the interpretation of principal components. Journal of Applied Statistics 22 (2), 203 214. d Aspremont, A., L. El Ghaoui, M. I. Jordan, and G. R. Lanckriet (2004). A direct formulation for sparse PCA using semidefinite programming. In NIPS, Volume 16, pp. 41 48. Jolliffe, I. T., N. T. Trendafilov, and M. Uddin (2003). A modified principal component technique based on the lasso. Journal of Computational and Graphical Statistics 12 (3), 531 547. Moghaddam, B., Y. Weiss, and S. Avidan (2006). Spectral bounds for sparse PCA: Exact and greedy algorithms. Advances in Neural Information Processing Systems 18, 915. Richman, M. B. (1986). Rotation of principal components. Journal of Climatology 6 (3), 293 335. Zou, H., T. Hastie, and R. Tibshirani (2006). Sparse principal component analysis. Journal of Computational and Graphical Statistics 15 (2), 265 286. 6