Short Course Robust Optimization and Machine Learning. Lecture 4: Optimization in Unsupervised Learning

Similar documents
Large-scale Robust Optimization and Applications

Short Course Robust Optimization and Machine Learning. 3. Optimization in Supervised Learning

Sparse PCA with applications in finance

A direct formulation for sparse PCA using semidefinite programming

A direct formulation for sparse PCA using semidefinite programming

STA141C: Big Data & High Performance Statistical Computing

Lecture 2: Linear Algebra Review

STA141C: Big Data & High Performance Statistical Computing

Short Course Robust Optimization and Machine Learning. Lecture 6: Robust Optimization in Machine Learning

LEC 2: Principal Component Analysis (PCA) A First Dimensionality Reduction Approach

Statistical Analysis of Online News

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Sparse Principal Component Analysis via Alternating Maximization and Efficient Parallel Implementations

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Tractable Upper Bounds on the Restricted Isometry Constant

Sparse Covariance Selection using Semidefinite Programming

Probabilistic Latent Semantic Analysis

Learning representations

Lecture 20: November 1st

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

14 Singular Value Decomposition

PCA and admixture models

Functional Analysis Review

CSC Linear Programming and Combinatorial Optimization Lecture 10: Semidefinite Programming

Lecture: Face Recognition and Feature Reduction

Convex Optimization in Classification Problems

15 Singular Value Decomposition

Stat 159/259: Linear Algebra Notes

Dimensionality reduction

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Sparse and Robust Optimization and Applications

Mathematical foundations - linear algebra

Basic Calculus Review

Lecture: Face Recognition and Feature Reduction

Linear Algebra Methods for Data Mining

7 Principal Component Analysis

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Convex Optimization. (EE227A: UC Berkeley) Lecture 6. Suvrit Sra. (Conic optimization) 07 Feb, 2013

Information Retrieval

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

STATS 306B: Unsupervised Learning Spring Lecture 13 May 12

The Singular Value Decomposition (SVD) and Principal Component Analysis (PCA)

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Singular Value Decomposition

1 Singular Value Decomposition and Principal Component

Conditions for Robust Principal Component Analysis

Assignment #10: Diagonalization of Symmetric Matrices, Quadratic Forms, Optimization, Singular Value Decomposition. Name:

CSC 576: Variants of Sparse Learning

Multivariate Normal Models

Expectation Maximization

A direct formulation for sparse PCA using semidefinite programming

Duke University, Department of Electrical and Computer Engineering Optimization for Scientists and Engineers c Alex Bronstein, 2014

Low Rank Matrix Completion Formulation and Algorithm

Midterm 1 Solutions. 1. (2 points) Show that the Frobenius norm of a matrix A depends only on its singular values. Precisely, show that

Multivariate Normal Models

IV. Matrix Approximation using Least-Squares

CSCI 1951-G Optimization Methods in Finance Part 10: Conic Optimization

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

CS60021: Scalable Data Mining. Dimensionality Reduction

Introduction to Data Mining


Gaussian Graphical Models and Graphical Lasso

Machine learning for pervasive systems Classification in high-dimensional spaces

LECTURE 7. Least Squares and Variants. Optimization Models EE 127 / EE 227AT. Outline. Least Squares. Notes. Notes. Notes. Notes.

16.1 L.P. Duality Applied to the Minimax Theorem

Throughout these notes we assume V, W are finite dimensional inner product spaces over C.

Mathematical Optimisation, Chpt 2: Linear Equations and inequalities

Randomized Coordinate Descent Methods on Optimization Problems with Linearly Coupled Constraints

Methods for sparse analysis of high-dimensional data, II

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

Least Squares Optimization

Background Mathematics (2/2) 1. David Barber

An Introduction to Correlation Stress Testing

AM205: Assignment 2. i=1

Large Scale Data Analysis Using Deep Learning

Methods for sparse analysis of high-dimensional data, II

Statistical Machine Learning

UNIT 6: The singular value decomposition.

Chapter 7: Symmetric Matrices and Quadratic Forms

PCA, Kernel PCA, ICA

Semidefinite Programming

Robust Principal Component Analysis

Numerical Methods I Singular Value Decomposition

Linear Algebra. Shan-Hung Wu. Department of Computer Science, National Tsing Hua University, Taiwan. Large-Scale ML, Fall 2016

Singular Value Decomposition

ELE539A: Optimization of Communication Systems Lecture 15: Semidefinite Programming, Detection and Estimation Applications

STAT 309: MATHEMATICAL COMPUTATIONS I FALL 2017 LECTURE 5

Preprocessing & dimensionality reduction

1 Principal component analysis and dimensional reduction

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

2. LINEAR ALGEBRA. 1. Definitions. 2. Linear least squares problem. 3. QR factorization. 4. Singular value decomposition (SVD) 5.

Identifying Financial Risk Factors

Singular Value Decompsition

CS295: Convex Optimization. Xiaohui Xie Department of Computer Science University of California, Irvine

Preliminary/Qualifying Exam in Numerical Analysis (Math 502a) Spring 2012

Data Mining Lecture 4: Covariance, EVD, PCA & SVD

1 Last time: least-squares problems

Math Camp Lecture 4: Linear Algebra. Xiao Yu Wang. Aug 2010 MIT. Xiao Yu Wang (MIT) Math Camp /10 1 / 88

Transcription:

Short Course Robust Optimization and Machine Machine Lecture 4: Optimization in Unsupervised Laurent El Ghaoui EECS and IEOR Departments UC Berkeley Spring seminar TRANSP-OR, Zinal, Jan. 16-19, 2012 s

Outline Machine of Unsupervised models Principal Component Analysis s Sparse graphical models s

Outline Machine of Unsupervised models Principal Component Analysis s Sparse graphical models s

What is unsupervised learning? In unsupervised learning, we are given a matrix of data points X = [x 1,..., x m], with x i R n ; we wish to learn some condensed information from it. Machine s: Find one or several direction of maximal variance. Find a low-rank approximation or other structured approximation. Find correlations or some other statistical information (e.g., graphical model). Find clusters of data points. s

The empirical covariance matrix Definition Given p n data matrix A = [a 1,..., a m] (each row representing say a log-return time-series over m time periods), the empirical covariance matrix is defined as the p p matrix S = 1 m We can express S as m (a i â)(a i â) T, â := 1 m i=1 S = 1 m AcAT c, m a i. where A c is the centered data matrix, with p columns (a i â), i = 1,..., m. i=1 Machine s

The empirical covariance matrix Link with directional variance The (empirical) variance along direction x is var(x) = 1 m m [x T (a i â)] 2 = x T Sx = 1 m Acx 2 2. i=1 where A c is the centered data matrix. Machine s Hence, covariance matrix gives information about variance along any direction.

Eigenvalue decomposition for symmetric matrices Theorem (EVD of symmetric matrices) We can decompose any symmetric p p matrix Q as Q = p λ i u i ui T = UΛU T, i=1 where Λ = diag(λ 1,..., λ p), with λ 1... λ n the eigenvalues, and U = [u 1,..., u p] is a p p orthogonal matrix (U T U = I p) that contains the eigenvectors of Q. That is: Qu i = λ i u i, i = 1,..., p. Machine s

Singular Value Decomposition (SVD) Theorem (SVD of general matrices) We can decompose any non-zero p m matrix A as A = r σ i u i vi T = UΣV T, Σ = diag(σ 1,..., σ r, 0,..., 0) }{{} i=1 n r times where σ 1... σ r > 0 are the singular values, and U = [u 1,..., u m], V = [v 1,..., v p] are square, orthogonal matrices (U T U = I p, V T V = I m). The first r columns of U, V contains the leftand right singular vectors of A, respectively, that is: Av i = σ i u i, A T u i = σ i v i, i = 1,..., r. Machine s

Links between EVD and SVD The SVD of a p m matrix A is related to the EVD of a (PSD) matrix related to A. Machine If A = UΣV T is the SVD of A, then The EVD of AA T is UΛU T, with Λ = Σ 2. The EVD of A T A is V ΛV T. s Hence the left (resp. right) singular vectors of A are the eigenvectors of the PSD matrix AA T (resp. A T A).

Variational characterizations Largest and smallest eigenvalues and singular values If Q is square, symmetric: λ max(q) = If A is a general rectangular matrix: σ max(a) = max x T Qx. x : x 2 =1 max Ax 2. x : x 2 =1 Machine s Similar formulae for minimum eigenvalues and singular values.

Variational characterizations Other eigenvalues and singular values If Q is square, symmetric, the k-th largest eigenvalue satisfies λ k = max x T Qx, x S k, : x 2 =1 where S k is the subspace spanned by {u k,..., u p}. Machine s A similar result holds for singular values.

Low-rank approximation For a given p m matrix A, and integer k m, p, the k-rank approximation problem is A (k) := arg min X A F : Rank(X) k, X where F is the Frobenius norm (Euclidean norm of the vector formed with all the entries of the matrix). The solution is A (k) = k i=1 σ i u i v T i, where A = UΣV T is an SVD of the matrix A. Machine s

Low-rank approximation Interpretation: rank-one case Assume data matrix A R p m represents time-series data (each row is a time-series). Assume also that A is rank-one, that is, A = uv T R p m, where u, v are vectors. Then A = a T 1... a T m, a j (t) = u(j)v(t), 1 j p, 1 t m. Thus, each time-series is a scaled copy of the time-series represented by v, with scaling factors given in u. We can think of v as a factor that drives all the time-series. Machine s

Low-rank approximation Interpretation: low-rank case When A is rank k, that is, A = UV T, U R p k, V R m k, k << m, p, we can express the j-th row of A as Machine a j (t) = k u i (j)v i (t), 1 j p, 1 t m. i=1 Thus, each time-series is the sum of scaled copies of k time-series represented by v 1,..., v k, with scaling factors given in u 1,..., u k. We can think of v i s as the few factors that drive all the time-series. s

Outline Machine of Unsupervised models Principal Component Analysis s Sparse graphical models s

Motivation Machine s Votes of US Senators, 2002-2004. The plot is impossible to read... Can we project data on a lower dimensional subspace? If so, how should we choose a projection?

Principal Component Analysis Principal Component Analysis () originated in psychometrics in the 1930 s. It is now widely used in Exploratory data analysis. Simulation. Visualization. Machine Application fields include Finance, marketing, economics. Biology, medecine. Engineering design, signal compression and image processing. Search engines, data mining. s

Solution principles finds principal components (PCs), i.e. orthogonal directions of maximal variance. PCs are computed via EVD of covariance matrix. Can be interpreted as a factor model of original data matrix. Machine s

problem Definition Let us normalize the direction in a way that does not favor any direction. Machine problem: A non-convex problem! max x var(x) : x 2 = 1. s Solution is easy to obtain via the eigenvalue decomposition (EVD) of S, or via the SVD of centered data matrix A c.

problem Solution problem: max x Assume the EVD of S is given: x T Sx : x 2 = 1. S = p λ i u i ui T, i=1 with λ 1... λ p, and U = [u 1,..., u p] is orthogonal (U T U = I). Then arg max x : x 2 =1 x T Sx = u 1, where u 1 is any eigenvector of S that corresponds to the largest eigenvalue λ 1 of S. Machine s

problem : US Senators voting data Machine s Projection of US Senate voting data on random direction (left panel) and direction of maximal variance (right panel). The latter reveals party structure (party affiliations added after the fact). Note also the much higher range of values it provides.

Finding orthogonal directions A deflation method Once we ve found a direction with high variance, can we repeat the process and find other ones? method: Project data points on the subspace orthogonal to the direction we found. Fin a direction of maximal variance for projected data. Machine s The process stops after p steps (p is the dimension of the whole space), but can be stopped earlier (to find only k directions, with k << p).

Finding orthogonal directions Result It turns out that the direction that solves max x var(x) : x T u 1 = 0 is u 2, an eigenvector corresponding to the second-to-largest eigenvalue. Machine s After k steps of the deflation process, the directions returned are u 1,..., u k.

allows to build a low-rank approximation to the data matrix: A = k i=1 σ i u i v T i Each v i is a particular factor, and u i s contain scalings. Machine s

of market data Machine Plot shows the eigenvalues of covariance matrix in decreasing order. First ten components explain 80% of the variance. Largest magnitude of eigenvector for 1st component correspond to financial sector (FABC, FTU, MER, AIG, MS). s Data: Daily log-returns of 77 Fortune 500 companies, 1/2/2007 12/31/2008.

Outline Machine of Unsupervised models Principal Component Analysis s Sparse graphical models s

Motivation One of the issues with is that it does not yield principal directions that are easily interpretable: The principal directions are really combinations of all the relevant features (say, assets). Hence we cannot interpret them easily. The previous thresholding approach (select features with large components, zero out the others) can lead to much degraded explained variance. Machine s

Problem definition Modify the variance maximization problem: max x x T Sx λ Card(x) : x 2 = 1, where penalty parameter λ 0 is given, and Card(x) is the cardinality (number of non-zero elements) in x. Machine s The problem is hard but can be approximated via convex relaxation.

Safe feature elimination Express S as S = R T R, with R = [r 1,..., r p] (each r i corresponds to one feature). Theorem (Safe feature elimination [2]) We have max x T Sx λ Card(x) = x : x 2 =1 max z : z 2 =1 p i=1 max(0, (ri T z) 2 λ). Machine s

Corollary If λ > r i 2 2 = S ii, we can safely remove the i-th feature (row/column of S). The presence of the penalty parameter allows to prune out dimensions in the problem. In practice, we want λ high as to allow better interpretability. Hence, interpretability requirement makes the problem easier in some sense! Machine s

for sparse Step 1: l 1 -norm bound problem: φ(λ) := max x First recall Cauchy-Schwartz inequality: hence we have the upper bound φ(λ) φ(λ) := max x x T Sx λ Card(x) : x 2 = 1, x 1 Card(x) x 2, x T Sx λ x 2 1 : x 2 = 1. Machine s

for sparse Step 2: lifting and rank relaxation Next we rewrite problem in terms of (PSD, rank-one) X := xx T : φ = max X Tr SX λ X 1 : X 0, Tr X = 1, Rank(X) = 1. Drop the rank constraint, and get the upper bound λ ψ(λ) := max X Tr SX λ X 1 : X 0, Tr X = 1. Machine Upper bound is a semidefinite program (SDP). In practice, X is found to be (close to) rank-one at optimum. s

The problem remains challenging due to the huge number of variables. Second-order methods become quickly impractical as a result. technique often allows huge reduction in problem size. Dual block-coordinate methods are efficient in this case [7]. Still area of active research. (Like SVD in the 70 s-90 s... ) Machine s

1 of New York Times headlines Data: NYTtimes text collection contains 300, 000 articles and has a dictionary of 102, 660 unique words. The variance of the features (words) decreases very fast: Variance 10 0 10 1 10 2 10 3 10 4 10 5 10 6 0 2 4 6 8 10 12 Word Index x 10 4 Sorted variances of 102,660 words in NYTimes data. Machine s With a target number of words less than 10, allows to reduce the number of features from n 100, 000 to n = 500.

of New York Times headlines Machine Words associated with the top 5 sparse principal components in NYTimes 1st PC 2nd PC 3rd PC 4th PC 5th PC (6 words) (5 words) (5 words) (4 words) (4 words) million point official president school percent play government campaign program business team united states bush children company season u s administration student market game attack companies s Note: the algorithm found those terms without any information on the subject headings of the corresponding articles (unsupervised problem).

NYT Dataset Comparison with thresholded Thresholded involves simply thresholding the principal components. k = 2 k = 3 k = 9 k = 14 even even even would like like we new states like even now we this like will now united this states if will united states world so some if 1st PC from Thresholded for various cardinality k. The results contain a lot of non-informative words. Machine s

Robust is based on the assumption that the data matrix can be (approximately) written as a low-rank matrix: A = LR T, with L R p k, R R m k, with k << m, p. Machine Robust [1] assumes that A has a low-rank plus sparse structure: A = N + LR T where noise matrix N is sparse (has many zero entries). s How do we discover N, L, R based on A?

Robust model In robust, we solve the convex problem min N A N + λ N 1 where is the so-called nuclear norm (sum of singular values) of its matrix argument. At optimum, A N has usually low-rank. Machine s Motivation: the nuclear norm is akin to the l 1 -norm of the vector of singular values, and l 1 -norm minimization encourages sparsity of its argument.

CVX syntax Here is a matlab snippet that solves a robust problem via CVX, given integers n, m, a n m matrix A and non-negative scalar λ exist in the workspace: cvx_begin variable X(n,m); minimize( norm_nuc(a-x)+ lambda*norm(x(:),1)) cvx_end Machine s Not the use of norm_nuc, which stands for the nuclear norm.

Outline Machine of Unsupervised models Principal Component Analysis s Sparse graphical models s

Motivation We d like to draw a graph that describes the links between the features (e.g., words). Edges in the graph should exist when some strong, natural metric of similarity exist between features. For better interpretability, a sparse graph is desirable. Various motivations: portfolio optimization (with sparse risk term), clustering, etc. Machine s Here we focus on exploring conditional independence within features.

Gaussian assumption Let us assume that the data points are zero-mean, and follow a multi-variate Gaussian distribution: x N (0, Σ), with Σ a p p covariance matrix. Assume Σ is positive definite. Machine Gaussian probability density function: p(x) = 1 (2π det Σ) p/2 exp((1/2)x T Σ 1 x). where X := Σ 1 is the precision matrix. s

Conditional independence The pair of random variables x i, x j are conditionally independent if, for x k fixed (k i, j), the density can be factored: p(x) = p i (x i )p j (x j ) where p i, p j depend also on the other variables. Machine Interpretation: if all the other variables are fixed then x i, x j are independent. : Gray hair and shoe size are independent, conditioned on age. s

Conditional independence C.I. and the precision matrix Machine Theorem (C.I. for Gaussian RVs) The variables x i, x j are conditionally independent if and only if the i, j element of the precision matrix is zero: (Σ 1 ) ij = 0. Proof. The coefficient of x i x j in log p(x) is (Σ 1 ) ij. s

Sparse precision matrix estimation Let us encourage sparsity of the precision matrix in the problem: max X log det X Tr SX λ X 1, with X 1 := i,j X ij, and λ > 0 a parameter. The above provides an invertible result, even if S is not positive-definite. The problem is convex, and can be solved in a large-scale setting by optimizing over column/rows alternatively. Machine s

Dual Sparse precision matrix estimation: max X log det X Tr SX λ X 1. Machine Dual: min U log det(s + U) : U λ. s Block-coordinate descent: Minimize over one column/row of U cyclically. Each step is a QP.

Data: Interest rates Machine 1Y 4.5Y 10Y 5.5Y 7.5Y 9.5Y 9Y 7Y 3Y 0.5Y 2Y 6.5Y 6.5Y 8.5Y 8Y 6Y 5Y 4Y 2.5Y 8Y 5.5Y 7.5Y 3Y 4Y 5Y 1.5Y 3.5Y 7Y 2.5Y 1Y 6Y 8.5Y 10Y 9Y 9.5Y 0.5Y 2Y 1.5Y s Using covariance matrix (λ = 0). Using λ = 0.1. The original precision matrix is dense, but the sparse version reveals the maturity structure.

Data: US Senate voting, 2002-2004 Machine s Again the sparse version reveals information, here political blocks within each party.

Outline Machine of Unsupervised models Principal Component Analysis s Sparse graphical models s

Machine Emmanuel J. Candès, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis? 2009. L. El Ghaoui. On the quality of a semidefinite programming bound for sparse principal component analysis. arxiv:math/060144, February 2006. Olivier Ledoit and Michael Wolf. A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis, 88:365 411, February 2004. O.Banerjee, L. El Ghaoui, and A. d Aspremont. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. Journal of Machine Research, 9:485 516, March 2008. S. Sra, S.J. Wright, and S. Nowozin. Optimization for Machine. MIT Press, 2011. Y. Zhang, A. d Aspremont, and L. El Ghaoui. : Convex relaxations, algorithms and applications. In M. Anjos and J.B. Lasserre, editors, Handbook on Semidefinite, Cone and Polynomial Optimization: Theory,, Software and Applications. Springer, 2011. To appear. Y. Zhang and L. El Ghaoui. Large-scale sparse principal component analysis and application to text data. December 2011. s