MLCC 2015 Dimensionality Reduction and PCA

Similar documents
Introductory Machine Learning Notes 1. Lorenzo Rosasco

Introductory Machine Learning Notes 1

Introductory Machine Learning Notes 1. Lorenzo Rosasco

Lecture 6 Sept Data Visualization STAT 442 / 890, CM 462

PCA, Kernel PCA, ICA

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

Machine Learning - MT & 14. PCA and MDS

Principal Component Analysis

Principal Component Analysis

Convergence of Eigenspaces in Kernel Principal Component Analysis

Principal Component Analysis (PCA)

Regularization via Spectral Filtering

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

MLCC 2017 Regularization Networks I: Linear Models

Oslo Class 4 Early Stopping and Spectral Regularization

LECTURE 16: PCA AND SVD

Spectral Regularization

Machine learning for pervasive systems Classification in high-dimensional spaces

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

STATISTICAL LEARNING SYSTEMS

Notes on singular value decomposition for Math 54. Recall that if A is a symmetric n n matrix, then A has real eigenvalues A = P DP 1 A = P DP T.

Regularized Discriminant Analysis and Reduced-Rank LDA

MATH 829: Introduction to Data Mining and Analysis Principal component analysis

Nonlinear Dimensionality Reduction

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Linear Algebra & Geometry why is linear algebra useful in computer vision?

PCA and admixture models

What is Principal Component Analysis?

Principal Component Analysis and Linear Discriminant Analysis

Lecture: Face Recognition and Feature Reduction

Covariance and Correlation Matrix

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Introduction to Machine Learning

Principal Component Analysis

Lecture: Face Recognition and Feature Reduction

1 Feature Vectors and Time Series

Methods for sparse analysis of high-dimensional data, II

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Dimensionality Reduction

1 Principal Components Analysis

Singular Value Decomposition

Lecture 5 Singular value decomposition

Preprocessing & dimensionality reduction

Basic Calculus Review

Principal Component Analysis

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Unsupervised dimensionality reduction

Kernel Principal Component Analysis

Linear Algebra and Eigenproblems

Lecture 5 Supspace Tranformations Eigendecompositions, kernel PCA and CCA

EECS 275 Matrix Computation

(a) If A is a 3 by 4 matrix, what does this tell us about its nullspace? Solution: dim N(A) 1, since rank(a) 3. Ax =

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

The Principal Component Analysis

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

Lecture 2: Linear Algebra Review

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Exercises * on Principal Component Analysis

Linear Algebra & Geometry why is linear algebra useful in computer vision?

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining Techniques

Notes on Latent Semantic Analysis

Methods for sparse analysis of high-dimensional data, II

Statistics for Applications. Chapter 9: Principal Component Analysis (PCA) 1/16

Computation. For QDA we need to calculate: Lets first consider the case that

Deriving Principal Component Analysis (PCA)

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Principal Component Analysis (PCA)

Principal Component Analysis. Applied Multivariate Statistics Spring 2012

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Lecture 7 Spectral methods

PRINCIPAL COMPONENTS ANALYSIS

Eigenfaces. Face Recognition Using Principal Components Analysis

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

CS 143 Linear Algebra Review

Probabilistic Latent Semantic Analysis

Dimensionality Reduction

Assignment #10: Diagonalization of Symmetric Matrices, Quadratic Forms, Optimization, Singular Value Decomposition. Name:

14 Singular Value Decomposition

Principal Component Analysis

Principal components analysis COMS 4771

Linear Algebra Methods for Data Mining

15 Singular Value Decomposition

MLCC Clustering. Lorenzo Rosasco UNIGE-MIT-IIT

LECTURE NOTE #11 PROF. ALAN YUILLE

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

IV. Matrix Approximation using Least-Squares

Computational Methods. Eigenvalues and Singular Values

Mathematical foundations - linear algebra

Announcements. stuff stat 538 Zaid Hardhaoui. Statistics. cry. spring. g fa. inference. VC dimension covering

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Machine Learning (BSMC-GA 4439) Wenke Liu

Face Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition

Transcription:

MLCC 2015 Dimensionality Reduction and PCA Lorenzo Rosasco UNIGE-MIT-IIT June 25, 2015

Outline PCA & Reconstruction PCA and Maximum Variance PCA and Associated Eigenproblem Beyond the First Principal Component PCA and Singular Value Decomposition Kernel PCA MLCC 2015 2

Dimensionality Reduction In many practical applications it is of interest to reduce the dimensionality of the data: data visualization data exploration: for investigating the effective dimensionality of the data MLCC 2015 3

Dimensionality Reduction (cont.) This problem of dimensionality reduction can be seen as the problem of defining a map M : X = R D R k, k D, according to some suitable criterion. MLCC 2015 4

Dimensionality Reduction (cont.) This problem of dimensionality reduction can be seen as the problem of defining a map M : X = R D R k, k D, according to some suitable criterion. In the following data reconstruction will be our guiding principle. MLCC 2015 5

Principal Component Analysis PCA is arguably the most popular dimensionality reduction procedure. MLCC 2015 6

Principal Component Analysis PCA is arguably the most popular dimensionality reduction procedure. It is a data driven procedure that given an unsupervised sample S = (x 1,..., x n ) derive a dimensionality reduction defined by a linear map M. MLCC 2015 7

Principal Component Analysis PCA is arguably the most popular dimensionality reduction procedure. It is a data driven procedure that given an unsupervised sample S = (x 1,..., x n ) derive a dimensionality reduction defined by a linear map M. PCA can be derived from several prospective and here we give a geometric derivation. MLCC 2015 8

Dimensionality Reduction by Reconstruction Recall that, if w R D, w = 1, then (w T x)w is the orthogonal projection of x on w MLCC 2015 9

Dimensionality Reduction by Reconstruction Recall that, if w R D, w = 1, then (w T x)w is the orthogonal projection of x on w MLCC 2015 10

Dimensionality Reduction by Reconstruction (cont.) First, consider k = 1. The associated reconstruction error is x (w T x)w 2 (that is how much we lose by projecting x along the direction w) MLCC 2015 11

Dimensionality Reduction by Reconstruction (cont.) First, consider k = 1. The associated reconstruction error is x (w T x)w 2 (that is how much we lose by projecting x along the direction w) Problem: Find the direction p allowing the best reconstruction of the training set MLCC 2015 12

Dimensionality Reduction by Reconstruction (cont.) Let S D 1 = {w R D w = 1} is the sphere in D dimensions. Consider the empirical reconstruction minimization problem, 1 min w S D 1 n n x i (w T x i )w 2. The solution p to the above problem is called the first principal component of the data MLCC 2015 13

An Equivalent Formulation A direct computation shows that x i (w T x i )w 2 = x i (w T x i ) 2 MLCC 2015 14

An Equivalent Formulation A direct computation shows that x i (w T x i )w 2 = x i (w T x i ) 2 Then, problem is equivalent to 1 min w S D 1 n n x i (w T x i )w 2 1 max w S D 1 n n (w T x i ) 2 MLCC 2015 15

Outline PCA & Reconstruction PCA and Maximum Variance PCA and Associated Eigenproblem Beyond the First Principal Component PCA and Singular Value Decomposition Kernel PCA MLCC 2015 16

Reconstruction and Variance Assume the data to be centered, x = 1 n x i = 0, then we can interpret the term (w T x) 2 as the variance of x in the direction w. MLCC 2015 17

Reconstruction and Variance Assume the data to be centered, x = 1 n x i = 0, then we can interpret the term (w T x) 2 as the variance of x in the direction w. The first PC can be seen as the direction along which the data have maximum variance. 1 n max (w T x i ) 2 w S D 1 n MLCC 2015 18

Centering If the data are not centered, we should consider 1 max w S D 1 n n (w T (x i x)) 2 (1) equivalent to with x c = x x. 1 max w S D 1 n n (w T x c i) 2 MLCC 2015 19

Centering and Reconstruction If we consider the effect of centering to reconstruction it is easy to see that we get where 1 min w,b S D 1 n n x i ((w T (x i b))w + b) 2 ((w T (x i b))w + b is an affine (rather than an orthogonal) projection MLCC 2015 20

Outline PCA & Reconstruction PCA and Maximum Variance PCA and Associated Eigenproblem Beyond the First Principal Component PCA and Singular Value Decomposition Kernel PCA MLCC 2015 21

PCA as an Eigenproblem A further manipulation shows that PCA corresponds to an eigenvalue problem. MLCC 2015 22

PCA as an Eigenproblem A further manipulation shows that PCA corresponds to an eigenvalue problem. Using the symmetry of the inner product, 1 n n (w T x i ) 2 = 1 n n w T x i w T x i = 1 n n w T x i x T i w = w T ( 1 n n x i x T i )w MLCC 2015 23

PCA as an Eigenproblem A further manipulation shows that PCA corresponds to an eigenvalue problem. Using the symmetry of the inner product, 1 n n (w T x i ) 2 = 1 n n w T x i w T x i = 1 n n w T x i x T i w = w T ( 1 n n x i x T i )w Then, we can consider the problem max w T C n w, w S D 1 C n = 1 n n x i x T i MLCC 2015 24

PCA as an Eigenproblem (cont.) We make two observations: The ( covariance ) matrix C n = 1 n n XT n X n is symmetric and positive semi-definite. MLCC 2015 25

PCA as an Eigenproblem (cont.) We make two observations: The ( covariance ) matrix C n = 1 n n XT n X n is symmetric and positive semi-definite. The objective function of PCA can be written as the so called Rayleigh quotient. w T C n w w T w MLCC 2015 26

PCA as an Eigenproblem (cont.) We make two observations: The ( covariance ) matrix C n = 1 n n XT n X n is symmetric and positive semi-definite. The objective function of PCA can be written as the so called Rayleigh quotient. w T C n w w T w Note that, if C n u = λu then ut C nu u T u = λ, since u is normalized. MLCC 2015 27

PCA as an Eigenproblem (cont.) We make two observations: The ( covariance ) matrix C n = 1 n n XT n X n is symmetric and positive semi-definite. The objective function of PCA can be written as the so called Rayleigh quotient. w T C n w w T w Note that, if C n u = λu then ut C nu u T u = λ, since u is normalized. Indeed, it is possible to show that the Rayleigh quotient achieves its maximum at a vector corresponding to the maximum eigenvalue of C n MLCC 2015 28

PCA as an Eigenproblem (cont.) Computing the first principal component of the data reduces to computing the biggest eigenvalue of the covariance and the corresponding eigenvector. C n u = λu, C n = 1 n n Xn T X n MLCC 2015 29

Outline PCA & Reconstruction PCA and Maximum Variance PCA and Associated Eigenproblem Beyond the First Principal Component PCA and Singular Value Decomposition Kernel PCA MLCC 2015 30

Beyond the First Principal Component We discuss how to consider more than one principle component (k > 1) M : X = R D R k, k D The idea is simply to iterate the previous reasoning MLCC 2015 31

Residual Reconstruction The idea is to consider the one dimensional projection that can best reconstruct the residuals r i = x i (p T x i )p i MLCC 2015 32

Residual Reconstruction The idea is to consider the one dimensional projection that can best reconstruct the residuals r i = x i (p T x i )p i An associated minimization problem is given by 1 min w S D 1,w p n (note: the constraint w p) n r i (w T r i )w 2. MLCC 2015 33

Residual Reconstruction (cont.) Note that for all i = 1,..., n, since w p r i (w T r i )w 2 = r i 2 (w T r i ) 2 = r i 2 (w T x i ) 2 MLCC 2015 34

Residual Reconstruction (cont.) Note that for all i = 1,..., n, since w p r i (w T r i )w 2 = r i 2 (w T r i ) 2 = r i 2 (w T x i ) 2 Then, we can consider the following equivalent problem 1 max w S D 1,w p n n (w T x i ) 2 = w T C n w. MLCC 2015 35

PCA as an Eigenproblem 1 max w S D 1,w p n n (w T x i ) 2 = w T C n w. Again, we have to minimize the Rayleigh quotient of the covariance matrix with the extra constraint w p MLCC 2015 36

PCA as an Eigenproblem 1 max w S D 1,w p n n (w T x i ) 2 = w T C n w. Again, we have to minimize the Rayleigh quotient of the covariance matrix with the extra constraint w p Similarly to before, it can be proved that the solution of the above problem is given by the second eigenvector of C n, and the corresponding eigenvalue. MLCC 2015 37

PCA as an Eigenproblem (cont.) C n u = λu, C n = 1 n n x i x T i The reasoning generalizes to more than two components: computation of k principal components reduces to finding k eigenvalues and eigenvectors of C n. MLCC 2015 38

Remarks Computational complexity roughly O(kD 2 ) (complexity of forming C n is O(nD 2 )). If we have n points in D dimensions and n D can we compute PCA in less than O(nD 2 )? MLCC 2015 39

Remarks Computational complexity roughly O(kD 2 ) (complexity of forming C n is O(nD 2 )). If we have n points in D dimensions and n D can we compute PCA in less than O(nD 2 )? The dimensionality reduction induced by PCA is a linear projection. Can we generalize PCA to non linear dimensionality reduction? MLCC 2015 40

Outline PCA & Reconstruction PCA and Maximum Variance PCA and Associated Eigenproblem Beyond the First Principal Component PCA and Singular Value Decomposition Kernel PCA MLCC 2015 41

Singular Value Decomposition Consider the data matrix X n, its singular value decomposition is given by where: X n = UΣV T U is a n by k orthogonal matrix, V is a D by k orthogonal matrix, Σ is a diagonal matrix such that Σ i,i = λ i, i = 1,..., k and k min{n, D}. The columns of U and the columns of V are the left and right singular vectors and the diagonal entries of Σ the singular values. MLCC 2015 42

Singular Value Decomposition (cont.) The SVD can be equivalently described by the equations 1 C n p j = λ j p j, n K nu j = λ j u j, X n p j = 1 λ j u j, n XT n u j = λ j p j, for j = 1,..., d and where C n = 1 n XT n X n and 1 n K n = 1 n X nx T n MLCC 2015 43

PCA and Singular Value Decomposition If n p we can consider the following procedure: form the matrix K n, which is O(Dn 2 ) find the first k eigenvectors of K n, which is O(kn 2 ) compute the principal components using p j = 1 λj X T n u j = 1 λj n x i u i j, j = 1,..., d where u = (u 1,..., u n ), This is O(knD) if we consider k principal components. MLCC 2015 44

Outline PCA & Reconstruction PCA and Maximum Variance PCA and Associated Eigenproblem Beyond the First Principal Component PCA and Singular Value Decomposition Kernel PCA MLCC 2015 45

Beyond Linear Dimensionality Reduction? By considering PCA we are implicitly assuming the data to lie on a linear subspace... MLCC 2015 46

Beyond Linear Dimensionality Reduction? By considering PCA we are implicitly assuming the data to lie on a linear subspace......it is easy to think of situations where this assumption might violated. MLCC 2015 47

Beyond Linear Dimensionality Reduction? By considering PCA we are implicitly assuming the data to lie on a linear subspace......it is easy to think of situations where this assumption might violated. Can we use kernels to obtain non linear generalization of PCA? MLCC 2015 48

From SVD to KPCA Using SVD the projection of a point x on a principal component p j, for j = 1,..., d, is (M(x)) j = x T p j = 1 λj x T X T n u j = 1 λj n x T x i u i j, Recall 1 C n p j = λ j p j, n K nu j = λ j u j, X n p j = 1 λ j u j, n XT n u j = λ j p j, MLCC 2015 49

PCA and Feature Maps (M(x)) j = 1 λj What if consider a non linear feature-map Φ : X F, before performing PCA? n x T x i u i j, MLCC 2015 50

PCA and Feature Maps (M(x)) j = 1 λj What if consider a non linear feature-map Φ : X F, before performing PCA? n x T x i u i j, (M(x)) j = Φ(x) T p j = 1 λj n where K n σ j = σ j u j and (K n ) i,j = Φ(x) T Φ(x j ). Φ(x) T Φ(x i )u i j, MLCC 2015 51

Kernel PCA (M(x)) j = Φ(x) T p j = 1 λj n Φ(x) T Φ(x i )u i j, If the feature map is defined by a positive definite kernel K : X X R, then (M(x)) j = 1 λj n where K n σ j = σ j u j and (K n ) i,j = K(x i, x j ). K(x, x i )u i j, MLCC 2015 52

Wrapping Up In this class we introduced PCA as a basic tool for dimensionality reduction. We discussed computational aspect and extensions to non linear dimensionality reduction (KPCA) MLCC 2015 53

Next Class In the next class, beyond dimensionality reduction, we ask how we can devise interpretable data models, and discuss a class of methods based on the concept of sparsity. MLCC 2015 54