Principal Component Analysis

Similar documents
Nonlinear Dimensionality Reduction

Maximum variance formulation

PCA, Kernel PCA, ICA

Principal Component Analysis (PCA) CSC411/2515 Tutorial

Convergence of Eigenspaces in Kernel Principal Component Analysis

Principal Component Analysis

1 Principal Components Analysis

Principal Component Analysis

Introduction to Machine Learning

Computation. For QDA we need to calculate: Lets first consider the case that

STATISTICAL LEARNING SYSTEMS

Kernel Principal Component Analysis

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

Final Exam, Linear Algebra, Fall, 2003, W. Stephen Wilson

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

MLCC 2015 Dimensionality Reduction and PCA

Dimensionality Reduction

Lecture 6 Sept Data Visualization STAT 442 / 890, CM 462

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Mathematical foundations - linear algebra

Chapter 3 Transformations

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Data Mining and Analysis: Fundamental Concepts and Algorithms

Machine Learning - MT & 14. PCA and MDS

1 Feature Vectors and Time Series

Covariance and Correlation Matrix

Point Distribution Models

Perceptron Revisited: Linear Separators. Support Vector Machines

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Principal Components Analysis (PCA)

LECTURE 16: PCA AND SVD

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Dimension Reduction and Low-dimensional Embedding

Machine Learning (CSE 446): Unsupervised Learning: K-means and Principal Component Analysis

Advanced Introduction to Machine Learning CMU-10715

Principal Component Analysis (PCA)

Kernel methods, kernel SVM and ridge regression

Support Vector Machine

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Approximate Kernel Methods

Nonlinear Dimensionality Reduction

1 Singular Value Decomposition and Principal Component

1 Inner Product and Orthogonality

Exercises * on Principal Component Analysis

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Machine learning for pervasive systems Classification in high-dimensional spaces

Approximate Kernel PCA with Random Features

Statistical Machine Learning

LECTURE NOTE #11 PROF. ALAN YUILLE

Notes on singular value decomposition for Math 54. Recall that if A is a symmetric n n matrix, then A has real eigenvalues A = P DP 1 A = P DP T.

IV. Matrix Approximation using Least-Squares

Principal Component Analysis

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Unsupervised Learning: Dimensionality Reduction

STA 414/2104: Lecture 8

STA 414/2104: Machine Learning

Manifold Learning: Theory and Applications to HRI

What is Principal Component Analysis?

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Image Analysis. PCA and Eigenfaces

PCA and LDA. Man-Wai MAK

PRINCIPAL COMPONENTS ANALYSIS

MATH 304 Linear Algebra Lecture 20: The Gram-Schmidt process (continued). Eigenvalues and eigenvectors.

CS181 Midterm 2 Practice Solutions

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

Unsupervised dimensionality reduction

Statistical Pattern Recognition

Exercise Sheet 1.

Principal Component Analysis (PCA)

PCA and LDA. Man-Wai MAK

Worksheet for Lecture 25 Section 6.4 Gram-Schmidt Process

Data Mining Lecture 4: Covariance, EVD, PCA & SVD

Lecture: Face Recognition and Feature Reduction

Chapter XII: Data Pre and Post Processing

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

STA 414/2104: Lecture 8

STA141C: Big Data & High Performance Statistical Computing

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Lecture 7: Con3nuous Latent Variable Models

Principal Components Analysis

Online Kernel PCA with Entropic Matrix Updates

Data Mining and Analysis: Fundamental Concepts and Algorithms

Lecture 7 Spectral methods

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Linear Dimensionality Reduction

Solutions to Review Problems for Chapter 6 ( ), 7.1

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Statistical Machine Learning from Data

Lecture: Face Recognition and Feature Reduction

SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS

Eigenvalues, Eigenvectors, and an Intro to PCA

Machine Learning: Basis and Wavelet 김화평 (CSE ) Medical Image computing lab 서진근교수연구실 Haar DWT in 2 levels

Statistics for Applications. Chapter 9: Principal Component Analysis (PCA) 1/16

Eigenvalues, Eigenvectors, and an Intro to PCA

Lecture 5 Supspace Tranformations Eigendecompositions, kernel PCA and CCA

Lecture 6 Proof for JL Lemma and Linear Dimensionality Reduction

Learning sets and subspaces: a spectral approach

Transcription:

CSci 5525: Machine Learning Dec 3, 2008

The Main Idea Given a dataset X = {x 1,..., x N }

The Main Idea Given a dataset X = {x 1,..., x N } Find a low-dimensional linear projection

The Main Idea Given a dataset X = {x 1,..., x N } Find a low-dimensional linear projection Two possible formulations

The Main Idea Given a dataset X = {x 1,..., x N } Find a low-dimensional linear projection Two possible formulations The variance in low-d is maximized

The Main Idea Given a dataset X = {x 1,..., x N } Find a low-dimensional linear projection Two possible formulations The variance in low-d is maximized The average projection cost is minimized

The Main Idea Given a dataset X = {x 1,..., x N } Find a low-dimensional linear projection Two possible formulations The variance in low-d is maximized The average projection cost is minimized Both are equivalent

Two viewpoints u 1 x 2 x n x n x 1

Maximum Variance Formulation Consider X = {x 1,..., x N }

Maximum Variance Formulation Consider X = {x 1,..., x N } With x i R d, goal is to get a projection in R m, m < d

Maximum Variance Formulation Consider X = {x 1,..., x N } With x i R d, goal is to get a projection in R m, m < d Consider m = 1, need a projection vector u 1 R d

Maximum Variance Formulation Consider X = {x 1,..., x N } With x i R d, goal is to get a projection in R m, m < d Consider m = 1, need a projection vector u 1 R d Each datapoint x i gets projected to u T x i

Maximum Variance Formulation Consider X = {x 1,..., x N } With x i R d, goal is to get a projection in R m, m < d Consider m = 1, need a projection vector u 1 R d Each datapoint x i gets projected to u T x i Mean of the projected data u T 1 x where x = 1 N N x n

Maximum Variance Formulation Consider X = {x 1,..., x N } With x i R d, goal is to get a projection in R m, m < d Consider m = 1, need a projection vector u 1 R d Each datapoint x i gets projected to u T x i Mean of the projected data u T 1 x where x = 1 N Variance of the projected data where 1 N N x n N (u T 1 x n u T 1 x)2 = u T 1 Su 1 S = 1 N N (x n x)(x n x) T

Maximum Variance Formulation (Contd.) Maximize u T 1 Su 1 w.r.t. u 1

Maximum Variance Formulation (Contd.) Maximize u T 1 Su 1 w.r.t. u 1 Need to have a constraint to prevent u 1

Maximum Variance Formulation (Contd.) Maximize u T 1 Su 1 w.r.t. u 1 Need to have a constraint to prevent u 1 Normalization constraint u 1 2 = 1

Maximum Variance Formulation (Contd.) Maximize u T 1 Su 1 w.r.t. u 1 Need to have a constraint to prevent u 1 Normalization constraint u 1 2 = 1 The Lagrangian for the problem u T 1 Su 1 + λ 1 (1 u T 1 u 1 )

Maximum Variance Formulation (Contd.) Maximize u T 1 Su 1 w.r.t. u 1 Need to have a constraint to prevent u 1 Normalization constraint u 1 2 = 1 The Lagrangian for the problem First order necessary condition u T 1 Su 1 + λ 1 (1 u T 1 u 1 ) Su 1 = λu 1

Maximum Variance Formulation (Contd.) Maximize u T 1 Su 1 w.r.t. u 1 Need to have a constraint to prevent u 1 Normalization constraint u 1 2 = 1 The Lagrangian for the problem First order necessary condition u T 1 Su 1 + λ 1 (1 u T 1 u 1 ) Su 1 = λu 1 u 1 must be largest eigenvector of S since u T 1 Su 1 = λ 1

Maximum Variance Formulation (Contd.) Maximize u T 1 Su 1 w.r.t. u 1 Need to have a constraint to prevent u 1 Normalization constraint u 1 2 = 1 The Lagrangian for the problem First order necessary condition u T 1 Su 1 + λ 1 (1 u T 1 u 1 ) Su 1 = λu 1 u 1 must be largest eigenvector of S since u T 1 Su 1 = λ 1 The eigenvector u 1 is called a principal component

Maximum Variance Formulation (Contd.) Subsequent principal components must be orthogonal to u 1

Maximum Variance Formulation (Contd.) Subsequent principal components must be orthogonal to u 1 Maximize u T 2 Su 2 s.t. u 2 2 = 1, u 2 u 1

Maximum Variance Formulation (Contd.) Subsequent principal components must be orthogonal to u 1 Maximize u T 2 Su 2 s.t. u 2 2 = 1, u 2 u 1 Turns out to be the second eigenvector, and so on

Maximum Variance Formulation (Contd.) Subsequent principal components must be orthogonal to u 1 Maximize u T 2 Su 2 s.t. u 2 2 = 1, u 2 u 1 Turns out to be the second eigenvector, and so on The top-m eigenvectors give the best m-dimensional projection

Minimum Error Formulation Consider a complete basis {u i } in R d

Minimum Error Formulation Consider a complete basis {u i } in R d Each data point can be written as x n = d i=1 α niu i

Minimum Error Formulation Consider a complete basis {u i } in R d Each data point can be written as x n = d i=1 α niu i Note that α ni = x T n u i so that x n = d (x T n u i )u i i=1

Minimum Error Formulation Consider a complete basis {u i } in R d Each data point can be written as x n = d i=1 α niu i Note that α ni = x T n u i so that x n = d (x T n u i )u i i=1 Our goal is to obtain a lower dimensional subspace m < d

Minimum Error Formulation Consider a complete basis {u i } in R d Each data point can be written as x n = d i=1 α niu i Note that α ni = x T n u i so that x n = d (x T n u i )u i i=1 Our goal is to obtain a lower dimensional subspace m < d A generic representation of a low-d point x n = m d z ni u i + b i u i i=1 i=m+1

Minimum Error Formulation Consider a complete basis {u i } in R d Each data point can be written as x n = d i=1 α niu i Note that α ni = x T n u i so that x n = d (x T n u i )u i i=1 Our goal is to obtain a lower dimensional subspace m < d A generic representation of a low-d point x n = m d z ni u i + b i u i i=1 i=m+1 Coefficients z ni depend on the data point x n

Minimum Error Formulation Consider a complete basis {u i } in R d Each data point can be written as x n = d i=1 α niu i Note that α ni = x T n u i so that x n = d (x T n u i )u i i=1 Our goal is to obtain a lower dimensional subspace m < d A generic representation of a low-d point x n = m d z ni u i + b i u i i=1 i=m+1 Coefficients z ni depend on the data point x n Free to choose z ni, b i, u i to get x n close to x n

Minimum Error Formulation (Contd.) The objective is to minimize J = 1 N x n x n 2 N

Minimum Error Formulation (Contd.) The objective is to minimize J = 1 N x n x n 2 N Taking derivative w.r.t. z ni we get z nj = x T n u j, j = 1,..., m

Minimum Error Formulation (Contd.) The objective is to minimize J = 1 N x n x n 2 N Taking derivative w.r.t. z ni we get z nj = x T n u j, j = 1,..., m Taking derivative w.r.t. b j we get b j = x T u j, j = m + 1,..., d

Minimum Error Formulation (Contd.) The objective is to minimize J = 1 N x n x n 2 N Taking derivative w.r.t. z ni we get z nj = x T n u j, j = 1,..., m Taking derivative w.r.t. b j we get b j = x T u j, j = m + 1,..., d Then we have d x n x n = {(x n x) T u i }u i i=m+1

Minimum Error Formulation (Contd.) The objective is to minimize J = 1 N x n x n 2 N Taking derivative w.r.t. z ni we get z nj = x T n u j, j = 1,..., m Taking derivative w.r.t. b j we get b j = x T u j, j = m + 1,..., d Then we have d x n x n = {(x n x) T u i }u i i=m+1 Lies in the space orthogonal to the principal subspace

Minimum Error Formulation (Contd.) The objective is to minimize J = 1 N x n x n 2 N Taking derivative w.r.t. z ni we get z nj = x T n u j, j = 1,..., m Taking derivative w.r.t. b j we get b j = x T u j, j = m + 1,..., d Then we have d x n x n = {(x n x) T u i }u i i=m+1 Lies in the space orthogonal to the principal subspace The distortion measure to be minimized J = 1 N d d (x T n u i x T u i ) 2 = u T i Su i N i=m+1 i=m+1

Minimum Error Formulation (Contd.) The objective is to minimize J = 1 N x n x n 2 N Taking derivative w.r.t. z ni we get z nj = x T n u j, j = 1,..., m Taking derivative w.r.t. b j we get b j = x T u j, j = m + 1,..., d Then we have d x n x n = {(x n x) T u i }u i i=m+1 Lies in the space orthogonal to the principal subspace The distortion measure to be minimized J = 1 N d d (x T n u i x T u i ) 2 = u T i Su i N i=m+1 i=m+1 Need orthonormality constraints on u i to prevent u i = 0 solution

Minimum Error Formulation (Contd.) Consider special case d = 2, m = 1

Minimum Error Formulation (Contd.) Consider special case d = 2, m = 1 The Lagrangian of the objective L = u T 2 Su 2 + λ 2 (1 u T 2 u 2 )

Minimum Error Formulation (Contd.) Consider special case d = 2, m = 1 The Lagrangian of the objective L = u T 2 Su 2 + λ 2 (1 u T 2 u 2 ) First order condition is Su 2 = λ 2 u 2

Minimum Error Formulation (Contd.) Consider special case d = 2, m = 1 The Lagrangian of the objective L = u T 2 Su 2 + λ 2 (1 u T 2 u 2 ) First order condition is Su 2 = λ 2 u 2 In general, the condition is Su i = λ i u i

Minimum Error Formulation (Contd.) Consider special case d = 2, m = 1 The Lagrangian of the objective L = u T 2 Su 2 + λ 2 (1 u T 2 u 2 ) First order condition is Su 2 = λ 2 u 2 In general, the condition is Su i = λ i u i Given by the eigenvectors corresponding to the smallest (d m) eigenvalues

Minimum Error Formulation (Contd.) Consider special case d = 2, m = 1 The Lagrangian of the objective L = u T 2 Su 2 + λ 2 (1 u T 2 u 2 ) First order condition is Su 2 = λ 2 u 2 In general, the condition is Su i = λ i u i Given by the eigenvectors corresponding to the smallest (d m) eigenvalues So the principal space u i, i = 1,..., m are the largest eigenvectors

Kernel PCA In PCA, the principal components u i are given by Su i = λ i u i where N S = 1 N x n x T n Consider a feature mapping φ(x) Want to implicitly perform PCA in the feature space Assume the features have zero mean n φ(x n) = 0

Kernel PCA (Contd.) The sample covariance matrix in the feature space C = 1 N φ(x n )φ(x n ) T N

Kernel PCA (Contd.) The sample covariance matrix in the feature space C = 1 N φ(x n )φ(x n ) T N The eigenvectors are given by Cv i = λ i v i

Kernel PCA (Contd.) The sample covariance matrix in the feature space C = 1 N φ(x n )φ(x n ) T N The eigenvectors are given by Cv i = λ i v i We want to avoid computing C explicitly

Kernel PCA (Contd.) The sample covariance matrix in the feature space C = 1 N φ(x n )φ(x n ) T N The eigenvectors are given by Cv i = λ i v i We want to avoid computing C explicitly Note that the eigenvectors satisfy 1 N } φ(x n ) {φ(x n ) T v i = λ i v i N

Kernel PCA (Contd.) The sample covariance matrix in the feature space C = 1 N φ(x n )φ(x n ) T N The eigenvectors are given by Cv i = λ i v i We want to avoid computing C explicitly Note that the eigenvectors satisfy 1 N } φ(x n ) {φ(x n ) T v i = λ i v i N Since the inner product is a scaler, we have N v i = a in φ(x n )

Kernel PCA (Contd.) Substituting back into the eigenvalue equation 1 N N φ(x n )φ(x n ) T N m=1 a im φ(x m ) = λ i N a in φ(x n )

Kernel PCA (Contd.) Substituting back into the eigenvalue equation 1 N N φ(x n )φ(x n ) T N m=1 a im φ(x m ) = λ i Multiplying both sides by φ(x l ) and using K(x n, x m ) = φ(x n ) T φ(x m ), we have 1 N N K(x l, x n ) N a im K(x n, x m ) = λ i m=1 N N a in φ(x n ) a in K(x l, x n )

Kernel PCA (Contd.) Substituting back into the eigenvalue equation 1 N N φ(x n )φ(x n ) T N m=1 a im φ(x m ) = λ i Multiplying both sides by φ(x l ) and using K(x n, x m ) = φ(x n ) T φ(x m ), we have 1 N N K(x l, x n ) N a im K(x n, x m ) = λ i m=1 In matrix notation, we have K 2 a i = λ i NKa i N N a in φ(x n ) a in K(x l, x n )

Kernel PCA (Contd.) Substituting back into the eigenvalue equation 1 N N φ(x n )φ(x n ) T N m=1 a im φ(x m ) = λ i Multiplying both sides by φ(x l ) and using K(x n, x m ) = φ(x n ) T φ(x m ), we have 1 N N K(x l, x n ) N a im K(x n, x m ) = λ i m=1 In matrix notation, we have K 2 a i = λ i NKa i N N a in φ(x n ) a in K(x l, x n ) Except for eigenvectors with 0 eigenvalues, we can solve Ka i = λ i Na i

Kernel PCA (Contd.) Since the original v i are normalized, we have 1 = v T i v i = a T i Ka i = λ i Na T i a i Gives a normalization condition for a i Compute a i by solving the eigenvalue decomposition The projection of a point is given by y i (x) = φ(x i ) T v i = N a in φ(x n ) T φ(x n ) = N a in K(x, x n )

Illustration of Kernel PCA (Feature Space) φ 1 v 1 φ 2

Illustration of Kernel PCA (Data Space) x 2 x 1

Dimensionality of Projection Original x i R d, feature φ(x i ) R D Possibly D >> d so that the number of principal components can be greater than d However, the number of nonzero eigenvalues cannot exceed N The covariance matrix C has rank at most N, even if D >> d Kernel PCA involves eigenvalue decomposition of a N N matrix

Kernel PCA: Non-zero Mean The features need not have zero mean Note that the features cannot be explicitly centered The centralized data would be of the form φ(x n ) = φ(x n ) 1 N N φ(x l ) l=1 The corresponding gram matrix K = K 1 N K K1 N + 1 N K1 N Use K in the basic kernel PCA formulation

Kernel PCA on Artificial Data

Kernel PCA Properties Computes eigenvalue decomposition of N N matrix Standard PCA computes it for d d For large datasets N >> d, Kernel PCA is more expensive Standard PCA gives projection to a low dimensional principal subspace l ˆx n = (x T n u i )u i i=1 Kernel PCA cannot do this φ(x) forms a d-dimensional manifold in R D PCA projection ˆφ of φ(x) need not be in the manifold May not have a pre-image ˆx in the data space