Principal Component Analysis - PDF Free Download

CSci 5525: Machine Learning Dec 3, 2008

The Main Idea Given a dataset X = {x 1,..., x N }

The Main Idea Given a dataset X = {x 1,..., x N } Find a low-dimensional linear projection

The Main Idea Given a dataset X = {x 1,..., x N } Find a low-dimensional linear projection Two possible formulations

The Main Idea Given a dataset X = {x 1,..., x N } Find a low-dimensional linear projection Two possible formulations The variance in low-d is maximized

The Main Idea Given a dataset X = {x 1,..., x N } Find a low-dimensional linear projection Two possible formulations The variance in low-d is maximized The average projection cost is minimized

The Main Idea Given a dataset X = {x 1,..., x N } Find a low-dimensional linear projection Two possible formulations The variance in low-d is maximized The average projection cost is minimized Both are equivalent

Two viewpoints u 1 x 2 x n x n x 1

Maximum Variance Formulation Consider X = {x 1,..., x N }

Maximum Variance Formulation Consider X = {x 1,..., x N } With x i R d, goal is to get a projection in R m, m < d

Maximum Variance Formulation Consider X = {x 1,..., x N } With x i R d, goal is to get a projection in R m, m < d Consider m = 1, need a projection vector u 1 R d

Maximum Variance Formulation Consider X = {x 1,..., x N } With x i R d, goal is to get a projection in R m, m < d Consider m = 1, need a projection vector u 1 R d Each datapoint x i gets projected to u T x i Mean of the projected data u T 1 x where x = 1 N N x n

Maximum Variance Formulation (Contd.) Maximize u T 1 Su 1 w.r.t. u 1

Maximum Variance Formulation (Contd.) Maximize u T 1 Su 1 w.r.t. u 1 Need to have a constraint to prevent u 1

Maximum Variance Formulation (Contd.) Maximize u T 1 Su 1 w.r.t. u 1 Need to have a constraint to prevent u 1 Normalization constraint u 1 2 = 1

Maximum Variance Formulation (Contd.) Maximize u T 1 Su 1 w.r.t. u 1 Need to have a constraint to prevent u 1 Normalization constraint u 1 2 = 1 The Lagrangian for the problem u T 1 Su 1 + λ 1 (1 u T 1 u 1 )

Maximum Variance Formulation (Contd.) Maximize u T 1 Su 1 w.r.t. u 1 Need to have a constraint to prevent u 1 Normalization constraint u 1 2 = 1 The Lagrangian for the problem First order necessary condition u T 1 Su 1 + λ 1 (1 u T 1 u 1 ) Su 1 = λu 1 u 1 must be largest eigenvector of S since u T 1 Su 1 = λ 1

Maximum Variance Formulation (Contd.) Subsequent principal components must be orthogonal to u 1

Maximum Variance Formulation (Contd.) Subsequent principal components must be orthogonal to u 1 Maximize u T 2 Su 2 s.t. u 2 2 = 1, u 2 u 1

Maximum Variance Formulation (Contd.) Subsequent principal components must be orthogonal to u 1 Maximize u T 2 Su 2 s.t. u 2 2 = 1, u 2 u 1 Turns out to be the second eigenvector, and so on

Maximum Variance Formulation (Contd.) Subsequent principal components must be orthogonal to u 1 Maximize u T 2 Su 2 s.t. u 2 2 = 1, u 2 u 1 Turns out to be the second eigenvector, and so on The top-m eigenvectors give the best m-dimensional projection

Minimum Error Formulation Consider a complete basis {u i } in R d

Minimum Error Formulation Consider a complete basis {u i } in R d Each data point can be written as x n = d i=1 α niu i

Minimum Error Formulation Consider a complete basis {u i } in R d Each data point can be written as x n = d i=1 α niu i Note that α ni = x T n u i so that x n = d (x T n u i )u i i=1

Minimum Error Formulation Consider a complete basis {u i } in R d Each data point can be written as x n = d i=1 α niu i Note that α ni = x T n u i so that x n = d (x T n u i )u i i=1 Our goal is to obtain a lower dimensional subspace m < d A generic representation of a low-d point x n = m d z ni u i + b i u i i=1 i=m+1

Minimum Error Formulation (Contd.) The objective is to minimize J = 1 N x n x n 2 N

Minimum Error Formulation (Contd.) The objective is to minimize J = 1 N x n x n 2 N Taking derivative w.r.t. z ni we get z nj = x T n u j, j = 1,..., m

Minimum Error Formulation (Contd.) The objective is to minimize J = 1 N x n x n 2 N Taking derivative w.r.t. z ni we get z nj = x T n u j, j = 1,..., m Taking derivative w.r.t. b j we get b j = x T u j, j = m + 1,..., d Then we have d x n x n = {(x n x) T u i }u i i=m+1 Lies in the space orthogonal to the principal subspace The distortion measure to be minimized J = 1 N d d (x T n u i x T u i ) 2 = u T i Su i N i=m+1 i=m+1

Minimum Error Formulation (Contd.) Consider special case d = 2, m = 1

Minimum Error Formulation (Contd.) Consider special case d = 2, m = 1 The Lagrangian of the objective L = u T 2 Su 2 + λ 2 (1 u T 2 u 2 )

Minimum Error Formulation (Contd.) Consider special case d = 2, m = 1 The Lagrangian of the objective L = u T 2 Su 2 + λ 2 (1 u T 2 u 2 ) First order condition is Su 2 = λ 2 u 2

Minimum Error Formulation (Contd.) Consider special case d = 2, m = 1 The Lagrangian of the objective L = u T 2 Su 2 + λ 2 (1 u T 2 u 2 ) First order condition is Su 2 = λ 2 u 2 In general, the condition is Su i = λ i u i Given by the eigenvectors corresponding to the smallest (d m) eigenvalues

Kernel PCA In PCA, the principal components u i are given by Su i = λ i u i where N S = 1 N x n x T n Consider a feature mapping φ(x) Want to implicitly perform PCA in the feature space Assume the features have zero mean n φ(x n) = 0

Kernel PCA (Contd.) The sample covariance matrix in the feature space C = 1 N φ(x n )φ(x n ) T N

Kernel PCA (Contd.) The sample covariance matrix in the feature space C = 1 N φ(x n )φ(x n ) T N The eigenvectors are given by Cv i = λ i v i

Kernel PCA (Contd.) The sample covariance matrix in the feature space C = 1 N φ(x n )φ(x n ) T N The eigenvectors are given by Cv i = λ i v i We want to avoid computing C explicitly

Kernel PCA (Contd.) The sample covariance matrix in the feature space C = 1 N φ(x n )φ(x n ) T N The eigenvectors are given by Cv i = λ i v i We want to avoid computing C explicitly Note that the eigenvectors satisfy 1 N } φ(x n ) {φ(x n ) T v i = λ i v i N

Kernel PCA (Contd.) Substituting back into the eigenvalue equation 1 N N φ(x n )φ(x n ) T N m=1 a im φ(x m ) = λ i N a in φ(x n )

Kernel PCA (Contd.) Substituting back into the eigenvalue equation 1 N N φ(x n )φ(x n ) T N m=1 a im φ(x m ) = λ i Multiplying both sides by φ(x l ) and using K(x n, x m ) = φ(x n ) T φ(x m ), we have 1 N N K(x l, x n ) N a im K(x n, x m ) = λ i m=1 N N a in φ(x n ) a in K(x l, x n )

Kernel PCA (Contd.) Since the original v i are normalized, we have 1 = v T i v i = a T i Ka i = λ i Na T i a i Gives a normalization condition for a i Compute a i by solving the eigenvalue decomposition The projection of a point is given by y i (x) = φ(x i ) T v i = N a in φ(x n ) T φ(x n ) = N a in K(x, x n )

Illustration of Kernel PCA (Feature Space) φ 1 v 1 φ 2

Illustration of Kernel PCA (Data Space) x 2 x 1

Dimensionality of Projection Original x i R d, feature φ(x i ) R D Possibly D >> d so that the number of principal components can be greater than d However, the number of nonzero eigenvalues cannot exceed N The covariance matrix C has rank at most N, even if D >> d Kernel PCA involves eigenvalue decomposition of a N N matrix

Kernel PCA: Non-zero Mean The features need not have zero mean Note that the features cannot be explicitly centered The centralized data would be of the form φ(x n ) = φ(x n ) 1 N N φ(x l ) l=1 The corresponding gram matrix K = K 1 N K K1 N + 1 N K1 N Use K in the basic kernel PCA formulation

Kernel PCA on Artificial Data

Kernel PCA Properties Computes eigenvalue decomposition of N N matrix Standard PCA computes it for d d For large datasets N >> d, Kernel PCA is more expensive Standard PCA gives projection to a low dimensional principal subspace l ˆx n = (x T n u i )u i i=1 Kernel PCA cannot do this φ(x) forms a d-dimensional manifold in R D PCA projection ˆφ of φ(x) need not be in the manifold May not have a pre-image ˆx in the data space