Principal Component Analysis (PCA)

Principal Component Analysis (PCA) Additional reading can be found from non-assessed exercises (week 8) in this course unit teaching page. Textbooks: Sect. 6.3 in [1] and Ch. 12 in [2]

Outline Introduction Principle Algorithms Exemplar Applications Relevant Issues Conclusions 2

Introduction Principal component analysis (PCA) A method for high-dimensional data analysis via redundancy reduction Identifying an optimal low-dimensional linear projection; maximum data variance in the new space Useful for data visualization, compression and feature extraction PCA finds a new coordinate system of maximum data variance Projection to principal axis to lead to a new low-dimensional representation 3

Principle Finding 1 st principal component Given a data set of N data points in a d-dimensional space,,, } X = { x 1 x N 4

Principle Finding 1 st principal component (cont.) 5

Principle General formulation We want to find M (M < d) principal components. So we need λ i < λ j > if i j, ( δ ij = 1 if i = j and 0 otherwise) 6

Principle Data reconstruction after dimension reduction From a compressed data point in the M-dimensional PCA space (M < d ), we can reconstruct the data point in the original d-dimensional space 7

Principle Perspective of minimizing reconstruction errors Also PCA formulated from a perspective of minimizing reconstruction errors 8

Principle Dual PCA Idea For a d x N (d >>N) matrix, X, dimensionality of this linear space < N 1 T S = N XX is d x d matrix; it is often computationally infeasible in solving its eigenvalue problem 1 S = N X T X is N x N matrix and hence eigenvalue problem is solvable. Fortunately, we can prove S and S share the same eigenvalues! If we achieve an eigenvector of S, we can use it to produce the corresponding eigenvector of S where v eigenvector of i u = i 1 Xv Nλ isan eigenvector of S' and u is its corresponding S, which shares the eigenvaule λ. i i i i 9

Principle Singular Value Decomposition (SVD) For a d x N matrix, X, it can be decomposed into the following form: X = UΣ T V U is a d x d orthogonal matrix, column i is the ith eigenvector of Σ is an d x N diagonal matrix, σ ii = λi, σ ii σ if i < jj j T XX V T is an N x N orthogonal matrix, column i is the ith eigenvector X T X Link to PCA If we make all data centralized by subtracting the mean, is the covariance matrix of X Column i in U corresponds to the ith principal component Properties of SVD allow us to deal with a high dimensional data set of few data points (i.e., d >> N) as it does not use a covariance matrix directly. XX T / N 10

Data Centralization Basic Algorithm For a given data set d N, X, ( d < N), subtracting the mean vector for all the instances in X to achieve the centralized data set denoted by ˆX. Eigenanalysis S = X ˆ X ˆ calculate T / N Finding principal components, finding out all d eigenvaules, ranking them so that λ 1 λ d u,, u., and their corresponding eigenvectors, Selecting top M (M < d) largest eigenvectors of S to form a project matrix U M = [ u1,, u M ], 1 λ λ M 1 d x Encoding data point z = U T M( x x) z is a M-dimensional vector encoding a data point x. Reconstructing data point (Decoding) x = x + U M z x is a d-dimensional vector for the data point x. 11

Data Centralization Dual Algorithm For a given data set d N, X, ( d N), subtracting the mean vector for all the instances in X to achieve the centralized data set denoted by ˆX. SVD Procedure calculate Y = ˆX T / N T T d x d matrix V (i.e., Y = UΣV ). Finding principal components and applying the SVD to Y. Then we achieve a Selecting first M (M < d) columns of V to form a project matrix U M T T = [ v 1,, v M ] x Encoding data point z = U T M( x x) z is a M-dimensional vector encoding a data point x. Reconstructing data point (Decoding) x = x + U M z x is a d-dimensional vector for the data point x. 12

Examples Example 1: Synthetic data 13

Examples Example 2: visualization of high-dimensional data PCA application to visualization of microarray data 14

Examples Example 3: data compression A hand-written digit 3 data set of 600 images, 100 x 100 =10,000 pixels

Examples Example 3: data compression A hand-written digit 3 data set of 600 images, 100 x 100 =10,000 pixels Original images Principal components Reconstructed images

Examples Example 4: feature extraction Extract silent features Eigenface from facial image to facilitate recognition.

Examples Example 4: feature extraction (cont.) Eigenfaces are the eigenvectors of the covariance matrix of the vector space of human faces. A human face may be considered to be a combination of these standard faces the principal eigenface looks like a bland androgynous average human face

Examples Example 4: feature extraction (cont.) When properly weighted, eigenfaces can be summed together to create an approximate face Remarkably few eigenvector terms are needed to give a fair likeness of most people's faces Suppose we are going to use M eigenfaces. Then a facial image will be represented by M coordinates in the PCA subspace. z = U T M x 2 U : d M M x : a vector of d matrix consisting of top M eigenvctors 2 elements converting froman image z : a vector of M elements to be a representation (features) Feature vectors of M elements will be used in a face recognition system for both training and testing

Relevant Issues How to find an appropriate dimensionality, M, in the PCA space We use Proportion of Variance (PoV) to determine it in practice PoV = k i= 1 d i= 1 λ i λ i = λ1 + + λk λ + + λ + + λ When PoV >= 90%, the corresponding k will be assigned to be M. 1 k d 20

Relevant Issues Limitations of the standard PCA Are dimensions of maximum data variance always the relevant dimensions for preservation? Other techniques are required! Relevance component analysis (RCA) Linear discriminative analysis (LDA) 21

Relevant Issues Limitations of the standard PCA (cont.) Should the goal be finding independent rather than pair-wise uncorrelated/orthogonal dimensions? Another technique is required! Independent Component Analysis (ICA) PCA ICA 22

Relevant Issues Limitations of the standard PCA (cont.) The reduction of dimensions for complex distributions may need nonlinear processing Nonlinear PCA extension preserves the proximity between the points in the input space; i.e., local topology of the distribution Enables to unfold some varieties in the input data Keep the local topology Nonlinear projection of a spiral Nonlinear projection of a horseshoe 23

Relevant Issues Miscellaneous PCA extensions (>100) Probabilistic PCA 2-D PCA Sparse PCA/Scaled PCA Nonnegative Matrix Factorization PCA mixture and local PCA Principal Curve and Surface Analysis Kernel PCA 24

Conclusions PCA is a simple yet popular method for handling high dimensional data and inspires many other methods. It is a linear method for dimensionality reduction by projecting original data to a new coordinate system to maximize data variance. PCA can be interpreted from various perspectives and therefore leads to different formulation methods. There are a number of limitations in the standard PCA. There are several variants or extensions, which tends to overcome the limitations of the standard PCA. 25