PCA Ron Parr CPS 71 With thanks to Tom Mitchell Principle Components Analysis Iea: Given ata points in - imensional space, project into lower imensional space while preserving as much informakon as possible E.g., fin best planar approximakon to 3D ata E.g., fin best planar approximakon to 10 4 D ata In parkcular, choose projeckon that minimizes square error in reconstruckng original ata 1
Why o we care? Lower imensional representakons permit Compression Noise filtering As preprocessing for classificakon Reuces feature space imension Simpler Classifiers Possibly beyer generalizakon May facilitate simple (nearest neighbor) methos Review of a Few Linear Algebra Facts A set of vectors is orthonormal if: All vectors in the set have norm 1 Any two ifferent vectors have ot- prouct 0 Any vector in a linear space can be expresse as a weighte combinakon of norm 1 vectors specifically, the vectors than form a basis for the space
PCA: Fin ProjecKons to Minimize ReconstrucKon Error Assume ata is set of N- imensional vectors, x = ((x 1 ) T,(x ) T,...(x ) T ) Can always express k th vector as x k = rank(x n ) i=1 z i k u i u i T u j = δ ij Compact way of inicakng orthonormality PCA: given M<. Fin that minimizes where x ˆ k = x + E M M i=1 k=1 z i k u i (u 1...u M ) x k ˆ x k u u 1 Mean x = 1 i=1 x i Review: Eigenvectors Matrix A has eigenvector u with eigenvalue λ if: Au = λu For symmetric A (normalize) eigenvectors: Are orthogonal Have real eigenvalues Form an orthonormal basis for A (See appenix C of text) 3
Review: ProjecKon Orthonormal basis - > trivial projeckon Suppose U is our basis (forme by first k eigenvectors) Suppose we want to project a new x w = (U T U) 1 U T x = U T x Note: We typically assume x has mean subtracte alreay PCA: given M<. Fin (u 1...u M ) PCA u u 1 that minimizes where x ˆ k = x + E M i=1 Note we get zero error if M=. Therefore, E M = (z j i ) = M Covariance matrix: k=1 z i k u i i=m +1 j=1 x k ˆ x k u T i (x j x ) j=1[ ] i=m +1 [ ][ ] = u T i (x j x ) u T i (x j x ) i=m +1 j=1 [ ][(x j x ) T u i ] = u T i Σu i = u i T (x j x ) i=m +1 j=1 Σ = k=1 (x k x )(x k x ) T i=m +1 Equivalent problem: Maximize variance in the imensions we keep How much we lef out by Dropping vectors u M+1 u This minimize when u i is eigenvector of Σ, i.e., when: Σu i = λ i u i 4
JusKfying Use of Eigenvectors We want to minimize: u T u Subject to: u T u = 1 Use Lagrange MulKpliers to minimize: u T u λu T u Take the graient, set to 0: u λu = 0 True when we use eigenvalues, vectors Minimize E M = i=m +1 u i T Σ u i PCA x u u 1 Σu i = λ i u i Eigenvector of Σ Eigenvalue x 1 E M = λ i i=m +1 PCA algorithm: 1. X Create N x ata matrix. A subtract mean x from each column in X 3. Σ covariance matrix of A 4. Fin eigenvectors an eigenvalues of Σ 5. PC s the M eigenvectors with largest eigenvalues 5
PCA Example mean First eigenvector Secon eigenvector PCA Example Reconstructe ata using only first eigenvector (M=1) mean First eigenvector Secon eigenvector 6
Applying PCA Example ata set: Images of faces (Famous Eigenface approach [Turk & Pentlan], [Sirovich & Kirby) Each atum is a point in image space Each point vector of luminance values Vectors are long, e.g., 56x56=64K These form columns of A, Σ=AA T Problem: AA T is unreasonably large! A Clever Workaroun Note that <<N(=64K) Use L=A T A instea of Σ=AA T Suppose γ is eigenvector of L Av is eigenvector of Σ Lv = γv A T Av = γv AA T Av = γav (Av) = γ(av) 7
ApplicaKon to Eigenfaces =hunres- thousans of faces Keep M~/10 eigenvectors (eigenfaces) Achieve: Low reconstruckon error RelaKvely high classificakon accuracy (across faces) Robust measure of faceness ApplicaKon to CollaboraKve Filtering CollaboraKve filtering: Use preferences/rakngs from a set of users to preict preferences/rakngs for a new user Examples: Amazon, Newlix, etc. CollaboraKve filtering as PCA: Customers span columns Proucts span rows Principle components are customer types 8
RelaKonship to SVD SVD factors a matrix: [U,S,V] = SVD(X) X = USV T U contains eigenvectors of XX T V contains eigenvectors of X T X S(ingular values) contains sqrt(eigenvalues) Can use SVD to get PCA solukon by subtrackng mean, then running SVD Summary of PCA Uses Data compression (compress ata by represenkng enkre ata set as coefficients for a small number of principle components) Noise filtering (assume low eigenvalue components correspon to noise) Feature seleckon for supervise learning (assumes low eigenvalue components are noise/irrelevant features) Nearest neighbor classificakon (assumes subspace of principle components is a more natural space in which to measure istances) Direct classificakon (assume istance to span of principle components is an inicator of class membership) VisualizaKon (assume the first or 3 principle components show the intereskng relakonships that exist in the ata) 9
Shortcomings Requires carefully controlle ata: All ata are aligne (e.g. all faces centere in frame) No missing entries (hanle awkwarly in CF) Completely knowlege free metho (somekmes this is goo) Is purely linear, e.g., oesn t know that faces are wrappe aroun 3D objects (heas) Makes no effort to preserve class isknckons PCA Problem Data Set PCA oesn t know about labels! 10
PCA Conclusions PCA fins orthonormal basis for ata Sorts imensions in orer of importance Discar low significance imensions to: Get compact escripkon Ignore noise Improve classificakon (hopefully) Not magic: Doesn t know class labels Can only capture linear variakons One of many types of imensionality reuckon! 11