Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Size: px

Start display at page:

Download "Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University"

Owen McDaniel
5 years ago
Views:

1 Lecture 4: Principal Component Analysis Aykut Erdem May 016 Hacettepe University

2 This week Motivation PCA algorithms Applications PCA shortcomings Autoencoders Kernel PCA

3 PCA Applications Data Visualization Data Compression Noise Reduction Learning Anomaly detection 3

4 Data Visualization Example: Given 53 blood and urine samples (features) from 65 people. How can we visualize the measurements? 4

5 Data Visualization Matrix format (65x53) Instances H-WBC H-RBC H-Hgb H-Hct H-MCV H-MCH H-MCHC A A A A A A A A A Features Difficult to see the correlations between the features... 5

6 Data Visualization Spectral format (65 curves, one for each person) measurement Measurement Difficult to compare the different patients... Value 6

7 Data Visualization Spectral format (53 pictures, one for each feature) H-Bands Person Difficult to see the correlations between the features... 7

8 Data Visualization C-LDH Bi-variate C-Triglycerides M-EPI C-LDH 00 How can we visualize the other variables??? Tri-variate difficult to see in 4 or higher dimensional spaces C-Triglycerides 9 8

9 Data Visualization Is there a representation better than the coordinate axes? Is it really necessary to show all the 53 dimensions? -... what if there are strong correlations between the features? How could we find the smallest subspace of the 53-D space that keeps the most information about the original data? A solution: Principal Component Analysis 9

10 PCA algorithms 10

11 Principal Component Analysis PCA: Orthogonal projection n of the of data the onto data a lowonto a lowerdimension linear space that... maximizes variance of projected data (purple line) minimizes mean squared distance between - data point and - projections (sum of blue lines) 11

12 Principal Component Analysis Idea: Given data points in a d-dimensional space, project them into a lower dimensional space while preserving as much information as possible. - Find best planar approximation to 3D data - Find best 1-D approximation to D data In particular, choose projection that minimizes squared error in reconstructing the original data. 1

13 Principal Component Analysis PCA Vectors originate from the center of mass. Principal component #1: points in the direction of the largest variance. Each subsequent principal component - is orthogonal to the previous ones, and - points in the directions of the largest variance of the residual subspace 13

14 D Gaussian dataset 14

15 1 st PCA axis 15

16 nd PCA axis 16

17 PCA algorithm I (sequential) Given the centered data {x 1,, x m }, compute the principal vectors: m 1 arg max T w {( w xi) } 1 1 w 1 m st PCA vector We maximize the variance of projection of x 1 w i 1 m T T arg max {[ w ( xi w1w1 xi)] } w 1 m i 1 We maximize the variance of the projection in the residual subspace w w x-x x PCA reconstruction x k th PCA vector x =w 1 (w 1T x) w 1 17

18 PCA algorithm I (sequential) Given w 1,, w k-1, we calculate w k principal vector as before: Maximize the variance of projection of x w k arg max w 1 1 m m i 1 {[ w T ( x i k 1 j 1 w j w T j x i )] } k th PCA vector x PCA reconstruction w We maximize the variance of the projection in the residual subspace w (w T x) w x w 1 (w 1T x) w 1 x =w 1 (w 1T x)+w (w T x) 19 18

19 PCA algorithm II (sample covariance matrix) Given data {x 1,, x m }, compute covariance matrix m 1 ( x x)( x x) i m 1 i T where x 1 m m i 1 x i PCA basis vectors = the eigenvectors of Larger eigenvalue more important eigenvectors 19

20 PCA algorithm II (sample covariance matrix) PCA algorithm(x, k): top k eigenvalues/eigenvectors x % X = N m data matrix, % each data point x i = column vector, i=1..m 1 m m i 1 x i X subtract mean x from each column vector x i in X X X T covariance matrix of X { i, u i } i=1..n = eigenvectors/eigenvalues of 1 N Return { i, u i } i=1.. k % top k PCA components 0

21 PCA algorithm III (SVD (SVD of of the the data data matrix) Singular Value Decomposition of the centered data matrix X. X features samples = USV T X = U S V T sig. significant samples significant noise noise noise 3 1

22 PCA algorithm III Columns of U the principal vectors, { u (1),, u (k) } orthogonal and has unit norm so U T U = I Can reconstruct the data using linear combinations of { u (1),, u (k) } Matrix S Diagonal Shows importance of each eigenvector Columns of V T The coefficients for reconstructing the samples

23 Applications 3

24 Face Recognition 4

25 Face Recognition Want to identify specific person, based on facial image Robust to glasses, lighting, - Can t just use the given 56 x 56 pixels 5

26 Applying PCA: Eigenfaces Method A: Build a PCA subspace for each person and check which subspace can reconstruct the test image the best Method B: Build one PCA database for the whole dataset and then classify based on the weights. X = x 1,, x m m faces 56 x 56 real values Example data set: Images of faces Famous Eigenface approach [Turk & Pentland], [Sirovich & Kirby] Each face x is values (luminance at location) x in (view as 64K dim vector) Form X = [ x 1,, x m ] centered data mtx Compute = XX T Problem: is 64K 64K HUGE!!! 7 6

27 Computational Complexity Suppose m instances, each of size N Eigenfaces: m=500 faces, each of size N=64K Given N N covariance matrix can compute all N eigenvectors/eigenvalues in O(N 3 ) first k eigenvectors/eigenvalues in O(k N ) But if N=64K, EXPENSIVE! 7

28 A Clever Workaround Note that m<<64k Use L=X T X instead of =XX T If v is eigenvector of L then Xv is eigenvector of Proof: L v = v X T X v = v X (X T X v) = X( v) = Xv X = x 1,, x m m faces 56 x 56 real values (XX T )X v = (Xv) Xv) = (Xv) 8

29 Principle Components (Method B) 9

30 Principle Components (Method B) faster if train with - only people w/out glasses - same lighting conditions 30

31 Shortcomings Requires carefully controlled data: - All faces centered in frame - Same size - Some sensitivity to angle Method is completely knowledge free - (sometimes this is good!) - Doesn t know that faces are wrapped around 3D objects (heads) - Makes no effort to preserve class distinctions 31

32 Happiness subspace (method A) 3

33 Disgust subspace (method A) 33

34 Facial Expression Recognition Movies 34

35 Facial Expression Recognition Movies 35

36 Facial Expression Recognition Movies Movies 36

37 Image Compression 37

38 Original Image Divide the de the original original 37x49 37x49 image into image patches: into patches: - Each patch is an instance View each as a 144-D vector 38

39 L error and PCA dim 39

40 PCA compression: 144D => 60D 40

41 PCA compression: 144D => 16D 41

42 16 most important eigenvectors

43 PCA compression: 144D => 6D 43

44 6 most important eigenvectors

45 PCA compression: 144D => 3D 45

46 3 most important eigenvectors

47 PCA compression: 144D => 1D 47

48 60 most important eigenvectors Looks like the discrete cosine bases of JPG! 48

49 D Discrete Cosine Basis 49

50 Noise Filtering 50

51 Noise Filtering x x U x 51

52 Noisy image 5

53 Denoised image using 15 PCA components 53

54 PCA Shortcomings 54

55 Problematic Data Set for PCA PCA doesn t know labels! 55

56 PCA vs. Fisher Linear Discriminant Principal Component Analysis higher variance bad for discriminability slide by Javier Hernandez Rivera Fisher Linear Discriminant smaller variance good discriminability 56

57 Problematic Data Set for PCA PCA cannot capture NON-LINEAR structure! 57

58 PCA Conclusions PCA - Finds orthonormal basis for data - Sorts dimensions in order of importance - Discard low significance dimensions Uses: - Get compact description - Ignore noise - Improve classification (hopefully) Not magic: - Doesn t know class labels - Can only capture linear variations One of many tricks to reduce dimensionality! 58

Advanced Introduction to Machine Learning CMU-10715

Advanced Introduction to Machine Learning CMU-10715 Principal Component Analysis Barnabás Póczos Contents Motivation PCA algorithms Applications Some of these slides are taken from Karl Booksh Research