Advanced Introduction to Machine Learning CMU-10715

Size: px

Start display at page:

Download "Advanced Introduction to Machine Learning CMU-10715"

Melvyn Peters
5 years ago
Views:

1 Advanced Introduction to Machine Learning CMU Principal Component Analysis Barnabás Póczos

2 Contents Motivation PCA algorithms Applications Some of these slides are taken from Karl Booksh Research group Tom Mitchell Ron Parr 2

3 Motivation 3

4 Data Visualization Data Compression Noise Reduction PCA Applications 4

5 Data Visualization Example: Given 53 blood and urine samples (features) from 65 people. How can we visualize the measurements? 5

6 Matrix format (65x53) Data Visualization Instances H-WBC H-RBC H-Hgb H-Hct H-MCV H-MCH H-MCHC A A A A A A A A A Features Difficult to see the correlations between the features... 6

7 Data Visualization Spectral format (65 curves, one for each person) Value m easurem ent M easu rem en t Difficult to compare the different patients... 7

8 Data Visualization Spectral format (53 pictures, one for each feature) H-Bands P e r s o n Difficult to see the correlations between the features... 8

9 Data Visualization Bi-variate Tri-variate C-LDH C-Triglycerides M-EPI C-LDH C-Triglycerides How can we visualize the other variables??? difficult to see in 4 or higher dimensional spaces... 9

10 Data Visualization Is there a representation better than the coordinate axes? Is it really necessary to show all the 53 dimensions? what if there are strong correlations between the features? How could we find the smallest subspace of the 53-D space that keeps the most information about the original data? A solution: Principal Component Analysis 10

11 PCA Algorithms 11

12 Principal Component Analysis PCA: Orthogonal projection of the data onto a lower-dimension linear space that... maximizes variance of projected data (purple line) minimizes the mean squared distance between data point and projections (sum of blue lines) 12

13 Principal Component Analysis Idea: Given data points in a d-dimensional space, project them into a lower dimensional space while preserving as much information as possible. Find best planar approximation of 3D data Find best 12-D approximation of D data In particular, choose projection that minimizes squared error in reconstructing the original data. 13

14 Principal Component Analysis Properties: PCA Vectors originate from the center of mass. Principal component #1: points in the direction of the largest variance. Each subsequent principal component is orthogonal to the previous ones, and points in the directions of the largest variance of the residual subspace 14

15 2D Gaussian dataset 15

16 1 st PCA axis 16

17 2 nd PCA axis 17

18 Given the centered data {x 1,, x m }, compute the principal vectors: 1 2 = arg max m T w {( w xi) } 1 1 w = 1 m st PCA vector w PCA algorithm I (sequential) 1 i= 1 To find w 1, maximize the variance of projection of x m T T 2 2 = arg max {[ w ( xi w1w1 xi )] } w = 1 m i= 1 To find w 2, we maximize the variance of the projection in the residual subspace w w 2 w x PCA reconstruction x-x x 2 nd PCA vector x =w 1 (w 1T x) w 1 18

19 PCA algorithm I (sequential) Given w 1,, w k-1, we calculate w k principal vector as before: Maximize the variance of projection of x w k = arg max w = 1 1 m m i= 1 {[ w T ( x i k 1 j= 1 w j w T j x i )] 2 } k th PCA vector x PCA reconstruction We maximize the variance of the projection in the residual subspace w 2 (w 2T x) w x w 1 (w 1T x) w 1 w 2 x =w 1 (w 1T x)+w 2 (w 2T x) 19

20 PCA algorithm II (sample covariance matrix) Given data {x 1,, x m }, compute covariance matrix Σ Σ = m 1 ( x x)( x x) i m 1 i= T where x = 1 m m i= 1 x i PCA basis vectors = the eigenvectors of Σ Larger eigenvalue more important eigenvectors 20

21 PCA algorithm(x, k): top k eigenvalues/eigenvectors % X = N m data matrix, % each data point x i = column vector, i=1..m x m 1 x i = m i = 1 PCA algorithm II (sample covariance matrix) X subtract mean x from each column vector x i in X Σ XX T covariance matrix of X { λ i, u i } i=1..n = eigenvectors/eigenvalues of Σ... λ 1 λ 2 λ N Return { λ i, u i } i=1.. k % top k PCA components 21

22 Animation Power iteration 1: v=sigma*v; Power iteration 2: λ= v v=v/sqrt(v'*v); PCA1 T*Sigma*v PCA1 Sigma v 2 =Sigma- λ*v PCA1 *v PCA1T ; PCA1 v=sigma 2 *v; v=v/sqrt(v'*v); v PCA2 22

23 23

24 PCA algorithm III (SVD of the data matrix) Singular Value Decomposition of the centered data matrix X. X features samples = USV T X = U S V T sig. significant significant noise noise noise samples 24

25 PCA algorithm III Columns of U the principal vectors, { u (1),, u (k) } orthogonal and has unit norm so U T U = I Can reconstruct the data using linear combinations of { u (1),, u (k) } Matrix S Diagonal Shows importance of each eigenvector Columns of V T The coefficients for reconstructing the samples 25

26 Applications 26

27 Face Recognition Want to identify specific person, based on facial image Robust to glasses, lighting, Can t just use the given 256 x 256 pixels 27

28 Applying PCA: Eigenfaces Method A: Build a PCA subspace for each person and check which subspace can reconstruct the test image the best Method B: Build one PCA database for the whole dataset and then classify based on the weights. 28

29 Applying PCA: Eigenfaces Example data set: Images of faces Eigenface approach [Turk & Pentland], [Sirovich & Kirby] Each face x is values (luminance at location) x 1,, x m x in R (view as 64K dim vector) Form X = [ x 1,, x m ] centered data mtx Compute Σ= XX T X = 256 x 256 real values Problem: Σ is 64K 64K HUGE!!! m faces 29

30 Computational Complexity Suppose m instances, each of size N Eigenfaces: m=500 faces, each of size N=64K Given N N covariance matrix Σ, can compute all N eigenvectors/eigenvalues in O(N 3 ) first k eigenvectors/eigenvalues in O(k N 2 ) But if N=64K, EXPENSIVE! 30

31 A Clever Workaround Note that m<<64k Use L=X T X instead of Σ=XX T If v is eigenvector of L then Xv is eigenvector of Σ Proof: L v = γ v X T X v = γ v X (X T X v) = X(γ v) = γ Xv (XX T )X v = γ (Xv) Σ (Xv) = γ (Xv) X = x 1,, x m m faces 256 x 256 real values 31

32 Principle Components (Method B) 32

33 Shortcomings Requires carefully controlled data: All faces centered in frame Same size Some sensitivity to angle Method is completely knowledge free (sometimes this is good!) Doesn t know that faces are wrapped around 3D objects (heads) Makes no effort to preserve class distinctions 33

34 Happiness subspace 34

35 Disgust subspace 35

36 Facial Expression Recognition Movies 36

37 Facial Expression Recognition Movies 37

38 Facial Expression Recognition Movies 38

39 Image Compression

40 Original Image Divide the original 372x492 image into patches: Each patch is an instance that contains 12x12 pixels on a grid Consider each as a 144-D vector 40

41 L 2 error and PCA dim 41

42 PCA compression: 144D 60D 42

43 PCA compression: 144D 16D 43

44 16 most important eigenvectors

45 PCA compression: 144D 6D 45

46 6 most important eigenvectors

47 PCA compression: 144D 3D 47

3 most important eigenvectors 2 2 4 4 6 6 8 8 10 10 12 2

48 3 most important eigenvectors

49 PCA compression: 144D 1D 49

50 60 most important eigenvectors Looks like the discrete cosine bases of JPG!... 50

51 2D Discrete Cosine Basis 51

52 Noise Filtering

53 Noise Filtering x x U x 53

54 Noisy image 54

55 Denoised image using 15 PCA components 55

56 PCA Shortcomings

57 Problematic Data Set for PCA PCA doesn t know about class labels! 57

58 PCA vs Fisher Linear Discriminant PCA maximizes variance, independent of class magenta FLD attempts to separate classes green line 58

59 Problematic Data Set for PCA PCA cannot capture NON-LINEAR structure! 59

60 PCA Conclusions PCA finds orthonormal basis for data Sorts dimensions in order of importance Discard low significance dimensions Applications: Get compact description Remove noise Improve classification (hopefully) Not magic: Doesn t know class labels Can only capture linear variations One of many tricks to reduce dimensionality! 60

61 Kernel PCA 61

62 Kernel PCA Performing PCA in the feature space Lemma Proof: 62

63 Kernel PCA Lemma 63

64 Kernel PCA Proof 64

65 Kernel PCA How to use α to calculate the projection of a new sample t? Where was I cheating? The data should be centered in the feature space, too! But this is manageable... 65

66 Input points before kernel PCA 66

67 Output after kernel PCA The three groups are distinguishable using the first component only 67

68 PCA Theory 68

69 Justification of Algorithm II GOAL: 69

70 Justification of Algorithm II x is centered! 70

71 Justification of Algorithm II GOAL: Use Lagrange-multipliers for the constraints. 71

72 Justification of Algorithm II 72

73 Thanks for the Attention! 73

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University Lecture 4: Principal Component Analysis Aykut Erdem May 016 Hacettepe University This week Motivation PCA algorithms Applications PCA shortcomings Autoencoders Kernel PCA PCA Applications Data Visualization