Principal Component Analysis

Size: px

Start display at page:

Download "Principal Component Analysis"

Hilda Golden
5 years ago
Views:

1 CSci 5525: Machine Learning Dec 3, 2008

2 The Main Idea Given a dataset X = {x 1,..., x N }

3 The Main Idea Given a dataset X = {x 1,..., x N } Find a low-dimensional linear projection

4 The Main Idea Given a dataset X = {x 1,..., x N } Find a low-dimensional linear projection Two possible formulations

5 The Main Idea Given a dataset X = {x 1,..., x N } Find a low-dimensional linear projection Two possible formulations The variance in low-d is maximized

6 The Main Idea Given a dataset X = {x 1,..., x N } Find a low-dimensional linear projection Two possible formulations The variance in low-d is maximized The average projection cost is minimized

7 The Main Idea Given a dataset X = {x 1,..., x N } Find a low-dimensional linear projection Two possible formulations The variance in low-d is maximized The average projection cost is minimized Both are equivalent

8 Two viewpoints u 1 x 2 x n x n x 1

9 Maximum Variance Formulation Consider X = {x 1,..., x N }

10 Maximum Variance Formulation Consider X = {x 1,..., x N } With x i R d, goal is to get a projection in R m, m < d

11 Maximum Variance Formulation Consider X = {x 1,..., x N } With x i R d, goal is to get a projection in R m, m < d Consider m = 1, need a projection vector u 1 R d

12 Maximum Variance Formulation Consider X = {x 1,..., x N } With x i R d, goal is to get a projection in R m, m < d Consider m = 1, need a projection vector u 1 R d Each datapoint x i gets projected to u T x i

13 Maximum Variance Formulation Consider X = {x 1,..., x N } With x i R d, goal is to get a projection in R m, m < d Consider m = 1, need a projection vector u 1 R d Each datapoint x i gets projected to u T x i Mean of the projected data u T 1 x where x = 1 N N x n

14 Maximum Variance Formulation Consider X = {x 1,..., x N } With x i R d, goal is to get a projection in R m, m < d Consider m = 1, need a projection vector u 1 R d Each datapoint x i gets projected to u T x i Mean of the projected data u T 1 x where x = 1 N Variance of the projected data where 1 N N x n N (u T 1 x n u T 1 x)2 = u T 1 Su 1 S = 1 N N (x n x)(x n x) T

15 Maximum Variance Formulation (Contd.) Maximize u T 1 Su 1 w.r.t. u 1

16 Maximum Variance Formulation (Contd.) Maximize u T 1 Su 1 w.r.t. u 1 Need to have a constraint to prevent u 1

17 Maximum Variance Formulation (Contd.) Maximize u T 1 Su 1 w.r.t. u 1 Need to have a constraint to prevent u 1 Normalization constraint u 1 2 = 1

18 Maximum Variance Formulation (Contd.) Maximize u T 1 Su 1 w.r.t. u 1 Need to have a constraint to prevent u 1 Normalization constraint u 1 2 = 1 The Lagrangian for the problem u T 1 Su 1 + λ 1 (1 u T 1 u 1 )

19 Maximum Variance Formulation (Contd.) Maximize u T 1 Su 1 w.r.t. u 1 Need to have a constraint to prevent u 1 Normalization constraint u 1 2 = 1 The Lagrangian for the problem First order necessary condition u T 1 Su 1 + λ 1 (1 u T 1 u 1 ) Su 1 = λu 1

20 Maximum Variance Formulation (Contd.) Maximize u T 1 Su 1 w.r.t. u 1 Need to have a constraint to prevent u 1 Normalization constraint u 1 2 = 1 The Lagrangian for the problem First order necessary condition u T 1 Su 1 + λ 1 (1 u T 1 u 1 ) Su 1 = λu 1 u 1 must be largest eigenvector of S since u T 1 Su 1 = λ 1

21 Maximum Variance Formulation (Contd.) Maximize u T 1 Su 1 w.r.t. u 1 Need to have a constraint to prevent u 1 Normalization constraint u 1 2 = 1 The Lagrangian for the problem First order necessary condition u T 1 Su 1 + λ 1 (1 u T 1 u 1 ) Su 1 = λu 1 u 1 must be largest eigenvector of S since u T 1 Su 1 = λ 1 The eigenvector u 1 is called a principal component

22 Maximum Variance Formulation (Contd.) Subsequent principal components must be orthogonal to u 1

23 Maximum Variance Formulation (Contd.) Subsequent principal components must be orthogonal to u 1 Maximize u T 2 Su 2 s.t. u 2 2 = 1, u 2 u 1

24 Maximum Variance Formulation (Contd.) Subsequent principal components must be orthogonal to u 1 Maximize u T 2 Su 2 s.t. u 2 2 = 1, u 2 u 1 Turns out to be the second eigenvector, and so on

25 Maximum Variance Formulation (Contd.) Subsequent principal components must be orthogonal to u 1 Maximize u T 2 Su 2 s.t. u 2 2 = 1, u 2 u 1 Turns out to be the second eigenvector, and so on The top-m eigenvectors give the best m-dimensional projection

26 Minimum Error Formulation Consider a complete basis {u i } in R d

27 Minimum Error Formulation Consider a complete basis {u i } in R d Each data point can be written as x n = d i=1 α niu i

28 Minimum Error Formulation Consider a complete basis {u i } in R d Each data point can be written as x n = d i=1 α niu i Note that α ni = x T n u i so that x n = d (x T n u i )u i i=1

29 Minimum Error Formulation Consider a complete basis {u i } in R d Each data point can be written as x n = d i=1 α niu i Note that α ni = x T n u i so that x n = d (x T n u i )u i i=1 Our goal is to obtain a lower dimensional subspace m < d

30 Minimum Error Formulation Consider a complete basis {u i } in R d Each data point can be written as x n = d i=1 α niu i Note that α ni = x T n u i so that x n = d (x T n u i )u i i=1 Our goal is to obtain a lower dimensional subspace m < d A generic representation of a low-d point x n = m d z ni u i + b i u i i=1 i=m+1

31 Minimum Error Formulation Consider a complete basis {u i } in R d Each data point can be written as x n = d i=1 α niu i Note that α ni = x T n u i so that x n = d (x T n u i )u i i=1 Our goal is to obtain a lower dimensional subspace m < d A generic representation of a low-d point x n = m d z ni u i + b i u i i=1 i=m+1 Coefficients z ni depend on the data point x n

32 Minimum Error Formulation Consider a complete basis {u i } in R d Each data point can be written as x n = d i=1 α niu i Note that α ni = x T n u i so that x n = d (x T n u i )u i i=1 Our goal is to obtain a lower dimensional subspace m < d A generic representation of a low-d point x n = m d z ni u i + b i u i i=1 i=m+1 Coefficients z ni depend on the data point x n Free to choose z ni, b i, u i to get x n close to x n

33 Minimum Error Formulation (Contd.) The objective is to minimize J = 1 N x n x n 2 N

34 Minimum Error Formulation (Contd.) The objective is to minimize J = 1 N x n x n 2 N Taking derivative w.r.t. z ni we get z nj = x T n u j, j = 1,..., m

35 Minimum Error Formulation (Contd.) The objective is to minimize J = 1 N x n x n 2 N Taking derivative w.r.t. z ni we get z nj = x T n u j, j = 1,..., m Taking derivative w.r.t. b j we get b j = x T u j, j = m + 1,..., d

36 Minimum Error Formulation (Contd.) The objective is to minimize J = 1 N x n x n 2 N Taking derivative w.r.t. z ni we get z nj = x T n u j, j = 1,..., m Taking derivative w.r.t. b j we get b j = x T u j, j = m + 1,..., d Then we have d x n x n = {(x n x) T u i }u i i=m+1

37 Minimum Error Formulation (Contd.) The objective is to minimize J = 1 N x n x n 2 N Taking derivative w.r.t. z ni we get z nj = x T n u j, j = 1,..., m Taking derivative w.r.t. b j we get b j = x T u j, j = m + 1,..., d Then we have d x n x n = {(x n x) T u i }u i i=m+1 Lies in the space orthogonal to the principal subspace

38 Minimum Error Formulation (Contd.) The objective is to minimize J = 1 N x n x n 2 N Taking derivative w.r.t. z ni we get z nj = x T n u j, j = 1,..., m Taking derivative w.r.t. b j we get b j = x T u j, j = m + 1,..., d Then we have d x n x n = {(x n x) T u i }u i i=m+1 Lies in the space orthogonal to the principal subspace The distortion measure to be minimized J = 1 N d d (x T n u i x T u i ) 2 = u T i Su i N i=m+1 i=m+1

39 Minimum Error Formulation (Contd.) The objective is to minimize J = 1 N x n x n 2 N Taking derivative w.r.t. z ni we get z nj = x T n u j, j = 1,..., m Taking derivative w.r.t. b j we get b j = x T u j, j = m + 1,..., d Then we have d x n x n = {(x n x) T u i }u i i=m+1 Lies in the space orthogonal to the principal subspace The distortion measure to be minimized J = 1 N d d (x T n u i x T u i ) 2 = u T i Su i N i=m+1 i=m+1 Need orthonormality constraints on u i to prevent u i = 0 solution

40 Minimum Error Formulation (Contd.) Consider special case d = 2, m = 1

41 Minimum Error Formulation (Contd.) Consider special case d = 2, m = 1 The Lagrangian of the objective L = u T 2 Su 2 + λ 2 (1 u T 2 u 2 )

42 Minimum Error Formulation (Contd.) Consider special case d = 2, m = 1 The Lagrangian of the objective L = u T 2 Su 2 + λ 2 (1 u T 2 u 2 ) First order condition is Su 2 = λ 2 u 2

43 Minimum Error Formulation (Contd.) Consider special case d = 2, m = 1 The Lagrangian of the objective L = u T 2 Su 2 + λ 2 (1 u T 2 u 2 ) First order condition is Su 2 = λ 2 u 2 In general, the condition is Su i = λ i u i

44 Minimum Error Formulation (Contd.) Consider special case d = 2, m = 1 The Lagrangian of the objective L = u T 2 Su 2 + λ 2 (1 u T 2 u 2 ) First order condition is Su 2 = λ 2 u 2 In general, the condition is Su i = λ i u i Given by the eigenvectors corresponding to the smallest (d m) eigenvalues

45 Minimum Error Formulation (Contd.) Consider special case d = 2, m = 1 The Lagrangian of the objective L = u T 2 Su 2 + λ 2 (1 u T 2 u 2 ) First order condition is Su 2 = λ 2 u 2 In general, the condition is Su i = λ i u i Given by the eigenvectors corresponding to the smallest (d m) eigenvalues So the principal space u i, i = 1,..., m are the largest eigenvectors

46 Kernel PCA In PCA, the principal components u i are given by Su i = λ i u i where N S = 1 N x n x T n Consider a feature mapping φ(x) Want to implicitly perform PCA in the feature space Assume the features have zero mean n φ(x n) = 0

47 Kernel PCA (Contd.) The sample covariance matrix in the feature space C = 1 N φ(x n )φ(x n ) T N

48 Kernel PCA (Contd.) The sample covariance matrix in the feature space C = 1 N φ(x n )φ(x n ) T N The eigenvectors are given by Cv i = λ i v i

49 Kernel PCA (Contd.) The sample covariance matrix in the feature space C = 1 N φ(x n )φ(x n ) T N The eigenvectors are given by Cv i = λ i v i We want to avoid computing C explicitly

50 Kernel PCA (Contd.) The sample covariance matrix in the feature space C = 1 N φ(x n )φ(x n ) T N The eigenvectors are given by Cv i = λ i v i We want to avoid computing C explicitly Note that the eigenvectors satisfy 1 N } φ(x n ) {φ(x n ) T v i = λ i v i N

51 Kernel PCA (Contd.) The sample covariance matrix in the feature space C = 1 N φ(x n )φ(x n ) T N The eigenvectors are given by Cv i = λ i v i We want to avoid computing C explicitly Note that the eigenvectors satisfy 1 N } φ(x n ) {φ(x n ) T v i = λ i v i N Since the inner product is a scaler, we have N v i = a in φ(x n )

52 Kernel PCA (Contd.) Substituting back into the eigenvalue equation 1 N N φ(x n )φ(x n ) T N m=1 a im φ(x m ) = λ i N a in φ(x n )

53 Kernel PCA (Contd.) Substituting back into the eigenvalue equation 1 N N φ(x n )φ(x n ) T N m=1 a im φ(x m ) = λ i Multiplying both sides by φ(x l ) and using K(x n, x m ) = φ(x n ) T φ(x m ), we have 1 N N K(x l, x n ) N a im K(x n, x m ) = λ i m=1 N N a in φ(x n ) a in K(x l, x n )

54 Kernel PCA (Contd.) Substituting back into the eigenvalue equation 1 N N φ(x n )φ(x n ) T N m=1 a im φ(x m ) = λ i Multiplying both sides by φ(x l ) and using K(x n, x m ) = φ(x n ) T φ(x m ), we have 1 N N K(x l, x n ) N a im K(x n, x m ) = λ i m=1 In matrix notation, we have K 2 a i = λ i NKa i N N a in φ(x n ) a in K(x l, x n )

55 Kernel PCA (Contd.) Substituting back into the eigenvalue equation 1 N N φ(x n )φ(x n ) T N m=1 a im φ(x m ) = λ i Multiplying both sides by φ(x l ) and using K(x n, x m ) = φ(x n ) T φ(x m ), we have 1 N N K(x l, x n ) N a im K(x n, x m ) = λ i m=1 In matrix notation, we have K 2 a i = λ i NKa i N N a in φ(x n ) a in K(x l, x n ) Except for eigenvectors with 0 eigenvalues, we can solve Ka i = λ i Na i

56 Kernel PCA (Contd.) Since the original v i are normalized, we have 1 = v T i v i = a T i Ka i = λ i Na T i a i Gives a normalization condition for a i Compute a i by solving the eigenvalue decomposition The projection of a point is given by y i (x) = φ(x i ) T v i = N a in φ(x n ) T φ(x n ) = N a in K(x, x n )

57 Illustration of Kernel PCA (Feature Space) φ 1 v 1 φ 2

58 Illustration of Kernel PCA (Data Space) x 2 x 1

59 Dimensionality of Projection Original x i R d, feature φ(x i ) R D Possibly D >> d so that the number of principal components can be greater than d However, the number of nonzero eigenvalues cannot exceed N The covariance matrix C has rank at most N, even if D >> d Kernel PCA involves eigenvalue decomposition of a N N matrix

60 Kernel PCA: Non-zero Mean The features need not have zero mean Note that the features cannot be explicitly centered The centralized data would be of the form φ(x n ) = φ(x n ) 1 N N φ(x l ) l=1 The corresponding gram matrix K = K 1 N K K1 N + 1 N K1 N Use K in the basic kernel PCA formulation

61 Kernel PCA on Artificial Data

62 Kernel PCA Properties Computes eigenvalue decomposition of N N matrix Standard PCA computes it for d d For large datasets N >> d, Kernel PCA is more expensive Standard PCA gives projection to a low dimensional principal subspace l ˆx n = (x T n u i )u i i=1 Kernel PCA cannot do this φ(x) forms a d-dimensional manifold in R D PCA projection ˆφ of φ(x) need not be in the manifold May not have a pre-image ˆx in the data space

Nonlinear Dimensionality Reduction

Nonlinear Dimensionality Reduction Piyush Rai CS5350/6350: Machine Learning October 25, 2011 Recap: Linear Dimensionality Reduction Linear Dimensionality Reduction: Based on a linear projection of the