Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, PDF Free Download

Lecture 8 Principal Component Analysis Luigi Freda ALCOR Lab DIAG University of Rome La Sapienza December 13, 2016 Luigi Freda ( La Sapienza University) Lecture 8 December 13, 2016 1 / 31

Outline 1 Eigen Decomposition Eigenvalues and Eigenvectors Some Properties 2 Singular Value Decomposition Singular Values Connection with Eigenvalues Singular Value Decomposition 3 Principal Component Analysis Discovering Latent Factors Luigi Freda ( La Sapienza University) Lecture 8 December 13, 2016 2 / 31

Eigenvalues given a matrix A C n n a nonzero vector v C n is said to be its (right) eigenvector if Av = λv for some scalar λ C; λ is called an eigenvalue of A the set of all eigenvalues of A is called its spectrum and it s denoted by σ(a) the Matlab command [ V,D]= e i g (A) produces a diagonal matrix D of eigenvalues and a matrix V such that AV = VD with D = diag(λ i ) if A is diagonalizable then V is full-rank (i.e. rank(v ) = n) and the columns of V correspond to n linearly independent eigenvectors, i.e. V = {v 1,..., v n}; in this case one has n A = VDV 1 = λ i v i v T i if Ã = TAT 1, where T is a nonsingular matrix, then Ã and A are called similar matrices; at the previous point, A and D are similar matrices Luigi Freda ( La Sapienza University) Lecture 8 December 13, 2016 4 / 31 i=1

Eigenvalues Real Matrix given a real matrix A R n n all its eigenvalues σ(a) are the roots of the characteristic polynomial equation det(λi A) = 0 given that A is a real matrix, if λ C is an eigenvalue then its conjugate λ C is also an eigenvalue, i.e., σ(a) = σ(a) det(λ I A) = 0 = det(λ I A) = det(λ I A) = det(λ I A) = 0 σ(a) = σ(a T ) since det(λi A) = det((λi A) T ) = det(λi A T ) if Ã = TAT 1, where T is a nonsingular matrix, then σ(ã) = σ(a) since det(λi Ã) = det(λtt 1 TAT 1 ) = det(t(λi A)T 1 ) = = det(t) det(λi A) det(t 1 ( ) = det(λi A) det I = det(t 1 ) det T = 1 ) Luigi Freda ( La Sapienza University) Lecture 8 December 13, 2016 6 / 31

Eigenvalues Real Matrix given a real matrix A R n n eigenvectors of A associated to different eigenvalues are linearly independent Av i = λ i v i, λ 1 λ 2, v 1 v 2 α 1v 1 + α 2v 2 = 0 = A(α 1v 1 + α 2v 2) = 0 = α 1λ 1v 1 + α 2λ 2v 2 = 0 since λ 1(α 1v 1 + α 2v 2) = 0 = α 2(λ 2 λ 1)v 2 = 0 = α 2 = 0 = α 1 = 0 if σ(a) = n, i.e. σ(a) = {λ 1, λ 2,..., λ n} with λ i λ j for i j, then A is diagonalizable Luigi Freda ( La Sapienza University) Lecture 8 December 13, 2016 7 / 31

Eigenvalues Real Matrix given a real matrix A R n n in general σ(a) = {λ 1,..., λ m} with m n and λ i λ j for i j, and one has det(λi A) = (λ λ 1) µa(λ 1) (λ λ 2) µa(λ 2)...(λ λ m) µa(λm) m i=1 µ a(λ i ) = n where µ a(λ i ) is the algebraic multiplicity of λ i in the general case one can always find the Jordan canonical form J λ i 1 1 A = VJV 1 J 2. J =. J.. i = λ.. i... 1 J p λ i Luigi Freda ( La Sapienza University) Lecture 8 December 13, 2016 8 / 31

Eigenvalues Real Matrix given a real matrix A R n n let E(λ i ) ker(a λ i I) = {v R n : (A λ i I)v = 0} be the eigenspace of λ i let µ g (λ i ) dim(e(λ i )) be the geometric multiplicity of λ i (µ g (λ i ) µ a(λ i )) if E(λ i ) = span{v i1,..., v iµi }, with µ i µ g (λ i ), then AV i = V i D i with V i = [v i1,..., v iµi ] R n µ i and D i = λ i I µi if µ g (λ i ) = µ a(λ i ) for each i {1,..., m} then and A is diagonalizable R n = E(λ 1)... E(λ m) n µ g (λ i ) = n i=1 Luigi Freda ( La Sapienza University) Lecture 8 December 13, 2016 9 / 31

Eigenvalues Real Symmetric Matrix given a real symmetric matrix S R nxn (i.e. S = S T ) all eigenvalues of S are real, i.e. σ(s) R eigenvectors v 1, v 2 of S associated to different eigenvalues are linearly independent and orthogonal Sv i = λ i v i, λ 1 λ 2, v 1 v 2, v T 2 (Sv 1 λ 1v 1) = 0 = (Sv 2) T v 1 λ 1v T 2 v 1 = 0 = v T 2 v 1(λ 2 λ 1) = 0 = v T 2 v 1 = 0 S is diagonalizable and one can find an orthonormal basis V such that V T = V 1 S = VDV T = n λ i v i v T i i=1 if S > 0 (S 0) then λ i > 0 (λ i 0) for i {1,..., m} N.B.: the above results can be applied for instance to covariance matrices Σ which are symmetric and positive definite Luigi Freda ( La Sapienza University) Lecture 8 December 13, 2016 10 / 31

Singular Value Decomposition idea: the singular values generalize the notion of eigenvalues to any kind of matrix given a matrix X R N D this can be decomposed as follows X N D = U S = V T N N N D D D r σ i u i v T i i=1 U R N N is orthonormal, U T U = UU T = I N V R D D is orthonormal, V T V = VV T = I D σ 1 σ 2... σ r 0 are the singular values, r = min(n, D) and S R N D σ 1... σ 1 S = if N > D or S =... if N D 0D N 0 N D σ D σ N Luigi Freda ( La Sapienza University) Lecture 8 December 13, 2016 12 / 31

Singular Value Decomposition Connection with Eigenvalues given a matrix X R N D one has X N D V T = U S N N N D D D U R N N is orthonormal, U T U = UU T = I N V R D D is orthonormal, V T V = VV T = I D = r i=1 σ iu i v T i where r = min(n, D) S R N D contains the singular values σ i on the main diagonal and 0s elsewhere we have X T X = (VS T U T )(USV T ) = VS T SV T = VDV T where D S T S is a diagonal matrix containing the squared singular values σ 2 i (and possibly some additional zeros) on the main diagonal and 0s elsewhere since (X T X)V = VD, then V contains the eigenvectors of X T X the eigenvalues of X T X are the squared singular values σ 2 i, contained in D the columns of V are called the right singular vectors of X Luigi Freda ( La Sapienza University) Lecture 8 December 13, 2016 14 / 31

Singular Value Decomposition Connection with Eigenvalues given a matrix X R N D one has X N D V T = U S N N N D D D U R N N is orthonormal, U T U = UU T = I N V R D D is orthonormal, V T V = VV T = I D = r i=1 σ iu i v T i where r = min(n, D) S R N D contains the singular values σ i on the main diagonal and 0s elsewhere we have XX T = (USV T )(VS T U T ) = USS T U T = U ˆDU T where ˆD SS T is a diagonal matrix containing the squared singular values σ 2 i (and possibly some additional zeros) on the main diagonal and 0s elsewhere since (XX T )U = U ˆD, then U contains the eigenvectors of XX T the eigenvalues of XX T are the squared singular values σ 2 i, contained in ˆD the columns of U are called the left singular vectors of X Luigi Freda ( La Sapienza University) Lecture 8 December 13, 2016 15 / 31

Singular Value Decomposition given a matrix X R N D with N > D this can be decomposed as follows X N D = U S = V T N N N D D D σ 1... S = 0 N D D σ i u i v T i the last N D columns of U are irrelevant, since they will be multiplied by zero i=1 σ D Luigi Freda ( La Sapienza University) Lecture 8 December 13, 2016 17 / 31

Singular Value Decomposition given a matrix X R N D with N > D this can be decomposed as X N D = U S V T N N N D D D = D i=1 σ iu i v T i since the last N D columns of U are irrelevant, we can directly neglect them and consider the economy sized SVD X N D = Û Ŝ = V T N D D D D D D σ i u i v T i where Û T Û = I D, V T V = VV T = I D and Ŝ = diag(σ 1,..., σ D ) i=1 Luigi Freda ( La Sapienza University) Lecture 8 December 13, 2016 18 / 31

Singular Value Decomposition given a matrix X R N D with N > D this can be decomposed as X = U S V T = D i=1 σ iu i v T i N D N N N D D D a rank L approximation of X considers only the first L < D columns of U and the first L singular values we can then consider the truncated SVD X L = U L N D N L S L L L V T L D L = L σ i u i v T i where U T L U L = I L, V T L V L = I D and S L = diag(σ 1,..., σ L ) i=1 Luigi Freda ( La Sapienza University) Lecture 8 December 13, 2016 19 / 31

Singular Value Decomposition given a matrix X R N D with N > D it is possible to show that X L = U L S L V T L which solve is the best rank L approximation matrix X L = argmin X ˆX F ˆX with the constraint rank(ˆx) = L where A F denotes the Frobenious norm ( m n A F = i=1 j=1 a 2 ij ) 1/2 = ( trace(a T A) ) 1/2 by replacing ˆX = X L one obtains [ SL X X L F = U(S 0 ] [ )V T SL F = S 0 ] F = r i=l+1 σ i Luigi Freda ( La Sapienza University) Lecture 8 December 13, 2016 20 / 31

Discovering Latent Factors An example dimensionality reduction: it is often useful to reduce the dimensionality by projecting the data to a lower dimensional subspace which captures the essence of the data for example in the above plot: the 2d approximation is quite good, most points lie close to this subspace projecting points onto the red line 1d approx is a rather poor approximation Luigi Freda ( La Sapienza University) Lecture 8 December 13, 2016 22 / 31

Discovering Latent Factors Motivations although data may appear high dimensional, there may only be a small number of degrees of variability, corresponding to latent factors low dimensional representations, when used as input to statistical models, often result in better predictive accuracy (focusing on the essence ) low dimensional representations are useful for enabling fast nearest neighbor searches two dimensional projections are very useful for visualizing high dimensional data Luigi Freda ( La Sapienza University) Lecture 8 December 13, 2016 23 / 31

Principal Component Analysis PCA the most common approach to dimensionality reduction is called Principal Components Analysis or PCA given x i R D, we compute an approximation ˆx i = Wz i with z i R L where L < D (dimensionality reduction) W R D L and W T W = I (orthogonal W) so as to minimize the reconstruction error J(W, Z) = 1 N x i ˆx i 2 = 1 N x i Wz i 2 N N i=1 i=1 since W T W = I one has x 2 = (x T x) 1/2 = (z T W T Wz) 1/2 = (z T z) 1/2 = z 2 in 3D or 2D W can represent part of a rotation matrix this can be thought of as an unsupervised version of (multi-output) linear regression, where we observe the high-dimensional response y = x, but not the low-dimensional cause z Luigi Freda ( La Sapienza University) Lecture 8 December 13, 2016 24 / 31

Principal Component Analysis PCA the objective function (reconstruction error) J(W, Z) = 1 N x i ˆx i 2 = 1 N x i Wz i 2 N N is equivalent to J(W, Z) = X T WZ T 2 F where A F denotes the Frobenious norm i=1 ( m n A F = i=1 j=1 a 2 ij i=1 ) 1/2 = ( trace(a T A) ) 1/2 Luigi Freda ( La Sapienza University) Lecture 8 December 13, 2016 25 / 31

Principal Component Analysis PCA we have to minimize the objective function J(W, Z) = 1 N x i ˆx i 2 = 1 N x i Wz i 2 N N i=1 or equivalently J(W, Z) = X T WZ T 2 F subject to W R D L and W T W = I it is possible to prove that the optimal solution is obtained by setting where X L = U L S L V T L Ŵ = V L i=1 Ẑ = XW (z i = ŴT x i = V T L x i ) is the rank L truncated SVD of X we have X T X = VS 2 V T hence (X T X)V L = V L S 2 L since V = [V L ] V L contains the L eigenvectors (principal directions) with the largest eigenvalues of the empirical covariance matrix Σ = 1 N XT X = 1 N N i=1 x ix T i z i = V T L x i is the orthogonal projection of x i on the eigenvectors in V L Luigi Freda ( La Sapienza University) Lecture 8 December 13, 2016 26 / 31

Principal Component Analysis PCA by replacing Z = XW in the objective function and recalling that W T W = I ( ) J(Z) = X T WZ T 2 F = trace (X T WZ T ) T (X T WZ T ) = = trace(xx T XWZ T ZW T X T + ZZ T ) = trace(x T X) trace(w T X T XW) = = trace(x T X) trace(z T Z) = trace(nσ X ) trace(nσ Z ) where we used the following facts: trace(ab) = trace(ba), trace(a T ) = trace(a) minimizing the reconstruction error is equivalent to maximizing the variance of the projected data z i = Wx i Luigi Freda ( La Sapienza University) Lecture 8 December 13, 2016 27 / 31

Principal Component Analysis PCA the principal directions are the ones along which the data shows maximal variance this means that PCA can be misled by directions in which the variance is high merely because of the measurement scale in the figure below, the vertical axis (weight) uses a large range than the horizontal axis (height), resulting in a line that looks somewhat unnatural it is therefore standard practice to standardize the data first where µ j = 1 N x ij (x ij µ j ) N i=1 x ij and σj 2 = 1 N N 1 i=1 (x ij µ j ) 2 σ j Luigi Freda ( La Sapienza University) Lecture 8 December 13, 2016 28 / 31

Principal Component Analysis PCA left: the mean and the first three PC basis vectors (eigendigits) based on 25 images of the digit 3 (from the MNIST dataset) right: reconstruction of an image based on 2, 10, 100 and all the basis vectors Luigi Freda ( La Sapienza University) Lecture 8 December 13, 2016 29 / 31

Principal Component Analysis PCA low rank approximations to an image top left: the original image is of size 200 320, so has rank 200 subsequent images have ranks 2, 5, and 20. Luigi Freda ( La Sapienza University) Lecture 8 December 13, 2016 30 / 31

Credits Kevin Murphy s book Luigi Freda ( La Sapienza University) Lecture 8 December 13, 2016 31 / 31

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016