FSAN/ELEG815: Statistical Learning

Size: px

Start display at page:

Download "FSAN/ELEG815: Statistical Learning"

Milton Little
5 years ago
Views:

1 : Statistical Learning Gonzalo R. Arce Department of Electrical and Computer Engineering University of Delaware 3. Eigen Analysis, SVD and PCA

2 Outline of the Course 1. Review of Probability 2. Stationary processes 3. Eigen Analysis, Singular Value Decomposition (SVD) and Principal Component Analysis (PCA) 4. The Learning Problem 5. Training vs Testing 6. Estimation theory: Maximum likelihood and Bayes estimation 7. The Wiener Filter 8. Adaptive Optimization: Steepest descent and the LMS algorithm 9. Least Squares (LS) and Recursive Least Squares (RLS) algorithm 10. Overfitting and Regularization 11. Logistic, Ridge and Lasso regression. 12. Neural Networks 13. Matrix Completion 1/79

3 Eigen Analysis Eigen Analysis Objective: Utilize tools from linear algebra to characterize and analyze matrices, especially the correlation matrix The correlation matrix plays a large role in statistical characterization and processing. Previously result: R is Hermitian. Further insight into the correlation matrix is achieved through eigen analysis Eigenvalues and vectors Matrix diagonalization Application: Optimum filtering problems 2/79

4 Eigen Analysis Objective: For a Hermitian matrix R, find a vector q satisfying Rq = λq Interpretation: Linear transformation by R changes the scale, but not the direction of q Fact: A M M matrix R has M eigenvectors and eigenvalues To see this, note Rq i = λ i q i i = 1,2,3,,M (R λi)q = 0 For this to be true, the row/columns of (R λi) must be linearly dependent, det(r λi) = 0 3/79

5 Eigen Analysis Note: det(r λi) is a Mth order polynomial in λ The roots of the polynomial are the eigenvalues λ 1,λ 2,,λ M Rq i = λ i q i Each eigenvector q i is associated with one eigenvalue λ i The eigenvectors are not unique Rq i = λ i q i R(aq i ) = λ i (aq i ) Consequence: eigenvectors are generally normalized, e.g., q i = 1 for i = 1,2,...,M 4/79

6 Eigen Analysis Example (General two dimensional case) Let M = 2 and R = [ R1,1 R 1,2 R 2,1 R 2,2 Determine the eigenvalues and eigenvectors. Thus ] det(r λi) = 0 R 1,1 λ R 1,2 R 2,1 R 2,2 λ = 0 λ 2 λ(r 1,1 + R 2,2 ) + (R 1,1 R 2,2 R 1,2 R 2,1 ) = 0 λ 1,2 = 1 2 [ ] (R 1,1 + R 2,2 ) ± 4R 1,2 R 2,1 + (R 1,1 R 2,2 ) 5/79

7 Eigen Analysis Back substitution yields the eigenvectors: [ R1,1 λ R 1,2 R 2,2 λ R 2,1 ] [ q1 q 2 ] = [ 0 0 ] In general, this yields a set of linear equations. In the M = 2 case: (R 1,1 λ)q 1 + R 1,2 q 2 =0 R 2,1 q 1 + (R 2,2 λ)q 2 =0 Solving the set of linear equations for a specific eigenvalue λ i yields the corresponding eigenvector, q i 6/79

8 Eigen Analysis Example (Two dimensional white noise) Let R be the correlation matrix of a two sample vector of zero mean white noise R = [ σ σ 2 Determine the eigenvalues and eigenvectors. Carrying out the analysis yields eigenvalues [ λ 1,2 = 1 (R 1,1 + R 2,2 ) ± 4R 1,2 R 2,1 + (R 1,1 R 2,2 ) 2 = 1 [ (σ 2 + σ 2 ] ) ± 0 + (σ 2 2 σ 2 ) = σ 2 ] ] and eigenvectors q 1 = [ 1 0 ] and q 2 = [ 0 1 Note: The eigenvectors are unit length (and orthogonal) ] 7/79

9 Eigen Properties Eigen Properties Property (eigenvalues of R k ) If λ 1,λ 2,,λ M are the eigenvalues of R, then λ k 1,λ k 2,,λ k M are the eigenvalues of R k. Proof: Note Rq i = λ i q i.multiplying both sides by R k 1 times, R k q i = λ i R k 1 q i = λ k i q i Property (linear independence of eigenvectors) The eigenvectors q 1,q 2,,q M, of R are linearly independent, i.e., M i=1 for all nonzero scalars a 1,a 2,,a M. a i q i 0 8/79

10 Eigen Properties Property (Correlation matrix eigenvalues are real & nonnegative) The eigenvalues of R are real and nonnegative. Proof: Rq i = λ i q i qi H Rq i = λ i qi H q i [pre multiply by qi H ] λ i = qh i Rq i q H i q i 0 Follows from the facts: R is positive semi-definite and q H i q i = q i 2 > 0 Note: In most cases, R is positive definite and λ i > 0, i = 1,2,,M 9/79

11 Eigen Properties Property (Unique eigenvalues orthogonal eigenvectors) If λ 1,λ 2,,λ M are unique eigenvalues of R, then the corresponding eigenvectors, q 1,q 2,,q M, are orthogonal. Proof: Rq i = λ i q i qj H Rq i = λ i qj H q i ( ) Also, since λ j is real and R is Hermitian Rq j = λ j q j qj H R = λ j qj H qj H Rq i = λ j qj H q i Substituting the LHS from ( ) λ i q H j q i = λ j q H j q i 10/79

12 Eigen Properties Thus λ i q H j q i = λ j q H j q i (λ i λ j )q H j q i = 0 Since λ 1,λ 2,,λ M are unique q H j q i = 0 i j q 1,q 2,,q M are orthogonal. QED 11/79

13 Eigen Properties Diagonalization of R Objective: Find a transformation that transforms the correlation matrix into a diagonal matrix. Let λ 1,λ 2,,λ M be unique eigenvectors of R and take q 1,q 2,,q M to be the M orthonormal eigenvectors { qi H 1 i = j q j = 0 i j Define Q = [q 1,q 2,,q M ] and Ω = diag(λ 1,λ 2,,λ M ). Then consider Q H RQ = q H 1 q H 2. qm H R[q 1,q 2,,q M ] 12/79

14 Eigen Properties Q H RQ = = = q H 1 q H 2. qm H q H 1 q H 2. qm H R[q 1,q 2,,q M ] [λ 1 q 1,λ 2 q 2,,λ N q M ] λ λ λ M Q H RQ = Ω (eigenvector diagonalization of R) 13/79

15 Eigen Properties Property (Q is unitary) Q is unitary, i.e., Q 1 = Q H Proof: Since the q i eigenvectors are orthonormal Q H Q = Q 1 = Q H q H 1 q H 2. qm H [q 1,q 2,,q M ] = I Property (Eigen decomposition of R) The correlation matrix can be expressed as R = M i=1 λ i q i q H i 14/79

16 Eigen Properties Proof: The correlation diagonalization result states Isolating R and expanding, Note: This also gives Q H RQ = Ω R = QΩQ H = [q 1,q 2,,q M ]Ω = [q 1,q 2,,q M ] λ 1 q H 1 λ 2 q H 2. λ M q H M q H 1 q H 2. qm H = M i=1 R 1 = (Q H ) 1 Ω 1 Q 1 = QΩ 1 Q H where Ω 1 = diag(1/λ 1,1/λ 2,,1/λ M ) λ i q i q H i 15/79

17 Eigen Properties Aside (trace & determinant for matrix products) Note trace(a) = i A i,i. Also, trace(ab) = trace(ba) similarly det(ab) = det(a)det(b) Property (Determinant Eigenvalue Relation) The determinant of the correlation matrix is related to the eigenvalues as follows: det(r) = M λ i i=1 Proof: Using R = QΩQ H and the above, det(r) = det(qωq H ) = det(q)det(q H )det(ω) = det(ω) = M λ i i=1 16/79

18 Eigen Properties Property (Trace Eigenvalue Relation) The trace of the correlation matrix is related to the eigenvalues as follows: trace(r) = M λ i i=1 Proof: Note QED trace(r) = trace(qωq H ) = trace(q H QΩ) = trace(ω) = M λ i i=1 17/79

19 Eigen Properties Definition (Normal Matrix) A complex square matrix A is a normal matrix if A H A = AA H That is, a matrix is normal if it commutes with its conjugate transpose. Note All Hermitian symmetric matrices are normal Every matrix that can be diagonalized by the unitary transform is normal Definition (Condition Number) The condition number reflects how numerically well conditioned a problem is, i.e, a low condition number well conditioned; a high condition number ill conditioned. 18/79

20 Eigen Properties Definition (Condition Number for Linear Systems) For a linear system Ax = b defined by a normal matrix A, the condition number is χ(a) = λ max λ min where λ max and λ min are the maximum/minimum eigenvalues of A Observations: Large eigenvalue spread ill conditioned Small eigenvalue spread well conditioned 19/79

21 SVD Matrix-Vector Multiplication Example in 2D: and, A = y = Ax = [ 2 ] x = [ ] [ ] [ 1 3 = ] [ 5 2 What is the geometrical meaning of the matrix-vector multiplication? ] 20/79

22 SVD Matrix-Vector Multiplication [ y = Ax = 1 1][ 3 ] [ 5 = 2 ] Rotates the vector θ Stretches the vector 21/79

23 SVD Matrix-Vector Multiplication To rotate x by an angle θ, we pre-multiply by [ ] cosθ sinθ A = sinθ cosθ Stretch x by factor α, pre-multiply by [ ] α 0 A = 0 α 22/79

24 SVD Matrix-Vector Multiplication Consider the vectors v 1 and v 2 depicting a circle. What happens to the circle under matrix multiplication? 23/79

25 SVD Matrix-Vector Multiplication What happens to the 2D circle under matrix multiplication? Note: Ortogonality holds since they are all rotated by the same angle. 24/79

26 SVD Matrix-Vector Multiplication What happens to the n-d hyper-sphere under matrix multiplication? 25/79

27 SVD n-dim Hyper-Sphere Mapping to n-dim Hyper-Ellipsoid The mapping can be written as Expressed in matrix form as A }{{} A C m n [v 1 v 2...v n ] }{{} Av 1 = σ 1 û 1. Av n = σ j û n = [û 1 û 2...û n ] }{{} V C n n AV = Û ˆΣ. Û C m n σ σ n } {{ } ˆΣ C n n 26/79

28 SVD n-dim Hyper-Sphere Mapping to n-dim Hyper-Ellipsoid Let v 1,...,v n be unitary orthonormal vectors, then V = [v 1 v 2... v n ] is a unitary transformation matrix, that is V 1 = V H. Let û 1,...,û n be unitary orthonormal vectors, then Û = [û 1 û 2... û n ] is a unitary transformation matrix, that is U 1 = ÛH. 27/79

29 SVD Reduced Singular Value Decomposition The mapping is thus given by, Multiply both sides by V 1 we obtain: AV = Û ˆΣ AVV 1 = 1 Û ˆΣV AVV H = Û ˆΣV H AI = Û ˆΣV H A = Û ˆΣV H where Σ = diag([σ 1,σ 2,...,σ n ]), such that σ 1 σ 2...σ p 0. 28/79

30 SVD Singular Value Decomposition Reduced SVD SVD 29/79

31 SVD Theorem 1 Every matrix A C m n has a singular value decomposition (SVD). Singular values σ j are uniquely determined. If A is square σ j are distinct. u j and v j are also unique up to a complex sign. (unique if the complex sign is ignored) 30/79

32 SVD SVD calculation Start with A T A: A H A = ( UΣV H) H ( UΣV H) = VΣU H UΣV H A H AV = VΣ 2 V H V A H AV = VΣ 2 Reduces to an eigenvalue decomposition problem of the form: A T AV = V Σ 2, }{{} B where Λ is a diagonal matrix with the eigenvalues of B and V corresponds to the eigenvectors of B. }{{} Λ 31/79

33 SVD SVD calculation How do we calculate U: AA H = ( UΣV H) ( UΣV H) H = UΣV H VΣU H AA H U = UΣ 2 U H U AA }{{ H } U = U Σ 2 }{{} B Λ Eigenvalue problem where Λ is a diagonal matrix with the eigenvalues of B and U corresponds to the eigenvectors of B. 32/79

34 SVD Movie Rating - A Solution Describe a movie as an array of factors, e.g. comedy, action... Describe each viewer using same factors, e.g. likes comedy, likes action, etc Rating based on match/mismatch More factors better prediction 33/79

35 SVD Singular Value Decomposition Solution Viewers rated movies on a scale from 1 to 5, 0 for movies that were not rated by the user. Each column j is a different movie Each row i is a different viewer Each element a i,j represents the rating of movie j by viewer i A = a 1,1 a 1,n a m, a m,n Goal: Use SVD to predict unobserved data or the rating of a movie that hasn t come out yet. 34/79

36 SVD Singular Value Decomposition Solution We want to classify Movies and Viewers: Category 1 Category 2 Movies = Category 3 Intuitively, if Movie 1 Movie 2, these movies are similar (same category).. Categories are determined by matrix A and SVD algorithm. 35/79

37 SVD Singular Value Decomposition Solution Now, consider that each movie belongs to more than one category e.g. half comedy and half action. This can be written as: Movie j = v 1 Cat1 + v 2 Cat2 + + v n Catn s.t. v 2 = 1 where the set of categories {Cat i} forms an orthonormal basis. 36/79

SVD Singular Value Decomposition Solution In the case of V iewers, we use the same Movies categories: Movies = Category 1 Category 2 Category 3 E.g. a viewer that loves comedy is represented with the same unit vector of the comedy category movies.

38 SVD Singular Value Decomposition Solution In the case of V iewers, we use the same Movies categories: Movies = Category 1 Category 2 Category 3 E.g. a viewer that loves comedy is represented with the same unit vector of the comedy category movies. Each V iewer is represented as:. = V iewers. V iewer i = u 1 Cat1 + u 2 Cat2 + + u n Catn s.t. u 2 = 1 37/79

39 SVD Singular Value Decomposition Solution From Theorem 1: There exist a unique decomposition into categories. Every matrix A C m n can be factorized as A = ÛΣVH where: 38/79

40 SVD Singular Value Decomposition Solution Each row vector (u i ) in Û represents the taste of a V iewer i on the corresponding categories. Û = u 1,1 u 1,n u m, u m,n = u 1 u 2. u m 39/79

41 SVD Singular Value Decomposition Solution Each column (v j ) in V H represents the content of a Movie j on the corresponding categories. V H = v 1,1 v 1,n v n, v n,n = [ ] v 1 v 2... v n 40/79

42 SVD Singular Value Decomposition Solution Each singular value σ ii in Σ computes how a viewer of category i rates a movie of the same category i. Σ = σ 1, σ 2, σ n,n 41/79

43 SVD Singular Value Decomposition Solution We have more viewers than movies: New categories are created. The new vectors are still unit vectors orthonormal to all the basis vectors but the ratings of these useless categories are zero. 42/79

44 SVD Singular Value Decomposition Solution The representation of each movie and each viewer can be obtained by Movie j = v 1,j Cat1 + v 2,j Cat2 + + v n,j Catn s.t. v j 2 = 1 = v 1,j σ1,1 + v 2,j σ2,2 + + v n,j σn,n = Σv j V iewer i = u i,1 Cat1 + u i,2 Cat2 + + u i,n Catn + + u i,m Catm s.t. u i 2 = 1 = u i,1 σ1,1 + u i,2 σ2,2 + + u i,n σn,n = u i Σ 43/79

45 SVD Singular Value Decomposition Solution Given the decomposition of a movie and a viewer, the rating is estimated by: V iewer i Movie j = u i,1 v 1,j σ 1,1 + u i,2 v 2,j σ 2,2 + + u i,n v n,j σ n,n = (u i Σ)( Σvj ) = u i Σv j 44/79

SVD Singular Value Decomposition - Example Considering the rating from 60 viewers to 16 movies of 4 different genres(action, romance, sci-fi, comedy), we

46 SVD Singular Value Decomposition - Example Considering the rating from 60 viewers to 16 movies of 4 different genres(action, romance, sci-fi, comedy), we generate A R Viewers rated movies on a scale from 1 to 5, 0 for movies that were not rated by the user. Observe the same 4 categories of viewers. 45/79

47 SVD Singular Value Decomposition - Example 46/79

48 SVD Singular Value Decomposition - Example 47/79

49 SVD Singular Value Decomposition - Example 48/79

50 SVD Singular Value Decomposition - Example To estimate not rated movies (zero entries in A), we use additional information: A is known to be low-rank or approximately low-rank. Thus, we are going to use the k-rank approximation of the matrix A that is: Â = ÛΣ kv H where Σ k has all but the first k singular values σ ii set to zero. The ratings different from zero in A are set to its original value. Note: The ratings matrix A is expected to be low-rank since user preferences can be described by a few categories (k), such as the movie genres. 49/79

51 SVD Singular Value Decomposition - Example 50/79

52 SVD Singular Value Decomposition - Example 51/79

53 PCA Principal Component Analysis (PCA) Simple, method for extracting relevant information from confusing data sets. How to reduce a complex data set to a lower dimension? Consider a mass attached to a spring which oscillates as shown below. What if we did not know that F = ma? 52/79

54 PCA PCA - Motivation: Toy example Since we live in a 3D world use three cameras to capture data from the system. No information about the real x,y, and z axes camera positions are chosen arbitrarily. How do we get from this data set to a simple equation of z? 53/79

55 PCA PCA - Motivation: Toy example Three cameras give redundant information. Only one camera at a specific angle necessary to describe the system behavior. PCA is used to avoid redundancy. 54/79

56 PCA Framework: Change of Basis The goal of PCA is to identify the most meaningful basis to re-express a dataset. Let the basis of representation for our samples be the naive choice, that is the identity matrix. B = b 1 b 2. = b m = I (1) Note all the recorded data can be trivially expressed as a linear combination of b i. 55/79

57 PCA Change of Basis PCA: Is there another basis, which is a linear combination of the original basis, that best respresents the data set? Let X be the original data set, where each column is a single measurements set. Let Y be a linear transformation by P, i.e. Y = PX. Implications: Geometrically P is a rotation and a stretch which transforms X into Y. The rows of P, {p 1,...,p m } are a set of new basis vectors for expressing the columns of X. What is the best way to re-express X?, what is a good choice for P? 56/79

58 PCA Noise Signal and noise variances are depicted as σ 2 signal and σ 2 noise. The largest direction of variance is not along the natural basis but along the best-fit line. The directions with largest variances contain the dynamics of interest. Intuition: Find the direction indicated by σ signal. 57/79

59 PCA Redundancy Figures depict possible plots between two arbitrary measurement types r 1 and r 2. Low redundancy uncorrelated recordings High redundancy correlated recordings, e.g. the sensors are too close or the measured variables are equivalent. If recordings are highly correlated it is not necessary to measure both of them. 58/79

60 PCA PCA - Basic concepts Let a = [a 1,a 2,...,a n ] and b = [b 1,b 2,...,b n ] be two sets of measurements. Are they related? If the mean of a and b is zero, then: Variance: How large the change is in each vector. σ 2 a = 1 n aat = 1 n σ 2 b = 1 n bbt = 1 n a 2 i i b 2 i i Covariance: Statistical relationship between data in a and b. σ 2 ab = 1 n abt = 1 n i a i b i 59/79

61 PCA Variance and Covariance Let X be defined as X = [x T 1... x T m], where x i corresponds to all measurements of a particular type. Then the covariance matrix is defined as: C X = 1 n XXH The covariance values reflect the noise and redundacy in the measurements. 60/79

62 PCA Variance and Covariance Recall C X is the covariance matrix of X defined as C X = 1 n XXH. Covariance matrix in the spring example is C X R 6 6 : C X = σy 2 1 y 1 σy 2 1 z 1 σy 2 1 y 2 σy 2 1 z 2 σy 2 1 y 3 σy 2 1 z 3 σz 2 1 y 1 σz 2 1 z 1 σz 2 1 y 2 σz 2 1 z 2 σz 2 1 y 3 σz 2 1 z 3 σy 2 2 y 1 σy 2 2 z 1 σy 2 2 y 2 σy 2 2 z 2 σy 2 2 y 3 σy 2 2 z 3 σz 2 2 y 1 σz 2 2 z 1 σz 2 2 y 2 σz 2 2 z 2 σz 2 2 y 3 σz 2 2 z 3 σy 2 3 y 1 σy 2 3 z 1 σy 2 3 y 2 σy 2 3 z 2 σy 2 3 y 3 σy 2 3 z 3 σz 2 3 y 1 σz 2 3 z 1 σz 2 3 y 2 σz 2 3 z 2 σz 2 3 y 3 σz 2 3 z 3 Diagonal: Variance measures; Off-diagonal: covariance between all pairs. C X is hermitian and symmetric, i.e. C X = C T X = C T X. 61/79

63 PCA Covariance Matrix Interpretation C X = σy 2 1 y 1 σy 2 1 z 1 σy 2 1 y 2 σy 2 1 z 2 σy 2 1 y 3 σy 2 1 z 3 σz 2 1 y 1 σz 2 1 z 1 σz 2 1 y 2 σz 2 1 z 2 σz 2 1 y 3 σz 2 1 z 3 σy 2 2 y 1 σy 2 2 z 1 σy 2 2 y 2 σy 2 2 z 2 σy 2 2 y 3 σy 2 2 z 3 σz 2 2 y 1 σz 2 2 z 1 σz 2 2 y 2 σz 2 2 z 2 σz 2 2 y 3 σz 2 2 z 3 σy 2 3 y 1 σy 2 3 z 1 σy 2 3 y 2 σy 2 3 z 2 σy 2 3 y 3 σy 2 3 z 3 σz 2 3 y 1 σz 2 3 z 1 σz 2 3 y 2 σz 2 3 z 2 σz 2 3 y 3 σz 2 3 z 3 Off-diagonal terms If covariance is large then components are statistically dependent. If covariance is small then components are statistically independent. Diagonal terms: If variance is large it contains a lot of information about the system. If variance is small it does not provide significant information about the system. 62/79

64 PCA PCA Goal: Change basis such that the covariance matrix of the data is diagonal. If off-diagonal terms 0, the redundancies are eliminated. Diagonal terms represent the variance of each component. Components with large variance are the most representative. 63/79

65 PCA PCA and Eigenvalue Decomposition How to solve the problem? Data set: X R m n, where m is the number of measurement types and n is the number of samples. PCA : Find an orthonormal matrix P in Y = PX such that C Y = 1 n YYT is a diagonal matrix. The rows of P are the principal components of X 64/79

66 PCA PCA and Eigenvalue Decomposition We begin rewriting C Y in terms of the unknown variable. C Y = 1 n YYT = 1 n (PX)(PX)T = 1 n PXXT P T = P = PC X P T ( 1 n XXT ) P T 65/79

67 PCA PCA and Eigenvalue Decomposition C X can be diagonalized by an orthogonal matrix of its eigenvectors since it is a symmetric matrix. Let P = Q T, where Q is a matrix with the eigenvectors of 1 n XXT, then: C Y = PC X P T = P ( Q T ΩQ ) P T = P ( P T ΩP ) P T = ( PP 1) Ω ( PP 1) = Ω The transformation Y = PX diagonalizes the system. Covariance of Y is a diagonal matrix with the eigenvalues of 1 n XXT. 66/79

68 PCA PCA and SVD The SVD of X is given by X = UΣV H. Let P = U H, then: Y = U H X, The covariance matrix of Y is given by: C Y = 1 n YYH = 1 n UH XX H U = 1 n UH UΣV H VΣU H U = 1 n Σ2 67/79

69 PCA PCA The transformation Y = U H X diagonalized the system. Covariance of Y is a diagonal matrix with the squared singular values of X multiplied by a factor of 1 n. It can be concluded that Σ 2 = Ω, and σ 2 i = λ i. The principal components of the data matrix are given by U H. 68/79

70 PCA Application: Face Recognition PCA in face recognition Eigenfaces Intuition: Figure out the correlation between the rows/ colums of A from the SVD. A = UΣV H (2) How important each direction is: Σ Principal Directions: U How each individual component (row/column) projects onto the principal components: V. 69/79

71 PCA Data in Face Recognition The data matrix for the face recognition problem is constructed by vectorizing the face images as shown below, i.e. A = [A T 1 A T 2... A T N] T. The matrix will be N M, where N is the number of images in the data base and M is the number of pixels of each image. 70/79

72 PCA Example - Celebrity Images Example, take 5 images of each celebrity: George Clooney, Bruce Willis, Margaret Thatcher and Matt Damon. In the example, M = and N = /79

73 PCA Average Faces How do the average of the faces of these celebrities look like? ā i = j=1 A j (3) 72/79

PCA Average Faces What defines George Clooney s face? Data matrix A R N M with the images of the example. Compute the correlation matrix of the features of the dataset, i.e. the pixels.

74 PCA Average Faces What defines George Clooney s face? Data matrix A R N M with the images of the example. Compute the correlation matrix of the features of the dataset, i.e. the pixels. The correlation matrix is C = A T A R M M, here M = High correlation values everybody has eyes, a nose and a mouth. Correlations between images of the same person will be higher. Average Face 73/79

75 PCA Eigendecomposition Obtain the eigenvalue decomposition of C = A T A. That is C = QΩQ 1. First eigenvectors q i R M 1 are called the principal components (eigenfaces). One can reconstruct each face as a weighted sum of the eigenvectors. 74/79

76 PCA Representing Faces onto Basis Each face A i R 1 M in the data set A = [A T 1 A T 2... A T N] T, can be represented as a linear combination of the best K eigenvectors: A T i = K j=1 w j q j, where w j = q T j A T i (4) 75/79

77 PCA Projection of the Average faces into the K=20 largest Eigenvectors Q is M M, here let Q be the matrix formed by the first 20 eigenvectors, i.e. Q R M N. Project the average faces ā i onto the reduced eigenvector space, i.e. projection = ā i Q. Projections for each face are characteristic of each average face and could be used for classification purposes. 76/79

78 PCA Projection of new images Test set: New image of Margaret Thatcher, Maryl Streep as Margaret Thatcher in The Iron Lady, Betty White. Project test images onto eigenvector space, p = xq, where x R 1 M is the new vectorized image and Q is the matrix with the first 20 eigenvectors of the database. Reconstruct images as ˆx = Qp T. Error defined as the difference between the projection of the new image and the projection of the original Margaret Thatcher images A j Q where j = 1,...,5, that is E j = A jq xq, A j Q where A j are the original images of the database, in this case the 5 images of Margareth Thatcher. 77/79

79 PCA Projection of new images Image depicts, from left to right Test images. Projection of the test images onto the eigenvector space p = xq. Reconstructed images using the first 20 eigenvectors of the database ˆx = Qp T. Error of the projection with respect to each original Margareth Thatcher Image. 78/79

80 PCA Projection of new images 79/79

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas Dimensionality Reduction: PCA Nicholas Ruozzi University of Texas at Dallas Eigenvalues λ is an eigenvalue of a matrix A R n n if the linear system Ax = λx has at least one non-zero solution If Ax = λx