Eigenfaces and Fisherfaces

Size: px
Start display at page:

Download "Eigenfaces and Fisherfaces"

Transcription

1 Eigenfaces and Fisherfaces Dimension Reduction and Component Analysis Jason Corso University of Michigan EECS 598 Fall 2014 Foundations of Computer Vision JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 1 / 101

2 Dimensionality High Dimensions Often Test Our Intuitions Consider a simple arrangement: you have a sphere of radius r = 1 in a space of D dimensions. We want to compute what is the fraction of the volume of the sphere that lies between radius r = 1 ɛ and r = 1. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 2 / 101

3 Dimensionality High Dimensions Often Test Our Intuitions Consider a simple arrangement: you have a sphere of radius r = 1 in a space of D dimensions. We want to compute what is the fraction of the volume of the sphere that lies between radius r = 1 ɛ and r = 1. Noting that the volume of the sphere will scale with r D, we have: V D (r) = K D r D (1) where K D is some constant (depending only on D). V D (1) V D (1 ɛ) = V D (1) (2) 1 (1 ɛ) D volume fraction D = 20 D = 5 D = 2 D = ɛ JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 2 / 101

4 Dimensionality Let s Build Some More Intuition Example from Bishop PRML Dataset: Measurements taken from a pipeline containing a mixture of oil. Three classes present (different geometrical configuration): homogeneous, annular, and laminar. Each data point is a 12 dimensional input vector consisting of measurements taken with gamma ray densitometers, which measure the attenuation of gamma rays passing along narrow beams through the pipe. x x 6 JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 3 / 101

5 Dimensionality Let s Build Some More Intuition Example from Bishop PRML 100 data points of features x 6 and x 7 are shown on the right. Goal: Classify the new data point at the x x x 6 JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 4 / 101

6 Dimensionality Let s Build Some More Intuition Example from Bishop PRML Observations we can make: The cross is surrounded by many red points and some green points. Blue points are quite far from the cross. Nearest-Neighbor Intuition: The query point should be determined more strongly by 0.5 nearby points from the training set and less strongly by more distant points. x 6 x JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 5 / 101

7 Dimensionality Let s Build Some More Intuition Example from Bishop PRML One simple way of doing it is: We can divide the feature space up into regular cells. For each, cell, we associated the class that occurs most frequently in that cell (in our training data). Then, for a query point, we determine which cell it falls into and then assign in the label associated with the cell. x x 6 JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 6 / 101

8 Dimensionality Let s Build Some More Intuition Example from Bishop PRML One simple way of doing it is: We can divide the feature space up into regular cells. For each, cell, we associated the class that occurs most frequently in that cell (in our training data). Then, for a query point, we determine which cell it falls into and then assign in the label associated with the cell. What problems may exist with this approach? x JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 6 / x 6

9 Dimensionality Let s Build Some More Intuition Example from Bishop PRML The problem we are most interested in now is the one that becomes apparent when we add more variables into the mix, corresponding to problems of higher dimensionality. In this case, the number of additional cells grows exponentially with the dimensionality of the space. Hence, we would need an exponentially large training data set to ensure all cells are filled. x 2 x 2 x 1 D = 1 x 1 D = 2 x 1 x 3 D = 3 JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 7 / 101

10 Dimensionality Curse of Dimensionality This severe difficulty when working in high dimensions was coined the curse of dimensionality by Bellman in The idea is that the volume of a space increases exponentially with the dimensionality of the space. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 8 / 101

11 Dimensionality Dimensionality and Classification Error? Some parts taken from G. V. Trunk, TPAMI Vol. 1 No. 3 PP How does the probability of error vary as we add more features, in theory? Consider the following two-class problem: The prior probabilities are known and equal: P (ω 1 ) = P (ω 2 ) = 1/2. The class-conditional densities are Gaussian with unit covariance: p(x ω 1 ) N(µ 1, I) (3) p(x ω 2 ) N(µ 2, I) (4) where µ 1 = µ, µ 2 = µ, and µ is an n-vector whose ith component is (1/i) 1/2. The corresponding Bayesian Decision Rule is decide ω 1 if x T µ > 0 (5) JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 9 / 101

12 Dimensionality Dimensionality and Classification Error? Some parts taken from G. V. Trunk, TPAMI Vol. 1 No. 3 PP The probability of error is where P (error) = 1 exp [ z 2 /2 ] dz (6) 2π r/2 r 2 = µ 1 µ 2 2 = 4 Let s take this integral for granted... (For more detail, you can look at DHS Problem 31 in Chapter 2 and read Section 2.7.) i=1 p(x ω i )P(ω i ) n (1/i). (7) ω 1 reducible error ω 2 R 1 x B x* R 2 x p(x ω 2)P(ω 2 ) dx p(x ω 1)P(ω 1 ) dx R 1 R 2 FIGURE Components of the probability of error for equal priors and (no JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 10 / 101

13 Dimensionality Dimensionality and Classification Error? Some parts taken from G. V. Trunk, TPAMI Vol. 1 No. 3 PP The probability of error is where P (error) = 1 exp [ z 2 /2 ] dz (6) 2π r/2 r 2 = µ 1 µ 2 2 = 4 Let s take this integral for granted... (For more detail, you can look at DHS Problem 31 in Chapter 2 and read Section 2.7.) What can we say about this result as more features are added? i=1 p(x ω i )P(ω i ) n (1/i). (7) ω 1 reducible error p(x ω 2)P(ω 2 ) dx p(x ω 1)P(ω 1 ) dx R 1 R 2 FIGURE Components of the probability of error for equal priors and (no JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 10 / 101 R 1 x B x* R 2 ω 2 x

14 Dimensionality Dimensionality and Classification Error? Some parts taken from G. V. Trunk, TPAMI Vol. 1 No. 3 PP The probability of error approaches 0 as n approach infinity because 1/i is a divergent series. More intuitively, each additional feature is going to decrease the probability of error as long as its means are different. In the general case of varying means and but same variance for a feature, we have r 2 = d ( ) µi1 µ 2 i2 (8) i=1 Certainly, we prefer features that have big differences in the mean relative to their variance. We need to note that if the probabilistic structure of the problem is completely known then adding new features is not going to decrease the Bayes risk (or increase it). JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 11 / 101 σ i

15 Dimensionality Dimensionality and Classification Error? Some parts taken from G. V. Trunk, TPAMI Vol. 1 No. 3 PP So, adding dimensions is good... x 3 x 2 FIGURE 3.3. Two three-dimensional distributions have nonoverlapping densities, and...in theory. thus in three dimensions the Bayes error vanishes. When projected to a subspace here, the two-dimensional x1 x2 subspace or a one-dimensional x1 subspace there can But, in practice, be greater performance overlap of the projectedseems distributions, toand not hence obey greaterthis Bayes error. theory. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 12 / 101 x 1

16 Dimensionality Dimensionality and Classification Error? Some parts taken from G. V. Trunk, TPAMI Vol. 1 No. 3 PP Consider again the two-class problem, but this time with unknown means µ 1 and µ 2. Instead, we have m labeled samples x 1,..., x m. Then, the best estimate of µ for each class is the sample mean (recall the parameter estimation lecture). µ = 1 m m x i (9) i=1 where x i comes from ω 2. The covariance matrix is I/m. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 13 / 101

17 Dimensionality Dimensionality and Classification Error? Some parts taken from G. V. Trunk, TPAMI Vol. 1 No. 3 PP Probablity of error is given by P (error) = P (x T µ 0 ω 2 ) = 1 2π γ n exp [ z 2 /2 ] dz (10) because it has a Gaussian form as n approaches infinity where γ n = E(z)/ [var(z)] 1/2 (11) n E(z) = (1/i) (12) i=1 ( var(z) = ) n (1/i) + n/m (13) m i=1 JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 14 / 101

18 Dimensionality Dimensionality and Classification Error? Some parts taken from G. V. Trunk, TPAMI Vol. 1 No. 3 PP Probablity of error is given by P (error) = P (x T µ 0 ω 2 ) = 1 exp [ z 2 /2 ] dz 2π The key is that we can show γ n lim γ n = 0 (14) n and thus the probability of error approaches one-half as the dimensionality of the problem becomes very high. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 15 / 101

19 Dimensionality Dimensionality and Classification Error? Some parts taken from G. V. Trunk, TPAMI Vol. 1 No. 3 PP Trunk performed an experiment to investigate the convergence rate of the probability of error to one-half. He simulated the problem for dimensionality 1 to 1000 and ran 500 repetitions for each dimension. We see an increase in performance initially and then a decrease (as the dimensionality of the problem grows larger than the number of training samples). JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 16 / 101

20 Dimension Reduction Motivation for Dimension Reduction The discussion on the curse of dimensionality should be enough! Even though our problem may have a high dimension, data will often be confined to a much lower effective dimension in most real world problems. Computational complexity is another important point: generally, the higher the dimension, the longer the training stage will be (and potentially the measurement and classification stages). We seek an understanding of the underlying data to Remove or reduce the influence of noisy or irrelevant features that would otherwise interfere with the classifier; Identify a small set of features (perhaps, a transformation thereof) during data exploration. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 17 / 101

21 Dimension Reduction Principal Component Analysis Dimension Reduction Version 0 We have n samples {x 1, x 2,..., x n }. How can we best represent the n samples by a single vector x 0? JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 18 / 101

22 Dimension Reduction Principal Component Analysis Dimension Reduction Version 0 We have n samples {x 1, x 2,..., x n }. How can we best represent the n samples by a single vector x 0? First, we need a distance function on the sample space. Let s use the Euclidean distance and the sum of squared distances criterion: J 0 (x 0 ) = n x 0 x k 2 (15) k=1 JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 18 / 101

23 Dimension Reduction Principal Component Analysis Dimension Reduction Version 0 We have n samples {x 1, x 2,..., x n }. How can we best represent the n samples by a single vector x 0? First, we need a distance function on the sample space. Let s use the Euclidean distance and the sum of squared distances criterion: J 0 (x 0 ) = n x 0 x k 2 (15) k=1 Then, we seek a value of x 0 that minimizes J 0. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 18 / 101

24 Dimension Reduction Principal Component Analysis Dimension Reduction Version 0 We can show that the minimizer is indeed the sample mean: m = 1 n n x k (16) k=1 We can verify it by adding m m into J 0 : J 0 (x 0 ) = = n (x 0 m) (x k m) 2 (17) k=1 n x 0 m 2 + k=1 n x k m 2 (18) k=1 Thus, J 0 is minimized when x 0 = m. Note, the second term is independent of x 0. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 19 / 101

25 Dimension Reduction Principal Component Analysis Dimension Reduction Version 0 So, the sample mean is an initial dimension-reduced representation of the data (a zero-dimensional one). It is simple, but it not does reveal any of the variability in the data. Let s try to obtain a one-dimensional representation: i.e., let s project the data onto a line running through the sample mean. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 20 / 101

26 Dimension Reduction Principal Component Analysis Dimension Reduction Version 1 Let e be a unit vector in the direction of the line. The standard equation for a line is then x = m + ae (19) where scalar a R governs which point along the line we are and hence corresponds to the distance of any point x from the mean m. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 21 / 101

27 Dimension Reduction Principal Component Analysis Dimension Reduction Version 1 Let e be a unit vector in the direction of the line. The standard equation for a line is then x = m + ae (19) where scalar a R governs which point along the line we are and hence corresponds to the distance of any point x from the mean m. Represent point x k by m + a k e. We can find an optimal set of coefficients by again minimizing the squared-error criterion n J 1 (a 1,..., a n, e) = (m + a k e) x k 2 (20) = k=1 n a 2 k e 2 2 k=1 n a k e T (x k m) + k=1 n x k m 2 k=1 (21) JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 21 / 101

28 Dimension Reduction Principal Component Analysis Dimension Reduction Version 1 Differentiating for a k and equating to 0 yields a k = e T (x k m) (22) This indicates that the best value for a k is the projection of the point x k onto the line e that passes through m. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 22 / 101

29 Dimension Reduction Principal Component Analysis Dimension Reduction Version 1 Differentiating for a k and equating to 0 yields a k = e T (x k m) (22) This indicates that the best value for a k is the projection of the point x k onto the line e that passes through m. How do we find the best direction for that line? JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 22 / 101

30 Dimension Reduction Principal Component Analysis Dimension Reduction Version 1 What if we substitute the expression we computed for the best a k directly into the J 1 criterion: J 1 (e) = n n n a 2 k 2 a 2 k + x k m 2 (23) k=1 = = k=1 k=1 k=1 n [ ] 2 n e T (x k m) + x k m 2 (24) k=1 n e T (x k m)(x k m) T e + k=1 n x k m 2 (25) k=1 JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 23 / 101

31 Dimension Reduction Principal Component Analysis Dimension Reduction Version 1 The Scatter Matrix Define the scatter matrix S as n S = (x k m)(x k m) T (26) k=1 JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 24 / 101

32 Dimension Reduction Principal Component Analysis Dimension Reduction Version 1 The Scatter Matrix Define the scatter matrix S as S = n (x k m)(x k m) T (26) k=1 This should be familiar this is a multiple of the sample covariance matrix. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 24 / 101

33 Dimension Reduction Principal Component Analysis Dimension Reduction Version 1 The Scatter Matrix Define the scatter matrix S as S = n (x k m)(x k m) T (26) k=1 This should be familiar this is a multiple of the sample covariance matrix. Putting it in: J 1 (e) = e T Se + n x k m 2 (27) k=1 The e that maximizes e T Se will minimize J 1. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 24 / 101

34 Dimension Reduction Principal Component Analysis Dimension Reduction Version 1 Solving for e We use Lagrange multipliers to maximize e T Se subject to the constraint that e = 1. u = e T Se λ(e T e 1) (28) JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 25 / 101

35 Dimension Reduction Principal Component Analysis Dimension Reduction Version 1 Solving for e We use Lagrange multipliers to maximize e T Se subject to the constraint that e = 1. u = e T Se λ(e T e 1) (28) Differentiating w.r.t. e and setting equal to 0. Does this form look familiar? u = 2Se 2λe e (29) Se = λe (30) JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 25 / 101

36 Dimension Reduction Principal Component Analysis Dimension Reduction Version 1 Eigenvectors of S Se = λe This is an eigenproblem. Hence, it follows that the best one-dimensional estimate (in a least-squares sense) for the data is the eigenvector corresponding to the largest eigenvalue of S. So, we will project the data onto the largest eigenvector of S and translate it to pass through the mean. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 26 / 101

37 Dimension Reduction Principal Component Analysis Principal Component Analysis We re already done...basically. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 27 / 101

38 Dimension Reduction Principal Component Analysis Principal Component Analysis We re already done...basically. This idea readily extends to multiple dimensions, say d < d dimensions. We replace the earlier equation of the line with x = m + a i e i (31) d i=1 And we have a new criterion function ( ) n d 2 J d = m + a ki e i x k k=1 i=1 (32) JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 27 / 101

39 Dimension Reduction Principal Component Analysis Principal Component Analysis J d is minimized when the vectors e 1,..., e d are the d eigenvectors fo the scatter matrix having the largest eigenvalues. These vectors are orthogonal. They form a natural set of basis vectors for representing any feature x. The coefficients a i are called the principal components. Visualize the basis vectors as the principal axes of a hyperellipsoid surrounding the data (a cloud of points). Principle components reduces the dimension of the data by restricting attention to those directions of maximum variation, or scatter. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 28 / 101

40 Dimension Reduction Fisher Linear Discriminant Fisher Linear Discriminant Description vs. Discrimination PCA is likely going to be useful for representing data. But, there is no reason to assume that it would be good for discriminating between two classes of data. Q versus O. Discriminant Analysis seeks directions that are efficient for discrimination. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 29 / 101

41 Dimension Reduction Fisher Linear Discriminant Let s Formalize the Situation Suppose we have a set of n d-dimensional samples with n 1 in set D 1, and similarly for set D 2. D i = {x 1,..., x ni }, i = {1, 2} (33) JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 30 / 101

42 Dimension Reduction Fisher Linear Discriminant Let s Formalize the Situation Suppose we have a set of n d-dimensional samples with n 1 in set D 1, and similarly for set D 2. D i = {x 1,..., x ni }, i = {1, 2} (33) We can form a linear combination of the components of a sample x: y = w T x (34) which yields a corresponding set of n samples y 1,..., y n split into subsets Y 1 and Y 2. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 30 / 101

43 Dimension Reduction Fisher Linear Discriminant Geometrically, This is a Projection If we constrain the norm of w to be 1 (i.e., w = 1) then we can conceptualize that each y i is the projection of the corresponding x i onto a line in the direction of w. 2 x 2 2 x w x w x 1 FIGURE 3.5. Projection of the same set of samples onto two different lines in the directions marked w. The figure on the right shows greater separation between the red and black projected points. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 31 / 101

44 Dimension Reduction Fisher Linear Discriminant Geometrically, This is a Projection If we constrain the norm of w to be 1 (i.e., w = 1) then we can conceptualize that each y i is the projection of the corresponding x i onto a line in the direction of w. 2 x 2 2 x w x w x 1 FIGURE 3.5. Projection of the same set of samples onto two different lines in the directions the magnitude marked w. The offigure w have on the right anyshows real greater significance? separation between the red Does and black projected points. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 31 / 101

45 Dimension Reduction Fisher Linear Discriminant What is a Good Projection? For our two-class setup, it should be clear that we want the projection that will have those samples from class ω 1 falling into one cluster (on the line) and those samples from class ω 2 falling into a separate cluster (on the line). JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 32 / 101

46 Dimension Reduction Fisher Linear Discriminant What is a Good Projection? For our two-class setup, it should be clear that we want the projection that will have those samples from class ω 1 falling into one cluster (on the line) and those samples from class ω 2 falling into a separate cluster (on the line). However, this may not be possible depending on our underlying classes. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 32 / 101

47 Dimension Reduction Fisher Linear Discriminant What is a Good Projection? For our two-class setup, it should be clear that we want the projection that will have those samples from class ω 1 falling into one cluster (on the line) and those samples from class ω 2 falling into a separate cluster (on the line). However, this may not be possible depending on our underlying classes. So, how do we find the best direction w? JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 32 / 101

48 Dimension Reduction Fisher Linear Discriminant Separation of the Projected Means Let m i be the d-dimensional sample mean for class i: m i = 1 n i x D i x. (35) JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 33 / 101

49 Dimension Reduction Fisher Linear Discriminant Separation of the Projected Means Let m i be the d-dimensional sample mean for class i: m i = 1 n i Then the sample mean for the projected points is m i = 1 n i And, thus, is simply the projection of m i. x D i x. (35) y Y i y (36) = 1 n i x D i w T x (37) = w T m i. (38) JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 33 / 101

50 Dimension Reduction Fisher Linear Discriminant Distance Between Projected Means The distance between projected means is thus m 1 m 2 = w T (m 1 m 2 ) (39) JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 34 / 101

51 Dimension Reduction Fisher Linear Discriminant Distance Between Projected Means The distance between projected means is thus m 1 m 2 = w T (m 1 m 2 ) (39) The scale of w: we can make this distance arbitrarily large by scaling w. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 34 / 101

52 Dimension Reduction Fisher Linear Discriminant Distance Between Projected Means The distance between projected means is thus m 1 m 2 = w T (m 1 m 2 ) (39) The scale of w: we can make this distance arbitrarily large by scaling w. Rather, we want to make this distance large relative to some measure of the standard deviation. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 34 / 101

53 Dimension Reduction Fisher Linear Discriminant The Scatter To capture this variation, we compute the scatter s 2 i = y Y i (y m i ) 2 (40) which is essentially a scaled sampled variance. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 35 / 101

54 Dimension Reduction Fisher Linear Discriminant The Scatter To capture this variation, we compute the scatter s 2 i = y Y i (y m i ) 2 (40) which is essentially a scaled sampled variance. From this, we can estimate the variance of the pooled data: 1 2 ( s n 1 + s 2 ) 2. (41) s s2 2 is called the total within-class scatter of the projected samples. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 35 / 101

55 Dimension Reduction Fisher Linear Discriminant Fisher Linear Discriminant The Fisher Linear Discriminant will select the w that maximizes J(w) = m 1 m 2 2 s s2 2. (42) It does so independently of the magnitude of w. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 36 / 101

56 Dimension Reduction Fisher Linear Discriminant Fisher Linear Discriminant The Fisher Linear Discriminant will select the w that maximizes J(w) = m 1 m 2 2 s s2 2. (42) It does so independently of the magnitude of w. This term is the ratio of the distance between the projected means scaled by the within-class scatter (the variation of the data). Recall the similar term from earlier in the lecture which indicated the amount a feature will reduce the probability of error is proportional to the ratio of the difference of the means to the variance. FLD will choose the maximum... JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 36 / 101

57 Dimension Reduction Fisher Linear Discriminant Fisher Linear Discriminant The Fisher Linear Discriminant will select the w that maximizes J(w) = m 1 m 2 2 s s2 2. (42) It does so independently of the magnitude of w. This term is the ratio of the distance between the projected means scaled by the within-class scatter (the variation of the data). Recall the similar term from earlier in the lecture which indicated the amount a feature will reduce the probability of error is proportional to the ratio of the difference of the means to the variance. FLD will choose the maximum... We need to rewrite J( ) as a function of w. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 36 / 101

58 Dimension Reduction Fisher Linear Discriminant Within-Class Scatter Matrix Define scatter matrices S i : S i = x D i (x m i )(x m i ) T (43) and S W = S 1 + S 2. (44) JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 37 / 101

59 Dimension Reduction Fisher Linear Discriminant Within-Class Scatter Matrix Define scatter matrices S i : S i = x D i (x m i )(x m i ) T (43) and S W = S 1 + S 2. (44) S W is called the within-class scatter matrix. S W is symmetric and positive semidefinite. In typical cases, when is S W nonsingular? JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 37 / 101

60 Dimension Reduction Fisher Linear Discriminant Within-Class Scatter Matrix Deriving the sum of scatters. s 2 i = x D i (w T x w T m i ) 2 (45) = x D i w T (x m i ) (x m i ) T w (46) = w T S i w (47) We can therefore write the sum of the scatters as an explicit function of w: s s 2 2 = w T S W w (48) JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 38 / 101

61 Dimension Reduction Fisher Linear Discriminant Between-Class Scatter Matrix The separation of the projected means obeys ( m 1 m 2 ) 2 = ( w T m 1 w T m 2 ) 2 (49) = w T (m 1 m 2 ) (m 1 m 2 ) T w (50) = w T S B w (51) Here, S B is called the between-class scatter matrix: S B = (m 1 m 2 ) (m 1 m 2 ) T (52) S B is also symmetric and positive semidefinite. When is S B nonsingular? JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 39 / 101

62 Dimension Reduction Fisher Linear Discriminant Rewriting the FLD J( ) We can rewrite our objective as a function of w. J(w) = wt S B w w T S W w This is the generalized Rayleigh quotient. (53) JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 40 / 101

63 Dimension Reduction Fisher Linear Discriminant Rewriting the FLD J( ) We can rewrite our objective as a function of w. J(w) = wt S B w w T S W w (53) This is the generalized Rayleigh quotient. The vector w that maximizes J( ) must satisfy S B w = λs W w (54) which is a generalized eigenvalue problem. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 40 / 101

64 Dimension Reduction Fisher Linear Discriminant Rewriting the FLD J( ) We can rewrite our objective as a function of w. J(w) = wt S B w w T S W w (53) This is the generalized Rayleigh quotient. The vector w that maximizes J( ) must satisfy which is a generalized eigenvalue problem. S B w = λs W w (54) For nonsingular S W (typical), we can write this as a standard eigenvalue problem: S 1 W S Bw = λw (55) JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 40 / 101

65 Dimension Reduction Fisher Linear Discriminant Simplifying Since w is not important and S B w is always in the direction of (m 1 m 2 ), we can simplify w = S 1 W (m 1 m 2 ) (56) w maximizes J( ) and is the Fisher Linear Discriminant. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 41 / 101

66 Dimension Reduction Fisher Linear Discriminant Simplifying Since w is not important and S B w is always in the direction of (m 1 m 2 ), we can simplify w = S 1 W (m 1 m 2 ) (56) w maximizes J( ) and is the Fisher Linear Discriminant. The FLD converts a many-dimensional problem to a one-dimensional one. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 41 / 101

67 Dimension Reduction Fisher Linear Discriminant Simplifying Since w is not important and S B w is always in the direction of (m 1 m 2 ), we can simplify w = S 1 W (m 1 m 2 ) (56) w maximizes J( ) and is the Fisher Linear Discriminant. The FLD converts a many-dimensional problem to a one-dimensional one. One still must find the threshold. This is easy for known densities, but not so easy in general. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 41 / 101

68 Comparisons Classic A Classic PCA vs. FLD comparison Source: Belhumeur et al. IEEE TPAMI 19(7) PCA maximizes the total scatter across all classes. fication. This method selects W in [1] in such a way he ratio of the between-class scatter and the withinscatter is maximized. t the between-class scatter matrix be defined as PCA projections are thus c T optimal S for reconstruction B = ÂN c i m i - mhcm i - mh i= 1 from a low dimensional basis, c but not necessarily T S from W = ÂaÂdiscrimination c x k - m i hc x k - mih i= 1 xk ŒX i standpoint. e within-class scatter matrix be defined as m i is the mean image of class X i, and N i is the numsamples in class X i. If S W is nonsingular, the optimal tion W opt FLD is chosen maximizes as the matrix the with ratio orthonormal of ns which the maximizes between-class the ratio of the scatter determinant of tween-class scatter matrix of the projected samples to eterminant and of the within-class scatter matrix scatter. of the ted samples, i.e., FLD tries to shape the T W SBW scatter opt = arg max to make it more W T W SW W effective for classification. W = w w w IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 19, NO. 7, JULY 1997 Fig. 2. A comparison of principal component analysis (PCA) and Fisher s linear discriminant (FLD) for a two class problem where data for each class lies near a linear subspace. 1 2 K m (4) method, which we call Fisherfaces, avoids this problem by JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 42 / 101 projecting the image set to a lower dimensional space so

69 w approach for face recognilarge variations in lighting t lighting variability includes ection and number of light 1, the same person, with the Source: Belhumeur et al. IEEE TPAMI 19(7) derivative of Comparisons Fisher s Linear Case Discriminant Study: Faces (FLD) [4], [5], maximizes the ratio of between-class scatter to that of within-class scatter. The Eigenface method is also based on linearly projecting the image space to a low dimensional feature space [6], [7], [8]. However, the Eigenface method, which uses principal components analysis (PCA) for dimensionality reduction, yields projection directions that maximize the total scatter across all classes, i.e., across all images of all faces. In choosing the projection which maximizes total scatter, PCA retains unwanted variations due to lighting and facial expression. As illustrated in Figs. 1 and 4 and stated by Moses et al., the variations between the images of the same face due to illumination and viewing direction are almost always larger than image variations due to change in face identity [9]. Thus, while the PCA projections are optimal Case Study: EigenFaces versus FisherFaces n from the same viewpoint, t when light sources illumitions. See also Fig. 4. Analysis of classic pattern recognition techniques (PCA and FLD) to do face recognition. Fixed pose but varying illumination. The variation in the resulting images caused by the varying illumination will nearly always dominate the variation caused by an identity change. ion exploits two observations: bertian surface, taken from der varying illumination, lie the high-dimensional image adowing, specularities, and ve observation does not extain regions of the face may ge to image that often devie linear subspace, and, confor recognition. rvations by finding a linear he high-dimensional image putational Vision and Control, Dept. ty, New Haven, CT du, hespanha@yale.edu. 0 Mar Recommended for accepis article, please send to: ECS Log Number Fig. 1. The same person seen under different lighting conditions can appear dramatically different: In the left image, the dominant light source is nearly head-on; in the right image, the dominant light source is from above and to the right /97/$ IEEE JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 43 / 101

70 contains 160 frontal face images covering 16 individuals taken under 10 different conditio r without glasses, three images taken with different point light sources, and five different faci JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 44 / 101 d and cropped to two different Comparisons scales: Case test Study: person Faces who is represented by onl ded the full face and part of the backsely Case cropped Study: ones EigenFaces included internal versusvaries FisherFaces with the number of principal c In general, the performance of t row, Source: eyes, Belhumeur nose, mouth, et al. IEEE and TPAMI chin, but 19(7) fore comparing the Linear Subspace a cluding contour. with the Eigenface method, we first

71 Comparisons Case Study: Faces Is Linear Okay For Faces? Source: Belhumeur et al. IEEE TPAMI 19(7) Fact: all of the images of a Lambertian surface, taken from a fixed viewpoint, but under varying illumination, lie in a 3D linear subspace of the high-dimensional image space. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 45 / 101

72 Comparisons Case Study: Faces Is Linear Okay For Faces? Source: Belhumeur et al. IEEE TPAMI 19(7) Fact: all of the images of a Lambertian surface, taken from a fixed viewpoint, but under varying illumination, lie in a 3D linear subspace of the high-dimensional image space. But, in the presence of shadowing, specularities, and facial expressions, the above statement will not hold. This will ultimately result in deviations from the 3D linear subspace and worse classification accuracy. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 45 / 101

73 Comparisons Case Study: Faces Method 1: Correlation Source: Belhumeur et al. IEEE TPAMI 19(7) Consider a set of N training images, {x 1, x 2,..., x N }. We know that each of the N images belongs to one of c classes and can define a C( ) function to map the image x into a class ω c. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 46 / 101

74 Comparisons Case Study: Faces Method 1: Correlation Source: Belhumeur et al. IEEE TPAMI 19(7) Consider a set of N training images, {x 1, x 2,..., x N }. We know that each of the N images belongs to one of c classes and can define a C( ) function to map the image x into a class ω c. Pre-Processing each image is normalized to have zero mean and unit variance. Why? JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 46 / 101

75 Comparisons Case Study: Faces Method 1: Correlation Source: Belhumeur et al. IEEE TPAMI 19(7) Consider a set of N training images, {x 1, x 2,..., x N }. We know that each of the N images belongs to one of c classes and can define a C( ) function to map the image x into a class ω c. Pre-Processing each image is normalized to have zero mean and unit variance. Why? Gets rid of the light source intensity and the effects of a camera s automatic gain control. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 46 / 101

76 Comparisons Case Study: Faces Method 1: Correlation Source: Belhumeur et al. IEEE TPAMI 19(7) Consider a set of N training images, {x 1, x 2,..., x N }. We know that each of the N images belongs to one of c classes and can define a C( ) function to map the image x into a class ω c. Pre-Processing each image is normalized to have zero mean and unit variance. Why? Gets rid of the light source intensity and the effects of a camera s automatic gain control. For a query image x, we select the class of the training image that is the nearest neighbor in the image space: x = arg min {x i } x x i then decide C(x ) (57) where we have vectorized each image. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 46 / 101

77 Comparisons Case Study: Faces Method 1: Correlation Source: Belhumeur et al. IEEE TPAMI 19(7) What are the advantages and disadvantages of the correlation based method in this context? JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 47 / 101

78 Comparisons Case Study: Faces Method 1: Correlation Source: Belhumeur et al. IEEE TPAMI 19(7) What are the advantages and disadvantages of the correlation based method in this context? Computationally expensive. Require large amount of storage. Noise may play a role. Highly parallelizable. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 47 / 101

79 Comparisons Case Study: Faces Method 2: EigenFaces Source: Belhumeur et al. IEEE TPAMI 19(7) The original paper is Turk and Pentland, Eigenfaces for Recognition Journal of Cognitive Neuroscience, 3(1) Quickly recall the main idea of PCA. Define a linear projection of the original n-d image space into an m-d space with m < n or m n, which yields new vectors y: where W R n m. y k = W T x k k = 1, 2,..., N (58) JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 48 / 101

80 Comparisons Case Study: Faces Method 2: EigenFaces Source: Belhumeur et al. IEEE TPAMI 19(7) The original paper is Turk and Pentland, Eigenfaces for Recognition Journal of Cognitive Neuroscience, 3(1) Quickly recall the main idea of PCA. Define a linear projection of the original n-d image space into an m-d space with m < n or m n, which yields new vectors y: where W R n m. Define the total scatter matrix S T as y k = W T x k k = 1, 2,..., N (58) S T = N (x k µ) (x k µ) T (59) k=1 where µ is the sample mean image. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 48 / 101

81 Comparisons Case Study: Faces Method 2: EigenFaces Source: Belhumeur et al. IEEE TPAMI 19(7) The original paper is Turk and Pentland, Eigenfaces for Recognition Journal of Cognitive Neuroscience, 3(1) The scatter of the projected vectors {y 1, y 2,..., y N } is W T S T W (60) JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 49 / 101

82 Comparisons Case Study: Faces Method 2: EigenFaces Source: Belhumeur et al. IEEE TPAMI 19(7) The original paper is Turk and Pentland, Eigenfaces for Recognition Journal of Cognitive Neuroscience, 3(1) The scatter of the projected vectors {y 1, y 2,..., y N } is W T S T W (60) W opt is chosen to maximize the determinant of the total scatter matrix of the projected vectors: W opt = arg max W T S T W (61) W = [ ] w 1 w 2... w m (62) where {w i i = 1, d,..., m} is the set of n-d eigenvectors of S T corresponding to the largest m eigenvalues. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 49 / 101

83 Comparisons Case Study: Faces An Example: Source: cdecoro/eigenfaces/. (not sure if this dataset include lighting variation...) JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 50 / 101

84 Comparisons Case Study: Faces Method 2: EigenFaces Source: Belhumeur et al. IEEE TPAMI 19(7) The original paper is Turk and Pentland, Eigenfaces for Recognition Journal of Cognitive Neuroscience, 3(1) What are the advantages and disadvantages of the eigenfaces method in this context? JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 51 / 101

85 Comparisons Case Study: Faces Method 2: EigenFaces Source: Belhumeur et al. IEEE TPAMI 19(7) The original paper is Turk and Pentland, Eigenfaces for Recognition Journal of Cognitive Neuroscience, 3(1) What are the advantages and disadvantages of the eigenfaces method in this context? The scatter being maximized is due not only to the between-class scatter that is useful for classification but also to the within-clas scatter, which is generally undesirable for classification. If PCA is presented faces with varying illumination, the projection matrix W opt will contain principal components that retain the variation in lighting. If this variation is higher than the variation due to class identity, then PCA will suffer greatly for classification. Yields a more compact representation than the correlation-based method. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 51 / 101

86 Comparisons Case Study: Faces Method 3: Linear Subspaces Source: Belhumeur et al. IEEE TPAMI 19(7) Use Lambertian model directly. Consider a point p on a Lambertian surface illuminated by a point light source at infinity. Let s R 3 signify the product of the light source intensity with the unit vector for the light source direction. The image intensity of the surface at p when viewed by a camera is E(p) = a(p)n(p) T s (63) where n(p) is the unit inward normal vector to the surface at point p, and a(p) is the albedo of the surface at p (a scalar). Hence, the image intensity of the point p is linear on s. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 52 / 101

87 Comparisons Case Study: Faces Method 3: Linear Subspaces Source: Belhumeur et al. IEEE TPAMI 19(7) So, if we assume no shadowing, given three images of a Lambertian surface from the same viewpoint under three known, linearly independent light source directions, the albedo and surface normal can be recovered. Alternatively, one can reconstruct the image of the surface under an arbitrary lighting direction by a linear combination of the three original images. This fact can be used for classification. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 53 / 101

88 Comparisons Case Study: Faces Method 3: Linear Subspaces Source: Belhumeur et al. IEEE TPAMI 19(7) So, if we assume no shadowing, given three images of a Lambertian surface from the same viewpoint under three known, linearly independent light source directions, the albedo and surface normal can be recovered. Alternatively, one can reconstruct the image of the surface under an arbitrary lighting direction by a linear combination of the three original images. This fact can be used for classification. For each face (class) use three or more images taken under different lighting conditions to construct a 3D basis for the linear subspace. For recognition, compute the distance of a new image to each linear subspace and choose the face corresponding to the shortest distance. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 53 / 101

89 Comparisons Case Study: Faces Method 3: Linear Subspaces Source: Belhumeur et al. IEEE TPAMI 19(7) Pros and Cons? JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 54 / 101

90 Comparisons Case Study: Faces Method 3: Linear Subspaces Source: Belhumeur et al. IEEE TPAMI 19(7) Pros and Cons? If there is no noise or shadowing, this method will achieve error free classification under any lighting conditions (and if the surface is indeed Lambertian). Faces inevitably have self-shadowing. Faces have expressions... Still pretty computationally expensive (linear in number of classes). JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 54 / 101

91 Comparisons Case Study: Faces Method 4: FisherFaces Source: Belhumeur et al. IEEE TPAMI 19(7) Recall the Fisher Linear Discriminant setup. The between-class scatter matrix c S B = N i (µ i µ)(µ i µ) T (64) i=1 The within-class scatter matrix c S W = (x k µ i ) (x k µ i ) T (65) i=1 x k D i The optimal projection W opt is chosen as the matrix with orthonormal columns which maximizes the ratio of the determinant of the between-class scatter matrix of the projected vectors to the determinant of the within-class scatter of the projected vectors: W T S B W W opt = arg max W W T S W W (66) JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 55 / 101

92 Comparisons Case Study: Faces Method 4: FisherFaces Source: Belhumeur et al. IEEE TPAMI 19(7) The eigenvectors {w i i = 1, 2,..., m} corresponding to the m largest eigenvalues of the following generalized eigenvalue problem comprise W opt : S B w i = λ i S W w i, i = 1, 2,..., m (67) This is a multi-class version of the FLD, which we will discuss in more detail. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 56 / 101

93 Comparisons Case Study: Faces Method 4: FisherFaces Source: Belhumeur et al. IEEE TPAMI 19(7) The eigenvectors {w i i = 1, 2,..., m} corresponding to the m largest eigenvalues of the following generalized eigenvalue problem comprise W opt : S B w i = λ i S W w i, i = 1, 2,..., m (67) This is a multi-class version of the FLD, which we will discuss in more detail. In face recognition, things get a little more complicated because the within-class scatter matrix S W is always singular. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 56 / 101

94 Comparisons Case Study: Faces Method 4: FisherFaces Source: Belhumeur et al. IEEE TPAMI 19(7) The eigenvectors {w i i = 1, 2,..., m} corresponding to the m largest eigenvalues of the following generalized eigenvalue problem comprise W opt : S B w i = λ i S W w i, i = 1, 2,..., m (67) This is a multi-class version of the FLD, which we will discuss in more detail. In face recognition, things get a little more complicated because the within-class scatter matrix S W is always singular. This is because the rank of S W is at most N c and the number of images in the learning set are commonly much smaller than the number of pixels in the image. This means we can choose W such that the within-class scatter is exactly zero. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 56 / 101

95 Comparisons Case Study: Faces Method 4: FisherFaces Source: Belhumeur et al. IEEE TPAMI 19(7) To overcome this, project the image set to a lower dimensional space so that the resulting within-class scatter matrix S W is nonsingular. How? JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 57 / 101

96 Comparisons Case Study: Faces Method 4: FisherFaces Source: Belhumeur et al. IEEE TPAMI 19(7) To overcome this, project the image set to a lower dimensional space so that the resulting within-class scatter matrix S W is nonsingular. How? PCA to first reduce the dimension to N c and the FLD to reduce it to c 1. W T opt is given by the product W T FLD W T PCA where W PCA = arg max W W T S T W (68) W FLD = arg max W W T W T PCA S BW PCA W W T W T PCA S W W PCA W (69) JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 57 / 101

97 Comparisons Case Study: Faces Experiment 1 and 2: Variation in Lighting Source: Belhumeur et al. IEEE TPAMI 19(7) Hypothesis: face recognition algorithms will perform better if they exploit the fact that images of a Lambertian surface lie in a linear subspace. Used Hallinan s Harvard Database which sampled the space of light source directions in 15 degree incremements. Used 330 images of five people (66 of each) and extracted five subsets. Classification is nearest neighbor in all cases. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 58 / 101

98 Comparisons Case Study: Faces Experiment 1 and 2: Variation in Lighting Source: Belhumeur et al. IEEE TPAMI 19(7) ECOGNITION USING CLASS SPECIFIC LINEAR PROJECTION 715 h of the aforeg two different heses that we e of the considbases were inm the Harvard een systematia database at expression and Fig. 3. The highlighted lines of longitude and latitude indicate the light source directions for Subsets 1 through 5. Each intersection of a longitudinal and latitudinal line on the right side of the illustration has a corresponding image in the database. the hypothesis ognition algoe fact that imubspace. More all four algousing an imdirection coincident with the camera s optical axis. Subset 2 contains 45 images for which the greater of the JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 59 / 101

99 . 4. Example JJ Corso images (University from each of Michigan) subset of the Harvard Database Eigenfaces used and to test Fisherfaces the four algorithms. 60 / 101 varying lighting. Sample images from each subset are Comparisons own in Fig. 4. bset 1 contains 30 images for which both the longitudi- Experiment nal and latitudinal angles 1: of Variation light source direction in are Lighting within 15 $ of the camera axis, including the lighting Source: Belhumeur et al. IEEE TPAMI 19(7) The Yale database is available for download from Subset Case Study: 5 contains Faces 105 images for which the greater of th longitudinal and latitudinal angles of light source d rection are 75 $ from the camera axis. For all experiments, classification was performed using nearest neighbor classifier. All training images of an ind

100 Comparisons Case Study: Faces Experiment 1: Variation in Lighting Extrapolation Source: Belhumeur et al. IEEE TPAMI 19(7) Train on Subset 1. Test716of Subsets 1,2,and IEEE TRANSACTIONS 3. ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 19, NO. 7, JULY 1997 Extrapolating from Subset 1 Method Reduced Error Rate (%) Space Subset 1 Subset 2 Subset 3 Eigenface Eigenface w/o 1st Correlation Linear Subspace Fisherface Fig. 5. Extrapolation: When each of the methods is trained on images with near frontal illumination (Subset 1), the graph and corresponding table show the relative performance under extreme light source conditions. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 61 / 101

101 Comparisons Case Study: Faces Experiment 2: Variation in Lighting Interpolation Source: Belhumeur et al. IEEE TPAMI 19(7) Train on Subsets 1 and 5. BELHUMEUR ET AL.: EIGENFACES VS. FISHERFACES: RECOGNITION USING CLASS SPECIFIC LINEAR PROJECTION 717 Test of Subsets 2, 3, and 4. Interpolating between Subsets 1 and 5 Method Reduced Error Rate (%) Space Subset 2 Subset 3 Subset 4 Eigenface Eigenface w/o 1st Correlation Linear Subspace Fisherface Fig. 6. Interpolation: When each of the methods is trained on images from both near frontal and extreme lighting (Subsets 1 and 5), the graph and corresponding table show the relative performance under intermediate lighting conditions. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 62 / 101

102 Comparisons Case Study: Faces Experiment 1 and 2: Variation in Lighting Source: Belhumeur et al. IEEE TPAMI 19(7) All of the algorithms perform perfectly when lighting is nearly frontal. However, when lighting is moved off axis, there is significant difference between the methods (spec. the class-specific methods and the Eigenface method). JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 63 / 101

103 Comparisons Case Study: Faces Experiment 1 and 2: Variation in Lighting Source: Belhumeur et al. IEEE TPAMI 19(7) All of the algorithms perform perfectly when lighting is nearly frontal. However, when lighting is moved off axis, there is significant difference between the methods (spec. the class-specific methods and the Eigenface method). Empirically demonstrated that the Eigenface method is equivalent to correlation when the number of Eigenfaces equals the size of the training set (Exp. 1). JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 63 / 101

104 Comparisons Case Study: Faces Experiment 1 and 2: Variation in Lighting Source: Belhumeur et al. IEEE TPAMI 19(7) All of the algorithms perform perfectly when lighting is nearly frontal. However, when lighting is moved off axis, there is significant difference between the methods (spec. the class-specific methods and the Eigenface method). Empirically demonstrated that the Eigenface method is equivalent to correlation when the number of Eigenfaces equals the size of the training set (Exp. 1). In the Eigenface method, removing the first three principal components results in better performance under variable lighting conditions. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 63 / 101

105 Comparisons Case Study: Faces Experiment 1 and 2: Variation in Lighting Source: Belhumeur et al. IEEE TPAMI 19(7) All of the algorithms perform perfectly when lighting is nearly frontal. However, when lighting is moved off axis, there is significant difference between the methods (spec. the class-specific methods and the Eigenface method). Empirically demonstrated that the Eigenface method is equivalent to correlation when the number of Eigenfaces equals the size of the training set (Exp. 1). In the Eigenface method, removing the first three principal components results in better performance under variable lighting conditions. Linear Subspace has comparable error rates with the FisherFace method, but it requires 3x as much storage and takes three times as long. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 63 / 101

106 Comparisons Case Study: Faces Experiment 1 and 2: Variation in Lighting Source: Belhumeur et al. IEEE TPAMI 19(7) All of the algorithms perform perfectly when lighting is nearly frontal. However, when lighting is moved off axis, there is significant difference between the methods (spec. the class-specific methods and the Eigenface method). Empirically demonstrated that the Eigenface method is equivalent to correlation when the number of Eigenfaces equals the size of the training set (Exp. 1). In the Eigenface method, removing the first three principal components results in better performance under variable lighting conditions. Linear Subspace has comparable error rates with the FisherFace method, but it requires 3x as much storage and takes three times as long. The Fisherface method had error rates lower than the Eigenface method and required less computation time. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 63 / 101

107 Comparisons Case Study: Faces Experiment 3: Yale DB Source: Belhumeur et al. IEEE TPAMI 19(7) Uses a second database of 16 subjects with ten images of images (5 Fig. 8. As demonstrated the Yale Database, the variation in performance of the Eigenface method depends on the number of principal components varying retained. Dropping in expression). the first three appears to improve performance. Leaving-One-Out of Yale Database Method Reduced Error Rate (%) Space Close Crop Full Face Eigenface Eigenface w/o 1st Correlation Linear Subspace Fisherface JJ Corso Fig. 9. (University The graph and ofcorresponding Michigan) table show the relative Eigenfaces performance and of Fisherfaces the algorithms when applied to the Yale Database which contains 64 / 101

108 Comparisons Case Study: Faces Experiment 3: Comparing Variation with Number of Principal Components Source: Belhumeur et al. IEEE TPAMI 19(7) IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 19, NO. 7, JULY As demonstrated on the Yale Database, the variation in performance of the Eigenface method depends on the number of principal com retained. Dropping the first three appears to improve performance. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 65 / 101

109 Comparisons Case Study: Faces JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 66 / 101

110 Comparisons Case Study: Faces Can PCA outperform FLD for recognition? Source: Martínez and Kak. PCA versus LDA. PAMI 23(2), There may be situations in which PCA might outperform FLD. Can you think of such a situation? JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 67 / 101

111 Comparisons Case Study: Faces Can PCA outperform FLD for recognition? Source: Martínez and Kak. PCA versus LDA. PAMI 23(2), There may be situations in which PCA might outperform FLD. Can you think of such a situation? LDA PCA D LDA D PCA When we have few training data, then it may be preferable to ure 1: There are two dierent classes embedded in two dierent \Gaussian-like" distributions. However, ple per class are supplied to the learning procedure (PCA or LDA). The classication result of the PCA p ing only the describe rst eigenvector) total scatter. is more desirable than the result of the LDA. D P CA and D LDA represent the esholds obtained by using nearest-neighbor classication. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 67 / 101

112 Comparisons Case Study: Faces Can PCA outperform FLD for recognition? Source: Martínez and Kak. PCA versus LDA. PAMI 23(2), They tested on the AR face database and found affirmative results. Figure 4: Shown here are performance curves for three dierent ways of dividing the data into a training set and a testing set. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 68 / 101

113 Extensions Multiple Discriminant Analysis Multiple Discriminant Analysis We can generalize the Fisher Linear Discriminant to multiple classes. Indeed we saw it done in the Case Study on Faces. But, let s cover it with some rigor. Let s make sure we re all at the same place: How many discriminant functions will be involved in a c class problem? JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 69 / 101

114 Extensions Multiple Discriminant Analysis Multiple Discriminant Analysis We can generalize the Fisher Linear Discriminant to multiple classes. Indeed we saw it done in the Case Study on Faces. But, let s cover it with some rigor. Let s make sure we re all at the same place: How many discriminant functions will be involved in a c class problem? Key: There will be c 1 projection functions for a c class problem and hence the projection will be from a d-dimensional space to a (c 1)-dimensional space. d must be greater than or equal to c. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 69 / 101

115 Extensions Multiple Discriminant Analysis MDA The Projection The projection is accomplished by c 1 discriminant functions: y i = wi T x i = 1,..., c 1 (70) which is summarized in matrix form as y = W T x (71) where W is the d (c 1) projection function. W 1 W 2 FIGURE 3.6. Three three-dimensional distributions are projected onto two-dimensional JJ Corso (University ofsubspaces, Michigan) described by a normal Eigenfaces vectorsand W1 and Fisherfaces W2. Informally, multiple discriminant 70 / 101

116 Extensions Multiple Discriminant Analysis MDA Within-Class Scatter Matrix The generalization for the Within-Class Scatter Matrix is straightforward: S W = c S i (72) i=1 where, as before, S i = x D i (x m i )(x m i ) T and m i = 1 n i x D i x. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 71 / 101

117 Extensions Multiple Discriminant Analysis MDA Between-Class Scatter Matrix The between-class scatter matrix S B is not so easy to generalize. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 72 / 101

118 Extensions Multiple Discriminant Analysis MDA Between-Class Scatter Matrix The between-class scatter matrix S B is not so easy to generalize. Let s define a total mean vector m = 1 x = 1 n n Recall then total scatter matrix x c n i m i (73) i=1 S T = x (x m)(x m) T (74) JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 72 / 101

119 Extensions Multiple Discriminant Analysis MDA Between-Class Scatter Matrix Then it follows that c S T = (x m i + m i m)(x m i + m i m) T (75) i=1 x D i c c = (x m i )(x m i ) T + (m i m)(m i m) T i=1 x D i i=1 x D i (76) c = S W + n i (m i m)(m i m) T (77) i=1 JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 73 / 101

120 Extensions Multiple Discriminant Analysis MDA Between-Class Scatter Matrix Then it follows that c S T = (x m i + m i m)(x m i + m i m) T (75) i=1 x D i c c = (x m i )(x m i ) T + (m i m)(m i m) T i=1 x D i i=1 x D i (76) c = S W + n i (m i m)(m i m) T (77) i=1 So, we can define this second term as a generalized between-class scatter matrix. The total scatter is the the sum of the within-class scatter and the between-class scatter. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 73 / 101

121 Extensions Multiple Discriminant Analysis MDA Objective Page 1 We again seek a criterion that will maximize the between-class scatter of the projected vectors to the within-class scatter of the projected vectors. Recall that we can write down the scatter in terms of these projections: m i = 1 y and m = 1 c n i m i (78) n i n y Y i i=1 S W = S B = c i=1 (y m i )(y m i ) T (79) y Y i c n i ( m i m)( m i m) T (80) i=1 JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 74 / 101

122 Extensions Multiple Discriminant Analysis MDA Objective Page 2 Then, we can show S W = W T S W W (81) S B = W T S B W (82) JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 75 / 101

123 Extensions Multiple Discriminant Analysis MDA Objective Page 2 Then, we can show S W = W T S W W (81) S B = W T S B W (82) A simple scalar measure of scatter is the determinant of the scatter matrix. The determinant is the product of the eigenvalues and thus is the product of the variation along the principal directions. J(W) = S B S W = WT S B W W T S W W (83) JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 75 / 101

124 Extensions Multiple Discriminant Analysis MDA Solution The columns of an optimal W are the generalized eigenvectors that correspond to the largest eigenvalues in S B w i = λ i S W w i (84) JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 76 / 101

125 Extensions Multiple Discriminant Analysis MDA Solution The columns of an optimal W are the generalized eigenvectors that correspond to the largest eigenvalues in S B w i = λ i S W w i (84) If S W is nonsingular, then this can be converted to a conventional eigenvalue problem. Or, we could notice that the rank of S B is at most c 1 and do some clever algebra... JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 76 / 101

126 Extensions Multiple Discriminant Analysis MDA Solution The columns of an optimal W are the generalized eigenvectors that correspond to the largest eigenvalues in S B w i = λ i S W w i (84) If S W is nonsingular, then this can be converted to a conventional eigenvalue problem. Or, we could notice that the rank of S B is at most c 1 and do some clever algebra... The solution for W is, however, not unique and would allow arbitrary scaling and rotation, but these would not change the ultimate classification. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 76 / 101

127 Extensions Image PCA IMPCA Source: Yang and Yang: IMPCA vs. PCA, Pattern Recognition v35, Observation: To apply PCA and FLD on images, we need to first vectorize them, which can lead to high-dimensional vectors. Solving the associated eigen-problems is a very time-consuming process. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 77 / 101

128 Extensions Image PCA IMPCA Source: Yang and Yang: IMPCA vs. PCA, Pattern Recognition v35, Observation: To apply PCA and FLD on images, we need to first vectorize them, which can lead to high-dimensional vectors. Solving the associated eigen-problems is a very time-consuming process. So, can we apply PCA on the images directly? JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 77 / 101

129 Extensions Image PCA IMPCA Source: Yang and Yang: IMPCA vs. PCA, Pattern Recognition v35, Observation: To apply PCA and FLD on images, we need to first vectorize them, which can lead to high-dimensional vectors. Solving the associated eigen-problems is a very time-consuming process. So, can we apply PCA on the images directly? Yes! This will be accomplished by what the authors call the image total covariance matrix. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 77 / 101

130 Extensions Image PCA IMPCA: Problem Set-Up Source: Yang and Yang: IMPCA vs. PCA, Pattern Recognition v35, Define A R m n as our image. Let w denote an n-dimensional column vector, which will represent the subspace onto which we will project an image. which yields m-dimensional vector y. You might think of w this as a feature selector. We, again, want to maximize the total scatter... y = Aw (85) JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 78 / 101

131 Extensions Image PCA IMPCA: Image Total Scatter Matrix Source: Yang and Yang: IMPCA vs. PCA, Pattern Recognition v35, We have M total data samples. The sample mean image is M = 1 M And the projected sample mean is m = 1 M = 1 M M A j (86) j=1 M y j (87) j=1 M A j w (88) j=1 = Mw (89) JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 79 / 101

132 Extensions Image PCA IMPCA: Image Total Scatter Matrix Source: Yang and Yang: IMPCA vs. PCA, Pattern Recognition v35, The scatter of the projected samples is M S = (y j m)(y j m) T (90) j=1 M [ = (Aj M)w ][ (A j M)w ] T (91) j=1 M tr( S) = w T (A j M) T (A j M) w (92) j=1 JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 80 / 101

133 Extensions Image PCA IMPCA: Image Total Scatter Matrix Source: Yang and Yang: IMPCA vs. PCA, Pattern Recognition v35, So, the image total scatter matrix is M S I = (A j M) T (A j M) (93) j=1 JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 81 / 101

134 Extensions Image PCA IMPCA: Image Total Scatter Matrix Source: Yang and Yang: IMPCA vs. PCA, Pattern Recognition v35, So, the image total scatter matrix is M S I = (A j M) T (A j M) (93) j=1 And, a suitable criterion, in a form we ve seen before is J I (w) = w T S I w (94) JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 81 / 101

135 Extensions Image PCA IMPCA: Image Total Scatter Matrix Source: Yang and Yang: IMPCA vs. PCA, Pattern Recognition v35, So, the image total scatter matrix is S I = M (A j M) T (A j M) (93) j=1 And, a suitable criterion, in a form we ve seen before is J I (w) = w T S I w (94) We know already that the vectors w that maximize J I are the orthonormal eigenvectors of S I corresponding to the largest eigenvalues of S I. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 81 / 101

136 Extensions Image PCA IMPCA: Experimental Results Source: Yang and Yang: IMPCA vs. PCA, Pattern Recognition v35, Dataset is the ORL database 10 different images taken of 40 individuals. Facial expressions and details are varying. attern Recognition Even the 35 pose (2002) can 1997 vary slightly: degree rotation/tilt and 10% scale. Size is normalized (92 112). jected es the e conotes ining :;M), by A. et the Fig. 1. Five images ofone person in ORL face database. The first five images of each person are used for training and the second five are used for testing Feature extraction method JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 82 / 101

137 Extensions Image PCA IMPCA: Experimental Results Source: Yang and Yang: IMPCA vs. PCA, Pattern Recognition v35, J. J. Yang, Yang, J.-Y. J.-Y. Yang Yang / Pattern / Pattern Recognition Recognition (2002) (2002) IMPCA varying the number of extracted eigenvectors (NN classifier): Table 1 Recognition rates (%) based on IMPCA projected features as as the the number of of projection vectors vectors varying f Projection ng, J.-Y. Yang / Pattern vector Recognition number 35 (2002) J. Yang, J.-Y. Yang / Pattern Recognition 35 (2002) Minimum distance Table Nearest rojected features as neighbor Recognition the number rates of projection (%) based on vectors IMPCA varying projected fromfeatures 1 to 10as the number of projection vectors varying fr 3 Projection Table vector 2 number Tables Tables 1 and 1 and 2 indicate 2 indicate tha Comparison Comparison ofthe ofthe maximal maximal recognition recognition rates rates using using the the three three methods methods forms Eigenfaces and Fisherf 86.5 Minimum 88.5 distance forms it is evident 91.0 Eigenfaces 88.5 and 90.0 it is evident that that the the time time co 93.5 Nearest 94.5 neighbor Recognition rate Eigenfaces Fisherfaces IMPCA using IMPCA is much less th Recognition rate Eigenfaces Fisherfaces IMPCA using IMPCA is much les ods. So our methods are mor Minimum distance 89.5% (46) 88.5% (39) 91.0% ods. So our methods are Minimum distance 89.5% (46) 88.5% (39) 91.0% extraction. Table Nearest 2 neighbor 93.5% (37) 88.5% (39) 95.5% Tables extraction. 1 and 2 indicate that Comparison Nearest neighbor ofthe Tables maximal 93.5% 1 and recognition 2(37) indicate rates 88.5% that using our (39) the proposed three methodsnote: Theforms value ineigenfaces parenthesesand denotes Fisherfaces. the numberwhat s ofaxes more, as in Table 3, 95.5% IMPCA outper- tes using the three meth- forms Eigenfaces and Fisherf thenote: maximal Therecognition value in parentheses rate is achieved. denotes the number ofaxes as it Acknowledgements is evident that the time co is evident that the time consumed for feature extraction the maximal rate is achieved. Acknowledgements JJ Corso (University Recognition of Michigan) rate Eigenfaces Eigenfaces and Fisherfaces IMPCA using IMPCA is much 83 / less 101th

138 Extensions IMPCA: Experimental Results Image PCA Nearest neighbor Source: Yang and Yang: IMPCA vs. PCA, Pattern Recognition v35, Table 2 Comparison ofthe maximal recognition rates using the three methods vs. EigenFaces vs. FisherFaces for IMPCA Recognition Recognition rate Eigenfaces Fisherfaces IMPCA Minimum distance 89.5% (46) 88.5% (39) 91.0% Nearest neighbor 93.5% (37) 88.5% (39) 95.5% Table forms E it is ev using IM ods. So extracti Note: The value in parentheses denotes the number ofaxes as the maximal recognition rate is achieved. Acknow Table 3 The CPU time consumed for feature extraction and classication using the three methods Time (s) Feature Classication Total extraction time time time We w under G Referen [1] M. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 84 / 101

139 it is ev Extensions Image PCA Nearest neighbor Recognition rate Eigenfaces Fisherfaces IMPCA using IM ods. So extracti Table forms E it Acknow is ev using IM ods. We Sow extracti under G IMPCA: Experimental Results Minimum distance 89.5% (46) 88.5% (39) 91.0% Source: Yang and Yang: IMPCA vs. PCA, Pattern Recognition v35, Table Nearest 2 neighbor 93.5% (37) 88.5% (39) 95.5% Comparison ofthe maximal recognition rates using the three methodsnote: vs. TheEigenFaces value in parentheses vs. FisherFaces denotes the fornumber Recognition ofaxes as IMPCA the maximal recognition rate is achieved. Recognition rate Eigenfaces Fisherfaces IMPCA Minimum distance 89.5% (46) 88.5% (39) 91.0% Table Nearest 3 neighbor 93.5% (37) 88.5% (39) 95.5% The CPU time consumed for feature extraction and classication IMPCA using Note: thevs. The three EigenFaces value methods in parentheses vs. FisherFaces denotes the fornumber Speed ofaxes as the maximal recognition rate is achieved. Time (s) Feature Classication Total extraction time time time Table Eigenfaces 3 (37) The Fisherfaces CPU time (39) consumed for feature extraction 5.27and classication using IMPCA the(112 three methods 8) Time (s) Feature Classication Total extraction time time time Acknow Referen [1] We M. w under Proc G Patte [2] P.N. Referen Fish IEEE [1] [3] M. K. L JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 84 / 101

140 Extensions Image PCA IMPCA: Discussion Source: Yang and Yang: IMPCA vs. PCA, Pattern Recognition v35, There is a clear speed-up because the amount of computation has been greatly reduced. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 85 / 101

141 Extensions Image PCA IMPCA: Discussion Source: Yang and Yang: IMPCA vs. PCA, Pattern Recognition v35, There is a clear speed-up because the amount of computation has been greatly reduced. However, this speed-up comes at some cost. What is that cost? JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 85 / 101

142 Extensions Image PCA IMPCA: Discussion Source: Yang and Yang: IMPCA vs. PCA, Pattern Recognition v35, There is a clear speed-up because the amount of computation has been greatly reduced. However, this speed-up comes at some cost. What is that cost? Why does IMPCA work in this case? What is IMPCA really doing? JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 85 / 101

143 Non-linear Dimension Reduction Locally Linear Embedding Locally Linear Embedding Source: Saul and Roweis, An Introduction to Locally Linear Embedding, 2001 So far, we have covered a few methods for dimension reduction that all make the underlying assumption the data in high dimension lives on a planar manifold in the lower dimension. The methods are easy to implement. The methods do not suffer from local minima. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 86 / 101

144 ple to implement, Non-linear Dimension and Reduction its optimizations Locally Linear Embedding do not involve local m time, however, it is capable of generating highly nonlinear emb Locally Linear Embedding mixture models for local dimensionality reduction[5, 6], which Source: Saul and Roweis, An Introduction to Locally Linear Embedding, 2001 perform PCA within each cluster, do not address the problem So far, namely, we have how covered to map a few high methods dimensional for dimension data into reduction a single that global all make of lower the underlying dimensionality. assumption the data in high dimension lives on a planar manifold in the lower dimension. In Thethis methods paper, arewe easy review to implement. the LLE algorithm in its most basic fo The methods do not suffer from local minima. potential application to audiovisual speech synthesis[3]. But, what if the data is non-linear? (A) (B) JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 86 / 101

145 Non-linear Dimension Reduction Locally Linear Embedding The Non-Linear Problem Source: Saul and Roweis, An Introduction to Locally Linear Embedding, 2001 In non-linear dimension reduction, one must discover the global internal coordinates of the manifold without signals that explicitly indicate how the data should be embedded in the lower dimension (or even how many dimensions should be used). JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 87 / 101

146 Non-linear Dimension Reduction Locally Linear Embedding The Non-Linear Problem Source: Saul and Roweis, An Introduction to Locally Linear Embedding, 2001 In non-linear dimension reduction, one must discover the global internal coordinates of the manifold without signals that explicitly indicate how the data should be embedded in the lower dimension (or even how many dimensions should be used). The LLE way of doing this is to make the assumption that, given enough data samples, the local frame of a particular point x i is linear. LLE proceeds to preserve this local structure while simultaneously reducing the global dimension (indeed preserving the local structure gives LLE the necessary constraints to discover the manifold). JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 87 / 101

147 Non-linear Dimension Reduction Locally Linear Embedding LLE: Neighborhoods Source: Saul and Roweis, An Introduction to Locally Linear Embedding, 2001 Suppose we have n real-valued vectors x i in dataset D each of dimension D. We assume these data are sampled from some smooth underlying manifold. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 88 / 101

148 Non-linear Dimension Reduction Locally Linear Embedding LLE: Neighborhoods Source: Saul and Roweis, An Introduction to Locally Linear Embedding, 2001 Suppose we have n real-valued vectors x i in dataset D each of dimension D. We assume these data are sampled from some smooth underlying manifold. Furthermore, we assume that each data point and its neighbors lie on or close to a locally linear patch of the manifold. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 88 / 101

149 Non-linear Dimension Reduction Locally Linear Embedding LLE: Neighborhoods Source: Saul and Roweis, An Introduction to Locally Linear Embedding, 2001 Suppose we have n real-valued vectors x i in dataset D each of dimension D. We assume these data are sampled from some smooth underlying manifold. Furthermore, we assume that each data point and its neighbors lie on or close to a locally linear patch of the manifold. LLE characterizes the local geometry of these patches by linear coefficients that reconstruct each data point from its neighbors. If the neighbors form the D-dimensional simplex, then these coefficients form the barycentric coordinates of the data point. In the simplest form of LLE, one identifies K such nearest neighbors based on the Euclidean distance. But, one can use all points in a ball of fixed radius, or even more sophisticated metrics. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 88 / 101

150 Non-linear Dimension Reduction Locally Linear Embedding LLE: Neighborhoods Source: Saul and Roweis, An Introduction to Locally Linear Embedding, 2001 We can write down the reconstruction error with the following cost function: J LLE1 (W) = x i 2 W ij x j (95) i j Notice that each row of the weight matrix W will be nonzero for only K columns; i.e., W is a quite sparse matrix. I.e., if we define the set N(x i ) as the K neighbors of x i, then we enforce W ij = 0 if x j / N(x i ) (96) We will enforce that the rows sum to 1, i.e., j W ij = 1. (This is for invariance.) JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 89 / 101

151 Non-linear Dimension Reduction Locally Linear Embedding LLE: Neighborhoods Source: Saul and Roweis, An Introduction to Locally Linear Embedding, 2001 Note that the constrained weights that minimize these reconstruction errors obey important symmetries: for any data point, they are invariant to rotations, rescalings, and translations of that data point and its neighbors. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 90 / 101

152 Non-linear Dimension Reduction Locally Linear Embedding LLE: Neighborhoods Source: Saul and Roweis, An Introduction to Locally Linear Embedding, 2001 Note that the constrained weights that minimize these reconstruction errors obey important symmetries: for any data point, they are invariant to rotations, rescalings, and translations of that data point and its neighbors. This means that these weights characterize intrinsic geometric properties of the neighborhood as opposed to any properties that depend on a particular frame of reference. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 90 / 101

153 Non-linear Dimension Reduction Locally Linear Embedding LLE: Neighborhoods Source: Saul and Roweis, An Introduction to Locally Linear Embedding, 2001 Note that the constrained weights that minimize these reconstruction errors obey important symmetries: for any data point, they are invariant to rotations, rescalings, and translations of that data point and its neighbors. This means that these weights characterize intrinsic geometric properties of the neighborhood as opposed to any properties that depend on a particular frame of reference. If we suppose the data lie on or near a smooth nonlinear manifold of dimension d D. Then, to a good approximation, there exists a linear mapping (a translation, rotation, and scaling) that maps the high dimensional coordinates of each neighborhood to global internal coordinates on the manifold. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 90 / 101

154 Non-linear Dimension Reduction Locally Linear Embedding LLE: Neighborhoods Source: Saul and Roweis, An Introduction to Locally Linear Embedding, 2001 So, we expect that the characterization of the local neighborhoods (W) in the original space to be equally valid for the local patches on the manifold. In other words, the same weights W ij that reconstruct point x i in the original space should also reconstruct it in the embedded manifold coordinate system. First, let s solve for the weights. And, then we ll see how to use this point to ultimately compute the global dimension reduction. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 91 / 101

155 Non-linear Dimension Reduction Locally Linear Embedding Solving for the Weights Source: Saul and Roweis, An Introduction to Locally Linear Embedding, 2001 Solving for the weights W is a constrained least-squares problem. Consider a particular x with K nearest neighbors η j and weights w j which sum to one. We can write the reconstruction error as ɛ = x 2 w j η j (97) j 2 = w j (x η j ) (98) j = w j w k C jk (99) jk where C jk is the covariance (x η j )(x η k ) T. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 92 / 101

156 Non-linear Dimension Reduction Locally Linear Embedding Solving for the Weights Source: Saul and Roweis, An Introduction to Locally Linear Embedding, 2001 We need to add a Lagrange multiplier to enforce the constraint j w j = 1 and then the weights can be solved in closed form. The optimal weights, in terms of the local covariance matrix, are w j = k C 1 jk lm C 1 lm. (100) JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 93 / 101

157 Non-linear Dimension Reduction Locally Linear Embedding Solving for the Weights Source: Saul and Roweis, An Introduction to Locally Linear Embedding, 2001 We need to add a Lagrange multiplier to enforce the constraint j w j = 1 and then the weights can be solved in closed form. The optimal weights, in terms of the local covariance matrix, are w j = k C 1 jk lm C 1 lm. (100) But, rather than explicitly inverting the covariance matrix, one can solve the linear system C jk w k = 1 (101) and rescale the weights so that they sum to one. k JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 93 / 101

158 Non-linear Dimension Reduction Locally Linear Embedding Stage 2: Neighborhood Preserving Mapping Source: Saul and Roweis, An Introduction to Locally Linear Embedding, 2001 Each high-dimensional input vector x i is mapped to a low-dimension vector y i representing the global internal coordinates on the manifold. LLE does this by choosing the d-dimension coordinates y i to minimize the embedding cost function: J LLE2 (y) = i y i j W ij y j 2 (102) The basis for the cost function is the same locally linear reconstruction errors but here, the weights W are fixed and the coordinates y are optimized. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 94 / 101

159 Non-linear Dimension Reduction Locally Linear Embedding Stage 2: Neighborhood Preserving Mapping Source: Saul and Roweis, An Introduction to Locally Linear Embedding, 2001 This defines a quadratic: J LLE2 (y) = i y i 2 W ij y j j (103) = ij M ij (y T i y j ) (104) where M ij = δ ij W ij W ji + k W ki W kj (105) with δ ij is 1 if j i and 0 otherwise. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 95 / 101

160 Non-linear Dimension Reduction Locally Linear Embedding Stage 2: Neighborhood Preserving Mapping Source: Saul and Roweis, An Introduction to Locally Linear Embedding, 2001 To remove the degree of freedom for translation, we enforce the coordinates to be centered at the origin: y i = 0 (106) i To avoid degenerate solutions, we constrain the embedding vectors to have unit covariance, with their outer-products satisfying 1 n y i yi T = I (107) i The optimal embedding is found by the bottom d + 1 non-zero eigenvectors, i.e., those d + 1 eigenvectors corresponding to the smallest but non-zero d + 1 eigenvalues. The bottom eigenvector is the unit vector the corresponds to the free translation, it is discarded. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 96 / 101

161 Non-linear Dimension Reduction Locally Linear Embedding Putting It Together Source: Saul and Roweis, An Introduction to Locally Linear Embedding, 2001 LLE ALGORITHM 1. Compute the neighbors of each data point, that best reconstruct each data point from its neighbors, minimizing the cost in eq. (1) by constrained linear fits. $/ 3. Compute the vectors best reconstructed by the weights, minimizing the quadratic form in eq. (2) by its bottom nonzero eigenvectors. 2. Compute the weights. Figure 2: Summary of the LLE algorithm, mapping high dimensional data points, $0 LLE illustrates, to low dimensional a general embedding principlevectors, of manifold. learning, elucidated by Tenenbaum et al[11], that overlapping are chosen, the optimal weights local neighborhoods collectively analyzed can provide informationand about coordinates global $ geometry. are computed (From by standard Saul and Roweis methods 2001) in linear algebra. The algorithm involves a single pass through the three steps in Fig. 2 and finds global minima of the reconstruction and embedding costs JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 97 / 101

162 Figure 1: The problem of nonlinear dimensionality reduction, as illustrated for three dimensional data (B) sampled from a two dimensional manifold (A). An unsupervised learning algorithm must discover the global internal coordinates of the manifold without signals that explicitly indicate how the data should be embed- JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 98 / 101 ple to implement, Non-linear and itsdimension optimizations ReductiondoLocally not involve Linear Embedding local minima. At the same time, however, it is capable of generating highly nonlinear embeddings. Note that Example mixture models 1 for local dimensionality reduction[5, 6], which cluster the data and Source: perform Saul PCA and Roweis, within each An Introduction cluster, do not to Locally address Linear the problem Embedding, considered 2001here namely, how to map high dimensional data into a single global coordinate system of lower dimensionality. In this paper, we review the LLE algorithm in its most basic form and illustrate a K=12. potential application to audiovisual speech synthesis[3]. (A) (B) (C)

163 Non-linear Dimension Reduction Locally Linear Embedding Example 2 Source: Saul and Roweis, An Introduction to Locally Linear Embedding, 2001 K=4. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 99 / 101

Dimension Reduction and Component Analysis Lecture 4

Dimension Reduction and Component Analysis Lecture 4 Dimension Reduction and Component Analysis Lecture 4 Jason Corso SUNY at Buffalo 28 January 2009 J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 1 / 103 Plan In lecture 3, we learned about estimating

More information

Dimension Reduction and Component Analysis

Dimension Reduction and Component Analysis Dimension Reduction and Component Analysis Jason Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Dimension Reduction and Component Analysis 1 / 103 Plan We learned about estimating parametric models and

More information

Recognition Using Class Specific Linear Projection. Magali Segal Stolrasky Nadav Ben Jakov April, 2015

Recognition Using Class Specific Linear Projection. Magali Segal Stolrasky Nadav Ben Jakov April, 2015 Recognition Using Class Specific Linear Projection Magali Segal Stolrasky Nadav Ben Jakov April, 2015 Articles Eigenfaces vs. Fisherfaces Recognition Using Class Specific Linear Projection, Peter N. Belhumeur,

More information

Lecture: Face Recognition

Lecture: Face Recognition Lecture: Face Recognition Juan Carlos Niebles and Ranjay Krishna Stanford Vision and Learning Lab Lecture 12-1 What we will learn today Introduction to face recognition The Eigenfaces Algorithm Linear

More information

Pattern Recognition 2

Pattern Recognition 2 Pattern Recognition 2 KNN,, Dr. Terence Sim School of Computing National University of Singapore Outline 1 2 3 4 5 Outline 1 2 3 4 5 The Bayes Classifier is theoretically optimum. That is, prob. of error

More information

Dimensionality Reduction Using PCA/LDA. Hongyu Li School of Software Engineering TongJi University Fall, 2014

Dimensionality Reduction Using PCA/LDA. Hongyu Li School of Software Engineering TongJi University Fall, 2014 Dimensionality Reduction Using PCA/LDA Hongyu Li School of Software Engineering TongJi University Fall, 2014 Dimensionality Reduction One approach to deal with high dimensional data is by reducing their

More information

Example: Face Detection

Example: Face Detection Announcements HW1 returned New attendance policy Face Recognition: Dimensionality Reduction On time: 1 point Five minutes or more late: 0.5 points Absent: 0 points Biometrics CSE 190 Lecture 14 CSE190,

More information

Face Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition

Face Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition ace Recognition Identify person based on the appearance of face CSED441:Introduction to Computer Vision (2017) Lecture10: Subspace Methods and ace Recognition Bohyung Han CSE, POSTECH bhhan@postech.ac.kr

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

Eigenface-based facial recognition

Eigenface-based facial recognition Eigenface-based facial recognition Dimitri PISSARENKO December 1, 2002 1 General This document is based upon Turk and Pentland (1991b), Turk and Pentland (1991a) and Smith (2002). 2 How does it work? The

More information

Face Detection and Recognition

Face Detection and Recognition Face Detection and Recognition Face Recognition Problem Reading: Chapter 18.10 and, optionally, Face Recognition using Eigenfaces by M. Turk and A. Pentland Queryimage face query database Face Verification

More information

Machine Learning 2017

Machine Learning 2017 Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section

More information

When Fisher meets Fukunaga-Koontz: A New Look at Linear Discriminants

When Fisher meets Fukunaga-Koontz: A New Look at Linear Discriminants When Fisher meets Fukunaga-Koontz: A New Look at Linear Discriminants Sheng Zhang erence Sim School of Computing, National University of Singapore 3 Science Drive 2, Singapore 7543 {zhangshe, tsim}@comp.nus.edu.sg

More information

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University Lecture 4: Principal Component Analysis Aykut Erdem May 016 Hacettepe University This week Motivation PCA algorithms Applications PCA shortcomings Autoencoders Kernel PCA PCA Applications Data Visualization

More information

Principal Component Analysis

Principal Component Analysis B: Chapter 1 HTF: Chapter 1.5 Principal Component Analysis Barnabás Póczos University of Alberta Nov, 009 Contents Motivation PCA algorithms Applications Face recognition Facial expression recognition

More information

Image Analysis & Retrieval. Lec 14. Eigenface and Fisherface

Image Analysis & Retrieval. Lec 14. Eigenface and Fisherface Image Analysis & Retrieval Lec 14 Eigenface and Fisherface Zhu Li Dept of CSEE, UMKC Office: FH560E, Email: lizhu@umkc.edu, Ph: x 2346. http://l.web.umkc.edu/lizhu Z. Li, Image Analysis & Retrv, Spring

More information

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Course 495: Advanced Statistical Machine Learning/Pattern Recognition Course 495: Advanced Statistical Machine Learning/Pattern Recognition Deterministic Component Analysis Goal (Lecture): To present standard and modern Component Analysis (CA) techniques such as Principal

More information

Advanced Introduction to Machine Learning CMU-10715

Advanced Introduction to Machine Learning CMU-10715 Advanced Introduction to Machine Learning CMU-10715 Principal Component Analysis Barnabás Póczos Contents Motivation PCA algorithms Applications Some of these slides are taken from Karl Booksh Research

More information

Linear Dimensionality Reduction

Linear Dimensionality Reduction Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Principal Component Analysis 3 Factor Analysis

More information

CITS 4402 Computer Vision

CITS 4402 Computer Vision CITS 4402 Computer Vision A/Prof Ajmal Mian Adj/A/Prof Mehdi Ravanbakhsh Lecture 06 Object Recognition Objectives To understand the concept of image based object recognition To learn how to match images

More information

Image Analysis & Retrieval Lec 14 - Eigenface & Fisherface

Image Analysis & Retrieval Lec 14 - Eigenface & Fisherface CS/EE 5590 / ENG 401 Special Topics, Spring 2018 Image Analysis & Retrieval Lec 14 - Eigenface & Fisherface Zhu Li Dept of CSEE, UMKC http://l.web.umkc.edu/lizhu Office Hour: Tue/Thr 2:30-4pm@FH560E, Contact:

More information

Lecture 13 Visual recognition

Lecture 13 Visual recognition Lecture 13 Visual recognition Announcements Silvio Savarese Lecture 13-20-Feb-14 Lecture 13 Visual recognition Object classification bag of words models Discriminative methods Generative methods Object

More information

PCA FACE RECOGNITION

PCA FACE RECOGNITION PCA FACE RECOGNITION The slides are from several sources through James Hays (Brown); Srinivasa Narasimhan (CMU); Silvio Savarese (U. of Michigan); Shree Nayar (Columbia) including their own slides. Goal

More information

p(d θ ) l(θ ) 1.2 x x x

p(d θ ) l(θ ) 1.2 x x x p(d θ ).2 x 0-7 0.8 x 0-7 0.4 x 0-7 l(θ ) -20-40 -60-80 -00 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ θ x FIGURE 3.. The top graph shows several training points in one dimension, known or assumed to

More information

Linear Subspace Models

Linear Subspace Models Linear Subspace Models Goal: Explore linear models of a data set. Motivation: A central question in vision concerns how we represent a collection of data vectors. The data vectors may be rasterized images,

More information

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation) Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation) PCA transforms the original input space into a lower dimensional space, by constructing dimensions that are linear combinations

More information

1 Principal Components Analysis

1 Principal Components Analysis Lecture 3 and 4 Sept. 18 and Sept.20-2006 Data Visualization STAT 442 / 890, CM 462 Lecture: Ali Ghodsi 1 Principal Components Analysis Principal components analysis (PCA) is a very popular technique for

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Feature Extraction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi, Payam Siyari Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Dimensionality Reduction

More information

Lecture 17: Face Recogni2on

Lecture 17: Face Recogni2on Lecture 17: Face Recogni2on Dr. Juan Carlos Niebles Stanford AI Lab Professor Fei-Fei Li Stanford Vision Lab Lecture 17-1! What we will learn today Introduc2on to face recogni2on Principal Component Analysis

More information

Principal Component Analysis and Linear Discriminant Analysis

Principal Component Analysis and Linear Discriminant Analysis Principal Component Analysis and Linear Discriminant Analysis Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1/29

More information

PCA and LDA. Man-Wai MAK

PCA and LDA. Man-Wai MAK PCA and LDA Man-Wai MAK Dept. of Electronic and Information Engineering, The Hong Kong Polytechnic University enmwmak@polyu.edu.hk http://www.eie.polyu.edu.hk/ mwmak References: S.J.D. Prince,Computer

More information

20 Unsupervised Learning and Principal Components Analysis (PCA)

20 Unsupervised Learning and Principal Components Analysis (PCA) 116 Jonathan Richard Shewchuk 20 Unsupervised Learning and Principal Components Analysis (PCA) UNSUPERVISED LEARNING We have sample points, but no labels! No classes, no y-values, nothing to predict. Goal:

More information

A Unified Bayesian Framework for Face Recognition

A Unified Bayesian Framework for Face Recognition Appears in the IEEE Signal Processing Society International Conference on Image Processing, ICIP, October 4-7,, Chicago, Illinois, USA A Unified Bayesian Framework for Face Recognition Chengjun Liu and

More information

LECTURE NOTE #10 PROF. ALAN YUILLE

LECTURE NOTE #10 PROF. ALAN YUILLE LECTURE NOTE #10 PROF. ALAN YUILLE 1. Principle Component Analysis (PCA) One way to deal with the curse of dimensionality is to project data down onto a space of low dimensions, see figure (1). Figure

More information

Face recognition Computer Vision Spring 2018, Lecture 21

Face recognition Computer Vision Spring 2018, Lecture 21 Face recognition http://www.cs.cmu.edu/~16385/ 16-385 Computer Vision Spring 2018, Lecture 21 Course announcements Homework 6 has been posted and is due on April 27 th. - Any questions about the homework?

More information

Motivating the Covariance Matrix

Motivating the Covariance Matrix Motivating the Covariance Matrix Raúl Rojas Computer Science Department Freie Universität Berlin January 2009 Abstract This note reviews some interesting properties of the covariance matrix and its role

More information

Minimum Error-Rate Discriminant

Minimum Error-Rate Discriminant Discriminants Minimum Error-Rate Discriminant In the case of zero-one loss function, the Bayes Discriminant can be further simplified: g i (x) =P (ω i x). (29) J. Corso (SUNY at Buffalo) Bayesian Decision

More information

PCA and LDA. Man-Wai MAK

PCA and LDA. Man-Wai MAK PCA and LDA Man-Wai MAK Dept. of Electronic and Information Engineering, The Hong Kong Polytechnic University enmwmak@polyu.edu.hk http://www.eie.polyu.edu.hk/ mwmak References: S.J.D. Prince,Computer

More information

Clustering VS Classification

Clustering VS Classification MCQ Clustering VS Classification 1. What is the relation between the distance between clusters and the corresponding class discriminability? a. proportional b. inversely-proportional c. no-relation Ans:

More information

Non-parametric Classification of Facial Features

Non-parametric Classification of Facial Features Non-parametric Classification of Facial Features Hyun Sung Chang Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology Problem statement In this project, I attempted

More information

Reconnaissance d objetsd et vision artificielle

Reconnaissance d objetsd et vision artificielle Reconnaissance d objetsd et vision artificielle http://www.di.ens.fr/willow/teaching/recvis09 Lecture 6 Face recognition Face detection Neural nets Attention! Troisième exercice de programmation du le

More information

Keywords Eigenface, face recognition, kernel principal component analysis, machine learning. II. LITERATURE REVIEW & OVERVIEW OF PROPOSED METHODOLOGY

Keywords Eigenface, face recognition, kernel principal component analysis, machine learning. II. LITERATURE REVIEW & OVERVIEW OF PROPOSED METHODOLOGY Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Eigenface and

More information

Parametric Techniques

Parametric Techniques Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure

More information

COS 429: COMPUTER VISON Face Recognition

COS 429: COMPUTER VISON Face Recognition COS 429: COMPUTER VISON Face Recognition Intro to recognition PCA and Eigenfaces LDA and Fisherfaces Face detection: Viola & Jones (Optional) generic object models for faces: the Constellation Model Reading:

More information

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering

More information

Parametric Techniques Lecture 3

Parametric Techniques Lecture 3 Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to

More information

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 42 Outline 1 Introduction 2 Feature selection

More information

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1 EEL 851: Biometrics An Overview of Statistical Pattern Recognition EEL 851 1 Outline Introduction Pattern Feature Noise Example Problem Analysis Segmentation Feature Extraction Classification Design Cycle

More information

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA) CS68: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA) Tim Roughgarden & Gregory Valiant April 0, 05 Introduction. Lecture Goal Principal components analysis

More information

Discriminant Uncorrelated Neighborhood Preserving Projections

Discriminant Uncorrelated Neighborhood Preserving Projections Journal of Information & Computational Science 8: 14 (2011) 3019 3026 Available at http://www.joics.com Discriminant Uncorrelated Neighborhood Preserving Projections Guoqiang WANG a,, Weijuan ZHANG a,

More information

Robustness of Principal Components

Robustness of Principal Components PCA for Clustering An objective of principal components analysis is to identify linear combinations of the original variables that are useful in accounting for the variation in those original variables.

More information

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision) CS4495/6495 Introduction to Computer Vision 8B-L2 Principle Component Analysis (and its use in Computer Vision) Wavelength 2 Wavelength 2 Principal Components Principal components are all about the directions

More information

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

ECE662: Pattern Recognition and Decision Making Processes: HW TWO ECE662: Pattern Recognition and Decision Making Processes: HW TWO Purdue University Department of Electrical and Computer Engineering West Lafayette, INDIANA, USA Abstract. In this report experiments are

More information

Eigenimaging for Facial Recognition

Eigenimaging for Facial Recognition Eigenimaging for Facial Recognition Aaron Kosmatin, Clayton Broman December 2, 21 Abstract The interest of this paper is Principal Component Analysis, specifically its area of application to facial recognition

More information

Lecture 17: Face Recogni2on

Lecture 17: Face Recogni2on Lecture 17: Face Recogni2on Dr. Juan Carlos Niebles Stanford AI Lab Professor Fei-Fei Li Stanford Vision Lab Lecture 17-1! What we will learn today Introduc2on to face recogni2on Principal Component Analysis

More information

p(x ω i 0.4 ω 2 ω

p(x ω i 0.4 ω 2 ω p( ω i ). ω.3.. 9 3 FIGURE.. Hypothetical class-conditional probability density functions show the probability density of measuring a particular feature value given the pattern is in category ω i.if represents

More information

Discriminative Direction for Kernel Classifiers

Discriminative Direction for Kernel Classifiers Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering

More information

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2015 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given

More information

Subspace Methods for Visual Learning and Recognition

Subspace Methods for Visual Learning and Recognition This is a shortened version of the tutorial given at the ECCV 2002, Copenhagen, and ICPR 2002, Quebec City. Copyright 2002 by Aleš Leonardis, University of Ljubljana, and Horst Bischof, Graz University

More information

Eigenfaces. Face Recognition Using Principal Components Analysis

Eigenfaces. Face Recognition Using Principal Components Analysis Eigenfaces Face Recognition Using Principal Components Analysis M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of Cognitive Neuroscience, 3(1), pp. 71-86, 1991. Slides : George Bebis, UNR

More information

Lecture Notes 1: Vector spaces

Lecture Notes 1: Vector spaces Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector

More information

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides Intelligent Data Analysis and Probabilistic Inference Lecture

More information

Face detection and recognition. Detection Recognition Sally

Face detection and recognition. Detection Recognition Sally Face detection and recognition Detection Recognition Sally Face detection & recognition Viola & Jones detector Available in open CV Face recognition Eigenfaces for face recognition Metric learning identification

More information

Image Analysis. PCA and Eigenfaces

Image Analysis. PCA and Eigenfaces Image Analysis PCA and Eigenfaces Christophoros Nikou cnikou@cs.uoi.gr Images taken from: D. Forsyth and J. Ponce. Computer Vision: A Modern Approach, Prentice Hall, 2003. Computer Vision course by Svetlana

More information

Nonparametric Methods Lecture 5

Nonparametric Methods Lecture 5 Nonparametric Methods Lecture 5 Jason Corso SUNY at Buffalo 17 Feb. 29 J. Corso (SUNY at Buffalo) Nonparametric Methods Lecture 5 17 Feb. 29 1 / 49 Nonparametric Methods Lecture 5 Overview Previously,

More information

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis .. December 20, 2013 Todays lecture. (PCA) (PLS-R) (LDA) . (PCA) is a method often used to reduce the dimension of a large dataset to one of a more manageble size. The new dataset can then be used to make

More information

Bayesian Decision Theory Lecture 2

Bayesian Decision Theory Lecture 2 Bayesian Decision Theory Lecture 2 Jason Corso SUNY at Buffalo 14 January 2009 J. Corso (SUNY at Buffalo) Bayesian Decision Theory Lecture 2 14 January 2009 1 / 58 Overview and Plan Covering Chapter 2

More information

ECE 661: Homework 10 Fall 2014

ECE 661: Homework 10 Fall 2014 ECE 661: Homework 10 Fall 2014 This homework consists of the following two parts: (1) Face recognition with PCA and LDA for dimensionality reduction and the nearest-neighborhood rule for classification;

More information

Machine learning for pervasive systems Classification in high-dimensional spaces

Machine learning for pervasive systems Classification in high-dimensional spaces Machine learning for pervasive systems Classification in high-dimensional spaces Department of Communications and Networking Aalto University, School of Electrical Engineering stephan.sigg@aalto.fi Version

More information

W vs. QCD Jet Tagging at the Large Hadron Collider

W vs. QCD Jet Tagging at the Large Hadron Collider W vs. QCD Jet Tagging at the Large Hadron Collider Bryan Anenberg: anenberg@stanford.edu; CS229 December 13, 2013 Problem Statement High energy collisions of protons at the Large Hadron Collider (LHC)

More information

Statistical and Learning Techniques in Computer Vision Lecture 1: Random Variables Jens Rittscher and Chuck Stewart

Statistical and Learning Techniques in Computer Vision Lecture 1: Random Variables Jens Rittscher and Chuck Stewart Statistical and Learning Techniques in Computer Vision Lecture 1: Random Variables Jens Rittscher and Chuck Stewart 1 Motivation Imaging is a stochastic process: If we take all the different sources of

More information

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Machine Learning Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 1 / 47 Table of contents 1 Introduction

More information

Methods for sparse analysis of high-dimensional data, II

Methods for sparse analysis of high-dimensional data, II Methods for sparse analysis of high-dimensional data, II Rachel Ward May 23, 2011 High dimensional data with low-dimensional structure 300 by 300 pixel images = 90, 000 dimensions 2 / 47 High dimensional

More information

Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x =

Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x = Linear Algebra Review Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1 x x = 2. x n Vectors of up to three dimensions are easy to diagram.

More information

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory Bayesian decision theory 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory Jussi Tohka jussi.tohka@tut.fi Institute of Signal Processing Tampere University of Technology

More information

Dimension Reduction (PCA, ICA, CCA, FLD,

Dimension Reduction (PCA, ICA, CCA, FLD, Dimension Reduction (PCA, ICA, CCA, FLD, Topic Models) Yi Zhang 10-701, Machine Learning, Spring 2011 April 6 th, 2011 Parts of the PCA slides are from previous 10-701 lectures 1 Outline Dimension reduction

More information

Unsupervised Learning: K- Means & PCA

Unsupervised Learning: K- Means & PCA Unsupervised Learning: K- Means & PCA Unsupervised Learning Supervised learning used labeled data pairs (x, y) to learn a func>on f : X Y But, what if we don t have labels? No labels = unsupervised learning

More information

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart 1 Motivation and Problem In Lecture 1 we briefly saw how histograms

More information

5. Discriminant analysis

5. Discriminant analysis 5. Discriminant analysis We continue from Bayes s rule presented in Section 3 on p. 85 (5.1) where c i is a class, x isap-dimensional vector (data case) and we use class conditional probability (density

More information

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin 1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)

More information

Comparative Assessment of Independent Component. Component Analysis (ICA) for Face Recognition.

Comparative Assessment of Independent Component. Component Analysis (ICA) for Face Recognition. Appears in the Second International Conference on Audio- and Video-based Biometric Person Authentication, AVBPA 99, ashington D. C. USA, March 22-2, 1999. Comparative Assessment of Independent Component

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

More information

System 1 (last lecture) : limited to rigidly structured shapes. System 2 : recognition of a class of varying shapes. Need to:

System 1 (last lecture) : limited to rigidly structured shapes. System 2 : recognition of a class of varying shapes. Need to: System 2 : Modelling & Recognising Modelling and Recognising Classes of Classes of Shapes Shape : PDM & PCA All the same shape? System 1 (last lecture) : limited to rigidly structured shapes System 2 :

More information

2D Image Processing Face Detection and Recognition

2D Image Processing Face Detection and Recognition 2D Image Processing Face Detection and Recognition Prof. Didier Stricker Kaiserlautern University http://ags.cs.uni-kl.de/ DFKI Deutsches Forschungszentrum für Künstliche Intelligenz http://av.dfki.de

More information

Unsupervised Learning with Permuted Data

Unsupervised Learning with Permuted Data Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University

More information

Lecture Notes on the Gaussian Distribution

Lecture Notes on the Gaussian Distribution Lecture Notes on the Gaussian Distribution Hairong Qi The Gaussian distribution is also referred to as the normal distribution or the bell curve distribution for its bell-shaped density curve. There s

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations. Previously Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations y = Ax Or A simply represents data Notion of eigenvectors,

More information

An Efficient Pseudoinverse Linear Discriminant Analysis method for Face Recognition

An Efficient Pseudoinverse Linear Discriminant Analysis method for Face Recognition An Efficient Pseudoinverse Linear Discriminant Analysis method for Face Recognition Jun Liu, Songcan Chen, Daoqiang Zhang, and Xiaoyang Tan Department of Computer Science & Engineering, Nanjing University

More information

Lecture: Face Recognition and Feature Reduction

Lecture: Face Recognition and Feature Reduction Lecture: Face Recognition and Feature Reduction Juan Carlos Niebles and Ranjay Krishna Stanford Vision and Learning Lab 1 Recap - Curse of dimensionality Assume 5000 points uniformly distributed in the

More information

Kazuhiro Fukui, University of Tsukuba

Kazuhiro Fukui, University of Tsukuba Subspace Methods Kazuhiro Fukui, University of Tsukuba Synonyms Multiple similarity method Related Concepts Principal component analysis (PCA) Subspace analysis Dimensionality reduction Definition Subspace

More information

Introduction to Machine Learning

Introduction to Machine Learning 10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what

More information

Eigenimages. Digital Image Processing: Bernd Girod, 2013 Stanford University -- Eigenimages 1

Eigenimages. Digital Image Processing: Bernd Girod, 2013 Stanford University -- Eigenimages 1 Eigenimages " Unitary transforms" Karhunen-Loève transform" Eigenimages for recognition" Sirovich and Kirby method" Example: eigenfaces" Eigenfaces vs. Fisherfaces" Digital Image Processing: Bernd Girod,

More information

Lecture: Face Recognition and Feature Reduction

Lecture: Face Recognition and Feature Reduction Lecture: Face Recognition and Feature Reduction Juan Carlos Niebles and Ranjay Krishna Stanford Vision and Learning Lab Lecture 11-1 Recap - Curse of dimensionality Assume 5000 points uniformly distributed

More information

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas Dimensionality Reduction: PCA Nicholas Ruozzi University of Texas at Dallas Eigenvalues λ is an eigenvalue of a matrix A R n n if the linear system Ax = λx has at least one non-zero solution If Ax = λx

More information

ECE 592 Topics in Data Science

ECE 592 Topics in Data Science ECE 592 Topics in Data Science Final Fall 2017 December 11, 2017 Please remember to justify your answers carefully, and to staple your test sheet and answers together before submitting. Name: Student ID:

More information

Linear Algebra Methods for Data Mining

Linear Algebra Methods for Data Mining Linear Algebra Methods for Data Mining Saara Hyvönen, Saara.Hyvonen@cs.helsinki.fi Spring 2007 Linear Discriminant Analysis Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki Principal

More information

STA 414/2104: Lecture 8

STA 414/2104: Lecture 8 STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable models Background PCA

More information

Lecture 7: Con3nuous Latent Variable Models

Lecture 7: Con3nuous Latent Variable Models CSC2515 Fall 2015 Introduc3on to Machine Learning Lecture 7: Con3nuous Latent Variable Models All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/

More information

CS 4495 Computer Vision Principle Component Analysis

CS 4495 Computer Vision Principle Component Analysis CS 4495 Computer Vision Principle Component Analysis (and it s use in Computer Vision) Aaron Bobick School of Interactive Computing Administrivia PS6 is out. Due *** Sunday, Nov 24th at 11:55pm *** PS7

More information