Dimension Reduction and Component Analysis Lecture 4

Dimension Reduction and Component Analysis Lecture 4 Jason Corso SUNY at Buffalo 28 January 2009 J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 1 / 103

Plan In lecture 3, we learned about estimating parametric models and how these then form classifiers and define decision boundaries (lecture 2). Now we turn back to the question of Dimensionality. Recall the fish example, where we experimented with the length feature first, then the lightness feature, and then decided upon a combination of the width and lightness. count 22 20 18 16 12 10 8 6 4 2 0 salmon sea bass 5 10 20 15 25 l* length count 14 12 10 8 6 4 2 0 salmon 2 4 x* 6 8 10 FIGURE 1.2. Histograms for the length feature for the two categories. No single FIGURE threshold value of the will serve to unambiguously discriminate betweenthreshold the two cat- value x (decision boundary) will serve to unambiguously discriminate line could serve be- as a decision boundary of our classifier. Overall classification error on 1.3. Histograms for the lightness feature for the two categories. FIGURE No single 1.4. The two of lightness and width for sea bass and salmon. The dark egories; We using lengthdeveloped alone, we will have some errors. The some value marked lintuition tween will lead the two to categories; using saying lightness alone, wethe will have some more errors. the The data value features shown x is lower than if wei useadd, only one feature the as in Fig. 1.3, but there will the smallest number of errors, on average. From: Richard O. Duda, Petermarked E. Hart, will andlead to the smallest number of errors, on average. From: Richard still be O. some Duda, errors. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Peter Sons, E. Inc. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Classification. Copyright c 2001 by John Wiley & Sons, Inc. Wiley & Sons, Inc. better my classifier will be... sea bass lightness width 22 21 20 19 18 17 16 15 14 salmon sea bass 2 4 6 8 10 lightness J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 2 / 103

Plan In lecture 3, we learned about estimating parametric models and how these then form classifiers and define decision boundaries (lecture 2). Now we turn back to the question of Dimensionality. Recall the fish example, where we experimented with the length feature first, then the lightness feature, and then decided upon a combination of the width and lightness. count 22 20 18 16 12 10 8 6 4 2 0 salmon sea bass 5 10 20 15 25 l* length count 14 12 10 8 6 4 2 0 salmon 2 4 x* 6 8 10 FIGURE 1.2. Histograms for the length feature for the two categories. No single FIGURE threshold value of the will serve to unambiguously discriminate betweenthreshold the two cat- value x (decision boundary) will serve to unambiguously discriminate line could serve be- as a decision boundary of our classifier. Overall classification error on 1.3. Histograms for the lightness feature for the two categories. FIGURE No single 1.4. The two of lightness and width for sea bass and salmon. The dark egories; We using lengthdeveloped alone, we will have some errors. The some value marked lintuition tween will lead the two to categories; using saying lightness alone, wethe will have some more errors. the The data value features shown x is lower than if wei useadd, only one feature the as in Fig. 1.3, but there will the smallest number of errors, on average. From: Richard O. Duda, Petermarked E. Hart, will andlead to the smallest number of errors, on average. From: Richard still be O. some Duda, errors. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Peter Sons, E. Inc. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Classification. Copyright c 2001 by John Wiley & Sons, Inc. Wiley & Sons, Inc. sea bass lightness width 22 21 20 19 18 17 16 15 14 salmon sea bass 2 4 6 8 10 better my classifier will be... We will see that in theory this may be, but in practice, this is not the case the probability of error will increase after a certain number of features (dimensionality) has been reached. lightness J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 2 / 103

Plan In lecture 3, we learned about estimating parametric models and how these then form classifiers and define decision boundaries (lecture 2). Now we turn back to the question of Dimensionality. Recall the fish example, where we experimented with the length feature first, then the lightness feature, and then decided upon a combination of the width and lightness. count 22 20 18 16 12 10 8 6 4 2 0 salmon sea bass 5 10 20 15 25 l* length count 14 12 10 8 6 4 2 0 salmon 2 4 x* 6 8 10 FIGURE 1.2. Histograms for the length feature for the two categories. No single FIGURE threshold value of the will serve to unambiguously discriminate betweenthreshold the two cat- value x (decision boundary) will serve to unambiguously discriminate line could serve be- as a decision boundary of our classifier. Overall classification error on 1.3. Histograms for the lightness feature for the two categories. FIGURE No single 1.4. The two of lightness and width for sea bass and salmon. The dark egories; We using lengthdeveloped alone, we will have some errors. The some value marked lintuition tween will lead the two to categories; using saying lightness alone, wethe will have some more errors. the The data value features shown x is lower than if wei useadd, only one feature the as in Fig. 1.3, but there will the smallest number of errors, on average. From: Richard O. Duda, Petermarked E. Hart, will andlead to the smallest number of errors, on average. From: Richard still be O. some Duda, errors. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Peter Sons, E. Inc. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Classification. Copyright c 2001 by John Wiley & Sons, Inc. Wiley & Sons, Inc. sea bass lightness width 22 21 20 19 18 17 16 15 14 salmon sea bass 2 4 6 8 10 better my classifier will be... We will see that in theory this may be, but in practice, this is not the case the probability of error will increase after a certain number of features (dimensionality) has been reached. We will first explore this point and then discuss a set of methods for dimension reduction. J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 2 / 103 lightness

Dimensionality High Dimensions Often Test Our Intuitions Consider a simple arrangement: you have a sphere of radius r = 1 in a space of D dimensions. We want to compute what is the fraction of the volume of the sphere that lies between radius r = 1 ɛ and r = 1. Noting that the volume of the sphere will scale with r D, we have: V D (r) = K D r D (1) where K D is some constant (depending only on D). V D (1) V D (1 ɛ) = V D (1) (2) 1 (1 ɛ) D volume fraction 1 0.8 0.6 0.4 0.2 D = 20 D = 5 D = 2 D = 1 0 0 0.2 0.4 0.6 0.8 1 ɛ J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 3 / 103

Dimensionality Let s Build Some Better Intuition Example from Bishop PRML Dataset: Measurements taken from a pipeline containing a mixture of oil. Three classes present (different geometrical configuration): homogeneous, annular, and laminar. Each data point is a 12 dimensional input vector consisting of measurements taken with gamma ray densitometers, which measure the attenuation of gamma rays passing along narrow beams through the pipe. x 7 2 1.5 1 0.5 0 0 0.25 0.5 0.75 1 x 6 J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 4 / 103

Dimensionality Let s Build Some Better Intuition Example from Bishop PRML 100 data points of features x 6 and x 7 are shown on the right. Goal: Classify the new data point at the x. Suggestions on how we might approach this classification problem? x 7 2 1.5 1 0.5 0 0 0.25 0.5 0.75 1 x 6 J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 5 / 103

Dimensionality Let s Build Some Better Intuition Example from Bishop PRML Observations we can make: The cross is surrounded by many red points and some green points. Blue points are quite far from the cross. Nearest-Neighbor Intuition: The query point should be determined more strongly by 0.5 nearby points from the training set and less strongly by more distance points. x 6 x 7 2 1.5 1 0 0 0.25 0.5 0.75 1 J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 6 / 103

Dimensionality Let s Build Some Better Intuition Example from Bishop PRML One simple way of doing it is: We can divide the feature space up into regular cells. For each, cell, we associated the class that occurs most frequently in that cell (in our training data). Then, for a query point, we determine which cell it falls into and then assign in the label associated with the cell. x 7 2 1.5 1 0.5 0 0 0.25 0.5 0.75 1 x 6 J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 7 / 103

Dimensionality Let s Build Some Better Intuition Example from Bishop PRML One simple way of doing it is: We can divide the feature space up into regular cells. For each, cell, we associated the class that occurs most frequently in that cell (in our training data). Then, for a query point, we determine which cell it falls into and then assign in the label associated with the cell. What problems may exist with this approach? x 7 2 1.5 1 0.5 0 0 0.25 0.5 0.75 1 x 6 J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 7 / 103

Dimensionality Let s Build Some Better Intuition Example from Bishop PRML The problem we are most interested in now is the one that becomes apparent when we add more variables into the mix, corresponding to problems of higher dimensionality. In this case, the number of additional cells grows exponentially with the dimensionality of the space. Hence, we would need an exponentially large training data set to ensure all cells are filled. x 2 x 2 x 1 D = 1 x 1 D = 2 x 1 x 3 D = 3 J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 8 / 103

Dimensionality Curse of Dimensionality This severe difficulty when working in high dimensions was coined the curse of dimensionality by Bellman in 1961. The idea is that the volume of a space increases exponentially with the dimensionality of the space. J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 9 / 103

Dimensionality Dimensionality and Classification Error? Some parts taken from G. V. Trunk, TPAMI Vol. 1 No. 3 PP. 306-7 1979 How does the probability of error vary as we add more features, in theory? Consider the following two-class problem: The prior probabilities are known and equal: P (ω 1 ) = P (ω 2 ) = 1/2. The class-conditional densities are Gaussian with unit covariance: p(x ω 1 ) N(µ 1, I) (3) p(x ω 2 ) N(µ 2, I) (4) where µ 1 = µ, µ 2 = µ, and µ is an n-vector whose ith component is (1/i) 1/2. The corresponding Bayesian Decision Rule is decide ω 1 if x T µ > 0 (5) J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 10 / 103

Dimensionality Dimensionality and Classification Error? Some parts taken from G. V. Trunk, TPAMI Vol. 1 No. 3 PP. 306-7 1979 The probability of error is where P (error) = 1 exp [ z 2 /2 ] dz (6) 2π r/2 r 2 = µ 1 µ 2 2 = 4 Let s take this integral for granted... (For more detail, you can look at DHS Problem 31 in Chapter 2 and read Section 2.7.) i=1 p(x ω i )P(ω i ) n (1/i). (7) ω 1 reducible error ω 2 R 1 x B x* R 2 x p(x ω 2)P(ω 2 ) dx p(x ω 1)P(ω 1 ) dx R 1 R 2 FIGURE 2.17. Components of the probability of error for equal priors and (no J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 11 / 103

Dimensionality Dimensionality and Classification Error? Some parts taken from G. V. Trunk, TPAMI Vol. 1 No. 3 PP. 306-7 1979 The probability of error is where P (error) = 1 exp [ z 2 /2 ] dz (6) 2π r/2 r 2 = µ 1 µ 2 2 = 4 Let s take this integral for granted... (For more detail, you can look at DHS Problem 31 in Chapter 2 and read Section 2.7.) What can we say about this result as more features are added? i=1 p(x ω i )P(ω i ) n (1/i). (7) ω 1 reducible error p(x ω 2)P(ω 2 ) dx p(x ω 1)P(ω 1 ) dx R 1 R 2 FIGURE 2.17. Components of the probability of error for equal priors and (no J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 11 / 103 R 1 x B x* R 2 ω 2 x

Dimensionality Dimensionality and Classification Error? Some parts taken from G. V. Trunk, TPAMI Vol. 1 No. 3 PP. 306-7 1979 The probability of error approaches 0 as n approach infinity because 1/i is a divergent series. More intuitively, each additional feature is going to decrease the probability of error as long as its means are different. In the general case of varying means and but same variance for a feature, we have r 2 = d ( ) µi1 µ 2 i2 (8) σ i=1 i Certainly, we prefer features that have big differences in the mean relative to their variance. We need to note that if the probabilistic structure of the problem is completely known then adding new features is not going to decrease the Bayes risk (or increase it). J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 12 / 103

Dimensionality Dimensionality and Classification Error? Some parts taken from G. V. Trunk, TPAMI Vol. 1 No. 3 PP. 306-7 1979 So, adding dimensions is good... x 3 x 2 FIGURE 3.3. Two three-dimensional distributions have nonoverlapping densities, and...in theory. thus in three dimensions the Bayes error vanishes. When projected to a subspace here, the two-dimensional x1 x2 subspace or a one-dimensional x1 subspace there can But, in practice, be greater performance overlap of the projectedseems distributions, toand not hence obey greaterthis Bayes error. theory. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc. J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 13 / 103 x 1

Dimensionality Dimensionality and Classification Error? Some parts taken from G. V. Trunk, TPAMI Vol. 1 No. 3 PP. 306-7 1979 Consider again the two-class problem, but this time with unknown means µ 1 and µ 2. Instead, we have m labeled samples x 1,..., x m. Then, the best estimate of µ for each class is the sample mean (recall last lecture). µ = 1 m m x i (9) i=1 where x i has been replaced by x i is x i comes from ω 2. The covariance matrix is I/m. J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 14 / 103

Dimensionality Dimensionality and Classification Error? Some parts taken from G. V. Trunk, TPAMI Vol. 1 No. 3 PP. 306-7 1979 Probablity of error is given by P (error) = P (x T µ 0 ω 2 ) = 1 2π γ n exp [ z 2 /2 ] dz (10) because it has a Guassian form as n approaches infinity where γ n = E(z)/ [var(z)] 1/2 (11) n E(z) = (1/i) (12) i=1 ( var(z) = 1 + 1 ) n (1/i) + n/m (13) m i=1 J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 15 / 103

Dimensionality Dimensionality and Classification Error? Some parts taken from G. V. Trunk, TPAMI Vol. 1 No. 3 PP. 306-7 1979 Probablity of error is given by P (error) = P (x T µ 0 ω 2 ) = 1 exp [ z 2 /2 ] dz 2π The key is that we can show γ n lim γ n = 0 (14) n and thus the probability of error approaches one-half as the dimensionality of the problem becomes very high. J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 16 / 103

Dimensionality Dimensionality and Classification Error? Some parts taken from G. V. Trunk, TPAMI Vol. 1 No. 3 PP. 306-7 1979 Trunk performed an experiment to investigate the convergence rate of the probability of error to one-half. He simulated the problem for dimensionality 1 to 1000 and ran 500 repetitions for each dimension. We see an increase in performance initially and then a decrease (as the dimensionality of the problem grows larger than the number of training samples). J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 17 / 103

Dimension Reduction Motivation for Dimension Reduction The discussion on the curse of dimensionality should be enough! Even though our problem may have a high dimension, data will often be confined to a much lower effective dimension in most real world problems. Computational complexity is another important point: generally, the higher the dimension, the longer the training stage will be (and potentially the measurement and classification stages). We seek an understanding of the underlying data to Remove or reduce the influence of noisy or irrelevant features that would otherwise interfere with the classifier; Identify a small set of features (perhaps, a transformation thereof) during data exploration. J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 18 / 103

Dimension Reduction Dimension Reduction Version 0 Principal Component Analysis We have n samples {x 1, x 2,..., x n }. How can we best represent the n samples by a single vector x 0? J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 19 / 103

Dimension Reduction Dimension Reduction Version 0 Principal Component Analysis We have n samples {x 1, x 2,..., x n }. How can we best represent the n samples by a single vector x 0? First, we need a distance function on the sample space. Let s use the sum of squared distances criterion: J 0 (x 0 ) = n x 0 x k 2 (15) k=1 J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 19 / 103

Dimension Reduction Dimension Reduction Version 0 Principal Component Analysis We have n samples {x 1, x 2,..., x n }. How can we best represent the n samples by a single vector x 0? First, we need a distance function on the sample space. Let s use the sum of squared distances criterion: J 0 (x 0 ) = n x 0 x k 2 (15) k=1 Then, we seek a value of x 0 that minimizes J 0. J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 19 / 103

Dimension Reduction Dimension Reduction Version 0 Principal Component Analysis We can show that the minimizer is indeed the sample mean: m = 1 n n x k (16) k=1 We can verify it by adding m m into J 0 : J 0 (x 0 ) = = n (x 0 m) (x k m) 2 (17) k=1 n x 0 m 2 + k=1 n x k m 2 (18) k=1 Thus, J 0 is minimized when x 0 = m. Note, the second term is independent of x 0. J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 20 / 103

Dimension Reduction Dimension Reduction Version 0 Principal Component Analysis So, the sample mean is an initial dimension-reduced representation of the data (a zero-dimensional one). It is simple, but it not does reveal any of the variability in the data. Let s try to obtain a one-dimensional representation: i.e., let s project the data onto a line running through the sample mean. J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 21 / 103

Dimension Reduction Dimension Reduction Version 1 Principal Component Analysis Let e be a unit vector in the direction of the line. The standard equation for a line is then x = m + ae (19) where scalar a R governs which point along the line we are and hence corresponds to the distance of any point x from the mean m. Represent point x k by m + a k e. We can find an optimal set of coefficients by again minimizing the squared-error criterion n J 1 (a 1,..., a n, e) = (m + a k e) x k 2 (20) = k=1 n a 2 k e 2 2 k=1 n a k e T (x k m) + k=1 n x k m 2 k=1 (21) J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 22 / 103

Dimension Reduction Dimension Reduction Version 1 Principal Component Analysis Differentiating for a k and equating to 0 yields a k = e T (x k m) (22) This indicates that the best value for a k is the projection of the point x k onto the line e that passes through m. J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 23 / 103

Dimension Reduction Dimension Reduction Version 1 Principal Component Analysis Differentiating for a k and equating to 0 yields a k = e T (x k m) (22) This indicates that the best value for a k is the projection of the point x k onto the line e that passes through m. How do we find the best direction for that line? J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 23 / 103

Dimension Reduction Dimension Reduction Version 1 Principal Component Analysis What if we substitute the expression we computed for the best a k directly into the J 1 criterion: J 1 (e) = n n n a 2 k 2 a 2 k + x k m 2 (23) k=1 = = k=1 k=1 k=1 n [ ] 2 n e T (x k m) + x k m 2 (24) k=1 n e T (x k m)(x k m) T e + k=1 n x k m 2 (25) k=1 J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 24 / 103

Dimension Reduction Dimension Reduction Version 1 The Scatter Matrix Principal Component Analysis Define the scatter matrix S as n S = (x k m)(x k m) T (26) k=1 J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 25 / 103

Dimension Reduction Dimension Reduction Version 1 The Scatter Matrix Principal Component Analysis Define the scatter matrix S as S = n (x k m)(x k m) T (26) k=1 This should be familiar this is a multiple of the sample covariance matrix. J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 25 / 103

Dimension Reduction Dimension Reduction Version 1 The Scatter Matrix Principal Component Analysis Define the scatter matrix S as S = n (x k m)(x k m) T (26) k=1 This should be familiar this is a multiple of the sample covariance matrix. Putting it in: J 1 (e) = e T Se + n x k m 2 (27) k=1 The e that maximizes e T e will minimize J 1. J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 25 / 103

Dimension Reduction Dimension Reduction Version 1 Solving for e Principal Component Analysis We use Lagrange multipliers to maximize e T Se subject to the constraint that e = 1. u = e T Se λe T e (28) J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 26 / 103

Dimension Reduction Dimension Reduction Version 1 Solving for e Principal Component Analysis We use Lagrange multipliers to maximize e T Se subject to the constraint that e = 1. Differentiating w.r.t. e and setting equal to 0. Does this form look familiar? u = e T Se λe T e (28) u = 2Se 2λe e (29) Se = λe (30) J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 26 / 103

Dimension Reduction Dimension Reduction Version 1 Eigenvectors of S Principal Component Analysis Se = λe This is an eigenproblem. Hence, it follows that the best one-dimensional estimate (in a least-squares sense) for the data is the eigenvector corresponding to the largest eigenvalue of S. So, we will project the data onto the largest eigenvector of bs and translate it to pass through the mean. J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 27 / 103

Dimension Reduction Principal Component Analysis Principal Component Analysis We re already done...basically. J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 28 / 103

Dimension Reduction Principal Component Analysis Principal Component Analysis We re already done...basically. This idea readily extends to multiple dimensions, say d < d dimensions. We replace the earlier equation of the line with x = m + a i e i (31) d i=1 And we have a new criterion function ( ) n d 2 J d = m + a ki e i x k k=1 i=1 (32) J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 28 / 103

Dimension Reduction Principal Component Analysis Principal Component Analysis J d is minimized when the vectors e 1,..., e d are the d eigenvectors fo the scatter matrix having the largest eigenvalues. These vectors are orthogonal. They form a natural set of basis vectors for representing any feature x. The coefficients a i are called the principal components. Visualize the basis vectors as the principal axes of a hyperellipsoid surrounding the data (a cloud of points). Principle components reduces the dimension of the data by restricting attention to those directions of maximum variation, or scatter. J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 29 / 103

Dimension Reduction Fisher Linear Discriminant Fisher Linear Discriminant Description vs. Discrimination PCA is likely going to be useful for representing data. But, there is no reason to assume that it would be good for discriminating between two classes of data. Q versus O. Discriminant Analysis seeks directions that are efficient for discrimination. J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 30 / 103

Dimension Reduction Fisher Linear Discriminant Intuition: Projection onto a Line Consider the problem of projecting data from d dimensions onto a line. Board... J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 31 / 103

Dimension Reduction Fisher Linear Discriminant Intuition: Projection onto a Line Consider the problem of projecting data from d dimensions onto a line. Board... Finding the orientation of the line for which the projected samples is the goal of classical discriminant analysis. J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 31 / 103

Dimension Reduction Let s Formalize the Situation Fisher Linear Discriminant Suppose we have a set of n d-dimensional samples with n 1 in set D 1, and similarly for set D 2. D i = {x 1,..., x n1 }, i = {1, 2} (33) We can form a linear combination of the components of a sample x: y = w T x (34) which yields a corresponding set of n samples y 1,..., y n split into subsets Y 1 and Y 2. J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 32 / 103

Dimension Reduction Fisher Linear Discriminant Geometrically, This is a Projection If we constrain the norm of w to be 1 (i.e., w = 1) then we can conceptualize that each y i is the projection of the corresponding x i onto a line in the direction of w. 2 x 2 2 x 2 1.5 1.5 1 1 0.5 w 0.5 0.5 1 1.5 x 1-0.5 0.5 1 1.5 w x 1 FIGURE 3.5. Projection of the same set of samples onto two different lines in the directions marked w. The figure on the right shows greater separation between the red and black projected points. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc. J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 33 / 103

Dimension Reduction Fisher Linear Discriminant Geometrically, This is a Projection If we constrain the norm of w to be 1 (i.e., w = 1) then we can conceptualize that each y i is the projection of the corresponding x i onto a line in the direction of w. 2 x 2 2 x 2 1.5 1.5 1 1 0.5 w 0.5 0.5 1 1.5 x 1-0.5 0.5 1 1.5 w x 1 FIGURE 3.5. Projection of the same set of samples onto two different lines in the directions the magnitude marked w. The offigure w have on the right anyshows real greater significance? separation between the red Does and black projected points. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc. J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 33 / 103

Dimension Reduction What is a Good Projection? Fisher Linear Discriminant For our two-class setup, it should be clear that we want the projection that will have those samples from class ω 1 falling into one cluster (on the line) and those samples from class ω 2 falling into a separate cluster (on the line). J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 34 / 103

Dimension Reduction What is a Good Projection? Fisher Linear Discriminant For our two-class setup, it should be clear that we want the projection that will have those samples from class ω 1 falling into one cluster (on the line) and those samples from class ω 2 falling into a separate cluster (on the line). However, this may not be possible depending on our underlying classes. J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 34 / 103

Dimension Reduction What is a Good Projection? Fisher Linear Discriminant For our two-class setup, it should be clear that we want the projection that will have those samples from class ω 1 falling into one cluster (on the line) and those samples from class ω 2 falling into a separate cluster (on the line). However, this may not be possible depending on our underlying classes. So, how do we find the best direction w? J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 34 / 103

Dimension Reduction Fisher Linear Discriminant Separation of the Projected Means Let m i be the d-dimensional sample mean for class i: m i = 1 n i x D i x. (35) J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 35 / 103

Dimension Reduction Fisher Linear Discriminant Separation of the Projected Means Let m i be the d-dimensional sample mean for class i: m i = 1 n i Then the sample mean for the projected points is m i = 1 n i And, thus, is simply the projection of m i. x D i x. (35) y Y i y (36) = 1 n i x D i w T x (37) = w T m i. (38) J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 35 / 103

Dimension Reduction Fisher Linear Discriminant Distance Between Projected Means The distance between projected means is thus m 1 m 2 = w T (m 1 m 2 ) (39) J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 36 / 103

Dimension Reduction Fisher Linear Discriminant Distance Between Projected Means The distance between projected means is thus m 1 m 2 = w T (m 1 m 2 ) (39) The scale of w: we can make this distance arbitrarily large by scaling w. Rather, we want to make this distance large relative to some measure of the standard deviation....this story we ve heard before. J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 36 / 103

The Scatter Dimension Reduction Fisher Linear Discriminant To capture this variation, we compute the scatter s 2 i = y Y i (y m i ) 2 (40) which is essentially a scaled sampled variance. From this, we can estimate the variance of the pooled data: 1 2 ( s n 1 + s 2 ) 2. (41) s 2 1 + s2 2 is called the total within-class scatter of the projected samples. J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 37 / 103

Dimension Reduction Fisher Linear Discriminant Fisher Linear Discriminant The Fisher Linear Discriminant will select the w that maximizes J(w) = m 1 m 2 2 s 2 1 + s2 2. (42) It does so independently of the magnitude of w. This term is the ratio of the distance between the projected means scaled by the within-class scatter (the variation of the data). Recall the similar term from earlier in the lecture which indicated the amount a feature will reduce the probability of error is proportional to the ratio of the difference of the means to the variance. FLD will choose the maximum... J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 38 / 103

Dimension Reduction Fisher Linear Discriminant Fisher Linear Discriminant The Fisher Linear Discriminant will select the w that maximizes J(w) = m 1 m 2 2 s 2 1 + s2 2. (42) It does so independently of the magnitude of w. This term is the ratio of the distance between the projected means scaled by the within-class scatter (the variation of the data). Recall the similar term from earlier in the lecture which indicated the amount a feature will reduce the probability of error is proportional to the ratio of the difference of the means to the variance. FLD will choose the maximum... We need to rewrite J( ) as a function of w. J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 38 / 103

Dimension Reduction Within-Class Scatter Matrix Fisher Linear Discriminant Define scatter matrices S i : S i = x D i (x m i )(x m i ) T (43) and S W = S 1 + S 2. (44) J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 39 / 103

Dimension Reduction Within-Class Scatter Matrix Fisher Linear Discriminant Define scatter matrices S i : S i = x D i (x m i )(x m i ) T (43) and S W = S 1 + S 2. (44) S W is called the within-class scatter matrix. S W is symmetric and positive semidefinite. In typical cases, when is S W nonsingular? J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 39 / 103

Dimension Reduction Within-Class Scatter Matrix Fisher Linear Discriminant Deriving the sum of scatters. s 2 i = x D i (w T x w T m i ) 2 (45) = x D i w T (x m i ) (x m i ) T w (46) = w T S i w (47) We can therefore write the sum of the scatters as an explicit function of w: s 2 1 + s 2 2 = w T S W w (48) J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 40 / 103

Dimension Reduction Between-Class Scatter Matrix Fisher Linear Discriminant The separation of the projected means obeys ( m 1 m 2 ) 2 = ( w T m 1 w T m 2 ) 2 (49) = w T (m 1 m 2 ) (m 1 m 2 ) T w (50) = w T S B w (51) Here, S B is called the between-class scatter matrix: S B = (m 1 m 2 ) (m 1 m 2 ) T (52) S B is also symmetric and positive semidefinite. When is S B nonsingular? J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 41 / 103

Dimension Reduction Rewriting the FLD J( ) Fisher Linear Discriminant We can rewrite our objective as a function of w. J(w) = wt S B w w T S W w This is the generalized Rayleigh quotient. The vector w that maximizes J( ) must satisfy (53) which is a generalized eigenvalue problem. S B w = λs W w (54) J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 42 / 103

Dimension Reduction Rewriting the FLD J( ) Fisher Linear Discriminant We can rewrite our objective as a function of w. J(w) = wt S B w w T S W w This is the generalized Rayleigh quotient. The vector w that maximizes J( ) must satisfy (53) S B w = λs W w (54) which is a generalized eigenvalue problem. For nonsingular S W (typical), we can write this as a standard eigenvalue problem: S 1 W S Bw = λw (55) J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 42 / 103

Simplifying Dimension Reduction Fisher Linear Discriminant Since w is not important and S B w is always in the direction of (m 1 m 2 ), we can simplify w = S 1 W (m 1 m 2 ) (56) w maximizes J( ) and is the Fisher Linear Discriminant. The FLD converts a many-dimensional problem to a one-dimensional one. J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 43 / 103

Simplifying Dimension Reduction Fisher Linear Discriminant Since w is not important and S B w is always in the direction of (m 1 m 2 ), we can simplify w = S 1 W (m 1 m 2 ) (56) w maximizes J( ) and is the Fisher Linear Discriminant. The FLD converts a many-dimensional problem to a one-dimensional one. One still must find the threshold. This is easy for known densities, but not so easy in general. J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 43 / 103

Comparisons Classic A Classic PCA vs. FLD comparison Source: Belhumeur et al. IEEE TPAMI 19(7) 711 720. 1997. PCA maximizes the total scatter across all classes. fication. This method selects W in [1] in such a way he ratio of the between-class scatter and the withinscatter is maximized. t the between-class scatter matrix be defined as PCA projections are thus c T optimal S for reconstruction B = ÂN c i m i - mhcm i - mh i= 1 from a low dimensional basis, c but not necessarily T S from W = ÂaÂdiscrimination c x k - m i hc x k - mih i= 1 xk ŒX i standpoint. e within-class scatter matrix be defined as m i is the mean image of class X i, and N i is the numsamples in class X i. If S W is nonsingular, the optimal tion W opt FLD is chosen maximizes as the matrix the with ratio orthonormal of ns which the maximizes between-class the ratio of the scatter determinant of tween-class scatter matrix of the projected samples to eterminant and of the within-class scatter matrix scatter. of the ted samples, i.e., FLD tries to shape the T W SBW scatter opt = arg max to make it more W T W SW W effective for classification. W = w w w IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 19, NO. 7, JULY 1997 Fig. 2. A comparison of principal component analysis (PCA) and Fisher s linear discriminant (FLD) for a two class problem where data for each class lies near a linear subspace. 1 2 K m (4) method, which we call Fisherfaces, avoids this problem by J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 44 / 103 projecting the image set to a lower dimensional space so

w approach for face recognilarge variations in lighting t lighting variability includes ection and number of light Source: Belhumeur et al. IEEE TPAMI 19(7) 711 720. 1997. derivative of Comparisons Fisher s Linear Case Discriminant Study: Faces (FLD) [4], [5], maximizes the ratio of between-class scatter to that of within-class scatter. The Eigenface method is also based on linearly projecting the image space to a low dimensional feature space [6], [7], [8]. However, the Eigenface method, which uses principal components analysis (PCA) for dimensionality reduction, yields projection directions that maximize the total scatter across all classes, i.e., across all images of all faces. In choosing the projection which maximizes total scatter, PCA retains unwanted variations due to lighting and facial expression. As illustrated in Figs. 1 and 4 and stated by Moses et al., the variations between the images of the same face due to illumination and viewing direction are almost always larger than image variations due to change in face identity [9]. Thus, while the PCA projections are optimal Case Study: EigenFaces versus FisherFaces 1, the same person, with the n from the same viewpoint, t when light sources illumitions. See also Fig. 4. Analysis of classic pattern recognition techniques (PCA and FLD) to do face recognition. Fixed pose but varying illumination. The variation in the resulting images caused by the varying illumination will nearly always dominate the variation caused by an identity change. ion exploits two observations: bertian surface, taken from der varying illumination, lie the high-dimensional image adowing, specularities, and ve observation does not extain regions of the face may ge to image that often devie linear subspace, and, confor recognition. rvations by finding a linear he high-dimensional image putational Vision and Control, Dept. ty, New Haven, CT 06520-8267. du, hespanha@yale.edu. 0 Mar. 1997. Recommended for accepis article, please send e-mail to: ECS Log Number 104797. Fig. 1. The same person seen under different lighting conditions can appear dramatically different: In the left image, the dominant light source is nearly head-on; in the right image, the dominant light source is from above and to the right. 0162-8828/97/$10.00 1997 IEEE J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 45 / 103

contains 160 frontal face images covering 16 individuals taken under 10 different conditio r without glasses, three images taken with different point light sources, and five different faci J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 46 / 103 d and cropped to two different Comparisons scales: Case test Study: person Faces who is represented by onl ded the full face and part of the backsely Case cropped Study: ones included EigenFaces internal versus varies FisherFaces with the number of principal c In general, the performance of t Source: Belhumeur et al. IEEE TPAMI 19(7) 711 720. 1997. row, eyes, nose, mouth, and chin, but fore comparing the Linear Subspace a cluding contour. with the Eigenface method, we first

Comparisons Case Study: Faces Is Linear Okay For Faces? Source: Belhumeur et al. IEEE TPAMI 19(7) 711 720. 1997. Fact: all of the images of a Lambertian surface, taken from a fixed viewpoint, but under varying illumination, lie in a 3D linear subspace of the high-dimensional image space. But, in the presence of shadowing, specularities, and facial expressions, the above statement will not hold. This will ultimately result in deviations from the 3D linear subspace and worse classification accuracy. J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 47 / 103

Comparisons Case Study: Faces Method 1: Correlation Source: Belhumeur et al. IEEE TPAMI 19(7) 711 720. 1997. Consider a set of N training images, {x 1, x 2,..., x N }. We know that each of the N images belongs to one of c classes and can define a C( ) function to map the image x into a class ω c. Pre-Processing each image is normalized to have zero mean and unit variance. Why? Gets rid of the light source intensity and the effects of a camera s automatic gain control. J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 48 / 103

Comparisons Case Study: Faces Method 1: Correlation Source: Belhumeur et al. IEEE TPAMI 19(7) 711 720. 1997. Consider a set of N training images, {x 1, x 2,..., x N }. We know that each of the N images belongs to one of c classes and can define a C( ) function to map the image x into a class ω c. Pre-Processing each image is normalized to have zero mean and unit variance. Why? Gets rid of the light source intensity and the effects of a camera s automatic gain control. For a query image x, we select the class of the training image that is the nearest neighbor in the image space: x = arg min {x i } x x i then decide C(x ) (57) where we have vectorized each image. J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 48 / 103

Comparisons Case Study: Faces Method 1: Correlation Source: Belhumeur et al. IEEE TPAMI 19(7) 711 720. 1997. What are the advantages and disadvantages of the correlation based method in this context? J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 49 / 103

Comparisons Case Study: Faces Method 1: Correlation Source: Belhumeur et al. IEEE TPAMI 19(7) 711 720. 1997. What are the advantages and disadvantages of the correlation based method in this context? Computationally expensive. Require large amount of storage. Noise may play a role. Highly parallelizable. J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 49 / 103

Comparisons Case Study: Faces EigenFaces Source: Belhumeur et al. IEEE TPAMI 19(7) 711 720. 1997. The original paper is Turk and Pentland, Eigenfaces for Recognition Journal of Cognitive Neuroscience, 3(1). 1991. Quickly recall the main idea of PCA. Define a linear projection of the original n-d image space into am m-d space with m < n or m n, which yields new vectors y: where W R n m. y k = W T x k k = 1, 2,..., N (58) J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 50 / 103

Comparisons Case Study: Faces EigenFaces Source: Belhumeur et al. IEEE TPAMI 19(7) 711 720. 1997. The original paper is Turk and Pentland, Eigenfaces for Recognition Journal of Cognitive Neuroscience, 3(1). 1991. Quickly recall the main idea of PCA. Define a linear projection of the original n-d image space into am m-d space with m < n or m n, which yields new vectors y: where W R n m. Define the total scatter matrix S T as y k = W T x k k = 1, 2,..., N (58) S T = N (x k µ) (x k µ) T (59) k=1 where µ is the sample mean image. J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 50 / 103

Comparisons Case Study: Faces EigenFaces Source: Belhumeur et al. IEEE TPAMI 19(7) 711 720. 1997. The original paper is Turk and Pentland, Eigenfaces for Recognition Journal of Cognitive Neuroscience, 3(1). 1991. The scatter of the projected vectors {y 1, y 2,..., y N } is W T S T W (60) J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 51 / 103

Comparisons Case Study: Faces EigenFaces Source: Belhumeur et al. IEEE TPAMI 19(7) 711 720. 1997. The original paper is Turk and Pentland, Eigenfaces for Recognition Journal of Cognitive Neuroscience, 3(1). 1991. The scatter of the projected vectors {y 1, y 2,..., y N } is W T S T W (60) W opt is chosen to maximize the determinant of the total scatter matrix of the projected vectors: W opt = arg max W T S T W (61) W = [ ] w 1 w 2... w m (62) where {w i i = 1, d,..., m} is the set of n-d eigenvectors of S T corresponding to the largest m eigenvalues. J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 51 / 103

Comparisons Case Study: Faces An Example: Source: http://www.cs.princeton.edu/ cdecoro/eigenfaces/. (not sure if this dataset include lighting variation...) J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 52 / 103

Comparisons Case Study: Faces EigenFaces Source: Belhumeur et al. IEEE TPAMI 19(7) 711 720. 1997. The original paper is Turk and Pentland, Eigenfaces for Recognition Journal of Cognitive Neuroscience, 3(1). 1991. What are the advantages and disadvantages of the eigenfaces method in this context? The scatter being maximized is due not only to the between-class scatter that is useful for classification but also to the within-clas scatter, which is generally undesirable for classification. If PCA is presented faces with varying illumination, the projection matrix W opt will contain principal components that retain the variation in lighting. If this variation is higher than the variation due to class identity, then PCA will suffer greatly for classification. Yields a more compact representation than the correlation-based method. J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 53 / 103

Comparisons Case Study: Faces Method 3: Linear Subspaces Source: Belhumeur et al. IEEE TPAMI 19(7) 711 720. 1997. Use Lambertian model directly. Consider a point p on a Lambertian surface illuminated by a point light source at infinity. Let s R 3 signify the product of the light source intensity with the unit vector for the light source direction. The image intensity of the surface at p when viewed by a camera is E(p) = a(p)n(p) T s (63) where n(p) is the unit inward normal vector to the surface at point p, and a(p) is the albedo of the surface at p (a scalar). Hence, the image intensity of the point p is linear on s. J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 54 / 103

Comparisons Case Study: Faces Method 3: Linear Subspaces Source: Belhumeur et al. IEEE TPAMI 19(7) 711 720. 1997. So, if we assume no shadowing, given three images of a Lambertian surface from the same viewpoint under three known, linearly independent light source directions, the albedo and surface normal can be recovered. Alternatively, one can reconstruct the image of the surface under an arbitrary lighting direction by a linear combination of the three original images. This fact can be used for classification. For each face (class) use three or more images taken under different lighting conditions to construct a 3D basis for the linear subspace. For recognition, compute the distance of a new image to each linear subspace and choose the face corresponding to the shortest distance. J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 55 / 103

Comparisons Case Study: Faces Method 3: Linear Subspaces Source: Belhumeur et al. IEEE TPAMI 19(7) 711 720. 1997. Pros and Cons? J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 56 / 103

Comparisons Case Study: Faces Method 3: Linear Subspaces Source: Belhumeur et al. IEEE TPAMI 19(7) 711 720. 1997. Pros and Cons? If there is no noise or shadowing, this method will achieve error free classification under any lighting conditions (and if the surface is indeed Lambertian). Faces inevitably have self-shadowing. Faces have expressions... Still pretty computationally expensive (linear in number of classes). J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 56 / 103

Comparisons Case Study: Faces Method 4: FisherFaces Source: Belhumeur et al. IEEE TPAMI 19(7) 711 720. 1997. Recall the Fisher Linear Discriminant setup. The between-class scatter matrix c S B = N i (µ i µ)(µ i µ) T (64) i=1 The within-class scatter matrix c S W = (x k µ i ) (x k µ i ) T (65) i=1 x k D i The optimal projection W opt is chosen as the matrix with orthonormal columns which maximizes the ratio of the determinant of the between-class scatter matrix of the projected vectors to the determinant of the within-class scatter of the projected vectors: W opt = arg max W W T S B W W T S W W (66) J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 57 / 103

Comparisons Case Study: Faces Method 4: FisherFaces Source: Belhumeur et al. IEEE TPAMI 19(7) 711 720. 1997. The eigenvectors {w i i = 1, 2,..., m} corresponding to the m largest eigenvalues of the following generalized eigenvalue problem comprise W opt : S B w i = λ i S W w i, i = 1, 2,..., m (67) This is a multi-class version of the FLD we discussed. In face recognition, this things get a little more complicated because the within-class scatter matrix S W is always singular. This is because the rank of S W is at most N c and the number of images in the learning set of commonly much smaller than the number of pixels in the image. This means we can choose W such that the within-class scatter is exactly zero. J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 58 / 103

Comparisons Case Study: Faces Method 4: FisherFaces Source: Belhumeur et al. IEEE TPAMI 19(7) 711 720. 1997. To overcome this, they project the image set to a lower dimensional space so that the resulting within-class scatter matrix S W is nonsingular. How? PCA to first reduce the dimension to N c and the FLD to reduce it to c 1. W T opt is given by the product W T FLD W T PCA where W PCA = arg max W W T S T W (68) W FLD = arg max W W T W T PCA S BW PCA W W T W T PCA S W W PCA W (69) J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 59 / 103

Comparisons Case Study: Faces Experiment 1 and 2: Variation in Lighting Source: Belhumeur et al. IEEE TPAMI 19(7) 711 720. 1997. Hypothesis: face recognition algorithms will perform better if they exploit the fact that images of a Lambertian surface lie in a linear subspace. Used Hallinan s Harvard Database which sampled the space of light source directions in 15 degree incremements. Used 330 images of five people (66 of each) and extracted five subsets. Classification is nearest neighbor in all cases. J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 60 / 103

Comparisons Case Study: Faces Experiment 1 and 2: Variation in Lighting Source: Belhumeur et al. IEEE TPAMI 19(7) 711 720. 1997. ECOGNITION USING CLASS SPECIFIC LINEAR PROJECTION 715 h of the aforeg two different heses that we e of the considbases were inm the Harvard een systematia database at expression and Fig. 3. The highlighted lines of longitude and latitude indicate the light source directions for Subsets 1 through 5. Each intersection of a longitudinal and latitudinal line on the right side of the illustration has a corresponding image in the database. the hypothesis ognition algoe fact that imubspace. More all four algousing an imdirection coincident with the camera s optical axis. Subset 2 contains 45 images for which the greater of the J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 61 / 103

varying lighting. Sample images from each subset are Comparisons own in Fig. 4. Experiment 1: Variation in Lighting bset 1 contains 30 images for which both the longitudinal and latitudinal angles of light source direction are Source: within 15 $ Belhumeur of the camera et axis, al. including IEEE TPAMI the lighting19(7) 711 720. 1997.. The Yale database is available for download from http://cvc.yale.edu. Subset Case Study: 5 contains Faces 105 images for which the greater of th longitudinal and latitudinal angles of light source d rection are 75 $ from the camera axis. For all experiments, classification was performed using nearest neighbor classifier. All training images of an ind. 4. Example J. Corso images (SUNY from each at Buffalo) subset of the Harvard Database used Lecture to test 4 the four algorithms. 28 January 2009 62 / 103

Comparisons Case Study: Faces Experiment 1: Variation in Lighting Extrapolation Source: Belhumeur et al. IEEE TPAMI 19(7) 711 720. 1997. Train on Subset 1. Test of Subsets 1,2,and 3. 716 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 19, NO. 7, JULY 1997 Extrapolating from Subset 1 Method Reduced Error Rate (%) Space Subset 1 Subset 2 Subset 3 Eigenface 4 0.0 31.1 47.7 10 0.0 4.4 41.5 Eigenface 4 0.0 13.3 41.5 w/o 1st 3 10 0.0 4.4 27.7 Correlation 29 0.0 0.0 33.9 Linear Subspace 15 0.0 4.4 9.2 Fisherface 4 0.0 0.0 4.6 Fig. 5. Extrapolation: When each of the methods is trained on images with near frontal illumination (Subset 1), the graph and corresponding table show the relative performance under extreme light source conditions. J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 63 / 103

Comparisons Case Study: Faces Experiment 2: Variation in Lighting Interpolation Source: Belhumeur et al. IEEE TPAMI 19(7) 711 720. 1997. Train on Subsets 1 and 5. BELHUMEUR ET AL.: EIGENFACES VS. FISHERFACES: RECOGNITION USING CLASS SPECIFIC LINEAR PROJECTION 717 Test of Subsets 2, 3, and 4. Interpolating between Subsets 1 and 5 Method Reduced Error Rate (%) Space Subset 2 Subset 3 Subset 4 Eigenface 4 53.3 75.4 52.9 10 11.11 33.9 20.0 Eigenface 4 31.11 60.0 29.4 w/o 1st 3 10 6.7 20.0 12.9 Correlation 129 0.0 21.54 7.1 Linear Subspace 15 0.0 1.5 0.0 Fisherface 4 0.0 0.0 1.2 Fig. 6. Interpolation: When each of the methods is trained on images from both near frontal and extreme lighting (Subsets 1 and 5), the graph and corresponding table show the relative performance under intermediate lighting conditions. J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 64 / 103

Comparisons Case Study: Faces Experiment 1 and 2: Variation in Lighting Source: Belhumeur et al. IEEE TPAMI 19(7) 711 720. 1997. All of the algorithms perform perfectly when lighting is nearly frontal. However, when lighting is moved off axis, there is significant difference between the methods (spec. the class-specific methods and the Eigenface method). Empirically demonstrated that the Eigenface method is equivalent to correlation when the number of Eigenfaces equals the size of the training set (Exp. 1). In the Eigenface method, removing the first three principal components results in better performance under variable lighting conditions. J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 65 / 103

Comparisons Case Study: Faces Experiment 1 and 2: Variation in Lighting Source: Belhumeur et al. IEEE TPAMI 19(7) 711 720. 1997. All of the algorithms perform perfectly when lighting is nearly frontal. However, when lighting is moved off axis, there is significant difference between the methods (spec. the class-specific methods and the Eigenface method). Empirically demonstrated that the Eigenface method is equivalent to correlation when the number of Eigenfaces equals the size of the training set (Exp. 1). In the Eigenface method, removing the first three principal components results in better performance under variable lighting conditions. Linear Subspace has comparable error rates with the FisherFace method, but it requires 3x as much storage and takes three times as long. J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 65 / 103

Comparisons Case Study: Faces Experiment 1 and 2: Variation in Lighting Source: Belhumeur et al. IEEE TPAMI 19(7) 711 720. 1997. All of the algorithms perform perfectly when lighting is nearly frontal. However, when lighting is moved off axis, there is significant difference between the methods (spec. the class-specific methods and the Eigenface method). Empirically demonstrated that the Eigenface method is equivalent to correlation when the number of Eigenfaces equals the size of the training set (Exp. 1). In the Eigenface method, removing the first three principal components results in better performance under variable lighting conditions. Linear Subspace has comparable error rates with the FisherFace method, but it requires 3x as much storage and takes three times as long. The Fisherface method had error rates lower than the Eigenface method and required less computation time. J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 65 / 103

Fig. 9. The graph and corresponding table show the relative performance of the algorithms when applied to the Yale Database which contains J. Corso variation (SUNY in facial expression at Buffalo) and lighting. Lecture 4 28 January 2009 66 / 103 Comparisons Case Study: Faces Experiment 3: Yale DB Source: Belhumeur et al. IEEE TPAMI 19(7) 711 720. 1997. Uses a second database of 16 subjects with ten images of images (5 Fig. 8. As demonstrated the Yale Database, the variation in performance of the Eigenface method depends on the number of principal components varying retained. Dropping in expression). the first three appears to improve performance. Leaving-One-Out of Yale Database Method Reduced Error Rate (%) Space Close Crop Full Face Eigenface 30 24.4 19.4 Eigenface w/o 1st 3 30 15.3 10.8 Correlation 160 23.9 20.0 Linear 48 21.6 15.6 Subspace Fisherface 15 7.3 0.6

Comparisons Case Study: Faces Experiment 3: Comparing Variation with Number of Principal Components Source: Belhumeur et al. IEEE TPAMI 19(7) 711 720. 1997. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 19, NO. 7, JULY As demonstrated on the Yale Database, the variation in performance of the Eigenface method depends on the number of principal com retained. Dropping the first three appears to improve performance. J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 67 / 103

Comparisons Case Study: Faces J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 68 / 103

Comparisons Case Study: Faces Can PCA outperform FLD for recognition? Source: Martínez and Kak. PCA versus LDA. PAMI 23(2), 2001. There may be situations in which PCA might outperform FLD. Can you think of such a situation? LDA PCA D LDA D PCA ure 1: There When arewe twohave dierent fewclasses training embedded data, in two then dierent it may \Gaussian-like" be preferable distributions. to However, ple per class are supplied to the learning procedure (PCA or LDA). The classication result of the PCA p ing only the describe rst eigenvector) total scatter. is more desirable than the result of the LDA. D P CA and D LDA represent the esholds obtained by using nearest-neighbor classication. J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 69 / 103

Comparisons Case Study: Faces Can PCA outperform FLD for recognition? Source: Martínez and Kak. PCA versus LDA. PAMI 23(2), 2001. They tested on the AR face database and found affirmative results. Figure 4: Shown here are performance curves for three dierent ways of dividing the data into a training set and a testing set. J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 70 / 103

Extensions Multiple Discriminant Analysis Multiple Discriminant Analysis We can generalize the Fisher Linear Discriminant to multiple classes. Indeed we saw it done in the Case Study on Faces. But, let s cover it with some rigor. Let s make sure we re all at the same place: How many discriminant functions will be involved in a c class problem? Key: There will be c 1 projection functions for a c class problem and hence the projection will be from a d-dimensional space to a (c 1)-dimensional space. d must be greater than or equal to c. J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 71 / 103

MDA The Projection Extensions Multiple Discriminant Analysis The projection is accomplished by c 1 discriminant functions: y i = wi T x i = 1,..., c 1 (70) which is summarized in matrix form as y = W T x (71) where W is the d (c 1) projection function. W 1 W 2 FIGURE 3.6. Three three-dimensional distributions are projected onto two-dimensional J. Corso (SUNY at subspaces, Buffalo) described by a normal vectors Lecture W1 and 4 W2. Informally, multiple discriminant 28 January 2009 72 / 103

Extensions Multiple Discriminant Analysis MDA Within-Class Scatter Matrix The generalization for the Within-Class Scatter Matrix is straightforward: S W = c S i (72) i=1 where, as before, S i = x D i (x m i )(x m i ) T and m i = 1 n i x D i x. J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 73 / 103

Extensions Multiple Discriminant Analysis MDA Between-Class Scatter Matrix The between-class scatter matrix S B is not so easy to generalize. J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 74 / 103

Extensions Multiple Discriminant Analysis MDA Between-Class Scatter Matrix The between-class scatter matrix S B is not so easy to generalize. Let s define a total mean vector m = 1 x = 1 n n Recall then total scatter matrix x c n i m (73) i=1 S T = x (x m)(x m) T (74) J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 74 / 103

Extensions Multiple Discriminant Analysis MDA Between-Class Scatter Matrix Then it follows that c S T = (x m i + m i m)(x m i + m i m) T (75) i=1 x D i c c = (x m i )(x m i ) T + (m i m)(m i m) T i=1 x D i i=1 x D i (76) c = S W + n i (m i m)(m i m) T (77) i=1 J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 75 / 103

Extensions Multiple Discriminant Analysis MDA Between-Class Scatter Matrix Then it follows that S T = c (x m i + m i m)(x m i + m i m) T (75) i=1 x D i c c = (x m i )(x m i ) T + (m i m)(m i m) T i=1 x D i i=1 x D i (76) c = S W + n i (m i m)(m i m) T (77) i=1 So, we can define this second term as a generalized between-class scatter matrix. The total scatter is the the sum of the within-class scatter and the between-class scatter. J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 75 / 103

MDA Objective Page 1 Extensions Multiple Discriminant Analysis We again seek a criterion that will maximize the between-class scatter of the projected vectors to the within-class scatter of the projected vectors. Recall that we can write down the scatter in terms of these projections: m i = 1 y and m = 1 c n i m i (78) n i n y Y i i=1 S W = S B = c i=1 (y m i )(y m i ) T (79) y Y i c n i ( m i m)( m i m) T (80) i=1 J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 76 / 103

Extensions Multiple Discriminant Analysis MDA Objective Page 2 Then, we can show S W = W T S W W (81) S B = W T S B W (82) J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 77 / 103

Extensions Multiple Discriminant Analysis MDA Objective Page 2 Then, we can show S W = W T S W W (81) S B = W T S B W (82) A simple scalar measure of scatter is the determinant of the scatter matrix. The determinant is the product of the eigenvalues and thus is the product of the variation along the principal directions. J(W) = S B S W = WT S W W W T S B W (83) J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 77 / 103

MDA Solution Extensions Multiple Discriminant Analysis The columns of an optimal W are the generalized eigenvectors that correspond to the largest eigenvalues in S B w i = λ i S W w i (84) If S W is nonsingular, then this can be converted to a conventional eigenvalue problem. Or, we could notice that the rank of S B is at most c 1 and do some clever algebra... J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 78 / 103

MDA Solution Extensions Multiple Discriminant Analysis The columns of an optimal W are the generalized eigenvectors that correspond to the largest eigenvalues in S B w i = λ i S W w i (84) If S W is nonsingular, then this can be converted to a conventional eigenvalue problem. Or, we could notice that the rank of S B is at most c 1 and do some clever algebra... The solution for W is, however, not unique and would allow arbitrary scaling and rotation, but these would not change the ultimate classification. J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 78 / 103

Extensions Image PCA IMPCA Source: Yang and Yang: IMPCA vs. PCA, Pattern Recognition v35, 2002. Observation: To apply PCA and FLD on images, we need to first vectorize them, which can lead to high-dimensional vectors. Solving the associated eigen-problems is a very time-consuming process. So, can we apply PCA on the images directly? Yes! (But we will argue that this has some drawbacks in constrast to the authors results.) This will be accomplished by what the authors call the image total covariance matrix. J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 79 / 103

Extensions Image PCA IMPCA: Problem Set-Up Source: Yang and Yang: IMPCA vs. PCA, Pattern Recognition v35, 2002. Define A R m n as our image. Let w denote an n-dimensional column vector, which will represent the subspace onto which we will project an image. which yields m-dimensional vector y. You might think of w this as a feature selector. We, again, want to maximize the total scatter... y = Aw (85) J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 80 / 103

Extensions Image PCA IMPCA: Image Total Scatter Matrix Source: Yang and Yang: IMPCA vs. PCA, Pattern Recognition v35, 2002. We have M total data samples. The sample mean image is M = 1 M And the projected sample mean is m = 1 M = 1 M M A j (86) j=1 M y j (87) j=1 M A j w (88) j=1 = Mw (89) J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 81 / 103

Extensions Image PCA IMPCA: Image Total Scatter Matrix Source: Yang and Yang: IMPCA vs. PCA, Pattern Recognition v35, 2002. The scatter of the projected samples is M S = (y j m)(y j m) T (90) j=1 M [ = (Aj M)w ][ (A j M)w ] T (91) j=1 M = w T (A j M) T (A j M) w (92) j=1 J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 82 / 103

Extensions Image PCA IMPCA: Image Total Scatter Matrix Source: Yang and Yang: IMPCA vs. PCA, Pattern Recognition v35, 2002. So, the image total scatter matrix is M S I = (A j M) T (A j M) (93) j=1 J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 83 / 103

Extensions Image PCA IMPCA: Image Total Scatter Matrix Source: Yang and Yang: IMPCA vs. PCA, Pattern Recognition v35, 2002. So, the image total scatter matrix is M S I = (A j M) T (A j M) (93) j=1 And, a suitable criterion, in a form we ve seen before is J I (w) = w T S I w (94) J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 83 / 103

Extensions Image PCA IMPCA: Image Total Scatter Matrix Source: Yang and Yang: IMPCA vs. PCA, Pattern Recognition v35, 2002. So, the image total scatter matrix is S I = M (A j M) T (A j M) (93) j=1 And, a suitable criterion, in a form we ve seen before is J I (w) = w T S I w (94) We know already that the vectors w that maximize J I are the orthonormal eigenvectors of S I corresponding to the largest eigenvalues of S I. J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 83 / 103

Extensions Image PCA IMPCA: Experimental Results Source: Yang and Yang: IMPCA vs. PCA, Pattern Recognition v35, 2002. Dataset is the ORL database 10 different images taken of 40 individuals. Facial expressions and details are varying. attern Recognition Even the 35 pose (2002) can 1997 vary slightly: 1999 20 degree rotation/tilt and 10% scale. Size is normalized (92 112). jected es the e conotes ining :;M), by A. et the Fig. 1. Five images ofone person in ORL face database. The first five images of each person are used for training and the second five are used for testing. 2.3. Feature extraction method J. Corso (SUNY at Buffalo) Lecture 4 28 January 2009 84 / 103