PCA and LDA Man-Wai MAK Dept. of Electronic and Information Engineering, The Hong Kong Polytechnic University enmwmak@polyu.edu.hk http://www.eie.polyu.edu.hk/ mwmak References: S.J.D. Prince,Computer Vision: Models Learning and Inference, Cambridge University Press, 2012 C. Bishop, Pattern Recognition and Machine Learning, Appendix E, Springer, 2006. October 26, 2018 Man-Wai MAK (EIE) PCA and LDA October 26, 2018 1 / 26
Overview 1 Dimension Reduction Why Dimension Reduction Dimension Reduction: Reduce to 1-Dim 2 Principle Component Analysis Derivation of PCA PCA on High-Dimensional Data Eigenface 3 Linear Discriminant Analysis LDA on 2-Class Problems LDA on Multi-class Problems Man-Wai MAK (EIE) PCA and LDA October 26, 2018 2 / 26
Why Dimension Reduction Many applications produce high-dimensional vectors In face recognition, if an image has size 360 260 pixels, the dimension is 93600. In hand-writing digit recognition, if a digit occupies 28 28 pixels, the dimension is 784. In speaker recognition, the dim can be as high as 61440 per utterance. High-dim feature vectors can easily cause the curse-of-dimensionality problem. Redundancy: Some of the elements in the feature vectors are strongly correlated, meaning that knowing one element will also know some other elements. Irrelevancy: Some elements in the feature vectors are irrelevant to the classification task. Man-Wai MAK (EIE) PCA and LDA October 26, 2018 3 / 26
Dimension Reduction Given a feature vector x R D, dimensionality reduction aims to find a low dimensional representation h R M that can approximately explain x: x f(h, θ) (1) where f(, ) is a function that takes the hidden variable h and a set of parameters θ and M D. Typically, we choose the function family f(, ) and then learn h and θ from training data. Least squares criterion: Given N training vectors X = {x 1,..., x N }, x i R D, we find the parameters θ and latent variables h that minimize the sum of squared error: { N } ˆθ, {ĥi} N i=1 = argmin θ,{h i } N i=1 i=1 [x i f(h i, θ)] T [x i f(h i, θ)] (2) Man-Wai MAK (EIE) PCA and LDA October 26, 2018 4 / 26
Dimension Reduction: Reduce to 1-Dim Approximate vector x i by a scalar value h i plus the global mean µ: x i φh i + µ, where µ = 1 N N x i, φ R D 1 i=1 Assuming µ = 0 or vectors have been mean-subtracted, i.e., x i x i µ i, we have x i φh i The least squares criterion becomes: ˆφ, {ĥi} N i=1 = argmin E(φ, {h i }) φ,{h i } N i=1 = argmin φ,{h i } N i=1 { N } (3) [x i φh i ] T [x i φh i ] i=1 Man-Wai MAK (EIE) PCA and LDA October 26, 2018 5 / 26
Dimension Reduction: Reduce to 1-Dim Eq. 3 has a problem in that it does not have a unique solution. If we multiply φ by any constant α and divide h i s by the same constant we get the same cost, i.e., αφ hi α = φh i. We make the solution unique by constraining φ 2 = 1 using a Lagrange multiplier: L(φ, {h i }) = E(φ, {h i }) + λ(φ T φ 1) N = (x i φh i ) T (x i φh i ) + λ(φ T φ 1) = i=1 N x T x i 2h i φ T x i + h 2 i + λ(φ T φ 1) i=1 Man-Wai MAK (EIE) PCA and LDA October 26, 2018 6 / 26
Dimension Reduction: Reduce to 1-Dim Setting L L φ = 0 and h i = 0, we obtain: i x iĥi = λˆφ and ˆφT xi = ĥi = x T i ˆφ Hence, i x i ( x T ˆφ ) ( ) i = x ix T i i ˆφ = λˆφ = Sˆφ = λˆφ where S is the covariance matrix of training data. 1 Therefore, ˆφ is the first eigenvector of S. 1 Note that x i s have been mean subtracted. Man-Wai MAK (EIE) PCA and LDA October 26, 2018 7 / 26
Dimension Reduction: Reduce to 1-Dim 13 Image preprocessing and feature extraction a) c) b) Figure 13.19 Reduction to a single dimension. a) Original data and direction of maximum variance. b) The data are projected onto to produce a one dimensional representation. c) To reconstruct the data, we re-multiply by. Most of the original variation is retained. PCA extends this model to project high dimensional data onto the K orthogonal dimensions with the most variance, to produce a K dimensional representation. Man-Wai MAK (EIE) PCA and LDA October 26, 2018 8 / 26
find a small number of axes in which the data have Dimension the highest Reduction: variability. 3D to 2D The axes may not parallel to the original axes. E.g., projec@on from 3D to 2D space x 3 x 2 x 1 5 Man-Wai MAK (EIE) PCA and LDA October 26, 2018 9 / 26
Principle Component Analysis In PCA, the hidden variables {h i } are multi-dimensional and φ becomes a rectangular matrix Φ = [φ 1 φ 2 φ M ], where M D. Each components of h i weights one column of matrix Φ so that data is approximated as x i Φh i, i = 1,..., N The cost function is 2 ˆΦ, {ĥi} N i=1 = argmin Φ,{h i } N i=1 = argmin Φ,{h i } N i=1 E ( Φ, {h i } N ) i=1 { N } (4) [x i Φh i ] T [x i Φh i ] i=1 2 Note that we have defined θ Φ in Eq. 2. Man-Wai MAK (EIE) PCA and LDA October 26, 2018 10 / 26
Principle Component Analysis To solve the non-uniqueness problem in Eq. 4, we enforce φ T d φ d = 1, d = 1,..., M, using a set of Lagrange multipliers {λ d } M d=1 : L(Φ, {h i }) = = = N M (x i Φh i ) T (x i Φh i ) + λ d (φ T d φ d 1) i=1 d=1 N (x i Φh i ) T (x i Φh i ) + tr{φλ M Φ T Λ} i=1 N x T x i 2h T i Φ T x i + h T i h i + tr{φλ M Φ T Λ} i=1 (5) where h i R M, Λ = diag{λ 1,..., λ M, 0,..., 0} R D D, Λ M = diag{λ 1,..., λ M } R M M, and Φ = [φ 1 φ 2 φ M ] R D M. Man-Wai MAK (EIE) PCA and LDA October 26, 2018 11 / 26
Principle Component Analysis Setting L L Φ = 0 and h i = 0, we obtain: i x iĥt i = ˆΦΛ M and ˆΦT xi = ĥi = ĥt i = x T i ˆΦ where we have used: X tr{xbxt } = XB T + XB and a T X T b X = bat. Therefore, i x ix T i ˆΦ = ˆΦΛ M = S ˆΦ = ˆΦΛ M (6) So, ˆΦ comprises the M eigenvectors of S. Man-Wai MAK (EIE) PCA and LDA October 26, 2018 12 / 26
Interpretation of Λ M Denote X as a D N centered data matrix whose n-th column is given by (x n 1 N N i=1 x i). The projected data matrix is given by Y = ˆΦ T X The covariance matrix of the transformed data is ( ) ( ) T YY T = ˆΦT X ˆΦT X = ˆΦ T XX T ˆΦ = ˆΦ T ˆΦΛM (see the eigen-equation in Eq. 6) = Λ M Therefore, the eigenvalues represent the variance of the projected vectors. Man-Wai MAK (EIE) PCA and LDA October 26, 2018 13 / 26
PCA on High-Dimensional Data When, the dimension D of x i is very high, computing S and its eigenvectors directly are impractical. However, the rank of S is limited by the number of training examples: If there are N training examples, there will be at most N 1 eigenvectors with non-zero eigenvalues. If N D, the principal components can be computed more easily. Let X be a data matrix comprising the mean-subtracted x i s in its columns. Then, S = XX T and the eigen-decomposition of S is given by Sφ i = XX T φ i = λ i φ i Instead of performing eigen-decomposition of XX T, we perform eigen-decomposition of X T Xψ i = λ i ψ i (7) Man-Wai MAK (EIE) PCA and LDA October 26, 2018 14 / 26
Principle Component Analysis Pre-multipling both side of Eq. 7 by X, we obtain XX T (Xψ i ) = λ i (Xψ i ) This means that if ψ i is an eigenvector of X T X, then φ i = Xψ i is an eigenvector of S = XX T. So, all we need is to compute the N 1 eigenvectors of X T X, which has size N N. Note that φ i computed in this way is un-normalized. So, we need to normalize them by φ i = Xψ i Xψ i, i = 1,..., N 1 Man-Wai MAK (EIE) PCA and LDA October 26, 2018 15 / 26
Example Application of PCA: Eigenface Eigenface is one of the most well-known applications of PCA. µ φ 1 φ 2 h φ 10 h 10 1 h... 2 h 399 +... Reconstructed faces using 399 eigenfaces Original faces Man-Wai MAK (EIE) PCA and LDA October 26, 2018 16 / 26
Example Application of PCA: Eigenface Faces reconstructed using different numbers of principal components (eigenfaces): Original 1 PC 20 PCs 50 PCs 100 PCs 200 PCs 399 PCs See Lab2 of EIE4105 in http://www.eie.polyu.edu.hk/ mwmak/myteaching.htm for implementation. Man-Wai MAK (EIE) PCA and LDA October 26, 2018 17 / 26
Limitations of PCA Limita&ons of PCA PCA will will fail fail if the if the subspace subspace is non-linear is non-linear PCA can only find this Linear subspace (PCA is fine) Nonlinear subspace (PCA fails) Solution: Use non-linear embedding such as ISOMAP or DNN Solu9ons: Using non-linear embedding such as ISOMAP or DNN 15 Man-Wai MAK (EIE) PCA and LDA October 26, 2018 18 / 26
Fisher Discriminant FDA a classifica-on Analysis method to separate data into two classes. FDABut is afda classification could also method be toconsidered separate dataas intoa two supervised classes. FDAdimension could also bereduc-on considered as method a supervised that dimension reduces reduction the method dimension that reduces to 1. the dimension to 1. Project data onto line joining the 2 means Project data onto FDA subspace 15 Man-Wai MAK (EIE) PCA and LDA October 26, 2018 19 / 26
Fisher Discriminant Analysis The idea of FDA is to find a 1-D line so that the projected data give a large separation between the means of two classes while also giving a small variance within each class, thereby minimizing the class overlap. Assume that training data are projected onto a 1-D space using Fisher criterion: where J(w) = y n = w T x n, n = 1,..., N. Between-class scatter Within-class scatter S B = (µ 2 µ 1 )(µ 2 µ 1 ) T and S W = = wt S B w w T S W w 2 k=1 n C k (x n µ k )(x n µ k ) T are the between-class and within-class scatter matrices, respectively, and µ 1 and µ 2 are the class means. Man-Wai MAK (EIE) PCA and LDA October 26, 2018 20 / 26
Fisher Discriminant Analysis Note that only the direction of w matters. Therefore, we can always find a w that leads to w T S W w = 1. The maximization of J(w) can be rewritten as: The Lagrangian function is Setting L w = 0, we obtain max w w T S B w subject to w T S W w = 1 L(w, λ) = 1 2 wt S B w λ(w T S W w 1) S B w λs W w = 0 = S B w = λs W w = (S 1 W S B)w = λw So, w is the first eigenvector of S 1 W S B. Man-Wai MAK (EIE) PCA and LDA October 26, 2018 21 / 26 (8)
LDA on Multi-class Problems For multiple classes (K > 2 and D > K), we can use LDA to project D-dimensional vectors to M-dimensional vectors, where 1 < M < K. w is extended to a matrix W = [w 1 w M ] and the projected scalar y i is extended to a vector y i : y n = W T (x n µ), where y nj = w T j (x n µ), j = 1,..., M where µ is the global mean of training vectors. The between-class and within-class scatter matrices become S B = S W = K N k (µ k µ)(µ k µ) T k=1 K k=1 n C k (x n µ k )(x n µ k ) T where N k is the number of samples in the class k, i.e., N k = C k. Man-Wai MAK (EIE) PCA and LDA October 26, 2018 22 / 26
LDA on Multi-class Problems The LDA criterion function: J(W) = Between-class scatter Within-class scatter { ( ) ( ) } 1 = Tr W T S B W W T S W W Constrained optimization: max W Tr{W T S B W} subject to W T S W W = I where I is an M M identity matrix. Note that unlike PCA in Eq. 5, because of the matrix S W in the constraint, we need to find one w j at a time. Note also that the constraint W T S W W = I suggests that w j s may not be orthogonal to each other. Man-Wai MAK (EIE) PCA and LDA October 26, 2018 23 / 26
LDA on Multi-class Problems To find w j, we write the Lagrangian function as: L(w j, λ j ) = w T j S B w j λ j (w T j S W w j 1) Using Eq. 8, the optimal solution of w j satisfies (S 1 W S B)w j = λ j w j Therefore, W comprises the first M eigenvectors of S 1 W S B. A more formal proof can be find in [1]. As the maximum rank of S B is K 1, S 1 W S B has at most K 1 non-zero eigenvalues. As a result, M can be at most K 1. After the projection, the vectors y n s can be used to train a classifier (e.g., SVM) for classification. Man-Wai MAK (EIE) PCA and LDA October 26, 2018 24 / 26
PCA vs. LDA LDA Projec+on: HCI Example Project 784-dim images to 3-dim LDA subspace formed by the 3 eigenvectors with the largest eigenvalues, i.e., W is a 784 x 3 matrix or W R 784 3 Project 784-dim vectors derived from 28 28 handwritten digits to 3-D space: LDA PCA Man-Wai MAK (EIE) PCA and LDA October 26, 2018 25 / 26
References [1] Fukunaga, K. (1990). Introduction to Statistical Pattern Recognition. San Diego, California, USA: Academic Press. Man-Wai MAK (EIE) PCA and LDA October 26, 2018 26 / 26