PCA and LDA. Man-Wai MAK - PDF Free Download

PCA and LDA Man-Wai MAK Dept. of Electronic and Information Engineering, The Hong Kong Polytechnic University enmwmak@polyu.edu.hk http://www.eie.polyu.edu.hk/ mwmak References: S.J.D. Prince,Computer Vision: Models Learning and Inference, Cambridge University Press, 2012 April 23, 2016 Man-Wai MAK (EIE) PCA and LDA April 23, 2016 1 / 22

Overview 1 Dimension Reduction Why Dimension Reduction Dimension Reduction: Reduce to 1-Dim 2 Principle Component Analysis Derivation of PCA PCA on High-Dimensional Data Eigenface 3 Linear Discriminant Analysis LDA on 2-Class Problems LDA on Multi-class Problems Man-Wai MAK (EIE) PCA and LDA April 23, 2016 2 / 22

Why Dimension Reduction Many applications produce high-dimensional vectors In face recognition, if an image has size 360 260 pixels, the dimension is 93600. In hand-writing digit recognition, if a digit occupies 28 28 pixels, the dimension is 784. In speaker recognition, the dim can be as high as 61440 per utterance. High-dim feature vectors can easily cause the curse-of-dimensionality problem. Redundancy: Some of the elements in the feature vectors are strongly correlated, meaning that knowing one element will also know another element. Irrelevancy: Some elements in the feature vectors are irrelevant to the classification task. Man-Wai MAK (EIE) PCA and LDA April 23, 2016 3 / 22

Dimension Reduction Dimensionality reduction attempt to find a low dimensional (or hidden) representation h which can approximately explain the data x so that x f (h, θ) (1) where f (, ) is a function that takes the hidden variable and a set of parameters θ. Typically, we choose the function family f (, ) and then learn h and θ from training data. Least squares criterion: Given N training vectors X = {x 1,..., x N }, x i R D, we find the parameters θ and latent variables h that minimize the sum of square error: { N } ˆθ, {ĥi} N i=1 = arg min [x i f (h i, θ)] T [x i f (h i, θ)] (2) θ,{h i } N i=1 i=1 Man-Wai MAK (EIE) PCA and LDA April 23, 2016 4 / 22

Dimension Reduction: Reduce to 1-Dim Approximate vector x i by a scalar value h i plus the global mean µ: x i φh i + µ, where µ = 1 N N x i, φ R D 1 i=1 Assuming µ = 0 or vectors have been mean-subtracted, i.e., x i x i µ i, we have x i φh i The least squares criterion becomes: ˆθ, {ĥi} N i=1 = arg min θ,{h i } N i=1 = arg min θ,{h i } N i=1 E(θ, h) { N } [x i φh i ] T [x i φh i ] i=1 (3) Man-Wai MAK (EIE) PCA and LDA April 23, 2016 5 / 22

Dimension Reduction: Reduce to 1-Dim Eq. 3 has a problem in that it does not have a unique solution. If we multiply φ by any constant α and divide h i s by the same constant we get the same cost, i.e., (αφ hi α = φh i). We make the solution unique by constraining the length of φ to be 1 using a Lagrange multiplier: E(φ, h) = = N (x i φh i ) T (x i φh i ) + λ(φ T φ 1) i=1 N x T x i 2h i φ T x i + hi 2 + λ(φ T φ 1) i=1 Man-Wai MAK (EIE) PCA and LDA April 23, 2016 6 / 22

Dimension Reduction: Reduce to 1-Dim Setting E E φ = 0 and h i = 0, we obtain: i x iĥi = λˆφ and ˆφ T x i = ĥ i = x T i ˆφ Hence, i x i ( x T i ) ( ) ˆφ = x ix T i i ˆφ = λˆφ = Sˆφ = λˆφ where S is the covariance matrix of training data. 1 Therefore, ˆφ is the first eigenvector of S. 1 Note that x i s have been mean subtracted. Man-Wai MAK (EIE) PCA and LDA April 23, 2016 7 / 22

Dimension Reduction: Reduce to 1-Dim 13 Image preprocessing and feature extraction a) c) b) Figure 13.19 Reduction to a single dimension. a) Original data and direction of maximum variance. b) The data are projected onto to produce a one dimensional representation. c) To reconstruct the data, we re-multiply by. Most of the original variation is retained. PCA extends this model to project high dimensional data onto the K orthogonal dimensions with the most variance, to produce a K dimensional representation. Man-Wai MAK (EIE) PCA and LDA April 23, 2016 8 / 22

Principle Component Analysis In PCA, the hidden variable h is multi-dimensional and φ becomes a rectangular matrix Φ = [φ 1 φ 2 φ M ], where M D. Each components of h weights one column of matrix Φ so that data is approximated as x i Φh i, i = 1,..., N The cost function is ˆθ, {ĥ i } N i=1 = arg min θ,{h i } N i=1 = arg min θ,{h i } N i=1 E ( ) θ, {h i } N i=1 { N } (4) [x i Φh i ] T [x i Φh i ] i=1 Man-Wai MAK (EIE) PCA and LDA April 23, 2016 9 / 22

Principle Component Analysis To solve the non-uniqueness problem in Eq. 4, we enforce φ T d φ d = 1 d using a set of Lagrange multipliers {λ d } D d=1 : E(Φ, {h i }) = = = N D (x i Φh i ) T (x i Φh i ) + λ d (φ T d φ d 1) i=1 d=1 N (x i Φh i ) T (x i Φh i ) + tr{φλφ T Λ} i=1 N x T x i 2h T i Φ T x i + h T i h i + tr{φλφ T Λ} i=1 (5) where Λ = diag{λ 1,..., λ D } and Φ = [φ 1 φ 2 φ D ]. Man-Wai MAK (EIE) PCA and LDA April 23, 2016 10 / 22

Principle Component Analysis Setting E E Φ = 0 and h i = 0, we obtain: i x iĥt i = ˆΦΛ and ˆΦT xi = ĥi = ĥt i = x T i where we have used the differentiation of matrix trace: Therefore, X tr{xbxt } = XB T + XB. i x ix T i ˆΦ = ˆΦΛ = S ˆΦ = ˆΦΛ So, ˆΦ comprises the M eigenvectors of S. ˆΦ Man-Wai MAK (EIE) PCA and LDA April 23, 2016 11 / 22

PCA on High-Dimensional Data When, the dimension D of x i is very high, computing S and its eigenvectors directly are impractical. However, the rank of S is limited by the number of training examples: If there are N training examples, there will be at most N 1 eigenvectors with non-zero eigenvalues. If N D, the principal components can be computed more easily. Let X be a data matrix comprising the mean-subtracted x i s in its columns. Then, S = XX T and the eigen-decomposition of S is given by Sφ i = XX T φ i = λ i φ i Instead of performing eigen-decomposition of XX T, we perform eigen-decomposition of X T Xψ i = λ i ψ i (6) Man-Wai MAK (EIE) PCA and LDA April 23, 2016 12 / 22

Principle Component Analysis Pre-multipling both side of Eq. 6 by X, we obtain XX T (Xψ i ) = λ i (Xψ i ) This means that if ψ i is an eigenvector of X T X, then φ i = Xψ i is an eigenvector of S = XX T. So, all we need is to compute the N 1 eigenvectors of X T X, which has size N N. Note that φ i computed in this way is un-normalized. So, we need to normalize them by φ i = Xψ i Xψ i, i = 1,..., N 1 Man-Wai MAK (EIE) PCA and LDA April 23, 2016 13 / 22

Example Application of PCA: Eigenface One of the most well-known application of PCA. µ φ 1 φ 2 h φ 10 h 10 1 h... 2 h 399 +... Reconstructed faces using 399 eigenfaces Original faces Man-Wai MAK (EIE) PCA and LDA April 23, 2016 14 / 22

Example Application of PCA: Eigenface Faces reconstructed using different number of principal components (eigenfaces): Original 1 PC 20 PCs 50 PCs 100 PCs 200 PCs 399 PCs See Lab2 of EIE4105 in http://www.eie.polyu.edu.hk/ mwmak/myteaching.htm for implementation. Man-Wai MAK (EIE) PCA and LDA April 23, 2016 15 / 22

Fisher Discriminant FDA a classifica-on Analysis method to separate data into two classes. FDABut is afda classification could also method be toconsidered separate dataas intoa two supervised classes. FDAdimension could also bereduc-on considered as method a supervised that dimension reduces reduction the method dimension that reduces to 1. the dimension to 1. Project data onto line joining the 2 means Project data onto FDA subspace 15 Man-Wai MAK (EIE) PCA and LDA April 23, 2016 16 / 22

Fisher Discriminant Analysis The idea of FDA is to find a 1-D line so that the projected data give a large separation between the means of two classes while also giving a small variance within each class, thereby minimizing the class overlap. Assume that training data are projected to one dimension using Fisher criterion: where J(w) = y n = w T x n, n = 1,..., N. Between-class scatter Within-class scatter S B = (µ 2 µ 1 )(µ 2 µ 1 ) T and S W = = wt S B w w T S W w 2 k=1 n C k (x n µ k )(x n µ k ) T are the between-class and within-class scatter matrices, respectively, and µ 1 and µ 2 are the class means. Man-Wai MAK (EIE) PCA and LDA April 23, 2016 17 / 22

Fisher Discriminant Analysis Note that only the direction of w matters. Therefore, we can always find a w that leads to w T S W w = 1. The maximization of J(w) can be rewritten as: The Lagrangian function is Setting L w = 0, we obtain max w w T S B w subject to w T S W w = 1 L(w, λ) = 1 2 wt S B w λ(w T S W w 1) S B w λs W w = 0 = S B w = λs W w = (S 1 W S B)w = λw So, w is the first eigenvector of S 1 W S B. Man-Wai MAK (EIE) PCA and LDA April 23, 2016 18 / 22 (7)

LDA on Multi-class Problems For multiple classes (K > 2 and D > K), we can use LDA to project D-dimensional vectors to M-dimensional vectors, where 1 < M < K. w is extended to a matrix W = [w 1 w M ] and the projected scalar y i is extended to a vector y i : y n = W T (x n µ), where y nj = w T j (x n µ), j = 1,..., M where µ is the global mean of training vectors. The between-class and within-class scatter matrices become S B = S W = K N k (µ k µ)(µ k µ) T k=1 K k=1 n C k (x n µ k )(x n µ k ) T where N k is the number of samples in the class k, i.e., N k = C k. Man-Wai MAK (EIE) PCA and LDA April 23, 2016 19 / 22

LDA on Multi-class Problems The LDA criterion function: J(W) = Between-class scatter Within-class scatter { ( ) ( ) } 1 = Tr W T S B W W T S W W Constrained optimization: max W Tr{W T S B W} subject to W T S W W = I where I is an M M identity matrix. Note that unlike PCA in Eq. 5, because of the matrix S W in the constraint, we need to find one w j at a time. Note also that the constraint W T S W W = I suggests that w j s may not be orthogonal to each other. Man-Wai MAK (EIE) PCA and LDA April 23, 2016 20 / 22

LDA on Multi-class Problems To find w j, we write the Lagrangian function as: L(w j, λ j ) = w T j S B w j λ j (w T j S W w j 1) Using Eq. 7, the optimal solution of w j satisfies (S 1 W S B)w j = λ j w j Therefore, W comprises the first M eigenvectors of S 1 W S B. A more formal proof can be find in [1]. As the maximum rank of S B is K 1, S 1 W S B has at most K 1 non-zero eigenvalues. As a result, M can be at most K 1. After the projection, the vectors y n s can be used to train a classifier (e.g., SVM) for classification. Man-Wai MAK (EIE) PCA and LDA April 23, 2016 21 / 22

References [1] Fukunaga, K. (1990). Introduction to statistical pattern classification. San Diego, California, USA: Academic Press. Man-Wai MAK (EIE) PCA and LDA April 23, 2016 22 / 22