Principal Component Analysis and Linear Discriminant Analysis Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1/29
Decomposition and Components Decomposition is a great idea. Linear decomposition and linear basis, e.g., the Fourier transform The bases construct the feature space may be orthogonal bases, may be not give the direction to find the components specified vs. learnt? The features are the image (or projection) of the original signal in the feature space e.g., the orthogonal projection of the original signal onto the feature space the projection does not have to be orthogonal Feature extraction 2/29
Outline Principal Component Analysis Linear Discriminant Analysis Comparison between PCA and LDA 3/29
Principal Components and Subspaces Subspaces preserve part of the information (and energy, or uncertainty) Principal components are orthogonal bases and preserve the large portion of the information of the data capture the major uncertainties (or variations) of data Two views Deterministic: minimizing the distortion of projection of the data Statistical: maximizing the uncertainty in data are they the same? under what condition they are the same? 4/29
View 1: Minimizing the MSE x ( WW T )x x R n, and assume centering E{x} = 0. m is the dim of the subspace, m < n orthonormal bases W = [w 1,w 2,...,w m ] where W T W = I, i.e., rotation orthogonal projection of x: Px = m (wi T x)w i = (WW T )x i=1 it achieves the minimum mean-square error (prove it!) e min MSE = E{ x Px 2 } = E{ P x } PCA can be posed as: finding a subspace that minimizes the MSE: argminj MSE (W) = E{ x Px 2 }, s.t., W T W = I W 5/29
Let do it... It is easy to see: So, J MSE (W) = E{x T x} E{x T Px} minimizing J MSE (W) maximizing E{x T Px} Then we have the following constrained optimization problem The Lagrangian is max W E{xT WW T x} s.t. W T W = I L(W,λ) = E{x T WW T x}+λ T (I W T W) The set of KKT conditions gives: L(W, λ) w i = 2E{xx T }w i 2λ i w i, i 6/29
What is it? Let s denote by S = E{xx T } (note: E{x} = 0). The KKT conditions give: or in a more concise matrix form: Sw i = λ i w i, i SW = λ i W What is this? Then, the value of minimum MSE is e min MSE = n i=m+1 i.e., the sum of the eigenvalues of the orthogonal subspace to the PCA subspace. λ i 7/29
View 2: Maximizing the Variation Let s look it from another perspective: We have a linear projection of x to a 1-d subspace y = w T x an important note: E{y} = 0 as E{x} = 0 The first principal component of x is such that the variance of the projection y is maximized of course, we need to constrain w to be a unit vector. so we have the following optimization problem what is it? max w J(w) = E{y2 } = E{(w T x) 2 }, s.t. w T w = 1 max w J(w) = wt Sw, s.t. w T w = 1 8/29
Maximizing the Variation (cont.) The sorted eigenvalues of S are λ 1 λ 2... λ n, and eigenvectors are {e 1,...,e n }. It is clearly that the first PC is y 1 = e T 1 x This can be generalized to m PCs (where m < n) with one more constraint E{y m y k } = 0, k < m i.e., the PCs are uncorrelated with all previously found PCs The solution is: Sounds familiar? w k = e k 9/29
The Two Views Converge The two views lead to the same result! You should prove: uncorrelated components orthonormal projection bases What if we are more greedy, say needing independent components? Do we shall expect orthonormal bases? In which case, we still have orthonormal bases? We ll see it in next lecture. 10/29
The Closed-Form Solution Learning the principal components from {x 1,...,x N }: 1. calculating m = 1 N N k=1 x k 2. centering A = [x 1 m,...,x N m] 3. calculating S = N k=1 (x k m)(x k m) T = AA T 4. eigenvalue decomposition 5. sorting λ i and e i 6. finding the bases S = U T ΣU W = [e 1,e 2,...,e m ] Note: The components for x is y = W T (x m), where x R n and y R m 11/29
A First Issue n is the dimension of input data, N is the size of the training set In practice, n N E.g., in image-based face recognition, if the resolution of a face image is 100 100, when stacking all the pixels, we end up n = 10,000. Note that S is a n n matrix Difficulties: S is ill-conditioned, as in general rank(s) n Eigenvalue decomposition of S is too demanding So, what should we do? 12/29
Solution I: First Trick A is a n N matrix, then S = AA T is n n, but A T A is N N Trick Let s do eigenvalue decomposition on A T A A T Ae = λe AA T Ae = λae i.e., if e is an eigenvector of A T A, then Ae is the eigenvector of AA T and the corresponding eigenvalues are the same Don t forget to normalize Ae Note: This trick does not fully solved the problem, as we still need to do eigenvalue decomposition on a N N matrix, which can be fairly large in practice. 13/29
Solution II: Using SVD Instead of doing EVD, doing SVD (singular value decomposition) is easier A R n R N A = UΣV T U R n R N, and U T U = I Σ R N R N, is diagonal V R N R N, and V T V = VV T = I 14/29
Solution III: Iterative Solution We can design an iterative procedure for finding W, i.e., W W+ W looking at the View of MSE minimization, our cost function: x m (wi T x)w i 2 = x i=1 m y i w i 2 = x (WW T )x 2 i=1 we can stop updating if the KKT is met m w i = γy i [x y i w i ] i=1 Its matrix form is: subspace learning algorithm Two issues: W = γ(xx T W WW T xx T W) The orthogonality is not reinforced Slow convergence 15/29
Solution IV: PAST To speed up the iteration, we can use recursive least squares (RLS). We can consider the following cost function J(t) = t β t i x(i) W(t)y(i) 2 i=1 where β is the forgetting factor. W can be solved recursively by the following PAST algorithm 1. y(t) = W T (t 1)x(t) 2. h(t) = P(t 1)y(t) 3. m(t) = h(t)/(β +y T (t)h(t)) 4. P(t) = 1 β Tri[P(t 1) m(t)ht (t)] 5. e(t) = x(t) W(t 1)y(t) 6. W(t) = W(t 1)+e(t)m T (t) 16/29
Outline Principal Component Analysis Linear Discriminant Analysis Comparison between PCA and LDA 17/29
Face Recognition: Does PCA work well? The same face under different illumination conditions What does PCA capture? Is this what we really want? 18/29
From Descriptive to Discriminative PCA extracts features (or components) that well describe the pattern Are they necessarily good for distinguishing between classes and separating patterns? Examples? We need discriminative features. Supervision (or labeled training data) is needed. The issues are: How do we define the discriminant and separability between classes? How many features do we need? How do we maximizing the separability? Here, we give an example of linear discriminant analysis. 19/29
Linear Discriminant Analysis Finding an optimal linear projection W Catches major difference between classes and discount irrelevant factors In the projected discriminative subspace, data are clustered 20/29
Within-class and Between-class Scatters We have two sets of labeled data: D 1 = {x 1,...,x n1 } and D 2 = {x 1,...,x n2 }. Let s define some terms: The centers of two classes, m i = 1 n i x D i x i Data scatter by definition S = (x m)(x m) T x D Within-class scatter: Between-class scatter: S w = S 1 +S 2 S b = (m 1 m 2 )(m 1 m 2 ) T 21/29
Fisher Liner Discriminant Input: We have two sets of labeled data: D 1 = {x 1,...,x n1 } and D 2 = {x 1,...,x n2 }. Output: We want to find a 1-d linear projection w that maximizes the separability between these two classes. Projected data: Y 1 = w T D 1 and Y 2 = w T D 2 Projected class centers: m i = w T m i Projected within-class scatter (it is a scalar in this case) S w = w T S w w prove it! Projected between-class scatter (it is a scalar in this case) Fisher Linear Discriminant S b = w T S b w prove it! J(w) = m 1 m 2 2 S 1 + S 2 = S b S w = wt S b w w T S w w 22/29
Rayleigh Quotient Theorem f(λ) = Ax λbx B where z B = z T B 1 z is minimized by the Rayleigh quotient λ = xt Ax x T Bx Proof. f(λ) λ = (Bx) T (Bz) = x T B T B 1 z = x T (Ax λbx) = x T Ax λx T Bx setting it to zero to see the result clearly. 23/29
Optimizing Fish Discriminant Theorem J(w) = wt S b w w T S ww is maximized when S b w = λs w w Proof. Let w T S w w = c 0. We can construct the Lagrangian as L(w,λ) = w T S b w λ(w T S w w c) Then KKT is It is clearly that L(w, λ) w = S b w λs w w S b w = λs w w 24/29
An Efficient Solution A naive solution is S 1 w S b w = λw Then we can do EVD on S 1 w S b, which needs some computation Is there a more efficient way? Facts: Sb w is along the direction of m 1 m 2. why? we don t care about λ the scalar factor So we can easily figure out the direction of w by Note: rank(s b ) = 1 w = S 1 w (m 1 m 2 ) 25/29
Multiple Discriminant Analysis Now, we have c number of classes: within-class scatter S w = c i=1 S i as before between-class scatter is a bit different from 2-class total scatter c S b = n i (m i m)(m i m) T i=1 S t = (x m)(x m) T = S w +S b x MDA is to find a subspace with bases W that maximizes J(W) = S b S w = WT S b W W T S w W 26/29
The Solution to MDA The solution is obtained by G-EVD S b w i = λ i S w w i where each w i is a generalized eigenvector In practice, what we can do is the following find the eigenvalues as the root of the characteristic polynomial for each λi, solve w i from S b λ i S w = 0 (S b λ i S w )w i = 0 Note: W is not unique (up to rotation and scaling) Note: rank(s b ) (c 1) (why?) 27/29
Outline Principal Component Analysis Linear Discriminant Analysis Comparison between PCA and LDA 28/29
The Relation between PCA and LDA LDA PCA 29/29