Fisher s Linear Discriminant Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1 / 26
FLD or LDA Introduced by Fisher (1936) One of widely-used linear discriminant analysis (LDA) methods Curse of dimensionality Linear dimensionality reduction: PCA, ICA, FLD, MDS Nonlinear dimensionality reduction: Isomap, LLE, Laplacian eigenmap FLD aims at achieving an optimal linear dimensionality reduction for classification 2 / 26
An Example: Isotropic Case x 2 µ 1 µ 2 x 1 3 / 26
FLD: A Graphical Illustration x 2 µ 1 µ 2 x 1 4 / 26
Two Classes Given a set of data points, {x R D }, one wished to find a linear projection of the data onto a 1-dimensional space, y = w x. Sample means for x: µ i = 1 x. N i x C i Sample means for the projected points: µ i = 1 N i y = 1 w x = w µ N i. i y Y i x C i We know that the difference between sample means is not always a good measure of the separation between projected points: µ 1 µ 2 = w (µ 1 µ 2 ). Scale w µ 1 µ 2 (not desirable!). 5 / 26
FLD: Two Classes Define the within-class scatter for projected samples by s 1 2 + s2 2, where s i 2 = [ ] (y µ i ) 2 = w (x µ i ) (x µ i ) w. y Y i x C i FLD finds } {{ } S i w = arg max w µ 1 µ 2 2 s 2 1 + s2 2 = arg max w w S B w w S W w, where S W = S 1 + S 2 (within-class scatter matrix) and S B = (µ 1 µ 2 ) (µ 1 µ 2 ) (between-class scatter matrix). arg maxw w S B w w S W w S Bw = λs W w (generalized eigenvalue problem). 6 / 26
Multiple Discriminant Functions For the case of K classes, FLD involves K 1 discriminant functions, i.e., the projection is from R D to R K 1. Given a set of data {x R D }, one wishes to find a linear lower-dimensional embedding W R (K 1) D such that {y = W x} are classified as well as possible in the lower-dimensional space. y 1. y K 1 }{{} y = w 1. w K 1 } {{ } W x 1. x D. }{{} x 7 / 26
Scatter Matrices Within-class scatter matrix S W = K Between-class scatter matrix K S B = (µ i µ) (µ i µ) = C i Total scatter matrix: S T = S W + S B x C i (x µ i ) (x µ i ). K N i (µ i µ) (µ i µ). S T = x (x µ) (x µ). Rank(S B ) K 1, Rank(S W ) N K, Rank(S T ) N 1. 8 / 26
Total Scatter Matrix Define X = [X 1,..., X K ] where X i is a matrix whose columns are associated with data vectors belonging to C i. Define H W = [ X 1 µ 1 e 1,..., X K µ K e K ], H B = [ (µ 1 µ)e 1,..., (µ K µ)e K ], H T = [x 1 µ 1,..., x N µ]. One can easily see that H T = X µe = H W + H B. We also have S W = H W H W, S B = H B H B, S T = H T H T. Since H W H B = 0, we have S T = (H W + H B )(H W + H B ) = S W + S B. The column vectors of S W and S B are linear combinations of centered data samples. 9 / 26
FLD: Multiple Classes Define S W = S B = K (y µ i ) (y µ i ) y Y i K N i ( µ i µ) ( µ i µ). One can easily show that S W = W S W W, S B = W S B W. 10 / 26
FLD seeks K 1 discriminant functions W such that y = W x: leading to W = arg max W = arg max W = arg max W J FLD { S 1 tr S } W B { ( ) 1 ( ) } tr W S W W W S B W, arg max J FLD S B w i = λ i S W w i. W generalized eigenvalue problem 11 / 26
Rayleigh Quotient Definition Let A R m m be symmetric. The Rayleigh quotient R(x, A) is defined by R(x, A) = x Ax x x. Theorem Let A R m m be symmetric with its eigenvalues being {λ 1 λ m }. For x 0 R m, we have and in particular, λ m x Ax x x λ 1, x Ax λ m = min x 0 x x, λ x Ax 1 = max x 0 x x. 12 / 26
An Extremal Property of Generalized Eigenvalues Theorem Let A and B be m m matrices, with A being nonnegative definite and B positive definite. For h = 1,..., m, define X h = [x 1,..., x h ], Y h = [x h,..., x m ], where x 1,..., x m are linear independent eigenvectors of B 1 A corresponding to the eigenvalues Then where x = 0 is excluded. λ 1 ( B 1 A ) λ m ( B 1 A ). λ m ( B 1 A ) = min Y h+1bx =0 λ 1 ( B 1 A ) = max X h 1Bx =0 x Ax x Bx, x Ax x Bx, 13 / 26
Relation to Least Squares Regression: Binary Class Given a training set {x i, y i } N, where x i R D and y i {1, 1}, consider a linear discriminant function: f (x i ) = w x i + b. Partition the data matrix into two groups, each group of which contains examples in class 1 or class 2, i.e., X = [X 1, X 2 ], where X 1 R D N1 and X 2 R D N2. Define binary label vector y R N, then LS regression is formulated as arg min y X w b1 N 2, w,b where 1 N is the N-dimensional vector of all ones, which can be re-written as [ ] [ arg min X 1 1 N1 w w,b 1 N2 b X 2 ] [ ] 1 N1 2. 1 N2 14 / 26
The solution to this LS problem satisfies the normal equation: [ ] [ ] [ ] [ ] [ ] X 1 X 2 X 1 1 N1 w X 1 X 2 1 1 N 1 1 N 2 X = N1 2 1 N2 b 1 N 1 1, N 2 1 N2 which is written as [ X 1 X 1 + X 2 X 2 X 1 1 N1 + X 2 1 N2 1 N 1 X 1 + 1 N 2 X 2 1 N 1 1 N1 + 1 N 2 1 N2 ] [ w b ] [ X 1 1 N1 X 2 1 N2 = 1 N 1 1 N1 1 N 2 1 N2 ]. Recall S B = (µ 1 µ 2 )(µ 1 µ 2 ) ) ) ) ) S W = (X 1 µ 1 1 N1 (X 1 µ 1 1 N1 + (X 2 µ 2 1 N2 (X 2 µ 2 1 N2 = X 1 X 1 N 1 µ 1 µ 2 + X 2 X 2 N 2 µ 2 µ 2. 15 / 26
With S B and S W, the normal equation is written as [ SW + N 1 µ 1 µ 1 + N ] [ 2µ 2 µ 2 N 1 µ 1 + N 2 µ 2 w (N 1 µ 1 + N 2 µ 2 ) N 1 + N 2 b ] = [ ] N1 µ 1 N 2 µ 2. N 1 N 2 Solve the 2nd equation for b to obtain b = (N 1 N 2 ) (N 1 µ 1 + N 2 µ 2 ) w N 1 + N 2. Substitute this into the 1st equation to obtain [ S W + N ] 1N 2 S B w = 2N 1 N 2 (µ N 1 + N 1 µ 2 ). 2 16 / 26
Note that the vector S B w is in the direction of µ 1 µ 2 for w, since S B w = (µ 1 µ 2 )(µ 1 µ 2 ) w. Thus we write Then we have N 1 N 2 N 1 + N 2 S B w = (2N 1 N 2 α)(µ 1 µ 2 ). w = αs 1 W (µ 1 µ 2 ), which is identical to FLD solutions except for scaling factor. 17 / 26
Simultaneous Diagonalization The goal: Given two symmetric matrices, Σ 1 and Σ 2, find a linear transformation W such that W Σ 1 W = I, W Σ 2 W = Λ. (diagonal) Methods: It turns out that simultaneous diagonalization involves the generalized eigen-decomposition. Two-stage method 1. whitening 2. unitary transformation Single-stage method: generalized eigenvalue decomposition 18 / 26
Simultaneous Diagonalization: Algorithm Outline 1. First, whiten Σ 1, i.e., where Σ 1 = U 1 DU 1. D 1 2 U 1 Σ 1 U 1 D 1 2 = I, D 1 2 U 1 Σ 2 U 1 D 1 2 = K, (not diagonal), 2. Second, apply an unitary transformation to diagonalize K, i.e., where K = U 2 ΛU 2. U 2 I U 2 = I, U 2 KU 2 = Λ, Then, the transformation W which simultaneously diagonalizes Σ 1 and Σ 2, is given by, W = U 1 D 1 2 U2, such that W Σ 1 W = I and W Σ 2 W = Λ 19 / 26
Simultaneous Diagonalization: Generalized Eigen-Decomposition Alternatively we can diagonalize two symmetric matrices Σ 1 and Σ 2 as W Σ 1 W = I, W Σ 2 W = Λ, (diagonal) where Λ, W are eigenvalues and eigenvectors of Σ 1 1 Σ 2, i.e., Σ 1 1 Σ 2W = W Λ. Prove it! 20 / 26
Example: Multi-Modal Data 21 / 26
Alternative Expressions of S W and S B Alternatively, S W and S B are expressed as S W = 1 A W ij (x i x j )(x i x j ), 2 i j S B = 1 A B ij (x i x j )(x i x j ), 2 i j A W ij = A B ij = { 1 N k if x i C k and x j C k, 0 if x i and x i are in different classes, { 1 N 1 N k if x i C k and x j C k, if x i and x i are in different classes. 1 N 22 / 26
S W = = = = = = 1 2 = 1 2 K (x µ i )(x µ i ) x C i K x j 1 x u x j 1 x v N x i N j C i x i u C i x v C i K x jx j 1 x j x v 1 x ux j + 1 N x i N j C i x i v C i x N u C i 2 i K x i x 1 i x ux v N i x u C i x v C i ( N ) A W ij x i x i A W ij x i x j j=1 j=1 A W ij j=1 ( x i x i + x j x j x i x j x j x i A W ij (x i x j )(x i x j ). j=1 ) x u C i x ux v x v C i 23 / 26
S B = S T S W = (x i µ)(x i µ) S W = = 1 2 = 1 2 ( N ) 1 x i x i N 1 2 j=1 j=1 j=1 j=1 A W ij ( 1 N AW ij 1 N x i x j j=1 ( x i x i + x j x j x i x j x j x i ) ( x i x i + x j x j x i x j x j x i ( ) 1 N AW ij (x i x j )(x i x j ). ) ) 24 / 26
Local Within-Class and Between-Class Scatter Given weighted adjacency matrix [A ij ], introduce local within-class scatter and local between-class scatter: S W = 1 A W ij (x i x j )(x i x j ), 2 i j S B = 1 A B ij (x i x j )(x i x j ), 2 i j A W ij = A B ij = { Aij N k if x i C k and x j C k, 0 if x i and x i are in different classes, { ( ) 1 A ij N 1 N k if x i C k and x j C k, 1 N if x i and x i are in different classes. 25 / 26
Local Fisher Discriminant Analysis (LFDA) Proposed by M. Sugiyama (ICML-2006). LFDA seeks K 1 discriminant functions W such that y = W x: arg max W Local within-class scatter matrix { ( ) 1 ( ) } tr W S W W W S B W, S W = 1 A W ij (x i x j )(x i x j ), 2 i j Local between-class scatter matrix S B = 1 A B ij (x i x j )(x i x j ). 2 i j 26 / 26