When Fisher meets Fukunaga-Koontz: A New Look at Linear Discriminants

Similar documents
An Efficient Pseudoinverse Linear Discriminant Analysis method for Face Recognition

Example: Face Detection

Solving the face recognition problem using QR factorization

Pattern Recognition 2

Dimensionality Reduction Using PCA/LDA. Hongyu Li School of Software Engineering TongJi University Fall, 2014

Simultaneous and Orthogonal Decomposition of Data using Multimodal Discriminant Analysis

Two-stage Methods for Linear Discriminant Analysis: Equivalent Results at a Lower Cost

Weighted Generalized LDA for Undersampled Problems

Image Analysis & Retrieval. Lec 14. Eigenface and Fisherface

Kernel Uncorrelated and Orthogonal Discriminant Analysis: A Unified Approach

Symmetric Two Dimensional Linear Discriminant Analysis (2DLDA)

A Unified Bayesian Framework for Face Recognition

Linear Subspace Models

Face Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition

Lecture: Face Recognition

Enhanced Fisher Linear Discriminant Models for Face Recognition

Recognition Using Class Specific Linear Projection. Magali Segal Stolrasky Nadav Ben Jakov April, 2015

Two-Dimensional Linear Discriminant Analysis

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Image Analysis & Retrieval Lec 14 - Eigenface & Fisherface

Integrating Global and Local Structures: A Least Squares Framework for Dimensionality Reduction

Random Sampling LDA for Face Recognition

Linear Algebra Methods for Data Mining

Dimensionality Reduction Using the Sparse Linear Model: Supplementary Material

PCA and LDA. Man-Wai MAK

PCA and LDA. Man-Wai MAK

Lecture 17: Face Recogni2on

Discriminant Uncorrelated Neighborhood Preserving Projections

Clustering VS Classification

A Statistical Analysis of Fukunaga Koontz Transform

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Advanced Introduction to Machine Learning CMU-10715

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

CS 231A Section 1: Linear Algebra & Probability Review

Eigenfaces and Fisherfaces

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

ECE 661: Homework 10 Fall 2014

Efficient Kernel Discriminant Analysis via QR Decomposition

Foundations of Computer Vision

Principal Component Analysis (PCA)

Comparative Assessment of Independent Component. Component Analysis (ICA) for Face Recognition.

Eigenimages. Digital Image Processing: Bernd Girod, 2013 Stanford University -- Eigenimages 1

Lecture 17: Face Recogni2on

Reconnaissance d objetsd et vision artificielle

Constructing Optimal Subspaces for Pattern Classification. Avinash Kak Purdue University. November 16, :40pm. An RVL Tutorial Presentation

Maximum variance formulation

EE731 Lecture Notes: Matrix Computations for Signal Processing

PCA, Kernel PCA, ICA

Discriminative Direction for Kernel Classifiers

Machine Learning 11. week

Linear Dimensionality Reduction

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

On the Equivalence Between Canonical Correlation Analysis and Orthonormalized Partial Least Squares

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Lecture 5 Supspace Tranformations Eigendecompositions, kernel PCA and CCA

Kazuhiro Fukui, University of Tsukuba

Mathematical foundations - linear algebra

Image Analysis. PCA and Eigenfaces

Principal Component Analysis

Singular Value Decomposition

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides

PCA FACE RECOGNITION

Machine Learning (Spring 2012) Principal Component Analysis

Lecture 13 Visual recognition

Data Mining and Analysis: Fundamental Concepts and Algorithms

Principal Component Analysis

Linear Algebra in Computer Vision. Lecture2: Basic Linear Algebra & Probability. Vector. Vector Operations

Adaptive Discriminant Analysis by Minimum Squared Errors

Machine learning for pervasive systems Classification in high-dimensional spaces

STRUCTURE PRESERVING DIMENSION REDUCTION FOR CLUSTERED TEXT DATA BASED ON THE GENERALIZED SINGULAR VALUE DECOMPOSITION

Locality Preserving Projections

Singular Value Decomposition

Principal Component Analysis and Linear Discriminant Analysis

Spectral Regression for Efficient Regularized Subspace Learning

SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS

Dimension Reduction and Component Analysis Lecture 4

Keywords Eigenface, face recognition, kernel principal component analysis, machine learning. II. LITERATURE REVIEW & OVERVIEW OF PROPOSED METHODOLOGY

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

Lecture: Face Recognition and Feature Reduction

Principal Component Analysis

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

CITS 4402 Computer Vision

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

IEEE. Proof. LINEAR DISCRIMINANT analysis (LDA) [1], [16] has

Principal Component Analysis (PCA)

Lecture: Face Recognition and Feature Reduction

EIGENVALUES AND SINGULAR VALUE DECOMPOSITION

Eigenface-based facial recognition

Small sample size in high dimensional space - minimum distance based classification.

Modeling Classes of Shapes Suppose you have a class of shapes with a range of variations: System 2 Overview

Dimension Reduction and Component Analysis

Regularized Discriminant Analysis and Reduced-Rank LDA

DS-GA 1002 Lecture notes 10 November 23, Linear models

Multi-Class Linear Dimension Reduction by. Weighted Pairwise Fisher Criteria

SVD and its Application to Generalized Eigenvalue Problems. Thomas Melzer

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

Fisher Linear Discriminant Analysis

CS 4495 Computer Vision Principle Component Analysis

Transcription:

When Fisher meets Fukunaga-Koontz: A New Look at Linear Discriminants Sheng Zhang erence Sim School of Computing, National University of Singapore 3 Science Drive 2, Singapore 7543 {zhangshe, tsim}@comp.nus.edu.sg Abstract he Fisher Linear Discriminant (FLD) is commonly used in pattern recognition. It finds a linear subspace that maximally separates class patterns according to Fisher s Criterion. Several methods of computing the FLD have been proposed in the literature, most of which require the calculation of the so-called scatter matrices. In this paper, we bring a fresh perspective to FLD via the Fukunaga-Koontz ransform (FK). We do this by decomposing the whole data space into four subspaces, and show where Fisher s Criterion is maximally satisfied. We prove the relationship between FLD and FK analytically, and propose a method of computing the most discriminative subspace. his method is based on the QR decomposition, which works even when the scatter matrices are singular, or too large to be formed. Our method is general and may be applied to different pattern recognition problems. We validate our method by experimenting on synthetic and real data.. Introduction In recent years, discriminant subspace analysis has been extensively studied in computer vision and pattern recognition. It has been widely used for feature extraction and dimension reduction in face recognition [ and text classification [5. One popular method is the Fisher Linear Discriminant (FLD), also known as Linear Discriminant Analysis (LDA) [2, 3. It tries to find an optimal subspace such that the separability of two classes is maximized. his is achieved by minimizing the within-class distance and maximizing the between-class distance simultaneously. o be more specific, in terms of the between-class scatter matrix S b and the within-class scatter matrix S w, the Fisher s Criterion can be written as J F (Φ) = Φ S b Φ Φ S w Φ () By maximizing the criterion J F, Fisher Linear Discriminant finds the subspaces in which the classes are most linearly separable. he solution [3 that maximizes J F is a set of the eigenvectors {φ i } which must satisfy S b φ = λs w φ. (2) his is called the generalized eigenvalue problem [2, 3. he discriminant subspace is spanned by the generalized eigenvectors. he discriminability of each eigenvector is measured by the corresponding generalized eigenvalue, i.e., the most discriminant subspace corresponds to the maximal generalized eigenvalue. he generalized eigenvalue problem can be solved by matrix inversion and eigendecomposition, i.e., applying the eigen-decomposition on S w S b. Unfortunately, for many applications with high dimensional data and few training samples, such as face recognition, the scatter matrix S w is singular because generally the dimension of sample data is greater than the number of samples. his is known as the undersampled or small sample size problem [3, 2. Up till now, a great number of methods have been proposed to circumvent the requirement of nonsingularity of S w, such as Fisherface [ and LDA/GSVD [5. In [, Fisherface first applies PCA [6, 8 to reduce dimension such that S w is nonsingular, then followed by LDA. he LDA/GSVD algorithm [5 avoids the inversion of S w by the simultaneous diagonalization via the Generalized Singular Value Decomposition (GSVD). However, these methods are designed to overcome the singularity problem and do not directly relate to the generalized eigenvalue, the essential measure of the discriminability. In this paper, we propose to apply Fukunaga Koontz ransform (FK) [3 on the LDA problem. Based on the eigenvalue ratio of FK, we decompose the whole data space into four subspaces. hen our theoretical analyses show the relationship between LDA, FK and GSVD. Our work has three main contributions:. We present a unifying framework to understand differ-

ent methods, namely, LDA, FK and GSVD. o be more specific, for the LDA problem, GSVD is equivalent to FK; and the generalized eigenvalue of LDA is equal to both the eigenvalue ratio of FK and the square of the generalized singular value of GSVD. 2. he proposed theory is useful for pattern recognition. Our theoretical analyses demonstrate how to choose the best subspaces for maximum discriminability. he experiments on synthetic data and real data validate our theory. 3. In connecting FK with LDA, we show how FK, originally meant for 2-class problems can be generalized to handle multiple classes. he rest of this paper is organized as follows: Section 2 briefly reviews the mathematical background for LDA, GSVD, and FK. In Section 3, we first analyze the discriminant subspace of LDA based on FK, then set up the connections between LDA, FK and GSVD. We apply our theory to the classification problem on synthetic and real data in Section 4, and conclude our paper in Section 5. 2. Mathematical Background Notations. Let A = {a,..., a N }, a i R D denote a data set of given D-dimensional vectors. Each data point belongs to exactly one of C object classes {L,..., L C }. he number of vectors in class L i is denoted by N i, thus N = N i. Observe that for high-dimensional data, e.g., face images, generally, C N D. he between-class scatter matrix S b, the within-class scatter matrix S w, and the total scatter matrix S t are defined as follows: S b = S w = S t = C N i (m i m)(m i m) = H b H b (3) i= C i= (a m i )(a m i ) = H w H w (4) a L i N (a i m)(a i m) = H t H t (5) i= S t = S b + S w (6) Here m i denotes the class mean and m is the global mean of A. he matrices H b R D C, H w R D N, and H t R D N are the precursor matrices of the betweenclass scatter matrix, the within-class scatter matrix and the total scatter matrix respectively. Let us denote the rank of each scatter matrix: r w = Rank(S w ), r b = Rank(S b ), and r t = Rank(S t ). Note that for high-dimensional data (D N), r b C, r w N and r t N. 2.. Linear Discriminant Analysis Given the data matrix A, which can be divided into C classes, we try to find a linear transformation matrix Φ R D d, where d < D. It maps high dimensional data to a low dimensional space. From the perspective of pattern classification, LDA aims to find the optimal transformation Φ such that the projected data are well separated. Usually, two types of criteria are used to measure the separability of classes [3. One type gives the upper bound on the Bayes error, e.g., Bhattacharyya distance. he other type is based on scatter matrices. As shown in Equation, Fisher s criterion belongs to the latter type. he solution of the criterion is the generalized eigenvector and eigenvalue (See Equation 2). However, as we mentioned, if S w is nonsingular it can be solved by the generalized eigendecomposition: S w S b φ = λφ. Otherwise, S w is singular and we have to circumvent the nonsingularity requirement via LDA/GSVD [5, for example. 2.2. Generalized SVD he Generalized Singular Value Decomposition (GSVD) was developed by Van Loan et al. [4. We will briefly review the mechanism of GSVD using LDA as an example. Howland et. al. [5 extended the applicability of LDA to the case when S w is singular. his is done by using the simultaneous diagonalization of the scatter matrices via GSVD [4. First, to reduce computation load, H b and H w are used instead of S b and S w. hen, based on GSVD there exist orthogonal matrices Y R C C, Z R N N, and a nonsingular matrix X R D D such that: where Σ b = I b Y H b X = [Σ b,, (7) Z H wx = [Σ w,. (8) D b, Σ w = O w O b D w I w. Since Y, Z are orthogonal matrices and X is nonsingular, from Equations 7, 8 we obtain H b = Y [Σ b, X, (9) H w = Z [Σ w, X. () he matrices I b R (rt rw) (rt rw) and I w R (rt r b) (r t r b ) are identity matrices, and O b R (C r b) (r t r b ) and O w R (N rw) (rt rw) are rectangle, zero matrices which may have no rows or no columns, and D b =diag(α rt r w+,..., α rb ) and D w =diag(β rt r w+,..., β rb ) satisfy > α rt r w+...

α rb >, < β rt r w+... β rb <, and α 2 i + β2 i =. hus, Σ b Σ b + Σ wσ w = I, () where I R rt rt is an identity matrix. he columns of X, the generalized singular vectors for the matrix pair (H b, H w ), can be used as the discriminant feature subspace based on GSVD. 2.3. Fukunaga Koontz ransform he FK was designed for the 2-class recognition problem. Given data matrices A and A 2 from two classes, the autocorrelation matrices S = A A and S 2 = A 2 A 2 are positive semi-definite (p.s.d.) and symmetric. he sum of these two matrices is still p.s.d. and symmetric and can be factorized in the form [ [ D U S = S + S 2 = [U, U (2) U Without loss of generality, S may be singular and r=rank(s)< D, thus D = diag{λ,..., λ r }, λ... λ r >. U R D r is the set of eigenvectors corresponding to nonzero eigenvalues and U R D (D r) is the orthogonal complement of U. Now we can whiten S by a transformation operator P = UD /2. he sum of the two matrices S, S 2 becomes P SP = P (S + S 2 )P = S + S 2 = I (3) where S = P S P and S 2 = P S 2 P, I R r r is an identity matrix. Suppose an eigenvector of S is v with eigenvalue λ, that is: S v = λ v. Since S = I S 2, we can rewrite it as: (I S 2 )v = λ v (4) S 2 v = ( λ )v (5) his means that S 2 has the same eigenvector as S but the corresponding eigenvalue is λ 2 = λ. Consequently, the dominant eigenvector of S is the weakest eigenvector of S 2, and vice versa. 3. heory In this section, we first employ FK on S b and S w, which results in decomposing the whole data space S t into four subspaces. hen we explain the relationship between LDA, FK and GSVD based on the generalized eigenvalue. his gives insight into the different discriminant subspace analyses. In this paper, we focus on linear discriminant subspace analysis, but our approach can be easily extended to nonlinear discriminant subspace analysis. For example, based on kernel method, we can apply FK on kernelized S b and kernelized S w. Eigenvalues.8.6.4.2 λ b U 2 3 4 r t r w r b r t D Dimension Figure. he whole data space is decomposed into four subspaces via FK. In U, the null space of S t, there is no discriminant information. In U, the sum of λ b + is equal to. Note that we represent all possible subspaces, but in practice, some of these subspaces may not exist. 3.. LDA/FK Generally speaking, for the LDA problem there are more than 2 classes. o handle multiple classes, we replace the autocorrelation matrices S and S 2 with the scatter matrices S b and S w. Since S b, S w and S t are p.s.d., symmetric and S t = S b + S w, we can apply FK on S b, S w and S t, which we shall henceforth call LDA/FK. he whole data space is decomposed into two subspaces: U and U (See Fig. ). On the one hand, U is the set of eigenvectors corresponding to the zero eigenvalues of S t. It is the intersection of the null spaces of S b and S w, and contains no discriminant information. On the other hand, U is the set of eigenvectors corresponding to the nonzero eigenvalues of S t. It contains discriminant information. Based on FK, S b = P S b P and S w = P S w P share the same eigenvectors, and the sum of two eigenvalues corresponding to the same eigenvector is equal to. S b = VΛ b V (6) S w = VΛ w V (7) I = Λ b + Λ w (8) Where V R rt rt is the orthogonal eigenvector matrix, Λ b, Λ w R rt rt are diagonal eigenvalue matrices. According to the eigenvalue ratio λ b, U can be further decomposed into three subspaces. o keep the integrity of the whole data space, we incorporate U as the fourth subspace. See Fig... Subspace : the span of eigenvectors {v i } corresponding to =, λ b =. Since λ b =, in this subspace, the eigenvalue ratio is maximized. U

2. Subspace 2: the span of eigenvectors {v i } corresponding to < < and < λ b <. Since < λ b <, the eigenvalue ratio is smaller than that of Subspace. 3. Subspace 3: the span of eigenvectors {v i } corresponding to = and λ b =. Since λ b =, the eigenvalue ratio is minimal. 4. Subspace 4: U, the span of eigenvectors corresponding to the zero eigenvalues of S t. Note that in practice, any of these four subspaces may not exist, depending on the ranks of S b, S w and S t. For example, if r t = r w + r b, then Subspace 2 does not exist. As illustrated in Fig., the null space of S w is the union of Subspace and Subspace 4, and the null space of S b is the union of Subspace 3 and Subspace 4, if they exist. 3.2. Relationship between LDA, GSVD and FK How do these four subspaces help to maximize the Fisher criterion J F? We explain this in the following heorem that connects the generalized eigenvalue of J F to the eigenvalues of FK. We begin with a Lemma: Lemma. For the LDA problem, GSVD is equivalent to FK, with X = [UD /2 V, U, Λ b = Σ b Σ b and Λ w = Σ wσ w. Where X, Σ b and Σ w are from GSVD (See Equations 7, 8), and U, D, V, U, Λ w and Λ b are matrices from FK (See Equations 6, 7, 8 ). Proof. GSVD = FK Based on GSVD, hus, S b = H b H b = X [ Σ b Σ b S w = H w H w = X [ Σ w Σ w X (S b + S w ) X = [ I X (9) X (2) (2) Since Σ b Σ b + Σ wσ w = I R rt rt, if we choose the first r t columns of X as P, i.e., P = X (D,rt), then P (S b + S w )P = I. his is exactly the outcome of FK. Meanwhile, we can reach the conclusion that Λ b = Σ b Σ b and Λ w = Σ wσ w. FK = GSVD Based on FK, P = UD /2 S b = P S b P = D /2 U H b H b UD /2 (22) S b = VΛ b V (23) Hence, D /2 U H b H b UD /2 = VΛ b V (24) In general, there is no unique decomposition on the above equation because H b H b = H byy H b for any orthogonal matrix Y R C C. hat is: D /2 U H b YY H b UD /2 = VΛ b V (25) Y H b UD /2 = Σ b V (26) Y H b UD /2 V = Σ b (27) where Σ b R C rt, and Λ b = Σ b Σ b. If we define X = [UD /2 V, U R D D. hen, Y H b X = Y H b [UD /2 V, U (28) = [Y H b UD /2 V, (29) = [ Σ b, (3) Here, H b U = and H wu = because U is the intersection of the null spaces of S b and S w. Similarly, we can obtain Z H wx = [ Σ w,, where Z R N N is an arbitrary orthogonal matrix and Σ w R N rt, and Λ w = Σ Σ w w. Since Λ b + Λ w = I and I R rt rt is an identity matrix, it is easy to check that Σ Σ b b + Σ Σ w w = I, which satisfies the constraint of GSVD. Now we have to prove X is nonsingular. XX = [UD /2 V, U [ V D /2 U U = UD U + U U [ [ D U = [U, U I U (3) where V R r r and [U, U are orthogonal matrices. Note that U U = and U U =. hus XX can be eigen-decomposed with positive eigenvalues, which means X is nonsingular. Based on the above lemma, we can investigate the relationship between the eigenvalue ratio of FK and the generalized eigenvalue λ of the Fisher criterion J F. heorem. If λ is the solution of Equation 2 (the generalized eigenvalue of S b and S w ), and λ b and are the eigenvalues after applying FK on S b and S w, then λ = λ b, where λ b + =. Proof. Based on GSVD, it is easy to verify that: S b = H b H b = X [ Σ b Σ b X (32)

According to our lemma, Λ b = Σ b Σ b, thus S b = X [ Λb X (33) Similarly, we can rewrite S w but with Λ w. Since S b φ = λs w φ, X [ Λb X φ = λx [ Λw X φ (34) Let v = X φ, and multiply X on both sides, we can obtain the following. [ Λb [ Λb If we add λ [ Λb ( + λ) [ Λw v = λ v (35) v on both sides of Equation 35, then [ I v = λ v. (36) his means that ( + λ)λ b = λ, which can be rewritten as λ b = λ( λ b ) = λ because λ b + =. Now, we can observe that λ = λ b. Corollary. If λ is the generalized eigenvalue of S b and S w, and α and β are the solutions of Equations 7, 8 and α/β is the generalized singular value of the matrix pair (H b, H w ), then λ = α2 β 2, where α 2 + β 2 =. Proof. In Lemma we have proved that Λ b = Σ b Σ b and Λ w = Σ wσ w, that is: λ b = α 2 and = β 2. According to heorem, we observe that λ = λ b. herefore it is easy to see that λ = α2 β. Note that α 2 β is the generalized singular value of (H b, H w ) by GSVD, and λ is the generalized eigenvalue of (S b, S w ). he Corollary suggests how to evaluate the discriminant subspaces of LDA/GSVD. Actually, Howland et. al. in [5 applied the Corollary implicitly, but in this paper we explicitly connect the generalized singular value α β measure of discriminability. Based on our analysis, the eigenvalue ratio λ b with λ, the and the square of the generalized singular value α2 β 2, both are equal to the generalized eigenvalue λ, the measure of discriminability. herefore, according to Fig., Subspace, with the infinite eigenvalue ratio λ b, is the most discriminant subspace, followed by Subspace 2 and Subspace 3. However, Subspace 4 which contains no discriminant information, can be safely thrown away. Input: he data matrix A. Output: Projection matrix Φ F such that the J F is maximized.. Compute H b and H t from data matrix A: [ N(m H b = m),..., N C(m C m), H t = [a m,..., a N m. 2. Apply QR decomposition on H t = QR, where Q R D r t, R R r t N and r t = Rank(H t). 3. Let S t = RR, since S t = Q S tq = Q H th t Q = RR. 4. Let Z = Q H b. 5. Let S b = ZZ, since S b = Q S b Q = Q H b H b Q = ZZ. 6. Compute the eigenvectors {v i} and eigenvalues {λ i} of S b. S t 7. Sort the eigenvectors v i according to λ i in decreasing order, λ λ 2 λ k > λ k+ = = λ rt =. 8. he final projection matrix Φ F = QV, where V = {v i}. Note that we can choose the subspaces based on the columns of V. Figure 2. Algorithm: Apply QR decomposition to compute LDA/FK. 3.3 Algorithm Although we proved that FK is equivalent to GSVD on the LDA problem, LDA/GSVD is computationally expensive. he running time of LDA/GSVD is O(D(N + C) 2 ), where D is the dimensionality, N is the number of training samples and C is the number of classes. In this paper, we propose an efficient way to compute the Subspaces, 2 and 3 of LDA/FK based on QR decomposition because Subspace 4 contains no discriminant information. Moreover, we use smaller matrices H b and H t because matrices S b, S w and S t may be too large to be formed. Our LDA/FK algorithm is shown in Fig. 2. he running time is O(DN 2 ). 4. Experiments 4.. A oy Problem First, we experiment on synthetic data: three single Gaussian classes with the same covariance matrix and different means. In 3D space, the three classes share the same covariance matrix: diag([,, ), and each class has points. hey have different means: [,,, [,, 2, and [,, (See Fig. 3(a)).

able. Eigenvalue and eigenvalue ratio of the toy problem. 2.5 LDA/FK LDA/GSVD LDA λ b / α 2 /β 2 λ / = (.679 6 ) 2.29286/.774.64354 2 =.44 =.44.44 / = (.66 6 ) 2.5 4 3 2 2 3 2 (a) 2 3 3.5 As shown in Fig. 4, LDA/FK decomposes the whole data space into three subspaces:, 2 and 3. Here Subspace 4 does not exist because the number of samples (3) is larger than the dimension (3). Note that each subspace consists of only one eigenvector. In terms of the eigenvalue ratio, Subspace is the most discriminative subspace, followed by Subspaces 2 and 3. But if we use Fisherface (PCA followed by LDA), we cannot obtain Subspace because Fisherface restricts to the Subspaces 2 and 3 where so that S w is invertible. In doing so, Fisherface is discarding the most discriminative information (Subspace ). o compare LDA/FK with Fisherface, we project the original 3D data to Subspace and Subspace 2, and we also project to 2D space via Fisherface. As illustrated in Fig. 3(b) and 3(c), the 2D projection of FK is more separable than the projection of Fisherface. his is consistent with our theory. Let us recall the Corollary: the eigenvalue ratio of LDA/FK is equal to the square of the generalized singular value of LDA/GSVD. o check this, we also apply GSVD on H b and H w, which is known as LDA/GSVD [5. hen we obtain the generalized singular value α/β. From able, we see that α 2 /β 2 λ b /λ 2 w. hus we show the validity of our theory experimentally. Moreover, this means we can compute the generalized eigenvalue λ of LDA, although it cannot be obtained directly when S w is singular. 4.2. Face Recognition We now apply LDA/FK on a real problem: face recognition. We perform experiments on the CMU-PIE [7 face dataset. he CMU-PIE database consists of over 4, images of 68 subjects. Each subject was recorded across 3 different poses, under 43 different illuminations (with and without background lighting), and with 4 different expression. Here we use a subset of PIE for our experiments. We choose 67 subjects 3, and each subject has 24 frontal face images with background lighting. All of these face images are aligned based on eyes coordinates, and cropped to 2 he difference depends on the machine precision. 3 In CMU-PIE, there are 68 subjects together, we only choose 67 subjects because one subject has fewer frontal face images..4.2.2.4..2.3.4 (b) 2.5.5.5.5.5 4 2 2 Figure 3. Original 3 classes (triangle, circle and cross) in 3D space (a) and 2D projection by (b) LDA/FK and (c) Fisherface. Observe that the 2D projection of FK is more separable than that of Fisherface. 7 8. Fig. 5 shows a sample of PIE face images used in our experiments. It is easy to see that the major challenge on this data set is to do face recognition under different illuminations. Here, to evaluate the performance of LDA/FK, we compare it with PCA and Fisherface. We employ -NN to do face recognition after projection onto the respective subspaces. For PCA, the projected subspace contains 66 dimensions; for Fisherface, we take the first principal components to make S w nonsingular, then followed by LDA. For LDA/FK, we project data to Subspaces, 2. In face recognition, we usually have an undersampled problem, which is also the reason for the singularity of S w. o evaluate the performance under such situation, we randomly choose n training samples from each subject, n = 2,, 2, and the remaining images are used for testing. Moreover, for each set of n training samples, we repeat sampling times to compute the mean and standard deviation of classification accuracies. As illustrated in Fig. 6, we observe that the more the training samples, the better the recognition accuracy. o be more specific, for each method, increasing the number of training samples increases the mean recognition rate and decreases the standard deviation. Fig. 6 also shows that no matter how many training sam- (c)

2 3.8 9 Eigenvalues.6.4 2 3 Recognition rate (%) 8 7 6 PCA Fisherface LDA/FK λ b.2 5 4 Dimension 2 4 6 8 2 Number of training samples (per class) Figure 4. Eigenvalue curve for the toy problem by LDA/FK. Note that each subspace consists of only one eigenvector. Figure 6. he accuracy curve on PIE with varying training samples. We show the rate and standard deviation from runs. Figure 5. A sample of face images from PIE dataset. Each subject has 24 frontal face images under different lighting conditions. ples are used, LDA/FK consistently outperforms PCA and Fisherface. his is easily explained by the subspaces used in each method. LDA/FK uses Subspaces and 2, which are the most discriminative. Fisherface ignores Subspace and operates in Subspaces 2 and 3. PCA has no notion of discriminative subspaces at all, since it is designed for pattern representation, rather than classification [2, 3. Note that even when we have only 2 or 4 training samples from each subject, LDA/FK can still obtain the best recognition accuracy ( 99%) with the smallest standard deviation (< %). his means LDA/FK can handle small sample size problem with high and stable performance, which is not the case for PCA or Fisherface. 5 Conclusion In this paper, we showed how FK provides valuable insights into LDA by exposing the four subspaces that make up the entire data space. hese subspaces differ in discriminability because each has a different value for the Fisher s Criterion. We also proved that the common technique of applying PCA followed by LDA (a.k.a. Fisherface) is not optimal because this discards Subspace, which is the most discriminative subspace. Our result also showed how FK, originally proposed for 2-class problems, can be generalized to handle multiple classes. his is achieved by replacing the autocorrelation matrices S and S 2 with the scatter matrices S b and S w. In the near future, we intend to investigate further the insights that FK reveals about LDA. Another interesting direction to pursue is the extend our theory to nonlinear discriminant analysis. One way is to use the kernel trick employed in SVM, e.g., construct kernelized between-class scatter and within-class scatter matrices. FK may yet again reveal new insights into kernelized LDA. References [ P. Belhumeur, J. Hespanha, and D. Kriegman. Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection. IEEE ransactions on Pattern Analysis and Machine Intelligence, 9(7), 997. [2 R. Duda, P. Hart, and D. Stork. Pattern Classification, 2nd edition. John Wiley and Sons, 2. [3 K. Fukunaga. Introduction to Statistical Pattern Recognition, 2rd Edition. Academic Press, 99. [4 G. Golub and C. V. Loan. Matrix Computations, 3rd Edition. John Hopkins Univ. Press, 996. [5 P. Howland and H. Park. Generalizing discriminant analysis using the generalized singular value decomposition. IEEE ransactions on Pattern Analysis and Machine Intelligence, 26(8):995 6, August 24. [6 I. Jolliffe. Principal component analysis. New York: SpringerVerlag, 986. [7. Sim, S. Baker, and M. Bsat. he cmu pose, illumination, and expression database. IEEE ransactions on Pattern Analysis and Machine Intelligence, 25(2):65 68, December 23. [8 M. urk and A. Pentland. Eigenfaces for Recognition. Journal of Cognitive Neuroscience, 3(), 99.