Kernel quadratic discriminant analysis for small sample size problem

Size: px

Start display at page:

Download "Kernel quadratic discriminant analysis for small sample size problem"

Leo Moore
6 years ago
Views:

Pattern Recognition ( www.elsevier.com/locate/pr Kernel quadratic discriminant analysis for small sample size prolem Jie Wang a,, K.. Plataniotis a, Juwei Lu, A.

, Santa Clara, CA 95054, USA c Ryerson University, 360 Victoria St, Toronto, Canada M5B 2K3 Received 3 January 2007; received in revised form 3 July 2007; accepted 23 Octoer 2007 Astract It is

1 Pattern Recognition ( Kernel quadratic discriminant analysis for small sample size prolem Jie Wang a,, K.. Plataniotis a, Juwei Lu, A.. Venetsanopoulos c a Department of Electrical and Computer Engineering, University of Toronto, 0 King s College Road, Toronto, Canada M5A 3G4 Vidient Systems, Inc., 4000 Burton Dr., Santa Clara, CA 95054, USA c Ryerson University, 360 Victoria St, Toronto, Canada M5B 2K3 Received 3 January 2007; received in revised form 3 July 2007; accepted 23 Octoer 2007 Astract It is generally elieved that quadratic discriminant analysis (QDA can etter fit the data in practical pattern recognition applications compared to linear discriminant analysis (LDA method. This is due to the fact that QDA relaxes the assumption made y LDA-ased methods that the covariance matrix for each class is identical. However, it still assumes that the class conditional distriution is Gaussian which is usually not the case in many real-world applications. In this paper, a novel kernel-ased QDA method is proposed to further relax the Gaussian assumption y using the kernel machine technique. The proposed method solves the complex pattern recognition prolem y comining the QDA solution and the kernel machine technique, and at the same time, tackles the so-called small sample size prolem through a regularized estimation of the covariance matrix. Extensive experimental results indicate that the proposed method is a more sophisticated solution outperforming many traditional kernel-ased learning algorithms Elsevier Ltd. All rights reserved. Keywords: Linear discriminant analysis; Quadratic discriminant analysis; Small sample size; Kernel machine technique. Introduction Linear discriminant analysis (LDA is one of the most effective feature extraction methods in statistical pattern recognition learning [ 5]. LDA-ased methods extract the discriminant features y maximizing the so-called Fisher s criterion which is defined as the ratio of etween- and within-class scatter matrices. It has een shown that LDA can e viewed as a special case of the optimal Bayesian classifier when the class conditional distriution of the data are Gaussian with identical covariance structure [6,7]. In such cases, the resulting class oundaries are restricted to linear hyper-planes. In many practical pattern recognition prolems, however, the distriution of the data is far more complicated than Gaussian, usually multi-modal and non-convex. In such cases, the performance of LDA-ased techniques deteriorates dramatically [6]. To this end, nonlinear techniques appear to e more effective to handle complex distriuted data samples. Quadratic discriminant analysis (QDA is one of the most commonly used nonlinear techniques for pattern classification. Corresponding author. Tel.: ; fax: address: wjiewjie@gmail.com (J. Wang. In the QDA framework, the class conditional distriution is assumed to e Gaussian, however, with an allowance for different covariance matrices. In such cases, a more complex quadratic oundary can e formed. It is therefore reasonale to elieve that QDA etter fits the real data structure. However, due to the fact that more free parameters are to e estimated (C covariance matrices, where C denotes the numer of classes compared to those in an LDA-ased solution ( covariance matrix, QDA is more susceptile to the so-called small sample size (SSS prolem such that the numer of training samples is smaller or comparale to the dimensionality of the sample space [8 0,2]. Under the SSS scenario, a reliale estimation of the system parameters, such as the within- and etween-class matrices in LDA-ased methods, ecomes a difficult task. The SSS prolem is a realistic prolem existing in many real-world applications such as face recognition. Under the SSS scenario, a direct application of QDA is difficult or even mathematically intractale as the estimates of the covariance matrices for each class ecome either ill- or poorly posed [6]. In order to apply QDA to the pattern recognition prolem under the SSS scenario, regularized direct QDA (denoted as RD-QDA [6] has een proposed to tackle the SSS prolem in two /$ Elsevier Ltd. All rights reserved. doi:

2 2 J. Wang et al. / Pattern Recognition ( steps: dimensionality reduction and regularized covariance matrix estimation. In has een shown that the discriminatory information resides in the intersection of the null space of the within-class scatter (denoted as A and the complement space of the null space of the etween-class scatter (denoted as B, i.e., A B [3 5]. At the same time, no significant information, in terms of maximization of Fisher s criterion, will e lost if the null space of the etween-class scatter is discarded. The low-dimensional etween-class suspace B is therefore first extracted. The RD-QDA method is thus a QDA solution performed in B, where a regularized covariance matrix is defined for each class according to the suggestions in Ref. []. By adjusting the regularization parameters in the estimation of the covariance matrix, a set of traditional or novel linear discriminant analysis methods could e otained. Although RD-QDA relaxes the identical covariance assumption, it still assumes, however, that each class is suject to a Gaussian distriution, which may not e the case in reality. To further relax the Gaussian assumption, in this paper, a kernel-ased RD-QDA solution, denoted as KRQDA, is proposed y using the kernel machine technique [2 6]. Kernel machine technique is considered the most important tool in the design of nonlinear statistical learning techniques. The premise ehind the kernel machine technique is to find a nonlinear map from the original sample space to a higher dimensional kernel feature space y using a nonlinear function. In the kernel feature space, pattern distriution is expected to e simplified so that etter classification performance can e achieved y applying traditional linear or quadratic methodologies [5,4]. The nonlinear mapping could e performed implicitly y replacing dot products of the feature representations in the kernel space with a kernel function defined in the original sample space [5,4]. Implementing the RD-QDA method in the kernel feature space, KRQDA comines the strengths of the quadratic learning technique and the kernel machine technique. The proposed method provides a sophisticated solution to the complex pattern recognition prolem y producing quadratic oundaries in the kernel space and at the same time tackles the SSS prolem y incorporating regularization settings through RD-QDA. It is not difficult to see that RD-QDA is a special case of the KRQDA method when a linear kernel is considered. To this end, KRQDA can e viewed as a more general learning algorithm from which a set of linear or nonlinear discriminant analysis methods could e otained y adjusting the kernel function as well as the regularization parameters. Extensive experiments have een performed and the results indicate that the proposed method outperforms many traditional kernel-ased algorithms and the RD-QDA solution. The rest of this paper is organized as follows. Section 2 riefly reviews the RD-QDA method. Following that, the detailed derivation of RD-QDA in the kernel space is descried in Section 3. In Section 4, extensive experiments are conducted on the FERET face dataase and other pattern classification data sets to demonstrate the effectiveness of the proposed method followed y a conclusion drawn in Section Review of regularized direct quadratic discriminant analysis In Ref. [6], a regularized direct quadratic discriminant analysis (RD-QDA has een proposed to apply QDA solution under the SSS scenario. The RD-QDA method includes two main steps: dimensionality reduction and the application of the regularized QDA in the reduced feature suspace. Let Z{Z i } denote the training set, containing C classes with each class Z i {z ij }, consisting of samples z ij R J, where R J denotes the J -dimensional real space. Therefore, there are C i examples availale in the training set in total. Let S and S w e the etween- and within-class scatter matrices, the low-dimensional complement space of the null space of S, denoted as B, is first extracted. Let V [v,...,v M ] e M eigenvectors of S corresponding to M non-zero eigenvalues Λ [λ,...,λ M ], where M min(c,j. The S suspace B is thus spanned y V, which is further scaled y U V Λ /2 so that U T S U I, where Λ diag(λ, diag( denotes the diagonalization operator and I is the (M M identity matrix. In order to perform QDA in B, all training samples z ij are first projected into the suspace spanned y U, otaining the corresponding feature representations expressed as y ij U T z ij, where y ij is the feature representation of z ij in the suspace B. The regularized sample covariance matrix of ith class, denoted y ˆΣ i (α, γ, is thus defined as follows []: ˆΣ i (α, γ ( γ ˆΣ i (α + γ M tr[ ˆΣ i (α]i, ˆΣ i (α (α [( αs i + αs], ( where (α, γ are two regularization parameters and M is the dimensionality of B. (α ( α + α and S i is the covariance matrix of ith class estimated in B, i.e. S i Ci (y ij ȳ i (y ij ȳ i T, y ij U T z ij, ȳ i (/ y ij and S C i S i. In the classification session, the input test p is projected into the suspace B, otaining the corresponding representation q U T p. The class lael of p is then determined y comparing the Mahalanois distance etween the input q and each class center ȳ i,i,...,c, i.e., ID(p arg min i d i (q, where d i (q(q ȳ i T ˆΣ i (α, γ(q ȳ i + ln ˆΣ i (α, γ 2lnπ i, (2 where π i / is the prior proaility of class i. It was shown in Ref. [6] that a series of traditional linear discriminant analysis methods could e otained y adjusting the parameters of α and γ. These include the direct nearest center method (D-C (α and γ ; the direct weighted nearest center method (D-WC (α 0 and γ ; the direct quadratic discriminant analysis (D-QDA (α 0 and γ 0 and two direct linear discriminant analysis solutions y Yu and Yang (YD-LDA [3] (α and γ0 and y Lu et al. [5] (LD-LDA (α and γ (tr[s/]+m/m, respectively.

3 J. Wang et al. / Pattern Recognition ( 3 3. Kernel regularized quadratic discriminant analysis RD-QDA relaxes the assumption made y LDA-ased solutions that the covariance matrix of each class is identical. It still assumes, however, that each class is suject to a Gaussian distriution, which is oviously not the case in many real-world patten classification applications. Motivated y the kernel machine technique, a kernel ased RD-QDA solution, denoted as KRQDA, is proposed in this paper. Implementing the RD-QDA method in the kernel feature space, KRQDA further relaxes the Gaussian assumption y introducing the kernel machine technique and at the same time tackles the SSS prolem through the regularization settings in RD-QDA. 3.. Dimensionality reduction in the kernel space Let Φ [ (z,..., (z CCC ] e the corresponding feature representations of the training samples in the kernel space F F. Let K e the Gram matrix (kernel matrix, i.e., K (K lh h,...,c l,...,c. K lh is a C l C h su-matrix of K composed of the samples from classes Z l and Z h, i.e., K lh (k ij,...,c h i,...,c l, where k ij k(z li, z hj and k( denotes the kernel function defined in R J. Let S e the etween-class scatter in F F, defined as S C ( i ( i T, (3 i where i (/ (z ij is the mean of Z i in F F and (/ C i Ci (z ij is the mean of training samples in F F. Following the RD-QDA framework, the complement space of the null space of S, denoted as B, is first extracted. B is spanned y the first M(M C eigenvectors of S, i.e., Ṽ [ṽ,...,ṽ M ], corresponding to the M largest eigenvalues. Ṽ can e otained y solving the eigenvalue prolem of S, which can e rewritten as follows: S C i ( ( T Ci ( i Ci ( i C ˆ i ˆ T i Φ Φ T, (4 i where ˆ i /( i and Φ [ˆ,..., ˆ C ].Itcan e oserved that S is a matrix of size F F, where F denotes the dimensionality of the kernel space. Due to the high dimensionality of F F, a direct computation of the eigenvectors of S is impossile. Fortunately, the prolem can e solved y an algeraic transform from the eigenvectors of a C atrix Φ T Φ [7]. Let λ i and ẽ i e ith eigenvalue and eigenvector of Φ T Φ sorted in a descending eigenvalue order, it can e seen that (Φ Φ T (Φ ẽ i λ i (Φ ẽ i. Therefore, it can e deduced that (Φ ẽ i is ith eigenvector of S Φ Φ T. As shown in Ref. [7], Φ T Φ can e expressed as Φ T Φ B (A T C K A C (AT C K C (TC K A C+ 2 (TC K C B, (5 where B diag[ C,..., C C ], s an atrix with all elements equal to, A C diag[a C,...,a CC ] is an C lock diagonal matrix and a Ci is a vector with all elements equal to /. Let Ẽ M [ẽ,...,ẽ M ] consist of M significant eigenvectors of Φ T Φ corresponding to the M largest eigenvalues λ >,,> λ M and Ṽ Φ Ẽ M, it is not difficult to derive that Ṽ T S Ṽ Λ, where Λ diag[ λ 2,,..., λ 2,M ]. Thus, the transformation matrix Ũ such that Ũ T S Ũ I can e computed as follows: Ũ Ṽ Λ /2, Ṽ Φ Ẽ M. (6 After otaining the low-dimensional feature suspace B spanned y Ũ, all training samples are projected to B to get the corresponding feature representations ỹ ij, as follows: ỹ ij Ũ T (z ij Λ /2 Ẽ T M ΦT (z ij, (7 where Φ T (z ij can e expressed as [7] Φ T (z ij ( B A C ν( (z ij TC ν( (z ij, (8 where ν( (z ij [ (z (z ij, (z 2 (z ij,..., (z CCC (z ij ] T can e computed implicitly through the kernel function defined in R J, i.e., (z mn (z ij k(z mn, z ij Perform QDA in B In order to perform QDA solution in B, the regularized sample covariance matrix of ith class, denoted as Σ i (α, γ, is defined as follows []: Σ i (α, γ ( γ Σ i (α + γ M tr[ Σ i (α]i, Σ i (α (α [( α S i + α S], (α ( α + α, S i (ỹ ij ỹ i (ỹ ij ỹ i T, S C S i. (9 i ỹ i (/ ỹij and (α, γ is a pair of regularization parameters.

4 4 J. Wang et al. / Pattern Recognition ( Tale Kernel-ased algorithms derived from KRQDA Algorithms KC KWC KQDA KDDA KDDA 2 α 0 0 M γ 0 0 ( S i tr + M ( Σ i M tr[ S/]I M tr[ S tr[ S/] i /]I S i / S/ ( S/ + I tr[ S/] +M The key component in the calculation of Σ i (α, γ is to derive the covariance matrix of ith class, i.e., S i, which could e expressed as follows: S i (ỹ ij ỹ i (ỹ ij ỹ i T ỹ ij ỹ T ij ỹ i ỹ T ij ỹ ij ỹt i + ỹ T ỹ i T i ỹ ij ỹ T ij ỹ i ỹt i ỹi ỹt i + ỹi ỹt i ỹ ij ỹ T ij ỹ i ỹt i J J 2, (0 where J ỹij ỹ T ij and J 2 ỹ i ỹt i. The detailed derivation of J and J 2 can e found in Appendices A and B. In the classification session, the input test image p is firstly projected to the lower dimensional etween class suspace i.e., q Ũ T (p. The identity of the test image is then determined y comparing the Mahanlanois distance etween the feature representation of the test image q and each class center ỹ i, i.e., ID(p arg min d i ( q, i d i ( q( q ỹ i T Σ i (α, γ( q ỹ i +ln Σ i (α, γ 2lnπ i, ( where π i /. It was shown in Ref. [6] that the two parameters α (0 α and γ (0 γ regularize the estimation of S i to S (y α and to a multiple of the identity matrix (y γ. Therefore, in a way similar to RD-QDA, a set of traditional or novel kernelased discriminant learning methods can e derived y adjusting the parameters of α and γ which are summarized in Tale. When (α 0, γ 0, no regularization is applied. The proposed KRQDA is equivalent to a standard quadratic discriminant analysis solution in the kernel feature space. Thus, a novel kernel-ased QDA method is derived, denoted as KQDA. While (α, γ, a strong regularization is applied so that the covariance matrix Σ i ecomes a multiple of the identity matrix, i.e., Σ i tr( S/I/M. Thus the proposed method is reduced to a nearest center solution performed in the kernel space, denoted as kernel nearest center classifier KC. When (α 0, γ, Σ i tr( S i /I/M. In such a case, the proposed method ecomes a weighted nearest center classifier, denoted as KWC, with the weight for each class defined as tr( S i //M. If(α, γ 0, it is not difficult to see that the proposed method is equivalent to the well-known kernel direct discriminant analysis [7] with the traditional Fisher s criterion (Ã arg maxã Ã T S Ã / Ã T S w Ã, denoted as KDDA. If a modifier Fisher s criterion is considered as proposed in Ref. [7], i.e., (Ã arg maxã Ã T S Ã / Ã T S Ã + Ã T S w Ã, it can e derived, that the corresponding KDDA solution, denoted as KDDA 2, is also a special case of KRQDA, when (α, γ (tr( S i / + M/M. In addition to manipulating the regularization parameters (α, γ, another flexiility of the proposed method lies in the choosing of kernel functions k(. The variation in the kernel functions results in different kernel spaces. It can e oserved that when a linear kernel is used, i.e., k(z i, z j z T i z j, the proposed KRQDA is reduced to RD-QDA. Therefore, KRQDA is a more general solution and RD-QDA is a special case corresponding to the linear kernel functions. 4. Experiment In order to demonstrate the effectiveness of the proposed method, a set of experiments are conducted to examine the performance of the proposed KRQDA method with respect to the regularization parameters (α, γ as well as the kernel functions. 4.. Experiment I performance as functions of regularization parameters In order to examine how the regularization parameters (α, γ affect the performance of KRQDA under different SSS scenarios, a suset of the well-known FERET face dataase is used for experimentation [8,9]. The data set used in this experiment contains 960 images. It is composed of 9 sujects (i.e., 9 classes with each suject having at least eight images. According to the FERET protocol guidelines, for each face image, the following preprocessing operations are performed: ( images are rotated and scaled so that the centers of the eyes are placed on specific pixels and the image size is normalized to 50 30; (2 a standard mask is applied to remove

application of the mask, as a vector of dimensionality 7 54. Fig. depicts the preprocessing procedure.

5 J. Wang et al. / Pattern Recognition ( 5 Fig.. Preprocessing procedure. Fig. 2. CRR of KRQDA as functions of (α, γ; top: L 2, 3, 4; ottom L 5, 6, 7. non-face portions; (3 histogram equalization is performed and image intensity values are normalized to zero mean and unit standard deviation; (4 each image is finally represented, after the application of the mask, as a vector of dimensionality Fig. depicts the preprocessing procedure. For each suject, L images are extracted to form the training set while the remaining images are treated as tests. In order to examine the effectiveness of the proposed method in handling the SSS prolem, a set of L values are tested with L{2, 3, 4, 5, 6, 7}. For each L value, the partitioning of training and test sets is repeated 0 times and the reported results are the average of 0 repetitions. The testing grid of (α, γ values for the proposed KRQDA solution is defined as α [0 4 : : ] and γ [0 4 : : ], where [0 4 : : ] denotes a vector consisting 2 elements from 0 4 to with a step of Therefore, 2 2 (α, γ cominations are examined. A Gaussian kernel is used as the kernel function for all kernel-ased algorithms. The kernel parameter σ 2 is set to 0 8 ased on a trial-and-error method. The recognition performance is measured y the correct recognition rate (CRR. Fig. 2 depicts the CRRs with respect to different regularization parameters (α, γ. Also, a quantitative comparison of various kernel-ased algorithms is performed. Tale 2 summarizes the CRRs otained y various kernel-ased algorithms. The KRQDA performance reported here represents est found CRRs across all (α, γ cominations. The optimal (α, γ values are also listed. In addition to KRQDA and its derived algorithms (listed in Tale, kernel principal component analysis (KPCA [20] and generalized discriminant analysis (GDA [2] are also implemented for comparison purposes. For KPCA and GDA, it is well known that the corresponding performance is a function of the numer of extracted features. Therefore, the performance reported here is the est found CRR corresponding the optimal feature numer ( opt otained y exhaustively searching all possile numers. It can e oserved from Fig. 2, when L 2, the peak value of the CRR appears in the central area of the (α, γ grid. As the value of L increases, the CRR peak moves toward the (0, 0 corner. A similar oservation can e otained from Tale 2. In a severe SSS scenario, i.e., when L 2, the est CRR of KRQDA is otained when α and γ have large values (λ 0.64 and γ Correspondingly, solutions like KC (α, γ and KDDA 2 (α, γ M/(tr( S i / + M which include a strong regularization process outperform less regularized solutions such as KWC (α 0, γ, KDDA (α, γ 0 and KQDA

6 6 J. Wang et al. / Pattern Recognition ( Tale 2 CRRs (% of various kernel ased algorithms on FERET set L 2 L 3 L 4 L 5 L 6 L 7 KPCA opt GDA opt KC KWC KQDA KDDA KDDA KRQDA (α, γ (0.64,0.54 (0.45,0.25 (0.05, 0 4 (0.0, 0 4 (0.0, 0 4 (0.0, 0 4 (α 0, γ 0. If sufficient samples are availale, e.g., L 7, KQDA gives etter performance. This indicates when the training sample size (L is small, the SSS prolem is a more critical concern. In such cases, a regularization step that replaces individual covariance matrices Σ i, with a common within-class scatter matrix S estimated from all training samples is an appropriate choice. Although it is a iased solution, it significantly reduces the variance in the estimation of Σ i resulting in an improved performance. However, when sufficient samples are availale for each class, methods like KQDA demonstrate their effectiveness in tackling the complex distriution prolem y producing quadratic oundaries in the kernel feature space. Therefore, in such cases, uniased solutions associated with small (α, γ values are appropriate choices. Compared to other kernel-ased algorithms, KRQDA gives the est performance for all cases due to the fact that a flexile regularization scheme has een integrated Experiment II performance as functions of kernels In addition to regularization parameters, another flexiility in the proposed KRQDA method is the manipulation of the kernels. In this experiment, the influence of the selected kernel functions will e examined. In addition to the FERET face dataase, two additional data sets are used for this experiment including the vowel data set and the vehicle data set collected from UCI enchmark repository [22]. The vowel data set consists of classes with each class containing 90 samples. Each sample is represented y a 0-dimensional vector. The vehicle data set consists of four classes with each class containing over 200 samples. Each sample is represented y an 8-dimensional vector. For face data set, L {2, 4, 6} images per suject are extracted as training samples while the rest are treated as tests. Again, the configuration of the training and test sets is repeated 0 times and the reported results are the average over 0 trials. For the vowel and vehicle data sets, a 0-fold testing is performed. In other words, for each class, the samples are separated into 0 non-overlapping susets, nine of which are used as training samples while the remaining single suset is treated as the test set. The experiment is repeated 0 times Tale 3 Best found CRR (% of KRQDA with Gaussian kernel using various σ 2 values σ L L L σ Vowel Vehicle corresponding to 0 different test sets and the results reported here are the average over 0 trials. The testing grid of (α, γ values for vowel and vehicle data set is set to α [0 : : ] and γ [0 : : ] so that the extreme cases α 0 or γ 0 will e tested. However, for face data set, the corresponding test grid remains the same as that used in the first experiment, i.e., α [0 4 : : ] and γ [0 4 : : ]. This is due to the fact that for face data set, (α 0, γ 0 will trigger computation difficulty. If (α 0, γ 0, according to Eq. (9, it can e derived that Σ i (α, γ S i /, which is singular. This is due to the fact that the numer of training samples for each suject is smaller than the dimensionality of the reduced etween-class suspace, i.e., C 90, where s the numer of sujects. Then in the recognition step, the computation of Mahalanois distance d i (Eq. ( ecomes inapplicale. Tale 3 depicts the est found CRRs of KRQDA with respect to different σ 2 values. It can e seen that different kernel σ 2 values results in different recognition performance. For the vowel data set, performance difference due to the different σ 2 values is small. However, for other two data sets, such difference is large. It can e oserved that the gap etween the est and worst CRRs are over 5% and 30% for the FERET and vehicle data sets, respectively. This indicates that choosing an appropriate kernel parameter is paramount in ensuring good performance in kernel-ased algorithms. Tale 4 lists the est found CRRs of KRQDA with respect to different kernels, including the Gaussian kernel, polynomial kernel and linear kernel. For the Gaussian kernel, σ 2 is set to 0 8, 50 and 00 for the face, vowels and vehicle sets. While for the polynomial

7 Tale 4 Best found CRR (% of KRQDA with various kernel functions ARTICLE I PRESS J. Wang et al. / Pattern Recognition ( 7 FERET Vowel Vehicle L 2 L 4 L 6 Gaussian (α, γ (0.64, 0.54 (0.05, 0 4 (0.0, 0 4 (0, 0 (0.6435, Polynomial (α, γ (0.40, 0.64 (0.0, 0 4 (0.0, 0 4 (0, 0 (0.4950, 0 Linear (α, γ (0.74, 0.64 (0.05, 0 4 (0.05, 0 4 (0, 0 (0.980, 0 kernel, defined as k(z i, z j (a(z i z j + d, the parameters are set to {a 0 9,,d 3} for the FERET face data set as suggested in Ref. [7], {a,,d 3} for the vowel data set and {a,,d 0.0} for the vehicle data set. The kernel parameters are selected empirically through a trial-and-error procedure. The linear kernel is defined as k(z i, z j z T i z j. It is not difficult to understand, that if a linear kernel is considered, the proposed KRQDA is reduced to RD-QDA. Compared to the RD-QDA solution, no significant improvement can e oserved y the proposed KRQDA for the FERET face data set. The performance of KRQDA with either the Gaussian kernel or the polynomial kernel is very close to that of RD-QDA. This can e explained as follows. The superiority of the proposed KRQDA method to the traditional RD-QDA method lies in its capaility of relaxing the Gaussian assumption y using kernel machine technique. However, as a more complicated algorithm, it is more easily to overfit in particular for face recognition application where only limited numer of training samples are availale. In addition, as any kernel-ased algorithm, its performance depends on the selected kernel function and its associated parameters. However, how to systematically choose an appropriate kernel function is still a challenging prolem far from well solved. The kernel parameters used in this experiment is selected empirically through a trial-anderror procedure. The lack of a good kernel selection algorithm also prevents the KRQDA method to demonstrate its potential power. Therefore, its superiority has een discounted. For the vowel and vehicle data sets, where the SSS prolem is not that critical with more training samples availale, etter performance can e oserved y using the KRQDA method with Gaussian kernel. In order to examine how KRQDA improves the recognition performance, we define IMP CRR Gaussian CRR Linear / CRR Linear (% to measure to comparative improvement of the KQRDA method with Gaussian kernel with respect to that with linear kernel, i.e., RD-QDA. It can e oserved that IMP Vowel 23% and IMP V ehicle 9%. This indicates that the KRQDA method provides etter performance than the traditional RD-QDA method, in particular on the vowel data set. In comparing with computational cost, KRQDA is more time consuming than RD-QDA as it is a more complicated algorithm. We compare oth the training and testing time consumed y the KRQDA method and the RD-QDA method on the FERET face data set, when L 2. The simulation is implemented on a PC with a 3.0 GHz Intel Pentium 4 processor and 2.0 GB RAM. All programs are written in Matla v7.0 and executed in MS Windows XP. It takes 5.23 s to train a KRQDA algorithm and 3.79 s to train a RD-QDA algorithm. In terms of testing time, KRQDA needs s and RD-QDA needs 0.03 s. It can e oserved that KRQDA takes much more time than RD-QDA in the training session. The extra time is mostly spent on the calculation of the kernel matrix. Since the training is an off-line operation, a higher computational cost is durale. While looking at the testing time, however, KRQDA is comparale to RD-QDA. This indicates that KRQDA is a realistic solution which can e applied in real-world applications. 5. Conclusion In this paper, a novel kernel regularized quadratic learning algorithm, KRQDA, has een developed. The proposed method comines the strengths of the kernel machine technique and the regularized quadratic learning technique to tackle the complex pattern classification prolem under the SSS scenario. KRQDA is a comprehensive and general pattern recognition solution from which a set of traditional or novel, linear or nonlinear learning algorithms can e derived y adjusting the regularization parameters as well as the kernel functions. Experimentation on different data sets indicates that the proposed method outperforms many traditional discriminant learning algorithms. As many other kernel-ased algorithms, different kernel functions as well as different kernel parameters significantly affect the performance of the KRQDA method. However, how to choose an appropriate kernel function still remains a difficult prolem in the kernel machine research. In addition, systematically choosing appropriate regularization parameters α and γ is another challenging task due to the lack of priori knowledge regarding the underlying distriution of the patterns. These prolems will e left for future research. Acknowledgments This work is partially supported y a grant provided y the Bell University Laoratory at the University of Toronto.

8 8 J. Wang et al. / Pattern Recognition ( Portions of the research in this paper use the FERET dataase of facial images collected under the FERET program. The authors would like to thank the FERET Technical Agent, the U.S. ational Institute of Standards and Technology (IST for providing the FERET dataase. Appendix A. Derivation of J J ỹ ij ỹ T ij Ũ T (z ij (z ij T Ũ (Ẽ m Λ /2 T Φ T (z ij (z ij T Φ (Ẽ m Λ /2, (2 where Ẽ m and Λ can e otained from Eq. (5. Express Φ T ( (z ij (z ij T Φ as follows: Φ T (z ij (z ij T Φ ˆ T m (z ij (z ij T ˆ n where ˆ T m (z ij (z ij T ˆ n m,...,c n,...,c Cm C n T m (z ij (z ij T n T m (z ij (z ij T T (z ij (z ij T n + T (z ij (z ij T, (3 Cm C n (I I 2 I 3 + I 4, (4 where I,I 2,I 3,I 4 can e expanded as follows: I T m (z ij (z ij T n (z mx T (z ij (z ij T C n (z ny C n C n C n [ Cm ] C n (z mx T (z ij C n (k xj mi (k jy in (Q m T KQ i,j y (Qi,j T KQ n (Q m T K D i K Qn. Therefore, (I m,...,c n,...,c AT C K Di K A C, I 2 T m (z ij (z ij T y (z mx T (z ij (z ij T [ Cm ] (z mx T (z ij C C n (z ij T (z ny n y C C n (k xj mi (k jy in (Q T KQ i,j n y (Qi,j T K (Qm T K D i K (I 2 m,...,c n,...,c AT C K Di K C, y (z ij T (z ny C C n (z ny n y (I 3 m,...,c n,...,c (I T 2 m,...,c n,...,c T C K Di K A C,

9 I 4 T (z ij (z ij T 2 m ARTICLE I PRESS C (z mx T (z ij (z ij T C C n (z ny [ Cm ] (z mx T (z ij C C n (z ij T (z ny 2 2 n y C C n (k xj mi (k jy in ( T KQ i,j n y (Qi,j T K 2 ( T K D i K (I 4 m,...,c n,...,c 2 T C K Di K C. J. Wang et al. / Pattern Recognition ( 9 n y where Φ T i T i Φ can e expressed as follows: Φ T i T i Φ ( ˆ T m i T i ˆ n m,...,c n,...,c, (7 where ˆ T Cm C n m i T i ˆ n ( T m i T i n T m i T i T i T i n + T T i i. (8 Express each item in Eq. (8 as follows: T m i T i n ( s (z mx T (z iy y ( (z is T C n (z tn C n t (k xy mi (k st n n y C n s t Define Q [q C,...,q CC ] is a vector and q Ci is a vector with all elements equal to 0. Thus, Q m is equivalent to Q except that the elements in q Cm equal to / while Q i,j is equivalent to Q except that the jth element in q Ci equals to. D i diag[0 C,...,0 Ci, Ci, 0 Ci+,...,0 CC ] is a diagonal matrix, Ci is a vector with all elements equal to and 0 Ck is a C k vector with all elements equal to 0. Therefore (z ij (z ij T Φ Φ T B (AT C K Di K A C AT C K Di K C T C K Di K A C + 2 T C K Di K C B T. (5 Appendix B. Derivation of J 2 J 2 ỹ i ỹt i Ũ T i T i Ũ (Ẽ m Λ /2 T Φ T i T i Φ (Ẽ m Λ /2, (6 (Q m T K (Q i (Qi T K (Q n ( T m i T i n m,...,c n,...,c AT C K Wi K A C, T m i T i ( s (z mx T (z iy y ( (z is T C C n (z tn (k xy mi y n t (Q m T K (Q i (Qi T K C C n (k st in s n t ( ( T m i T i m,...,c n,...,c AT C K Wi K C, ( i T i n m,...,c n,...,c (( T m i T i T m,...,c n,...,c T C K Wi K A C,

10 0 J. Wang et al. / Pattern Recognition ( T i T i ( C m s (z mx T (z iy y ( (z is T C C n (z tn C (k xy mi m y n t C C n (k st in s n t ( T ( K (Q i (Qi T K ( T m,...,c i T i n,...,c 2 T C K Wi K C, where W i diag[0 C,...,0 Ci, Ci, 0 Ci+,...,0 CC ] is a lock diagonal matrix and 0 Ck is a C k C k matrix with all elements equal to 0 and 0 Ci is a matrix with all elements equal to /Ci 2. Therefore, Φ T i T i Φ B (A T C K Wi K A C References AT C K Wi K C T C K Wi K A C + 2 T C K Wi K C B. (9 [] P.. Belhumeur, J.P. Hespanha, D.J. Kriegman, Eigenfaces vs. Fisherfaces: recognition using class specific linear projection, IEEE Trans. Pattern Anal. Mach. Intell. 9 (7 ( [2] L.-F. Chen, H.-Y.M. Liao, M.-T. Ko, J.-C. Lin, G.-J. Yu, A new LDAased face recognition system which can solve the small sample size prolem, Pattern Recognition 33 ( [3] H. Yu, J. Yang, A direct lda algorithm for high-dimensional data with application to face recognition, Pattern Recognition 34 ( [4] J. Lu, K. Plataniotis, A. Venetsanopoulos, Regularization studies of linear discriminant analysis in small sample size scenarios with application to face recognition, Pattern Recognition Lett. 26 (2 ( [5] J. Lu, K.. Plataniotis, A.. Venetsanopoulos, Face recognition using LDA ased algorithms, IEEE Trans. eural etworks 4 ( ( [6] J. Lu, K.. Plataniotis, A.. Venetsanopoulos, Regularized discriminant analysis for the small sample size prolem in face recognition, Pattern Recognition Lett. 24 (6 ( [7] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, second ed., Wiley, ew York, Y, [8] L. Kanal, B. Chandrasekaran, On dimensionality and sample size in statistical pattern classification, Pattern Recogntion 3 ( [9] P. Wald, R. Kronmal, Discriminant functions when covariance are unequal and sample sizes are moderate, Biometrics 33 ( [0] S.J. Raudys, A.K. Jain, Small sample size effects in statistical pattern recognition: recommendations for practitioners, IEEE Trans. Pattern Anal. Mach. Intell. 3 (3 ( [] J.H. Friedman, Regularized discriminant analysis, J. Am. Statist. Assoc. 84 ( [2] V.. Vapnik, The ature of Statistical Learning Theory, Springer, ew York, 995. [3] B. Scholkopf, C. Burges, A.J. Smola, Advances in Kernel Methods Support Vector Learning, MIT Press, Camridge, MA, 999. [4] A. Ruiz, P.L. de Teruel, onlinear kernel-ased statistical pattern analysis, IEEE Trans. eural etworks 2 ( ( [5] K.-R. Müller, S. Mika, G. Rätsch, K. Tsuda, B. Schölkopf, An introduction to kernel-ased learning algorithms, IEEE Trans. eural etworks 2 (2 ( [6] J. Shawe-Taylor,. Cristianini, Kernel Methods for Pattern Analysis, Camridge University Press, Camridge, [7] J. Lu, K.. Plataniotis, A.. Venetsanopoulos, Face recognition using kernel direct discriminant analysis algorithms, IEEE Trans. eural etworks 4 ( ( [8] P.J. Phillips, H. Wechsler, J. Huang, P. Rauss, The FERET dataase and evaluation procedure for face recognition algorithms, Image Vision Comput. J. 6 (5 ( [9] P.J. Phillips, H. Moon, S.A. Rizvi, P.J. Rauss, The FERET evaluation methodology for face-recognition algorithms, IEEE Trans. Pattern Anal. Mach. Intell. 22 (0 ( [20] B. Scholkopf, A. Smola, K. Muller, onlinear component analysis as a kernel eigenvalue prolem, eural Comput. 0 ( [2] G. Baudat, F. Anouar, Generalized discriminant analysis using a kernel approach, eural Comput. 2 ( [22] C. Blake, E. Keogh, C.J. Merz, uci repository of machine learning dataases, Department of Information and Computer Science, University of California, Irvine, Aout the Author JIE WAG received oth the B.Eng. and M.Sci. degree in electronic engineering from Zhejiang University, P.R. China, in 998 and 200, respectively. Currently, she is pursing the Ph.D. degree in the Edward S. Rogers, Sr. Department of Electrical and Computer Engineering, University of Toronto, Canada. Her research interests include face detection and recognition, multi-modal iometrics and kernel methods. Aout the Author K.. PLATAIOTIS received the B.Eng. degree in computer engineering and informatics from University of Patras, Greece, in 988 and the M.S. and Ph.D. degrees in electrical engineering from Florida Institute of Technology (Florida Tech, Melourne, in 992 and 994, respectively. He is an Associate Professor of Electrical and Computer Engineering at the University of Toronto, Canada. His research interests are in the areas of multimedia systems, iometrics, image and signal processing, communications systems and pattern recognition. Dr. Plataniotis is a registered professional engineer in the province of Ontario. Aout the Author JUWEI LU received the B.Eng. degree in electrical engineering from anjing University of Aeronautics and Astronautics, China, in 994 and the M.Eng. degree from the School of Electrical and Electronic Engineering, anyang Technological University, Singapore, in 999 and the Ph.D. degree from the Edward S.Rogers, Sr. Department of Electrical and Computer Engineering, University of Toronto, Canada, in His research interests include multimedia signal processing, face detection and recognition, kernel methods, support vector machines, neural networks, and oosting technologies.

11 J. Wang et al. / Pattern Recognition ( Aout the Author A.. VEETSAOPOULOS received the Bachelors of Electrical and Mechanical Engineering degree from the ational Technical University of Athens (TU, Greece, in 965, and the M.S., M.Phil., and Ph.D. degrees in Electrical Engineering from Yale University in 966, 968 and 969, respectively. He is a Professor of Electrical and Computer Engineering at the University of Toronto, Canada. Between he served as the 2th Dean of the Faculty of Applied Science and Engineering of the University of Toronto. ow he is the Vice President, Research and Innovation of Ryerson University. His research interests include multimedia (image compression, dataase retrieval, digital signal/image processing (multi channel image processing, nonlinear, adaptive and M-D filtering, digital communications (image transmission, image compression, neural networks and fuzzy reasoning in signal/image processing.

Kernel Discriminant Learning with Application to Face Recognition

Kernel Discriminant Learning with Application to Face Recognition Juwei Lu 1, K.N. Plataniotis 2, and A.N. Venetsanopoulos 3 Bell Canada Multimedia Laboratory The Edward S. Rogers Sr. Department of Electrical