Weighted Generalized LDA for Undersampled Problems

Size: px

Start display at page:

Download "Weighted Generalized LDA for Undersampled Problems"

Geoffrey Hudson
5 years ago
Views:

1 Weighted Generalized LDA for Undersampled Problems JING YANG School of Computer Science and echnology Nanjing University of Science and echnology Xiao Lin Wei Street No Nanjing CHINA LIYA FAN School of Mathematics Sciences Liaocheng University Hunan Road Liaocheng CHINA fanliya63@126.com QUANSEN SUN School of Computer Science and echnology Nanjing University of Science and echnology Xiao Lin Wei Street No Nanjing CHINA sunquansen@mail.njust.edu.cn Abstract: Linear discriminant analysis (LDA) is a classical approach for dimensionality reduction. It aims to maximize between-class scatter and minimize within-class scatter thus maximize the class discriminant. However for undersampled problems where the data dimensionality is larger than the sample size all scatter matrices are singular and the classical LDA encounters computational difficulty. Recently many LDA extensions have been developed to overcome the singularity such as Pseudo-inverse LDA (PILDA) LDA based on generalized singular value decomposition (LDA/GSVD) null space LDA (NLDA) and range space LDA (RSLDA). Moreover they endure the Fisher criterion that is nonoptimal with respect to classification rate. o remedy this problem weighted schemes are presented for several LDA extensions in this paper and called them weighted generalized LDA algorithms. Experiments on Yale face database and A& face database are performed to test and evaluate effective of the proposed algorithms and affect of weighting functions. Key Words: Feature extraction dimensionality reduction undersampled problem weighting function misclassification rate 1 Introduction Many machine learning data mining and bioinformatics problems involve data in very high-dimensional space. Undersampled problems where data dimension is larger than the sample size frequently occur in many applications including information retrieval 1-3 face recognition 4-6 and microarray data analysis 7. he feature extraction process is an important part of pattern recognition and machine learning which can results in computation cost decreasing and classification performance increasing. An appropriate representation of data from all features is an important problem in machine learning and data mining problems. All original features can not always beneficial for classification or regression tasks. Some features are irrelevant or redundant in distribution of data set. hese features can decrease the classification performance. In order to increase the classification performance and to reduce computation cost of classifier the feature selection process should be used in classification or regression problems 8. Linear discriminant analysis (LDA) 9 is one of the most popular linear projection techniques for feature extraction. However the classical LDA usually encounters two difficulties. One is the singularity problem caused by the undersampled problem. In recent years many LDA extensions have been developed to deal with this problem such as pseudoinverse LDA (PILDA) 1-11 LDA/GSVD null space LDA (NLDA) 1215 range space LDA 16 orthogonal LDA (OLDA) 1117 uncorrelated LDA (ULDA) 1118 and regularized LDA (RLDA) 19. In PILDA the inverse of the scatter matrix is replaced by the pseudo-inverse. It is equivalent to approximating the solution using a least-squares solution method. he optimal transformation is computed through the simultaneous diagonalization of scatter matrices. LDA/GSVD is one of generalizations of LDA based on GSVD it overcomes the singularity of the scatter matrices by applying the GSVD to E-ISSN: Volume

2 solve the generalized eigenvalue problem. he classical LDA solution is a special case of LDA/GSVD method. In NLDA the between-class distance is maximized in the null space of the within-class scatter matrix. It is a two-step approach the transformation using a basis of null space of the within-class scatter matrix is performed in the first stage and then in the transformed space the second projective directions are searched. In range space LDA the within-class distance is minimizes in the range space of the betweenclass scatter matrix. Similarly we propose a method based on the transformation by a basis of the range space of the within-class scatter matrix to handle undersampled problems. Another drawback of LDA-based algorithms is that the Fisher separability criterion is not directly related to classification rate. A promising solution to this problem is to introduce weighted schemes into the criteria. We can see from that weighting functions are close with classification accuracy. Different weighting functions can lead different classification error. Selecting suitable weighting function can increase classification accuracy. In this paper we focus on weighted versions of PILDA LDA/GSVD NLDA and range space LDA with five weighting functions for each weighted scheme in which the K-Nearest neighbors (KNN) method 22 is used for a classifier. We apply the Euclidean distance d ij = m i m j between the means of class i and j in weighting functions w(d ij ). A weighting function is generally a monotonically decreasing function because classes that are closer to one another are likely to have a greater confusion and should be given a greater weightage. For weighting functions we first apply two special cases of the weighting function w(d ij ) = (d ij ) p proposed by Lotlikar et al. in 2 with p = 1 and p = 2 and then an improved version of weighting function w(d ij ) = 1 erf( d ij 2d 2 ij 2 ) 2 presented by Loog in 25 where the Mahanalobis distance is replaced by the Euclidean distance. In addition according to the feature of weighting functions we present two new weighting functions. he rest of the paper is organized as follows. In Section 2 we briefly review generalized LDA algorithms. Weighted versions of generalized LDA algorithms and weighting functions are introduced in Section 3. Extensive experiments with proposed algorithms have been performed in Section 4 the results demonstrate the effective of the proposed algorithms and the affect of weighting functions. Conclusion follows in Section 5. 2 Generalized LDA In this section we review several generalized LDA algorithms which can overcome the limitation of the classical LDA. Given a data matrix A = a 1 a n R m n where a i R m i = 1 n. Assume the original data is already clustered and partitioned into r classes. Let A = A 1 A r where A i R m n i is the data matrix belonged to the i-th class and r i=1 n i = n. Let N i be the set of indices of i-th class i.e. a j belongs to the i-th class for j N i. he aim of LDA is to find a linear transformation G R m l that maps each a i to y i R l by y i = G a i and optimally preserves the cluster structure in the reduced-dimensional space. Let the between-class within-class and total scatter matrices are defined as S b = r i=1 n i(c i c)(c i c) S w = r i=1 j N i (a j c i )(a j c i ) and S t = n j=1 (a j c)(a j c) (see 1) where c i = (1/n i ) j N i a j and c = (1/n) n j=1 a j are class centroids and the global centroid respectively. We can easily see that the trace of S w measures the within-class closeness and the trace of S b measures the between-class separation. In the lowerdimensional space obtained from the transformation G three scatter matrices above become Sw L = G S w G Sb L = G S b G and St L = G S t G. An optimal transformation G would maximize trace (Sb L) and minimize trace (Sw) L simultaneously. In classical LDA which requires S w or S b is nonsingular common optimizations include max G {trace((sl w) 1 Sb L)} min G {trace((sl b ) 1 Sw)}. L (1) he problem (1) is equivalent to finding the eigenvectors satisfying the generalized eigen equation S b x = λs w x for λ. he solution can be obtained by solving an eigenvalue problem on the matrix Sw 1 S b if S w is nonsingular or on S 1 b S w if S b is nonsingular. here are at most r 1 eigenvectors corresponding to nonzero eigenvalues since rank(s b ) r 1. herefore the reduced dimension by classical LDA is at most r 1. A stable way to solve this eigenvalue problem is to apply singular value decomposition (SVD) on the scatter matrices the details can be found in 5. When the data dimensionality is larger than the sample size all scatter matrices are singular and the classical LDA is no longer applicable. In order to solve the small sampled size problems several generalizations of LDA have been proposed. E-ISSN: Volume

3 2.1 Pseudo-inverse LDA he pseudo-inverse of a matrix A denoted as A + refers to a unique matrix satisfying A + AA + = A + AA + A = A (AA + ) = AA + and (A + A) = A + A. he pseudo-inverse A + is commonly computed by SVD 23. If A = U V Σ is the SVD of the matrix A where U and V are orthogonal and Σ is diagonal with positive diagonal entries then Σ A + 1 = V U. In this subsection we consider the criterion F 1 (G) = trace((s L b )+ S L w) proposed in 1 which is a natural extension of (1) since the inverse of a matrix may not exist but the pseudo-inverse of a matrix is well-defined. Moreover when the matrix is invertible its pseudo-inverse is the same as its inverse. In PILDA the optimal transformation matrix G can be obtained by solving a minimization problem min G F 1(G). (2) he main technique for solving the problem (2) is to use GSVD. Define the matrices H w = A 1 c 1 (e 1 ) A r c r (e r ) H b = n 1 (c 1 c) n r (c r c) H t = a 1 c a n c where e i = (1 1) R n i. hen the scatter matrices can be expressed as S w = H w H w S b = H b H b and S t = H t H t. Let GSVD be applied to the matrix pair (H b H w ) such that U b H b X = Γ b U w H w X = Γ w (3) H w where U b R r r and U w R n n are orthogonal X R m m is a nonsingular matrix Γ b Γ b +Γ wγ w = H I s s = rank b and Γ b Γ b and Γ wγ w are diagonal matrices with nonincreasing and nondecreasing diagonal components respectively. he simultaneous diagonalizations of S b and S w can be obtained by Γ X S b X = b Γ b m s = I µ D τ = D b Γ X S w X = w Γ w µ = E τ s µ τ m s m s I s µ τ m s = D w (4) where I and denote identity and zero matrices respectively. D τ = diag(α µ+1 α µ+τ ) E τ = diag(β µ+1 β µ+τ ) with α µ+1 α µ+τ > < β µ+1 β µ+τ and αi 2 + β2 i = 1 for i = µ + 1 µ + τ. By (4) we can deduce that X S t X = X S b X + X Is S w X =. o solve the problem (2) we need the following three lemmas where the proofs of the first two are straightforward from the definition of the pseudoinverse and the third lemma can be found in 1. Lemma 1 For any matrix A R m n we have trace(aa + ) = rank(a). Lemma 2 For any matrix A R m n we have (AA ) + = (A + ) A +. Lemma 3 Let Σ = diag(σ 1 σ s ) be any diagonal matrix with σ 1 σ s >. hen for any matrix M R m s with rank(m) = δ the following inequality holds: trace((mσm ) + MM ) δ i 1 σ i where the equality holds if and only if M = D U for some orthogonal matrix U R m m and matrix D = Σ 1 QΣ 2 R δ δ where Q R δ δ is orthogonal and Σ 1 Σ 2 R δ δ are diagonal matrices with positive diagonal entries. heorem 4 Let X be the matrix specified by the GSVD of (H b H w ) in (3) and X δ the matrix consisting of the first δ columns of X where δ = rank(s b ). hen G = X δ M minimizes F 1 for any nonsingular M. E-ISSN: Volume

4 Proof: By the GSVD of the matrix pair (Hb H w ) we have Sb L = GD G b and Sw L = GD G w where G = (X 1 G). Let G = G1 G 2 G 3 G 4 be a partition of G such that G1 R l µ G 2 R l τ G 3 R l (s µ τ) and G 4 R l (m s). Putting G 12 = G 1 G 2 we have Sw+S L b L = G 12G 12 +G 3G 3 and S L b = G 12 ΣG 12 where Σ = Iµ D τ. By Lemma 1 we get trace((s L b )+ S L b ) = rank(sl b ) rank(s b ) = δ. Let Ĝ = G 12Σ 1/2 and Ĝ = U ˆΣ (5) V be the SVD of Ĝ. hen rank(ĝ) = rank(g 12) and Ĝ+ = ˆΣ 1 V U. Consequently by Lemma 3 we can deduce that trace((sb L)+ (Sw L + Sb L)) = trace((g 12 ΣG 12 )+ (G 12 G 12 + G 3G 3 )) trace((g 12 ΣG 12 )+ G 12 G 12 ) = trace((ĝ+ ) Ĝ + ĜΣ 1 Ĝ ) = trace(vδ Σ 1 V δ ) µ 1 + δ i=1 i=µ+1 1 α 2 i (6) where V δ is the matrix consisting of the first δ columns of V. We can easily show that the last inequality in Qδ (6) becomes equality if V δ = for any orthogonal matrix Q δ R δ δ and G 12 = U D R l (µ+τ) where D = ˆΣQ δ Σ 1/2 δ and Σ δ is the δ-th principal submatrix of Σ. It follows from (5)-(6) that F 1 (G) = trace((sb L)+ (Sw L + Sb L)) trace((sl b )+ Sb L) δ ( βi α i ) i=µ+1 where the equality holds if rank(sb L) = δ G 12 = ˆΣQδ Σ 1/2 U δ and G 3 =. We observe that the minimization of F 1 is independent of G 4 and simply set it to zero. Hence the minimum of F 1 is attained if the partition of G = G 1 G 2 G 3 G 4 satisfies ˆΣQδ Σ 1/2 G 12 = U δ G 3 = and G 4 =. Let U = U 1 U 2 be a partition of U so that U 1 R l δ and U 2 R l (l δ). It follows that ˆΣQδ Σ 1/2 G 12 = U δ U = 1 ˆΣQδ Σ 1/2 δ M. Note that U is orthogonal matrix Q δ is an arbitrary orthogonal matrix and ˆΣ and Σ δ are diagonal matrices. herefore M is an arbitrary nonsingular matrix and G = X δ M minimizes F 1 (G). his method is called PILDA that is Algorithm 2.1. PILDA H 1. Compute the SVD of K = b Hw as K = R P U where P and U are orthogonal and R is diagonal 2. Let s = rank(k) and δ = rank(h b ). Compute W from the SVD of P (1 : r 1 : s) the submatrix consisting of the first r rows and the first s columns of matrix P from Step 1 as P (1 : r 1 : s) = V ΓW 3. Let X = U R 1 W I 4. Assign the first δ columns of the matrix X to X δ 5. G = X δ M where M is any nonsingular matrix. 2.2 LDA based on GSVD In this subsection we review another method based on the GSVD proposed by Howland et al his approach is justified to preserve the cluster structure after dimension reduction. When S w = H w Hw is nonsingular it is wellknown that trace((sw) L 1 Sb L) is maximized if G h R m l consists of l eigenvectors of Sw 1 S b corresponding to the l largest eigenvalues 1. Let x i denote the i-th column of X then S b x i = λ i S w x i (7) which means that λ i and x i are an eigenvalueeigenvector pair of Sw 1 S b and trace(sw 1 S b ) = λ λ m. Expressing λ i as αi 2/β2 i the eigenvalue problem (7) becomes E-ISSN: Volume

5 β 2 i H b H b x i = α 2 i H w H w x i. (8) Let X = X 1 X 2 X 3 X 4 R m m be a partition of X obtained in section 2.1 with X 1 R m µ X 2 R m τ X 3 R m (s µ τ) and X 4 R m (m s). Defining α i = 1 β i = for i = 1 µ and α i = β i = 1 for i = µ+τ +1 s we can see that (8) is satisfied for 1 i s. For the remaining m s columns of X H b Hb x i and H w Hw x i are zero. So (8) is satisfied for arbitrary values of α i and β i when s + 1 i m and then the columns of X are the generalized singular vectors for the matrix pair (Hb H w ). A question that remains is which columns of X to include in G h. If S w is singular 14 argues in terms of the simultaneous optimization max(trace(g G h S bg h )) h min(trace(g G h S wg h )). h (9) Letting g j represent a column of G h we have trace(g h S bg h ) = gj S bg j and trace(g h S wg h ) = g j S w g j. If x i is the one of the leftmost µ columns of X then x i null(s w ) null(s b ) c (the superscript c denotes the complement) which indicates that including x i in G h can increase trace(g h S bg h ) while leave trace(g h S wg h ) unchanged. If x i is the one of the rightmost m s columns of X then x i null(s w ) null(s b ) which implies that adding x i to G h has no effect on these trace and does not contribute to either maximization or minimization in (9). In either case G h should be comprised of the leftmost µ + τ = rank(hb ) columns of X as illustrated in 24. Hence in LDA/GSVD the transformation matrix is G h = X 1 X 2. An efficient algorithm for LDA/GSVD was presented in 12 as follows. Algorithm 2.2. LDA/GSVD 1. Compute the EVD of S t : Σ1 U S t = U 1 U 2 1 U 2 2. Compute V from the EVD of Sb = Σ 1/2 1 U1 S bu 1 Σ 1/2 1 : S b = V Γ b Γ bv. 3. Assign the first r 1 columns of U 1 Σ 1/2 1 V to G h. 2.3 Null space LDA Chen et al. 15 proposed the null space LDA (NLDA) for dimensionality reduction of undersampled problems which is a two-stage LDA method. his approach projects the original space onto the null space. of S w by using an orthonormal basic of null (S w ) and then in the projected space a transformation that maximizes the between-class scatter is computed. Let the EVD of S w R m m be S w = U w Σ w U w = U w1 U w2 Σw1 U w1 U w2 where s 1 = rank(s w ) U w is orthogonal Σ w1 is a diagonal matrix with nonincreasing positive diagonal elements and U w1 contains the first s 1 columns of the orthogonal matrix U w. We can easily show that null(s w ) = span(u w2 ) and the transformation by U w2 Uw2 projects the original data to null(s w). he between-class scatter matrix S b in the transformed space is S b = U w2 Uw2 S bu w2 Uw2. Consider the EVD of S b : S b = Ũb Σ Σb1 b Ũb = Ũ Ũ b1 Ũ b2 b1 Ũ b2 where s 2 = rank( S b ) Ũ b1 R m s 2 and Σ b1 R s 2 s 2. In NLDA the optimal transformation matrix G e is obtained by G e = U w2 U w2ũb Range space LDA In this subsection we propose another approach to solve undersampled problems. his method first transforms the original space by using a basis of range(s w ) and then in the transformed space the maximization of between-class scatter is pursued. We denote shortly this method by RSLDA. Let the EVD of S w R m m be S w = U w Σ w U w = U w1 U w2 Σw1 U w1 U w2 where s 1 = rank(s w ) Σ w1 is a diagonal matrix and U w1 R m s 1. We can easily show that range(s w ) = span(u w1 ) and the transformation by V y = U w1 Σ 1/2 w1 projects the original data to range(s w ). he withinclass scatter matrix S w in the transformed space is S w = Vy S w V y = I s1. Let the EVD of S b Vy S b V y be S b = Ũb Σ b Ũ b = Ũ b1 Ũ b2 Σb1 Ũ b1 Ũ b2 where s 3 = rank( S b ) Σ b1 is a diagonal matrix and Ũ b1 R s 1 s 3. In RSLDA the optimal transformation matrix G y is obtained by G y = V y Ũ b1 = U w1 Σ 1/2 w1 Ũ b1. E-ISSN: Volume

6 3 Weighted versions and weighting functions As in 25 S b can be rewritten as S b = r 1 r i=1 j=i+1 n i n j n (m i m j )(m i m j ). From this formulation it can be observed that classical LDA maximize the Euclidean distance between the class means and all class pairs have the same weights irrespective of their separability in the original space which makes that the resulting transformation preserve the distance of already well-separated classes and cause a large overlap of neighboring classes in the transformed space. In fact the classes which are closer together are more likely to have more confusion and should therefore be more heavily weighted. wo similarly motivated solutions to this problem have been proposed: weighted pairwise Fisher criteria 25 and fractional-step LDA 2. he fractionalstep LDA is iterative and very time-consuming. In this section in order to improve the classification accuracy we study the weighted versions of generalized LDA mentioned in Section 2. We define a weighted between-class scatter matrix as Ŝ b = r 1 r i=1 j=i+1 n i n j n w(d ij)(m i m j )(m i m j ) (1) where d ij = m i m j is the Euclidean distance between the means of class i and j and w(d ij ) is a weighting function. Apparently the weighted between-class scatter matrix Ŝb degenerates to the conventional between-class scatter matrix S b if the weighting function in (1) gives a constant weight value. If the between-class scatter matrix S b is replaced by the weighted between-class scatter matrix Ŝ b by means of the algorithms obtained in Section 2 we can get some weighted generalized LDA. 3.1 Weighted generalized LDA With Algorithms 2.1 and 2.2 we can derive weighted PILDA and weighted LDA/GSVD algorithms. Algorithm 3.1. Weighted PILDA 1. Compute the EVD of S t : Σt1 U S t = U t1 U t2 t1 U t2 2. Compute the weighted between-class scatter matrix Ŝb 3. Compute V from the EVD of Š b = Σ 1/2 t1 Ut1ŜbU t1 Σ 1/2 t1 : Šb = V ˆΓ ˆΓ b b V 4. Assign the first r 1 columns of U t1 Σ 1/2 t1 V to ˆX δ 5. Ĝ = ˆX δ M where M is any nonsingular matrix. Algorithm 3.2. Weighted LDA/GSVD 1. Compute the EVD of S t : Σt1 U S t = U t1 U t2 t1 U t2 2. Compute the weighted between-class scatter matrix Ŝb 3. Compute V from the EVD of Š b = Σ 1/2 t1 Ut1ŜbU t1 Σ 1/2 t1 : Šb = V ˆΓ ˆΓ b b V 4. Assign the first r 1 columns of U t1 Σ 1/2 t1 V to Ĝ h. From Algorithms 3.1 and 3.2 we can see that weighted LDA/GSVD is a special case of weighted PILDA with M being the identity matrix. We will discuss the affect of the choice of M to classification accuracy in next section. Similarly we can obtain weighted NLDA and weighted RSLDA. Algorithm 3.3. Weighted NLDA 1. Compute the EVD of S w : Σw1 U S w = U w1 U w2 w1 U w2 2. Compute the weighted between-class scatter matrix Ŝb 3. Compute the EVD of S b = U w2 U w2ŝbu w2 U w2 : S b = Ũ b1 Ũ b2 Σb1 4. Ĝ e = U w2 U w2ũb1. Algorithm 3.4. Weighted RSLDA Ũ b1 Ũ b2 1. Compute the EVD of S w : Σw1 U S w = U w1 U w2 w1 U w2 2. Compute the weighted between-class scatter matrix Ŝb E-ISSN: Volume

7 3. Compute the EVD of Sb = Σ 1/2 w1 Uw1ŜbU w1 Σ 1/2 w1 : S b = Ũ b1 Ũ b2 Σb1 4. Ĝ y = U w1 Σ 1/2 w1 Ũ b Weighting functions Ũ b1 Ũ b2 We can see from that weighting functions have close relationships with classification accuracy. Different weighting functions can product different classification error. Selecting suitable weighting function can increase classification accuracy. In this paper we consider four weighted schemes Algorithms with five weighting functions for each weighted scheme. We apply the Euclidean distance d ij = m i m j between the means of class i and j in weighting functions w(d ij ). A weighting function is generally a monotonically decreasing function because classes that are closer to one another are likely to have a greater confusion and should be given a greater weightage. We first apply two special cases of the weighting function w(d ij ) = (d ij ) p proposed by Lotlikar et al. in 2 with p = 1 and p = 2 and then an improved version of weighting function w(d ij ) = 1 erf( d ij 2d 2 ij 2 ) presented by Loog in 25 where the 2 Mahanalobis distance is replaced by the Euclidean distance. In addition according to the feature of weighting functions mentioned above we present two new weighting functions. hey are listed below: w 1 : w(d ij ) = (d ij ) 2 w 2 : w(d ij ) = 1 erf( d ij 2d 2 ij 2 ) 2 w 3 : w(d ij ) = (d ij ) 1 w 4 : w(d ij ) = e 1 d ij w 5 : w(d ij ) = 1. e d ij 4 Experiments and analysis In this section in order to explain the effective of the proposed methods and illustrate the affect of weighting functions and any nonsingular matrix M in PILDA and weighted PILDA to classification accuracy we conduct a series of experiments on 5 different data sets from the A& face database 26 and Yale face database 27. here are ten different images of 4 distinct subjects in A& (or called ORL) face image database. For some subjects the images were taken at different times varying the lighting facial details and facial expressions. he size of each images is pixels with 256 grey levels per pixel. he Yale face database contains 165 face images of 15 individuals. here are 11 images per subject and these 11 images are respectively under the following different facial expression or configuration: center-light wearing glasses happy left-light wearing no glasses normal right-light sad sleepy surprised and wink. In our experiment the images are cropped to a size of and the gray level values of all images are rescaled to 1. Some images of one person are shown in Figure 1. Figure 1: Images of one person in Yale. Data sets 1-3 are from Yale face database. Data set 1 chooses the 21-th dimension to 6-th dimension it contains 15 classes and 165 examples with 4 dimensions data set 2 chooses the 41-th dimension to 9-th dimensionit contains 15 classes and 165 examples with 5 dimensions data set 3 chooses the 8-th dimension to 124-th dimension contains 15 classes and 165 examples with 225 dimensions. Data sets 4-5 are from A& face database. Data set 4 chooses the first 15 samples and the first 35 dimensions it contains 15 classes and 15 examples with 35 dimensions data set 5 chooses the first 15 samples and the 333-th dimension to 665-th dimension it contains 15 classes and 15 examples with 333 dimensions. For all 5 data sets the number of data samples is smaller than the dimension of data space all scatter matrices are singular. In the following experiments the KNN algorithm with K = 7 is used as a classifier for all date sets. For each method we applied 5-fold cross-validation to compute the misclassification rate. Experiments are repeated 5 times to obtain mean prediction misclassification rate. 4.1 Effect of the matrix M In this subsection we study the effect of the matrix M in PILDA to classification accuracy on five data sets. We randomly generate 5 matrices for M and compute the misclassification rates by using the optimal transformation matrices produced in PILDA and 7-NN. he experiment results are listed in able 1 where w : w(d ij ) = 1. From able 1 we can see that matrix M 3 pro- E-ISSN: Volume

8 kernel w(d ij) able 1: Misclassification rate (%) on five data sets W-PILDA W-LDA/GSVD W-NLDA W-RSLDA M 1 M 2 M 3 M 4 M 5 w w date set 1 w w w w w w date set 2 w w w w w w date set 3 w w w w w w date set 4 w w w w w w date set 5 w w w w E-ISSN: Volume

9 duces the best classification accuracy for data set 1 None of the accuracies is higher than matrix M 4 for data set 2 Matrix M 4 produces the best classification accuracy among PILDA but is not higher than RSLDA for data set 3 Matrix M 1 produces the best classification accuracy among PILDA but is not higher than RSLDA for data set 4 Matrix M 5 produces the best classification accuracy among PILDA but is not higher than NLDA and RSLDA for data set Affect of weighting functions In this subsection we study the effect of five weighting functions given in subsection 3.2 to classification accuracy on five data sets. he experiment results are listed in able 1. good overall results for PILDA the weighting function w 2 produces good overall results for LDA/GSVD NLDA and RSLDA. For data set 3 the weighting function w 4 produces good overall results for PILDA and LDA/GSVD the weighting function w 2 produces good overall result for NLDA the weighting functions w 3 and w 5 produce the best results for RSLDA. For data sets 4 and 5 the weighting function w 5 can t be applied the weighting functions w 1 and w 2 produce same results. For data set 4 and M 1 M 2 the best weighting functions for LDA/GSVD NLDA and RSLDA are w 1 and w 2. he weighting function w 3 produces the best result for M 3 and M 4. he best weighting function for M 5 is w 4. For data set 5 and M 1 M 3 M 4 the best weighting functions for LDA/GSVD and RSLDA are w 1 and w 2. he weighting function w 3 produces the best results for M 2 and M 5 and NLDA. Figure 2: he misclassification rates (%) for Data set 1. 5 Conclusion In this paper based on Pseudo-inverse LDA LDA/GSVD null space LDA and range space LDA we propose weighted Pseudo-inverse LDA weighted LDA/GSVD weighted null space LDA and weighted range space LDA. Not only can these methods deal with the singularity problem caused by the undersampled problem they can also make the criterions directly related to classification errors. In order to explain the effective of the proposed methods and illustrate the affect of weighting functions and any nonsingular matrix M in PILDA and weighted PILDA to classification accuracy we conduct a series of experiments on 5 different data sets from the A& face database and Yale face database. Results show that different weighting functions and different nonsingular matrix M affect the classification accuracy of the proposed methods. Acknowledgements: his work is supported by National Natural Science Foundation of China ( ). Figure 3: he misclassification rates (%) for Data set 2. From Figure 2 we can see that the weighting function w 2 is better than other weighting functions for data set 1. Especially we get the best classification accuracy for M 3 and w 2. Figure 3 shows that the weighting function w 3 produces References: 1 K. Fukunaga Introduction to Statistical Pattern Recognition second ed. Academic Press W. Berry S.. Dumais and G. W. O Brie Using linear algebra for intelligent information retrieval SIAM Review pp Deerwester S.. Dumais G.W. Furnas.K. Landauer and R. Harshman Indexing by latent semantic analysis Journal of the Society for Information Scienc pp E-ISSN: Volume

10 4 N. Belhumeur J. P. Hespanha and D. J. Kriegman Eigenfaces vs. Fisherfaces: Recognition using class specific linear projection IEEE ransactions on Pattern Analysis and Machine Intelligence 19 (7) 1997 pp L. Swets and J. Weng Using discriminant eigenfeatures for image retrieval IEEE ransactions on Pattern Analysis and Machine Intelligence 18 (8) 1996 pp A. urk and A. P. Pentland Face recognition using Eigenfaces In IEEE Computer Society Conference on Computer Vision and Pattern Recognition 1991 pp Dudoit J. Fridlyand and.p. Speed Comparison of discrimination methods for the classification of tumors using gene expression data Journal of the American Statistical Association 97 (457) 22 pp Cao Shen Sun Yang Chen Feature selection in a kernel space In International conference on machine learning (ICML) Oregon USA June 2-24 pp K. Fukunaga Introduction to Statistical Pattern Classification Academic Press San Diego California USA Ye R. Janardan C.H. Park and H. Park An optimization criterion for generalized discriminant analysis on undersampled problems IEEE ransactions on Pattern Analysis and Machine Intelligence 26 (8) 24b pp J. P. Ye Characterization of a family of algorithms for generalized discriminant analysis on undersampled problems Journal of Machine Learning Research 6 25 pp C. H. Park H. Park A comparision of generalized linear discriminant analysis algorithms Pattern Recognition pp P. Howland H. Park Generalizing discriminant analysis using the generalized singular value decomposition IEEE ransactions on Pattern Analysis and Machine Intelligence 26 (8) 24 pp P. Howland M. Jeon H. Park Structure preserving dimension reduction for clustered text data based on the generalized singular value decomposition SIAM J. Matrix Anal. Appl. 25 (1) 23 pp L. Chen H. M. Liao M. Ko J. Lin G. Yu A new LDA-based face recognition system which can solve the small sample size problem Pattern Recognition 33 2 pp H. Yu and J. Yang A direct LDA algorithm for high-dimensional datałwith application to face recognition Pattern Recognition pp J. P. Ye and. Xiong Computational and theoretical analysis of null space and orthogonal linear discriminant analysis Journal of Machine Learning Research 7 26 pp Ye R. Janardan Q. Li and H. Park Feature extraction via generalized uncorrelated linear discriminant analysis In Proc. International Conference on Machine Learning 24a pp J. H. Friedman Regularized discriminant analysis Journal of American Statistical Association 84 (45) 1989 pp R. Lotlikar R. KothariFractional-step dimensionality reductionieee rans. Pattern Analysis and Machine Intelligence 22 (6) 2 pp Y. Liang et al. Uncorrelated linear discriminant analysis based on weighted pairwise Fisher criterion Pattern Recognition 4 27 pp R. O. Duda P. E. Hart and D. Stork Pattern Classification Wiley G. H. Golub and C. F. VanLoan Matrix computations third ed. John Hopkins Univ. Press P. Howland H. Park Equivalence of several two-stage methods for linear discriminant analysis in: Proceedings of the Fourth SIAM International Conference on Data Mining April 24 pp M. Loog R. P. W. Duin R. Haeb-Umbach Multiclass linear dimension reduction by weighted pairwise Fisher criteria IEEE rans. Pattern Anal. Mach. Intell. 23 (7) 21 pp Jian-qiang Gao Li-ya Fan Li-zhong Xu Median null(s w )-based method for face feature recognition Applied Mathematics and Computation pp Lishan Qiao Songcan Chen Xiaoyang an Sparsity preserving projections with applications to face Recognition Pattern Recognition 43 (1) 21 pp E-ISSN: Volume

Solving the face recognition problem using QR factorization

Solving the face recognition problem using QR factorization JIANQIANG GAO(a,b) (a) Hohai University College of Computer and Information Engineering Jiangning District, 298, Nanjing P.R. CHINA gaojianqiang82@126.com