Neurocomputing 154 (2015) Contents lists available at ScienceDirect. Neurocomputing. journal homepage:

Size: px

Start display at page:

Download "Neurocomputing 154 (2015) Contents lists available at ScienceDirect. Neurocomputing. journal homepage:"

Claribel McBride
5 years ago
Views:

Neurocomputing 154 (2015) 139 148 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.

Nanjing University of Science and Technology, Nanjing 210094, China b Department of Computer Science and Technology, Tongji University, Shanghai 201804, China article info Article history: Received 9

Yuan Available online 16 December 2014 Keywords: Feature extraction Slow feature discriminant analysis Time series Adaptive parameter abstract Slow feature discriminant analysis (SFDA) is an

1 Neurocomputing 154 (2015) Contents lists available at ScienceDirect Neurocomputing journal homepage: Feature extraction using adaptive slow feature discriminant analysis Xingjian Gu a, Chuancai Liu a,n, Sheng Wang a, Cairong Zhao b a School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing , China b Department of Computer Science and Technology, Tongji University, Shanghai , China article info Article history: Received 9 June 2014 Received in revised form 6 November 2014 Accepted 2 December 2014 Communicated by Y. Yuan Available online 16 December 2014 Keywords: Feature extraction Slow feature discriminant analysis Time series Adaptive parameter abstract Slow feature discriminant analysis (SFDA) is an attractive biologically inspired learning method to extract discriminant features for classification. However, SFDA heavily relies on the constructed time series. For discriminant analysis, SFDA cannot make full use of discriminant power for classification, because the type of data distribution is unknown. To address those problems, we propose a new feature extraction method called adaptive slow feature discriminant analysis (ASFDA) in this paper. First, we design a new adaptive criterion to generate within-class time series. The time series have two properties: (1) a pair of time series lies on the same sub-manifold, (2) the sub-manifold of a pair of time series is smooth. Second, ASFDA seeks projections to minimize within-class temporal variation and maximize between-class temporal variation simultaneously based on maximum margin criterion. ASFDA provides an adaptive parameter to balance between-class temporal variation and within-class temporal variation to obtain an optimal discriminant subspace. Experimental results on three benchmark face databases demonstrate that our proposed ASFDA is superior to some state-of-the-art methods. & 2014 Elsevier B.V. All rights reserved. 1. Introduction Feature extraction is a fundamental and challenging problem in many research fields such as pattern recognition and machine learning. Among them, principal component analysis (PCA) [1] and linear discriminant analysis (LDA) [2] are the two most wellknown methods for linear feature extraction. To our best knowledge, linear feature extraction methods are unable to discover essential data structures that are nonlinear. Recent studies [3 6] have shown that large volumes of highdimensional data possibly reside on a nonlinear manifold. To discover the nonlinear manifold structure of the data, many manifold learning methods have been put forward. Representative manifold learning methods include Isomap [4], LLE [3], LE[5] and LTSA [6]. Isomap preserves pairwise geodesic distance of observations in embedding space. LLE focuses on local neighborhood of each data point and preserves the minimum linear reconstructing with neighborhood in the embedding space. LE is developed on Laplace Betltrami operator on manifold. LTAS first encodes local geometry of local tangent space, then aligns all the local tangent spaces to obtain global embedding. However, those manifold learning methods obtain low dimensional embedding without an explicit mapping, and they cannot extract feature beyond training samples. In order to overcome the problem, NPE [7] tries to find a n Corresponding author. address: chcailiu@163.com (C. Liu). linear subspace that preserves local structure based on the same principle of LLE. LPP [8] seeks a linear subspace to approximate nonlinear Laplacian Eigenmap. LLTSA [9] seeks to linear projections that approximate the affine transformation of LTSA. In order to extract discriminant feature for classification, there emerge several nonlinear manifold learning methods [10 15]. Yu et al. [10] present a discriminant locality preserving projections (DLPP) method to improve the classification performance of LPP. To overcome small size sample (SSS) problem in LPP, Lu et al. [11] propose discriminant locality preserving projections based on maximum margin criterion rather than ratio criterion (MMC) [16]. Yan et al. [12] propose the marginal fisher analysis (MFA) and Chen et al. [13] propose the local discriminant embedding (LDE). MFA and LDE are very similar in formulation. Both of MFA and LDE combine locality and class label information to represent the within-class compactness and between-class separability. For those methods, it is difficult to determine the number of nearest neighbors of each sample and the number of shortest pairs from different classes. To address the problem of local size, Zhang et al. [17] propose a method which can select adaptive local size. In order to utilize nonlocal information, Zhao et al. [18] propose a Graph Embedding Discriminant Analysis (GEDA) method, which does not only impact the samples of within-class and maximize the margin of between-class, but also maximize the nonlocal scatter at the same time. Recently, there are other interesting feature extraction models (such as sparse learning model [19], saliency-based visual attention [20] and temporal slowness learning [21]) inspired by the biological mechanism. Temporal slowness principle has been successfully /& 2014 Elsevier B.V. All rights reserved.

2 140 X. Gu et al. / Neurocomputing 154 (2015) applied to model the visual receptive field of cortical neurons [22]. Based on the slowness principle, Wiskott and Sejnowski [21] propose a nonlinear unsupervised algorithm called, Slow Feature Analysis (SFA), to learn the invariant and slowly varying features from quickly varying input signals. SFA has successfully extracted a rich set of complex-cell features by training with quasi-natural image sequences [23]. SFA also has found many applications in the field of computational neuroscience [24,25] and time series analysis [26,27]. Several researchers introduce the slowness principle to the applications of pattern recognition [28 32].ZhangandTao[28] have successfully introduced the SFA framework to deal with the problem of human action recognition. To our best knowledge, SFA has a good performance on the data sets with a temporal structure. However, in real applications, there are numerous discrete data sets that have no obvious temporal structure. In discrete scenario, it is necessary to construct time series before implementation of SFA. The authors [29] propose a new Supervised Slow Feature Analysis based on Consensus Matrix (SSFACM) to construct time series for face recognition. In [30], the authors propose another variant of Supervised Slow Feature that seeks the Shortest Path of each class samples (SSFASP) to construct time series for dimensionality reduction. Huang et al. [31,32] utilize KNN criterion to construct time series and introduce Supervised Slow Feature Analysis (SSFA) for nonlinear dimensionality reduction. In order to get discriminant slow feature, they also propose slow feature discriminant analysis (SFDA) [31], which minimizes within-class temporal variation and maximizes between-class temporal variation simultaneously. From the view of manifold learning, SFDA aims to find a suitable mapping which can minimize the distance of within-class points in the low dimensional space, and simultaneously ensure that between-class points are as far as possible. However, SFDA also have two main key issues. One is the notion of what type of pairwise-points can be considered as within-class time series or between-class time series. It is crucial to characterize suband multi-manifold information respectively. In the literatures [3,4,31,32], there are two common strategies for selecting withinclass time series: the k-nearest-neighborhood (k-nn), ε-neighborhood (ε-n). But they have distinct disadvantages. For example, it is difficult to choose a suitable parameter k (or ε), because the distribution of each class often does not always share the same scatter. According to the literature [33], if parameter k (or ε) is set to be relatively large, the k-nn (or ε-n) criterion has a tendency to include noisy time series. On the other hand, if parameter k (or ε) issettoberelativelysmall,some local information can be lost. Thus, how accuracy of the time series can be approximated is pivotal in the framework of SFA. The second one is how to balance temporal variation of within-class and between-class to get an optimal discriminant subspace. According to the literature [34], when there is a conflict between within-class temporal variation and between-class temporal variation, it is difficult to know which option is the best for classification given by the within-class temporal variation or between-class temporal variation. To deal with this problem, the concept of subclass discriminant analysis [35,36] is proposed, which divides each class into several subclasses. However, it is difficult to ascertain the number of subclass. In order to address these issues, we propose a novel dimensionality reduction method called adaptive slow feature discriminant analysis (ASFDA) to improve the performance of classification in this paper. First, a new adaptive criterion is designed to generate time series before implementation of ASFDA. It is well known that a pair of sample points, which can be considered as time series, are lying on a smooth sub-manifold. Inspired by [17], we develop a new adaptive criterion to generate within-class time series. From Fig. 1, we can see that the notion of constructed time series contains two aspects (1) points of time series are nearby with Euclidean distance, (2) points of time series lie in the same principal direction of the local neighborhood. We construct between-class time series using D between-class neighboring information to characterize the margin information. Second, for the purpose of enhancement of classification, ASFDA seeks projections to minimize the difference, rather than ratio, between within-class temporal variation and between-class temporal variation based on the idea of maximize margin criterion (MMC) [37]. Future more, ASFDA also gives an adaptive parameter to balance the temporal variation of within-class and between-class to maximize the discriminant power. Extensive experiments on three benchmark face databases show the effectiveness of the proposed ASFDA. The rest of the paper is organized as follows. In Section 2, we briefly review MMC and SFDA. In Section 3, we introduce the motivations of adaptive slow feature discriminant (ASFDA) and describe it in detail. In Section 4, experiments with face image databases are carried out to demonstrate the effectiveness of the proposed method. Finally, the conclusions are made in Section A brief review of MMC and SFDA Given a sample set X ¼fx 1 ; x 2 ; ; x N gar DN and each sample belongs to one of c classes fx 1 ; X 2 ; ; X c g. Let c denote the total number of classes and N i denote the number of training samples in the ith class. Let x i j denote the jth sample in the ith class, x denote the mean of all training samples, x i be the mean of the ith class MMC MMC is a classical supervised learning algorithm for feature extraction and classification. The between-class and within-class scatter matrices can be evaluated as follows: S b ¼ c N i ðx i xþðx i xþ T ð1þ i ¼ 1 S w ¼ c C N i i ¼ 1 j ¼ 1 B A Fig. 1. Geometric analysis of two different time series construction criterions. Left: KNN criterion to construct time series, Points (A, D) can be considered as a short time series in terms of Euclidean distance but they are not in the same principal direction, and it may deform the structure of manifold. Right: our proposed criterion to construct time series, points (A, B) and (A, C) are a pair of time series because they not only close to each other in terms of Euclidean distance but also lie in the same principal direction. ðx j i x iþðx j i x iþ T The MMC based discriminant rule is defined as follow, which is based on the difference of between-class scatter matrix and within-class scatter matrix W n ¼ arg max trðw T ðs b S w ÞWÞ ð3þ W The solution to the optimization of Eq. (3) can be solved by eigenvalue problem ðs b S w Þw ¼ λw and optimal projections can be selected as eigenvectors w 1 ; w 2 ; ; w d corresponding to the first largest eigenvalues λ 1 ; λ 2 ; ; λ d. D C B A ð2þ

3 X. Gu et al. / Neurocomputing 154 (2015) SFDA In this section, we present the detail of slow feature discriminant analysis (SFDA) [31] used for discrete data that does not have an obvious temporal structure. It should firstly constructs within-class time series t w and between-class time series t b using neighboring information: t w ¼fðx p i ; xq i Þg; i ¼ 1; 2; ; c; paq; pq ¼ 1; 2; ; N i ð4þ where x p i and x q i belong to the ith class. t b ¼fðx p i ; xq j Þg; iaj; ij ¼ 1; 2; ; c; p ¼ 1; 2; ; N i; q ¼ 1; 2; ; N j ð5þ where x p i and x q j belong to different classes. Based on the set of time series t w and t b, the temporal variation Δt w and Δt b can be approximated by the time difference, where Δt w ¼fðx p i xq i Þg; ðxp i ; xq i ÞAt w and Δt b ¼fðx p i xq j Þg; ðxp i ; xq j ÞAt b. The model of SFDA is as follows: w T T w w arg min w w T ð6þ T b w where T w ¼ Δt w Δt T w and T b ¼ Δt b Δt T b. The solution to the optimization of Eq. (6) can be solved by eigenvalue problem T w w ¼ λt b w and optimal projections can be selected as eigenvectors w 1 ; w 2 ; ; w d corresponding to the first smallest eigenvalues λ 1 ; λ 2 ; ; λ d. 3. Adaptive slow feature discriminant analysis 3.1. Motivations of ASFDA The goal of ASFDA is to extract discriminant slow feature for classification by minimizing the temporal variation of within-class time series and maximizing the temporal variation of between-class time series simultaneously. Fig. 2 provides an intuitive illustration of theideaofasfda.inscenarioofdiscretedatasetsthatdonothave obvious temporal structure, ASFDA heavily relies on the notion of how accurately the time series can be approximated. The selected within-class time series should reflect the local geometric structure of sub-manifold. For the purpose of classification, the selected between-class time series should reflect the local margin information between multi-manifold. On the other hand, according to the literature [34], there usually exists a conflict between the projections that minimizing temporal variation of within-class time series and projections that maximizing temporal variation of between-class time series. To address this problem, we give an adaptive parameter to balance temporal variations of within-class and between-class to maximize discriminant capability. Totally, ASFDA consists of two steps: (1) characterizing temporal variation of within-class and between-class, (2) integrating temporal variations of within-class and between-class using maximum margin criterion to learn an optimal discriminant subspace for classification. margin margin margin margin Fig. 2. An intuitive illustration of the idea of ASFDA. The points with different shapes are belonging to different classes. (a) The sample points in the original space and margin between different classes are shown. (b) The sample points in the projected space, on which each class vary slowly and a larger margin is provided are shown Within-class time series selection What can be considered as within-class time series in slow feature discriminant analysis is equivalent to how to select neighbor set, which can reflect local manifold structure. Inspired by the literature [17], we develop a new criterion that satisfies the following requirement: the selected time series for each sample point x i should reflect nearby relationship to the point x i.theselectedtime series contains two properties: (1) the two points of time series should lie on the same sub-manifold and (2) the sub-manifold of time series is smooth Notion of time series Given a data set X ¼½x 1 ; x 2 ; ; x N ŠAR DN sampled from a r- dimensional smooth sub-manifold x ¼ f ðτþ, wherexar D, τar r and f : ΩAR r -R D, Ω is an open connected subset. Assuming that two points x i and x j can be considered as a short time series, x j can be accurately represented in the form of Taylor expansion x j ¼ x i þ J τi ðτ j τ i Þþεðτ j τ i Þ,whereJ τi AR Dr is the Jacobian matrix of f at τ i and εðτ j τ i Þ is the second order term of τ j τ i.sincethesubmanifold is smooth, it guarantees that εðτ j τ i Þoηðτ j τ i Þ with a small constant ηað0; 1Þ. Basedonthis,ashorttimeseriesðx i ; x j Þ should satisfy the following criterion: Jx j x i J τi ðτ j τ i ÞJ oηjðτ j τ i ÞJ ð7þ According to the literature [6], J τi ðτ j τ i Þ can be estimated by Q i θ j i and τ j τ i can be estimated by θ j i, whereθ j i ¼ Q T i ðx j x i Þ is the local coordinate and Q i is a set of local orthogonal bases which can be obtained by singular value decomposition [38] method. Thus, Eq. (7) can be written as the following form: Jx j x i Q i θ j i J oηjθj i J The matrix form of Eq. (8) can be rewritten as follows: X i ðx i e T þq i Θ i Þ F oη Θ i F ð9þ where e is a column vector of all 1s, X i ¼fx j i gk i j ¼ 1,pointsx i and x j can be construct a short time series and Θ i ¼fθ j i gk i j ¼ 1 is the local coordinate matrix corresponding to X i. According to the property of singular value decomposition [38], X i x i e T F can be calculated in terms of its singular values, say σ 1 Z Zσ r Z Zσ ni 40, we can get the following equations: X i x i e T F ¼ n i ðσ l ¼ 1 lþ 2, Θ i F ¼ Q i Θ i F ¼ r ðσ l ¼ 1 lþ 2 and X i ðx i e T þq i Θ i Þ F ¼ n i ðσ l ¼ r þ 1 lþ 2. Thus, Eq. (9) can be rewritten as follows: r l ¼ 1 ðσ l Þ 2 þ n i l ¼ r þ 1 ðσ l Þ 2 oð1þηþ r l ¼ 1 ðσ l Þ 2 ð8þ ð10þ From the above analysis, the criterion of determining time series set can be summarized as follows: r ðσ l ¼ 1 lþ 2 n i ðσ 4β l ¼ 1 lþ ð11þ 2 where r is the dimension of local manifold, n i is the number of nonzero singular value of X i x i e T and β ¼ 1=ð1þηÞ, β A½0; 1Š. From Eq. (7), we can see that the smaller the η is, the more close the x i and x j. It also means that the larger the β is, the more close the X i and x i. From Eq. (11), we can see that the parameter β also measures the PCA energy of r directions. In summary, if the value of β is relatively large, the points in X i and point x i lie in the same principal directions Adaptive criterion for time series selection Assume that we have obtained a relative large neighbor set X i ¼½x 1 i ; x2 i ; ; xk i Š of point x i which can be obtained by k nearest neighborhood method. Given preset parameters r and β, if Eq. (11)

4 142 X. Gu et al. / Neurocomputing 154 (2015) is not satisfied, we should remove a point x i j from set X i. In each removing step, it should guarantee that the remaining set have maximum value of function (12) δðx i =x ðσ j i Þ¼ r l ¼ 1 lþ 2 n i ðσ ð12þ l ¼ 1 lþ 2 where X i =x j j i means point x i is removed from set X i and σ l, l ¼ 1; 2; ; n i are the nonzero singular values of X i =x j i x ie T. The removing step can be repeated until Eq. (11) holds. The adaptive selection of time series process is summarized in Algorithm 1. From Eq. (12), we can see that the remaining points mainly lie few a principal directions, which is in favor to describe the structure of submanifold. In practice, we usually set dimension of local neighborhood to be small and parameter β to be relative large. It ensures that the selected points are lying in the principal direction. Thus, those points which are not lying in the principal direction will be removed. Algorithm 1. Within-class time series selection. Input: Given a data point x i, its k nearest neighbor set X i ¼½x 1 i ; x2 i ; ; xk i i Š belonging to the same class of x i and parameters r, β Output: T i ¼fðx i ; x j i Þg j ¼ 1; ; k i, where x j i AX i and k i is the number of points in X i Calculate singular values σ l 40ðl ¼ 1; ; n i Þ of X i x i e T Calculate δ ¼ r ððσ l ¼ 1 lþ 2 Þ=ð n i ðσ l ¼ 1 lþ 2 Þ while δoβ and k i 4k min do j Select a point x ~ i that satisfies j ~ ¼ arg max Update X i ¼ X i =x ~ j i, k i ¼ k i 1 Update δ ¼ð r l ¼ 1 ðσ lþ 2 Þ=ð n i l ¼ 1 ðσ lþ 2 Þ end while j δðx i =x j i Þ 3.3. Characterization of within-class slowness scatter ASFDA aims to extract discriminant slow feature by using label information. It is difficult to obtain a set of suitable time series based on KNN criterion, since the distribution of each class data in real world application is unknown. A well constructed set of time series will be in favor of describing manifold structure, thus will perform better in recognition. Based on Algorithm 1, an adaptive time series t w ¼½t 1 w ; t2 w ; ; tn w Š can be obtained, where ti w ¼fðx i; x j iþg; i ¼ 1; 2; ; N; j ¼ 1; 2; ; k i, x i and x j i belonging to same class. Based on the set of time-series t w, the within-class temporal variation can be approximated Δt w ¼½Δt 1 w ; Δt2 w ; ; ΔtN wš as follows: Δt i w ¼fx i x j i g; ðx i; x j i ÞAti w ð13þ where k i is number of time series in t i w. And the within-class slowness scatter can be defined as follows: J w ¼ N trðw T Δt i wδt i w TWÞ ð14þ i ¼ 1 J w ¼ trðw T T w WÞ ð15þ where N is the number of training samples and T w ¼ N i ¼ 1 Δti wδt i w T Characterization of between-class margin scatter For achieving a good classification performance, the margin of between-class separability should be maximized in the lowdimensional space. Due to the nonlinear structure of manifold, many pairs of close samples share different labels. For each point, we only consider its k-nearest points with different label to calculate between-class time series. Given a data point x i,wefind its k-nearest points X ~ i ¼½~x 1 i ; ~x2 i ; ; ~xk i Š, which do not belong to the same class of x i, and calculate between-class temporal variation Δt b ¼½Δt 1 b ; Δt2 b ; ; ΔtN b Š by approximating the time difference, where Δt i b ¼fx i ~x 1 i ; ; x i ~x k i g; i ¼ 1; 2; ; N. And the betweenclass temporal variation can be defined as following equation: J b ¼ N trðw T Δt i bδt i b TWÞ ð16þ i ¼ 1 J b ¼ trðw T T b WÞ ð17þ where N is the number of samples, T b ¼ N i ¼ 1 Δti b Δti bt. It is easy to see that maximize J b is to maximize the local margin of different classes in the low dimensional space Objective function With the above preparation, the proposed algorithm is expected to find the optimal projection that can minimize the within-class temporal variation and simultaneously maximize the between-class temporal variation. We then have the following optimization problem: 8 >< min w T T w w w T w ¼ 1 max w T ð18þ >: T b w w T w ¼ 1 The optimization problem can be reformulated as follows: min w T w ¼ 1 w T ðt w αt b Þw ð19þ where αz0 is a suitable parameter that balance temporal variation of within-class and between-class. It can be easily reduced to an eigenvalue problem and optimal projections can be selected as eigenvector w 1 ; w 2 ; ; w d corresponding to the first smallest eigenvalues λ 1 ; λ 2 ; ; λ d The effective discriminant subspace It is obvious that the effective projections that can be used in feature extraction depend on both matrixes T w and T b. When the parameter value α is approaching to zero, SFDA will degenerate into a feature-extraction method which just only includes the withinclass information T w. On the other hand, when the parameter value is approaching to a large value such as þ1, ASFDA will degenerate into a feature extraction method which only considers betweenclass information T b. In real world application, it is difficult to characterize the real data distribution. According to the literature [34], when there is a conflict between temporal variation of withinclass and between-class, it is difficult to know which option is the best for classification that given by within-class temporal variation or that of between-class temporal variation. As far as we known, a larger discriminant power of low dimensional representation can result in a better performance for classification. In this section, we present a robust method to obtain an adaptive parameter that can maximize the discriminant power. To formally illustrate the effectiveness of the method, we first note the discriminant power with a given d as follows: trðw T T b WÞ trðw T ð20þ T w WÞ where W AR Dd and W T W ¼ I d. In order to acquire a good performance of classification, the goal of formula (20) is to maximize the discriminant power of a d-dimensional representation and Eq. (19)

5 X. Gu et al. / Neurocomputing 154 (2015) can be reformulated into the following equation: trðw T 0 max T bw 0 Þ α trðw T 0 T ww 0 Þ s:t: W 0 ¼ arg min trðw T ðt w αt b ÞWÞ ð21þ W 0 W 0 ¼ I d where D is the dimension of original data and d is the required low dimension. The optimal process contains two main steps: Removing the null space of between-class temporal variation It is well known that the matrices T w and T b are both positive semi-definite and the null space of T b has no discriminant ability. We assume that removing the null space of T b will not sacrifice the accuracy of classification. The singular value decomposition of T b is T b ¼ UΛ Tb U T ð22þ where U ¼½u 1 ; u 2 ; ; u m Š, Λ Tb ¼½λ 1 T b ; λ 2 T b ; ; λ m T b Š, λ 1 T b Zλ 2 T b Z Zλ m T b 40, m is the number of positive singular values of T b. Then the solution of Eq. (21) is a linear combination of column U, and W¼UV. We rewrite the two temporal variation as T U b ¼ UT T b U and T U w ¼ UT T w U. Eq. (21) can be reduced to min trðv T ðt U V T w αtu b ÞVÞ ð23þ V ¼ I d and so the discriminant power is expressed as follows: trðv T T U b VÞ trðv T T U w VÞ ð24þ where T U b is the positive definite matrix and T U w is the positive semi-definite matrix Iterative optimization At each iterative step, we start with V n AR md, and we could compute the tradeoff parameter as α n ¼ trðv T n TU w V nþ trðv T n TU b V nþ And then the V n þ 1 AR md is calculated as follows: V n þ 1 ¼ arg min trðv T n þ 1 ðt w α n T b ÞV n þ 1 Þ V T n þ 1 V n þ 1 ¼ I d ð25þ ð26þ As matrix T U b is the positive matrix, the term trðv T n TU b V nþ will always have a positive value. The detailed iterative procedure is listed in Algorithm 2. Algorithm 2. Iterative procedure to obtain optimal discriminant subspace. Input: Given within-class temporal variation T w and between-class variation T b Output: W Remove the null space of T b as in Eq. (22), rewrite T U w ¼ UT T w U and T U b ¼ UT T b U for each na½1; N max Š do Compute the balance parameter α n from the projection matrix V n 1 α n ¼ trðv T n 1 TU w V n 1Þ trðv T n 1 TU b V n 1Þ Calculate the new projection matrix V n V n ¼ arg min trðv T n ðtu α w nt U b ÞV nþ V T n V n ¼ I d If JV n V n 1 J oε (ε is a small positive value) V ¼ V n, then break; end for W¼UV Theorem 1. Iterative procedure in Algorithm 2 is convergence, since the parameter α n is monotonically decreasing and is also bounded. We denote a function F as follows: FðVÞ¼ trðv T T U w VÞ trðv T T U b VÞ ð27þ It has the following property: FðV n ÞrFðV n 1 Þ and FðVÞZ0. Proof. Set α n ¼ trðv T n 1 TU w V n 1Þ trðv T n 1 TU b V n 1Þ then trðv T n ðtu w α nt U b ÞV nþ¼0 Also V n ¼ arg min trðv T n ðtu α w nt U b ÞV nþ V T n V n ¼ I d Then trðv T n ðtu w α nt U b ÞV nþrtrðv T n 1 ðtu w α nt U b ÞV n 1Þ¼0 trðv T n TU w V nþ trðv T n TU b V nþ rα n FðV n ÞrFðV n 1 Þ Moreover T U b is the positive definite matrix and T U w is the positive semi-definite matrix FðVÞ¼ trðv T T w VÞ trðv T T b VÞ Z0: Therefore, in the process of the iteration the parameter α n is monotonically decreasing and is also bounded. Corollary 1. As Algorithm 2 convergences, the maximum discriminant power can be obtained simultaneously. Proof. It is easy to see that the discriminant power is the reciprocal of the parameter α, and as the parameter α arrives its minimum, the discriminant power will reach its maximum. Now, the algorithmic procedure of ASFDA is formally summarized as follows. Step Construct within-class temporal variation T w using 1: Algorithm 1. Step Construct between-class temporal variation T b using neighboring information. 2: Step Using Algorithm 2 to solve the following objection: 3: trðw T 0 T bw 0 Þ trðw T 0 T ww 0 Þ s:t: max W;α W 0 ¼ arg min W 0 W 0 ¼ I d trðw T ðt w αt b ÞWÞ Step After obtaining the optimal transformation matrix W, for a 4: new sample x, its low-dimensional feature representation y ¼ W T x.

144 X. Gu et al. / Neurocomputing 154 (2015) 139 148 4.

MFA[12], MMC [16], SSFACM [29], SSFASP [30], SFDA [31]and SSFA [32] on several publicly available databases.

For MFA the k-nearest neighborhood parameter k 1 is set as k 1 ¼ l 1 and k 2 is set as c, where l denotes the number of training samples per class and c denotes the number of classes.

6 144 X. Gu et al. / Neurocomputing 154 (2015) Experimental results and analysis In this section, we evaluate the performance of our method ASFDA in comparison with other classical dimensionality reduction methods including LDA [2], DLPP [10], MFA[12], MMC [16], SSFACM [29], SSFASP [30], SFDA [31]and SSFA [32] on several publicly available databases. In order to make the comparison fair, we first apply PCA as preprocessing step to keep 98% energy. For MFA the k-nearest neighborhood parameter k 1 is set as k 1 ¼ l 1 and k 2 is set as c, where l denotes the number of training samples per class and c denotes the number of classes. We follow [11,39] to set the heat parameter t as t ¼ 2 m=2:5 σ 0, where σ 0 is the standard deviation of the squared norms of the training samples, and maf 20; 9; ; 0; ; 20g in DLPP. In ASFDA, we set r¼1, β¼0.65 and the between-class neighborhood selection parameter k¼c. After all the methods have been adopted to extract low dimensional feature, the nearest neighbor classifier with Euclidean metric as the distance measure is employed to perform classification task. We denote the recognition accuracy as the percent of samples that can be correctly recognized in testing samples. All the experiments are performed on a (CPU:Core 2 Duo 2.2 GHz, RAM:2G) PC with MATLAB 2010a Database The ORL Face Database contains 400 images of 40 distinct individuals and each subject have 10 different images. These images were taken at different times and demonstrates variations in lighting condition, facial expression (open/closed eyes, smiling/not smiling) and facial details (glasses/no glasses). For computational convenience, we manually cropped the face portion of the image into the resolution of Some example images of one person are shown in Fig. 3. The Extended YaleB Face Database contains 16,128 images under 9 poses and 64 illumination conditions. In our experiment, we select a subset contains 2431 images of 38 individuals. Before implement our experiment, each image in Extended YaleB face database is cropped and resize to Fig. 4 shows some sample images. The CMU PIE Face Database contains 41,368 images from 68 individual. These images of each individual were taken under 13 different poses, 43 different illumination conditions, and with 4 different expressions. In our experiment, we select a subset containing 11,554 images of 68 individuals. Before implement our experiment, all the face images in PIE face database are resized into the resolution of SomesampleimagesareshowninFig Experiment for tradeoff parameter α In this subsection, we investigate the performance of the ASFDA over the reduced dimensions and the value of tradeoff parameter α. 8 samples of each individual are selected for training, while the remaining samples are used for testing on YaleB and PIE face databases. On ORL face database, 5 samples per class are randomly chosen for training and the rest samples are used for testing. Each experiment is randomly repeated 20 times to get the average recognition accuracy. Figs. 7 9 demonstrate the recognition accuracy of ASFDA over the variance of the dimensionality of subspaces and different values of the parameter α. Table 1 gives the maximal recognition accuracy of ASFDA with different values of parameter α. Fig. 6 gives the variation of parameter α in the process of iteration. Fig. 3. Sample images in ORL database. Fig. 4. Sample images in YaleB database. Fig. 5. Sample images in PIE database.

7 X. Gu et al. / Neurocomputing 154 (2015) From Figs. 6 to 9 and Table 1, the recognition rates are sensitive to tradeoff parameter, ASFDA always has a significantly advantage in recognition rates, because our method can automatically choose an optimal parameter to balance temporal variation of within-class and between-class to get the maximum discriminant power in the low dimensional subspace, so that it can always obtain the best performance for classification. As can be seen from Fig. 6, theparameterα will arrive its minimum value soon. When the parameter α reaches its minimum value, ASFDA obtains its maximum discriminant power, because discriminant power is the reciprocal of parameter α. From Fig. 7, when parameter α is set as 0.001, 0.01, 0.1 and 1, the recognition rate curves are very similar, and the recognition rates are less sensitive to parameter α. The reason is that there is less conflict between within-class temporal variation and between-class temporal variation on ORL database Experiment for face recognition In this subsection, we compare the performance of different dimensionality reduction methods. In order to evaluate the effectiveness of time series constructed by our method, we extend the SFDA [31] into difference form. So there are two variants of SFDA, SFDA-ratio and SFDA-difference. We randomly select l (l ¼ 8; 10; 12; 15 on YaleB, PIE and l ¼ 6; 7; 8 on ORL) samples of each individual for training, and the rest of samples used for testing on the three databases. Each experiment is randomly repeated 20 times to get the average recognition accuracy. Tables 2 4 give the maximal average recognition accuracy obtained by different dimensionality methods as well as standard deviations and the corresponding dimensionality of reduced subspace. In addition, we draw the recognition rate curves of the different dimensionality reduction methods in Figs Fig. 7. Recognition accuracies of ASFDA on ORL face database using 5 training samples. Table 1 The maximal recognition accuracy (%) of ASFDA with the variants of parameter α on ORL, YaleB and PIE databases.the bold recognition accuracy is the best. Databases α¼0.001 α¼0.01 α¼0.1 α¼1 α¼10 α¼100 α¼1000 Our method ORL (α¼0.09) YaleB (α¼0.25) PIE (α¼0.44) Fig. 8. Recognition accuracies of ASFDA on YaleB face database using 8 training samples. Fig. 6. The variation of tradeoff parameter α with the number of iteration on ORL, YaleB and PIE databases. Fig. 9. Recognition accuracies of ASFDA on PIE face database using 8 training samples.

8 146 X. Gu et al. / Neurocomputing 154 (2015) From Figs. 10 to 13 and Tables 2 to 4, wecanseethatasfda consistently outperforms LDA, DLPP, MFA, MMC in all experiments in three face databases. The good performance of ASFDA also demonstrates that ASFDA is more effective than other methods in feature extraction. ASFDA not only captures the structure information both of sub-manifold and multi-manifold, but also obtains maximum discriminant power from temporal variations of within-class and between-class. As shown in Figs , the maximum recognition of LDA is higher than that of MMC and the performance of SFDA-ratio also outperforms that of SFDA-difference. The reason is that MMC and SFDA-difference are sensitive to the tradeoff parameter α. SFDA-ratio and ASFDA outperform SSFA, SSFACM and SSFASP, because SSFA, SSFACM and SSFASP ignore the between-class information. Although SFDA-difference can obtain the discriminant feature, its performance is not good enough as expected in our experiments. The reason may be that the difference criterion relies on the tradeoff parameter α when there is a conflict between within-class temporal variation and between-class temporal variation. We can observe that ASFDA outperforms both of two variants of SFDA: SFDA-ratio and SFDAdifference, the reason is that the time series constructed by our method are more helpful to reveal the structure information of submanifoldandmulti-manifoldthansfda Influence of parameters on ASFDA performance TheproposedmethodASFDAhastwoparameters,i.e.,r of the dimension of local neighborhood and β measuring the PCA energy of r directions in the local neighborhood. In this subsection, we study the parameters r and β impact to the performance of ASFDA on the ORL, YaleB and PIE face databases. We randomly select 8 samples of each individual for training, and the rest samples are used for testing. We repeat the experiment for 20 times to obtain the average recognition Table 2 The maximal average recognition accuracy (%) and their corresponding standard deviations, optimal dimensions of LDA, DLPP, MFA, MMC, SSFA, SSFAMC, SSFASP, SFDA-ratio, SFDA-difference and ASFDA across 20 runs on ORL database. The bold recognition accuracy is the best. Method 6 Train. 7 Train. 8 Train. LDA (34) (38) (31) DLPP (39) (38) (36) MFA (31) (31) (41) MMC (80) (49) (46) SSFA (38) (60) (58) SSFAMC (46) (53) (49) SSFASP (55) (37) (32) SFDA-ratio (41) (47) (51) SFDA-difference (80) (22) (46) ASFDA (37) (33) (42) Fig. 10. The recognition rate curves of LDA, DLPP, MFC, MMC, SSFA, SSFACM, SSFASP, SFDA-ratio, SFDA-difference and ASFDA versus dimensions on ORL face database using 7 training samples. Table 3 The maximal average recognition accuracy (%) and their corresponding standard deviations, optimal dimensions of LDA, DLPP, MFA, MMC, SSFA, SSFAMC, SSFASP, SFDA-ratio, SFDA-difference and ASFDA across 20 runs on YaleB database. The bold recognition accuracy is the best. Method 8 Train. 10 Train. 12 Train. 15 Train. LDA (25) (34) (26) (29) DLPP (23) (35) (32) (25) MFA (81) (78) (87) (78) MMC (70) (78) (79) (89) SSFA (28) (36) (27) (38) SSFAMC (32) (55) (64) (53) SSFASP (45) (68) (62) (51) SFDA-ratio (25) (29) (28) (29) SFDA-difference (65) (76) (49) (54) ASFDA (28) (30) (28) (31) Table 4 The maximal average recognition accuracy (%) and their corresponding standard deviations, optimal dimensions of LDA, DLPP, MFA, MMC, SSFA, SSFAMC, SSFASP, SFDA-ratio, SFDA-difference and ASFDA across 20 runs on PIE database. The bold recognition accuracy is the best. Method 8 Train. 10 Train. 12 Train. 15 Train. LDA (30) (28) (31) (36) DLPP (34) (27) (26) (35) MFA (61) (45) (40) (48) MMC (43) (53) (57) (52) SSFA (31) (34) (24) (36) SSFAMC (25) (33) (33) (43) SSFASP (31) (31) (33) (28) SFDA-ratio (31) (30) (30) (38) SFDA-difference (50) (48) (47) (57) ASFDA (30) (28) (32) (30)

X. Gu et al. / Neurocomputing 154 (2015) 139 148 147 accuracy. Fig. 13 shows the maximal average recognition accuracy as a function of these two parameters r and β. From Fig.

9 X. Gu et al. / Neurocomputing 154 (2015) accuracy. Fig. 13 shows the maximal average recognition accuracy as a function of these two parameters r and β. From Fig. 13, we can clearly see that our proposed ASFDA is stable on the whole to the variant of parameters r and β on the three face databases. More specially, the recognition accuracy of ASFDA will increase in a small range either with the increase of parameter β or with decrease of parameter r. It is obvious that when the parameter r is set r¼1 orr¼2 and parameter β is set above 0.6, ASFDA will obtain a better performance on the three face databases. The reason may be that the dimension of local manifold is small. The points in X i and point x i in the constructed time series are nearby and lie in the same principal directions, which favor to reveal the local structure of manifold Computational efficiency comparison Fig. 11. The recognition rate curves of LDA, DLPP, MFC, MMC, SSFA, SSFACM, SSFASP, SFDA-ratio, SFDA-difference and ASFDA versus dimensions on YaleB face database using 8 training samples. In this section, we discuss the computational cost of our proposed ASFDA in comparison to LDA, MMC, SSFA, SSFACM, SSFASP, SFDA-ratio and SFDA-difference. ASFDA has the same complexity as SFDA when the time series and parameter α are given. However, our proposed ASFDA needs to perform extra computation for constructing adaptive time series and computing adaptive parameter α to balance withinclass temporal variation and between-class temporal variation. Precisely, ASFDA has larger arithmetic operations than SFDA. We use the ORL, YaleB and PIE face databases to empirically compare the computational efficiency of those methods. In each database, l¼8 samples of each individual are selected to calculate the training cost of each method. The experiments are repeatedly performed 20 times. Finally, the average training time is computed. Table 5 shows the average training costs of the methods on different databases. 5. Conclusion In this paper, we develop a novel feature extraction method called adaptive slow feature discriminant analysis (ASFDA) for face recognition. ASFDA provides a new criterion to generate within-class time series to describe sub-manifold and use neighboring information to generate between-class time series to describe the margin between multi-manifold. Moreover, ASFDA gives an automatic parameter to balance temporal variation of within-class and between-class to obtain an optimal discriminant subspace for classification. The experimental results demonstrate that the proposed ASFDA is superior to some state-of-the-art methods in face recognition. Table 5 The average training time (seconds) of LDA, MMC, SSFA, SSFACM, SSFASP and SFDA across 20 runs on three face databases.the bold recognition accuracy is the best. LDA MMC SSFA SSFACM SSFASP SFDAratio SFDAdifference ASFDA Fig. 12. The recognition rate curves of LDA, DLPP, MFC, MMC, SSFA, SSFACM, SSFASP, SFDA-ratio, SFDA-difference and ASFDA versus dimensions on PIE face database using 8 training samples. ORL YaleB PIE Fig. 13. The maximal average recognition accuracy versus parameters r and β on the ORL, YaleB and PIE face databases.

148 X. Gu et al. / Neurocomputing 154 (2015) 139 148 Acknowledgement This work is supported by the National Natural Science Fund of China (Grant nos.

2013KJ010). References [1] M. Turk, A. Pentland, Eigenfaces for recognition, J. Cogn. Neurosci. 3 (1) (1991) 71 86. [2] P.N. Belhumeur, J.P. Hespanha, D. Kriegman, Eigenfaces vs.

Saul, Nonlinear dimensionality reduction by locally linear embedding, Science 290 (5500) (2000) 2323 2326. [4] J.B. Tenenbaum, V. De Silva, J.C.

Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural Comput. 15 (6) (2003) 1373 1396. [6] Z.-y.

10 148 X. Gu et al. / Neurocomputing 154 (2015) Acknowledgement This work is supported by the National Natural Science Fund of China (Grant nos , and ), the Project of Ministry of Industry, Information Technology of PRC (Grant no. E0310/1112/02-1) and Fundamental Research Funds for the Central Universities (Grant no. 2013KJ010). References [1] M. Turk, A. Pentland, Eigenfaces for recognition, J. Cogn. Neurosci. 3 (1) (1991) [2] P.N. Belhumeur, J.P. Hespanha, D. Kriegman, Eigenfaces vs. fisherfaces: recognition using class specific linear projection, IEEE Trans. Pattern Anal. Mach. Intell. 19 (7) (1997) [3] S.T. Roweis, L.K. Saul, Nonlinear dimensionality reduction by locally linear embedding, Science 290 (5500) (2000) [4] J.B. Tenenbaum, V. De Silva, J.C. Langford, A global geometric framework for nonlinear dimensionality reduction, Science 290 (5500) (2000) [5] M. Belkin, P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural Comput. 15 (6) (2003) [6] Z.-y. Zhang, H.-y. Zha, Principal manifolds and nonlinear dimensionality reduction via tangent space alignment, J. Shanghai Univ. (English Edition) 8 (4) (2004) [7] X. He, D. Cai, S. Yan, H.-J. Zhang, Neighborhood preserving embedding, in: Tenth IEEE International Conference on Computer Vision, vol. 2, IEEE, Los Alamitos, CA, USA, 2005, pp [8] X. He, S. Yan, Y. Hu, P. Niyogi, H.-J. Zhang, Face recognition using laplacianfaces, IEEE Trans. Pattern Anal. Mach. Intell. 27 (3) (2005) [9] T. Zhang, J. Yang, D. Zhao, X. Ge, Linear local tangent space alignment and application to face recognition, Neurocomputing 70 (7) (2007) [10] W. Yu, X. Teng, C. Liu, Face recognition using discriminant locality preserving projections, Image Vis. Comput. 24 (3) (2006) [11] G.-F. Lu, Z. Lin, Z. Jin, Face recognition using discriminant locality preserving projections based on maximum margin criterion, Pattern Recognit. 43 (10) (2010) [12] S. Yan, D. Xu, B. Zhang, H.-J. Zhang, Q. Yang, S. Lin, Graph embedding and extensions: a general framework for dimensionality reduction, IEEE Trans. Pattern Anal. Mach. Intell. 29 (1) (2007) [13] H.-T. Chen, H.-W. Chang, T.-L. Liu, Local discriminant embedding and its variants, in: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, IEEE, Los Alamitos, CA, USA, 2005, pp [14] C. Zhao, D. Miao, Z. Lai, C. Gao, C. Liu, J. Yang, Two-dimensional color uncorrelated discriminant analysis for face recognition, Neurocomputing 113 (3) (2013) [15] C. Zhao, Z. Lai, C. Liu, X. Gu, J. Qian, Fuzzy local maximal marginal embedding for feature extraction, Soft Comput. 16 (1) (2012) [16] H. Li, T. Jiang, K. Zhang, Efficient and robust feature extraction by maximum margin criterion, in: Neural Information Processing Systems, [17] Z. Zhang, J. Wang, H. Zha, Adaptive manifold learning, IEEE Trans. Pattern Anal. Mach. Intell. 34 (2) (2012) [18] C. Zhao, Z. Lai, D. Miao, Z. Wei, C. Liu, Graph embedding discriminant analysis for face recognition, Neural Comput. Appl. 22 (5) (2013) [19] J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry, Y. Ma, Robust face recognition via sparse representation, IEEE Trans. Pattern Anal. Mach. Intell. 31 (2) (2009) [20] L.Itti,C.Koch,E.Niebur,etal.,Amodelofsaliency-basedvisualattentionforrapid scene analysis, IEEE Trans. Pattern Anal. Mach. Intell. 20 (11) (1998) [21] L. Wiskott, T.J. Sejnowski, Slow feature analysis: unsupervised learning of invariances, Neural Comput. 14 (4) (2002) [22] P. Berkes, Temporal Slowness as an Unsupervised Learning Principle: Selforganization of Complex-cell Receptive Fields and Application to Pattern Recognition (Ph.D. thesis), Citeseer, [23] S. Dähne, N. Wilbert, L. Wiskott, Slow feature analysis on retinal waves leads to v1 complex cells, PLoS Comput. Biol. 10 (5) (2014) e [24] R. Legenstein, N. Wilbert, L. Wiskott, Reinforcement learning on slow features of high-dimensional input streams, PLoS Comput. Biol. 6 (8) (2010) e [25] M. Franzius, N. Wilbert, L. Wiskott, Invariant object recognition and pose estimation with slow feature analysis, Neural Comput. 23 (9) (2011) [26] T. Blaschke, T. Zito, L. Wiskott, Independent slow feature analysis and nonlinear blind source separation, Neural Comput. 19 (4) (2007) [27] S. Dähne, J. Höhne, M. Schreuder, M. Tangermann, Slow feature analysis-a tool for extraction of discriminating event-related potentials in brain-computer interfaces, in: Artificial Neural Networks and Machine Learning ICANN 2011, Springer, Berlin, Germany, 2011, pp [28] Z. Zhang, D. Tao, Slow feature analysis for human action recognition, IEEE Trans. Pattern Anal. Mach. Intell. 34 (3) (2012) [29] X. Gu, C. Liu, S. Wang, Supervised slow feature analysis for face recognition, in: Biometric Recognition, Springer, Heidelberg, Germany, 2013, pp [30] X. Gu, C. Liu, Z. Yang, Dimensionality reduction based on supervised slow feature analysis for face recognition, Int. J. Signal Process., Image Process. Pattern Recognit. 7 (1) (2014) [31] Y. Huang, J. Zhao, M. Tian, Q. Zou, S. Luo, Slow feature discriminant analysis and its application on handwritten digit recognition, in: International Joint Conference on Neural Networks, IEEE, Piscataway, NJ, USA, 2009, pp [32] Y. Huang, J. Zhao, Y. Liu, S. Luo, Q. Zou, M. Tian, Nonlinear dimensionality reduction using a temporal coherence principle, Inf. Sci. 181 (16) (2011) [33] V. Premachandran, R. Kakarala, Consensus of k-nns for robust neighborhood selection on graph-based manifolds, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Los Alamitos, CA, USA, 2013, pp [34] A.M. Martinez, M. Zhu, Where are linear feature extraction methods applicable? IEEE Trans. Pattern Anal. Mach. Intell. 27 (12) (2005) [35] M. Zhu, A.M. Martinez, Subclass discriminant analysis, IEEE Trans. Pattern Anal. Mach. Intell. 28 (8) (2006) [36] X. Jing, S. Li, D. Zhang, C. Lan, J. Yang, Optimal subset-division based discrimination and its kernelization for face and palmprint recognition, Pattern Recognit. 45 (10) (2012) [37] H. Li, T. Jiang, K. Zhang, Efficient and robust feature extraction by maximum margin criterion, IEEE Trans. Neural Netw. 17 (1) (2006) [38] G.H. Golub, C. Reinsch, Singular value decomposition and least squares solutions, Numer. Math. 14 (5) (1970) [39] L. Zhang, L. Qiao, S. Chen, Graph-optimized locality preserving projections, Pattern Recognit. 43 (6) (2010) Xingjian Gu is now working for Ph.D. degree at the School of Computer Science and Engineering in Nanjing University of Science and Technology. He received his B. S. degree in the college of math and physics at Nanjing University of Information Science and Technology in His research interests mainly focus on Pattern Recognition and Computer Vision. Chuancai Liu is a Full Professor in the School of Computer Science and Engineering of Nanjing University of Science and Technology, China. He obtained his Ph.D. degree from the China Ship Research and Development Academy in His research interests include AI, Pattern Recognition and Computer Vision. He has published about 50 papers in International/ National Journals. Sheng Wang received his B.S. degree in automation from Henan University, China, in He obtained his M.S. degree in Control Theory and Control Engineering from the same University. Currently, he is a Ph.D. student at Nanjing University of science and Technology. His research interests include Image Processing, Pattern Recognition and Machine Learning. Cairong Zhao is currently an assistant professor at Tongji University. He received the Ph.D. degree from Nanjing University of Science and Technology, M.S. degree from Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, and B.S. degree from Jilin University, in 2011, 2006 and 2003, respectively. His research interests include Face Recognition, Building Recognition and Vision Attention.

Discriminant Uncorrelated Neighborhood Preserving Projections

Journal of Information & Computational Science 8: 14 (2011) 3019 3026 Available at http://www.joics.com Discriminant Uncorrelated Neighborhood Preserving Projections Guoqiang WANG a,, Weijuan ZHANG a,