Using LSTM Network in Face Classification Problems

Using LS Network in Face Classification Problems Débora C. Corrêa, Denis H. P. Salvadeo, Alexandre L.. Levada, José H. Saito, Nelson D. A. ascarenhas, Jander oreira Departamento de Computação, Universidade Federal de São Carlos, SP-Brasil Instituto de Física, Universidade de São Paulo, São Carlos, SP-Brasil debora_correa@dc.ufscar.br, denissalvadeo@dc.ufscar.br. alexandreluis@ursa.ifsc.usp.br, saito@dc.ufscar.br, nelson@dc.ufscar.br, ander@dc.ufscar.br Abstract any researches have used convolutional neural networks for face classification tasks. Aiming to reduce the number of training samples as well training time, we propose to use a LS network and compare its performance with a standard LP network. Experiments with face images from CBCL database using PCA for feature extraction provided good results indicating that LS could learn properly even with reduced training set and its performance is much better than LP.. Introduction In the last decades, the human face has been explored in a variety of neural networks, computer vision and pattern recognition applications. any technical challenges exist in tasks involving face classifications problems. Some of the principal difficulties are []: large variability, highly complex nonlinear manifolds, high dimensionality and small sample size. any approaches use convolutional artificial neural networks for face classification tasks, as for example, the Neocognitron network [-3]. Our motivation to use other models of networks is to reduce the number of training samples and to improve classification performance. It is well known that, artificial neural networks (ANNs), also known as connectionist systems, represent a non-algorithmic computation form inspired on the human brain structure and processing [4]. In this non-algorithmic approach, computation is performed by a set of several simple processing units, the neurons, connected in a network and acting in parallel. he neurons are connected by weights, which store the network knowledge. o represent a desired solution of a problem, the ANNs perform training or learning phase, which consists of showing a set of examples (dataset training) to the network so that it can extract the necessary features to represent the given information [4]. In this work, our obective is first to use the LS (Long-Short erm emory) network for face classification tasks and check how good it is for this kind of application. Secondly, we compare the results obtained by LS with a traditional LP (ulti-layered Perceptron) network in order to show that LS networks are more capable to learn in presence of longdependence terms in the input data. Besides LS networks are faster than LPs in the learning phase. A first study about the use of these networks for face classification is reported in [5]. Statistical techniques based on principal component analysis (eigenfaces) are effective in reducing the dimensionality of face images. In this work we choose PCA as a tool for feature extraction. he remaining of the paper is organized as follows: section describes the LS network; section 3 describes the principal concepts about PCA, feature extraction technique; section 4 presents the proposed methodology and results and section 5 presents the final remarks and conclusions.. LS Neural Network he LS network is an alternative architecture for recurrent neural network inspired on the human memory systems. he principal motivation is to solve the vanishing-gradient problem by enforcing constant error flow through constant error carrousels (CECs) within special units, permitting then non-decaying error flow back into time [6] [7]. he CECs are units responsible to keep the error signal. his enables the network to learn important data and store them without degradation of long period of time. Also, as we verified in the experiments, this feature improve learning process, by decreasing training time and reducing the mean square error (training error). In a LS network there are memory blocks instead of hidden neurons (Figure (a)). A memory block is formed by one or more memory cells and a pair of adaptive, multiplicative gating units which gate input and output to all cells in the block (Figure (b)) controlling what the network is supposed to learn. An illustration of a memory block with one memory cell is shown in Figure.

principal axis of the hyper-ellipsoid that defines the distribution. Figure 3, obtained in [8], shows a -D example. he principal components are now the data proections in the two main axis, φ and φ. Besides, the variances of the components, given by the eigenvalues λ i, are distinct in most applications, with a considerable number of them so small, that they can be excluded. he selected principal components define the vector y. he obective is to find the new basis vectors, by optimizing certain mathematical criteria. Figure : Recurrent Neural Network with one recurrent hidden layer. Right: LS with memory blocks in the hidden layer (only one is shown) [6, pp.]. Figura 3: Graphical illustration of the Karhunen-Loève ransform for the -D Gaussian case. Figure : LS memory block with one memory cell [6, pp.]. LS networks have been used in many applications, such as speech recognition, function approximation, music composition, among other applications. For a detailed explanation of the LS network forward and backward pass see reference [6] and the work of Hochreiter & Schmidhuber [7]. 3. Principal Component Analysis Principal Component Analysis is the technique that implements the Karhunen-Loève ransform, or Hotteling ransform, a classical unsupervised second order method that uses the eigenvalues and eigenvectors of the covariance matrix to transform the feature space, creating orthogonal uncorrelated features. It is a second order method because all the necessary information is available directly from the covariance matrix of the mixture data and no information regarding probability distributions is needed. In the multivariate Gaussian case, the transformed feature space corresponds to the space generated by the athematically, we can express the rotation of the coordinate system defined by the Karhunen-Loève ransform by an orthonormal matrix Z =, S, with dimensions N N, with = [ w, w,..., w ] N representing the new system s axis and S = w+, w+,..., wn, denoting the axis N ( N ) of the eliminated components during the dimensionality reduction. he orthonormality conditions imply that w wk = 0 for k, e w wk = for = k. Now, it is possible to write the n-dimensional vector x in the new basis as: n n x = x w w = c w () where ( ) c is the inner product between x e w. hen, the new m-dimensional vector y is obtained by the following transformation: N y = x = c w w, w,..., w = [ c, c,..., c ] [ ] () hus, PCA seeks a linear transformation that

maximizes the variance of the proected data, or in mathematical terms, optimizes the following criterion, where C X is the covariance matrix of the observations: PCA J ( w) E y = = E y y = E c (3) However, it is known that c = x w, and therefore: PCA J ( w) = E wxx w (4) = we xx w = wcw X subect to w =, defining a optimization problem. he solution to this problem can be achieved using Lagrange multipliers. In this case, we have: PCA J w, γ = w C w γ w w (5) ( ) X ( ) Differentiating the above expression on and setting the result to zero, leads to the following result [9]: CXw = λ w (6) herefore, we have an eigenvector problem, which means that the vectors w of the new basis that maximize the variance of the transformed data are the eigenvectors of the covariance matrix C X. Another characteristic of PCA is that it minimizes the mean square error (SE) during the dimensionality reduction. In this sense, PCA tries to obtain a set of basis vectors ( < N), that span a -dimensional subspace in which the mean square error between this new representation and the original one is minimum. he proection of x in the subspace spanned by the w vectors, =,...,, is given by equation () and thus the SE criterion can be defined as: J ( w ) = E x ( x w ) w PCA SE Considering that the data is centralized (the mean vector is null) and due to the orthonormal basis, equation (7) is further simplified to: w (7) J ( w ) E x E ( x w ) = = PCA SE = X E x E w xx w E x w C w As the first term does not depend on minimize the SE, we have to maximize (8) w, in order to wc Xw. From equation (5) in the previous section, this optimization problem is solved by using Lagrange ultipliers. hus, inserting equation (7) in (8), leads to: PCA JSE ( w ) E = x γ (9) his result shows that in order to minimize the SE, we must choose the eigenvectors associated to the largest eigenvalues of the covariance matrix. Finally, the PCA criteria are very effective in terms of data compression, and often used for data classification. 4. ethodology We trained LS and LP networks to perform the following tasks involving face classification: face or nonface classification, face authentication and gender classification. o test and evaluate the performance of the networks for these tasks, we used images from the I-CBCL (Center for Biological and Computational Learning) face recognition database #, available at [0]. he CBCL FACE DAABASE # consists of a training set of 49 face images and a test set of 47 face images with spatial dimensions of 9 x 9 pixels. Each 9 x 9 image was transformed to a -D signal of 36 elements. We call this representation the face descriptor. Figure 4 shows some template faces of the training set. Figure 5 shows face descriptors corresponding to 4 images in the template set. Each one of them represents a 36-D input vector x. Figure 4: I-CBCL DAABASE # example faces

a) b) Figure 5: Examples of face descriptors for template patterns: a) emplate 3; b) emplate 4 he experiments were executed using ALAB. We applied PCA to reduce the dimensionality of the input patterns (36-D), so we can avoid problems caused by high dimensional data. We trained the LS and LP neural network using 5, 0 and 0 principal components in order to compare training time, the mean square error obtained by the networks in the training phase and the performance in the application phase. he number of units in each layer of the network depends on the number of principal components obtained when applying PCA. he experiments were executed in a computer having the following specifications: Intel Core Duo processor,.66 GHz, 667 Hz FSB, B L cache, GB DDR. Figure 6: LS network architecture [5]. he LP network model used in the experiments is illustrated in Figure 7. It has one input layer, one hidden layer and one output layer. It was trained with the standard back-propagation algorithm. 4.. LS and LP architecture he LS network model used in the experiments is illustrated in Figure 6 (only a limited subset of connections is shown). We observed that the network performs better if there are direct connections from input neurons to output ones (connections without weights); and if the memory cells are self-connect and their outputs also feed memory cells in the same memory block and other memory cells in different memory blocks. We used the weight initialization proposed by Correa, Levada and Saito []. In this work the behavior of the hidden units in a LS network in the application of function approximation is described in details. Based on this study they propose a method to initialize part of the network weights in order to improve and stabilize the training process. Figure 7: LS network architecture [4]. 4.. Experiments and Results For face or non-face classification, we selected 00 faces (50 images representing face templates and 50 images representing non-face templates) from the training set. he networks are trained to output if it receives a face and 0 otherwise. We stopped the LS network training when the SE was smaller than 0 -. hen, we trained the LP network with the same quantity of epochs needed for LS network. Cleary, the LP network got a much larger SE and spent more time in the training phase (see ables and ). Later, we chose another 00 images (50 from each class, face and non-

face) from the test set. As LP network could not learn properly, it obtained higher uncorrected classification rates. he obtained results are shown in ables and, where CC stands for correct classification, IC incorrect classification and CCR correct classification rate. able : Face and non-face: LS training and classification face x non-face LS Epochs SE ime (s) CC IC CCR 5 PCA 80 0.006 35 70 30 85% 0 PCA 330 0.007 959 56 44 78% 0 PCA 400 0.007 4768 4 58 7% able : Face and non-face: LP training and classification face x LP non-face Epochs SE ime (s) CC IC CCR 5 PCA 80 49.87 848 00 00 50% 0 PCA 330 49.9 787 00 00 50% 0 PCA 400 50.0 6534 00 00 50% For gender classification we selected 3 images (6 men faces and 6 women faces) in the training phase. he networks are trained to output if it receives a man face and 0 otherwise. Again, we trained both networks with the same number of epochs to compare their performance. In the application phase, it is presented to the networks 3 different faces of the same individuals (with different positions, expressions or illumination) to be classified. he obtained results for LS are illustrated in able 3. Although LP network obtained a correct classification rate of 50%, we noted that in all experiments all faces were classified as belonging to the same class. hat is, it classified all faces as man or all faces as woman, depending on the situation, as can be observed in able 4. able 3: Gender: LS training and classification gender LS Epochs SE ime (s) CC IC CCR 5 PCA 600 0.006 85 3 0 00% 0 PCA 600 0.005 635 3 0 00% 0 PCA 50 0.007 946 30 93.75% able 4: Gender: LP training and classification gender LP Epochs SE ime (s) CC IC CCR 5 PCA 600 7.76 508 6 6 50% 0 PCA 600 7.74 873 6 6 50% 0 PCA 50 8 78 6 6 50% For the authentication problem, we selected 50 faces of different individuals to represent the classes to be classified by LS and LP networks. Later, we chose another 50 faces of the same persons (with different positions, expressions or illumination) from the test set. We verified that LS can learn properly the classes even with one sample of each class and a reduced feature set, as presented by able 5. able 5: Authentication: LS and LP classification 0 principal comp. 0 principal comp. CC IC CCR CC IC CCR LS 48 96% 44 6 88% LP 5 45 0% 48 4% 5. Conclusions In this work, we proposed to use a LS network and compare its performance with a standard LP network for face classification problems. We compared the classification performance of these different network architectures. he LS network presented better performance in terms of training time, mean square error and correct classification rates in all the three proposed face classification tasks, showing that it is a powerful tool in pattern recognition applications, even if we are dealing with a reduced training set. 6. Acknowledgements We would like to thank FAPESP for the financial support through Alexandre L.. Levada student schoolarship (process nº 06/07-4) and also CNPq for the financial support through Denis H. P. Salvadeo and CAPES for Débora C. Corrêa student scholarships. 7. References [] A. K. Jain and S. Z. Li, Handbook of Face Recognition: Springer-Verlag New York, Inc., 005. [] K. Fukushima, "A Neural Network for Visual Pattern Recognition", Computer, vol., n. 3 pp. 65-75, 980.

[3] C. O. Santana, J. H. Saito, "Reconhecimento Facial utilizando a Rede Neural Neocognitron", In: Proceedings of the III Workshop de Visão Computacional, 007 (in portuguese). [4] Haykin, S., Neural Networks: A comprehensive Foundation, Prentice Hall; nd edition (July 6, 998). [5] A. L.. Levada, D. C. Correa, D. H. P. Salvadeo, J. H. Saito, N. D. A. ascarenhas. Novel Approaches for Face Recognition: emplate-atching using Dynamic ime Warping and LS Neural Network Supervised Classification. In: Proceedings on the 5th International Conference on Systems, Signals and Image Processing. Bratislava : House SU, 008. p. 4-44. [6] F. Gers: Long Short-erm emory in Recurrent Neural Networks. PhD thesis (00) emory. Neural Computation, 9(8):735-780, 997. [8] K. Fukunaga, An Introduction to Statistical Pattern Recognition, Second ed., Academic Press, 990. [9]. Y. Young,. W. Calvert, Classification, Estimation, and Pattern Recognition, Elsevier, 974. [0] CBCL Face Database #, I Center for Biological and Computation Learning. [] D. C. Corrêa, A. L.. Levada, J. H. Saito. Stabilizing and Improving the Learning Speed of - Layered LS Network. In: Proceedings on the 008 IEEE th International Conference on Computational Science and Engineering. IEEE Computer Society, p. 93-300, 008. [7] S. Hochreiter, J. Schmidhuber, Long Short-erm