Two-Dimensional Linear Discriminant Analysis Jieping Ye Department of CSE University of Minnesota jieping@cs.umn.edu Ravi Janardan Department of CSE University of Minnesota janardan@cs.umn.edu Qi Li Department of CIS University of Delaware qili@cis.udel.edu Astract Linear Discriminant Analysis (LDA) is a well-known scheme for feature extraction and dimension reduction. It has een used widely in many applications involving high-dimensional data, such as face recognition and image retrieval. An intrinsic limitation of classical LDA is the so-called singularity prolem, that is, it fails when all scatter matrices are singular. A well-known approach to deal with the singularity prolem is to apply an intermediate dimension reduction stage using Principal Component Analysis (PCA) efore LDA. The algorithm, called PCA+LDA, is used widely in face recognition. However, PCA+LDA has high costs in time and space, due to the need for an eigen-decomposition involving the scatter matrices. In this paper, we propose a novel LDA algorithm, namely, which stands for 2-Dimensional Linear Discriminant Analysis. overcomes the singularity prolem implicitly, while achieving efficiency. The key difference etween and classical LDA lies in the model for data representation. Classical LDA works with vectorized representations of data, while the algorithm works with data in matrix representation. To further reduce the dimension y, the comination of and classical LDA, namely, is studied, where LDA is preceded y. The proposed algorithms are applied on face recognition and compared with PCA+LDA. Experiments show that and achieve competitive recognition accuracy, while eing much more efficient. Introduction Linear Discriminant Analysis [2, 4] is a well-known scheme for feature extraction and dimension reduction. It has een used widely in many applications such as face recognition [], image retrieval [6], microarray data classification [3], etc. Classical LDA projects the data onto a lower-dimensional vector space such that the ratio of the etween-class distance to the within-class distance is maximized, thus achieving maximum discrimination. The optimal projection (transformation) can e readily computed y applying the eigendecomposition on the scatter matrices. An intrinsic limitation of classical LDA is that its ojective function requires the nonsingularity of one of the scatter matrices. For many applications, such as face recognition, all scatter matrices in question can e singular since the data is from a very high-dimensional space, and in general, the dimension exceeds the
numer of data points. This is known as the undersampled or singularity prolem [5]. In recent years, many approaches have een rought to ear on such high-dimensional, undersampled prolems, including pseudo-inverse LDA, PCA+LDA, and regularized LDA. More details can e found in [5]. Among these LDA extensions, PCA+LDA has received a lot of attention, especially for face recognition []. In this two-stage algorithm, an intermediate dimension reduction stage using PCA is applied efore LDA. The common aspect of previous LDA extensions is the computation of eigen-decomposition of certain large matrices, which not only degrades the efficiency ut also makes it hard to scale them to large datasets. In this paper, we present a novel approach to alleviate the expensive computation of the eigen-decomposition in previous LDA extensions. The novelty lies in a different data representation model. Under this model, each datum is represented as a matrix, instead of as a vector, and the collection of data is represented as a collection of matrices, instead of as a single large matrix. This model has een previously used in [8, 9, 7] for the generalization of SVD and PCA. Unlike classical LDA, we consider the projection of the data onto a space which is the tensor product of two vector spaces. We formulate our dimension reduction prolem as an optimization prolem in Section 3. Unlike classical LDA, there is no closed form solution for the optimization prolem; instead, we derive a heuristic, namely. To further reduce the dimension, which is desirale for efficient querying, we consider the comination of and LDA, namely, where the dimension of the space transformed y is further reduced y LDA. We perform experiments on three well-known face datasets to evaluate the effectiveness of and and compare with PCA+LDA, which is used widely in face recognition. Our experiments show that: () is applicale to high-dimensional undersampled data such as face images, i.e., it implicitly avoids the singularity prolem encountered in classical LDA; and (2) and have distinctly lower costs in time and space than PCA+LDA, and achieve classification accuracy that is competitive with PCA+LDA. 2 An overview of LDA In this section, we give a rief overview of classical LDA. Some of the important notations used in the rest of this paper are listed in Tale. Given a data matrix A IR N n, classical LDA aims to find a transformation G IR N l that maps each column a i of A, for i n, in the N-dimensional space to a vector i in the l-dimensional space. That is G : a i IR N i = G T a i IR l (l<n). Equivalently, classical LDA aims to find a vector space G spanned y {g i } l i=, where G =[g,,g l ], such that each a i is projected onto G y (g T a i,,gl T a i) T IR l. Assume that the original data in A is partitioned into k classes as A = {Π,, Π k }, where Π i contains n i data points from the ith class, and k i= n i = n. Classical LDA aims to find the optimal transformation G such that the class structure of the original high-dimensional space is preserved in the low-dimensional space. In general, if each class is tightly grouped, ut well separated from the other classes, the quality of the cluster is considered to e high. In discriminant analysis, two scatter matrices, called within-class (S w ) and etween-class (S ) matrices, are defined to quantify the quality of the cluster, as follows [4]: S w = k i= x Π i (x m i )(x m i ) T, and S = k i= n i(m i m)(m i m) T, where m i = n i x Π i x is the mean of the ith class, and m = k n i= x Π i x is the gloal mean.
Notation Description n numer of images in the dataset k numer of classes in the dataset A i ith image in matrix representation a i ith image in vectorized representation r numer of rows in A i c numer of columns in A i N dimension of a i (N = r c) Π j jth class in the dataset L transformation matrix (left) y R transformation matrix (right) y I numer of iterations in B i reduced representation of A i y l numer of rows in B i l 2 numer of columns in B i Tale : Notation It is easy to verify that trace(s w ) measures the closeness of the vectors within the classes, while trace(s ) measures the separation etween classes. In the low-dimensional space resulting from the linear transformation G (or the linear projection onto the vector space G), the within-class and etween-class matrices ecome S L = GT S G, and S L w = G T S w G. An optimal transformation G would maximize trace(s L) and minimize trace(sl w). Common optimizations in classical discriminant analysis include (see [4]): { max trace((s L w ) S L ) } { and min trace((s L ) Sw) L }. () G G The optimization prolems in Eq. () are equivalent to the following generalized eigenvalue prolem: S x = λs w x, for λ 0. The solution can e otained y applying an eigendecomposition to the matrix Sw S,ifS w is nonsingular, or S S w,ifs is nonsingular. There are at most k eigenvectors corresponding to nonzero eigenvalues, since the rank of the matrix S is ounded from aove y k. Therefore, the reduced dimension y classical LDA is at most k. A stale way to compute the eigen-decomposition is to apply SVD on the scatter matrices. Details can e found in [6]. Note that a limitation of classical LDA in many applications involving undersampled data, such as text documents and images, is that at least one of the scatter matrices is required to e nonsingular. Several extensions, including pseudo-inverse LDA, regularized LDA, and PCA+LDA, were proposed in the past to deal with the singularity prolem. Details can e found in [5]. 3 2-Dimensional LDA The key difference etween classical LDA and the that we propose in this paper is in the representation of data. While classical LDA uses the vectorized representation, works with data in matrix representation. We will see later in this section that the matrix representation in leads to an eigendecomposition on matrices with much smaller sizes. More specifically, involves the eigen-decomposition of matrices with sizes r r and c c, which are much smaller than the matrices in classical LDA. This dramatically reduces the time and space complexities of over LDA.
Unlike classical LDA, considers the following (l l 2 )-dimensional space L R, which is the tensor product of the following two spaces: L spanned y {u i } l i= and R spanned y {v i } l2 i=. Define two matrices L = [u,,u l ] IR r l and R = [v,,v l2 ] IR c l2. Then the projection of X IR r c onto the space L Ris L T XR R l l2. Let A i IR r c, for i =,,n, e the n images in the dataset, clustered into classes Π,, Π k, where Π i has n i images. Let M i = n i X Π i X e the mean of the ith class, i k, and M = k n i= X Π i X e the gloal mean. In, we consider images as two-dimensional signals and aim to find two transformation matrices L IR r l and R IR c l2 that map each A i IR r c, for i n, to a matrix B i IR l l2 such that B i = L T A i R. Like classical LDA, aims to find the optimal transformations (projections) L and R such that the class structure of the original high-dimensional space is preserved in the low-dimensional space. A natural similarity metric etween matrices is the Froenius norm [8]. Under this metric, the (squared) within-class and etween-class distances D w and D can e computed as follows: k k D w = X M i 2 F, D = n i M i M 2 F. i= X Π i i= Using the property of the trace, that is, trace(mm T )= M 2 F, for any matrix M, we can rewrite D w and D as follows: ( k ) D w = trace (X M i )(X M i ) T, i= X Π i ( k ) D = trace n i (M i M)(M i M) T. i= In the low-dimensional space resulting from the linear transformations L and R, the withinclass and etween-class distances ecome ( k ) D w = trace L T (X M i )RR T (X M i ) T L, i= X Π i ( k ) D = trace n i L T (M i M)RR T (M i M) T L. i= The optimal transformations L and R would maximize D and minimize D w. Due to the difficulty of computing the optimal L and R simultaneously, we derive an iterative algorithm in the following. More specifically, for a fixed R, we can compute the optimal L y solving an optimization prolem similar to the one in Eq. (). With the computed L, we can then update R y solving another optimization prolem as the one in Eq. (). Details are given elow. The procedure is repeated a certain numer of times, as discussed in Section 4. Computation of L For a fixed R, D w and D can e rewritten as D w = trace ( L T S R w L ), D = trace ( L T S R L ),
Algorithm (A,,A n, l, l 2 ) Input: A,,A n, l, l 2 Output: L, R, B,,B n. Compute the mean M i of ith class for each i as M i = n i X Π i X; 2. Compute the gloal mean M = k n i= X Π i X; 3. R 0 (I l2, 0) T ; 4. For j fromtoi 5. Sw R k i= X Π i (X M i )R j Rj T (X M i) T, S R k i= n i(m i M)R j Rj T (M i M) T ; 6. Compute the first l eigenvectors {φ L l }l l= of ( ) Sw R S R ; 7. L j [ ] φ L,,φ L l 8. Sw L k i= X Π i (X M i ) T L j L T j (X M i), S L k i= n i(m i M) T L j L T j (M i M); 9. Compute the first l 2 eigenvectors {φ R l }l2 l= of ( Sw) L S L ; 0. R j [ ] φ R,,φ R l 2 ;. EndFor 2. L L I, R R I ; 3. B l L T A l R, for l =,,n; 4. return(l, R, B,,B n ). where S R w = k i= X Π i (X M i )RR T (X M i ) T, S R = k n i (M i M)RR T (M i M) T. Similar to the optimization prolem in Eq. (), the optimal L can e computed y solving the following optimization prolem: max L trace ( (L T S R w L) (L T S R L)). The solution can e otained y solving the following generalized eigenvalue prolem: S R w x = λs R x. Since S R w is in general nonsingular, the optimal L can e otained y computing an eigendecomposition on ( S R w ) S R. Note that the size of the matrices S R w and S R is r r, which is much smaller than the size of the matrices S w and S in classical LDA. Computation of R Next, consider the computation of R, for a fixed L. A key oservation is that D w and D can e rewritten as D w = trace ( R T SwR L ), D = trace ( R T S L R ), where k k Sw L = (X M i ) T LL T (X M i ), S L = n i (M i M) T LL T (M i M). i= X Π i i= This follows from the following property of trace, that is, trace(ab) =trace(ba), for any two matrices A and B. Similarly, the optimal R can e computed y solving the following optimization prolem: max R trace ( (R T SwR) L (R T S LR)). The solution can e otained y solving the following generalized eigenvalue prolem: Swx L = λs Lx. Since SL w is in general nonsingular, i=
the optimal R can e otained y computing an eigen-decomposition on ( S L w) S L. Note that the size of the matrices S L w and S L is c c, much smaller than S w and S. The pseudo-code for the algorithm is given in Algorithm. It is clear that the most expensive steps in Algorithm are in Lines 5, 8 and 3 and the total time complexity is O(n max(l,l 2 )(r + c) 2 I), where I is the numer of iterations. The algorithm depends on the initial choice R 0. Our experiments show that choosing R 0 =(I l2, 0) T, where I l2 is the identity matrix, produces excellent results. We use this initial R 0 in all the experiments. Since the numer of rows (r) and the numer of columns (c) of an image A i are generally comparale, i.e., r c N, we set l and l 2 to a common value d in the rest of this paper, for simplicity. However, the algorithm works in the general case. With this simplification, the time complexity of the algorithm ecomes O(ndNI). The space complexity of is O(rc) =O(N). The key to the low space complexity of the algorithm is that the matrices Sw R, S R, SL w, and S L can e formed y reading the matrices A l incrementally. 3. As mentioned in the Introduction, PCA is commonly applied as an intermediate dimensionreduction stage efore LDA to overcome the singularity prolem of classical LDA. In this section, we consider the comination of and LDA, namely, where the dimension y is further reduced y LDA, since small reduced dimension is desirale for efficient querying. More specifically, in the first stage of, each image A i IR r c is reduced to B i IR d d y, with d<min(r, c). In the second stage, each B i is first transformed to a vector i IR d2 (matrix-to-vector alignment), then i is further reduced to L i IR k y LDA with k <d 2, where k is the numer of classes. Here, matrix-to-vector alignment means that the matrix is transformed to a vector y concatenating all its rows together consecutively. The time complexity of the first stage y is O(ndNI). The second stage applies classical LDA to data in d 2 -dimensional space, hence takes O(n(d 2 ) 2 ), assuming n>d 2. Hence the total time complexity of is O ( nd(ni + d 3 ) ). 4 Experiments In this section, we experimentally evaluate the performance of and on face images and compare with PCA+LDA, used widely in face recognition. For PCA+LDA, we use 200 principal components in the PCA stage, as it produces good overall results. All of our experiments are performed on a P4.80GHz Linux machine with GB memory. For all the experiments, the -Nearest-Neighor (NN) algorithm is applied for classification and ten-fold cross validation is used for computing the classification accuracy. Datasets: We use three face datasets in our study: PIX, ORL, and PIE, which are pulicly availale. PIX (availale at http://peipa.essex.ac.uk/ipa/pix/faces/manchester/testhard/), contains 300 face images of 30 persons. The image size is 52 52. We susample the images down to a size of 00 00 = 0000. ORL (availale at http://www.uk.research.att.com/facedataase.html), contains 400 face images of 40 persons. The image size is 92 2. PIE is a suset of the CMU PIE face image dataset (availale at http://www.ri.cmu.edu/projects/project 48.html). It contains 665 face images of 63 persons. The image size is 640 480. We susample the images down to a size of 220 75 = 38500. Note that PIE is much larger than the other two datasets.
8 6 4 2 8 6 4 2 2 4 6 8 0 2 4 6 8 20 Numer of iterations 8 6 4 2 8 6 4 2 2 4 6 8 0 2 4 6 8 20 Numer of iterations.05 5 5 2 4 6 8 0 2 4 6 8 20 Numer of iterations Figure : Effect of the numer of iterations on and using the three face datasets; PIX, ORL and PIE (from left to right). The impact of the numer, I, of iterations: In this experiment, we study the effect of the numer of iterations (I in Algorithm ) on and. The results are shown in Figure, where the x-axis denotes the numer of iterations, and the y-axis denotes the classification accuracy. d =0is used for oth algorithms. It is clear that oth accuracy curves are stale with respect to the numer of iterations. In general, the accuracy curves of are slightly more stale than those of. The key consequence is that we only need to run the for loop (from Line 4 to Line ) in Algorithm only once, i.e., I =, which significantly reduces the total running time of oth algorithms. The impact of the value of the reduced dimension d: In this experiment, we study the effect of the value of d on and, where the value of d determines the dimensionality in the transformed space y. We did extensive experiments using different values of d on the face image datasets. The results are summarized in Figure 2, where the x-axis denotes the values of d (etween and 5) and the y-axis denotes the classification accuracy with -Nearest-Neighor as the classifier. As shown in Figure 2, the accuracy curves on all datasets stailize around d =4to 6. Comparison on classification accuracy and efficiency: In this experiment, we evaluate the effectiveness of the proposed algorithms in terms of classification accuracy and efficiency and compare with PCA+LDA. The results are summarized in Tale 2. We can oserve that has similar performance as PCA+LDA in classification, while it outperforms. Hence the LDA stage in not only reduces the dimension, ut also increases the accuracy. Another key oservation from Tale 2 is that is almost one order of magnitude faster than PCA+LDA, while, the running time of is close to that of. Hence is a more effective dimension reduction algorithm than PCA+LDA, as it is competitive to PCA+LDA in classification and has the same numer of reduced dimensions in the transformed space, while it has much lower time and space costs. 5 Conclusions An efficient algorithm, namely, is presented for dimension reduction. is an extension of LDA. The key difference etween and LDA is that works on the matrix representation of images directly, while LDA uses a vector representation. has asymptotically minimum memory requirements, and lower time complexity than LDA, which is desirale for large face datasets, while it implicitly avoids the singularity prolem encountered in classical LDA. We also study the comination of and LDA, namely, where the dimension y is further reduced y LDA. Experiments show that and are competitive with PCA+LDA, in terms of classification accuracy, while they have significantly lower time and space costs.
0.7 0.6 0.5 0.7 0.6 0.5 0.4 0.3 0.2 0.7 0.6 0.5 0.4 0.3 0.2 0. 0.4 2 4 6 8 0 2 4 Value of d 0. 2 4 6 8 0 2 4 Value of d 0 2 4 6 8 0 2 4 Value of d Figure 2: Effect of the value of the reduced dimension d on and using the three face datasets; PIX, ORL and PIE (from left to right). Dataset PCA+LDA Time(Sec) Time(Sec) Time(Sec) PIX 98.00% 7.73 97.33%.69 98.50%.73 ORL 97.75% 2.5 97.50% 2.4 98.00% 2.9 PIE 99.32% 53 00% 57 Tale 2: Comparison on classification accuracy and efficiency: means that PCA+LDA is not applicale for PIE, due to its large size. Note that PCA+LDA involves an eigendecomposition of the scatter matrices, which requires the whole data matrix to reside in main memory. Acknowledgment Research of J. Ye and R. Janardan is sponsored, in part, y the Army High Performance Computing Research Center under the auspices of the Department of the Army, Army Research Laoratory cooperative agreement numer DAAD9-0-2-004, the content of which does not necessarily reflect the position or the policy of the government, and no official endorsement should e inferred. References [] P.N. Belhumeour, J.P. Hespanha, and D.J. Kriegman. Eigenfaces vs. Fisherfaces: Recognition using class specific linear projection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 9(7):7 720, 997. [2] R.O. Duda, P.E. Hart, and D. Stork. Pattern Classification. Wiley, 2000. [3] S. Dudoit, J. Fridlyand, and T. P. Speed. Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association, 97(457):77 87, 2002. [4] K. Fukunaga. Introduction to Statistical Pattern Classification. Academic Press, San Diego, California, USA, 990. [5] W.J. Krzanowski, P. Jonathan, W.V McCarthy, and M.R. Thomas. Discriminant analysis with singular covariance matrices: methods and applications to spectroscopic data. Applied Statistics, 44:0 5, 995. [6] Daniel L. Swets and Juyang Weng. Using discriminant eigenfeatures for image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 8(8):83 836, 996. [7] J. Yang, D. Zhang, A.F. Frangi, and J.Y. Yang. Two-dimensional PCA: a new approach to appearance-ased face representation and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26():3 37, 2004. [8] J. Ye. Generalized low rank approximations of matrices. In ICML Conference Proceedings, pages 887 894, 2004. [9] J. Ye, R. Janardan, and Q. Li. GPCA: An efficient dimension reduction scheme for image compression and retrieval. In ACM SIGKDD Conference Proceedings, pages 354 363, 2004.