CPM: A Covariance-preserving Projection Method

Size: px

Start display at page:

Download "CPM: A Covariance-preserving Projection Method"

Marybeth Chastity Douglas
5 years ago
Views:

1 CPM: A Covariance-preserving Projection Method Jieping Ye Tao Xiong Ravi Janardan Astract Dimension reduction is critical in many areas of data mining and machine learning. In this paper, a Covariance-preserving Projection Method (CPM for short is proposed for dimension reduction. CPM maximizes the class discrimination and also preserves approximately the class covariance. The optimization involved in CPM can e formulated as low rank approximations of a collection of matrices, which can e solved iteratively. Our theoretical and empirical analysis reveals the relationship etween CPM and Linear Discriminant Analysis (LDA, Sliced Average Variance Estimator (SAVE, and Heteroscedastic Discriminant Analysis (HDA. This gives us new insights into the nature of these different algorithms. We use oth synthetic and real-world datasets to evaluate the effectiveness of the proposed algorithm. keywords: Dimension reduction, linear discriminant analysis, heteroscedastic discriminant analysis, covariance. Introduction Linear Discriminant Analysis (LDA [4, 7, 8] is a well-known scheme for feature extraction and dimension reduction. LDA projects the data onto a lowerdimensional vector space such that the ratio of the etween-class distance to the within-class distance is maximized, thus achieving maximum discrimination. LDA is equivalent to maximum likelihood classification assuming normal distriution for each class with a common covariance matrix. It has een applied successfully to many areas, such as computer vision, ioinformatics, etc. [, 6,, 5] However, LDA has several limitations: ( It fails to recover the features for classification, when all class centroids coincide; Department of Computer Science and Engineering, Arizona State University Department of Electrical and Computer Engineering, University of Minnesota Department of Computer Science and Engineering, University of Minnesota ( it may not find the est projection, when class covariance matrices vary; and ( the reduced dimension of LDA is no larger than k (k denotes the numer of classes, which may not e sufficient for complex data. Generalization of LDA y fitting Gaussian mixtures to each class has een studied y Hastie []. Cook et al. [, ] proposed the Sliced Average Variance Estimator (SAVE, which is shown to e capale of dealing with the limitations in LDA. Zhu and Hastie [6] developed a general method for finding important discriminant directions without assuming the class densities elong to any particular parametric family. Kumar and Andreou [] proposed HDA, ased on a different model, in which the classes are still Gaussian, yet are allowed to have different covariance matrices, under the condition that oth centroids and covariance matrices coincide in a suspace of the oservation space. In this paper, we propose a new algorithm for dimension reduction, called CPM (which stands for Covariance-preserving Projection Method. CPM aims to maximize the class discrimination and at the same time preserve approximately the class covariance y applying a tuning parameter α etween and. One key feature of CPM is that the critical information on the class covariance is preserved under the projection. With a properly chosen tuning parameter α, whichmayede- pendent on the data distriution, the CPM algorithm is ale to deal with difficult situations encountered in LDA. In practice, the est value of α can e estimated y cross-validation. To illustrate the difference etween CPM and LDA, we generated a synthetic dataset with dimensions and classes as in [6]. Fig (top shows the first two coordinates. For the first two coordinates, class is simulated from a standard multivariate Gaussian, while classes and are mixtures of two symmetrically shifted standard Gaussians. In the remaining 8 coordinates, Gaussian noise with zero mean and standard deviation is used for all three classes. It is clear from Fig (middle that LDA fails to extract important features ecause the class centroids coincide. CPM separates the three classes completely as shown in Fig (ottom, which shows the advantage of incorporating the class 4

2 covariance information in CPM. The optimization (maximization prolem involved in CPM is nonlinear and difficult to solve. We formulate a lower ound for the criterion function used in CPM. The maximization of the derived lower ound can e formulated as low rank approximations of a collection of symmetric and positive semi-definite matrices, a special case of the low rank approximations in [5, 4]. To our est knowledge, there is no closed form solution. We derive an iterative algorithm, which updates the projection successively and converges to a local optimum. Unlike LDA, the reduced dimension, d, of CPM can e larger than k. We also study the relationship etween CPM and SAVE and HDA. SAVE is shown to e closely related to a special case of CPM when α =.5. Recall that the parameter α controls the tradeoff etween the separation of class centroids and the preservation of class covariances. The optimal value of α may depend on the data distriution. CPM is more flexile in dealing with different situations y varying the values of α, thansave,whereα is fixed to e.5. Our theoretical analysis also reveals that CPM is an approximation of HDA. ote that HDA involves complex and nonlinear optimizations []. It has een oserved in [] that HDA has high computational cost, especially for large datasets; and HDA may not complete, since the numerial optimization procedure in HDA may not converge. We show that the optimization prolem involved in CPM is simpler and much easier to solve than HDA. The theoretical analysis further justifies the use of the covariance information in CPM and gives us new insights into the nature of these different algorithms. We perform experiments on oth synthetic and realworld datasets to evaluate the effectiveness of CPM and compare it with other well known algorithms. Our experiments show that: ( CPM is ale to recover the features for classification, even when all class centroids coincide or the covariance matrices vary; ( CPM is competitive with SAVE and LDA in classification; ( CPM is comparale to HDA in classification; and (4 HDA does not complete for several cases. Thus CPM may e a good alternative for HDA as a generalization of LDA to deal with difficult situations where all class centroids coincide or the covariance matrices vary Data LDA CPM Figure : Top: original data (the first two coordinates; Middle: projection y LDA; and Bottom: projection y CPM. ote that the scales for different algorithms vary. 5

3 The rest of the paper is organized as follows: Section presents the CPM algorithm; the relationship etween CPM and SAVE and HDA is discussed in Section ; experiments results are presented in Section 4; and the paper is concluded in Section 5. Class Covariance-preserving Projection Method Given a data matrix A IR n, consisting of data points in n-dimensional space, we wish to find a vector space G spanned y {g i } d,whereg i IR n, such that each data point a i IR n of A, is projected onto G y with d<n.here G T a i =(g T a i,,g T d a i T IR d, G =[g,,g d ] IR n d denotes the projection. Assume that the data in A is partitioned into k classes as A = {Π,, Π k },whereπ i contains data points from the i-th class, and k =. The covariance matrix of the i-th class is defined as W i = (x c i (x c i T, x Π i where c i = x Π i x is the centroid of the i-th class, In discriminant analysis, three scatter matrices, called within-class (S w, etween-class (S, and total (S t matrices are defined as follows [7]: S w = W i, where S = S t = (c i c(c i c T, c = x Π i (x c(x c T, x x Π i is the gloal centroid. It is easy to verify that S t = S w + S. In the low-dimensional space resulting from the linear projection G, the scatter matrices ecome S L = G T S G, S L w = G T S w G, and S L t = G T S t G. The optimal projection G in LDA [7] can e computed y solving the following optimization prolem: (. G { =arg max trace(g T S G }. G:G T S tg=i d Assuming that the total scatter matrix has een normalized, i.e., S t = I n, the optimal projection for LDA can e otained y maximizing trace(g T S G, suject to the orthogonality constraint, that is, G T G = I d.the solution is given y the top eigenvectors of S.Thereare at most k eigenvectors corresponding to the nonzero eigenvalues, since the rank of the matrix S is ounded from aove y k. Therefore, the reduced dimension of LDA is at most k.. Prolem formulation LDA maximizes the separation etween different classes y maximizing trace(g T S G. It is optimal when all classes have a common covariance matrix. However, it ignores the class covariance information and may not e effective when the class centroids are close to each other (ote that S = when class centroids coincide or the class covariance matrices vary. The CPM algorithm proposed elow attempts to overcome the limitations of LDA, y considering the information from oth the class centroids and the class covariances. Let us first consider Principal Component Analysis (PCA [] on the i-th class. Let G e the optimal projection consisting of the first d principal components of W i (the covariance matrix of the i-th class. The information loss, or the approximation error y keep only the d principal components is W i G(G T W i GG T F, where F denotes the Froenius norm of a matrix [9]. The projection in PCA is shown to minimize the approximation error [] and can e computed via the Singular Value Decomposition (SVD [9]. In CPM, we consider all k classes simultaneously and aim to find the projection G such that the weighted approximation error of all classes is minimized. Mathematically, CPM aims to minimize the following weighted approximation error: (. (W i S w G ( G T (W i S w G G T F, whichcaneshowntoeequalto W i S w F GT (W i S w G F, 6

4 assuming G has orthonormal columns. maximizes (. GT (W i S w G F, Thus CPM which is the weighted variance of the k covariance matrices (after the projection G. ote that S w is involved here, since S w = k W i is the weighted sum of {W i } k. Following LDA, CPM considers the class discrimination (separationetween different centroids, y maximizing trace(g T S G. Moreover, CPM attempts to also preserve class covariances. Specifically, CPM considers the information from oth the class centroids and the class covariances simultaneously, via a tuning parameter α etween and. Mathematically, assuming S t has een normalized, CPM computes the optimal G such that (.4 G =arg max h α (G, where α, and h α (G = ( α trace(g T S G +α GT (W i S w G F. The optimization in (.4 is nonlinear and difficult to solve. Instead, we maximize a lower ound, f α (G, of h α (G, where f α (G = ( α G T S G F +α GT (W i S w G F. It is easy to check that G T S G F = trace(g T S GG T S G trace(g T S G, since G has orthonormal columns. Thus, f α (G h α (G. Furthermore, we have the following result: Lemma.. Let S and G e defined aove, then arg max trace(g T S G:G T GGT S G G=I d = arg max trace(g T S G. Proof. Let S = UΣU T e the SVD of S,whereU IR n q has orthonormal columns, Σ=diag(σ,,σ q, σ,, σ q >, and q =rank(s. Since G T G = I d,itcaneshown[9] that q trace(g T S G σ i, and the equality holds when G = U. that ext, we show U =arg max trace(g T S G:G T GGT S G. G=I d The proof follows from: and trace(g T S GG T S G trace(g T S G q σ i, q trace(u T S UU T S U=trace(Σ Σ = σ i. Lemma. will e used in Proposition. elow to show the relationship etween CPM and LDA. The optimization involved in CPM is thus approximated as finding G such that (.5 G =arg max f α (G.. The CPM algorithm Define w = α, M = S,andw i = α i, M i = W i S w, for all i k. Then f α (G = w i G T M i G F (.6 = w i trace ( (G T M i G(G T M i G. To our est knowledge, there is no closed form solution to the aove maximization prolem in (.5 with f α (G as in (.6 [5, 4]. Instead, we derive an iterative algorithm, which computes a sequence of matrices G (j,forj. More specifically, G (j+ is computed so that G (j+ =arg max ( w i trace G T M i G (j (G (j T M i G. 7

5 It can e shown [4] that G (j+ consists of the first d eigenvectors of w i ( M i G (j (G (j T M i. For simplicity, we choose G ( = G = (I d, T. However, empirical results show that CPM is insensitive to the initial choice as in [5, 4]. The main steps of the CPM algorithm include: ( ormalize the data with S t = I n ; ( Initialization: G ( G, j, and s ; ( Compute the first d eigenvectors {φ i } d of k w ( i Mi G (j (G (j T M i ; (4 G (j+ [φ,,φ d ]; (5 s j+ k w i trace ( (G (j+ T M i G (j (G (j T M i G (j+ ; The relationship etween CPM and LDA is descried in Propositions. elow. Proposition.. CPM is equivalent to LDA, if either of the following two conditions hold: ( α = ; or ( all classes have the same covariance matrix, i.e., W i = S w, for all i. Proof. Recall that in CPM, the optimal G is computed y maximizing f α (G, defined as f α (G =( α G T S G F +α GT (W i S w G F, suject to the constraint that G T G = I d. When α =,orw i = S w, for all i, the second term in f α (G vanishes. Thus, arg max f α (G = arg max G T S G:G T G F G=I d = arg max trace(g T S G, (6 Repeat ( (5 until (sj+ sj s j+ η. The convergence of CPM is determined y the threshold η. In our experiment, we choose η = 6. The convergence of the CPM algorithm is estalished in the following theorem: Theorem.. Let {s j } j= e defined aove. Then the sequence {s j } j= is nonincreasing and ounded from aove. Thus the CPM algorithm converges. Proof. By the definition of s j and s j+,wehave s j+ = = max ( w i trace G T M i G (j (G (j T M i G w i trace ((G (j T M i G (j (G (j T M i G (j = s j. w i trace ((G (j T M i G (j (G (j T M i G (j By the property of trace, ( ( (.7 SAVE(α =α T s j+ = w i trace (G (j+ T M i G (j (G (j T M i (G (j+ w i trace (M i M i = w i M i F. The CPM algorithm thus converges. where the last equality follows from Lemma.. This completes the proof of the proposition. Relationship etween CPM and SAVE and HDA In this section, we study the relationship etween CPM, SAVE, and HDA. More specifically, SAVE is shown to e closely related to a special case of CPM, and CPM and SAVE are shown to e approximations of HDA. The theoretical analysis provides the justification for the CPM algorithm and gives us insights into the nature of these different algorithms.. CPM versus SAVE Cook et al. [,]proposed the Sliced Average Variance Estimator (SAVE to overcome the limitations in LDA. Like CPM, each class in SAVE may have different covariances. That is, the covariances, W i and W j for the i-th and j-th classes (i j may e different. In their approach, the data is normalized so that the total covariance matrix is identity (as in CPM and then the discriminant directions are chosen y maximizing ( i (I n W i α, over α IR n of unit length, that is, α =. solutions are given y the top eigenvectors of ( i (I n W i. The 8

6 Since S w + S = S t = I n (after normalization, we have ( i (I n W i = which equals to ( i (S w W i + S, ( i ((Sw W i + S +(S w W i S ( i +S (S w W i = (S w W i + S, (.8 where the last equality follows, since k ( i = and k ( i (Sw W i =. ext, let us take a closer look at the CPM algorithm presented in Section. It was shown empirically in [5] that arg max w i G T M i G F arg max w i trace ( G T Mi G. Thus the solution to CPM can e approximated y maximizing f α (G = ( α trace(g T S G +α trace ( G T (W i S w G, and is given y the first d eigenvectors of ( αs + α (W i S w, which contains the same set of eigenvectors as (.9 (W i S w + S, when α =. ote that the matrices in (.8 and (.9 differ only in the second term. S is used in SAVE, whereas S is used in CPM. Thus, SAVE is related to a special case of CPM when the parameter α is set to e.5. Our experimental results show that when α =.5, CPM often has similar performance as SAVE. Recall that the parameter α controls the tradeoff etween the separation of class centroids and the preservation of class covariances. CPM is more flexile in dealing with different situations y varying the values of α. Experimental results in Section 4 show that CPM is competitive with SAVE and may significantly outperform SAVE for some cases.. CPM versus HDA LDA is optimal, when all classes have a common covariance matrix. However, the assumption is quite strict and may not applicale for some cases, as mentioned in Section. Kumar and Andreou [] proposed Heteroscedastic Discriminant Analysis (HDA, which assumes each class is Gaussian, ut with possily different covariance matrices, under the assumption that oth centroids and covariance matrices coincide in a suspace of the oservation space. More specifically, HDA assumes that there exists an integer d, d < n, a full rank matrix G of dimension n d, and G =[G, F ] IR n n, such that ( G G T c i = T c (. i F T c and (. G T W i G = ( G T W i G F T WF, for some c and W, which are common for all i. This implies that HDA assumes that the differences etween classes lie solely in a suspace of d dimensions (projection y G, whereas in the complementary suspace of n d dimensions (projection y F theyare identical, i.e. useless for discrimination. The proaility density of a i IR n under the aove model is given as det( P (a i = (π G ( n G exp ( GT a i T W g(i G G T T c i ( ( GT W g(i G GT a i G T c i, where a i elongs to the class g(i. The log-likelihood of the data under the linear transformation G and under the constrained Gaussian model aove is logl = logp (a i = { ( GT a i G T ( T c GT i W G g(i ( GT a i G T c i +log ((π n G } T W g(i G (. +logdet( G. 9

7 The aove likelihood function can then e maximized with respect to its parameters. Since there is no closed-form solution for maxmizing the likelihood with respect to its parameters, the maximization has to e performed numerically. More details on this can e found in [], where quadratic programming is performed for the required optimization. As pointed out in [], even though the quadratic-optimization techniques are used, the likelihood surface is not strictly quadratic, and the optimization prolem occasionally fails. We often oserve this prolem in our experiments. In the rest of this section, we show that the CPM algorithm proposed in this paper is an approximation of HDA. The first assumption in (. requires that G T G (c i c =( T (c i c, i.e., most of the information on c i c is preserved in the suspace spanned y G. It follows that G is computed so that GT (c i c =trace(g T S G is large. The second assumption in (. requires that ( G T (W i S w G G = T (W i S w G. Similarly, this implies that GT (W i S w G F is large. These two assumptions can simultaneously e approximated y maximizing h α (G in (.4. Thus, CPM can e considered as an approximation of HDA. Our experimental results in the next section show that CPM is comparale to HDA in classification, while CPM is much more efficient and roust than HDA.. SAVE versus CPM SAVE can also e considered as an approximation of HDA, ased on the following result. Lemma.. Let S e defined aove and let G IR n d with d<n.then max G T G=I d {trace ( G T S G } = max GT G=I d {trace ( G T S G }. Proof. Let S = UΣU T e the SVD of S,whereU IR n q has orthonormal columns, Σ=diag(σ,,σ q, σ,, σ q >, and q =rank(s. From Lemma., q trace(g T S G σ i, and the equality holds when G = U. Similarly, q trace(g T S G=trace(GT UΣ U T G σi, and the maximum value is achieved when G = U. This completes the proof of the lemma. Lemma. shows that the two assumptions of HDA are approximated y maximizing the criterion of SAVE in (.8. While CPM applies the weight α to alance the tradeoff etween these two assumptions, SAVE puts an equal weight to oth terms. 4 Experiments In this section, we evaluate CPM on oth synthetic and real-world datasets. The -earest-eighor algorithm is applied for classification. For simplicity, α =. is used for CPM in all experiments. However, the est value of α may e estimated through cross-validation. The first experiment is on a synthetic dataset, where the class centroids of three classes do not coincide, ut they have different covariance matrices. For the first two coordinates, class is simulated from a standard multivariate Gaussian with mean (, and covariance C = diag(,. Classes and are mixtures of two shifted standard Gaussians, with mean (,, (5, and ( 5,, (, respectively. Each Gaussian component also has covariance C. From Fig., HDA, SAVE, and CPM separate the three classes etter than LDA, which shows the effect of incorporating the class covariance information. LDA does not consider the class covariance information and fails to find the est projection, when class covariances vary. Our next experiment is on the Pendigits dataset from the UCI machine learning repository. It contains the (x, y coordinates of hand-written digits. Each digit is represented as a vector in 6-dimensional space. The dataset is divided into a training set (7494 digits and a ftp://ftp.ics.uci.edu/pu/machine-learning-dataases/

8 test set (498 digits. For the purpose of visualization as in [6], we select the s, 6 s and 9 s. We apply LDA, SAVE, HDA, and CPM to this -class suprolem (which contains 9 training digits and 5 test digits and extract leading discriminant directions. We then project the test set onto these dimensions. The results are shown in Fig.. It is clear that LDA, HDA, and CPM separate the three classes well. SAVE, on the other hand, picks up directions where one class has a much larger variance than the other two classes, as mentioned in [6]. This further confirms our claim in Section that CPM is more flexile in dealing with different situations y choosing different α, thansave, where α is fixed to e.5. ote that we also run CPM with α =.5. The result is similar to SAVE and is thus omitted. ext, we evaluate CPM in terms of classification on two datasets: Pendigits and Ionosphere, oth from the UCI machine learning repository, using different numer, d, of reduced dimensions. For Pendigits, we use a suset consisting of s, 5 s, 6 s, and 9 s. Ionosphere is a inary dataset consisting of 5 instances of radar collected data, with 4 dimensions ( instances for training and 5 instances for test. The result is summarized in Fig. 4, where the x-axis denotes the reduced dimension d for HDA, SAVE, and CPM, and the y-axis denotes the classification accuracy. The result on LDA with the fixed reduced dimension, k, is also presented. We ran HDA on the Ionosphere dataset ut the algorithm did not complete. This may e due to the complex optimization prolem involved in HDA, as mentioned in Section. We oserve from Fig. 4 that the est accuracy for HDA, SAVE, and CPM often occurs when d>k. In practice, the est dimension d can e estimated through cross-validation. Also note that the accuracy curves of HDA and CPM follow similar trend on the Pendigits dataset (see Section for theoretical analysis, while they are quite different from SAVE. Overall, CPM is very competitive with the other three algorithms. Finally we compare CPM with SAVE and LDA in terms of classification on four datasets from UCI. The results are summarized in Tale. The reduced dimension used in CPM is set to e k, the reduced dimension used in LDA. ote that ased on our previous experiments, the performance of CPM may e improved y using larger numer of dimensions. HDA is not included in the comparison, since it is not ale to complete for many of the datasets. We can oserve that CPM outperforms SAVE for all cases, while CPM is very competitive with LDA. We can also oserve that LDA performs reasonaly well for all datasets. This implies that LDA is fairly roust in practice, even though the assumption involved in LDA may e violated. 5 Conclusions A new algorithm, namely CPM is proposed for dimension reduction. It aims to maximize the class discrimination and preserve the class covariance simultaneously. The optimization prolem involved in CPM is equivalent to low rank approximations of a collection of matrices, which can e solved iteratively. We have applied CPM to various datasets and our empirical results show that: ( CPM is capale of recovering the features for classification, even when all class centroids coincide or covariance matrices vary, overcoming limitations of LDA; and ( CPM is competitive with LDA, SAVE, and HDA, in terms of classification. Our theoretical analysis reveals the close relationship etween CPM and LDA, SAVE, and HDA. LDA is shown to e a special case of CPM, which generalizes LDA y utilizing the class covariance information. CPM and SAVE are shown to e approximations of HDA, while they are much more efficient and roust than HDA. SAVE is closely related to a special case of CPM, when the parameter α, which controls the tradeoff etween the separation of class centroids and the preservation of class covariances, is set to e.5. CPM is thus more flexile in dealing with different situations. These theoretical results further justify the criterion used in CPM and give us new insights into these algorithms. Some directions for future work include: ( extension of CPM to deal with high-dimensional data; and ( extension of CPM to deal with nonlinear data using kernels. Acknowledgements Research of JY was sponsored, in part, y the Evolutionary Functional Genomics Center of the Biodesign Institute at the Arizona State University. Support Fellowships from Guidant Corporation and from the Department of Computer Science & Engineering, at the University of Minnesota, is gratefully acknowledged. References [] P.. Belhumeour, J.P. Hespanha, and D.J. Kriegman. Eigenfaces vs. Fisherfaces: Recognition using class specific linear projection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 9(7:7 7, 997. [] R. D. Cook and S. Weiserg. Discussion of Li (99. Journal of the American Statistical Association, 86: 8, 99. [] R. D. Cook and X. Yin. Dimension reduction and visualization in discriminant analysis (with discussion.

9 LDA HDA SAVE 4 CPM Figure : Projections y LDA, HDA, SAVE and CPM using the synthetic dataset, where the class centroids of three classes do not coincide, ut they have different covariance matrices. HDA, SAVE and CPM consider the class covariance information and separate the three classes etter than LDA.

10 LDA HDA SAVE CPM Figure : Projections y LDA, HDA, SAVE and CPM using the Pendigits dataset. LDA, HDA, and CPM separate the three classes etter than SAVE.

11 Accuracy HDA SAVE CPM LDA Accuracy SAVE CPM LDA Value of d Pendigits Value of d Ionosphere Figure 4: Comparison of classification accuracy of LDA, HDA, SAVE and CPM using the Pendigit and Ionosphere datasets. dataset CPM SAVE LDA Pima 69.55% ( % ( % (.8 Vote 7.% ( % (.5 7.7% (.48 Waveform 8.% (. 79.8% (. 76.6% (.6 Ionosphere 86.% ( % (.8 8.4% (. Tale : Comparison on classification accuracy and standard deviation (in parenthesis. Australian and ew Zealand Journal of Statistics, 4 (:47 99,. [4] R.O. Duda, P.E. Hart, and D. Stork. Pattern Classification. Wiley,. [5] C. Ding and J. Ye. -Dimensional singular value decomposition for D maps and images. In SIAM Data Mining Conference Proceedings, 5. [6] S. Dudoit, J. Fridlyand, and T. P. Speed. Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association, 97(457:77 87,. [7] K. Fukunaga. Introduction to Statistical Pattern Classification. 99. [8] T. Hastie, R. Tishirani, and J.H. Friedman. The elements of statistical learning : data mining, inference, and prediction. Springer,. [9] G. H. Golu and C. F. Van Loan. Matrix Computations. Third edition, 996. [] T. Hastie and R. Tishirani. Discriminant analysis y Gaussian mixtures. Journal of the Royal Statistical Society series B, 58:58 76, 996. [] P. Howland, M. Jeon, and H. Park. Structure preserving dimension reduction for clustered text data ased on the generalized singular value decomposition. SIAM Journal on Matrix Analysis and Applications, 5(: 65 79,. [] I. T. Jolliffe. Principal Component Analysis. Springer- Verlag, ew York, 986. []. Kumar and A.G. Andreou. A generalization of linear discriminant analysis in maximum likelihood framework. In Proceedings of Joint Meeting of ASA, 996. [4] J. Ye. Generalized low rank approximations of matrices. In ICML Conference Proceedings, pages , 4. [5] J. Ye. Characterization of a family of algorithms for generalized discriminant analysis on undersampled prolems. Journal of Machine Learning Research, 6: 48 5, 5. [6] M. Zhu and T. Hastie. Feature extraction for nonparametric discriminant analysis. Journal of Computational and Graphical Statistics, (:,. 4

Two-Dimensional Linear Discriminant Analysis

Two-Dimensional Linear Discriminant Analysis Jieping Ye Department of CSE University of Minnesota jieping@cs.umn.edu Ravi Janardan Department of CSE University of Minnesota janardan@cs.umn.edu Qi Li Department