High Dimensional Discriminant Analysis Charles Bouveyron LMC-IMAG & INRIA Rhône-Alpes Joint work with S. Girard and C. Schmid High Dimensional Discriminant Analysis - Lear seminar p.1/43
Introduction High dimensional data: many scientific domains need to analyze data which are increasingly complex, modern data are made up of many variables: imagery (MRI, vision), biology (DNA micro-array),... Classification is very difficult in high dimensional spaces: many learning methods suffer from the curse of dimensionality [Bel61], since the number n of data is not generally sufficient to learn high-dimensional data. The empty space phenomena [ST83] allows to assume that data live in subspaces with lower dimensionality. High Dimensional Discriminant Analysis - Lear seminar p.2/43
Introduction Classification: supervised classification (discriminant analysis) requires examples of classes, unsupervised classification (clustering) aims to organize data in homogeneous classes. 2 ways: generative methods: QDA, LDA, GMM, discriminantive methods: logistic regression and SVM. Generative models can be both used in supervised and unsupervised classification. High Dimensional Discriminant Analysis - Lear seminar p.3/43
Outline of the talk Discriminant analysis framework New modelisation of high-dimensional data High dimensional discriminant analysis (HDDA) construction of the decision rule a posteriori probability and reformulation Particular rules Estimators and intrinsic dimension estimation Numerical results application to image categorization application to object recognition Extension to unsupervised classification High Dimensional Discriminant Analysis - Lear seminar p.4/43
Part 1 Discriminant analysis framework High Dimensional Discriminant Analysis - Lear seminar p.5/43
Discriminant analysis framework Discriminant analysis is the supervised part of classification, i.e. it requires a professor! Discriminant analysis goals: descriptive aspect: find a data representation which allows to interpret the groups using explanatory variables. decisional aspect: the major goal is to find the good class membership of a new data x. Of course, HDDA favours the decisional aspect! High Dimensional Discriminant Analysis - Lear seminar p.6/43
Discrimination problem The basic problem: assign an observation x = (x 1,..., x p ) R p with unknown class membership to one of k classes C 1,..., C k known a priori. We have a learning dataset A: A = {(x 1, c 1 ),..., (x n, c n )/x j R p and y j {1,..., k}}, where the vector x j contains p explanatory variables and y j indicates the index of the class of x j. We have to construct a decision rule δ: δ : R p {1,..., k} x y. High Dimensional Discriminant Analysis - Lear seminar p.7/43
Bayes decision rule The optimal decision rule δ, called Bayes decision rule, is : δ : x C i, if i = argmax{p(c i x)}, i=1,...,k δ : x C i, if i = argmin{ 2 log(π i f i (x))}, i=1,...,k where π i is the a priori probability of class C i and f i (x) denotes the class conditional density of x. Generative methods usually assume that distributions of classes are Gaussian N (µ i, Σ i ). High Dimensional Discriminant Analysis - Lear seminar p.8/43
Classical discriminant analysis method Quadratic discriminant analysis (QDA): i = argmin{(x µ i ) t Σ 1 i (x µ i ) + log(det Σ i ) 2 log(π i )}. i=1,...,k Linear discriminant analysis (LDA): with the assumption that i, Σ i = Σ i = argmin{µ t iσ 1 µ i 2µ t iσ 1 x 2 log(π i )}. i=1,...,k QDA and LDA have disappointing behavior when the size of the training dataset n is small compared to the number p of variables. High Dimensional Discriminant Analysis - Lear seminar p.9/43
Discriminant analysis regularization Dimension reduction: PCA, FDA, features selection, Fischer discriminant analysis (FDA) combines: a dimension reduction step (projection on the k 1 discriminant axes) with one of the previous methods (usually LDA). Parsimonious models: Regularized discriminant analysis (RDA, [Fri89]), is an intermediate classifier between QDA and LDA, Eigenvalue decomposition discriminant analysis (EDDA, [BC96]) is based on the re-parametrization of the covariance matrices of classes: Σ i = λ i D i A i D t i. High Dimensional Discriminant Analysis - Lear seminar p.10/43
Dimension reduction for classification 25 20 20 15 15 10 5 0 5 10 15 10 5 0 5 10 15 20 20 25 20 15 10 5 0 5 10 15 20 25 15 10 5 0 5 10 15 PCA axes Discriminant axes Fig.1 - High-dimensional data which classes live in different subspaces with lower dimensionality. High Dimensional Discriminant Analysis - Lear seminar p.11/43
Part 2 New modelisation High Dimensional Discriminant Analysis - Lear seminar p.12/43
New modelisation The empty space phenomena enables us to assume that HD data live in subspaces with low dimensionality. The main idea of the new modelisation is: each class is decomposed on two subspaces with low dimensionality, and the classes are assumed spherical in these subspaces. High Dimensional Discriminant Analysis - Lear seminar p.13/43
New modelisation We assume that class conditional densities are Gaussian N (µ i, Σ i ) with means µ i and covariance matrices Σ i. Let Q i be the orthogonal matrix of eigenvectors of the covariance matrix Σ i, Let B i be the basis of R p made of the eigenvectors of Σ i. The class conditional covariance matrix i is defined in the basis B i by: i = Q t i Σ i Q i. High Dimensional Discriminant Analysis - Lear seminar p.14/43
New modelisation We assume in addition that i contains only two different eigenvalues a i > b i. Let E i be the affine space generated by eigenvectors associated to the eigenvalue a i and such that µ i E i. We define also E i such that E i E i = R p and µ i E i. Let P i and P i be the projection operators on E i and E i. High Dimensional Discriminant Analysis - Lear seminar p.15/43
New modelisation Thus, we assume that i has the following form: i = 0 B @ a i 0... 0 a i 0 0 b i 0...... 1 C A 9 >= >; 9 >= >; d i (p d i ) 0 b i High Dimensional Discriminant Analysis - Lear seminar p.16/43
New modelisation: illustration High Dimensional Discriminant Analysis - Lear seminar p.17/43
Part 3 High Dimensional Discriminant Analysis High Dimensional Discriminant Analysis - Lear seminar p.18/43
High Dimensional Discriminant Analysis Under the preceding assumptions, the Bayes decision rule yields a new decision rule δ + : Theorem 1: The new decision rule δ + consists in classifying x to the class C i if: i = argmin i=1,...,k { 1 µ i P i (x) 2 + 1 x P i (x) 2 a i b i } +d i log(a i ) + (p d i ) log(b i ) 2 log(π i ). High Dimensional Discriminant Analysis - Lear seminar p.19/43
HDDA: illustration K i (x) = 1 a i µ i P i (x) 2 + 1 b i x P i (x) 2 + d i log(a i ) + (p d i ) log(b i ) 2 log(π i ). High Dimensional Discriminant Analysis - Lear seminar p.20/43
HDDA: a posteriori probability In many applications, it is interesting to dispose of the a posteriori probability p(c i x) that x belongs to C i. The Bayes formula yields: p(c i x) = exp ( 1 2 K i(x) ) k j=1 exp ( 1 2 K j(x) ), where K i is the cost function of δ + conditionally with the class C i : K i (x) = 1 a i µ i P i (x) 2 + 1 b i x P i (x) 2 +d i log(a i ) + (p d i ) log(b i ) 2 log(π i ). High Dimensional Discriminant Analysis - Lear seminar p.21/43
HDDA: reformulation In order to interpret easily the decision rule δ +, we introduce α i and σ i : a i = σ2 i α i and b i = σ2 i (1 α i ) with α i ]0, 1[ and σ i > 0. Thus, the decision rule δ + consists in classifying x to the class C i if: { 1 i ( = argmin i=1,...,k σi 2 αi µ i P i (x) 2 + (1 α i ) x P i (x) 2) ( ) } 1 αi +2p log(σ i ) + d i log p log(1 α i ) 2 log(π i ). Notation: HDDA is the model [a i b i Q i d i ] or [α i σ i Q i d i ]. α i High Dimensional Discriminant Analysis - Lear seminar p.22/43
Part 4 Particular rules High Dimensional Discriminant Analysis - Lear seminar p.23/43
Particular rules By allowing some but not all of HDDA parameters to vary, we obtain 24 particular rules: which correspond to different regularizations, which some ones are easily geometrically interpretable, which 9 have explicit formulations. HDDA can be interpreted as a classical discriminant analysis in particular cases: if i, α i = 1 2 : δ+ is QDA with sperical classes, if in addition i, σ i = σ: δ + is LDA with sperical classes. High Dimensional Discriminant Analysis - Lear seminar p.24/43
Links with classical methods QDA Σ i = λ i D i A i D t i Σ i = Q i i Q t i EDDA HDDA Σ i = λdad t A i = Id α i = 1 2... LDA QDAs... Σ i = σ 2 i Id σ i = σ LDAs π i = π LDA géo High Dimensional Discriminant Analysis - Lear seminar p.25/43
Model [ασq i d i ] The decision rule δ + consists in classifying x to the class C i if: i = argmin{α µ i P i (x) 2 + (1 α) x P i (x) 2 }. i=1,...,k High Dimensional Discriminant Analysis - Lear seminar p.26/43
Part 5 Estimation High Dimensional Discriminant Analysis - Lear seminar p.27/43
HDDA estimators Estimators are computed using maximum likelihood estimation from the learning set A. Common estimators: ˆπ i = n i n, n i = #(C i ), ˆµ i = 1 n i x j C i x j, ˆΣ i = 1 n i x j C i (x j ˆµ i ) t (x j ˆµ i ). High Dimensional Discriminant Analysis - Lear seminar p.28/43
Estimators of the model [a i b i Q i d i ] Assuming d i is known, the ML estimators are: ˆQ i is made of the eigenvectors associated to the ordered eigenvalues of ˆΣ i, â i is the mean of the largest d i eigenvalues of ˆΣ i : â i = d i l=1 λ il d i, ˆb i is the mean of the smallest (p d i ) eigenvalues of ˆΣ i : p λ il ˆb i = (p d i ). l=d i +1 High Dimensional Discriminant Analysis - Lear seminar p.29/43
Estimation trick The decision rule δ + do not requires to compute the last (p d i ) eigenvectors of ˆΣ i. Thus, in order to minize the number of parameters to estimate, we use the following relation: p l=d i +1 λ il = tr( ˆΣ i ) d i l=1 λ il. Number of parameters to estimate with p = 100, d i = 10 and k = 4: QDA: 20 603 HDDA: 4 323 High Dimensional Discriminant Analysis - Lear seminar p.30/43
Intrinsic dimension estimation We base our approach to chose the values of d i on eigenvalues of Σ i, We use two empirical methods: common thresholding on the cumulative variance: d p d i = argmin λ d=1,...,p 1 j / λ j s, j=1 j=1 scree-test of Cattell: analyses differences between the eigenvalues in order to find a brake in the scree of eigenvalues. High Dimensional Discriminant Analysis - Lear seminar p.31/43
Intrinsic dimension estimation 0.1 0.1 0.08 0.08 0.06 0.06 0.04 0.04 0.02 0.02 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Ordered eigenvalues of Σ i Ordered eigenvalues of Σ i 1 0.05 0.8 0.04 0.03 0.6 0.02 0.4 0.01 0.2 0 2 4 6 8 10 Cumulative sum of eigenvalues Common tresholding 0 0 2 4 6 8 10 Difference betwenn eigenvalues Scree-test of Cattell High Dimensional Discriminant Analysis - Lear seminar p.32/43
Part 6 Numerical results High Dimensional Discriminant Analysis - Lear seminar p.33/43
Results: artificial data Method Classification rate HDDA ([a i b i Q i d i ]) 0.958 HDDA ([a i b i Q i d]) 0.964 LDA 0.512 FDA 0.51 SVM 0.478 3 Gaussian densities in R 15, with d 1 = 3, d 2 = 4 and d 3 = 5, In addition, the proportions are very different: π 1 = 1 2, π 2 = 1 3 and π 3 = 1 6, High Dimensional Discriminant Analysis - Lear seminar p.34/43
Results: image categorization A recent study [LBGGDH03] proposes an approach based on the human perception to categorize natural images. An image is represented by a vector of 49 dimensions. Each one of these 49 components is the response of the image to a Gabor filter. High Dimensional Discriminant Analysis - Lear seminar p.35/43
Results: image categorization Data: 328 descriptors in 49 dimensions, Results: Method Classification rate HDDA ([a i b i Q i d i ]) 0.857 HDDA ([a i bq i d]) 0.881 QDA 0.849 LDA 0.775 FDA (d = k 1) 0.79 SVM 0.839 Classification results for the image categorization experiment (leave-one-out). High Dimensional Discriminant Analysis - Lear seminar p.36/43
Results: object recognition Our approach uses local descriptors (Harris-Laplace+Sift), We consider 3 object classes (wheels, seat and handlebars) and 1 background class, The dataset is made of 1000 descriptors in 128 dimensions: learning dataset: 500, test dataset: 500. High Dimensional Discriminant Analysis - Lear seminar p.37/43
Results: object recognition 1 1 0.9 0.9 0.8 0.8 0.7 0.7 True positives 0.6 0.5 0.4 FDA LDA True positives 0.6 0.5 0.4 0.3 0.3 0.2 0.1 SVM classifiers HDDA classifiers 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positives 0.2 0.1 HDDA with error probability < 10 5 with error probability < 10 10 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positives Classification results for the object recognition experiment. High Dimensional Discriminant Analysis - Lear seminar p.38/43
Results: object recognition Recognition using HDDA Recognition using SVM High Dimensional Discriminant Analysis - Lear seminar p.39/43
Part 7 Unsupervised classification High Dimensional Discriminant Analysis - Lear seminar p.40/43
Extension to unsupervised classification The unsupervised classification aims to organize data in homogeneous classes. Gaussian mixture models (GMM) are an efficient way for unsupervised classification: in Gaussian mixture models, the density of the mixture is: f(x, θ) = k i=1 π i f i (x; µ i, Σ i ), where θ = {π 1,..., π k, µ 1,..., µ k, Σ 1,..., Σ k }. the parameters estimation is generally done by the EM algorithm. High Dimensional Discriminant Analysis - Lear seminar p.41/43
Extension to unsupervised classification Using our model for HD data, the two main steps of the EM algorithm are: E step: compute t (q) ij t (q) ij where K (q) i d (q) i log(a (q) i = t (q) i (x j ) = exp( K (q) i (x j )/2)/ k l=1 exp( K(q) l (x j )/2), (x) = µ(q) i P (q) i (x j ) 2 a (q) i ) + (p d (q) i ) log(b (q) i + x j P (q) i (x j ) 2 b (q) i ) 2 log(π (q) i ). M step: classical estimation of π i, µ i and Σ i ; the estimators of a i, b i and Q i are the same as those of HDDA. + High Dimensional Discriminant Analysis - Lear seminar p.42/43
References [BC96] H. Bensmail and G. Celeux. Regularized gaussian discriminant analysis through eigenvalue decomposition. Journal of the American Statistical Association, 91:1743 1748, 1996. [Bel61] R. Bellman. Adaptive Control Processes. Princeton University Press, 1961. [Fri89] J.H. Friedman. Regularized discriminant analysis. Journal of the American Statistical Association, 84:165 175, 1989. [LBGGDH03] H. Le Borgne, N. Guyader, A. Guerin-Dugué, and J. Hérault. Classification of images: Ica filters vs human perception. In 7th International Symposium on Signal Processing and its Applications, number 2, pages 251 254, 2003. [ST83] D. Scott and J. Thompson. Probability density estimation in higher dimensions. In Proceedings of the Fifteenth Symposium on the Interface, North Holland-Elsevier Science Publishers, pages 173 179, 1983. High Dimensional Discriminant Analysis - Lear seminar p.43/43