High Dimensional Discriminant Analysis

High Dimensional Discriminant Analysis Charles Bouveyron LMC-IMAG & INRIA Rhône-Alpes Joint work with S. Girard and C. Schmid ASMDA Brest May 2005

Introduction Modern data are high dimensional: Imagery: MRI, Computer vision, Biology: DNA micro-array. Classication is very dicult in high dimensional spaces: many learning methods suer from the curse of dimensionality [Bel61], the empty space phenomenum [ST83] allows to assume that data live in low-dimensional subspaces.

Outline 1 Framework of discriminant analysis 2 New model for high-dimensional data 3 High Dimensional Discriminant Analysis 4 Estimators and intrinsic dimension estimation 5 Numerical results 6 Conclusion & work in progress

Classication Classication: supervised classication (discriminant analysis), unsupervised classication (clustering). 2 main classication methods family: generative methods: QDA, LDA, discriminative methods: logistic regression, SVM. Generative models can be both used in supervised and unsupervised classication.

Discrimination problem The basic problem: assign an observation x = (x 1,..., x p ) R p with unknown class membership to one of k classes C 1,..., C k known a priori. We dispose of a learning dataset A: A = {(x 1, y 1 ),..., (x n, y n )/x j R p and y j {1,..., k}}, where the vector x j contains p explanatory variables and y j indicates the index of the class of x j. We have to construct a decision rule δ: δ : R p {1,..., k} x y.

Bayes decision rule The optimal decision rule δ is : δ : x C i, if i = argmax{p(c i x)}, i=1,...,k δ : x C i, if i = argmin{ 2 log(π i f i (x))}, i=1,...,k where π i is the a priori probability of class C i and f i (x) denotes the class conditional density of x. We consider only generative methods which assume that distributions of classes are Gaussian N (µ i, Σ i ).

Classical methods Quadratic discriminant analysis (QDA): i = argmin{(x µ i ) t Σ 1 i (x µ i ) + log(det Σ i ) 2 log(π i )}. i=1,...,k Linear discriminant analysis (LDA): with the assumption that i, Σ i = Σ i = argmin{µ t iσ 1 µ i 2µ t iσ 1 x 2 log(π i )}. i=1,...,k QDA and LDA have disappointing behavior when n p.

Regularizations Dimension reduction: PCA, feature selection, Fisher Discriminant Analysis (FDA). Parsimonious models: Regularized discriminant analysis [Fri89], Eigenvalue decomposition discriminant analysis [BC96].

Classication of high-dimensional data 5 5 0 0 5 5 0 5 10 Correct classification 5 5 0 5 10 FDA classification (48.8% correct) 5 5 0 0 5 5 0 5 10 SVM classification (46.4% correct) 5 5 0 5 10 HDDA classification (95.3% correct) Three Gaussian densities in R 100 with intrinsic dimensions equal to 2. For visualization, data are projected on the 2 discriminant axes.

The idea of new model The main idea: data of the same class live in a specic low-dimensional subspace, data of dierent classes live in dierent subspaces, For each class, we split R p into two subspaces: the subspace where the data live, and its orthogonal complementary, We use a parsimonious model: we model each class as spherical density in the 2 subspaces.

The new model We assume that class conditional densities are Gaussian N (µ i, Σ i ) with means µ i and covariance matrices Σ i. Let Q i be the orthogonal matrix of eigenvectors of the covariance matrix Σ i, Let B i be the basis of R p made of the eigenvectors of Σ i. The class conditional covariance matrix i is dened in the basis B i by: i = Q t i Σ i Q i.

The new model We assume in addition that i contains only two dierent eigenvalues a i > b i. Let E i be the ane space generated by eigenvectors associated to the eigenvalue a i and such that µ i E i. We dene also E i such that E i E i = R p and µ i E i. Let P i and P i be the projection operators on E i and E i.

The new model Thus, we assume that i has the following form: 0 i = B @ a i 0... 0 a i 0 0 b i 0...... 1 C A 9 = ; 9 >= >; d i (p d i ) 0 b i

High Dimensional Discriminant Analysis Under the preceding assumptions, the Bayes decision rule yields a new decision rule δ + : Theorem The new decision rule δ + consists in classifying x to the class C i if: { 1 i = argmin µ i P i (x) 2 + 1 x P i (x) 2 i=1,...,k a i b i } +d i log(a i ) + (p d i ) log(b i ) 2 log(π i ).

HDDA: illustration The subspace E i and its supplementary E i. K i(x) = 1 a i µ i P i(x) 2 + 1 b i x P i(x) 2 +d i log(a i)+(p d i) log(b i) 2 log(π i)

HDDA: particular rules By allowing some but not all of HDDA parameters to vary, we obtain 24 particular models: which correspond to dierent regularizations, which some ones are easily geometrically interpretable, which 9 have explicit formulations. Notations: a i = σ2 i α i with α i ]0, 1[, and b i = σ2 i (1 α i) with σ i > 0. HDDA can be interpreted as classical discriminant analysis in particular cases: if i, α i = 1 2 : δ+ is QDA with spherical classes, if in addition i, σ i = σ: δ + is LDA with spherical classes.

Model [ασq i d i ] Theorem The decision rule δ + consists in classifying x to the class C i if: i = argmin{α µ i P i (x) 2 + (1 α) x P i (x) 2 }. i=1,...,k

HDDA estimators Estimators are computed using maximum likelihood estimation from the learning set A. Classical estimators: ˆπ i = n i n, n i = #(C i ), ˆµ i = 1 n i x j C i x j, ˆΣ i = 1 n i x j C i (x j ˆµ i ) t (x j ˆµ i ).

Estimators of the model [a i b i Q i d i ] Assuming d i is known, the ML estimators are: ˆQ i is made of the eigenvectors associated to the ordered eigenvalues of ˆΣi, â i is the mean of the largest d i eigenvalues of ˆΣi : â i = 1 d i d i l=1 λ il, ˆb i is the mean of the smallest (p d i ) eigenvalues of ˆΣi : ˆb i = 1 (p d i ) p l=d i+1 λ il.

Estimation trick In order to minimize the number of parameters to estimate, we use the following relation: p l=d i +1 λ il = tr( ˆΣ d i i ) λ il. Number of parameters to estimate with p = 100, d i = 10 and k = 4: Method l=1 Nb of param. QDA 20 603 HDDA (model [a i b i Q i d i ]) 4 323 HDDA (model [a i b i Qd]) 1 367

Intrinsic dimension estimation We base our approach to chose the values of d i on eigenvalues of Σ i, We use the scree-test of Cattell [Cat66]: The scree-test of Cattell.

Optical character recognition We consider the USPS dataset: learning: 2007 examples, test: 7291 examples. Recognition results: Examples of the USPS dataset. Method Recognition rate HDDA [a i bq i d i ] 95.86 % HDDA [a i b i Q i d i ] 95.52 % LDA (d = 256) 74.56 % FDA (d = 9) 90.23 % SVM (linear) 94.28 % Human 97.50 %

Object recognition Our approach uses local descriptors: detection of interest points: Harris-Laplace operator interest points description: Sift operator. We consider 3 object classes (wheels, seat and handlebars) and 1 background class, The dataset contains 1000 descriptors in 128 dimensions: learning dataset: 500, test dataset: 500.

Numerical results 0.9 0.85 HDDA SVM (Rbf, γ=0.6) FDA PCA+LDA (d=45) 0.8 Precision 0.75 0.7 0.65 0.6 0.55 40 60 80 100 120 140 160 180 Recall Classication results for the object recognition experiment.

Recognition results Recognition using HDDA Recognition using SVM Recognition results for the object recognition experiment.

Conclusion The new model proposed here nds the specic subspace and estimates the intrinsic dimension of each class, uses this information in the Gaussian model of each class, includes additional assumptions in order to reduce the number of parameters to estimate. The main advantages of our model are: good performances without dimension reduction of the data, good performances with small learning datasets, as fast as classical generative methods, it can be used either in supervised or in unsupervised classication.

Work in progress Extension to unsupervised classication using the EM algorithm. Application to object recognition in a weakly-supervised framework: unsupervised classication to learn object parts, supervised classication to recognize the object in a new image.

References H. Bensmail and G. Celeux. Regularized gaussian discriminant analysis through eigenvalue decomposition. Journal of the American Statistical Association, 91:17431748, 1996. R. Bellman. Adaptive Control Processes. Princeton University Press, 1961. C. Bouveyron, S. Girard, and C. Schmid. Analyse discriminante de haute dimension. Rapport de recherche 5470, INRIA, January 2005. R. B. Cattell. The scree test for the number of factors. Multivariate Behavioral Research, 1(2):140161, 1966. J.H. Friedman. Regularized discriminant analysis. Journal of the American Statistical Association, 84:165175, 1989. D. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91110, 2004. D. Scott and J. Thompson. Probability density estimation in higher dimensions. In Proceedings of the Fifteenth Symposium on the Interface, North Holland-Elsevier Science Publishers, pages 173179, 1983.