Semiparametric Discriminant Analysis of Mixture Populations Using Mahalanobis Distance Probal Chaudhuri and Subhajit Dutta Indian Statistical Institute, Kolkata. Workshop on Classification and Regression Trees Institute of Mathematical Sciences, National University of Singapore
Brief history Fisher, R. A. (1936), The use of multiple measurements in taxonomic problems, Ann. Eugenics, 7, 179 188. Fisher was interested in the taxonomic classification of different species of Iris. He had measurements on the lengths and the widths of the sepals and the petals of the flowers of three different species, namely, Iris setosa, Iris virginica and Iris versicolor.
Brief history (Contd.) Mahalanobis, P. C. (1936), On the generalized distance in statistics, Proc. Nat. Acad. Sci., India, 12, 49 55. Mahalanobis met Nelson Annandale at the 1920 Nagpur session of the Indian Science Congress. Annandale asked Mahalanobis to analyze anthropometric measurements of Anglo-Indians in Calcutta. This eventually led to the development of Mahalanobis distance.
Mahalanobis distance and Fisher s discriminant function Mahalanobis distance of an observation x from a population with mean µ and dispersion Σ : MD(x,µ,Σ) = (x µ) Σ 1 (x µ). Fisher s linear discriminant function for two populations with means µ 1 and µ 2 and a common dispersion Σ : (x (µ 1 +µ 2 )/2) Σ 1 (µ 1 µ 2 ).
Mahalanobis distance and Fisher s discriminant function (Contd.)
Mahalanobis distance and Fisher s discriminant function (Contd.) The linear discriminant function has Bayes risk optimality for Gaussian class distributions which differ in their locations but have the same dispersion. This has been discussed in detail in Welch (1939, Biometrika) and Rao (1948, JRSS-B). In fact, the Bayes risk optimality holds for elliptically symmetric and unimodal class distributions which differ only in their locations.
Examples Example (a) : Class 1 : Mixture of N d (0,Σ) and N d (0, 10Σ); and Class 2 : N d (0, 5Σ). Σ = [0.51 1+0.5I d ], N d is the d-variate normal distribution. Example (b) : Class 1 : Mixture of U d (0,Σ, 0, 1) and U d (0,Σ, 2, 3); and Class 2 : U d (0,Σ, 1, 2) and U d (0,Σ, 3, 4). U d (µ,σ, r 1, r 2 ) denotes the uniform distribution over the region {x R d : r 1 < Σ 1/2 (x µ) < r 2 }. Classes have same location 0 but different scatters and shapes.
0 X2 2 0 4 5 10 X2 2 5 4 10 Bayes class boundaries 10 5 0 5 X1 (a) Example (a) 10 4 2 0 2 X1 (b) Example (b) Figure: Bayes class boundaries in R2. 4
Bayes class boundaries (Contd.) Class distributions involve elliptically symmetric distributions. They have same location (i.e., 0), and they differ in their scatters as well as shapes. No linear or quadratic classifier will work here!!
Performance of some standard classifiers 0.55 0.5 0.5 0.45 0.45 0.4 Misclassification rate 0.4 0.35 0.3 0.25 0.2 0.15 0.1 Bayes k NN KDE LDA QDA Misclassification rate 0.35 0.3 0.25 0.2 0.15 0.1 0.05 Bayes k NN KDE LDA QDA 0.05 1 2 3 4 5 log(d) (a) Example (a) 0 1 2 3 4 5 log(d) (b) Example (b) Figure: Misclassification rates of LDA, QDA, two nonparametric classifiers and the Bayes classifier for d = 2, 5, 10, 20, 50 and 100.
Nonparametric multinomial additive logistic regression model Suppose that the class densities are elliptically symmetric f i (x) = Σ i 1/2 g i ( Σ 1/2 i (x µ i ) ) = ψ i (MD(x,µ i,σ i )) for all i = 1, 2. The class posterior probabilities are and p(1 x) = π 1 f 1 (x)/(π 1 f 1 (x)+π 2 f 2 (x)) It is easy to see that p(2 x) = 1 p(1 x). log{p(1 x)/p(2 x)} = log(π 1 f 1 (x)/π 2 f 2 (x)) = log(π 1 /π 2 )+logψ 1 (MD(x,µ 1,Σ 1 )) logψ 2 (MD(x,µ 2,Σ 2 )).
Nonparametric multinomial additive logistic regression model (Contd.) The posteriors turn out to be of the form = = p(1 x) = p(1 z(x)) exp(logψ 1 (z 1 (x)) logψ 2 (z 2 (x)) [1+exp(logψ 1 (z 1 (x)) logψ 2 (z 2 (x))], p(2 x) = p(2 z(x)) 1 [1+exp(logψ 1 (z 1 (x)) logψ 2 (z 2 (x))], where z(x) = (z 1 (x), z 2 (x)) = (MD(x,µ 1,Σ 1 ), MD(x,µ 2,Σ 2 )). The posterior probabilities satisfy a generalized additive model (Hastie and Tibshirani, 1990).
Nonparametric multinomial additive logistic regression model (Contd.) If we have two normal populations N d (µ 1,Σ) and N d (µ 2,Σ), we get linear logistic regression model for the posterior probabilities. This is related to Fisher s linear discriminant analysis. If we have two normal populations N d (µ 1,Σ 1 ) and N d (µ 2,Σ 2 ), we get quadratic logistic regression model for the posterior probabilities. This is related to quadratic discriminant analysis.
Nonparametric multinomial additive logistic regression model (Contd.) For any 1 i (J 1), it is easy to see that log{p(i x)/p(j x)} = log(π i /π J )+logψ i (MD(x,µ i,σ i )) logψ J (MD(x,µ J,Σ J )), where p(i x) is the posterior probability of the i-th class. For any 1 i (J 1), the posteriors are of the form p(i x) = p(i z(x)) = exp(φ i (z(x))) [1+ (J 1) k=1 exp(φ k(z(x)))], 1 p(j x) = p(j z(x)) = [1+ (J 1) k=1 exp(φ k(z(x)))], where z(x) = (MD(x,µ 1,Σ 1 ),, MD(x,µ J,Σ J )).
Nonparametric multinomial additive logistic regression model (Contd.) We replace the original feature variables by the Mahalanobis distances from different classes. x z(x) = (MD(x,µ 1,Σ 1 ),, MD(x,µ J,Σ J )). One can use the backfitting algorithm (Hastie and Tibshirani, 1990) to estimate the posterior probabilities from the training data.
More general class distributions Non-elliptic class distributions. Multi-modal class distributions. Mixture models for class distributions.
Finite mixture of elliptically symmetric densities Assume R i f i (x) = θ ik Σ ik 1/2 g ik ( Σ 1/2 ik (x µ ik ) ), k=1 where θ ik s are positive satisfying R i k=1 θ ik = 1 for all 1 i J. The posterior probability for the i-th class is R i p(i x) = p(c ir x) for all 1 i J, r=1 where c ir denotes the r-th sub-class in the i-th class. The posterior probability p(c ir x) satisfies a multinomial additive logistic regression model because the distribution of the sub-population c ir is elliptically symmetric.
The missing data problem In the training data, we have the class labels, but the sub-class labels are not available. If we had known the sub-class labels, we could once again use the backfitting algorithm to estimate the sub-class posteriors. Sub-class labels can be treated as missing observations. We can use an EM-type algorithm.
SPARC : The algorithm Initial E-step : Sub-class labels are estimated by appropriate cluster analysis of the training data in each class. Initial and later M-steps : Once the sub-class labels are obtained, sub-class posteriors can be estimated by fitting a nonparametric multinomial additive logistic regression model using the backfitting algorithm. Later E-steps : The sub-class labels are estimated by sub-class posterior probabilities. Iterations are carried out until posterior estimates stabilize. An observation is classified into the class having the largest posterior probability.
A pool of different classifiers Traditional classifiers like LDA and QDA. Nonparametric classifiers based on k-nn and KDE. SVM with the linear kernel and the radial basis functions. Classifiers based on adaptive partitioning of the co-variate space : CART, RF and Poly-MARS.
A pool of different classifiers (Contd.) Hastie and Tibshirani (1996, JRSS-B) proposed an extension of LDA by modelling the density function of each class by a finite mixture of normal densities. They called their method mixture discriminant analysis (MDA). Fraley and Raftery (2002, JASA) extended MDA to MclustDA, where they considered mixtures of other families of parametric densities.
Simulated datasets Examples (a) and (b). Example (c) : The first class is an equal mixture of N d (0, 0.25I d ) and N d (0, I d ), and the second class is an equal mixture of N d (-1, 0.25I d ) and N d (1, 0.25I d ). Example (d) : One class distribution is an equal mixture of U d (0,Σ, 0, 1) and U d (2,Σ, 2, 3), and the other one is an equal mixture of U d (1,Σ, 1, 2) and U d (3,Σ, 3, 4).
Analysis of simulated data SPARC Mclust DA MDA Poly MARS RF CART KDE knn SVM r SVM l QDA LDA Bayes SPARC Mclust DA MDA Poly MARS RF CART KDE knn SVM r SVM l QDA LDA Bayes 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 Misclassification rate 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 Misclassification rate (a) Example (a) (b) Example (b) Figure: Boxplots of misclassification rates. Center of a box is the mean, and its width = 2 3sd.
Analysis of simulated data SPARC Mclust DA MDA Poly MARS RF CART KDE knn SVM r SVM l QDA LDA Bayes SPARC Mclust DA MDA Poly MARS RF CART KDE knn SVM r SVM l QDA LDA Bayes 0 0.1 0.2 0.3 0.4 0.5 Misclassification rate 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 Misclassification rate (a) Example (c) (b) Example (d) Figure: Boxplots of misclassification rates. Center of a box is the mean, and its width = 2 3sd.
Real data Hemophilia data : Available in the R package rrcov. There are two classes of carrier and non-carrier women. The variables are measurements on AHF activity and AHF antigen. Biomedical data : Available in the CMU data archive. There are two classes of carriers and non-carriers of a rare genetic disease. Variables are four measurements on blood samples of individuals.
Analysis of real benchmark data sets SPARC Mclust DA MDA Poly MARS RF CART KDE knn SVM r SVM l QDA LDA SPARC Mclust DA MDA Poly MARS RF CART KDE knn SVM r SVM l QDA LDA 0.14 0.16 0.18 0.2 0.22 0.24 Misclassification rate 0.12 0.14 0.16 0.18 0.2 0.22 Misclassification rate (a) Hemophilia data (b) Biomedical data Figure: Boxplots of misclassification rates. Center of a box is the mean, and its width = 2 3sd.
Real data (Contd.) Diabetes data : Available in the UCI data archive. There are three classes that consist of normal individuals, chemical diabetic and overt diabetic patients. The five variables are measurements related to weights of individuals and their insulin and glucose levels in blood. Vehicle data : Available in the UCI data archive. There are four types of vehicles and eighteen measurements related to the shape of each vehicle.
Analysis of real benchmark data sets (Contd.) SPARC Mclust DA MDA Poly MARS RF CART KDE knn SVM r SVM l QDA LDA SPARC Mclust DA MDA Poly MARS RF CART KDE knn SVM r SVM l QDA LDA 0.05 0.1 0.15 0.2 0.25 Misclassification rate 0.15 0.2 0.25 0.3 0.35 0.4 Misclassification rate (a) Diabetes data (b) Vehicle data Figure: Boxplots of misclassification rates. Center of a box is the mean, and its width = 2 3sd.
Looking back at the Iris data SPARC Mclust DA MDA Poly MARS RF CART KDE knn SVM r SVM l QDA LDA 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 Misclassification rate Figure: Boxplot of misclassification rates for the Iris data. Center of the box is the mean, and its width = 2 3sd.