Semiparametric Discriminant Analysis of Mixture Populations Using Mahalanobis Distance. Probal Chaudhuri and Subhajit Dutta

Similar documents
ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Classification techniques focus on Discriminant Analysis

Lecture 5: Classification

Discriminant Analysis and Statistical Pattern Recognition

Discrimination: finding the features that separate known groups in a multivariate sample.

On optimum choice of k in nearest neighbor classification

Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012

SF2935: MODERN METHODS OF STATISTICAL LECTURE 3 SUPERVISED CLASSIFICATION, LINEAR DISCRIMINANT ANALYSIS LEARNING. Tatjana Pavlenko.

Classification via kernel regression based on univariate product density estimators

LEC 4: Discriminant Analysis for Classification

Classification. Chapter Introduction. 6.2 The Bayes classifier

Lecture 5: LDA and Logistic Regression

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Extensions to LDA and multinomial regression

The Bayes classifier

Probabilistic Fisher Discriminant Analysis

MSA220 Statistical Learning for Big Data

CS145: INTRODUCTION TO DATA MINING

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

A Study of Relative Efficiency and Robustness of Classification Methods

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Pattern Recognition and Machine Learning

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Does Modeling Lead to More Accurate Classification?

Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining

Random projection ensemble classification

Classification using stochastic ensembles

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

DD-Classifier: Nonparametric Classification Procedure Based on DD-plot 1. Abstract

Lectures in AstroStatistics: Topics in Machine Learning for Astronomers

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Adaptive Mixture Discriminant Analysis for Supervised Learning with Unobserved Classes

Comparison of Different Classification Methods on Glass Identification for Forensic Research

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

MLE/MAP + Naïve Bayes

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

Design of Text Mining Experiments. Matt Taddy, University of Chicago Booth School of Business faculty.chicagobooth.edu/matt.

Multi-scale Classification using Localized Spatial Depth

STAT 730 Chapter 1 Background

Classification Ensemble That Maximizes the Area Under Receiver Operating Characteristic Curve (AUC)

Adaptive Mixture Discriminant Analysis for. Supervised Learning with Unobserved Classes

CMSC858P Supervised Learning Methods

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

CS6220: DATA MINING TECHNIQUES

CSC 411: Lecture 09: Naive Bayes

MLE/MAP + Naïve Bayes

Supervised dimensionality reduction using mixture models

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

High Dimensional Discriminant Analysis

L11: Pattern recognition principles

High Dimensional Discriminant Analysis

An Introduction to Multivariate Methods

A Robust Approach to Regularized Discriminant Analysis

Machine Learning for Signal Processing Bayes Classification and Regression

CS6220: DATA MINING TECHNIQUES

Independent Factor Discriminant Analysis

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Classification: Linear Discriminant Analysis

10-701/ Machine Learning - Midterm Exam, Fall 2010

ISSN X On misclassification probabilities of linear and quadratic classifiers

Applied Multivariate and Longitudinal Data Analysis

A Bias Correction for the Minimum Error Rate in Cross-validation

MULTIVARIATE HOMEWORK #5

SVM-flexible discriminant analysis

Spring 2006: Linear Discriminant Analysis, Etc.

Introduction to Machine Learning

Lecture 9: Classification, LDA

Lecture 9: Classification, LDA

Lecture 8: Classification

Gaussian Mixture Models with Component Means Constrained in Pre-selected Subspaces

DISCRIMINANT ANALYSIS. 1. Introduction

BINARY TREE-STRUCTURED PARTITION AND CLASSIFICATION SCHEMES

T 2 Type Test Statistic and Simultaneous Confidence Intervals for Sub-mean Vectors in k-sample Problem

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur

Lecture 9: Classification, LDA

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine Learning 2017

Bayesian Inference of Interactions and Associations

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Unsupervised clustering of COMBO-17 galaxy photometry

Introduction to Machine Learning Midterm Exam

Classification 1: Linear regression of indicators, linear discriminant analysis

CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis

Machine Learning for OR & FE

On Some Classification Methods for High Dimensional and Functional Data

Supervised Learning. Regression Example: Boston Housing. Regression Example: Boston Housing

Support Vector Machines for Classification: A Statistical Portrait

Introduction to Machine Learning

Discriminant Kernels based Support Vector Machine

c 4, < y 2, 1 0, otherwise,

Multivariate statistical methods and data mining in particle physics

Transcription:

Semiparametric Discriminant Analysis of Mixture Populations Using Mahalanobis Distance Probal Chaudhuri and Subhajit Dutta Indian Statistical Institute, Kolkata. Workshop on Classification and Regression Trees Institute of Mathematical Sciences, National University of Singapore

Brief history Fisher, R. A. (1936), The use of multiple measurements in taxonomic problems, Ann. Eugenics, 7, 179 188. Fisher was interested in the taxonomic classification of different species of Iris. He had measurements on the lengths and the widths of the sepals and the petals of the flowers of three different species, namely, Iris setosa, Iris virginica and Iris versicolor.

Brief history (Contd.) Mahalanobis, P. C. (1936), On the generalized distance in statistics, Proc. Nat. Acad. Sci., India, 12, 49 55. Mahalanobis met Nelson Annandale at the 1920 Nagpur session of the Indian Science Congress. Annandale asked Mahalanobis to analyze anthropometric measurements of Anglo-Indians in Calcutta. This eventually led to the development of Mahalanobis distance.

Mahalanobis distance and Fisher s discriminant function Mahalanobis distance of an observation x from a population with mean µ and dispersion Σ : MD(x,µ,Σ) = (x µ) Σ 1 (x µ). Fisher s linear discriminant function for two populations with means µ 1 and µ 2 and a common dispersion Σ : (x (µ 1 +µ 2 )/2) Σ 1 (µ 1 µ 2 ).

Mahalanobis distance and Fisher s discriminant function (Contd.)

Mahalanobis distance and Fisher s discriminant function (Contd.) The linear discriminant function has Bayes risk optimality for Gaussian class distributions which differ in their locations but have the same dispersion. This has been discussed in detail in Welch (1939, Biometrika) and Rao (1948, JRSS-B). In fact, the Bayes risk optimality holds for elliptically symmetric and unimodal class distributions which differ only in their locations.

Examples Example (a) : Class 1 : Mixture of N d (0,Σ) and N d (0, 10Σ); and Class 2 : N d (0, 5Σ). Σ = [0.51 1+0.5I d ], N d is the d-variate normal distribution. Example (b) : Class 1 : Mixture of U d (0,Σ, 0, 1) and U d (0,Σ, 2, 3); and Class 2 : U d (0,Σ, 1, 2) and U d (0,Σ, 3, 4). U d (µ,σ, r 1, r 2 ) denotes the uniform distribution over the region {x R d : r 1 < Σ 1/2 (x µ) < r 2 }. Classes have same location 0 but different scatters and shapes.

0 X2 2 0 4 5 10 X2 2 5 4 10 Bayes class boundaries 10 5 0 5 X1 (a) Example (a) 10 4 2 0 2 X1 (b) Example (b) Figure: Bayes class boundaries in R2. 4

Bayes class boundaries (Contd.) Class distributions involve elliptically symmetric distributions. They have same location (i.e., 0), and they differ in their scatters as well as shapes. No linear or quadratic classifier will work here!!

Performance of some standard classifiers 0.55 0.5 0.5 0.45 0.45 0.4 Misclassification rate 0.4 0.35 0.3 0.25 0.2 0.15 0.1 Bayes k NN KDE LDA QDA Misclassification rate 0.35 0.3 0.25 0.2 0.15 0.1 0.05 Bayes k NN KDE LDA QDA 0.05 1 2 3 4 5 log(d) (a) Example (a) 0 1 2 3 4 5 log(d) (b) Example (b) Figure: Misclassification rates of LDA, QDA, two nonparametric classifiers and the Bayes classifier for d = 2, 5, 10, 20, 50 and 100.

Nonparametric multinomial additive logistic regression model Suppose that the class densities are elliptically symmetric f i (x) = Σ i 1/2 g i ( Σ 1/2 i (x µ i ) ) = ψ i (MD(x,µ i,σ i )) for all i = 1, 2. The class posterior probabilities are and p(1 x) = π 1 f 1 (x)/(π 1 f 1 (x)+π 2 f 2 (x)) It is easy to see that p(2 x) = 1 p(1 x). log{p(1 x)/p(2 x)} = log(π 1 f 1 (x)/π 2 f 2 (x)) = log(π 1 /π 2 )+logψ 1 (MD(x,µ 1,Σ 1 )) logψ 2 (MD(x,µ 2,Σ 2 )).

Nonparametric multinomial additive logistic regression model (Contd.) The posteriors turn out to be of the form = = p(1 x) = p(1 z(x)) exp(logψ 1 (z 1 (x)) logψ 2 (z 2 (x)) [1+exp(logψ 1 (z 1 (x)) logψ 2 (z 2 (x))], p(2 x) = p(2 z(x)) 1 [1+exp(logψ 1 (z 1 (x)) logψ 2 (z 2 (x))], where z(x) = (z 1 (x), z 2 (x)) = (MD(x,µ 1,Σ 1 ), MD(x,µ 2,Σ 2 )). The posterior probabilities satisfy a generalized additive model (Hastie and Tibshirani, 1990).

Nonparametric multinomial additive logistic regression model (Contd.) If we have two normal populations N d (µ 1,Σ) and N d (µ 2,Σ), we get linear logistic regression model for the posterior probabilities. This is related to Fisher s linear discriminant analysis. If we have two normal populations N d (µ 1,Σ 1 ) and N d (µ 2,Σ 2 ), we get quadratic logistic regression model for the posterior probabilities. This is related to quadratic discriminant analysis.

Nonparametric multinomial additive logistic regression model (Contd.) For any 1 i (J 1), it is easy to see that log{p(i x)/p(j x)} = log(π i /π J )+logψ i (MD(x,µ i,σ i )) logψ J (MD(x,µ J,Σ J )), where p(i x) is the posterior probability of the i-th class. For any 1 i (J 1), the posteriors are of the form p(i x) = p(i z(x)) = exp(φ i (z(x))) [1+ (J 1) k=1 exp(φ k(z(x)))], 1 p(j x) = p(j z(x)) = [1+ (J 1) k=1 exp(φ k(z(x)))], where z(x) = (MD(x,µ 1,Σ 1 ),, MD(x,µ J,Σ J )).

Nonparametric multinomial additive logistic regression model (Contd.) We replace the original feature variables by the Mahalanobis distances from different classes. x z(x) = (MD(x,µ 1,Σ 1 ),, MD(x,µ J,Σ J )). One can use the backfitting algorithm (Hastie and Tibshirani, 1990) to estimate the posterior probabilities from the training data.

More general class distributions Non-elliptic class distributions. Multi-modal class distributions. Mixture models for class distributions.

Finite mixture of elliptically symmetric densities Assume R i f i (x) = θ ik Σ ik 1/2 g ik ( Σ 1/2 ik (x µ ik ) ), k=1 where θ ik s are positive satisfying R i k=1 θ ik = 1 for all 1 i J. The posterior probability for the i-th class is R i p(i x) = p(c ir x) for all 1 i J, r=1 where c ir denotes the r-th sub-class in the i-th class. The posterior probability p(c ir x) satisfies a multinomial additive logistic regression model because the distribution of the sub-population c ir is elliptically symmetric.

The missing data problem In the training data, we have the class labels, but the sub-class labels are not available. If we had known the sub-class labels, we could once again use the backfitting algorithm to estimate the sub-class posteriors. Sub-class labels can be treated as missing observations. We can use an EM-type algorithm.

SPARC : The algorithm Initial E-step : Sub-class labels are estimated by appropriate cluster analysis of the training data in each class. Initial and later M-steps : Once the sub-class labels are obtained, sub-class posteriors can be estimated by fitting a nonparametric multinomial additive logistic regression model using the backfitting algorithm. Later E-steps : The sub-class labels are estimated by sub-class posterior probabilities. Iterations are carried out until posterior estimates stabilize. An observation is classified into the class having the largest posterior probability.

A pool of different classifiers Traditional classifiers like LDA and QDA. Nonparametric classifiers based on k-nn and KDE. SVM with the linear kernel and the radial basis functions. Classifiers based on adaptive partitioning of the co-variate space : CART, RF and Poly-MARS.

A pool of different classifiers (Contd.) Hastie and Tibshirani (1996, JRSS-B) proposed an extension of LDA by modelling the density function of each class by a finite mixture of normal densities. They called their method mixture discriminant analysis (MDA). Fraley and Raftery (2002, JASA) extended MDA to MclustDA, where they considered mixtures of other families of parametric densities.

Simulated datasets Examples (a) and (b). Example (c) : The first class is an equal mixture of N d (0, 0.25I d ) and N d (0, I d ), and the second class is an equal mixture of N d (-1, 0.25I d ) and N d (1, 0.25I d ). Example (d) : One class distribution is an equal mixture of U d (0,Σ, 0, 1) and U d (2,Σ, 2, 3), and the other one is an equal mixture of U d (1,Σ, 1, 2) and U d (3,Σ, 3, 4).

Analysis of simulated data SPARC Mclust DA MDA Poly MARS RF CART KDE knn SVM r SVM l QDA LDA Bayes SPARC Mclust DA MDA Poly MARS RF CART KDE knn SVM r SVM l QDA LDA Bayes 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 Misclassification rate 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 Misclassification rate (a) Example (a) (b) Example (b) Figure: Boxplots of misclassification rates. Center of a box is the mean, and its width = 2 3sd.

Analysis of simulated data SPARC Mclust DA MDA Poly MARS RF CART KDE knn SVM r SVM l QDA LDA Bayes SPARC Mclust DA MDA Poly MARS RF CART KDE knn SVM r SVM l QDA LDA Bayes 0 0.1 0.2 0.3 0.4 0.5 Misclassification rate 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 Misclassification rate (a) Example (c) (b) Example (d) Figure: Boxplots of misclassification rates. Center of a box is the mean, and its width = 2 3sd.

Real data Hemophilia data : Available in the R package rrcov. There are two classes of carrier and non-carrier women. The variables are measurements on AHF activity and AHF antigen. Biomedical data : Available in the CMU data archive. There are two classes of carriers and non-carriers of a rare genetic disease. Variables are four measurements on blood samples of individuals.

Analysis of real benchmark data sets SPARC Mclust DA MDA Poly MARS RF CART KDE knn SVM r SVM l QDA LDA SPARC Mclust DA MDA Poly MARS RF CART KDE knn SVM r SVM l QDA LDA 0.14 0.16 0.18 0.2 0.22 0.24 Misclassification rate 0.12 0.14 0.16 0.18 0.2 0.22 Misclassification rate (a) Hemophilia data (b) Biomedical data Figure: Boxplots of misclassification rates. Center of a box is the mean, and its width = 2 3sd.

Real data (Contd.) Diabetes data : Available in the UCI data archive. There are three classes that consist of normal individuals, chemical diabetic and overt diabetic patients. The five variables are measurements related to weights of individuals and their insulin and glucose levels in blood. Vehicle data : Available in the UCI data archive. There are four types of vehicles and eighteen measurements related to the shape of each vehicle.

Analysis of real benchmark data sets (Contd.) SPARC Mclust DA MDA Poly MARS RF CART KDE knn SVM r SVM l QDA LDA SPARC Mclust DA MDA Poly MARS RF CART KDE knn SVM r SVM l QDA LDA 0.05 0.1 0.15 0.2 0.25 Misclassification rate 0.15 0.2 0.25 0.3 0.35 0.4 Misclassification rate (a) Diabetes data (b) Vehicle data Figure: Boxplots of misclassification rates. Center of a box is the mean, and its width = 2 3sd.

Looking back at the Iris data SPARC Mclust DA MDA Poly MARS RF CART KDE knn SVM r SVM l QDA LDA 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 Misclassification rate Figure: Boxplot of misclassification rates for the Iris data. Center of the box is the mean, and its width = 2 3sd.