Discrimination: finding the features that separate known groups in a multivariate sample.

Similar documents
which is the chance that, given the observation/subject actually comes from π 2, it is misclassified as from π 1. Analogously,

Lecture 8: Classification

THE UNIVERSITY OF CHICAGO Graduate School of Business Business 41912, Spring Quarter 2012, Mr. Ruey S. Tsay

6-1. Canonical Correlation Analysis

DISCRIMINANT ANALYSIS. 1. Introduction

SF2935: MODERN METHODS OF STATISTICAL LECTURE 3 SUPERVISED CLASSIFICATION, LINEAR DISCRIMINANT ANALYSIS LEARNING. Tatjana Pavlenko.

Discriminant Analysis and Statistical Pattern Recognition

12 Discriminant Analysis

Semiparametric Discriminant Analysis of Mixture Populations Using Mahalanobis Distance. Probal Chaudhuri and Subhajit Dutta

Efficient Approach to Pattern Recognition Based on Minimization of Misclassification Probability

Applied Multivariate and Longitudinal Data Analysis

Chapter 25 The DISCRIM Procedure. Chapter Table of Contents

Discriminant Analysis (DA)

The DISCRIM Procedure

MULTIVARIATE HOMEWORK #5

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Is cross-validation valid for small-sample microarray classification?

Classification: Linear Discriminant Analysis

Curve Fitting Re-visited, Bishop1.2.5

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

STA 414/2104, Spring 2014, Practice Problem Set #1

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

5. Discriminant analysis

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Discriminant analysis and supervised classification

What does Bayes theorem give us? Lets revisit the ball in the box example.

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Multivariate statistical methods and data mining in particle physics

Classification. 1. Strategies for classification 2. Minimizing the probability for misclassification 3. Risk minimization

Algorithm-Independent Learning Issues

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Machine Learning for Signal Processing Bayes Classification and Regression

Robustness of the Quadratic Discriminant Function to correlated and uncorrelated normal training samples

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

GENOMIC SIGNAL PROCESSING. Lecture 2. Classification of disease subtype based on microarray data

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining

Bayes Decision Theory

Support Vector Machines

Fast and Robust Discriminant Analysis

Small sample size generalization

18.9 SUPPORT VECTOR MACHINES

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Generative modeling of data. Instructor: Taylor Berg-Kirkpatrick Slides: Sanjoy Dasgupta

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

10-701/ Machine Learning - Midterm Exam, Fall 2010

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Machine Learning for Signal Processing Bayes Classification

A Robust Approach to Regularized Discriminant Analysis

Bayesian Decision Theory

The Bayes classifier

Introduction to Machine Learning Spring 2018 Note 18

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Machine Learning for OR & FE

Discriminant Analysis and Statistical Pattern Recognition

Classification via kernel regression based on univariate product density estimators

Stat 216 Final Solutions

CLASSICAL NORMAL-BASED DISCRIMINANT ANALYSIS

Informatics 2B: Learning and Data Lecture 10 Discriminant functions 2. Minimal misclassifications. Decision Boundaries

Machine Learning Practice Page 2 of 2 10/28/13

Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Notes on Discriminant Functions and Optimal Classification

Introduction to Machine Learning

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

How to evaluate credit scorecards - and why using the Gini coefficient has cost you money

Mining Classification Knowledge

Supervised Learning: Non-parametric Estimation

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Pattern recognition. "To understand is to perceive patterns" Sir Isaiah Berlin, Russian philosopher

Advanced statistical methods for data analysis Lecture 1

Multivariate signal approaches for object detection using microwave sensing

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Support Vector Machine (SVM) and Kernel Methods

StatPatternRecognition: A C++ Package for Multivariate Classification of HEP Data. Ilya Narsky, Caltech

Gaussian Models

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

An Alternative Algorithm for Classification Based on Robust Mahalanobis Distance

Probability Models for Bayesian Recognition

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Statistical Methods in Particle Physics

Machine Learning Lecture 5

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Nearest Neighbor Pattern Classification

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Pattern Recognition and Machine Learning

KWAME NKRUMAH UNIVERSITY OF SCIENCE AND TECHNOLOGY

Cross-validation of N earest-neighbour Discriminant Analysis

Data Mining and Analysis: Fundamental Concepts and Algorithms

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

On optimum choice of k in nearest neighbor classification

Course in Data Science

SGN (4 cr) Chapter 5

ISSN X On misclassification probabilities of linear and quadratic classifiers

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Transcription:

Discrimination and Classification Goals: Discrimination: finding the features that separate known groups in a multivariate sample. Classification: developing a rule to allocate a new object into one of a number of known groups. A classification rule is based on the features that separate the groups, so the goals overlap. 1

Simplest Case: Two Groups Both discrimination and classification depend on multivariate observation X. Basic setup: Consider each group a population: π 1 and π 2. The density of X in π g is f g (x), g = 1, 2. 2

Alternatively: A single population with a group indicator G. The density of X given G = g is f g (x). P(G = 1) = p 1, P(G = 2) = p 2 = 1 p 1. p 1 and p 2 are the prior probabilities of the groups. 3

Example: ownership of a riding mower. Lotsize 14 16 18 20 22 Owner Non owner 40 60 80 100 Income 4

Classification problem is to predict whether a homeowner owns a riding mower, given the Income and Lot Size. Rule must give an answer for any x, so it defines two regions R 1 and R 2. Classification errors: P (2 1) = P(X R 2 G = 1) = P (1 2) = P(X R 1 G = 2) = R 2 f 1 (x)dx, R 1 f 2 (x)dx. 5

Unconditionally: P (correctly classified as π 1 ) = P (1 1)p 1 P (missclassified as π 1 ) = P (1 2)p 2 P (correctly classified as π 2 ) = P (2 2)p 2 P (missclassified as π 2 ) = P (2 1)p 1 The two types of classification error may have different costs: Classify as: π 1 π 2 True population: π 1 0 c(2 1) π 2 c(1 2) 0 6

Expected Cost of Misclassification: ECM = c(2 1)P (2 1)p 1 + c(1 2)P (1 2)p 2. The classification region that minimizes ECM is ( ) ( ) f R 1 : 1 (x) c(1 2) p2 f 2 (x). c(2 1) p 1 Note: these regions depend on the ratios of: the densities; the costs of misclassification; the prior probabilities. 7

Special cases: Total Probability of Misclassification: TPM = P(misclassify either way) = P (2 1)p 1 + P (1 2)p 2 is ECM with c(1 2) = c(2 1) = 1. Posterior mode: allocate to the population with the higher posterior probability; by Bayes s rule: P (π 1 x 0 ) = P (π 2 x 0 ) = same rule as minimizing TPM. p 1 f 1 (x 0 ) p 1 f 1 (x 0 ) + p 2 f 2 (x 0 ) p 2 f 2 (x 0 ) p 1 f 1 (x 0 ) + p 2 f 2 (x 0 ) 8

Multivariate Normal Populations If f 1 (x) and f 2 (x) are multivariate normal with the same Σ and means µ 1 and µ 2, then the minimum-ecm allocation rule is based on a linear function of x. Basic result is [ ] [ f1 (x) ln = (µ f 2 (x) 1 µ 2 ) Σ 1 x 1 ] 2 (µ 1 + µ 2) Optimal region is R 1 : (µ 1 µ 2 ) Σ [x 1 1 ] 2 (µ 1 + µ 2) ln [( c(1 2) c(2 1) ) ( )] p2 p 1. 9

Typically, Σ, µ 1 and µ 2 are unknown, but replaced with sample estimates, based on training data. Example: Hemophilia A carriers. Women subjects in two groups: Non-carriers and Obligatory carriers. Variables: X 1 = log 10 (AHF acitivity) X 2 = log 10 (AHF-like antigen) SAS proc discrim program and output. 10

Evaluating Classification Functions The Optimum Error Rate is the minimum TPM. For multivariate normal populations, OER = 1 Φ ( ) where 2 is the Generalized Squared Distance between groups: 2 = (µ 1 µ 2 ) Σ 1 (µ 1 µ 2 ). 2, For the hemophilia example, SAS reports 2 = 4.57431, whence OER = 0.1424. 11

The OER calculation is the estimated error rate for allocation using the true parameters good in large samples, but dubious in small samples. Sample-based evaluation: APparent Error Rate = fraction of training data that are misclassified; reported by SAS as Resubstitution Summary ; APER = 0.1389 for the example; under-estimates error rate, because the data used to develop the rule are then used to evaluate it; Lachenbruch s holdout procedure: omit one observation at a time, recalculate the classification rule, and classify the omitted observation; reported by SAS as Crossvalidation Summary, 0.1556 for the example. 12

Quadratic Classification If f 1 (x) and f 2 (x) are multivariate normal with different covariances Σ 1 and Σ 2 and means µ 1 and µ 2, then the minimum- ECM allocation rule is based on a quadratic function of x. Classification regions are now bounded by conic sections (generally elliptical or hyperbolic). No simple explicit representation. 13

In proc discrim, use pool = no on the proc statement. For the AHF example, the resubstition Total error rate is 0.1389 and the cross-validated Total error rate is 0.1722. The cross-validated Total error rate is higher than for linear classification, suggesting that the estimation cost of the additional parameters in the two Σ s outweighs their value in improving the fit. 14

Nonparametric Classification One alternative to the multivariate normal assumption is classification based on nonparametric density estimates of f g (x). In SAS, use method = npar on the proc discrim statement. You have the choice of nearest neighbor and kernel estimation: for nearest neighbor estimation, use k =... to specify the number of neighbors; for kernel estimation, use kernel =... to name a kernel, and r =... to specify the radius (bandwidth). 15