Classification. 1. Strategies for classification 2. Minimizing the probability for misclassification 3. Risk minimization

Similar documents
What does Bayes theorem give us? Lets revisit the ball in the box example.

5. Discriminant analysis

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Bayesian Decision Theory

Bayes Decision Theory

Bayesian Decision Theory

Informatics 2B: Learning and Data Lecture 10 Discriminant functions 2. Minimal misclassifications. Decision Boundaries

Machine Learning Lecture 5

Bayes Rule for Minimizing Risk

Machine Learning Lecture 7

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

Machine Learning Lecture 2

Object Recognition. Digital Image Processing. Object Recognition. Introduction. Patterns and pattern classes. Pattern vectors (cont.

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Feature selection and extraction Spectral domain quality estimation Alternatives

Multivariate statistical methods and data mining in particle physics

Intelligent Systems Statistical Machine Learning

Machine Learning Lecture 2

Machine Learning Lecture 2

Machine Learning for Signal Processing Bayes Classification and Regression

44 CHAPTER 2. BAYESIAN DECISION THEORY

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

Linear Models for Classification

Linear & nonlinear classifiers

Linear Models for Classification

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Lecture 9: Classification, LDA

Lecture 9: Classification, LDA

Lecture 9: Classification, LDA

Slides modified from: PATTERN RECOGNITION CHRISTOPHER M. BISHOP. and: Computer vision: models, learning and inference Simon J.D.

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

CS 195-5: Machine Learning Problem Set 1

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Ch 4. Linear Models for Classification

Notes on Discriminant Functions and Optimal Classification

Bayesian Decision Theory

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Bayesian Decision Theory

Intelligent Systems Statistical Machine Learning

Linear & nonlinear classifiers

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 3

Support Vector Machines with Example Dependent Costs

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

PATTERN RECOGNITION AND MACHINE LEARNING

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

Advanced statistical methods for data analysis Lecture 1

Curve Fitting Re-visited, Bishop1.2.5

Discriminant analysis and supervised classification

COM336: Neural Computing

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Support Vector Machine (continued)

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Machine Learning 2017

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Linear Classifiers as Pattern Detectors

How to evaluate credit scorecards - and why using the Gini coefficient has cost you money

Statistical methods in recognition. Why is classification a problem?

Motivating the Covariance Matrix

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Bayesian decision making

Linear Classification

Linear Classification: Probabilistic Generative Models

Linear Regression and Discrimination

Machine Learning for Signal Processing Bayes Classification

Linear Decision Boundaries

Algorithm-Independent Learning Issues

ECE521 week 3: 23/26 January 2017

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Multi-layer Neural Networks

Introduction to Statistical Inference

Bayes Decision Theory

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

ECE521 Lecture7. Logistic Regression

Lecture 8: Classification

Cheng Soon Ong & Christian Walder. Canberra February June 2018

LECTURE NOTE #8 PROF. ALAN YUILLE. Can we find a linear classifier that separates the position and negative examples?

Overview c 1 What is? 2 Definition Outlines 3 Examples of 4 Related Fields Overview Linear Regression Linear Classification Neural Networks Kernel Met

L5 Support Vector Classification

On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality. Weiqiang Dong

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

ADVANCED METHODS FOR PATTERN RECOGNITION

Is cross-validation valid for small-sample microarray classification?

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Multivariate statistical methods and data mining in particle physics Lecture 4 (19 June, 2008)

DISCRIMINANT ANALYSIS. 1. Introduction

Statistical Pattern Recognition

Minimum Error Rate Classification

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

THE UNIVERSITY OF CHICAGO Graduate School of Business Business 41912, Spring Quarter 2012, Mr. Ruey S. Tsay

c 4, < y 2, 1 0, otherwise,

Transcription:

Classification Volker Blobel University of Hamburg March 2005 Given objects (e.g. particle tracks), which have certain features (e.g. momentum p, specific energy loss de/ dx) and which belong to one of certain classes (e.g. particle types). Classification = assigment of object to classes 1. Strategies for classification 2. Minimizing the probability for misclassification 3. Risk minimization Christopher M. Bishop, Neural Networks for Pattern Recognition, Clarendon Press Oxford Keys during display: enter = next page; = next page; = previous page; home = first page; end = last page (index, clickable); C- = back; C-N = goto page; C-L = full screen (or back); C-+ = zoom in; C- = zoom out; C-0 = fit in window; C-M = zoom to; C-F = find; C-P = print; C-Q = exit.

Strategies for classification In statistical classification problems an object belongs to one of several classes C k. Each object has to be assigned to one of the classes on the basis of certain features of the object. Example from particle physics analysis: identification of the particle type in a high energy particle reaction; here the classes are the particles types pion, kaon, proton.... Classification strategies can be based on minimization of the probability of misclassification, or on risk minimization; this requirement applies to cases with rare classes, where rejecting the correct class has to be avoided. Volker Blobel University of Hamburg Classification page 2

Probabilities and classification Formally a-priori probabilities P (C k ) of an object belonging to each of the classes C k are introduced. This probability P (C k ) corresponds to the fractions of objects in each class in the limit of an infinite number of observations. If a classification is required, but nothing is known about the features of the objects, the best strategy is to assign it to the class having the highest a-priori probability. This minimizes the probability of misclassification, although some classifications must be wrong. In the particle identification example one would classify each particle as a pion, the most common particle created in a high energy reaction. Information in terms of values x of certain features connected to each object is of statistical nature and has to be used to improve the classification. The distribution of values x follows a certain probability density function p(x). The probability of x lying in an interval (a, b) of x is given by P (x [a, b]) = b a p(x) dx. If there are several feature variables x 1, x 2,..., they are grouped into a vector x corresponding to a point in multidimensional feature space. The probability of a multidimensional x lying in a region R of x-space is given by P (x R) = p(x) dx. R Volker Blobel University of Hamburg Classification page 3

Class-conditional probability The class-conditional probability density p(x C k ) is the density of x given that it belongs to class C k. The unconditional density function p(x) is the density function for x, irrespective of the class, and is given by the sum p(x) = p(x C k ) P (C k ). k If, for a given object, the value x of the feature vector is known, one has to combine this information with the a-priori probabilities. This is done by Bayes theorem a-posteriori probability for class k : P (C k x) = p(x C k) P (C k ) p(x). The denominator p(x), the unconditional density function, ensures the correct normalization of the a-posteriori probabilities P (C k x): Normalization: P (C k x) = 1. k Volker Blobel University of Hamburg Classification page 4

Notation Lower-case letters are used for probability densities, and upper-case letters for probabilities. k-th class C k point in feature space x, x probability density p(x), p(x) class-conditional probability density p(x C k ) a-priori probability for class k P (C k ) a-posteriori probability for class k P (C k x) discriminant function for class C k y k (x) loss matrix L with elements L kj rejection threshold P th Volker Blobel University of Hamburg Classification page 5

Minimizing the probability of misclassification After the observation of a feature vector x, the a-posteriori probabilities P (C k x) give the probability of the pattern belonging to class C k. The probability of misclassification is minimized by selecting class C k if P (C k x) > P (C j x) for all j k For this comparison the unconditional density p(x) may be removed from the Bayes formula, and the criterion can be rewritten in the form P (x C k ) P (C k ) > P (x C j ) P (C j ) for all j k. This is the fundamental decision formula, which minimizes the probability of misclassification. Each point of feature space is assigned to one of c classes. The feature space is divided up into c decision regions R 1, R 2... R c ; a feature vector x falling in region R k is assigned to class C k. The decision regions R k for a particular class C k may be contiguous or may be divided into disjoint regions. The boundaries between these regions are called decision boundaries (surfaces). Volker Blobel University of Hamburg Classification page 6

Decision probabilities The probability for a correct decision is given by P (correct) = k = k P (x R k, C k ) = P (x R k C k ) P (C k ) k p(x R k C k ) P (C k ) dx. R k This probability is maximized by choosing the regions {R k } such that each x is assigned to the class for which the integrand is a maximum. Error probabilities in the case of only two classes: P (error) = P (x R 2, C 1 ) + P (x R 1, C 2 ) = P (x R 2 C 1 ) P (C 1 ) + P (x R 1 C 2 ) P (C 2 ) = p(x C 1 ) P (C 1 ) dx + p(x C 2 ) P (C 2 ) dx R 2 R 1 P (x R 2, C 1 ) is the joint probability of x being assigned to class C 2, and the true class being C 1. If p(x C 1 ) P (C 1 ) > p(x C 2 ) P (C 2 ) for a given x, the regions R 1 and R 2 have to be chosen such that x is in region R 1, since this gives the smaller contribution to the error. Volker Blobel University of Hamburg Classification page 7

Discriminant functions Decision on class membership has been based solely on the relative sizes of the probabilities. The classification can be reformulated in terms of a set of discriminant functions y 1 (x), y 2 (x).... A feature vector x is assigned to class C k if y k (x) > y j (x) for all j k. The decision rule for minimizing the probability of misclassification can be made by choosing y k (x) = P (C k x). Since the unconditional density p(x) does not affect the classification decision an equivalent choice is y k (x) = p (x C k ) P (C k ). Furthermore, since only the relative magnitude of the discriminat function are important in determinating the class, replacing y(x) by g(y(x)), where g( ) is any monotonic function, will not modify the classification. Taking the logarithms: y k (x) = ln p (x C k ) + ln P (C k ). Decision boundaries are given by the regions where the discriminant functions are equal. If the regions R k and R j are contiguous then the decision boundary is given by y k (x) = y j (x). Volker Blobel University of Hamburg Classification page 8

Two-class discriminant functions In this special case the two discriminant functions y 1 (x) and y 2 (x) can alternatively be combined to a single discriminant function y(x) = y 1 (x) y 2 (x). The pattern x is assigned to class C 1 if y(x) > 0 and to class C 2 if y(x) < 0. Alternative forms of the single discriminant function are: y(x) = P (C 1 x) P (C 2 x) and y(x) = ln p(x C 1) p(x C 2 ) + ln P (C 1) P (C 2 ). Volker Blobel University of Hamburg Classification page 9

Risk minimization In many applications the minimization of the probability of misclassifying is not the most appropriate criterion, especially if there are rare processes, where rejection of the correct class has serious consequences. This may be taken into account in the following way. A loss matrix L is defined. where the elements L kj have the meaning of a penalty associated with assigning a pattern to class C j, when it in fact belongs to class C k. Then the expected (average) loss for those patterns is given by R k = k L kj R j p (x C k ) dx. The overall expected loss, or risk, for patterns from all classes is R = R k P (C k ) = { } L kj p (x C k ) P (C k ) dx. k j R j k The risk is minimized if the integrand is minimized at each point x, i.e. if the regions R j are chosen such that x R j when L kj p(x C k ) P (C k ) < k k L ki p(x C k ) P (C k ) for all i j. Volker Blobel University of Hamburg Classification page 10

Loss matrix This is a generalization of the decision rule for minimization the probability of misclassification. The general formula reduces to the special formula for the special assigment { 0 for k = j (correct class) L kj =. 1 for k j (wrong class) In general the elements of the loss matrix have to be estimated based on experience. If class C j is a rare, but important class, which should have a small loss, then for k j the elements L kj should be small; otherwise the acceptance of class C j is small because of the small value of the a-priori probability P (C j ) for rare classes. In any case the classification has to be studied carefully in order to check the values of the a-priori probabilities and the loss matrix L. Volker Blobel University of Hamburg Classification page 11

Example: particle identification de/ dx-curves for π (red), K ± (green), p (blue) and electron (yellow) with a-priori probabilities 100 : 10 : 5 : 5 3 3 2 2 1 1 0.1 Default loss matrix Volker Blobel University of Hamburg 1 10 0.1 1 10 Loss matrix elements in favour of electron with element 0.2 instead of 1 for p, 0.8 for π and K ±. Classification page 12

Rejection thresholds Most classification errors are expected to occur in those regions of x-space where the largest of the a-posteriori probabilities is relatively low, due to a strong overlap between different classes. It may be better not to make a classification decision in those case, and to reject the pattern instead. Rejection can be made on the basis of a threshold P th in the range 0 to 1, according to { P th, then classify x if max P (C k x) k < P th, then reject x. Volker Blobel University of Hamburg Classification page 13

Normal distribution and discriminant functions If the class-conditional density functions are independent normal distributions, then the discriminant function has the form y k (x) = 1 2 (x µ k) T V 1 k (x µ k) 1 2 ln V k + ln P (C k ), where constant terms have been dropped. Decision boundaries given by conditions y k (x) = y j (x) are general quadratic functions in the feature space. The discriminant functions are simplified if the covariance matrices for the various classes are equal (V k V ). Then the terms with the determinant V are identical for all classes and can be dropped. The quadratic contribution x T V 1 x is class-independent too and can be dropped. Because x T V 1 µ k = µ T k V 1 x the discriminant functions can be simplified to the form y k (x) = w T k x + w k0 with w T k = µ T k V 1 w k0 = 1 2 µt k V 1 µ k + ln P (C k ). The functions are linear in x; decision boundaries corresponding to y k (x) = y j (x) are then hyperplanes. Volker Blobel University of Hamburg Classification page 14

Classification Strategies for classification 2 Probabilities and classification............. 3 Class-conditional probability.............. 4 Notation.......................... 5 Minimizing the probability of misclassification 6 Decision probabilities.................. 7 Discriminant functions.................. 8 Two-class discriminant functions............ 9 Risk minimization 10 Loss matrix........................ 11 Example: particle identification............. 12 Rejection thresholds................... 13 Normal distribution and discriminant functions.... 14