ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

Similar documents
SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

Pattern Classification

Bayesian Decision Theory

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

5. Discriminant analysis

The Bayes classifier

Bayesian Decision and Bayesian Learning

L11: Pattern recognition principles

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Machine Learning 2017

Machine Learning Lecture 2

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

Machine Learning Lecture 2

Introduction to Machine Learning

STA 414/2104, Spring 2014, Practice Problem Set #1

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Bayesian Decision Theory Lecture 2

Parametric Techniques Lecture 3

Pattern Recognition. Parameter Estimation of Probability Density Functions

Introduction to Machine Learning

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Parametric Techniques

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Pattern Recognition 2

44 CHAPTER 2. BAYESIAN DECISION THEORY

Machine Learning for Signal Processing Bayes Classification and Regression

p(x ω i 0.4 ω 2 ω

L5: Quadratic classifiers

Minimum Error-Rate Discriminant

Minimum Error Rate Classification

CMU-Q Lecture 24:

Bayesian Decision Theory

Bayesian Decision Theory

CS 195-5: Machine Learning Problem Set 1

Linear Models for Classification

p(d θ ) l(θ ) 1.2 x x x

Dimensionality Reduction and Principal Components

Discriminant analysis and supervised classification

Expect Values and Probability Density Functions

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Algorithms for Classification: The Basic Methods

Lecture 8: Classification

Contents 2 Bayesian decision theory

Pattern recognition. "To understand is to perceive patterns" Sir Isaiah Berlin, Russian philosopher

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Bayes Decision Theory

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Classification 2: Linear discriminant analysis (continued); logistic regression

Informatics 2B: Learning and Data Lecture 10 Discriminant functions 2. Minimal misclassifications. Decision Boundaries

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel

Dimensionality Reduction and Principle Components

p(x ω i 0.4 ω 2 ω

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

12 Discriminant Analysis

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

CSC 411: Lecture 09: Naive Bayes

What does Bayes theorem give us? Lets revisit the ball in the box example.

Machine Learning Lecture 2

Bayesian Learning (II)

Metric-based classifiers. Nuno Vasconcelos UCSD

Introduction to Statistical Inference

COM336: Neural Computing

Error Rates. Error vs Threshold. ROC Curve. Biometrics: A Pattern Recognition System. Pattern classification. Biometrics CSE 190 Lecture 3

Machine Learning. Classification. Bayes Classifier. Representing data: Choosing hypothesis class. Learning: h:x a Y. Eric Xing

Machine Learning (CS 567) Lecture 5

Probability Models for Bayesian Recognition

Notes on Discriminant Functions and Optimal Classification

Classification and Pattern Recognition

Naive Bayes classification

Classification. 1. Strategies for classification 2. Minimizing the probability for misclassification 3. Risk minimization

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 3

MSA220 Statistical Learning for Big Data

Detection theory 101 ELEC-E5410 Signal Processing for Communications

Introduction to Machine Learning

Lecture 3: Pattern Classification

Pattern Classification

Statistical Methods for Particle Physics Lecture 1: parameter estimation, statistical tests

Linear Classifiers as Pattern Detectors

Machine Learning Linear Classification. Prof. Matteo Matteucci

Motivating the Covariance Matrix

Ch 4. Linear Models for Classification

Introduction to Gaussian Processes

Lecture 3: Pattern Classification. Pattern classification

L5 Support Vector Classification

Dimensionality reduction

Lecture Notes on the Gaussian Distribution

Lecture 3. G. Cowan. Lecture 3 page 1. Lectures on Statistical Data Analysis

Data Mining Part 4. Prediction

Example - basketball players and jockeys. We will keep practical applicability in mind:

Introduction to Machine Learning

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

LECTURE NOTE #3 PROF. ALAN YUILLE

TDA231. Logistic regression

Lecture 13: Data Modelling and Distributions. Intelligent Data Analysis and Probabilistic Inference Lecture 13 Slide No 1

Generative Classifiers: Part 1. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang

Transcription:

Memorial University of Newfoundland Pattern Recognition Lecture 6 May 18, 2006 http://www.engr.mun.ca/~charlesr Office Hours: Tuesdays & Thursdays 8:30-9:30 PM EN-3026 Review Distance-based Classification Use training samples to find defining features of known classes Determine a suitable prototype for each class Determine a suitable distance measure (metric) Use the prototype(s) and metric to classify new unknown patterns to one of the classes 2

(x m 1 ) T (x m 1 ) Review Minimum Euclidean Distance (MED) C 1 C 2 (x m 2 ) T (x m 2 ) x = m1 = m2 = C 1 C 2 C 1 C 2 3 Review Minimum Intra-Class Distance (MICD) A.K.A Mahalanobis Distance x = m1 = m2 = S1 = (x m 1 ) T S 1 1 (x m 1) Contour C 1 C 2 C 1 C 2 C 1 C 2 C 1 C 2 (x m 2 ) T S 1 2 (x m 2) S2 = 4

Review Unit Standard Deviation Contour A way to visualize the distribution Calculate the eigenvectors and eigenvalues from the covariance matrix (S) x2 x1 5 Review Decision surface for 2 classes in multidimensional feature space (x m 1 ) T S 1 1 (x m 1 ) (x m 2 )T S 1 2 (x m 2 ) = 0 x T Qx + w T x + w 0 = 0 HyperQuadratic If S1 = S2, then Q=0 and decision surface becomes a hyperplane: 6

Recall the problem with the MICD Bayesian Classification p(x) x The MICD always favours C2 over C1, even close to their common mean, because the MICD scales the distributions so that they are the same. 7 Ideally, we want to favour the class with the highest probability for the given pattern: where! P! (C i! x) is the a posterior (after measurement) probability of class Ci given x. 8

Maximum A Posterior (MAP) Classifier To get P(Ci x), we need p(x Ci) and Bayes Theorem: Where p(x Ci) is the class conditional probability density (p.d.f.), which needs to be estimated from the available samples or otherwise assumed. P(Ci) is the a priori (before measurement) probability of class Ci. p(x) = j p(x C j )P (C j ) 9 This gives the Bayes (or MAP) classifier: P (C i x) C i C j P (C j x) (after measurement) pro So assign the pattern x to the class with the maximum weighted p.d.f. 10

x 11 The decision boundary θ is where p(x C i )P (C i ) = p(x C j )P (C j ) In 1-D (i.e. one feature) there are usually 2 boundaries. In n-d space, the decision surface is usually complicated. The actual surfaces depend on the p.d.f. s. If we restrict ourselves to Gaussian distributions, we can compare MAP to the other classifiers we ve seen. 12

Consider the univariate Gaussian: It is convenient to use the log-likelihood form since the p.d.f. s involve exponentials. C i p(x C i )P (C i ) p(x C j )P (C j ) C j For the normal distribution this becomes: 13 Notes: Quadratic in x Measures distances from the class means in standard deviation units When P(Cj) > P(Ci), the RHS > 0 and so Cj is preferred When σ i > σ j, RHS > 0 and so Cj is preferred So MAP is biased towards favouring: most likely class most compact class 14

To find the decision boundary, set both sides equal and solve for x: ( ) ( ) ( 1 σj 2 1 σi 2 x 2 µ j 2 σj 2 µ ) i σi 2 x + µ2 j σj 2 µ2 i P (Cj )σ i σi 2 2log = 0 P (C i )σ j then there is a single boundary: If σ i = σ j, there there is a single boundary: ( ) µi µ j 2 σ 2 x + µ2 j µ2 i σ 2 Solving for x gives the value of θ 2log ( ) P (Cj ) = 0 P (C i ) 15 Assuming µ j > µ i, then for P(Cj) > P(Ci) the second term is negative and P(Cj ) = 2 P(Ci ) Ci Cj Compare: Thus the MAP decision boundary is biased towards the least likely class. 16

In general, solutions. σ i σ j, and the MAP decision boundary equation has two Ci Cj θ 1, θ 2 are functions of µ i, µ j, σ i, σ j, P (C i ), and P (C j ) 17 Compare the MAP classifier with the MICD classifier: MAP: MICD: So the MAP classifier is biased towards the most likely class and the most compact class. p(x) C1 C2 x 18

Now for the multidimensional normal case (n features): p(x C i ) = 1 (2π) n/2 Σ i 1/2 e 1 2 (x µ i ) T Σ 1 i (x µ i ) i = 1, 2,..., N Take the log-likelihood form (as in the 1-D case) to get the MAP classifier: The decision surface is a hyperquadratic, but shifted from the MICD surface by: A priori term - favours the class which is more likely Favours the class whose covariant matrix has the smallest determinant (equivalent to smallest volume of the unit std dev contour) 19 Note: If we don t know the a priori probabilities ) (or choose to ignore them), then we can drop the term 2log ( P (Cj ) P (C i ) The result is called the Maximum Likelihood (ML) Classifier 20

Summary For normal distribution with maximum likelihood estimates of the mean and covariance matrix, the classifiers are: 21 Loss Function Often decisions based on classification have associated losses. The losses are not necessarily constant for each decision. For example, misclassifying a pattern as class 1 when it is really class 2 might be worse than the reverse case. This can occur when one class is very rare, but costly if missed (such as credit card fraud). p(x C 1 )P (C 1 ) 1 2 p(x C 2 )P (C 2 ) if P(C1) >> P(C2) then the lowest error rate can be attained by always classifying as C1 22

One option is to improve the selection of features such that p(x C2) >> p(x C1) for some values of x. Although this is ideal, especially if perfect separation can be attained, in practice this might not be feasible. Another option is to assign a loss to misclassification. Let {α 1... α a } be the finite set of actions based on classifications. Can think of these actions as assigning a pattern to a class, but doesn t have to be the case. Could be something like sound alarm. 23 Conditional Risk To minimize overall risk, choose the action with the lowest risk for the pattern: R(α i x) Shorthand notation: α j α i 2 R(α j x) λ ij λ(α i C j ) cost of action i given class j 24

Loss for Two Classes (i) Assume: λ 11 = λ 22 = 0 λ 12 = λ 21 = 1 (no loss for correct choice) (same loss for incorrect choice) R(α 1 x) α 2 α 1 R(α 2 x) 25 Loss for Two Classes (ii) Assume: λ 11 = λ 22 = 0 (no loss for correct choice) λ 12 λ 21 (different losses for incorrect choices) λ 11 P (C 1 x) + λ 12 P (C 2 x) α 2 α 1 λ 21 P (C 1 x) + λ 22 P (C 2 x) Use Bayes theorem: So if λ 12 > λ 21, thus there is greater chance of loss in choosing action 1 when real class is C2. Therefore we are more likely to choose action 2. 26

Example Credit card fraud is a big problem. A fraudster will obtain the credit card information of a victim and proceed to use the card to purchase as many items as possible (usually items that are easy to sell). Credit card companies use real-time analysis of transactions to attempt to separate fraudulent activity from legitimate. The sooner fraud can be detected, the lower the cost to the credit card company, which has to assume almost all of the losses. The analysis uses features from the transactions, including location, type, speed, and physical velocity. However, the features are not sufficient to perfectly classify a transaction as fraud or legitimate. There is always the risk of inconveniencing customers. 27 Example Assuming that the amount of fraudulent activity is about 1% of the total credit card activity: C1 = Fraud P(C1) = 0.01 C2 = No fraud P(C2) = 0.99 If losses are equal for misclassification, then: 28

Example p(x ω i ) 0.4 ω 2 0.3 ω 1 0.2 0.1 9 10 11 12 13 14 15 x Figure 2.1 From Pattern Classification by Duda, Hart, and Stork 29 Example However, losses are probably not the same. Classifying a fraudulent transaction as legitimate leads to direct dollar losses as well as intangible losses (e.g. reputation, hassles for consumers). Classifying a legitimate transaction as fraudulent inconveniences consumers, as their purchases are denied. This could lead to loss of future business. Let s assume that the ratio of loss for not fraud to fraud is 1 to 50. 30

Example p(x C 1 ) p(x C 1 ) α 1 α 2 99 λ 12 λ 21 p(x C 2 ) α 1 α 2 99 1 50 p(x C 2) By including the loss function, the decision boundaries change significantly. 31 Example p(x ω i ) 0.4 ω 2 0.3 ω 1 0.2 0.1 9 10 11 12 13 14 15 x Figure 2.1 From Pattern Classification by Duda, Hart, and Stork 32