SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Similar documents
Bayesian Decision Theory

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Naïve Bayes classification

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Bayesian Learning (II)

Detection theory 101 ELEC-E5410 Signal Processing for Communications

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

What does Bayes theorem give us? Lets revisit the ball in the box example.

Pattern Recognition. Parameter Estimation of Probability Density Functions

Linear Models for Classification

An Introduction to Statistical and Probabilistic Linear Models

Pattern Classification

Detection Theory. Chapter 3. Statistical Decision Theory I. Isael Diaz Oct 26th 2010

Minimum Error Rate Classification

Algorithmisches Lernen/Machine Learning

Machine Learning 4771

Time Series Classification

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Classification: Probabilistic Generative Models

Machine Learning Lecture 2

Introduction to Bayesian Learning. Machine Learning Fall 2018

Bayesian Decision Theory

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

5. Discriminant analysis

Machine Learning Lecture 5

Machine Learning Lecture 2

Statistical learning. Chapter 20, Sections 1 3 1

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Machine Learning for Signal Processing Bayes Classification and Regression

Mining Classification Knowledge

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Machine Learning. Yuh-Jye Lee. March 1, Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

44 CHAPTER 2. BAYESIAN DECISION THEORY

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning Lecture 2

Data Mining Part 4. Prediction

+ + ( + ) = Linear recurrent networks. Simpler, much more amenable to analytic treatment E.g. by choosing

Statistical learning. Chapter 20, Sections 1 4 1

The Monte Carlo Method: Bayesian Networks

Bayesian Machine Learning

Linear Classification

PATTERN RECOGNITION AND MACHINE LEARNING

Machine Learning for Signal Processing Bayes Classification

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

Lecture 6: Gaussian Mixture Models (GMM)

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Machine Learning and Data Mining. Bayes Classifiers. Prof. Alexander Ihler

CMU-Q Lecture 24:

The Naïve Bayes Classifier. Machine Learning Fall 2017

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

Introduction to Machine Learning

Introduction to Statistical Inference

Introduction to Machine Learning

Introduction to Signal Detection and Classification. Phani Chavali

Ch 4. Linear Models for Classification

Statistical Learning. Philipp Koehn. 10 November 2015

Cheng Soon Ong & Christian Walder. Canberra February June 2018

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Loss Functions, Decision Theory, and Linear Models

Generative Clustering, Topic Modeling, & Bayesian Inference

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

Chapter 6 Classification and Prediction (2)

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

Mixture of Gaussians Models

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Pattern recognition. "To understand is to perceive patterns" Sir Isaiah Berlin, Russian philosopher

Introduction into Bayesian statistics

Bayes Decision Theory

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

MLE/MAP + Naïve Bayes

L11: Pattern recognition principles

Logistics. Naïve Bayes & Expectation Maximization. 573 Schedule. Coming Soon. Estimation Models. Topics

Introduction to Machine Learning

Bayesian Networks BY: MOHAMAD ALSABBAGH

Naïve Bayesian. From Han Kamber Pei

Lecture 3: Pattern Classification

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

PMR Learning as Inference

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Introduction to Machine Learning

Statistical learning. Chapter 20, Sections 1 3 1

Learning with Probabilities

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Bayesian Learning. Bayesian Learning Criteria

Bayesian Approaches Data Mining Selected Technique

p(d θ ) l(θ ) 1.2 x x x

CS-E3210 Machine Learning: Basic Principles

Deep Learning for Computer Vision

INTRODUCTION TO BAYESIAN INFERENCE PART 2 CHRIS BISHOP

Intelligent Systems Statistical Machine Learning

Kernel methods, kernel SVM and ridge regression

TDA231. Logistic regression

Error Rates. Error vs Threshold. ROC Curve. Biometrics: A Pattern Recognition System. Pattern classification. Biometrics CSE 190 Lecture 3

Transcription:

SYDE 372 Introduction to Pattern Recognition Probability Measures for Classification: Part I Alexander Wong Department of Systems Design Engineering University of Waterloo

Outline 1 2 3 4

Why use probability measures for classification? Great variability may occur within a class of patterns due to measurement noise (e.g., image noise and warping) and inherent variability (apples can vary in size and shape) We tried to account for these variabilities by treating patterns as random vectors In the MICD classifier, we account for this variability by incorporating statistical parameters of the class (e.g., mean and variance) This works well for scenarios where class distributions can be well modeled based on Gaussian statistics, but may perform poorly when the class distributions are more complex and non-gaussian. How do we deal with this?

Why use probability measures for classification? Idea: What if we have more complete information about the probabilistic behaviour of the class? Given known class conditional probability density distributions, we can create powerful similarity measures that tell us the likelihood, or probability, of each class given an observed pattern. Classifiers built on such probabilistic measures are optimal in the minimum probability of error sense.

Bayesian classifier Consider the two class pattern recognition problem: Given an unknown pattern x, assign the pattern to either class A or class B. A general rule of statistical decision theory is to minimize the cost associated with making a wrong decision. e.g., amount of money lost by deciding to buying a stock that gets delisted the next day and is actually a don t buy.

Bayesian classifier Let L ij be the cost of deciding on class c j when the true class is c i The total risk associated with deciding x belongs to c j can be defined by the expected cost: r j (x) = K L ij P(c i x) (1) i=1 where K is the number of classes and P(c i x) is the posterior distribution of class c i given the pattern x.

Bayesian classifier For the two class case: r 1 (x) = L 11 P(c 1 x) + L 21 P(c 2 x) (2) r 2 (x) = L 12 P(c 1 x) + L 22 P(c 2 x) (3) Applying Bayes rule gives us: r 1 (x) = L 11P(x c 1 )P(c 1 ) + L 21 P(x c 2 )P(c 2 ) p(x) r 2 (x) = L 12P(x c 1 )P(c 1 ) + L 22 P(x c 2 )P(c 2 ) p(x) (4) (5)

Bayesian classifier The general K -class Bayesian classifier is defined as follows, and minimizes total risk: For the two class case: (L 11 L 12 )P(x c 1 )P(c 1 ) x c i iff r i (x) < r j (x) j i (6) 1 > < 2 How do you choose an appropriate cost? (L 21 L 22 )P(x c 2 )P(c 2 ) (7)

Choosing cost functions The most common cost used in the situation where no other cost criterion is known is the zero-one loss function: { 0 i = j L ij = (8) 1 i j Meaning: all errors have equal costs. Given the zero-one loss function, the total risk function becomes: K r j (x) = P(c i x) = P(ɛ x) (9) i = 1 i j So minimizing total risk in this case is the same as minimizing probability of error!

Types of probabilistic classifiers Using the zero-one loss function, we will study two main types of probabilistic classifiers: Maximum a Posteriori (MAP) probability classifier Maximum Likelihood (ML) classifier

Maximum a Posteriori classifier Given two classes A and B, the MAP classifier can be defined as follows: P(A x) A > < B P(B x) (10) where P(A x) and P(B x) are the posterior class probabilities of A and B, respectively, given observation x. Meaning: All patterns with a higher posterior probability for A than for B will be classified as A, and all patterns with a higher posterior probability for B than for A will be classified as B

Maximum a Posteriori classifier Class probability models usually given in terms of class conditional probabilities P(x A) and P(x A) More convenient to express MAP in the form: P(x A) P(x B) A > < B P(B) P(A) (11) l(x) A > < B θ (12) where l(x) is the likelihood ratio and θ is the threshold

Maximum a Posteriori classifier When dealing with probability density functions with exponential dependence (e.g., Gamma, Gaussian, etc.), it is more convenient to deal with MAP in the log-likelihood form: A > log l(x) log θ (13) < B

Maximum a Posteriori classifier: Example Suppose we are given a two-class problem, where P(x A) and P(x B) are given by: p(x A) = p(x B) = where b 2 > a 2 > a 1 > b 1. 0, x < a 1 1 a 2 a 1 a 1 x a 2 0 a 2 < x 0, x < b 1 1 b 2 b 1 b 1 x b 2 0 b 2 < x Assuming P(A) = P(B) = 1/2, develop the MAP classification strategy. (14) (15)

Maximum a Posteriori classifier: Example

Maximum a Posteriori classifier: Example The MAP classification strategy can be defined as: b 1 < x < a 1 : Decide class B a 1 < x < a 2 : Decide class A a 2 < x < b 2 : Decide class B Otherwise: No decision

Maximum a Posteriori classifier When dealing with probability density functions with exponential dependence (e.g., Gamma, Gaussian, etc.), it is more convenient to deal with MAP in the log-likelihood form: A > log l(x) log θ (16) < B

Maximum Likelihood classifier Ideally, we would like to use the MAP classifier, which chooses the most probable class: P(x A) P(x B) A > < B P(B) P(A) (17) However, in many cases the priors P(A) and P(B) are unknown, making it impossible to use the posteriors P(A x) and P(B x). Common alternative is, instead of choosing the most probable class, we choose the class that makes the observed pattern x most probable.

Maximum Likelihood classifier Instead of maximizing the posterior, we instead maximize the likelihood: A > P(x A) P(x B) (18) < B In likelihood form: P(x A) P(x B) A > < B Can be viewed as special case of MAP where P(A) = P(B). 1 (19)