Bayesian Decision Theory

Similar documents
Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Bayesian Decision Theory

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Pattern Classification

Error Rates. Error vs Threshold. ROC Curve. Biometrics: A Pattern Recognition System. Pattern classification. Biometrics CSE 190 Lecture 3

Minimum Error Rate Classification

Bayes Decision Theory

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

What does Bayes theorem give us? Lets revisit the ball in the box example.

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

p(x ω i 0.4 ω 2 ω

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

5. Discriminant analysis

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

CSE555: Introduction to Pattern Recognition Midterm Exam Solution (100 points, Closed book/notes)

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

Minimum Error-Rate Discriminant

Bayesian Decision Theory

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 3

Bayes Rule for Minimizing Risk

Cheng Soon Ong & Christian Walder. Canberra February June 2018

p(x ω i 0.4 ω 2 ω

Contents 2 Bayesian decision theory

Classification. 1. Strategies for classification 2. Minimizing the probability for misclassification 3. Risk minimization

44 CHAPTER 2. BAYESIAN DECISION THEORY

Naïve Bayes classification

Naive Bayesian Rough Sets

Bayesian Decision and Bayesian Learning

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Bayesian Decision Theory Lecture 2

Machine Learning 2017

INTRODUCTION TO BAYESIAN INFERENCE PART 2 CHRIS BISHOP

Machine Learning Lecture 2

ECE521 week 3: 23/26 January 2017

Detection Theory. Chapter 3. Statistical Decision Theory I. Isael Diaz Oct 26th 2010

Machine Learning Lecture 2

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Introduction to Signal Detection and Classification. Phani Chavali

SGN (4 cr) Chapter 5

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Linear Classification: Probabilistic Generative Models

Machine Learning Lecture 2

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Logistic Regression. Machine Learning Fall 2018

Topics. Bayesian Learning. What is Bayesian Learning? Objectives for Bayesian Learning

INF Introduction to classifiction Anne Solberg Based on Chapter 2 ( ) in Duda and Hart: Pattern Classification

Expect Values and Probability Density Functions

Chapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1)

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

PATTERN RECOGNITION AND MACHINE LEARNING

Pattern recognition. "To understand is to perceive patterns" Sir Isaiah Berlin, Russian philosopher

Intelligent Systems Statistical Machine Learning

Application: Can we tell what people are looking at from their brain activity (in real time)? Gaussian Spatial Smooth

Machine Learning Lecture 5

Bayes Decision Theory

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

Linear Classifiers as Pattern Detectors

Chapter 2 Bayesian Decision Theory. Pattern Recognition Soochow, Fall Semester 1

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Intelligent Systems Statistical Machine Learning

INF Introduction to classifiction Anne Solberg

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses

Detection theory 101 ELEC-E5410 Signal Processing for Communications

Bayesian Methods: Naïve Bayes

Mining Classification Knowledge

Pattern Recognition and Machine Learning. Learning and Evaluation of Pattern Recognition Processes

Lecture 3: Pattern Classification

p(d θ ) l(θ ) 1.2 x x x

BAYESIAN DECISION THEORY

INF Anne Solberg One of the most challenging topics in image analysis is recognizing a specific object in an image.

Linear & nonlinear classifiers

Deep Learning for Computer Vision

Machine Learning Linear Classification. Prof. Matteo Matteucci

Multivariate statistical methods and data mining in particle physics

Introduction to Statistical Inference

Chapter 6 Classification and Prediction (2)

Pattern Recognition. Parameter Estimation of Probability Density Functions

Bayesian Learning. Bayesian Learning Criteria

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

CS6220: DATA MINING TECHNIQUES

Machine detection of emotions: Feature Selection

CSC 411: Lecture 04: Logistic Regression

Generative Classifiers: Part 1. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang

Generative Clustering, Topic Modeling, & Bayesian Inference

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Object Recognition. Digital Image Processing. Object Recognition. Introduction. Patterns and pattern classes. Pattern vectors (cont.

Nearest Neighbor Pattern Classification

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Computer Vision Group Prof. Daniel Cremers. 3. Regression

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

day month year documentname/initials 1

Transcription:

Bayesian Decision Theory Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012

Today s Topics Bayesian Decision Theory Bayesian classification for normal distributions Error Probabilities and Integrals

Today s Topics Bayesian Decision Theory (Continuous Features) Minimum-Error-Rate Classification Minimum-Risk Classification Discriminant Functions Bayesian classification for normal distributions Error Probabilities and Integrals

Minimum-Error-Rate Classification Goal Minimizing the classification error probability Probability of error in Bayesian The Bayesian classifier is optimal with respect to minimizing the classification error probability

Make a Decision The bayesian decision theory is referred as the minimum-error-rate classification in general cases Make a decision? Decide How to get posteriori probability? Other equivalence? where

Probability of Error

Example Cancer cell recognition Two classes Normal: w 1 Abnormal: w 2 Prior probability known as P(w 1 )=0.9 P(w 2 )=0.1 Cell to be classified Feature x Class-conditional probability density p(x w 1 )=0.2 p(x w 2 )=0.4 Which class does x belong to?

Extension to Multiple classes The principle is the same as the two-category cases Decision rule if P(ω i x ) max j 1,2,... c P( ω j x), then x ω i Or Discriminant Function g i (x) if p( x ω ) P( ω ) i i max j 1,2,... c p( x ω j ) P( ω j ), then x ω i

Extension to Multiple classes (cont.) Decision is made by comparing the discriminant functions of each class

Probability of Error Revisit Assume that the feature space is divided into regions as R1,R2, Rc How to calculate the probability of error? P(e)=? Any other solutions? P(e)=1-P(c)

Today s Topics Bayesian Decision Theory Minimum-Error-Rate Classification Minimum-Risk Classification Discriminant Functions Bayesian classification for normal distributions Error Probabilities and Integrals

A More General Theory How can we generalize to more than one feature? replace the scalar x by the feature vector x more than two states of nature? just a difference in notation allowing actions other than just decisions? allow the possibility of rejection different risks in the decision? define how costly each action is introduce a more general error function loss function (i.e., associate costs with actions)

Bayesian Decision Theory (cont.) Let {w 1,...,w c } be the finite set of c states of nature (classes, categories) Let {a 1,..., a a } be the finite set of a possible actions Let l(a i w j ) be the loss incurred for taking action i when the state of nature is w j Let x be the d-component vector-valued random variable called the feature vector.

Bayesian Decision Theory (cont.) p(x w j ) is the class-conditional probability density function P(w j ) is the prior probability that nature is in state w j The posterior probability can be computed as

Conditional Risk Suppose we observe x and take action a i If the true state of nature is w j, we incur the loss l(a i w j ) The expected loss with taking action a i is which is also called the conditional risk.

Minimum-Risk Classification The general decision rule a(x) tells us which action to take for observation x We want to find the decision rule that minimizes the overall risk Bayes decision rule minimizes the overall risk by selecting the action a i for which R(a i x) is minimum. The resulting minimum overall risk is called the Bayes risk and is the best performance that can be achieved.

Two-Category Classification Define a 1 : deciding w 1, a 2 : deciding w 2, l ij = l (a i w j ). Conditional risks can be written as

Two-Category Classification The minimum-risk decision rule becomes This corresponds to deciding w 1 if Other equivalence? comparing the likelihood ratio to a threshold that is independent of the observation x.

Minimum-Error-Rate Classification Actions are decisions on classes (a i is deciding w i ). If action a i is taken and the true state of nature is w j, then the decision is correct if i=j and in error if i j. We want to find a decision rule that minimizes the probability of error.

Minimum-Error-Rate Classification Define the zero-one loss function (all errors are equally costly). Conditional risk becomes

Minimum-Error-Rate Classification Minimizing the risk requires maximizing P(w i x) and results in the minimum-error decision rule The resulting error is called the Bayes error and is the best performance that can be achieved.

Minimum-Error-Rate Classification The likelihood ratio p(x w 1 )/p(x w 2 ). The threshold q a is computed using the priors P(w 1 ) = 2/3 and P(w 2 ) = 1/3, and a zero-one loss function. If we penalize mistakes in classifying w 2 patterns as w 1 more than the converse, we should increase the threshold to q b.

Example Revisited Cancer cell recognition Two classes Normal: w 1 Abnormal: w 2 Prior probability known as P(w 1 )=0.9 P(w 2 )=0.1 Cell to be classified Feature x Class-conditional probability density p(x w 1 )=0.2 p(x w 2 )=0.4 Which class does x belong to? Loss functions l 11 =0 l 12 =6 l 21 =0 l 22 =0

Today s Topics Bayesian Decision Theory Minimum-Error-Rate Classification Minimum-Risk Classification Discriminant Functions Bayesian classification for normal distributions Error Probabilities and Integrals

Discriminant Functions A useful way of representing classifiers is through discriminant functions g i (x), i = 1,..., c, where the classifier assigns a feature vector x to class w i if For the classifier that minimizes conditional risk For the classifier that minimizes error

Discriminant Functions These functions divide the feature space into c decision regions (R 1,..., R c ), separated by decision boundaries. Note that the results do not change even if we replace every g i (x) by f(g i (x)) where f( ) is a monotonically increasing function (e.g., logarithm). This may lead to significant analytical and computational simplifications.

Discriminant Functions In particular, for minimum-error-rate classification, any of the following choices gives identical classification results, but some can be much simpler to understand or to compute than others

Discriminant Functions Even though the discriminant functions can be written in a variety of forms, the decision rules are equivalent The effect of any decision rule is to divide the feature space into c decision regions, R 1,...,R c. If g i (x) > g j (x) for all j i, then x is in R i, and the decision rule calls for us to assign x to ω i. The regions are separated by decision boundaries, surfaces in feature space where ties occur among the largest discriminant functions

Discriminant Functions In this two-dimensional twocategory classifier, the probability densities are Gaussian, the decision boundary consists of two hyperbolas, and thus the decision region R 2 is not simply connected.

Discriminant Functions Instead of using two discriminant functions g 1 and g 2 and assigning x to ω 1 if g 1 > g 2, it is more common to define a single discriminant function Decide ω 1 if g(x) > 0; otherwise decide ω 2. Other various forms

Thank you for all your attention