Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

Similar documents
Bayesian Decision Theory

Bayesian Decision Theory Lecture 2

Pattern Classification

Contents 2 Bayesian decision theory

Bayesian Decision Theory

Bayes Decision Theory

Minimum Error-Rate Discriminant

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 3

Bayesian Decision and Bayesian Learning

SGN-2506: Introduction to Pattern Recognition. Jussi Tohka Tampere University of Technology Institute of Signal Processing 2006

Error Rates. Error vs Threshold. ROC Curve. Biometrics: A Pattern Recognition System. Pattern classification. Biometrics CSE 190 Lecture 3

Lecture Notes on the Gaussian Distribution

44 CHAPTER 2. BAYESIAN DECISION THEORY

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

Nearest Neighbor Pattern Classification

Parametric Techniques

CS 195-5: Machine Learning Problem Set 1

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Machine Learning 2017

Parametric Techniques Lecture 3

Bayes Rule for Minimizing Risk

Classification 1: Linear regression of indicators, linear discriminant analysis

Chapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1)

Minimum Error Rate Classification

Probability Models for Bayesian Recognition

Naïve Bayes classification

Introduction to Machine Learning

Machine Learning Lecture 2

Multivariate statistical methods and data mining in particle physics

Introduction to Machine Learning Spring 2018 Note 18

SGN (4 cr) Chapter 5

Expect Values and Probability Density Functions

COM336: Neural Computing

L11: Pattern recognition principles

Maximum Likelihood Estimation. only training data is available to design a classifier

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Machine Learning Linear Classification. Prof. Matteo Matteucci

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Bayesian Decision Theory

Bayes Decision Theory

p(d θ ) l(θ ) 1.2 x x x

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Introduction to Machine Learning

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 1

2. What are the tradeoffs among different measures of error (e.g. probability of false alarm, probability of miss, etc.)?

Naive Bayes and Gaussian Bayes Classifier

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

p(x ω i 0.4 ω 2 ω

01 Probability Theory and Statistics Review

Bayes Decision Theory - I

Application: Can we tell what people are looking at from their brain activity (in real time)? Gaussian Spatial Smooth

Naive Bayes and Gaussian Bayes Classifier

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Deep Learning for Computer Vision

Naive Bayes and Gaussian Bayes Classifier

Classification 2: Linear discriminant analysis (continued); logistic regression

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Continuous Random Variables

Lecture 8: Classification

Lecture 8: Signal Detection and Noise Assumption

Lecture 5: Likelihood ratio tests, Neyman-Pearson detectors, ROC curves, and sufficient statistics. 1 Executive summary

Machine Learning Lecture 2

BAYESIAN DECISION THEORY

Learning Methods for Linear Detectors

The generative approach to classification. A classification problem. Generative models CSE 250B

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

The Bayes classifier

Example - basketball players and jockeys. We will keep practical applicability in mind:

Machine Learning 1. Linear Classifiers. Marius Kloft. Humboldt University of Berlin Summer Term Machine Learning 1 Linear Classifiers 1

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Gaussian discriminant analysis Naive Bayes

Linear Regression and Discrimination

Stephen Scott.

ECE 4400:693 - Information Theory

SDS 321: Introduction to Probability and Statistics

Bayesian Methods: Naïve Bayes

CMU-Q Lecture 24:

L5: Quadratic classifiers

Linear Discrimination Functions

Linear & nonlinear classifiers

Linear Classifiers as Pattern Detectors

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Introduction to Machine Learning

Lecture 1 October 9, 2013

Machine Learning Lecture 5

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Unsupervised Learning with Permuted Data

12 Discriminant Analysis

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

p(x ω i 0.4 ω 2 ω

Why study probability? Set theory. ECE 6010 Lecture 1 Introduction; Review of Random Variables

Gaussian and Linear Discriminant Analysis; Multiclass Classification

INF Introduction to classifiction Anne Solberg Based on Chapter 2 ( ) in Duda and Hart: Pattern Classification

Transcription:

Bayesian decision theory 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory Jussi Tohka jussi.tohka@tut.fi Institute of Signal Processing Tampere University of Technology Bayesian decision theory is a fundamental statistical approach to the problem of pattern classification. This approach is based on quantifying the tradeoffs between various classification decisions and their costs. The problem needs to be posed in probabilistic terms and all associated probabilities need to be completely known. The theory is just a formalization of some common sense procedures; However, it gives a solid foundation to which various pattern classification methods can build on. 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.1/50 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.2/50 The fish example: Terminology The fish example continued Separate between two kinds of fish: Sea bass and salmon. No other kinds of fish are possible. ω is true class, it is a random variable (RV). Two classes are possible: ω = ω 1 for sea-bass and ω = ω 2 for salmon. If the sea contains more sea-basses than salmons, it is natural to assume (even when no data or features are available), that the caught fish is a sea-bass. This is modeled with the prior probabilities, P (ω 1 ) and P (ω 2 ), which are positive and sum to one. If there are more sea basses, P (ω 1 ) > P (ω 2 ). If we must decide at this point (for some curious reason) which fish we have, how would we decide? Because there are more sea-basses, we would say that the fish is a sea-bass. In other words, our decision rule becomes: Decide ω 1 if P (ω 1 ) > P (ω 2 ) and otherwise decide ω 2. To develop better rules, we must extract some information or features from the data. This means, for instance, making lightness measurements about the fish. 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.3/50 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.4/50

The fish example continued The fish example continued Suppose we have a lightness reading, say x, from the fish. What to do next? We know every probability relevant to the classification problem. Particularly, we know P (ω 1 ), P (ω 2 ) and the class conditional probability densities p(x ω 1 ) and p(x ω 2 ). Based on these we can compute the probability that the class ω = ω 1 given that the lightness reading is x and similarly for salmon. Just use Bayes formula: The decision rule: Decide ω 1 if P (ω 1 x) > P (ω 2 x) and otherwise decide ω 2. Remember that we do not know (directly) P (ω j x). They must be computed through the Bayes rule: P (ω j x) = p(x ω j)p (ω j ), p(x) P (ω j x) = p(x ω j)p (ω j ), p(x) where the evidence p(x) = 2 j=1 p(x ω j)p (ω j ). 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.5/50 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.6/50 The fish example continued The fish example continued The justification for the rule: P (error x) = P (ω 1 x) if we decide ω 2. P (error x) = P (ω 2 x) if we decide ω 1. Average error P (error) = P (error, x)dx = P (error x)p(x)dx. Thus, if P (error x) is minimal for every x, also the average error is minimized. The decision rule Decide ω 1 if P (ω 1 x) > P (ω 2 x) and otherwise decide ω 2. guarantees just that. An equivalent decision rule is obtained by multiplying P (ω j x) in the previous rule by p(x). Because p(x) is a constant this obviously does not affect the decision itself. We have a rule: Decide ω 1 if p(x ω 1 )P (ω 1 ) > p(x ω 2 )P (ω 2 ), otherwise decide ω 2. 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.7/50 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.8/50

BDT - features BDT - classes The purpose is to decide for an action based on the sensed object with the measured feature vector x. Each object to be classified has a corresponding feature vector and we identify the feature vector x with the object to be classified. The set of all possible feature vectors is called feature space, which we denote by F. Feature spaces correspond to sample spaces. Examples of feature spaces are R d, {0, 1} d, R {0, 1} d. For the moment, we assume that the feature space is R d. We denote by ω the (unknown) class or the category of the object x. We use symbols ω 1,..., ω c for the c categories, or classes to which x can belong to. At this stage, c is fixed. Each category ω i has a prior probability P (ω i ). The prior probability tells us how likely particular class is before making any observations. In addition, we know the probability density functions (pdfs) of feature vectors drawn from a certain class. These are called class conditional density functions (ccdfs) and denoted by p(x ω i ) for i = 1,..., c. Ccdfs tell us how probable is the feature vector x provided that the class of the object is ω i. 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.9/50 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.10/50 BDT - actions We write α 1,..., α a for a possible actions and α(x) for the action taken after observing x. Thought as a function from the feature space to {α 1,..., α a }, α(x) is called a decision rule. In fact, any function from the feature space to {α 1,..., α a } is a decision rule. The number of actions a need not to be equal to c, the number of classes. But if a = c and the actions α i read: Assign x to the class i, we often forget about the actions and talk about assigning x to a certain class. BDT - loss function We will develop the optimal decision rule based on statistics, but before that we need to tie actions and classes together. This is done with a loss function. The loss function, denoted by λ, tells how costly each action is and it is used to convert a probability determination into a decision of an action. λ is a function from action/class pairs to the set of positive real numbers. λ(α i ω j ) describes the loss incurred for taking action α i if the true class is ω j. Low λ(α ω j ) for good actions given the class ω j and high λ(α ω j ) for bad actions given the class ω j. 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.11/50 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.12/50

BDT - Bayes decision rule Bayes decision rule If the true class is ω j, by definition we will incur the loss λ(α i ω j ) when taking the action α i after observing x. The expected loss or the conditional risk of taking action α i, after observing x, is R(α i x) = c λ(α i ω j )P (ω j x). j=1 The total expected loss, termed overall risk, is R total = R total (α) = R(α(x) x)p(x)dx Now, we would like to derive such decision rule α(x) that it minimizes the overall risk. This decision rule is Select action α i that gives the minimum expected loss R(α i x). i.e. α(x) = arg min αi R(α i x) This is called the Bayes decision rule. The classifier build upon this rule is called the Bayes (minimum risk) classifier. The overall risk R total for the Bayes decision rule is called Bayes risk. It is the smallest overall risk that is possible. for the decision rule α. 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.13/50 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.14/50 Bayes decision rule Bayes decision rule We now prove that the Bayes decision rule indeed minimizes the overall risk. THEOREM. Let α : F {α 1,..., α a } be an arbitrary decision rule and α bayes : F {α 1,..., α a } the Bayes decision rule. Then R total (α bayes ) R total (α). PROOF: Note that by the definition of the Bayes decision rule R(α bayes (x) x) R(α(x) x). Hence, because p(x) 0, R(α Bayes (x) x)p(x) R(α(x) x)p(x). And R total (α bayes ) = R(α bayes (x) x)p(x)dx R(α(x) x)p(x)dx = R total (α). 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.15/50 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.16/50

Two category classification Two category classification The possible classes are ω 1, ω 2. The action α 1 corresponds deciding that the true class is ω 1 and α 2 corresponds deciding that the true class is ω 2. Write λ(α i ω j ) = λ ij R(α 1 x) = λ 11 P (ω 1 x) + λ 12 P (ω 2 x) R(α 2 x) = λ 21 P (ω 1 x) + λ 22 P (ω 2 x) The Bayes decision rule; decide that the true class is ω 1 if R(α 1 x) < R(α 2 x) and ω 2 otherwise. We decide (that the true class is) ω 1 if (λ 21 λ 11 )P (ω 1 x) > (λ 12 λ 22 )P (ω 2 x). Ordinarily λ 21 λ 11 and λ 12 λ 22 are positive. That is, the loss is greater when making a mistake. Assume λ 21 > λ 11. Then Bayes decision rule can be written as: Decide ω 1 if p(x ω 1 ) p(x ω 2 ) > λ 12 λ 22 λ 21 λ 11 P (ω 2 ) P (ω 1 ). 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.17/50 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.18/50 Example: SPAM filtering Zero-one-loss We have two actions: α 1 stands for keep the mail and α 2 stands for delete as SPAM. There are two classes ω 1 (normal mail) ω 2 (SPAM i.e. junk mail). P (ω 1 ) = 0.4, P (ω 2 ) = 0.6 and λ 11 = 0, λ 21 = 3, λ 12 = 1, λ 22 = 0. That is, deleting important mail as SPAM is more costly than keeping a SPAM mail. We get an e-mail message with the feature vector x and p(x ω 1 ) = 0.35, p(x ω 2 ) = 0.65. How does the Bayes minimum risk classifier act? P (ω 1 x) = 0.35 0.4 0.35 0.4+0.65 0.4 = 0.264; P (ω 2 x) = 0.736 R(α 1 x) = 0 0.264 + 0.736 = 0.736, R(α 2 x) = 0 0.736 + 3 0.264 = 0.792. Don t delete the mail! 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.19/50 An important loss function is zero-one-loss λ(α i ω j ) = 0 if i = j 1 if i j In the matrix form 0 1 1 1 1 0 1 1..... 1 1 0 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.20/50

Minimum error rate classification Recap: The Bayes Classifier Assume that our loss-function is zero-one-loss and actions α i read as Decide that the true class is ω i. We can then identify the action α i and the class ω i. The Bayes decision rule applied to this case leads to minimum error rate classification rule and Bayes (minimum error) classifier. The minimum error rate classification rule is: Decide ω i if P (ω i x) > P (ω j x) for all j i. Given a feature vector x, compute the conditional risk for taking action α i for all i = 1,..., a and select the action that gives the smallest conditional risk R(α i x). Classification with zero-one-loss: Compute the probability P (ω i x) for all categories ω 1,..., ω c and select the category that gives the largest probability. Remember to use Bayes rule in computing the probabilities. 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.21/50 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.22/50 Discriminant functions Discriminant functions Note: In what follows we will assume that a = c and use ω i and α i interchangeably. There are many ways to represent the pattern classifiers. One of the most useful is in terms of a set of discriminant functions g 1 (x),..., g c (x) for a c category classifier. The classifier assigns a feature vector x to class ω i if g i (x) > g j (x) Figure 2.5 from Duda, Hart, Stork: Pattern Classification, Wiley, 2001 for all i j. For the Bayes classifier g i (x) = R(α i x). 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.23/50 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.24/50

Equivalent discriminant functions Equivalent discriminant functions The choice of discriminant functions is not unique. Many distinct sets of discriminant functions lead to the same classifier, that is, to the same decision rule. We say that two sets discriminant functions are equivalent if they lead to the same classifier. Or to put it more formally, two sets discriminant functions are equivalent if their corresponding decision rules give equal decisions for all x. The following holds: Let f(x) < f(y) whenever x < y (i.e. f is monotonically increasing function) and g i (x), i = 1,..., c be the discriminant functions representing a classifier. Then, the discriminant functions f(g i (x)), i = 1,..., c represent essentially the same classifier as g i (x), i = 1,..., c. Example: Equivalent sets of discriminant functions for the minimum error rate (Bayes) classifier; g i (x) = P (ω i x) = p(x ω i)p (ω i ) P c j=1 p(x ω j)p (ω j ) g i (x) = p(x ω i )P (ω i ) g i (x) = ln p(x ω i ) + ln P (ω i ) 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.25/50 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.26/50 Linear discriminant functions Discriminant functions: Two categories A discriminant function is linear if it can be written in the form g i (x) = w t i x + w i0; w i = [w i1,..., w id ]. The term w i0 is called the threshold or bias for the ith category. Note that a linear discriminant function is linear with respect to w i, w i0, but actually affine with respect to x. Don t let this bother you too much! A classifier is linear if it can be represented using entirely linear discriminant functions. Linear classifiers have some important properties which we will study later during this course. Two-category case: We may combine the two discriminant functions into a single discriminant function. The decision rule: Decide ω 1 if g 1 (x) > g 2 (x) and otherwise decide ω 2. Define g(x) = g 1 (x) g 2 (x). We obtain an equivalent decision rule: Decide ω 1 if g(x) > 0. 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.27/50 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.28/50

Decision regions Decision regions The effect of any decision rule is to divide the feature space (in this case R d ) into c disjoint decision regions, R 1,..., R c. Decision rules can written with the help of decision regions: If x R i decide ω i. (Therefore decision regions form a representation for a classifier.) Decision regions can be derived from discriminant functions R i = {x : g i (x) > g j (x) i j}. Note that decision regions are properties of the classifier and they are not affected if the discriminant functions are changed to equivalent ones. Boundaries of decision regions, i.e places where two or more discriminant functions yield the same value, are called decision boundaries. 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.29/50 Figure 2.6 from Duda, Hart, Stork: Pattern Classification, Wiley, 2001 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.30/50 Decision regions example Decision regions example Consider two-category classification problem with P (ω 1 ) = 0.6, P (ω 2 ) = 0.4 and p(x ω 1 ) = 1 2π exp[ 0.5x 2 ] and p(x ω 2 ) = 1 2π exp[ 0.5(x 1) 2 ]. Find the decision regions and boundaries for the minimum error rate Bayes classifier. The decision region R 1 is the set of points where P (ω 1 x) > P (ω 2 x). The decision region R 2 is the set of points where P (ω 2 x) > P (ω 1 x). The decision boundary is the set of points where P (ω 2 x) = P (ω 1 x). 0.4 0.3 0.2 0.1 0 5 0 5 Class conditional densities class 1 class 2 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.31/50 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.32/50

Decision regions example Decision regions example Let us begin with the decision boundary: P (ω 1 x) = P (ω 2 x) p(x ω 1 )P (ω 1 ) = p(x ω 2 )P (ω 2 ), where we used the Bayes formula and multiplied with p(x). p(x ω 1 )P (ω 1 ) = p(x ω 2 )P (ω 2 ) ln[p(x ω 1 )P (ω 1 )] = ln[p(x ω 2 )P (ω 2 )]. 0.25 0.2 0.15 0.1 Class 1 decision region class 1 class 2 Class 2 decision region (x/2) 2 + ln 0.6 = ((x 1)/2) 2 + ln 0.4 x 2 4 ln 0.6 = x 2 2x + 1 4 ln 0.4 Decision boundary is x = 0.5 + ln 0.6 ln 0.4 0.91 R 1 = {x : x < x }, R 2 = {x : x > x }. 0.05 0 5 0 5 P (ω 1 x) and P (ω 2 x) and decision regions 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.33/50 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.34/50 The normal density The normal density - properties Univariate case where parameter σ > 0. Multivariate case p(x) = p(x) = 1 2πσ exp[ 1 2 (x µ σ )2 ], 1 (2π) d/2 det(σ) exp[ 1 2 (x µ)t Σ 1 (x µ)], where x is a d-column vector and Σ is a positive definite matrix. For a positive definite matrix Σ, x T Σx > 0 for all x 0. The normal density has several properties which give it a special position among probability densities. To large extent, this is due analytical tractability - as we shall soon see - but there are also other reasons for favoring normal densities. X N(µ, Σ) stands for X is a RV having the normal density with parameters µ, Σ. The expected value of X is E[X] = µ. The variance of X is V ar[x] = Σ. 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.35/50 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.36/50

The normal density - properties DFs for the normal density Let X = [X 1,..., X d ] N(µ, Σ). Then X i N(µ i, σ ii ). Let A, B be d d matrices. Then AX N(Aµ, AΣA T ). AX and BX are independent if and only if AΣB T is the zero matrix. The sum of two normally distributed RVs is also normally distributed. Central Limit Theorem: The sum of n identically distributed independent (i.i.d) RVs tends to a normally distributed RV as n approaches infinity. The minimum error rate classifier can be represented by discriminant functions (DFs) g i (x) = ln p(x ω i ) + ln P (ω i ), i = 1,..., c. Letting p(x ω i ) = N(µ i, Σ i ) we obtain g i (x) = 1 2 (x µ i) T Σ 1 i (x µ i ) d 2 ln 2π 1 2 ln det(σ i)+ln P (ω i ). 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.37/50 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.38/50 DFs for the normal density: Σ i = σ 2 I DFs for the normal density: Σ i = σ 2 I Σ i = σ 2 I: Features are independent and each feature has the same variance σ 2. Geometrically, this corresponds to the situation in which the samples fall in the equally-sized (hyper)spherical clusters and the cluster for the ith class is centered around µ i. We will now derive equivalent linear DFs to the DFs g i (x) = 1 2 (x µ i) T Σ 1 i (x µ i ) d 2 ln 2π 1 2 ln det(σ i)+ln P (ω i ). 1) All the constant terms like d 2 ln 2π can be dropped - dropping them does not affect the classification result. 2) In this particular case also the determinants of the covariance matrices have all the same value (σ 2d ). 3) Σ 1 = 1 σ 2 I and hence g i (x) = x µ i 2 2σ 2 + ln P (ω i ). We (sloppily) use = sign to indicate that discriminant functions are equivalent. 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.39/50 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.40/50

DFs for the normal density: Σ i = σ 2 I Minimum distance classifier g i (x) = x µ i 2 2σ 2 + ln P (ω i ) = 1 2σ 2 (x µ i) T (x µ i ) + ln P (ω i ) = 1 2σ 2 (xt x 2µ T i x + µ T i µ i ) + ln P (ω i ). From the last expression we see that the quadratic term x T x is same for all categories and can be dropped. Hence, we obtain equivalent linear discriminant functions: If we assume that all P (ω i ) are equal and Σ i = σ 2 I, we obtain the minimum distance classifier. Note that this means that P (ω i ) = 1 c. The name of the classifier follows from the set of discriminant functions used: g i (x) = x µ i. Hence, a feature vector is assigned to the category with the nearest mean. Note that a minimum distance classifier can be also implemented as a linear classifier. g linear i (x) = 1 σ 2 (µt i x 1 2 µt i µ i ) + ln P (ω i ). 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.41/50 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.42/50 DFs for the normal density: Σ i = Σ DFs for the normal density: Σ i arbitrary Consider a bit more complicated model. Now all covariance matrices have still the same value but there exists some correlation between the features. Also in this case we have a linear classifier: It is time to consider the most general Gaussian model, where features from each category are assumed to be normally distributed, but nothing more is not assumed. In this case the discriminant functions where and g i (x) = w T i x + w i0, w i = Σ 1 µ i w i0 = 1 2 µt i Σ 1 µ i + ln P (ω i ). g i (x) = 1 2 (x µ i) T Σ 1 i (x µ i ) d 2 ln 2π 1 2 ln det(σ i)+ln P (ω i ). cannot be simplified much; Only constant terms can be dropped. Discriminant functions are now necessarily quadratic which means that the decision regions may have more complicated shapes than in the linear case. 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.43/50 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.44/50

BDT - Discrete features Independent binary features In many practical applications, the feature space is discrete. The components of the feature vectors are then binary or higher-integer valued. This simply means that integrals as in continuous case must be replaced with sums and probability densities must be replaced with probabilities. For example, the minimum error rate classification rule is: Decide ω i if Consider the two-category problem, where the feature vectors x = [x 1,..., x d ] are binary, i.e. x i is either 0 or 1. Assume further that features are (conditionally) independent, that is P (x ω j ) = d P (x i ω j ). i=1 Denote p i = P (x i = 1 ω 1 ) and q i = P (x i = 1 ω 2 ). P (ω i x) = P (x ω i)p (ω i ) P (x) > P (ω j x) i j. 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.45/50 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.46/50 Independent binary features Independent binary features P (x ω 1 ) = d i=1 px i i (1 p i) 1 x i P (x ω 2 ) = d i=1 qx i i (1 q i) 1 x i Use discriminant functions g 1 (x) = ln P (ω 1 x), g 2 (x) = ln P (ω 2 x). Recall that g(x) = g 1 (x) g 2 (x), when the classifier studied the sign of g(x). By Bayes rule, this leads to g(x) = ln P (x ω 1) P (x ω 2 ) + ln P (ω 1) P (ω 2 ). g(x) = d i=1 [x i ln p i q i + (1 x i ) ln 1 p i 1 q i ] + ln P (ω 1) P (ω 2 ) g(x) = d i=1 x i ln p i(1 q i ) q i (1 p i ) + d i=1 ln 1 p i 1 q i + ln P (ω 1) P (ω 2 ) Note that we have a linear machine: g(x) = d i=1 w ix i + w i0. The magnitude of w i determines the importance of the yes (1) answer for x i. If p i = q i, the value of x i gives no information about the class. The prior probabilities appear only in the bias term. Increasing P (ω 1 ) biases the decision in favor of ω 1. 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.47/50 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.48/50

Receiver operating characteristic For signal detection theory and receiver operating characteristic (ROC), see www.cs.pitt.edu/ milos/courses/cs2750/lectures/class9.pdf. BDT - context This far we have assumed that our interest is in classifying a single object at time. However, in applications we may need to classify several objects at same time. Example: image segmentation. If we assume that the class of one object is independent from remaining ones, nothing does change. If there is some dependence, then the basic principles remain the same, i.e. we assign objects to the most probable category. But now we need to take also categories of other objects into account and place all objects in such categories that the probability of the whole ensemble is maximized. Computational difficulties follow. 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.49/50 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.50/50