Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

Size: px

Start display at page:

Download "Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory"

Bruce Logan
5 years ago
Views:

1 Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory Jussi Tohka jussi.tohka@tut.fi Institute of Signal Processing Tampere University of Technology Bayesian decision theory is a fundamental statistical approach to the problem of pattern classification. This approach is based on quantifying the tradeoffs between various classification decisions and their costs. The problem needs to be posed in probabilistic terms and all associated probabilities need to be completely known. The theory is just a formalization of some common sense procedures; However, it gives a solid foundation to which various pattern classification methods can build on Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.1/ Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.2/50 The fish example: Terminology The fish example continued Separate between two kinds of fish: Sea bass and salmon. No other kinds of fish are possible. ω is true class, it is a random variable (RV). Two classes are possible: ω = ω 1 for sea-bass and ω = ω 2 for salmon. If the sea contains more sea-basses than salmons, it is natural to assume (even when no data or features are available), that the caught fish is a sea-bass. This is modeled with the prior probabilities, P (ω 1 ) and P (ω 2 ), which are positive and sum to one. If there are more sea basses, P (ω 1 ) > P (ω 2 ). If we must decide at this point (for some curious reason) which fish we have, how would we decide? Because there are more sea-basses, we would say that the fish is a sea-bass. In other words, our decision rule becomes: Decide ω 1 if P (ω 1 ) > P (ω 2 ) and otherwise decide ω 2. To develop better rules, we must extract some information or features from the data. This means, for instance, making lightness measurements about the fish Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.3/ Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.4/50

2 The fish example continued The fish example continued Suppose we have a lightness reading, say x, from the fish. What to do next? We know every probability relevant to the classification problem. Particularly, we know P (ω 1 ), P (ω 2 ) and the class conditional probability densities p(x ω 1 ) and p(x ω 2 ). Based on these we can compute the probability that the class ω = ω 1 given that the lightness reading is x and similarly for salmon. Just use Bayes formula: The decision rule: Decide ω 1 if P (ω 1 x) > P (ω 2 x) and otherwise decide ω 2. Remember that we do not know (directly) P (ω j x). They must be computed through the Bayes rule: P (ω j x) = p(x ω j)p (ω j ), p(x) P (ω j x) = p(x ω j)p (ω j ), p(x) where the evidence p(x) = 2 j=1 p(x ω j)p (ω j ) Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.5/ Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.6/50 The fish example continued The fish example continued The justification for the rule: P (error x) = P (ω 1 x) if we decide ω 2. P (error x) = P (ω 2 x) if we decide ω 1. Average error P (error) = P (error, x)dx = P (error x)p(x)dx. Thus, if P (error x) is minimal for every x, also the average error is minimized. The decision rule Decide ω 1 if P (ω 1 x) > P (ω 2 x) and otherwise decide ω 2. guarantees just that. An equivalent decision rule is obtained by multiplying P (ω j x) in the previous rule by p(x). Because p(x) is a constant this obviously does not affect the decision itself. We have a rule: Decide ω 1 if p(x ω 1 )P (ω 1 ) > p(x ω 2 )P (ω 2 ), otherwise decide ω Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.7/ Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.8/50

3 BDT - features BDT - classes The purpose is to decide for an action based on the sensed object with the measured feature vector x. Each object to be classified has a corresponding feature vector and we identify the feature vector x with the object to be classified. The set of all possible feature vectors is called feature space, which we denote by F. Feature spaces correspond to sample spaces. Examples of feature spaces are R d, {0, 1} d, R {0, 1} d. For the moment, we assume that the feature space is R d. We denote by ω the (unknown) class or the category of the object x. We use symbols ω 1,..., ω c for the c categories, or classes to which x can belong to. At this stage, c is fixed. Each category ω i has a prior probability P (ω i ). The prior probability tells us how likely particular class is before making any observations. In addition, we know the probability density functions (pdfs) of feature vectors drawn from a certain class. These are called class conditional density functions (ccdfs) and denoted by p(x ω i ) for i = 1,..., c. Ccdfs tell us how probable is the feature vector x provided that the class of the object is ω i Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.9/ Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.10/50 BDT - actions We write α 1,..., α a for a possible actions and α(x) for the action taken after observing x. Thought as a function from the feature space to {α 1,..., α a }, α(x) is called a decision rule. In fact, any function from the feature space to {α 1,..., α a } is a decision rule. The number of actions a need not to be equal to c, the number of classes. But if a = c and the actions α i read: Assign x to the class i, we often forget about the actions and talk about assigning x to a certain class. BDT - loss function We will develop the optimal decision rule based on statistics, but before that we need to tie actions and classes together. This is done with a loss function. The loss function, denoted by λ, tells how costly each action is and it is used to convert a probability determination into a decision of an action. λ is a function from action/class pairs to the set of positive real numbers. λ(α i ω j ) describes the loss incurred for taking action α i if the true class is ω j. Low λ(α ω j ) for good actions given the class ω j and high λ(α ω j ) for bad actions given the class ω j Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.11/ Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.12/50

4 BDT - Bayes decision rule Bayes decision rule If the true class is ω j, by definition we will incur the loss λ(α i ω j ) when taking the action α i after observing x. The expected loss or the conditional risk of taking action α i, after observing x, is R(α i x) = c λ(α i ω j )P (ω j x). j=1 The total expected loss, termed overall risk, is R total = R total (α) = R(α(x) x)p(x)dx Now, we would like to derive such decision rule α(x) that it minimizes the overall risk. This decision rule is Select action α i that gives the minimum expected loss R(α i x). i.e. α(x) = arg min αi R(α i x) This is called the Bayes decision rule. The classifier build upon this rule is called the Bayes (minimum risk) classifier. The overall risk R total for the Bayes decision rule is called Bayes risk. It is the smallest overall risk that is possible. for the decision rule α Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.13/ Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.14/50 Bayes decision rule Bayes decision rule We now prove that the Bayes decision rule indeed minimizes the overall risk. THEOREM. Let α : F {α 1,..., α a } be an arbitrary decision rule and α bayes : F {α 1,..., α a } the Bayes decision rule. Then R total (α bayes ) R total (α). PROOF: Note that by the definition of the Bayes decision rule R(α bayes (x) x) R(α(x) x). Hence, because p(x) 0, R(α Bayes (x) x)p(x) R(α(x) x)p(x). And R total (α bayes ) = R(α bayes (x) x)p(x)dx R(α(x) x)p(x)dx = R total (α) Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.15/ Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.16/50

5 Two category classification Two category classification The possible classes are ω 1, ω 2. The action α 1 corresponds deciding that the true class is ω 1 and α 2 corresponds deciding that the true class is ω 2. Write λ(α i ω j ) = λ ij R(α 1 x) = λ 11 P (ω 1 x) + λ 12 P (ω 2 x) R(α 2 x) = λ 21 P (ω 1 x) + λ 22 P (ω 2 x) The Bayes decision rule; decide that the true class is ω 1 if R(α 1 x) < R(α 2 x) and ω 2 otherwise. We decide (that the true class is) ω 1 if (λ 21 λ 11 )P (ω 1 x) > (λ 12 λ 22 )P (ω 2 x). Ordinarily λ 21 λ 11 and λ 12 λ 22 are positive. That is, the loss is greater when making a mistake. Assume λ 21 > λ 11. Then Bayes decision rule can be written as: Decide ω 1 if p(x ω 1 ) p(x ω 2 ) > λ 12 λ 22 λ 21 λ 11 P (ω 2 ) P (ω 1 ) Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.17/ Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.18/50 Example: SPAM filtering Zero-one-loss We have two actions: α 1 stands for keep the mail and α 2 stands for delete as SPAM. There are two classes ω 1 (normal mail) ω 2 (SPAM i.e. junk mail). P (ω 1 ) = 0.4, P (ω 2 ) = 0.6 and λ 11 = 0, λ 21 = 3, λ 12 = 1, λ 22 = 0. That is, deleting important mail as SPAM is more costly than keeping a SPAM mail. We get an message with the feature vector x and p(x ω 1 ) = 0.35, p(x ω 2 ) = How does the Bayes minimum risk classifier act? P (ω 1 x) = = 0.264; P (ω 2 x) = R(α 1 x) = = 0.736, R(α 2 x) = = Don t delete the mail! Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.19/50 An important loss function is zero-one-loss λ(α i ω j ) = 0 if i = j 1 if i j In the matrix form Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.20/50

6 Minimum error rate classification Recap: The Bayes Classifier Assume that our loss-function is zero-one-loss and actions α i read as Decide that the true class is ω i. We can then identify the action α i and the class ω i. The Bayes decision rule applied to this case leads to minimum error rate classification rule and Bayes (minimum error) classifier. The minimum error rate classification rule is: Decide ω i if P (ω i x) > P (ω j x) for all j i. Given a feature vector x, compute the conditional risk for taking action α i for all i = 1,..., a and select the action that gives the smallest conditional risk R(α i x). Classification with zero-one-loss: Compute the probability P (ω i x) for all categories ω 1,..., ω c and select the category that gives the largest probability. Remember to use Bayes rule in computing the probabilities Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.21/ Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.22/50 Discriminant functions Discriminant functions Note: In what follows we will assume that a = c and use ω i and α i interchangeably. There are many ways to represent the pattern classifiers. One of the most useful is in terms of a set of discriminant functions g 1 (x),..., g c (x) for a c category classifier. The classifier assigns a feature vector x to class ω i if g i (x) > g j (x) Figure 2.5 from Duda, Hart, Stork: Pattern Classification, Wiley, 2001 for all i j. For the Bayes classifier g i (x) = R(α i x) Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.23/ Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.24/50

7 Equivalent discriminant functions Equivalent discriminant functions The choice of discriminant functions is not unique. Many distinct sets of discriminant functions lead to the same classifier, that is, to the same decision rule. We say that two sets discriminant functions are equivalent if they lead to the same classifier. Or to put it more formally, two sets discriminant functions are equivalent if their corresponding decision rules give equal decisions for all x. The following holds: Let f(x) < f(y) whenever x < y (i.e. f is monotonically increasing function) and g i (x), i = 1,..., c be the discriminant functions representing a classifier. Then, the discriminant functions f(g i (x)), i = 1,..., c represent essentially the same classifier as g i (x), i = 1,..., c. Example: Equivalent sets of discriminant functions for the minimum error rate (Bayes) classifier; g i (x) = P (ω i x) = p(x ω i)p (ω i ) P c j=1 p(x ω j)p (ω j ) g i (x) = p(x ω i )P (ω i ) g i (x) = ln p(x ω i ) + ln P (ω i ) Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.25/ Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.26/50 Linear discriminant functions Discriminant functions: Two categories A discriminant function is linear if it can be written in the form g i (x) = w t i x + w i0; w i = [w i1,..., w id ]. The term w i0 is called the threshold or bias for the ith category. Note that a linear discriminant function is linear with respect to w i, w i0, but actually affine with respect to x. Don t let this bother you too much! A classifier is linear if it can be represented using entirely linear discriminant functions. Linear classifiers have some important properties which we will study later during this course. Two-category case: We may combine the two discriminant functions into a single discriminant function. The decision rule: Decide ω 1 if g 1 (x) > g 2 (x) and otherwise decide ω 2. Define g(x) = g 1 (x) g 2 (x). We obtain an equivalent decision rule: Decide ω 1 if g(x) > Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.27/ Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.28/50

Decision regions Decision regions The effect of any decision rule is to divide the feature space (in this case R d ) into c disjoint decision regions, R 1,..., R c.

8 Decision regions Decision regions The effect of any decision rule is to divide the feature space (in this case R d ) into c disjoint decision regions, R 1,..., R c. Decision rules can written with the help of decision regions: If x R i decide ω i. (Therefore decision regions form a representation for a classifier.) Decision regions can be derived from discriminant functions R i = {x : g i (x) > g j (x) i j}. Note that decision regions are properties of the classifier and they are not affected if the discriminant functions are changed to equivalent ones. Boundaries of decision regions, i.e places where two or more discriminant functions yield the same value, are called decision boundaries Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.29/50 Figure 2.6 from Duda, Hart, Stork: Pattern Classification, Wiley, Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.30/50 Decision regions example Decision regions example Consider two-category classification problem with P (ω 1 ) = 0.6, P (ω 2 ) = 0.4 and p(x ω 1 ) = 1 2π exp[ 0.5x 2 ] and p(x ω 2 ) = 1 2π exp[ 0.5(x 1) 2 ]. Find the decision regions and boundaries for the minimum error rate Bayes classifier. The decision region R 1 is the set of points where P (ω 1 x) > P (ω 2 x). The decision region R 2 is the set of points where P (ω 2 x) > P (ω 1 x). The decision boundary is the set of points where P (ω 2 x) = P (ω 1 x) Class conditional densities class 1 class Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.31/ Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.32/50

9 Decision regions example Decision regions example Let us begin with the decision boundary: P (ω 1 x) = P (ω 2 x) p(x ω 1 )P (ω 1 ) = p(x ω 2 )P (ω 2 ), where we used the Bayes formula and multiplied with p(x). p(x ω 1 )P (ω 1 ) = p(x ω 2 )P (ω 2 ) ln[p(x ω 1 )P (ω 1 )] = ln[p(x ω 2 )P (ω 2 )] Class 1 decision region class 1 class 2 Class 2 decision region (x/2) 2 + ln 0.6 = ((x 1)/2) 2 + ln 0.4 x 2 4 ln 0.6 = x 2 2x ln 0.4 Decision boundary is x = ln 0.6 ln R 1 = {x : x < x }, R 2 = {x : x > x } P (ω 1 x) and P (ω 2 x) and decision regions Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.33/ Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.34/50 The normal density The normal density - properties Univariate case where parameter σ > 0. Multivariate case p(x) = p(x) = 1 2πσ exp[ 1 2 (x µ σ )2 ], 1 (2π) d/2 det(σ) exp[ 1 2 (x µ)t Σ 1 (x µ)], where x is a d-column vector and Σ is a positive definite matrix. For a positive definite matrix Σ, x T Σx > 0 for all x 0. The normal density has several properties which give it a special position among probability densities. To large extent, this is due analytical tractability - as we shall soon see - but there are also other reasons for favoring normal densities. X N(µ, Σ) stands for X is a RV having the normal density with parameters µ, Σ. The expected value of X is E[X] = µ. The variance of X is V ar[x] = Σ Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.35/ Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.36/50

10 The normal density - properties DFs for the normal density Let X = [X 1,..., X d ] N(µ, Σ). Then X i N(µ i, σ ii ). Let A, B be d d matrices. Then AX N(Aµ, AΣA T ). AX and BX are independent if and only if AΣB T is the zero matrix. The sum of two normally distributed RVs is also normally distributed. Central Limit Theorem: The sum of n identically distributed independent (i.i.d) RVs tends to a normally distributed RV as n approaches infinity. The minimum error rate classifier can be represented by discriminant functions (DFs) g i (x) = ln p(x ω i ) + ln P (ω i ), i = 1,..., c. Letting p(x ω i ) = N(µ i, Σ i ) we obtain g i (x) = 1 2 (x µ i) T Σ 1 i (x µ i ) d 2 ln 2π 1 2 ln det(σ i)+ln P (ω i ) Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.37/ Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.38/50 DFs for the normal density: Σ i = σ 2 I DFs for the normal density: Σ i = σ 2 I Σ i = σ 2 I: Features are independent and each feature has the same variance σ 2. Geometrically, this corresponds to the situation in which the samples fall in the equally-sized (hyper)spherical clusters and the cluster for the ith class is centered around µ i. We will now derive equivalent linear DFs to the DFs g i (x) = 1 2 (x µ i) T Σ 1 i (x µ i ) d 2 ln 2π 1 2 ln det(σ i)+ln P (ω i ). 1) All the constant terms like d 2 ln 2π can be dropped - dropping them does not affect the classification result. 2) In this particular case also the determinants of the covariance matrices have all the same value (σ 2d ). 3) Σ 1 = 1 σ 2 I and hence g i (x) = x µ i 2 2σ 2 + ln P (ω i ). We (sloppily) use = sign to indicate that discriminant functions are equivalent Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.39/ Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.40/50

11 DFs for the normal density: Σ i = σ 2 I Minimum distance classifier g i (x) = x µ i 2 2σ 2 + ln P (ω i ) = 1 2σ 2 (x µ i) T (x µ i ) + ln P (ω i ) = 1 2σ 2 (xt x 2µ T i x + µ T i µ i ) + ln P (ω i ). From the last expression we see that the quadratic term x T x is same for all categories and can be dropped. Hence, we obtain equivalent linear discriminant functions: If we assume that all P (ω i ) are equal and Σ i = σ 2 I, we obtain the minimum distance classifier. Note that this means that P (ω i ) = 1 c. The name of the classifier follows from the set of discriminant functions used: g i (x) = x µ i. Hence, a feature vector is assigned to the category with the nearest mean. Note that a minimum distance classifier can be also implemented as a linear classifier. g linear i (x) = 1 σ 2 (µt i x 1 2 µt i µ i ) + ln P (ω i ) Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.41/ Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.42/50 DFs for the normal density: Σ i = Σ DFs for the normal density: Σ i arbitrary Consider a bit more complicated model. Now all covariance matrices have still the same value but there exists some correlation between the features. Also in this case we have a linear classifier: It is time to consider the most general Gaussian model, where features from each category are assumed to be normally distributed, but nothing more is not assumed. In this case the discriminant functions where and g i (x) = w T i x + w i0, w i = Σ 1 µ i w i0 = 1 2 µt i Σ 1 µ i + ln P (ω i ). g i (x) = 1 2 (x µ i) T Σ 1 i (x µ i ) d 2 ln 2π 1 2 ln det(σ i)+ln P (ω i ). cannot be simplified much; Only constant terms can be dropped. Discriminant functions are now necessarily quadratic which means that the decision regions may have more complicated shapes than in the linear case Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.43/ Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.44/50

12 BDT - Discrete features Independent binary features In many practical applications, the feature space is discrete. The components of the feature vectors are then binary or higher-integer valued. This simply means that integrals as in continuous case must be replaced with sums and probability densities must be replaced with probabilities. For example, the minimum error rate classification rule is: Decide ω i if Consider the two-category problem, where the feature vectors x = [x 1,..., x d ] are binary, i.e. x i is either 0 or 1. Assume further that features are (conditionally) independent, that is P (x ω j ) = d P (x i ω j ). i=1 Denote p i = P (x i = 1 ω 1 ) and q i = P (x i = 1 ω 2 ). P (ω i x) = P (x ω i)p (ω i ) P (x) > P (ω j x) i j Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.45/ Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.46/50 Independent binary features Independent binary features P (x ω 1 ) = d i=1 px i i (1 p i) 1 x i P (x ω 2 ) = d i=1 qx i i (1 q i) 1 x i Use discriminant functions g 1 (x) = ln P (ω 1 x), g 2 (x) = ln P (ω 2 x). Recall that g(x) = g 1 (x) g 2 (x), when the classifier studied the sign of g(x). By Bayes rule, this leads to g(x) = ln P (x ω 1) P (x ω 2 ) + ln P (ω 1) P (ω 2 ). g(x) = d i=1 [x i ln p i q i + (1 x i ) ln 1 p i 1 q i ] + ln P (ω 1) P (ω 2 ) g(x) = d i=1 x i ln p i(1 q i ) q i (1 p i ) + d i=1 ln 1 p i 1 q i + ln P (ω 1) P (ω 2 ) Note that we have a linear machine: g(x) = d i=1 w ix i + w i0. The magnitude of w i determines the importance of the yes (1) answer for x i. If p i = q i, the value of x i gives no information about the class. The prior probabilities appear only in the bias term. Increasing P (ω 1 ) biases the decision in favor of ω Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.47/ Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.48/50

13 Receiver operating characteristic For signal detection theory and receiver operating characteristic (ROC), see milos/courses/cs2750/lectures/class9.pdf. BDT - context This far we have assumed that our interest is in classifying a single object at time. However, in applications we may need to classify several objects at same time. Example: image segmentation. If we assume that the class of one object is independent from remaining ones, nothing does change. If there is some dependence, then the basic principles remain the same, i.e. we assign objects to the most probable category. But now we need to take also categories of other objects into account and place all objects in such categories that the probability of the whole ensemble is maximized. Computational difficulties follow Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.49/ Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.50/50

Bayesian Decision Theory

Bayesian Decision Theory Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2017 CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 1 / 46 Bayesian