Bayes Rule for Minimizing Risk

Size: px

Start display at page:

Download "Bayes Rule for Minimizing Risk"

Oscar Wiggins
5 years ago
Views:

1 Bayes Rule for Minimizing Risk Dennis Lee April 1, 014 Introduction In class we discussed Bayes rule for minimizing the probability of error. Our goal is to generalize this rule to minimize risk instead of probability of error. For simplicity we deal with the two class case. Then we provide examples for the cases of 1D and D features, and we derive Bayes rule for minimizing risk in these cases. We briefly motivate this topic by considering the following scenario. Suppose that a doctor has to classify a tumor as cancerous or benign. We consider the cost of misclassifying the tumor as benign to be high, since we would like the patient to be treated quickly if cancer is present. Therefore, we would assign a high cost to misclassifying the tumor as benign. We shall show how to incorporate this cost into Bayes rule in this article. Bayes rule for minimizing risk Let x R n be a feature vector. Let q i x) be the posterior probability of class i denoted as ω i ) given x, and let P i be the prior probability for ω i. Let p i x) be the class conditional density for class i. Denote c ij as the cost of deciding x ω i with ω j as the true class. The conditional cost of assigning x ω i given x is r i x) = c i1 q 1 x) + c i q x) 1) where i = 1,. We assign x so that cost is minimum: r 1 x) ω 1 ω r x) ) 1

2 which says to decide ω 1 if r 1 x) < r x) and ω otherwise. If we make the decision this way, the total cost becomes E[rx)] = min[r 1 x), r x)]px)dx = min[c 11 q 1 x) + c 1 q x), c 1 q 1 x) + c q x)]px)dx = min[c 11 P 1 + c 1 P p x), c 1 P 1 + c P p x)]dx 3) = c 11 P 1 + c 1 P p x)dx + R 1 c 1 P 1 + c P p x)dx R = c 1 P 1 + c P ) + c 11 c 1 )P 1 + c 1 c )P p x)dx R 1 where R 1 and R are partitions of R n, R 1 and R are determined by the decision rule from Eq. ), and we use R dx = 1 R 1 dx in the last line. To minimize Eq. 3), we make the integrand as negative as possible: c 1 c )P p x) ω 1 ω c 1 c 11 )P 1 4) which is equivalent to assigning x ω 1 if x makes the integrand negative, and x ω otherwise. Bayes test for minimum cost can now be stated as [1]. Example 1: 1D features p x) ω 1 ω c 1 c )P c 1 c 11 )P 1. 5) Consider two classes and x R. Let = Nµ 1, σ 1 ) and p x) = Nµ, σ ). For simplicity, let c 11 = c = 0. From Eq. 4) we assign x ω 1 if c 1 P 1 > c 1 P p x) lnc 1 P 1 ) + ln) > lnc 1 P ) + lnp x)) lnc 1 P 1 ) lnσ 1 ) 1 lnπ) x µ 1) σ 1 > lnc 1 P ) lnσ ) 1 lnπ) x µ ) σ 6) so the discriminant function becomes ) x µ1 + µ ) x + 1 ) µ µ 1 + ln σ σ1 σ1 σ σ σ1 σ σ 1 ) ) P1 c 1 + ln > 0 7) P c 1 which has the form ax + bx + c > 0

3 where a = 1 1 σ 1 σ 1 ), and c = 1 µ σ b = µ 1 σ 1 ) µ 1 + ln σ1 µ, σ σ σ 1 ) ) P1 c 1 + ln. P c 1 This form is similar to Bayes rule for minimizing error, except for the factor of ln shifts the decision thresholds. An equivalent formulation is to decide x ω 1 if or where p x) > P c 1 P 1 c 1 p x) > λ λ = P c 1 P 1 c 1. P 1 c 1 P c 1 ), which We can interpret the decision rule as a modification of the Neyman-Pearson criterion that takes into account the priors and the cost. Example : D features Let x R with normal class conditional densities p i x) = 1 [ π) Σ i 1/ exp 1 ] n x µ i) T σ 1 i x µ i ) where i = 1,. For simplicity, let c 11 = c = 0. Similar to Eq. 6), we decide x ω 1 if c 1 P 1 > c 1 P p x) lnc 1 P 1 ) + ln) > lnc 1 P ) + lnp x)) lnc 1 P 1 ) ln Σ 1 x µ 1 ) T Σ 1 1 x µ 1 ) > lnc 1 P ) ln Σ x µ ) T Σ 1 x µ ) 8) so the discriminant function becomes 1 xt Σ 1 Σ 1 1 )x+x T Σ 1 1 µ 1 Σ 1 µ )+ 1 ) µt Σ 1 µ µ T 1 Σ 1 P1 c 1 1 µ 1 )+ln + 1 ) P c 1 ln Σ Σ 1 9) 3

4 which has the form where x T Ax + b T x + c > 0 A = 1 Σ 1 Σ 1 1 ), and b = Σ 1 1 µ 1 Σ 1 µ, c = 1 ) µt Σ 1 µ µ T 1 Σ 1 P1 c 1 1 µ 1 ) + ln + 1 ) P c 1 ln Σ Σ 1. As in the 1D case, we decide x ω 1 if where p x) > λ λ = P c 1 P 1 c 1. When Σ 1 = Σ, the Bayes classifier becomes a linear discriminant function. To give a specific illustration, we generate data from classes and take µ 1 = [ 8 1 ] T 10) µ = [ 7 7 ] T 11) [ ] 6 1 Σ =. 1) 1 6 The data is classified using the discriminant function from Eq. 9). To see the effects of c 1 and c 1, we vary their values and examine how the separating hyperplane shifts in Fig. 1. We examine the following cases: When c 1 = 1 and c 1 = 1, the cost of misclassifying classes 1 and are equal. We are reduced to Bayes rule. The separating hyperplane is positioned to minimize the probability of error, as Fig. 1a) shows. When c 1 = 5 and c 1 = 1, the cost of misclassifying class is high. Thus, the separating hyperplane shifts toward class 1, so less points from class are misclassified, as Fig. 1b) shows. When c 1 = 1 and c 1 = 5, the cost of misclassifying class 1 is high, so the separating hyperplane shifts toward class. As a result, less points from class 1 are misclassified, as Fig. 1c) shows. 4

5 a) c 1 = 1, c 1 = 1 b) c 1 = 5, c 1 = 1 c) c 1 = 1, c 1 = 5 Figure 1: Data for class 1 crosses) and class circles). In all cases, Probω 1 ) = Probω ) = 0.5. Misclassified points are shown in red. Values of µ 1, µ, and Σ are given in Eqs. 10) - 1). As the figures show, the separating hyperplanes shift depending on the values of c 1 and c 1. 5

6 Summary We have derived Bayes rule for minimizing risk. The rule can be stated as p x) ω 1 ω c 1 c )P c 1 c 11 )P 1. To illustrate this rule, we have given two examples dealing with 1D and D features. For both cases, the separating hyperplane shifts depending on the costs. Figure 1 provides a nice demonstration for the D case. When the cost of misclassifying class i is high, the separating hyperplane shifts to reduce the number of points misclassified from class i. We hope that this material provides the reader with a more general understanding of Bayes rule. 6

7 References [1] K. Fukunaga, Introduction to Statistical Pattern Recognition Academic, New York, 197). 7

Bayesian Decision Theory

Bayesian Decision Theory Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Bayesian Decision Theory Bayesian classification for normal distributions Error Probabilities