Bayes Decision Theory Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1 / 16
Bayes Decision Theory Consider a set of feature vectors, {x}, which belong to either class C 1 or C 2 with prior probability P(C i ). A fundamental question arises: What would be a best way to assign an appropriate class label to a data point x? Decide C 1 if P(C 1 ) > P(C 2 )? Not a good idea (little information) Bayes decision theory gives an answer to this fundamental question. Bayes decision rule: Decide C 1 if P(C 1 x) > P(C 2 x). Question: Does this Bayes decision rule give the minimal probability of misclassification? Yes! 2 / 16
Bayes Rule P(C k x) = class-conditional density {}}{ P(x C k ) j P(x C j )P(C j ) }{{} normalization factor prior {}}{ P(C k ). In practice, we model the class-conditional density P(x C k ) by a parameterized form. 3 / 16
Decision Boundary: Case 1 P (x C 1 )P (C 1 ) P (x C 2 )P (C 2 ) R 1 R 2 4 / 16
Decision Boundary: Case 2 P (x C 1 )P (C 1 ) P (x C 2 )P (C 2 ) R 1 R 2 5 / 16
Decision Boundaries Decision boundaries are boundaries between decision regions. The probability of misclassification in the binary classification problem, is given by P(error) = P(x R 2, C 1 ) + P(x R 1, C 2 ) = P(x R 2 C 1 )P(C 1 ) + P(x R 1 C 2 )P(C 2 ) = P(x C 1 )P(C 1 )dx + P(x C 2 )P(C 2 ). R 2 R 1 One can observe that if P(x C 1 )P(C 1 ) > P(x C 2 )P(C 2 ), we should choose the regions R 1 and R 2 such that x is in R 1 since this give a smaller contribution to the error. 6 / 16
Alternatively, we consider P(correct) = = = K P(x R k, C k ) k=1 K P(x R k C k )P(C k ) k=1 K k=1 R k P(x C k )P(C k ). This probability is maximized by choosing {R k } such that x is assigned to the class for which the integrand is a maximum. 7 / 16
Discriminant Functions Decide C k if f k (x) > f j (x) j k. Choose f k (x) = P(C k x) = P(x C k )P(C k ) where P(x) is dropped. If f k (x) is a discriminant function then g(f k (x)) af k (x) a > 0 f k (x) + b g is a monotonically increasing function are also eligible discriminant functions. Then, f k (x) = log [P(x C k )P(C k )] = log P(x C k ) + log P(C k ) is also a discriminant function. 8 / 16
Discriminant Functions for Normal Density Consider discriminant functions f i (x) = log P(x C i ) + log P(C i ). Assume P(x C i ) N (µ i, Σ i ). Then, discriminant functions have the form [ { 1 f i (x) = log exp 1 (2π) m 2 Σ 1 2 2 (x µ i) T Σ 1 i (x µ i )} ] + log P(C i ) = 1 2 (x µ i) T Σ 1 i (x µ i ) m 2 log 2π 1 2 log Σ i + log P(C i ). 9 / 16
Case 1: Σ = σ 2 I Features are statistically independent with the same variance σ 2. Hyperspherical cluster. Σ i = σ 2m, Σ 1 = 1 σ 2 I. f i (x) = 1 2σ 2 x µ i 2 + log P(C i ). If P(C i ) are the same for all classes, then the discriminant functions become f i (x) = 1 2σ 2 x µ i 2, which is a minimum distance classifier. 10 / 16
Case 1 Leads to Linear Discriminant Functions In case 1, the linear discriminant function f i (x) can be rewritten as where f i (x) = w T i x + w i0, w i = 1 σ 2 µ i, w i0 = 1 2σ 2 µt i µ i + log P(C i ). Decision boundaries are hyperplanes defined by f i (x) = f j (x). 11 / 16
Case 2: Σ i = Σ In such a case, discriminant functions are given by f i (x) = 1 2 (x µ i) T Σ 1 (x µ) + log P(C i ). If P(C i ) are the same for all classes, the discriminant function is simply based on the Mahalanobis distance, (x µ i ) T Σ 1 (x µ). Case 2 also leads to linear discriminant functions which have the form where f i (x) = w T i x + w i0, w i = Σ 1 µ i, w i0 = 1 2 µt i Σ 1 µ i + log P(C i ). 12 / 16
Case 3: Arbitrary Σ i In such a case, discriminant functions have the form where f i (x) = x T W i x + w T i + w i0, W i = 1 2 Σ 1 i, w i = Σ 1 i µ i, w i0 = 1 2 µt i Σ 1 i µ i 1 2 log Σ i + log P(C i ). This case leads to a quadratic discriminant function. The decision boundaries are hyperquadrics. They can assume any of the general forms such as pairs of hyperplanes, hyperspheres, hyperellipsoids, and hyperparaboloids. 13 / 16
Loss Function and Expected Loss Suppose that we are given a set of training data, {(x i, y i )} N i=1. Loss function, l(f (x), y), quantifies the loss or cost associating with the prediction f (x) when the data x were actually labeled with y. Expected loss is defined as L(f (x), y) = E p(y x ) [l(f (x), y)] = l(f (x), y)p(y x)dy. 14 / 16
0-1 Loss The 0-1 binary loss function is of the form: l(f (x), y) = 1 δ f (x ),y = { 0, if f (x) = y 1, otherwise. It makes most sense when the hypothesis space is discrete. The expected loss is given by L(f (x), y) = y l(f (x), y)p(y x) = y (1 δ f (x ),y )p(y x) = y p(y x) y δ f (x ),y p(y x) = 1 p(f (x) x). The expected loss is minimized when f (x) is chosen to be the maximum of the posterior distribution p(y x), i.e., MAP estimate. 15 / 16
Squared Loss The squared error loss function is of the form: l(f (x), y) = (y f (x)) 2. It is most appropriate when y lives in a continuous space with a well-defined metric. The expected loss is given by L(f (x), y) = l(f (x), y)p(y x)dy = (y f (x)) 2 p(y x)dy = y 2 p(y x)dy + f (x) 2 p(y x)dy 2f (x) y p(y x)dy = f (x) 2 2f (x)e[y x] + E[y 2 x] ] = (f (x) E[y x]) 2 + E [(y E[y x]) 2 x, which is minimized when f (x) = E[y x]. 16 / 16