What does Bayes theorem give us? Lets revisit the ball in the box example.

ECE 6430 Pattern Recognition and Analysis Fall 2011 Lecture Notes - 2 What does Bayes theorem give us? Lets revisit the ball in the box example. Figure 1: Boxes with colored balls Last class we answered the question, what is the overall probability that the selection procedure will pick a green ball? Now lets look at another problem: suppose we have a green ball, what is the probability that it came from blue box? Or red box? 1

We can solve the problem of reversing the conditional probability by using Bayes theorem: p (B = b F = g) = p (F = g B = b) p (B = b) p (F = g) (1) Note that we know all the probabilities on the RHS from earlier! Prior probability: p (B = b) or p (B = r) it is the probability available before we observe the identity of the ball. Posterior probability: p (B F ) it is the probability available after we observe the identity of the ball. How does Bayes theorem relate to data training? Prior probabilities, p (C k ), can be estimated from the proportions of the training data which fall into each class. What does this mean? How many times was each class chosen? (What is the probability of choosing the blue box?) Class-conditional probability, p (x n C k ): is estimated from histograms for each class. Why is this happening? Take the example of hand written letters 'a' and 'b'. We tried to classify based just on height. In that case, there was a lot of overlap between classes C 1 and C 2. While the above is the most obvious, even with more number of features, such instances of over-lapping occurs. Both boxes (classes) have both orange and green balls! What about the denominator, p (x n )? Once we have the prior and the class-conditional probabilities, we can calculate this as, 2

p (x n ) = p (x n C 1 ) p (C 1 ) + + p (x n C k ) p (C k ) (2) This is just a normalizing value! Why? Summarizing, Decision making: posterior prior likelihood (3) Given a new data value x nt, the probability of misclassication is minimized if we assign the data to class C k, for which the posterior probability p (C k x nt ) is largest. C k, if p ( C k x nt) > p ( C j x nt) k j (4) Rejection threshold in Bayesian context { if max p (C k x n θ classify x n as C k ) (5) k < θ reject x n Note: that in the text book, discrete probabilities are denoted by a capital P (), while I have not made that distinction. 3

Discriminant functions Discriminant functions y 1 (x), y M (x) is dened such that, an input vector x is assigned to class C k if, y k (x) > y j (x) j k. If we compare this to our earlier rule of minimizing probability of misclassication, we would have, Applying Bayes theorem, we will have, y k (x) = p (C k x) (6) y k (x) = p (x C k ) p (C k ) (7) Note that when dening the discriminant function, we can discard the denominator p (x n ). Figure 2: Joint probabilities compared p (x, C 1 ) = p (x C 1 ) p (C 1 ) In general, the decision boundaries are given by the regions where the discriminant functions are equal. y k (x) = y j (x). Since we are looking to compare relative magnitudes, we can replace y (x) with another monotonic function and expect to get similar results. e.g. z k (x) = ln p (x C k ) + ln p (C k ) (8) 4

Curve tting revisited Figure 3: Probability in curve tting! Remember: Curve tting involves nding one set of values for w, minimizing error between y (x n, w) and desired output or target values, t n. Error function: measures the mist between the function y (x n, w), for any given value of w, and the training set data points. Sum of the squares of the errors: What does p (t x 0 ) mean? E = 1 2 N {y (x n ; w) t n } 2 (9) n=1 What are we doing here?: Choosing a specic estimate y( x) of the value of t for each input x. The regression function y (x), minimizes the expected squared loss. ˆ ˆ E (L) = {y (x) t} 2 p (x, t) dxdt (10) Choose y (x) to minimize E (L). 5

Finding partial dierential w.r.t. y (x) and equating it to zero, we will end up getting, y (x) = E (t x) (11) Figure 4: Bayesian curve t Generalizing for multiple variables, we will have, y (x) = E (t x) (12) Minimizing risk Sometimes, misclassifying one way might be more detrimental. e.g. Identication of tumor. It would be riskier to classify a real tumor as a non tumor than the other way. Loss matrix element l kj =penalty associated with assigning a pattern to class C j when in belongs to C k. END R k = Σ c j=1l kj Σ k p(x C k ) (13) Σ k l kj p(x C k )p(c k ) < Σ k l ki p(x C k )p(c k ) i j (14) 6