Bayesian Learning Chapter 6: Bayesian Learning CS 536: Machine Learning Littan (Wu, TA) [Read Ch. 6, except 6.3] [Suggested exercises: 6.1, 6.2, 6.6] Bayes Theore MAP, ML hypotheses MAP learners Miniu description length principle Bayes optial classifier Naive Bayes learner (if tie) Exaple: Learning over text data Bayesian belief networks Expectation Maxiization algorith Roles for Bayesian Methods Provides practical learning algoriths: Naive Bayes learning Bayesian belief network learning Cobine prior knowledge (prior probabilities) with observed data Requires prior probabilities Provides useful conceptual fraework Provides gold standard for evaluating other learning algoriths Additional insight into Occa's razor Bayes Theore P(h D) = P(D h) P(h) / P(D) P(h) = prior prob. of hypothesis h P(D) = prior prob. of training data D P(h D) = probability of h given D P(D h) = probability of D given h
Choosing Hypotheses Natural choice is ost probable hypothesis given the training data, or axiu a posteriori hypothesis h MAP : h MAP = argax h in H P(h D) = argax h in H P(D h) P(h) / P(D) = argax h in H P(D h) P(h) If assue P(h i ) = P(h j ) then can further siplify, and choose the axiu likelihood (ML) hypothesis h ML = argax hi in H P(D h i ) Bayes Theore Does patient have cancer or not? A patient takes a lab test and the result coes back positive. The test returns a correct positive result in 98% of the cases in which the disease is actually present, and a correct negative result in 97% of the cases in which the disease is not present. Furtherore,.008 of the entire population have this cancer. P (cancer) = P (not cancer) = P (+ cancer) = P (- cancer) = P (+ not cancer) = P (- not cancer) = Basic Forulas for Probs Product Rule: probability P(A ^ B) of a conjunction of two events A and B: P(A ^ B) = P(A B) P(B) = P(B A) P(A) Su Rule: probability of a disjunction of two events A and B: P(A v B) = P(A) + P(B) - P(A ^ B ) Theore of total probability: if the events A 1,., A n are utually exclusive with! i=1 n P (A i ) = 1, then P(B) =! i=1 n P(B A i ) P(A i ) Brute Force MAP Learner 1. For each hypothesis h in H, calculate the posterior probability P(h D) = P(D h) P(h) / P(D) 2. Output the hypothesis h MAP with the highest posterior probability h MAP = argax h in H P(h D)
Evolution of Posterior Probs As data is added, certainty of hypotheses increases. What is the effect on entropy? Real-Valued Functions Consider any real-valued target function f Training exaples <x i, d i >, where d i is noisy training value d i = f(x i ) + e i e i is rando variable (noise) drawn independently for each x i according to soe Gaussian distribution with ean=0 Then, the axiu likelihood hypothesis h ML is the one that iniizes the su of squared errors: h ML = argin h in H! i=1 (d i -h(x i )) 2 MAP and Least Squares MAP/Least Squares Proof h MAP = argax h in H P(h D) = argax h in H P(D h) = argax h in H " i =1 1/sqrt(2#$ 2 ) exp(-1/2 ((d i -h(x i ))/$) 2 ) = argax h in H! i =1 ln 1/sqrt(2#$ 2 ) -1/2 ((d i -h(x i ))/$) 2 = argax h in H! i =1-1/2 ((d i -h(x i ))/$) 2 = argax h in H! i =1 -(d i -h(x i )) 2 = argin h in H! i =1 (d i -h(x i )) 2
Predicting Probabilities Consider predicting survival probability fro patient data Training exaples <x i, d i >, where d i is 1 or 0 Want to train neural network to output a probability given x i (not a 0 or 1) Predicting Probabilities In this case, can show h ML = argax h in H! i=1 (d i ln h(x i ) + (1-d i ) ln(1-h(x i ))) Weight update rule for a sigoid unit: w jk % w jk +&w jk where &w jk = '! i=1 (d i - h(x i )) x ijk MDL Principle Miniu Description Length Principle Occa's razor: prefer the shortest hypothesis MDL: prefer the hypothesis h that iniizes h MDL = argin h in H (L C1 (h) + L C2 (D h)) where L C (x) is the description length of x under encoding C MDL Exaple Exaple: H = decision trees, D = training data labels L C1 (h) is # bits to describe tree h L C2 (D h) is # bits to describe D given h Note L C2 (D h) = 0 if exaples classified perfectly by h. Need only describe exceptions. Hence, h MDL trades off tree size for training errors
MDL Justification h MAP = argax h in H P(D h) P(h) = argax h in H (log 2 P(D h) +log 2 P(h)) = argin h in H (-log 2 P(D h) -log 2 P(h)) Fro inforation theory: The optial (shortest expected coding length) code for an event with probability p is -log 2 p So, prefer the hypothesis that iniizes length(h) + length(isclassifications) Classifying New Instances So far we've sought the ost probable hypothesis given the data D (i.e., h MAP ) Given new instance x, what is its ost probable classification? h MAP (x) is not the ost probable classification! Classification Exaple Consider: Three possible hypotheses: P(h 1 D) =.4, P(h 2 D) =.3, P(h 3 D) =.3 Given new instance x, h 1 (x) = +, h 2 (x) = -, h 3 (x) = - What s h MAP (x)? What's ost probable classification of x? Bayes Optial Classifier Bayes optial classification: argax vj in V! hi in H P(v j h i ) P(h i D) Exaple: P(h 1 D) =.4, P(- h 1 ) = 0, P(+ h 2 ) = 1 P(h 2 D) =.3, P(- h 2 ) = 1, P(+ h 3 ) = 0 P(h 3 D) =.3, P(- h 3 ) = 1, P(+ h 3 ) = 0, therefore! hi in H P(+ h i ) P(h i D) =.4! hi in H P( - h i ) P(h i D) =.6 MAP class
Gibbs Classifier Bayes optial classifier provides best result, but can be expensive if any hypotheses. Gibbs algorith: 1. Choose one hypothesis at rando, according to P(h D) 2. Use this one to classify new instance Error of Gibbs (Not so) surprising fact: Assue target concepts are drawn at rando fro H according to priors on H. Then: E [error Gibbs ]! 2E [error BayesOptial ] Suppose correct, unifor prior distribution over H, then Pick any hypothesis consistent with the data, with unifor probability Its expected error no worse than twice Bayes optial Naive Bayes Classifier Along with decision trees, neural networks, knn, one of the ost practical and ost used learning ethods. When to use: Moderate or large training set available Attributes that describe instances are conditionally independent given classification Successful applications: Diagnosis Classifying text docuents Naive Bayes Classifier Assue target function f : X ( V, where each instance x described by attributes <a 1, a 2 a n >. Most probable value of f (x) is: v MAP = argax vj in V P(v j a 1, a 2 a n ) = argax vj in V P(a 1, a 2 a n, v j ) P(v j ) / P(a 1, a 2 a n ) = argax vj in V P(a 1, a 2 a n, v j ) P(v j )
Naïve Bayes Assuption P(a 1, a 2 a n, v j ) = " i P(a i v j ), which gives Naive Bayes classifier: v NB = argax vj in V P(v j ) " i P(a i v j ) Note: No search in training! Naïve Bayes Algorith Naïve_Bayes_Learn(exaples) For each target value v j ^ P(v j ) % estiate P(v j ) For each attribute value a i of each attribute a ^ P(a i v j ) % estiate P(a i v j ) Classify_New_Instance(x) ^ ^ v NB = argax vj in V P(v j ) " i P(a i v j ) Naïve Bayes: Exaple Consider PlayTennis again, and new instance <Outlk = sun, Tep = cool, Huid = high, Wind = strong> Want to copute: v NB = argax vj in V P(v j ) " i P(a i v j ) P(y) P(sun y) P(cool y) P(high y) P(strong y) =.005 P(n) P(sun n) P(cool n) P(high n) P(strong n) =.021 So, v NB = n Naïve Bayes: Subtleties 1. Conditional independence assuption is often violated P(a 1, a 2 a n, v j ) = " i P(a i v j )...but it works surprisingly well anyway. Note don't need estiated posteriors P(v j x) to be correct; need only that argax vj in V P(v j a 1, a 2 a n ) = argax vj in V P(v j ) " i P(a i v j ) Doingos & Pazzani [1996] for analysis Naïve Bayes posteriors often unrealistically close to 1 or 0
Naïve Bayes: Subtleties 2. what if none of the training instances with target value v j have attribute a i? P(a i v j ) = 0, and P(v j ) " i P(a i v j ) = 0 Solution is Bayesian estiate: P(a i v j ) = (n c + p)/(n + ) where n is nuber of training exaples for which v = v j, n c nuber of exaples for which v = v j and a = a i p is prior estiate for P(a i v j ) is weight given to prior (i.e., nuber of virtual exaples)