Bayesian Learning. Examples. Conditional Probability. Two Roles for Bayesian Methods. Prior Probability and Random Variables. The Chain Rule P (B)

Examples My mood can take 2 possible values: happy, sad. The weather can take 3 possible vales: sunny, rainy, cloudy My friends know me pretty well and say that: P(Mood=happy Weather=rainy) = 0.25 P(Mood=happy Weather=sunny) =0.4 P(Mood=happy Weather=cloudy) = 0.05 Can you compute: P(Mood=happy)? P(Mood=sad)? P(Weather=rainy)? Bayesian learning algorithms use probability theory as an approach to concept classification. Bayesian approaches typically do not generate an explicit hypothesis. Statistical data is used to represent implicit hypotheses. Bayesian classifiers generally produce probabilities for (possibly multiple) class assignments, rather than a single definite classification. To use Bayesian techniques, we often need to make independence assumptions that often aren t valid (but we make them anyway!). 4 1 Conditional Probability P(A B) represents the probability of A given that B isknowntobetrue. We call this a conditional or posterior probability. P(A B) = 1 is equivalent to B A. For example, suppose Rover rarely howls: P(Rover howls) = 0.01 But when there is a full moon, he always howls! Then P(Rover howls full moon) =1.0 Two Roles for Bayesian Methods 1. Provides practical learning algorithms Naive Bayes learning Bayesian belief network learning Combine prior knowledge (prior probabilities) with observed data Requires prior probabilities 2. Provides useful conceptual framework Provides gold standard for evaluating other learning algorithms Additional insight into Occam s razor 5 2 The Chain Rule Prior Probability and Random Variables P (A B) = P (A B) P (B) We can rewrite this as: P (A B) =P (A B) P (B) which is called the chain rule because we can chain together probabilities to compute the likelihood of conjunctions. Example: P (A) =0.5 P (B A) =0.6 P (C A B) =0.8 What is P (A B C)? P(A) represents the prior or unconditional probability that statement A is true, in the absence of other information. A random variable represents different outcomes of an event. Each random variable has a domain of possible values x 1...x n,which each have a probability. The probabilities for all possible outcomes must sum to 1. For example: P(Disease=CAVITY) = 0.5 P(Disease=GUM DISEASE) = 0.3 P(Disease=IMPACTED TOOTH) = 0.1 P(Disease=ROOT INFECTION) = 0.1 6 3

Basic Formulas for Probabilities Bayes Theorem applied to machine learning P (h D) = P (h) = prior probability of hypothesis h = prior probability of training data D P (h D) = probability of h given D P (D h) = probability of D given h Product (or Chain ) Rule: probability of a conjunction of two events: P (A B) =P (A B)P (B) =P (B A)P (A) Sum Rule: probability of a disjunction of two events: P (A B) =P (A)+P(B) P (A B) Theorem of total probability: if events A 1,...,A n are mutually n exclusive with P (A i )=1, then n P (B) = P (B A i )P (A i ) 10 7 Bayes Rule Increasing Probability with more Data Simple form: P( h) P(h D1) P(h D1,D2) P (A B) = P (A) P (B A) P (B) hypotheses hypotheses hypotheses ( a) ( b) ( c) General form: P (A B,X) = P (A X) P (B A,X) P (B X) 11 8 Choosing Hypotheses P (h D) = Generally want the most probable hypothesis given the training data. Maximum a posteriori hypothesis h MAP : P (h D) Bayes Rule Example Let S = patient has a stiff neck, M = disease is meningitis. Assume that: P(S M) =0.5 P(M) = 1/5000 P(S) = 1/20 If assume P (h i )=P(h j ) then can further simplify. The Maximum Likelihood hypothesis: h ML = argmax P (D h i ) P(M S) =? 12 9

Most Probable Classification of New Instances So far we ve sought the most probable hypothesis given the data D (i.e., h MAP ) Given new instance x, what is its most probable classification? h MAP (x) is not the most probable classification! Consider: Three possible hypotheses: P (h 1 D) =.4, P(h 2 D) =.3, P(h 3 D) =.3 Given new instance x h 1 (x) =+,h 2 (x) =, h 3 (x) = Brute Force MAP Hypothesis Learner For each hypothesis h in H, calculate the posterior probability: P (h D) = Output the hypothesis h MAP with the highest posterior probability: P (h D) What s most probable classification of x? 16 13 Bayes Optimal Classifier Example: therefore and argmax P (h i D)P (c j h i ) P (h 1 D) =.4, P( h 1 )=0, P(+ h 1 )=1 P (h 2 D) =.3, P( h 2 )=1, P(+ h 2 )=0 P (h 3 D) =.3, P( h 3 )=1, P(+ h 3 )=0 P (h i D)P (+ h i ) =.4 P (h i D)P ( h i ) =.6 arg max v j V P (v j h i )P (h i D) = Relation to Concept Learning Consider our usual concept learning task Instance space X, hypothesis space H, training examples D Consider the FINDS learning algorithm (outputs most specific hypothesis from the version space VS H,D ) What would Bayes rule produce as the MAP hypothesis? Does FINDS output a MAP hypothesis?? 17 14 Naive Bayes Classifier One of the most practical learning methods, along with decision trees, neural networks, nearest neighbor methods, etc. Requires: 1. Moderate or large training set. 2. Attributes that describe instances should be conditionally independent given the classification. Successful applications include diagnosis and text classification. Relation to Concept Learning, II Assume fixed set of instances x 1,...,x n Assume D is the set of classifications c(x 1 ),...,c(x n) Choose P (D h) P (D h) =1if h consistent with D P (D h) =0otherwise Choose P (h) to be uniform distribution: P (h) = 1 for all h H H Then, 1 if h is consistent with D P (h D) = VS H,D 0 otherwise 18 15

Sparse Data What if none of the training instances with class c j have attribute value a i? Then ˆP (a i c j )=0 Naive Bayes Classifier Assume target function f : X C, where each instance x is described by attributes a 1,a 2...a n. so ˆP (c j ) i ˆP (a i c j )=0!!! Most probable value of f(x) is: Typical solution is m-estimate for ˆP (a i c j ) nc + mp ˆP (a i c j ) n + m where n is number of training examples for which c = c j n c number of examples for which c = c j and a = a i p is prior estimate for ˆP (a i c j ) m is weight given to prior (i.e. # of virtual examples) c MAP = argmax c MAP = argmax c MAP = argmax P (c j a 1,a 2...a n) P (a 1,a 2...a n c j )P (c j ) P (a 1,a 2...a n) P (a 1,a 2...a n c j )P (c j ) 22 19 Naive Bayes Example Consider a medical diagnosis problem with 3 possible diagnoses (Well, Cold, Allergy) and 3 symptoms (sneeze, cough, fever). Diagnosis Well Cold Allergy P (d) 0.9 0.05 0.05 P (sneeze d) 0.1 0.9 0.9 P (cough d) 0.1 0.8 0.7 P (fever d) 0.01 0.7 0.4 P (well sneeze cough fever)= Naive Bayes assumption To make the problem tractable, we often need to make the following independence assumption: P (a 1,a 2...a n c j )= P (a i c j ) i which allows us to define the Naive Bayes Classifier: c NB = argmax P (c j ) i P (a i c j ) 23 20 Example, II Diagnosis Well Cold Allergy P (d) 0.9 0.05 0.05 P (sneeze d) 0.1 0.9 0.9 P (cough d) 0.1 0.8 0.7 P (fever d) 0.01 0.7 0.4 P (cold sneeze cough fever)= Naive Bayes Algorithm Naive Bayes Learn(examples) For each possible class c j ˆP (c j ) estimate P (c j ) For each attribute value a i of each attribute a ˆP (a i c j ) estimate P (a i c j ) P (allergy sneeze cough fever)= P (sneeze cough fever)= Classify New Instance(x) c NB = argmax ˆP (c j ) ˆP (a i c j ) a i x 24 21

Bayesian Belief Networks Naive Bayes conditional independence assumption sometimes still too restrictive But it s impossible to estimate all parameters without some such assumptions Bayesian Belief Networks describe conditional independence among subsets of variables Allows combining prior knowledge about (in)dependencies with observed training data Comments on Naive Bayes Generally works quite well despite the blanket assumption of independence Experiments show it can be quite competitive with other methods (C4.5) on standard data sets Although it does not estimate probabilities accurately when independence is violated, still often picks correct category Does not perform search! Has a fairly strong bias Hypothesis not guaranteed to fit the training data Handles noise well 28 25 Conditional Independence Definition: X is conditionally independent of Y given Z if the probability distribution governing X is independent of the value of Y given the value of Z; that is, if ( x i,y j,z k )P (X = x i Y = y j,z = z k )=P(X = x i Z = z k ) Abbreviated: P (X Y,Z)=P (X Z) Example: Thunder is conditionally independent of Rain, given Lightning P (Thunder Rain, Lightning) =P (Thunder Lightning) Naive Bayes uses this to justify: P (X, Y Z) =P (X Y,Z)=P (X Z)P (Y Z) Minimum Description Length Principle Occam s razor: prefer the shortest hypothesis MDL: prefer the hypothesis that minimizes: h MDL = argmin L C 1 (h)+l C2 (D h) where L C (x) is the description length of x under encoding C For example: H=decision trees, D=labels from training data L C1 (h) is # bits to describe tree h L C2 (D h) is # bits to describe D given h Note: L C2 (D h) =0if examples classified perfectly by h. Need only describe exceptions Hence h MDL trades off tree size for training errors 29 26 Bayesian Belief Network MDL, Continued Lightning Storm BusTourGroup Campfire C C S,B 0.4 0.6 S, B S,B S, B 0.1 0.8 0.9 0.2 0.2 0.8 h MAP = argmin log 2 P (D h)+log 2 P (h) log 2 P (D h) log 2 P (h) (1) Thunder ForestFire Campfire Network represents a set of conditional independence assertions: Each node is asserted to be conditionally independent of its nondescendants, given its immediate predecessors Directed acyclic graph Each node has an associated conditional probability table Interesting fact from information theory: The optimal (shortest expected coding length) code for an event with probability p is log 2 p bits. So interpret (1): log 2 P (h) is length of h under optimal code log 2 P (D h) is length of D given h under optimal code By Occam s razor, prefer the hypothesis that minimizes length(h) + length(misclassifications) 30 27

Summary: Bayesian Belief Networks Combine prior knowledge with observed data Impact of prior knowledge (when correct!) is to lower the sample complexity Active research area: Extend from boolean to real-valued variables Parameterized distributions instead of tables Extend to first-order instead of propositional systems More effective inference methods... Interpretation The net and its associated table represents the joint probability distribution over all variables and their values In general, n P (y 1,...,y k )= P (y i Parents(Y i )) where Parents(Y i ) denotes immediate predecessors of Y i in the graph Therefore, joint distribution is fully defined by graph, plus the conditional probabilities Example: P (S, L, T, C, F, B) =? 34 31 Statistical Speech Recognition Model What is the most likely sequence of words that the speech signal represents? P (words signal) = P (words)p (signal words) P (signal) P (words) is the language model. For example, fat cat is more likely than hat cat. P (signal words) is the acoustic model. For example, cat is likely to be pronounced as kæt. Learning of Bayesian Networks Several variants of this learning task Network structure might be known or unknown Training examples might provide values of all network variables, or just some If structure known and all variables observed Then it reduces to training as with a Naive Bayes classifier P (signal) is the likelihood of the speech signal, but this is constant across all interpretations of the signal so we can ignore it. 35 32 The Language Model (P (words)) Ideally, we d like a complete model of the English language to tell us the exact probability of a sentence. But there is no complete model of English. (Or any other natural language.) You d need to know the probability of every sentence that could ever be uttered! A grammar should be helpful, but there is no complete grammar for English and spoken language is notoriously ungrammatical anyway. Learning Bayes Nets Suppose structure known, variables partially observable Similar to training neural network with hidden units In fact, can learn Bayes Net conditional probability tables using gradient ascent! Converge to network h that (locally) maximizes P (D h) Statistical speech recognition systems approximate the likelihood of sentences by collecting n-gram statistics of word usage from large spoken language corpora. 36 33

Summary: Probabilistic inference lends power and flexibility, can incorporate prior probabilistic knowledge Can make statistically invalid assumptions and still get reasonably good results (Naive Bayes) Provides insights into algorithms that don t even use probabilities (explicitly) Bayesian Belief Networks let us incorporate even more types of prior knowledge (than standard Bayes techniques) Many techniques that have proven their success in many application areas N-grams A unigram represents a single word. We estimate: P (w) = frequency of w in corpus / number of words in corpus. A bigram represents a pair of sequential words. We estimate: P (w 2 w 1) = frequency of w 2 following w 1 / frequency of w 1. A trigram represents a triple of sequential words. We estimate: P (w 3 w 1,w 2) = frequency of w 3 following w 1,w 2 / frequency of w 1,2. 40 37 Unigram/Bigram Statistics Word Unigram Previous words Count OF IN IS ON TO FROM THAT WITH LINE VISION THE 367 179 143 44 44 65 35 30 17 0 0 ON 69 0 0 1 0 0 0 0 0 0 0 OF 281 0 0 2 0 1 0 3 0 4 0 TO 212 0 0 19 0 0 0 0 0 0 1 IS 175 0 0 0 0 0 0 13 0 1 3 A 153 36 36 33 23 21 14 3 15 0 0 THAT 124 0 3 18 0 1 0 0 0 0 0 WE 105 0 0 0 1 0 0 12 0 0 0 LINE 17 1 0 0 0 1 0 0 0 0 0 VISION 13 3 0 0 1 0 1 0 0 0 0 38 The N-gram language model Ideally, we d like to compute: P (w 1...w n)=p (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 )...P (w n w 1...w n 1 ) But we rarely have the data necessary to compute those statistics reliably. So we estimate by making independence assumptions and using bigrams (or trigams). The bigram model is: P (w 1...w n)=p (w 1 )P (w 2 w 1 )P (w 3 w 2 )...P (w n w n 1 ) n P (w 1...w n)= P (w i w i 1 ) Does amazingly well at predicting the next word! 39