Today Statistical Learning Parameter Estimation: Maximum Likelihood (ML) Maximum A Posteriori (MAP) Bayesian Continuous case Learning Parameters for a Bayesian Network Naive Bayes Maximum Likelihood estimates Priors Learning Structure of Bayesian Networks Coin Flip Coin Flip C C C C P(H C ) =. P(H C P(H P(H C ) =. P(H C P(H Which coin will I use? Which coin will I use? P(C ) = / P(C ) = / P( ) = / Prior: Probability of a hypothesis before we make any observations P(C ) = / P(C ) = / P( ) = / Uniform Prior: All hypothesis are equally likely before we make any observations 4 Experiment : Heads Experiment : Heads P(C H) =? P(C H) =? P( H) =? P(C H) =.66 P(C H) =. P( H) =.6 P (C H) = P (H C )P (C ) P (H) P (H) = P (H C i )P (C i ) i= Posterior: Probability of a hypothesis given data C C C C P(H C ) =. P(H C P(H P(C ) = / P(C ) = / P( ) = / 5 P(H C ) =. P(H C P(H P(C ) = / P(C ) = / P( ) = / 6
Experiment : Tails P(C HT) =? P(C HT) =? P( HT) =? P (C HT ) = αp (HT C )P (C ) = αp (H C )P (T C )P (C ) Experiment : Tails P(C HT) =. P(C HT8 P( HT) =. P (C HT ) = αp (HT C )P (C ) = αp (H C )P (T C )P (C ) C C C C P(H C ) =. P(H C P(H P(C ) = / P(C ) = / P( ) = / 7 P(H C ) =. P(H C P(H P(C ) = / P(C ) = / P( ) = / 8 Experiment : Tails P(C HT) =. P(C HT8 P( HT) =. Your Estimate? What is the probability of heads after two experiments? Most likely coin: C Best estimate for P(H) P(H C C C C C P(H C ) =. P(H C P(H P(C ) = / P(C ) = / P( ) = / 9 P(H C ) =. P(H C P(H P(C ) = / P(C ) = / P( ) = / Most likely coin: C Your Estimate? Maximum Likelihood Estimate: The best hypothesis that fits observed data assuming uniform prior C Best estimate for P(H) P(H C Using Prior Knowledge Should we always use Uniform Prior? Background knowledge: Heads => you go first in Abalone against TA TAs are nice people => TA is more likely to use a coin biased in your favor C C P(H C P(C ) = / P(H C ) =. P(H C P(H
Using Prior Knowledge We can encode it in the prior: P(C ) =.5 P(C P( C C Experiment : Heads P(C H) =? P(C H) =? P( H) =? P (C H) = αp (H C )P (C ) C C P(H C ) =. P(H C P(H P(H C ) =. P(H C P(H P(C ) =.5 P(C P( 4 Experiment : Heads P(C H) =.6 P(C H) =.65 P( H) =.89 ML posterior after Exp : P(C H) =.66 P(C H) =. P( H) =.6 C C Experiment : Tails P(C HT) =? P(C HT) =? P( HT) =? P (C HT ) = αp (HT C )P (C ) = αp (H C )P (T C )P (C ) C C P(H C ) =. P(H C P(H P(C ) =.5 P(C P( 5 P(H C ) =. P(H C P(H P(C ) =.5 P(C P( 6 Experiment : Tails Experiment : Tails P(C HT) =.5 P(C HT) =.48 P( HT) =.485 P(C HT) =.5 P(C HT) =.48 P( HT) =.485 P (C HT ) = αp (HT C )P (C ) = αp (H C )P (T C )P (C ) C C C C P(H C ) =. P(H C P(H P(C ) =.5 P(C P( 7 P(H C ) =. P(H C P(H P(C ) =.5 P(C P( 8
Your Estimate? What is the probability of heads after two experiments? Most likely coin: Best estimate for P(H) Most likely coin: Your Estimate? Maximum A Posteriori (MAP) Estimate: The best hypothesis that fits observed data assuming a non-uniform prior Best estimate for P(H) P(H P(H C C P(H C ) =. P(H C P(H P(C ) =.5 P(C P( 9 P(H P( Did We Do The Right Thing? Did We Do The Right Thing? P(C HT) =.5 P(C HT) =.48 P( HT) =.485 P(C HT) =.5 P(C HT) =.48 P( HT) =.485 C and are almost equally likely C C C C P(H C ) =. P(H C P(H P(H C ) =. P(H C P(H A Better Estimate Recall: P (H) = P (H C i )P (C i ) =.68 i= P(C HT) =.5 P(C HT) =.48 P( HT) =.485 Bayesian Estimate Bayesian Estimate: Minimizes prediction error, given data and (generally) assuming a non-uniform prior P (H) = P (H C i )P (C i ) =.68 i= P(C HT) =.5 P(C HT) =.48 P( HT) =.485 C C P(H C ) =. P(H C P(H C C P(H C ) =. P(H C P(H 4
Comparison ML (Maximum Likelihood): P(H after experiments: P(H MAP (Maximum A Posteriori): P(H after experiments: P(H Bayesian: P(H) =.68 after experiments: P(H Comparison ML (Maximum Likelihood): P(H after experiments (HTH 8 ): P(H MAP (Maximum A Posteriori): P(H after experiments (HTH 8 ): P(H Bayesian: P(H) =.68 after experiments (HTH 8 ): P(H 5 6 Comparison ML (Maximum Likelihood): Easy to compute MAP (Maximum A Posteriori): Still easy to compute Incorporates prior knowledge Bayesian: Minimizes error => great when data is scarce Potentially much harder to compute Comparison ML (Maximum Likelihood): Easy to compute MAP (Maximum A Posteriori): Still easy to compute Incorporates prior knowledge Bayesian: Minimizes error => great when data is scarce Potentially much harder to compute 7 8 Comparison ML (Maximum Likelihood): Easy to compute MAP (Maximum A Posteriori): Still easy to compute Incorporates prior knowledge Bayesian: Minimizes error => great when data is scarce Potentially much harder to compute Summary For Now Prior: Probability of a hypothesis before we see any data Likelihood: Probability of data given hypothesis Uniform Prior: A prior that makes all hypothesis equaly likely Posterior: Probability of a hypothesis after we saw some data 9
Summary For Now Continuous Case Prior: Probability of a hypothesis before we see any data Uniform Prior: A prior that makes all hypothesis equaly likely Posterior: Probability of a hypothesis after we saw some data Likelihood: Probability of data given hypothesis Prior In the previous example, we chose from a discrete set of three coins Hypothesis Maximum Likelihood Estimate Uniform The most likely Point Maximum A Posteriori Estimate Any The most likely Point Bayesian Estimate Any Weighted combination In general, we have to pick from a continuous distribution of biased coins Average Continuous Case Continuous Case....4.5.6.7.8 Continuous Case Prior 4 Continuous Case Exp : Tails Exp : Heads.9 Posterior after experiments: uniform ML Estimate MAP Estimate Bayesian Estimate....4.5.6.7.8.9....4.5.6.7.8.9....4.5.6.7.8.9 w/ uniform prior...4.5.6.7.8.9 with background knowledge. with background knowledge....4.5.6.7.8.9....4.5.6.7.8.9....4.5.6.7.8.9 5....4.5.6.7.8.9 6
5 4....4.5.6.7.8.9-5 4....4.5.6.7.8.9-5 5 5-5..4.6.8 9 8 7 6 5 4 -....4.5.6.7.8.9 9 8 7 6 5 4....4.5.6.7.8.9-8 6 4 8 6 4 -..4.6.8 After Experiments ML Estimate MAP Estimate Bayesian Estimate Posterior: w/ uniform prior After Experiments ML Estimate MAP Estimate Bayesian Estimate Posterior: w/ uniform prior with background knowledge with background knowledge 7 8 Parameter Estimation and Bayesian Networks Parameter Estimation and Bayesian Networks E B R A J M T F T T F T F F F F F T F T F T T T F F F T T T F T F F F F E B R A J M T F T T F T F F F F F T F T F T T T F F F T T T F T F F F F We have: - Bayes Net structure and observations - We need: Bayes Net parameters P(B) =? Prior + data = Now compute either MAP or Bayesian estimate 9 4 Parameter Estimation and Bayesian Networks E B R A J M T F T T F T F F F F F T F T F T T T F F F T T T F T F F F F Parameter Estimation and Bayesian Networks E B R A J M T F T T F T F F F F F T F T F T T T F F F T T T F T F F F F 4 P(A E,B) =? P(A E, B) =? P(A E,B) =? P(A E, B) =? 4
..4.6.8..4.6.8 Parameter Estimation and Bayesian Networks Parameter Estimation and Bayesian Networks E B R A J M T F T T F T F F F F F T F T F T T T F F F T T T F T F F F F E B R A J M T F T T F T F F F F F T F T F T T T F F F T T T F T F F F F P(A E,B) =? P(A E, B) =? P(A E,B) =? P(A E, B) =? Prior + data = Now compute either MAP or Bayesian estimate 4 P(A E,B) =? P(A E, B) =? P(A E,B) =? P(A E, B) =? You know the drill 44 Naive Bayes FREE apple Bayes A Bayes Net where all nodes are children of a single root node Why? Expressive and accurate? Easy to learn? Naive Bayes A Bayes Net where all nodes are children of a single root node Why? Expressive and accurate? No - why? Easy to learn? 45 46 Naive Bayes Naive Bayes A Bayes Net where all nodes are children of a single root node Why? Expressive and accurate? No Easy to learn? Yes A Bayes Net where all nodes are children of a single root node Why? Expressive and accurate? No Easy to learn? Yes Useful? Sometimes 47 48
Inference In Naive Bayes P(S) =.6 Inference In Naive Bayes P(S) =.6 P(A S) =. P(A S) =. P(B S) =. P(B S) =. P(F S) =.8 P(F S) =.5 P(A S) =? P(A S) =? P(A S) =. P(A S) =. P(B S) =. P(B S) =. P(F S) =.8 P(F S) =.5 P(A S) =? P(A S) =? Goal, given evidence (words in an email) decide if an email is spam E = {A, B, F, K,...} P (S E) = = = P (E S)P (S) P (E) P (A, B, F, K,... S)P (S) P (A, B, F, K,...) Independence to the rescue! P (A S)P ( B S)P (F S)P ( K S)P (... S)P (S) P (A)P ( B)P (F )P ( K)P (...) 49 5 Inference In Naive Bayes P(S) =.6 Inference In Naive Bayes P(S) =.6 P(A S) =. P(A S) =. P(B S) =. P(B S) =. P(F S) =.8 P(F S) =.5 P(A S) =? P(A S) =? P(A S) =. P(A S) =. P(B S) =. P(B S) =. P(F S) =.8 P(F S) =.5 P(A S) =? P(A S) =? P (S E) = P ( S E) = P (A S)P ( B S)P (F S)P ( K S)P (... S)P (S) P (A)P ( B)P (F )P ( K)P (...) P (A S)P ( B S)P (F S)P ( K S)P (... S)P ( S) P (A)P ( B)P (F )P ( K)P (...) if P(S E) > P( S E) But 5 P (S E) P (A S)P ( B S)P (F S)P ( K S)P (... S)P (S) P ( S E) P (A S)P ( B S)P (F S)P ( K S)P (... S)P ( S) 5 Prior..4.6.8! Can we calculate Maximum Likelihood estimate of! easily? + Data: = Max Likelihood estimate..4.6.8! Looking for the maximum of a function: - find the derivative - set it to zero What function are we maximizing? P(data hypothesis) hypothesis = h! (one for each value of!) P(data h ) = P( h )P(!! h )P(! h )P(! h )! =! (-!) (-!)! =! # (-!) # 5 54
What function are we maximizing? P(data hypothesis) hypothesis = h! (one for each value of!) P(data h ) = P( h )P(!! h )P(! h )P(! h )! =! (-!) (-!)! =! # (-!) # What function are we maximizing? P(data hypothesis) hypothesis = h! (one for each value of!) P(data h ) = P( h )P(!! h )P(! h )P(! h )! =! (-!) (-!)! =! # (-!) # 55 56 What function are we maximizing? P(data hypothesis) hypothesis = h! (one for each value of!) P(data h ) = P( h )P(!! h )P(! h )P(! h )! =! (-!) (-!)! =! # (-!) # What function are we maximizing? P(data hypothesis) hypothesis = h! (one for each value of!) P(data h ) = P( h )P(!! h )P(! h )P(! h )! =! (-!) (-!)! =! # (-!) # 57 58 To find! that maximizes! # (-!) # we take a derivative of the function and set it to. And we get: = # / (# + # ) You knew it already, right? To find! that maximizes! # (-!) # we take a derivative of the function and set it to. And we get: # = # + # You knew it already, right? 59 6
Problems With Small Samples What happens if in your training data apples are not mentioned in any spam message? P(A S) = Why is it bad? P (S E) P (A S)P ( B S)P (F S)P ( K S)P (... S)P (S) = Smoothing Smoothing is used when samples are small Add-one smoothing is the simplest smoothing method: just add to every count! 6 6 Priors! # Recall that P(S) = # + # If we have a slight hunch that P(S)! p # + p P(S) = # + # + If we have a big hunch that P(S)! p # + mp # + # + m where m can be any number > Priors! # Recall that P(S) = # + # If we have a slight hunch that P(S)! p # + p P(S) = # + # + If we have a big hunch that P(S)! p # + mp # + # + m where m can be any number > 6 64 Priors! # Recall that P(S) = # + # If we have a slight hunch that P(S)! p # + p P(S) = # + # + If we have a big hunch that P(S)! p # + mp # + # + m where m can be any number > Priors! # + mp P(S) = # + # + m Note that if m = in the above, it is like saying I have seen samples that make me believe that P(S) = p Hence, m is referred to as the equivalent sample size 65 66
Priors! # + mp P(S) = # + # + m Where should p come from? No prior knowledge => p=.5 If you build a personalized spam filter, you can use p = P(S) from some body else s filter! Inference in Naive Bayes Revisited Recall that P (S E) P (A S)P ( B S)P (F S)P ( K S)P (... S)P (S) We are multiplying lots of small numbers together Is there => any danger potential of underflow! for trouble here? Solution? Use logs! 67 68 Inference in Naive Bayes Revisited Recall that P (S E) P (A S)P ( B S)P (F S)P ( K S)P (... S)P (S) We are multiplying lots of small numbers together => danger of underflow! Solution? Use logs! Inference in Naive Bayes Revisited log(p (S E)) log(p (A S)P ( B S)P (F S)P ( K S)P (... S)P (S)) log(p (A S))+log(P ( B S))+log(P (F S))+log(P ( K S))+log(P (... S))+log(P (S))) Now we add regular numbers -- little danger of over- or underflow errors! 69 7 Learning The Structure of Bayesian Networks General idea: look at all possible network structures and pick one that fits observed data best Impossibly slow: exponential number of networks, and for each we have to learn parameters, too! What do we do if searching the space exhaustively is too expensive? Learning The Structure of Bayesian Networks Local search! Start with some network structure Try to make a change (add, delete or reverse node) See if the new network is any better 7 7
Learning The Structure of Bayesian Networks Learning The Structure of Bayesian Networks What network structure should we start with? Random with uniform prior? Networks that reflects our (or experts ) knowledge of the field? prior network+equivalent sample size data x x x x 4 x 5 x true true x 6 x 7 x 8 x true. x 9 x true true... improved network(s) x x x 4 x x 5 x 6 x 7 x 8 x 9 7 74 Learning The Structure of Bayesian Networks We have just described how to get an ML or MAP estimate of the structure of a Bayes Net What would the Bayes estimate look like? Find all possible networks Calculate their posteriors When doing inference: result weighed combination of all networks! 75