Which coin will I use? Which coin will I use? Logistics. Statistical Learning. Topics. 573 Topics. Coin Flip. Coin Flip

Logistics Statistical Learning CSE 57 Team Meetings Midterm Open book, notes Studying See AIMA exercises Daniel S. Weld Daniel S. Weld Supervised Learning 57 Topics Logic-Based Reinforcement Learning Planning Probabilistic Knowledge Representation & Inference Search Problem Spaces Agency Topics Parameter Estimation: Maximum Likelihood (ML) Maximum A Posteriori (MAP) Bayesian Continuous case Learning Parameters for a Bayesian Network Naive Bayes Maximum Likelihood estimates Priors Learning Structure of Bayesian Networks Daniel S. Weld Daniel S. Weld 4 Coin Flip Coin Flip H ) =. H ) =.5 H H ) =. H ) =.5 H Which coin will I use? ) = / ) = / ) = / Prior: Probability of a hypothesis before we make any observations Which coin will I use? ) = / ) = / ) = / Uniform Prior: All hypothesis are equally likely before we make any observations

Experiment : Heads H) =? H) =? H) =? Experiment : Heads H) =.66 H) =. H) =.6 Posterior: Probability of a hypothesis given data H )=. H ) =.5 H )=/ ) = / ) = / H ) =. H ) =.5 H ) = / ) = / ) = / Terminology Experiment : Tails Prior: Probability of a hypothesis before we see any data Uniform Prior: A prior that makes all hypothesis equaly likely Posterior: Probability of a hypothesis after we saw some data Likelihood: Probability of data given hypothesis HT) =? HT) =? HT) =? H ) =. H ) =.5 H ) = / ) = / ) = / Experiment : Tails HT) =. HT) =.58 HT) =. Experiment : Tails HT) =. HT) =.58 HT) =. H ) =. H ) =.5 H ) = / ) = / ) = / H ) =. H ) =.5 H ) = / ) = / ) = /

Your Estimate? What is the probability of heads after two experiments? Most likely coin: Best estimate for H) H ) =.5 Most likely coin: Your Estimate? Maximum Likelihood Estimate: The best hypothesis that fits observed data assuming uniform prior Best estimate for H) H ) =.5 H ) =. H ) =.5 H ) = / ) = / ) = / H ) =.5 ) = / Using Prior Knowledge Using Prior Knowledge We can encode it in the prior: ) =.5 ) =.5 ) =.7 H ) =. H ) =.5 H H ) =. H ) =.5 H Experiment : Heads H) =? H) =? H) =? Experiment : Heads H) =.6 H) =.65 H) =.89 Compare with ML posterior after Exp : H) =.66 H) =. H) =.6 H ) =. H ) =.5 H ) =.5 ) =.5 ) =.7 H ) =. H ) =.5 H ) =.5 ) =.5 ) =.7

Experiment : Tails HT) =? HT) =? HT) =? Experiment : Tails HT) =.5 HT) =.48 HT) =.485 H ) =. H ) =.5 H ) =.5 ) =.5 ) =.7 H ) =. H ) =.5 H ) =.5 ) =.5 ) =.7 Experiment : Tails HT) =.5 HT)=.48 HT) =.485 Your Estimate? What is the probability of heads after two experiments? Most likely coin: Best estimate for H) H H ) =. H ) =.5 H ) =.5 ) =.5 ) =.7 H ) =. H ) =.5 H ) =.5 ) =.5 ) =.7 Most likely coin: Your Estimate? Maximum A Posteriori (MAP) Estimate: The best hypothesis that fits observed data assuming a non-uniform prior Best estimate for H) H Did We Do The Right Thing? HT)=.5 HT)=.48 HT)=.485 H ) =.7 H ) =. H ) =.5 H 4

Did We Do The Right Thing? A Better Estimate HT) =.5 HT)=.48 HT)=.485 and are almost equally likely Recall: =.68 HT)=.5 HT)=.48 HT)=.485 H ) =. H ) =.5 H H ) =. H ) =.5 H Bayesian Estimate Bayesian Estimate: Minimizes prediction error, given data and (generally) assuming a non-uniform prior =.68 HT)=.5 HT)=.48 HT)=.485 H ) =. H ) =.5 H Comparison After more experiments: HTH 8 ML (Maximum Likelihood): H) =.5 after experiments: H MAP (Maximum A Posteriori): H after experiments: H Bayesian: H) =.68 after experiments: H Comparison Summary For Now ML (Maximum Likelihood): Easy to compute MAP (Maximum A Posteriori): Still easy to compute Incorporates prior knowledge Bayesian: Minimizes error => great when data is scarce Potentially much harder to compute Maximum Likelihood Estimate Maximum A Posteriori Estimate Bayesian Estimate Prior Uniform Any Any Hypothesis The most likely The most likely Weighted combination 5

....4.5.6.7.8.9....4.5.6.7.8.9....4.5.6.7.8.9....4.5.6.7.8.9....4.5.6.7.8.9....4.5.6.7.8.9 5 4-5 4 -....4.5.6.7.8.9....4.5.6.7.8.9....4.5.6.7.8.9....4.5.6.7.8.9 Continuous Case Continuous Case In the previous example, we chose from a discrete set of three coins In general, we have to pick from a continuous distribution of biased coins Continuous Case Continuous Case Prior Exp : Heads Exp : Tails uniform with background knowledge....4.5.6.7.8.9 Continuous Case Posterior after experiments: After Experiments... Posterior: ML Estimate MAP Estimate Bayesian Estimate w/ uniform prior ML Estimate MAP Estimate Bayesian Estimate w/ uniform prior with background knowledge with background knowledge 6

After Experiments... Topics Parameter Estimation: Maximum Likelihood (ML) Maximum A Posteriori (MAP) Bayesian Continuous case Learning Parameters for a Bayesian Network Naive Bayes Maximum Likelihood estimates Priors Learning Structure of Bayesian Networks Daniel S. Weld 8 Review: Conditional Probability A B) is the probability of A given B Assumes that B is the only info known. Defined by: A B) A B) = B) Conditional Independence A&B not independent, since A B) < A) A A B True A A B B True B Daniel S. Weld 9 Daniel S. Weld 4 True Conditional Independence But: A&B are made independent by C A A B B A C C A C) = A B, C) Bayes Rule E H ) H ) P ( H E) = E) Simple proof from def of conditional probability: H E) H E) = (Def. cond. prob.) E) H E) E H ) = (Def. cond. prob.) H ) P ( H E) = E H ) H ) (Mult by H) in line ) B C QED: E H ) H ) P ( H E) = E) (Substitute # in #) Daniel S. Weld 4 Daniel S. Weld 4 7

5 5 5..4.6.8-5 8 6 4 8 6 4..4.6.8 - An Example Bayes Net Given Parents, X is Independent of Non-Descendants Earthquake Burglary Pr(B=t) Pr(B=f).5.95 Radio Alarm Pr(A E,B) e,b.9 (.) e,b. (.8) e,b.85 (.5) e,b. (.99) NbrCalls NbrCalls Daniel S. Weld 4 Daniel S. Weld 44 Given Markov Blanket, X is Independent of All Other Nodes Parameter Estimation and Bayesian Networks MB(X) = Par(X) Childs(X) Par(Childs(X)) Daniel S. Weld 45 E B R A J M T F T T F T F F F F F T F T F T T T F F F T T T F T F F F F... We have: - Bayes Net structure and observations - We need: Bayes Net parameters Parameter Estimation and Bayesian Networks Parameter Estimation and Bayesian Networks E B R A J M T F T T F T F F F F F T F T F T T T F F F T T T F T F F F F... E B R A J M T F T T F T F F F F F T F T F T T T F F F T T T F T F F F F... B) =? Prior + data = Now compute either MAP or Bayesian estimate A E,B) =? A E, B) =? A E,B) =? A E, B) =? 8

..4.6.8..4.6.8 Parameter Estimation and Bayesian Networks A E,B) =? A E, B) =? A E,B) =? A E, B) =? Prior E B R A J M T F T T F T F F F F F T F T F T T T F F F T T T F T F F F F... + data= Now compute either MAP or Bayesian estimate Topics Parameter Estimation: Maximum Likelihood (ML) Maximum A Posteriori (MAP) Bayesian Continuous case Learning Parameters for a Bayesian Network Naive Bayes Maximum Likelihood estimates Priors Learning Structure of Bayesian Networks Daniel S. Weld 5 Recap Given a BN structure (with discrete or continuous variables), we can learn the parameters of the conditional prop tables. Earthqk Burgl What if we don t know structure? Spam Nigeria Sex Nude Alarm N N Daniel S. Weld 5 Learning The Structure of Bayesian Networks Search thru the space of possible network structures! (for now, assume we observe all variables) For each structure, learn parameters Pick the one that fits observed data best Caveat won t we end up fully connected???? When scoring, add a penalty model complexity Problem!?!? Learning The Structure of Bayesian Networks Search thru the space For each structure, learn parameters Pick the one that fits observed data best Problem? Exponential number of networks! And we need to learn parameters for each! Exhaustive search out of the question! So what now? 9

Learning The Structure of Bayesian Networks Local search! Start with some network structure Try to make a change (add or delete or reverse edge) See if the new network is any better Initial Network Structure? Uniform prior over random networks? Network which reflects expert knowledge? What should be the initial state? Learning BN Structure The Big Picture We described how to do MAP (and ML) learning of a Bayes net (including structure) How would Bayesian learning (of BNs) differ? Find all possible networks Calculate their posteriors When doing inference, return weighed combination of predictions from all networks! Daniel S. Weld 57