Logistics. Naïve Bayes & Expectation Maximization. 573 Schedule. Coming Soon. Estimation Models. Topics

Size: px

Start display at page:

Download "Logistics. Naïve Bayes & Expectation Maximization. 573 Schedule. Coming Soon. Estimation Models. Topics"

Walter Booker
5 years ago
Views:

1 Logistics Naïve Bayes & Expectation Maximization CSE 7 eam Meetings Midterm Open book, notes Studying See AIMA exercises Daniel S. Weld Daniel S. Weld 7 Schedule Selected opics Coming Soon Selected opics Supervised Learning Logic-Based Reinforcement Learning Planning Probabilistic Knowledge Representation & Inference Search Problem Spaces Agency Artificial Life Intelligent Internet Systems Crossword Puzzles Daniel S. Weld Daniel S. Weld opics Estimation Models est & Mini Projects Review Naive Bayes Maximum Likelihood Estimates Working with Probabilities Expectation Maximization Challenge Maximum Likelihood Estimate Maximum A Posteriori Estimate Bayesian Estimate Prior Uniform Any Any Hypothesis he most likely he most likely Weighted combination Daniel S. Weld

2 Continuous Case Continuous Case Relative Likelihood Prior uniform with background knowledge Exp : Heads Exp : ails Probability of heads Continuous Case Posterior after experiments: After Experiments... Posterior: ML Estimate MAP Estimate Bayesian Estimate w/ uniform prior ML Estimate MAP Estimate Bayesian Estimate w/ uniform prior with background knowledge with background knowledge opics Naive Bayes est & Mini Projects Review Naive Bayes Maximum Likelihood Estimates Working with Probabilities Expectation Maximization Challenge =`Is apple in message? A Bayes Net where all nodes are children of a single root node Why? Expressive and accurate? Easy to learn? Daniel S. Weld

3 Naive Bayes Naive Bayes All nodes are children of a single root node Why? Expressive and accurate? No - why? Easy to learn? All nodes are children of a single root node Why? Expressive and accurate? No Easy to learn? Yes Naive Bayes Inference In Naive Bayes P(S) =.6 All nodes are children of a single root node Why? Expressive and accurate? No Easy to learn? Yes Useful? Sometimes P(A S) =. P(A S) =. P(B S) =. P(B S) =. Inference In Naive Bayes P(S) =.6 P(B S) =. P(B S) =. P( S) =.8 P( S) =. P(S E) P(S) =.6 P(A S) =? P(A S) =? Independence to the rescue! P(E) P(A S) =? P(A S) =? Goal, given evidence (words in an ) Decide if an is spam Inference In Naive Bayes P(A S) =. P(A S) =. P( S) =.8 P( S) =. P(A S) =. P(A S) =. P(B S) =. P(B S) =. P( S) =.8 P( S) =. P(A S) =? P(A S) =? P(E) P(E) P(E) Spam if P(S E) > P( S E) But...

4 Inference In Naive Bayes P(S) =.6 Parameter Estimation Revisited P(S) = θ P(A S) =. P(A S) =. P(B S) =. P(B S) =. P( S) =.8 P( S) =. P(A S) =? P(A S) =? Prior Can we calculate Maximum Likelihood estimate of θ easily? θ + Data: = Max Likelihood estimate θ Looking for the maximum of a function: - find the derivative - set it to zero opics est & Mini Projects Review Naive Bayes Maximum Likelihood Estimates Working with Probabilities Smoothing Computational Details Continuous Quantities Expectation Maximization Challenge P( i S) = Evidence is Easy? # # + # Or. Are their problems? Daniel S. Weld Smooth with a Prior P( i S) = p = prior probability m = weight # + mp # + # + m Note that if m =, it means I ve seen samples that make me believe P( i S) = p Hence, m is referred to as the equivalent sample size Probabilities: Important Detail! P(spam n ) = Π P(spam i ) i Any more potential problems here? We are multiplying lots of small numbers Danger of underflow!. 7 = 7 E -8 Solution? Use logs and add! p * p = e log(p)+log(p) Always keep in log form

5 P(S ) Easy to compute from data if discrete P(S ) What if is real valued? Instance Spam? Instance Spam? < alse < alse < alse > rue > rue P(S ) = ¼ ignoring smoothing What now? Daniel S. Weld Daniel S. Weld 6 #. S? Anything Else? #. S? it Gaussians.. P(S.)? Daniel S. Weld 7 Daniel S. Weld 8 #..... S? Smooth with Gaussian then sum Kernel Density Estimation #..... S? Spam? P(S =.) P( S =.)..... What s with the shape?..... Daniel S. Weld 9 Daniel S. Weld

...6.8...6.8 Analysis Attribute value opics est & Mini Projects Review Naive Bayes Expectation Maximization Review: Learning Bayesian Networks Parameter Estimation Structure Learning Hidden Nodes

6 Analysis Attribute value opics est & Mini Projects Review Naive Bayes Expectation Maximization Review: Learning Bayesian Networks Parameter Estimation Structure Learning Hidden Nodes Challenge Daniel S. Weld Daniel S. Weld An Example Bayes Net Parameter Estimation and Bayesian Networks Radio Earthquake NbrCalls Alarm Burglary NbrCalls Pr(A E,B) e,b.9 (.) e,b. (.8) e,b.8 (.) e,b. (.99) Pr(B=t) Pr(B=f)..9 Daniel S. Weld E B R A J M... We have: - Bayes Net structure and observations - We need: Bayes Net parameters Parameter Estimation and Bayesian Networks Parameter Estimation and Bayesian Networks P(A E,B) =? P(A E, B) =? P(A E,B) =? P(A E, B) =? E B R A J M... P(A E,B) =? P(A E, B) =? P(A E,B) =? P(A E, B) =? Prior E B R A J M... + data= Now compute either MAP or Bayesian estimate 6

7 Recap Given a BN structure (with discrete or continuous s), we can learn the parameters of the conditional prop tables. Spam Earthqk Burgl What if we don t know structure? Nigeria Sex Nude Alarm N N Daniel S. Weld 7 Learning he Structure of Bayesian Networks Search thru the space of possible network structures! (for now, assume we observe all s) or each structure, learn parameters Pick the one that fits observed data best Caveat won t we end up fully connected???? When scoring, add a penalty model complexity Problem!?!? Learning he Structure of Bayesian Networks Search thru the space or each structure, learn parameters Pick the one that fits observed data best Problem? Exponential number of networks! And we need to learn parameters for each! Exhaustive search out of the question! So what now? Learning he Structure of Bayesian Networks Local search! Start with some network structure ry to make a change (add or delete or reverse edge) See if the new network is any better Initial Network Structure? Uniform prior over random networks? Network which reflects expert knowledge? What should be the initial state? 7

8 Learning BN Structure he Big Picture We described how to do MAP (and ML) learning of a Bayes net (including structure) How would Bayesian learning (of BNs) differ? ind all possible networks Calculate their posteriors When doing inference, return weighed combination of predictions from all networks! Daniel S. Weld Hidden Variables We could- But we d get a fully-connected network But we can t observe the disease Can t we learn without it? With 78 parameters (vs. 78) Much harder to learn! Daniel S. Weld Daniel S. Weld 6 Chicken & Egg Problem If we knew that a training instance (patient) had the disease It would be easy to learn P(symptom disease) But we can t observe disease, so we don t. If we knew params, e.g. P(symptom disease) then it d be easy to estimate if the patient had the disease. But we don t know these parameters. (high-level version) Initialize randomly [M step] reating each instance as fractionally having both values compute the new parameter values Iterate until convergence! Daniel S. Weld 7 Daniel S. Weld 8 8

9 Simplest Version Mixture of two distributions Input Looks Like Know: form of distribution & variance, % = Just need mean of each distribution Daniel S. Weld Daniel S. Weld We Want to Predict Initialize randomly: set θ =?; θ =?? Daniel S. Weld Daniel S. Weld Initialize randomly Initialize randomly Daniel S. Weld Daniel S. Weld 9

10 Initialize randomly [M step] reating each instance as fractionally having both values compute the new parameter values ML Mean of Single Gaussian U ml = argmin u Σ i (x i u) Daniel S. Weld Daniel S. Weld 6 Initialize randomly [M step] reating each instance as fractionally having both values compute the new parameter values Iterate Daniel S. Weld 7 Daniel S. Weld 8 [M step] reating each instance as fractionally having both values compute the new parameter values Daniel S. Weld Daniel S. Weld 6

11 [M step] reating each instance as fractionally having both values compute the new parameter values Until Convergence Problems Need to assume form of distribution Local Maxima But It really works in practice! Can easilly extend to multiple s E.g. Mean & Variance Or much more complex models Daniel S. Weld 6 Daniel S. Weld 6 Crossword Puzzles Daniel S. Weld 6

Which coin will I use? Which coin will I use? Logistics. Statistical Learning. Topics. 573 Topics. Coin Flip. Coin Flip

Which coin will I use? Which coin will I use? Logistics. Statistical Learning. Topics. 573 Topics. Coin Flip. Coin Flip Logistics Statistical Learning CSE 57 Team Meetings Midterm Open book, notes Studying See AIMA exercises Daniel S. Weld Daniel S. Weld Supervised Learning 57 Topics Logic-Based Reinforcement Learning Planning