Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

Similar documents
Logistics. Naïve Bayes & Expectation Maximization. 573 Schedule. Coming Soon. Estimation Models. Topics

Which coin will I use? Which coin will I use? Logistics. Statistical Learning. Topics. 573 Topics. Coin Flip. Coin Flip

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Bayesian Methods: Naïve Bayes

CS 361: Probability & Statistics

Probabilistic Classification

CS 361: Probability & Statistics

Lecture 10: Bayes' Theorem, Expected Value and Variance Lecturer: Lale Özkahya

Naïve Bayes classification

Bayesian Learning. Instructor: Jesse Davis

Bayesian Learning (II)

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Bayes Formula. MATH 107: Finite Mathematics University of Louisville. March 26, 2014

Introduction to Artificial Intelligence Midterm 2. CS 188 Spring You have approximately 2 hours and 50 minutes.

Bayesian Updating with Discrete Priors Class 11, Jeremy Orloff and Jonathan Bloom

Introduction to Machine Learning

Generative Clustering, Topic Modeling, & Bayesian Inference

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Introduction to Machine Learning

With Question/Answer Animations. Chapter 7

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

Bayes Networks. CS540 Bryan R Gibson University of Wisconsin-Madison. Slides adapted from those used by Prof. Jerry Zhu, CS540-1

Bayesian Models in Machine Learning

CSC321 Lecture 18: Learning Probabilistic Models

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Text Categorization CSE 454. (Based on slides by Dan Weld, Tom Mitchell, and others)

CSE 473: Artificial Intelligence Autumn Topics

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

CSC Discrete Math I, Spring Discrete Probability

Probability Review and Naïve Bayes

Introduction to AI Learning Bayesian networks. Vibhav Gogate

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

MLE/MAP + Naïve Bayes

Logistic Regression. Machine Learning Fall 2018

Classification & Information Theory Lecture #8

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Bayesian RL Seminar. Chris Mansley September 9, 2008

Introduction to Bayesian Learning. Machine Learning Fall 2018

Lecture 9: Naive Bayes, SVM, Kernels. Saravanan Thirumuruganathan

Non-parametric Methods

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Conditional Probability & Independence. Conditional Probabilities

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

The Naïve Bayes Classifier. Machine Learning Fall 2017

Statistical learning. Chapter 20, Sections 1 3 1

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

CS 188: Artificial Intelligence Spring Today

LEARNING WITH BAYESIAN NETWORKS

MLE/MAP + Naïve Bayes

CPSC 340: Machine Learning and Data Mining

Final Examination CS 540-2: Introduction to Artificial Intelligence

Directed Graphical Models

Probabilistic Machine Learning

2011 Pearson Education, Inc

Probabilistic Graphical Models (I)

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

CSCE 478/878 Lecture 6: Bayesian Learning

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

Review: Bayesian learning and inference

Point Estimation. Vibhav Gogate The University of Texas at Dallas

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Info 2950, Lecture 4

Deep Learning for Computer Vision

Lecture 3 Probability Basics

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Inconsistency of Bayesian inference when the model is wrong, and how to repair it

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

1 Probabilities. 1.1 Basics 1 PROBABILITIES

Be able to define the following terms and answer basic questions about them:

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

CS221 / Autumn 2017 / Liang & Ermon. Lecture 15: Bayesian networks III

Bayesian Networks BY: MOHAMAD ALSABBAGH

Computational Perception. Bayesian Inference

PROBABILISTIC REASONING SYSTEMS

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Machine Learning

1 Probabilities. 1.1 Basics 1 PROBABILITIES

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14

Statistical methods for NLP Estimation

Bayesian Networks. Motivation

NLP: Probability. 1 Basics. Dan Garrette December 27, E : event space (sample space)

Algorithmisches Lernen/Machine Learning

Machine Learning. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 8 May 2012

Probability Based Learning

Naïve Bayes Classifiers and Logistic Regression. Doug Downey Northwestern EECS 349 Winter 2014

Computational Cognitive Science

Statistical learning. Chapter 20, Sections 1 3 1

Bayes Theorem & Naïve Bayes. (some slides adapted from slides by Massimo Poesio, adapted from slides by Chris Manning)

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Bayesian Learning Features of Bayesian learning methods:

STA 4273H: Statistical Machine Learning

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Lecture 7. Bayes formula and independence

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

{ p if x = 1 1 p if x = 0

Statistical learning. Chapter 20, Sections 1 4 1

(1) Introduction to Bayesian statistics

Transcription:

Today Statistical Learning Parameter Estimation: Maximum Likelihood (ML) Maximum A Posteriori (MAP) Bayesian Continuous case Learning Parameters for a Bayesian Network Naive Bayes Maximum Likelihood estimates Priors Learning Structure of Bayesian Networks Coin Flip Coin Flip C C C C P(H C ) =. P(H C P(H P(H C ) =. P(H C P(H Which coin will I use? Which coin will I use? P(C ) = / P(C ) = / P( ) = / Prior: Probability of a hypothesis before we make any observations P(C ) = / P(C ) = / P( ) = / Uniform Prior: All hypothesis are equally likely before we make any observations 4 Experiment : Heads Experiment : Heads P(C H) =? P(C H) =? P( H) =? P(C H) =.66 P(C H) =. P( H) =.6 P (C H) = P (H C )P (C ) P (H) P (H) = P (H C i )P (C i ) i= Posterior: Probability of a hypothesis given data C C C C P(H C ) =. P(H C P(H P(C ) = / P(C ) = / P( ) = / 5 P(H C ) =. P(H C P(H P(C ) = / P(C ) = / P( ) = / 6

Experiment : Tails P(C HT) =? P(C HT) =? P( HT) =? P (C HT ) = αp (HT C )P (C ) = αp (H C )P (T C )P (C ) Experiment : Tails P(C HT) =. P(C HT8 P( HT) =. P (C HT ) = αp (HT C )P (C ) = αp (H C )P (T C )P (C ) C C C C P(H C ) =. P(H C P(H P(C ) = / P(C ) = / P( ) = / 7 P(H C ) =. P(H C P(H P(C ) = / P(C ) = / P( ) = / 8 Experiment : Tails P(C HT) =. P(C HT8 P( HT) =. Your Estimate? What is the probability of heads after two experiments? Most likely coin: C Best estimate for P(H) P(H C C C C C P(H C ) =. P(H C P(H P(C ) = / P(C ) = / P( ) = / 9 P(H C ) =. P(H C P(H P(C ) = / P(C ) = / P( ) = / Most likely coin: C Your Estimate? Maximum Likelihood Estimate: The best hypothesis that fits observed data assuming uniform prior C Best estimate for P(H) P(H C Using Prior Knowledge Should we always use Uniform Prior? Background knowledge: Heads => you go first in Abalone against TA TAs are nice people => TA is more likely to use a coin biased in your favor C C P(H C P(C ) = / P(H C ) =. P(H C P(H

Using Prior Knowledge We can encode it in the prior: P(C ) =.5 P(C P( C C Experiment : Heads P(C H) =? P(C H) =? P( H) =? P (C H) = αp (H C )P (C ) C C P(H C ) =. P(H C P(H P(H C ) =. P(H C P(H P(C ) =.5 P(C P( 4 Experiment : Heads P(C H) =.6 P(C H) =.65 P( H) =.89 ML posterior after Exp : P(C H) =.66 P(C H) =. P( H) =.6 C C Experiment : Tails P(C HT) =? P(C HT) =? P( HT) =? P (C HT ) = αp (HT C )P (C ) = αp (H C )P (T C )P (C ) C C P(H C ) =. P(H C P(H P(C ) =.5 P(C P( 5 P(H C ) =. P(H C P(H P(C ) =.5 P(C P( 6 Experiment : Tails Experiment : Tails P(C HT) =.5 P(C HT) =.48 P( HT) =.485 P(C HT) =.5 P(C HT) =.48 P( HT) =.485 P (C HT ) = αp (HT C )P (C ) = αp (H C )P (T C )P (C ) C C C C P(H C ) =. P(H C P(H P(C ) =.5 P(C P( 7 P(H C ) =. P(H C P(H P(C ) =.5 P(C P( 8

Your Estimate? What is the probability of heads after two experiments? Most likely coin: Best estimate for P(H) Most likely coin: Your Estimate? Maximum A Posteriori (MAP) Estimate: The best hypothesis that fits observed data assuming a non-uniform prior Best estimate for P(H) P(H P(H C C P(H C ) =. P(H C P(H P(C ) =.5 P(C P( 9 P(H P( Did We Do The Right Thing? Did We Do The Right Thing? P(C HT) =.5 P(C HT) =.48 P( HT) =.485 P(C HT) =.5 P(C HT) =.48 P( HT) =.485 C and are almost equally likely C C C C P(H C ) =. P(H C P(H P(H C ) =. P(H C P(H A Better Estimate Recall: P (H) = P (H C i )P (C i ) =.68 i= P(C HT) =.5 P(C HT) =.48 P( HT) =.485 Bayesian Estimate Bayesian Estimate: Minimizes prediction error, given data and (generally) assuming a non-uniform prior P (H) = P (H C i )P (C i ) =.68 i= P(C HT) =.5 P(C HT) =.48 P( HT) =.485 C C P(H C ) =. P(H C P(H C C P(H C ) =. P(H C P(H 4

Comparison ML (Maximum Likelihood): P(H after experiments: P(H MAP (Maximum A Posteriori): P(H after experiments: P(H Bayesian: P(H) =.68 after experiments: P(H Comparison ML (Maximum Likelihood): P(H after experiments (HTH 8 ): P(H MAP (Maximum A Posteriori): P(H after experiments (HTH 8 ): P(H Bayesian: P(H) =.68 after experiments (HTH 8 ): P(H 5 6 Comparison ML (Maximum Likelihood): Easy to compute MAP (Maximum A Posteriori): Still easy to compute Incorporates prior knowledge Bayesian: Minimizes error => great when data is scarce Potentially much harder to compute Comparison ML (Maximum Likelihood): Easy to compute MAP (Maximum A Posteriori): Still easy to compute Incorporates prior knowledge Bayesian: Minimizes error => great when data is scarce Potentially much harder to compute 7 8 Comparison ML (Maximum Likelihood): Easy to compute MAP (Maximum A Posteriori): Still easy to compute Incorporates prior knowledge Bayesian: Minimizes error => great when data is scarce Potentially much harder to compute Summary For Now Prior: Probability of a hypothesis before we see any data Likelihood: Probability of data given hypothesis Uniform Prior: A prior that makes all hypothesis equaly likely Posterior: Probability of a hypothesis after we saw some data 9

Summary For Now Continuous Case Prior: Probability of a hypothesis before we see any data Uniform Prior: A prior that makes all hypothesis equaly likely Posterior: Probability of a hypothesis after we saw some data Likelihood: Probability of data given hypothesis Prior In the previous example, we chose from a discrete set of three coins Hypothesis Maximum Likelihood Estimate Uniform The most likely Point Maximum A Posteriori Estimate Any The most likely Point Bayesian Estimate Any Weighted combination In general, we have to pick from a continuous distribution of biased coins Average Continuous Case Continuous Case....4.5.6.7.8 Continuous Case Prior 4 Continuous Case Exp : Tails Exp : Heads.9 Posterior after experiments: uniform ML Estimate MAP Estimate Bayesian Estimate....4.5.6.7.8.9....4.5.6.7.8.9....4.5.6.7.8.9 w/ uniform prior...4.5.6.7.8.9 with background knowledge. with background knowledge....4.5.6.7.8.9....4.5.6.7.8.9....4.5.6.7.8.9 5....4.5.6.7.8.9 6

5 4....4.5.6.7.8.9-5 4....4.5.6.7.8.9-5 5 5-5..4.6.8 9 8 7 6 5 4 -....4.5.6.7.8.9 9 8 7 6 5 4....4.5.6.7.8.9-8 6 4 8 6 4 -..4.6.8 After Experiments ML Estimate MAP Estimate Bayesian Estimate Posterior: w/ uniform prior After Experiments ML Estimate MAP Estimate Bayesian Estimate Posterior: w/ uniform prior with background knowledge with background knowledge 7 8 Parameter Estimation and Bayesian Networks Parameter Estimation and Bayesian Networks E B R A J M T F T T F T F F F F F T F T F T T T F F F T T T F T F F F F E B R A J M T F T T F T F F F F F T F T F T T T F F F T T T F T F F F F We have: - Bayes Net structure and observations - We need: Bayes Net parameters P(B) =? Prior + data = Now compute either MAP or Bayesian estimate 9 4 Parameter Estimation and Bayesian Networks E B R A J M T F T T F T F F F F F T F T F T T T F F F T T T F T F F F F Parameter Estimation and Bayesian Networks E B R A J M T F T T F T F F F F F T F T F T T T F F F T T T F T F F F F 4 P(A E,B) =? P(A E, B) =? P(A E,B) =? P(A E, B) =? 4

..4.6.8..4.6.8 Parameter Estimation and Bayesian Networks Parameter Estimation and Bayesian Networks E B R A J M T F T T F T F F F F F T F T F T T T F F F T T T F T F F F F E B R A J M T F T T F T F F F F F T F T F T T T F F F T T T F T F F F F P(A E,B) =? P(A E, B) =? P(A E,B) =? P(A E, B) =? Prior + data = Now compute either MAP or Bayesian estimate 4 P(A E,B) =? P(A E, B) =? P(A E,B) =? P(A E, B) =? You know the drill 44 Naive Bayes FREE apple Bayes A Bayes Net where all nodes are children of a single root node Why? Expressive and accurate? Easy to learn? Naive Bayes A Bayes Net where all nodes are children of a single root node Why? Expressive and accurate? No - why? Easy to learn? 45 46 Naive Bayes Naive Bayes A Bayes Net where all nodes are children of a single root node Why? Expressive and accurate? No Easy to learn? Yes A Bayes Net where all nodes are children of a single root node Why? Expressive and accurate? No Easy to learn? Yes Useful? Sometimes 47 48

Inference In Naive Bayes P(S) =.6 Inference In Naive Bayes P(S) =.6 P(A S) =. P(A S) =. P(B S) =. P(B S) =. P(F S) =.8 P(F S) =.5 P(A S) =? P(A S) =? P(A S) =. P(A S) =. P(B S) =. P(B S) =. P(F S) =.8 P(F S) =.5 P(A S) =? P(A S) =? Goal, given evidence (words in an email) decide if an email is spam E = {A, B, F, K,...} P (S E) = = = P (E S)P (S) P (E) P (A, B, F, K,... S)P (S) P (A, B, F, K,...) Independence to the rescue! P (A S)P ( B S)P (F S)P ( K S)P (... S)P (S) P (A)P ( B)P (F )P ( K)P (...) 49 5 Inference In Naive Bayes P(S) =.6 Inference In Naive Bayes P(S) =.6 P(A S) =. P(A S) =. P(B S) =. P(B S) =. P(F S) =.8 P(F S) =.5 P(A S) =? P(A S) =? P(A S) =. P(A S) =. P(B S) =. P(B S) =. P(F S) =.8 P(F S) =.5 P(A S) =? P(A S) =? P (S E) = P ( S E) = P (A S)P ( B S)P (F S)P ( K S)P (... S)P (S) P (A)P ( B)P (F )P ( K)P (...) P (A S)P ( B S)P (F S)P ( K S)P (... S)P ( S) P (A)P ( B)P (F )P ( K)P (...) if P(S E) > P( S E) But 5 P (S E) P (A S)P ( B S)P (F S)P ( K S)P (... S)P (S) P ( S E) P (A S)P ( B S)P (F S)P ( K S)P (... S)P ( S) 5 Prior..4.6.8! Can we calculate Maximum Likelihood estimate of! easily? + Data: = Max Likelihood estimate..4.6.8! Looking for the maximum of a function: - find the derivative - set it to zero What function are we maximizing? P(data hypothesis) hypothesis = h! (one for each value of!) P(data h ) = P( h )P(!! h )P(! h )P(! h )! =! (-!) (-!)! =! # (-!) # 5 54

What function are we maximizing? P(data hypothesis) hypothesis = h! (one for each value of!) P(data h ) = P( h )P(!! h )P(! h )P(! h )! =! (-!) (-!)! =! # (-!) # What function are we maximizing? P(data hypothesis) hypothesis = h! (one for each value of!) P(data h ) = P( h )P(!! h )P(! h )P(! h )! =! (-!) (-!)! =! # (-!) # 55 56 What function are we maximizing? P(data hypothesis) hypothesis = h! (one for each value of!) P(data h ) = P( h )P(!! h )P(! h )P(! h )! =! (-!) (-!)! =! # (-!) # What function are we maximizing? P(data hypothesis) hypothesis = h! (one for each value of!) P(data h ) = P( h )P(!! h )P(! h )P(! h )! =! (-!) (-!)! =! # (-!) # 57 58 To find! that maximizes! # (-!) # we take a derivative of the function and set it to. And we get: = # / (# + # ) You knew it already, right? To find! that maximizes! # (-!) # we take a derivative of the function and set it to. And we get: # = # + # You knew it already, right? 59 6

Problems With Small Samples What happens if in your training data apples are not mentioned in any spam message? P(A S) = Why is it bad? P (S E) P (A S)P ( B S)P (F S)P ( K S)P (... S)P (S) = Smoothing Smoothing is used when samples are small Add-one smoothing is the simplest smoothing method: just add to every count! 6 6 Priors! # Recall that P(S) = # + # If we have a slight hunch that P(S)! p # + p P(S) = # + # + If we have a big hunch that P(S)! p # + mp # + # + m where m can be any number > Priors! # Recall that P(S) = # + # If we have a slight hunch that P(S)! p # + p P(S) = # + # + If we have a big hunch that P(S)! p # + mp # + # + m where m can be any number > 6 64 Priors! # Recall that P(S) = # + # If we have a slight hunch that P(S)! p # + p P(S) = # + # + If we have a big hunch that P(S)! p # + mp # + # + m where m can be any number > Priors! # + mp P(S) = # + # + m Note that if m = in the above, it is like saying I have seen samples that make me believe that P(S) = p Hence, m is referred to as the equivalent sample size 65 66

Priors! # + mp P(S) = # + # + m Where should p come from? No prior knowledge => p=.5 If you build a personalized spam filter, you can use p = P(S) from some body else s filter! Inference in Naive Bayes Revisited Recall that P (S E) P (A S)P ( B S)P (F S)P ( K S)P (... S)P (S) We are multiplying lots of small numbers together Is there => any danger potential of underflow! for trouble here? Solution? Use logs! 67 68 Inference in Naive Bayes Revisited Recall that P (S E) P (A S)P ( B S)P (F S)P ( K S)P (... S)P (S) We are multiplying lots of small numbers together => danger of underflow! Solution? Use logs! Inference in Naive Bayes Revisited log(p (S E)) log(p (A S)P ( B S)P (F S)P ( K S)P (... S)P (S)) log(p (A S))+log(P ( B S))+log(P (F S))+log(P ( K S))+log(P (... S))+log(P (S))) Now we add regular numbers -- little danger of over- or underflow errors! 69 7 Learning The Structure of Bayesian Networks General idea: look at all possible network structures and pick one that fits observed data best Impossibly slow: exponential number of networks, and for each we have to learn parameters, too! What do we do if searching the space exhaustively is too expensive? Learning The Structure of Bayesian Networks Local search! Start with some network structure Try to make a change (add, delete or reverse node) See if the new network is any better 7 7

Learning The Structure of Bayesian Networks Learning The Structure of Bayesian Networks What network structure should we start with? Random with uniform prior? Networks that reflects our (or experts ) knowledge of the field? prior network+equivalent sample size data x x x x 4 x 5 x true true x 6 x 7 x 8 x true. x 9 x true true... improved network(s) x x x 4 x x 5 x 6 x 7 x 8 x 9 7 74 Learning The Structure of Bayesian Networks We have just described how to get an ML or MAP estimate of the structure of a Bayes Net What would the Bayes estimate look like? Find all possible networks Calculate their posteriors When doing inference: result weighed combination of all networks! 75