Introduc)on to Bayesian methods (con)nued) - Lecture 16

Similar documents
Point Estimation. Vibhav Gogate The University of Texas at Dallas

Bayesian Methods: Naïve Bayes

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Day 5: Generative models, structured classification

Naïve Bayes Lecture 17

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

MLE/MAP + Naïve Bayes

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Machine Learning. Classification. Bayes Classifier. Representing data: Choosing hypothesis class. Learning: h:x a Y. Eric Xing

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

MLE/MAP + Naïve Bayes

Point Estimation. Maximum likelihood estimation for a binomial distribution. CSE 446: Machine Learning

Machine Learning

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Naïve Bayes classification

CSE446: Naïve Bayes Winter 2016

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

CSC321 Lecture 18: Learning Probabilistic Models

Announcements. Proposals graded

Learning P-maps Param. Learning

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Learning Bayesian network : Given structure and completely observed data

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Introduction to Bayesian Learning. Machine Learning Fall 2018

Introduction to Machine Learning CMU-10701

CS 188: Artificial Intelligence Spring Today

Computational Cognitive Science

Lecture : Probabilistic Machine Learning

Announcements. Proposals graded

Introduction to AI Learning Bayesian networks. Vibhav Gogate

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

COMP90051 Statistical Machine Learning

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

Gaussians Linear Regression Bias-Variance Tradeoff

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

CSE 473: Artificial Intelligence Autumn Topics

Computational Cognitive Science

Logistic Regression. Machine Learning Fall 2018

Bias-Variance Tradeoff

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Introduction to Machine Learning

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

CS 6140: Machine Learning Spring 2016

Unsupervised learning (part 1) Lecture 19

Expectation Maximization Algorithm

Introduction to Machine Learning

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Linear classifiers Lecture 3

Bayesian Learning (II)

CSE546: Naïve Bayes Winter 2012

UVA CS / Introduc8on to Machine Learning and Data Mining

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Machine Learning, Fall 2012 Homework 2

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

The Bayes classifier

Introduction to Machine Learning

Non-parametric Methods

Bayesian RL Seminar. Chris Mansley September 9, 2008

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Discrete Binary Distributions

Bayesian Models in Machine Learning

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

The Naïve Bayes Classifier. Machine Learning Fall 2017

Bayesian Inference and MCMC

Parametric Techniques Lecture 3

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

Introduction to Machine Learning

CPSC 540: Machine Learning

Density Estimation. Seungjin Choi

ECE 5984: Introduction to Machine Learning

Parametric Techniques

Mixture Models & EM algorithm Lecture 21

PATTERN RECOGNITION AND MACHINE LEARNING

Lecture 4. Generative Models for Discrete Data - Part 3. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza.

BN Semantics 3 Now it s personal! Parameter Learning 1

CPSC 340: Machine Learning and Data Mining

Notes on Machine Learning for and

Hidden Markov Models

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Lecture 11: Probability Distributions and Parameter Estimation

Bayesian networks Lecture 18. David Sontag New York University

Introduction to Bayesian Learning

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Machine Learning

Support vector machines Lecture 4

Machine Learning

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Machine Learning for Signal Processing Bayes Classification and Regression

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Probability and Estimation. Alan Moses

Transcription:

Introduc)on to Bayesian methods (con)nued) - Lecture 16 David Sontag New York University Slides adapted from Luke Zettlemoyer, Carlos Guestrin, Dan Klein, and Vibhav Gogate

Outline of lectures Review of probability (A?er midterm) Maximum likelihood es)ma)on 2 examples of Bayesian classifiers: Naïve Bayes LogisIc regression

Bayes Rule Two ways to factor a joint distribution over two variables: Dividing, we get: Why is this at all helpful? Let s us build one conditional from its reverse Often one conditional is tricky but the other one is simple Foundation of many practical systems (e.g. ASR, MT) In the running for most important ML equation!

Returning to thumbtack example P(Heads) = θ, P(Tails) = 1-θ Flips are i.i.d.: D={x i i=1 n}, P(D θ ) = Π i P(x i θ ) Independent events IdenIcally distributed according to Bernoulli distribuion Sequence D of α H Heads and α T Tails Called the likelihood of the data under the model

Maximum Likelihood EsImaIon Data: Observed set D of α H Heads and α T Tails Hypothesis: Bernoulli distribuion Learning: finding θ is an opimizaion problem What s the objecive funcion? MLE: Choose θ to maximize probability of D

Your first parameter learning algorithm Set derivaive to zero, and solve! d ln P (D ) d = d d [ln H (1 ) T ] = d d [ H ln + T ln(1 )] = H d d ln + = H T 1 T d d =0 ln(1 )

Data L( ; D) =lnp (D )

What if I have prior beliefs? Billionaire says: Wait, I know that the thumbtack is close to 50-50. What can you do for me now? You say: I can learn it the Bayesian way Rather than esimaing a single θ, we obtain a distribuion over possible values of θ Pr( ) In the beginning Observe flips e.g.: {tails, tails} Pr( D) After observations

Bayesian Learning Use Bayes rule! Data Likelihood Prior Posterior Normalization Or equivalently: For uniform priors, this reduces to maximum likelihood estimation! P ( ) 1 P ( D) P (D )

Bayesian Learning for Thumbtacks Likelihood: What should the prior be? Represent expert knowledge Simple posterior form For binary variables, commonly used prior is the Beta distribuion:

ln 2 2N 2 ln ln 2 2N 2 N ln(2/ ) 2 2 2/0.05) 0.1 2 3.8 P ( ) 1 Beta prior distribuion P(θ) N ln(2/ ) 2 2 N ln(2/0.05) 2 0.1 2 3.8 0.02 = 190 0.02 = 190 P ( ) 1 Since the Beta distribuion is conjugate to the Bernoulli distribuion, the posterior distribuion has a paricularly simple form: D) P (D ) P ( D) P (D ) P ( D) H (1 ) T H 1 (1 ) T 1 = H+ H 1 (1 ) T + T 1 1 = H + H 1 (1 ) T + t +1 = Beta( H + H, T + T )

Using Bayesian inference for predicion We now have a distribu)on over parameters For any specific f, a funcion of interest, compute the expected value of f: Integral is o?en hard to compute As more data is observed, posterior is more concentrated MAP (Maximum a posteriori approximaion): use most likely parameter to approximate the expectaion

Outline of lectures Review of probability Maximum likelihood esimaion 2 examples of Bayesian classifiers: Naïve Bayes LogisIc regression

Bayesian ClassificaIon Problem statement: Given features X 1,X 2,,X n Predict a label Y [Next several slides adapted from: Vibhav Gogate, Jonathan Huang, Luke Zettlemoyer, Carlos Guestrin, and Dan Weld]

Example ApplicaIon Digit Recogni)on X Y Classifier 5 X 1,,X n {0,1} (Black vs. White pixels) Y {0,1,2,3,4,5,6,7,8,9}

The Bayes Classifier If we had the joint distribuion on X 1,,X n and Y, could predict using: (for example: what is the probability that the image represents a 5 given its pixels?) So How do we compute that?

The Bayes Classifier Use Bayes Rule! Likelihood Prior Normalization Constant Why did this help? Well, we think that we might be able to specify how features are generated by the class label

The Bayes Classifier Let s expand this for our digit recogniion task: To classify, we ll simply compute these probabiliies, one per class, and predict based on which one is largest

Model Parameters How many parameters are required to specify the likelihood, P(X 1,,X n Y)? (Supposing that each image is 30x30 pixels) The problem with explicitly modeling P(X 1,,X n Y) is that there are usually way too many parameters: We ll run out of space We ll run out of Ime And we ll need tons of training data (which is usually not available)

Naïve Bayes Naïve Bayes assumpion: Features are independent given class: More generally: How many parameters now? Suppose X is composed of n binary features

The Naïve Bayes Classifier Given: Prior P(Y) n condiionally independent features X 1,, X 1, given the class Y For each feature i, we specify P(X i Y) X 1 X 2 Y X n ClassificaIon decision rule: If certain assumption holds, NB is optimal classifier! (they typically don t)

A Digit Recognizer Input: pixel grids Output: a digit 0-9 Are the naïve Bayes assumpions realisic here?

What has to be learned? 1 0.1 2 0.1 3 0.1 4 0.1 5 0.1 6 0.1 7 0.1 8 0.1 9 0.1 0 0.1 1 0.01 2 0.05 3 0.05 4 0.30 5 0.80 6 0.90 7 0.05 8 0.60 9 0.50 0 0.80 1 0.05 2 0.01 3 0.90 4 0.80 5 0.90 6 0.90 7 0.25 8 0.85 9 0.60 0 0.80

MLE for the parameters of NB Given dataset Count(A=a,B=b) Ã number of examples where A=a and B=b X X Prior: X X P (Y = y) = P ObservaIon distribuion: P (X i = x Y = y) = P X X MLE for discrete NB, simply: P Count(Y = y) P y 0 Count(Y = y ) 0 Count(X i = x, Y = y) P x 0 Count(X i = x,y = y)

MLE for the parameters of NB Training amounts to, for each of the classes, averaging all of the examples together:

Given dataset MAP esimaion for NB Count(A=a,B=b) Ã number of examples where A=a and B=b X X X MAP esimaion X for discrete X NB, simply: Prior: X X X P (Y = y) = 0 ObservaIon distribuion: P (X i = x Y = y) = P P Count(Y = y) P y 0 Count(Y = y ) Called smoothing. Corresponds to Dirichlet prior! P Count(X i = x, Y = y) P x 0 Count(X i = x,y = y) + a + X_i *a