MLE/MAP + Naïve Bayes

Similar documents
MLE/MAP + Naïve Bayes

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Hidden Markov Models

Machine Learning

Perceptron (Theory) + Linear Regression

Bayesian Methods: Naïve Bayes

Bayesian Classifiers, Conditional Independence and Naïve Bayes. Required reading: Naïve Bayes and Logistic Regression (available on class website)

CPSC 340: Machine Learning and Data Mining

Naïve Bayes classification

Hidden Markov Models

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Machine Learning, Fall 2012 Homework 2

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

The Naïve Bayes Classifier. Machine Learning Fall 2017

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Introduction to Bayesian Learning. Machine Learning Fall 2018

Machine Learning

Machine Learning

Estimating Parameters

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Introduc)on to Bayesian methods (con)nued) - Lecture 16

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Machine Learning

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Logistic Regression. Machine Learning Fall 2018

Bayesian Learning (II)

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Bayesian Models in Machine Learning

Machine Learning

ECE 5984: Introduction to Machine Learning

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Gaussian discriminant analysis Naive Bayes

Generative Learning algorithms

The Bayes classifier

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Naive Bayes and Gaussian Bayes Classifier

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

ECE521 week 3: 23/26 January 2017

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

CS 361: Probability & Statistics

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Machine Learning Gaussian Naïve Bayes Big Picture

Bias-Variance Tradeoff

COMP90051 Statistical Machine Learning

Machine Learning. Classification. Bayes Classifier. Representing data: Choosing hypothesis class. Learning: h:x a Y. Eric Xing

Machine Learning

Learning with Probabilities

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

Introduction to Probability and Statistics (Continued)

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Introduction to Machine Learning

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

CS 188: Artificial Intelligence Spring Today

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

CMU-Q Lecture 24:

Lecture 4. Generative Models for Discrete Data - Part 3. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza.

Learning Bayesian network : Given structure and completely observed data

CSC321 Lecture 18: Learning Probabilistic Models

Introduction to Machine Learning

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

CSC 411: Lecture 09: Naive Bayes

Bayesian Networks (Part II)

Machine Learning: Homework Assignment 2 Solutions

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Lecture #11: Classification & Logistic Regression

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

An Introduction to Statistical and Probabilistic Linear Models

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Some Probability and Statistics

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Review: Probability. BM1: Advanced Natural Language Processing. University of Potsdam. Tatjana Scheffler

Naive Bayes and Gaussian Bayes Classifier

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning

Logistic Regression. William Cohen

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

Lecture 2: Simple Classifiers

CS 6140: Machine Learning Spring 2016

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Generative v. Discriminative classifiers Intuition

LEC 4: Discriminant Analysis for Classification

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Some Probability and Statistics

Machine Learning for Signal Processing Bayes Classification and Regression

Generative Classifiers: Part 1. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Transcription:

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University MLE/MAP + Naïve Bayes MLE / MAP Readings: Estimating Probabilities (Mitchell, 2016) Naïve Bayes Readings: Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression (Mitchell, 2016) Murphy 3 Bishop - - HTF - - Mitchell 6.1-6.10 Matt Gormley Lecture 5 February 1, 2016 1

Reminders Background Exercises (Homework 1) Release: Wed, Jan. 25 Due: Wed, Feb. 1 at 5:30pm ONLY HW1: Collaboration questions not required Homework 2: Naive Bayes Release: Wed, Feb. 1 Due: Mon, Feb. 13 at 5:30pm 2

MLE / MAP Outline Generating Data Natural (stochastic) data Synthetic data Why synthetic data? Examples: Multinomial, Bernoulli, Gaussian Data Likelihood Independent and Identically Distributed (i.i.d.) Example: Dice Rolls Learning from Data (Frequentist) Principle of Maximum Likelihood Estimation (MLE) Optimization for MLE Examples: 1D and 2D optimization Example: MLE of Multinomial Aside: Method of Lagrange Multipliers Learning from Data (Bayesian) maximum a posteriori (MAP) estimation Optimization for MAP Example: MAP of Bernoulli Beta Last Lecture This Lecture 3

Learning from Data (Frequentist) Whiteboard Aside: Method of Langrange Multipliers Example: MLE of Multinomial 4

Learning from Data (Bayesian) Whiteboard maximum a posteriori (MAP) estimation Optimization for MAP Example: MAP of Bernoulli Beta 5

Takeaways One view of what ML is trying to accomplish is function approximation The principle of maximum likelihood estimation provides an alternate view of learning Synthetic data can help debug ML algorithms Probability distributions can be used to model real data that occurs in the world (don t worry we ll make our distributions more interesting soon!) 6

Naïve Bayes Outline Probabilistic (Generative) View of Classification Decision rule for probability model Real- world Dataset Economist vs. Onion articles Document à bag- of- words à binary feature vector Naive Bayes: Model Generating synthetic "labeled documents" Definition of model Naive Bayes assumption Counting # of parameters with / without NB assumption Naïve Bayes: Learning from Data Data likelihood MLE for Naive Bayes MAP for Naive Bayes Visualizing Gaussian Naive Bayes 7

Today s Goal To define a generative model of emails of two different classes (e.g. spam vs. not spam) 8

Spam News The Economist The Onion 9

Whiteboard Real- world Dataset Economist vs. Onion articles Document à bag- of- words à binary feature vector 10

Whiteboard Naive Bayes: Model Generating synthetic "labeled documents" Definition of model Naive Bayes assumption Counting # of parameters with / without NB assumption 11

Model 1: Bernoulli Naïve Bayes Flip weighted coin If HEADS, flip each red coin If TAILS, flip each blue coin y x 1 x 2 x 3 x M 0 1 0 1 1 Each red coin corresponds to an x m 1 0 1 0 1 1 1 1 1 1 0 0 0 1 1 0 1 0 1 0 1 1 0 1 0 We can generate data in this fashion. Though in practice we never would since our data is given. Instead, this provides an explanation of how the data was generated (albeit a terrible one). 12

Whiteboard Naive Bayes: Model Generating synthetic "labeled documents" Definition of model Naive Bayes assumption Counting # of parameters with / without NB assumption 13

What s wrong with the Naïve Bayes Assumption? The features might not be independent!! Example 1: If a document contains the word Donald, it s extremely likely to contain the word Trump These are not independent! Example 2: If the petal width is very high, the petal length is also likely to be very high 14

Naïve Bayes: Learning from Data Whiteboard Data likelihood MLE for Naive Bayes MAP for Naive Bayes 15

VISUALIZING NAÏVE BAYES Slides in this section from William Cohen (10-601B, Spring 2016) 16

Fisher Iris Dataset Fisher (1936) used 150 measurements of flowers from 3 different species: Iris setosa (0), Iris virginica (1), Iris versicolor (2) collected by Anderson (1936) Species Sepal Length Sepal Width Petal Length 0 4.3 3.0 1.1 0.1 0 4.9 3.6 1.4 0.1 0 5.3 3.7 1.5 0.2 1 4.9 2.4 3.3 1.0 1 5.7 2.8 4.1 1.3 1 6.3 3.3 4.7 1.6 1 6.7 3.0 5.0 1.7 Petal Width Full dataset: https://en.wikipedia.org/wiki/iris_flower_data_set 18

Slide from William Cohen

Slide from William Cohen

Plot the difference of the probabilities Slide from William Cohen

Naïve Bayes has a linear decision boundary Slide from William Cohen (10-601B, Spring 2016)

Figure from William Cohen (10-601B, Spring 2016)

Figure from William Cohen (10-601B, Spring 2016) Why don t we drop the generative model and try to learn this hyperplane directly?

Beyond the Scope of this Lecture Multinomial Naïve Bayes can be used for integer features Multi- class Naïve Bayes can be used if your classification problem has > 2 classes 25

Summary 1. Naïve Bayes provides a framework for generative modeling 2. Choose p(x m y) appropriate to the data (e.g. Bernoulli for binary features, Gaussian for continuous features) 3. Train by MLE or MAP 4. Classify by maximizing the posterior 26

EXTRA SLIDES 27

Generic Naïve Bayes Model Support: Depends on the choice of event model, P(X k Y) Model: Product of prior and the event model P (,Y)=P (Y ) K k=1 P (X k Y ) Training: Find the class- conditional MLE parameters For P(Y), we find the MLE using all the data. For each P(X k Y) we condition on the data with the corresponding class. Classification: Find the class that maximizes the posterior ŷ = y p(y ) 28

Generic Naïve Bayes Model Classification: ŷ = y = y p(y ) p( y)p(y) p(x) (posterior) (by Bayes rule) = y p( y)p(y) 29

Model 1: Bernoulli Naïve Bayes Support: Binary vectors of length K {0, 1} K Generative Story: Y Bernoulli( ) X k Bernoulli( k,y ) k {1,...,K} Model: p, (x,y)=p, (x 1,...,x K,y) = p (y) K k=1 p k (x k y) K =( ) y (1 ) (1 y) ( k,y ) x k (1 k,y) (1 x k) k=1 30

Model 1: Bernoulli Naïve Bayes Support: Binary vectors of length K t {0, 1}K p, (x, y) = p Generative Story: Y Xk Model: p, (x1,..., xk, y) K Bernoulli( ) = p (y) p k (xk y) Bernoulli( k,y ) k=1 k {1,..., K} Same as Generic xk (1 xk ) (x, y) = (p,)y(x (11,...),(1xKy), y) ( k,y )Naïve Bayes (1 k,y ) K, K = p (y) k=1 k=1 p k (xk y) Classification: Find the class that maximizes the posterior K y y) y ==( ) `;K t (1 )(1 p(y t) ( k,y )x y k (1 k,y ) (1 xk ) k=1 31

Model 1: Bernoulli Naïve Bayes Training: Find the class- conditional MLE parameters For P(Y), we find the MLE using all the data. For each P(X k Y) we condition on the data with the corresponding class. N i=1 = I(y(i) = 1) N N i=1 k,0 = I(y(i) =0 x (i) k = 1) N i=1 I(y(i) = 0) N i=1 I(y(i) =1 x (i) k = 1) k,1 = k N i=1 I(y(i) = 1) {1,...,K} 32

Model 1: Bernoulli Naïve Bayes Training: Find the class- conditional MLE parameters For P(Y), we find the MLE using all the data. For each P(X k Y) we condition on the data with the corresponding class. = k,0 = k,1 = N i=1 I(y(i) = 1) N N i=1 I(y(i) =0 x (i) k = 1) N i=1 I(y(i) = 0) N i=1 I(y(i) =1 x (i) k = 1) N i=1 I(y(i) = 1) Data: y x 1 x 2 x 3 x K 0 1 0 1 1 1 0 1 0 1 1 1 1 1 1 0 0 0 1 1 0 1 0 1 0 k {1,...,K} 1 1 0 1 0 33

Model 2: Multinomial Naïve Bayes Support: Option 1: Integer vector (word IDs) =[x 1,x 2,...,x M ] where x m {1,...,K} a word id. Generative Story: for i {1,...,N}: y (i) Bernoulli( ) for j {1,...,M i }: x (i) j Multinomial( y (i), 1) Model: p, (x,y)=p (y) K k=1 p k (x k y) =( ) y (1 ) (1 y) M i j=1 y,x j 34

Model 3: Gaussian Naïve Bayes Support: R K Model: Product of prior and the event model p(x,y)=p(x 1,...,x K,y) = p(y) K k=1 p(x k y) Gaussian Naive Bayes assumes that p(x k y) is given by a Normal distribution. 35

Model 4: Multiclass Naïve Bayes Model: The only change is that we permit y to range over C classes. p(x,y)=p(x 1,...,x K,y) = p(y) K k=1 p(x k y) Now, y Multinomial(, 1) and we have a separate conditional distribution p(x k y) for each of the C classes. 36

Smoothing 1. Add- 1 Smoothing 2. Add- λ Smoothing 3. MAP Estimation (Beta Prior) 37

MLE What does maximizing likelihood accomplish? There is only a finite amount of probability mass (i.e. sum- to- one constraint) MLE tries to allocate as much probability mass as possible to the things we have observed at the expense of the things we have not observed 38

MLE For Naïve Bayes, suppose we never observe the word serious in an Onion article. In this case, what is the MLE of p(x k y)? k,0 = N i=1 I(y(i) =0 x (i) k = 1) N i=1 I(y(i) = 0) Now suppose we observe the word serious at test time. What is the posterior probability that the article was an Onion article? p(y x) = p(x y)p(y) p(x) 39

1. Add- 1 Smoothing The simplest setting for smoothing simply adds a single pseudo-observation to the data. This converts the true observations D into a new dataset D from we derive the MLEs. D = {( (i),y (i) )} N i=1 (1) D = D {(, 0), (, 1), (, 0), (, 1)} (2) where is the vector of all zeros and is the vector of all ones. This has the effect of pretending that we observed each feature x k with each class y. 40

1. Add- 1 Smoothing What if we write the MLEs in terms of the original dataset D? = N i=1 I(y(i) = 1) N k,0 = 1+ N i=1 I(y(i) =0 x (i) k = 1) 2+ N i=1 I(y(i) = 0) k,1 = 1+ N i=1 I(y(i) =1 x (i) k = 1) 2+ N i=1 I(y(i) = 1) k {1,...,K} 41

2. Add- λ Smoothing For the Categorical Distribution Suppose we have a dataset obtained by repeatedly rolling a K-sided (weighted) die. Given data D = {x (i) } N i=1 where x(i) {1,...,K}, we have the following MLE: k = N i=1 I(x(i) = k) N With add- smoothing, we add pseudo-observations as before to obtain a smoothed estimate: k = + N i=1 I(x(i) = k) k + N 42

MLE vs. MAP Suppose we have data D = {x (i) } N i=1 MLE = MAP = N i=1 N i=1 p( (i) ) p( (i) )p( ) Maximum Likelihood Estimate (MLE) Maximum a posteriori (MAP) estimate Prior 43

3. MAP Estimation (Beta Prior) Generative Story: The parameters are drawn once for the entire dataset. for k for i {1,...,K}: for y {0, 1}: k,y Beta(, ) {1,...,N}: y (i) Bernoulli( ) for k x (i) k {1,...,K}: Bernoulli( k,y (i)) Training: Find the class- conditional MAP parameters = N i=1 I(y(i) = 1) N k,0 = ( 1) + N i=1 I(y(i) =0 x (i) k = 1) ( 1) + ( 1) + N i=1 I(y(i) = 0) k,1 = ( 1) + N i=1 I(y(i) =1 x (i) k = 1) ( 1) + ( 1) + N i=1 I(y(i) = 1) k {1,...,K} 44