Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

Similar documents
MLE/MAP + Naïve Bayes

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

MLE/MAP + Naïve Bayes

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Naïve Bayes classification

Bayesian Methods: Naïve Bayes

The Naïve Bayes Classifier. Machine Learning Fall 2017

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Introduction to Bayesian Learning. Machine Learning Fall 2018

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Machine Learning

Learning Bayesian network : Given structure and completely observed data

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning

CS 361: Probability & Statistics

Hidden Markov Models

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

Gaussian discriminant analysis Naive Bayes

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

The Bayes classifier

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Introduction to Machine Learning

CPSC 340: Machine Learning and Data Mining

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Introduction to Probability and Statistics (Continued)

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

Logistic Regression. William Cohen

Estimating Parameters

Machine Learning, Fall 2012 Homework 2

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

Introduction to Machine Learning Spring 2018 Note 18

Machine Learning for Signal Processing Bayes Classification and Regression

Bayesian Learning (II)

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Generative Learning algorithms

Machine Learning for Signal Processing Bayes Classification

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Probability Theory for Machine Learning. Chris Cremer September 2015

Bayesian Models in Machine Learning

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

Machine Learning

LEC 4: Discriminant Analysis for Classification

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Hidden Markov Models

CSC321 Lecture 18: Learning Probabilistic Models

Some Probability and Statistics

Lecture 4. Generative Models for Discrete Data - Part 3. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza.

Bayesian Classifiers, Conditional Independence and Naïve Bayes. Required reading: Naïve Bayes and Logistic Regression (available on class website)

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Qualifier: CS 6375 Machine Learning Spring 2015

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Machine Learning

Introduc)on to Bayesian methods (con)nued) - Lecture 16

CHAPTER 2 Estimating Probabilities

KEY CONCEPTS IN PROBABILITY: SMOOTHING, MLE, AND MAP

CS540 ANSWER SHEET

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Machine Learning. Classification. Bayes Classifier. Representing data: Choosing hypothesis class. Learning: h:x a Y. Eric Xing

Lecture : Probabilistic Machine Learning

Machine Learning and Data Mining. Bayes Classifiers. Prof. Alexander Ihler

Machine Learning

Generative Clustering, Topic Modeling, & Bayesian Inference

COS 424: Interacting with Data. Lecturer: Dave Blei Lecture #11 Scribe: Andrew Ferguson March 13, 2007

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

COMP90051 Statistical Machine Learning

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Introduction to Machine Learning

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Machine Learning Gaussian Naïve Bayes Big Picture

Linear Classification

Learning with Probabilities

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Machine Learning

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Logistic Regression. Machine Learning Fall 2018

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Probabilistic Graphical Models

ECE 5984: Introduction to Machine Learning

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Introduction to Machine Learning

Classification objectives COMS 4771

Machine Learning

Bowl Maximum Entropy #4 By Ejay Weiss. Maxent Models: Maximum Entropy Foundations. By Yanju Chen. A Basic Comprehension with Derivations

Machine Learning - MT Classification: Generative Models

Statistical Data Mining and Machine Learning Hilary Term 2016

Introduction to Machine Learning

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Aarti Singh. Lecture 2, January 13, Reading: Bishop: Chap 1,2. Slides courtesy: Eric Xing, Andrew Moore, Tom Mitchell

Non-parametric Methods

Some Probability and Statistics

Transcription:

School of Computer Science 10-701 Introduction to Machine Learning aïve Bayes Readings: Mitchell Ch. 6.1 6.10 Murphy Ch. 3 Matt Gormley Lecture 3 September 14, 2016 1

Homewor 1: due 9/26/16 Project Proposal: due 10/3/16 start early! Reminders 2

Outline Bacground: Maximum lielihood estimation (MLE) Maximum a posteriori (MAP) estimation Example: Exponential distribution Generative Models Model 0: ot- so- naïve Model aïve Bayes aïve Bayes Assumption Model 1: Bernoulli aïve Bayes Model 2: Multinomial aïve Bayes Model 3: Gaussian aïve Bayes Model 4: Multiclass aïve Bayes Smoothing Add- 1 Smoothing Add- λ Smoothing MAP Estimation (Beta Prior) 3

MLE vs. MAP Suppose we have data D = {x (i) } i=1 MLE = p( (i) ) Maximum Lielihood Estimate (MLE) i=1 4

Bacground: MLE Example: MLE of Exponential Distribution pdf of Exponential( ): f(x) = e x Suppose X i Exponential( ) for 1 i. Find MLE for data D = {x (i) } i=1 First write down log-lielihood of sample. Compute first derivative, set to zero, solve for. Compute second derivative and chec that it is concave down at MLE. 5

Bacground: MLE Example: MLE of Exponential Distribution First write down log-lielihood of sample. ( )= = = i=1 i=1 i=1 = ( ) f(x (i) ) (1) ( ( x (i) )) (2) ( )+ x (i) (3) i=1 x (i) (4) 6

Bacground: MLE Example: MLE of Exponential Distribution Compute first derivative, set to zero, solve for. d ( ) d = d d ( ) = i=1 MLE = i=1 x (i) (1) x (i) =0 (2) i=1 x(i) (3) 7

Bacground: MLE Example: MLE of Exponential Distribution pdf of Exponential( ): f(x) = e x Suppose X i Exponential( ) for 1 i. Find MLE for data D = {x (i) } i=1 First write down log-lielihood of sample. Compute first derivative, set to zero, solve for. Compute second derivative and chec that it is concave down at MLE. 8

MLE vs. MAP Suppose we have data D = {x (i) } i=1 MLE = MAP = i=1 i=1 p( (i) ) p( (i) )p( ) Maximum Lielihood Estimate (MLE) Maximum a posteriori (MAP) estimate Prior 9

Generative Models Specify a generative story for how the data was created (e.g. roll a weighted die) Given parameters (e.g. weights for each side) for the model, we can generate new data (e.g. roll the die again) Typical learning approach is MLE (e.g. find the most liely weights given the data) 10

Features Suppose we want to represent a document (M words) as a vector (vocabulary size V ) How should we do it? Option 1: Integer vector (word IDs) =[x 1,x 2,...,x M ] where x m {1,...,K} a word id. Option 2: Binary vector (word indicators) =[x 1,x 2,...,x K ] where x {0, 1} is a boolean. Option 3: Integer vector (word counts) =[x 1,x 2,...,x K ] where x Z + is a positive integer. 11

Today s Goal To define a generative model of emails of two different classes (e.g. spam vs. not spam) 12

Spam ews The Economist The Onion 13

Model 0: ot- so- naïve Model? Generative Story: 1. Flip a weighted coin (Y) 2. If heads, roll the red many sided die to sample a document vector (X) from the Spam distribution 3. If tails, roll the blue many sided die to sample a document vector (X) from the ot- Spam distribution P (X 1,...,X K,Y)=P (X 1,...,X K Y )P (Y ) This model is computationally naïve! 14

Model 0: ot- so- naïve Model? Generative Story: 1. Flip a weighted coin (Y) 2. If heads, sample a document ID (X) from the Spam distribution 3. If tails, sample a document ID (X) from the ot- Spam distribution P (X, Y )=P (X Y )P (Y ) This model is computationally naïve! 15

Model 0: ot- so- naïve Model? Flip weighted coin If HEADS, roll red die y x 1 x 2 x 3 x K 0 1 0 1 1 1 0 1 0 1 1 1 1 1 1 0 0 0 1 1 If TAILS, roll blue die Each side of the die is labeled with a document vector (e.g. [1,0,1,,1]) 0 1 0 1 0 1 1 0 1 0 16

Outline Bacground: Maximum lielihood estimation (MLE) Maximum a posteriori (MAP) estimation Example: Exponential distribution Generative Models Model 0: ot- so- naïve Model aïve Bayes aïve Bayes Assumption Model 1: Bernoulli aïve Bayes Model 2: Multinomial aïve Bayes Model 3: Gaussian aïve Bayes Model 4: Multiclass aïve Bayes Smoothing Add- 1 Smoothing Add- λ Smoothing MAP Estimation (Beta Prior) 17

aïve Bayes Assumption Conditional independence of features: P (X 1,...,X K,Y)=P(X 1,...,X K Y )P (Y ) = K =1 P (X Y ) P (Y ) 18

C P(C) 0 0.33 1 0.67 A C P(A C) 0 0 0.2 0 1 0.5 1 0 0.8 1 1 0.5 B C P(B C) 0 0 0.1 0 1 0.9 1 0 0.9 1 1 0.1 Estimating a joint from conditional probabilities P(A, B C) = P(A C)* P(B C) a, bc : P(A = a B = b C = c) = P(A = a C = c)* P(B = b C = c) A B C P(A,B,C) 0 0 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1

C P(C) 0 0.33 1 0.67 A C P(A C) 0 0 0.2 0 1 0.5 1 0 0.8 1 1 B 0.5 C P(B C) 0 0 0.1 0 1 0.9 1 0 0.9 1 1 0.1 D C P(D C) 0 0 0.1 0 1 0.1 1 0 0.9 1 1 0.1 Estimating a joint from conditional probabilities A B D C P(A,B,D,C) 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 0 1 0 0 0 1 0 1 0 1 1 0 0 1 1 1 0 0 0 0 1 0 0 1 0..

Assuming conditional independence, the conditional probabilities encode the same information as the joint table. They are very convenient for estimating P( X 1,,X n Y)=P( X 1 Y)* *P( X n Y) They are almost as good for computing P(Y X 1,..., X n ) = P(X..., X 1, n Y )P(Y ) P(X 1,..., X n ) x,y : P(Y = y X 1,..., X n = x) = P(X..., X 1, n = x Y)P(Y = y) P(X 1,..., X n = x)

Generic aïve Bayes Model Support: Depends on the choice of event model, P(X Y) Model: Product of prior and the event model P (,Y)=P (Y ) K =1 P (X Y ) Training: Find the class- conditional MLE parameters For P(Y), we find the MLE using all the data. For each P(X Y) we condition on the data with the corresponding class. Classification: Find the class that maximizes the posterior ŷ = y p(y ) 22

Generic aïve Bayes Model Classification: ŷ = y = y = y p(y ) p( y)p(y) p(x) p( y)p(y) (posterior) (by Bayes rule) 23

Model 1: Bernoulli aïve Bayes Support: Binary vectors of length K {0, 1} K Generative Story: Y Bernoulli( ) X Bernoulli(,Y ) {1,...,K} Model: p, (x,y)=p, (x 1,...,x K,y) = p (y) K =1 p (x y) =( ) y (1 ) (1 y) K =1 (,y ) x (1,y) (1 x ) 24

Model 0: ot- so- naïve Model? Flip weighted coin If HEADS, flip each red coin If TAILS, flip each blue coin y x 1 x 2 x 3 x K 0 1 0 1 1 Each red coin corresponds to an x 1 0 1 0 1 1 1 1 1 1 0 0 0 1 1 0 1 0 1 0 1 1 0 1 0 We can generate data in this fashion. Though in practice we never would since our data is given. Instead, this provides an explanation of how the data was generated (albeit a terrible one). 25

Model 1: Bernoulli aïve Bayes Support: Binary vectors of length K {0, 1} K Generative Story: Y Bernoulli( ) X Bernoulli(,Y ) {1,...,K} Model: p, (x,y) =( ) y (1 ) (1 y) K =1 Same as Generic aïve Bayes (,y ) x (1,y) (1 x ) Classification: Find the class that maximizes the posterior ŷ = y p(y ) 26

Generic aïve Bayes Model Recall Classification: ŷ = y = y = y p(y ) p( y)p(y) p(x) p( y)p(y) (posterior) (by Bayes rule) 27

Model 1: Bernoulli aïve Bayes Training: Find the class- conditional MLE parameters For P(Y), we find the MLE using all the data. For each P(X Y) we condition on the data with the corresponding class. =,0 =,1 = i=1 I(y(i) = 1) i=1 I(y(i) =0 x (i) = 1) i=1 I(y(i) = 0) i=1 I(y(i) =1 x (i) = 1) i=1 I(y(i) = 1) {1,...,K} 28

Model 1: Bernoulli aïve Bayes Training: Find the class- conditional MLE parameters For P(Y), we find the MLE using all the data. For each P(X Y) we condition on the data with the corresponding class. =,0 =,1 = i=1 I(y(i) = 1) i=1 I(y(i) =0 x (i) = 1) i=1 I(y(i) = 0) i=1 I(y(i) =1 x (i) = 1) i=1 I(y(i) = 1) Data: y x 1 x 2 x 3 x K 0 1 0 1 1 1 0 1 0 1 1 1 1 1 1 0 0 0 1 1 0 1 0 1 0 {1,...,K} 1 1 0 1 0 29

Outline Bacground: Maximum lielihood estimation (MLE) Maximum a posteriori (MAP) estimation Example: Exponential distribution Generative Models Model 0: ot- so- naïve Model aïve Bayes aïve Bayes Assumption Model 1: Bernoulli aïve Bayes Model 2: Multinomial aïve Bayes Model 3: Gaussian aïve Bayes Model 4: Multiclass aïve Bayes Smoothing Add- 1 Smoothing Add- λ Smoothing MAP Estimation (Beta Prior) 30

Smoothing 1. Add- 1 Smoothing 2. Add- λ Smoothing 3. MAP Estimation (Beta Prior) 31

MLE What does maximizing lielihood accomplish? There is only a finite amount of probability mass (i.e. sum- to- one constraint) MLE tries to allocate as much probability mass as possible to the things we have observed at the expense of the things we have not observed 32

MLE For aïve Bayes, suppose we never observe the word serious in an Onion article. In this case, what is the MLE of p(x y)?,0 = i=1 I(y(i) =0 x (i) = 1) i=1 I(y(i) = 0) ow suppose we observe the word serious at test time. What is the posterior probability that the article was an Onion article? p(y x) = p(x y)p(y) p(x) 33

1. Add- 1 Smoothing The simplest setting for smoothing simply adds a single pseudo-observation to the data. This converts the true observations D into a new dataset D from we derive the MLEs. D = {( (i),y (i) )} i=1 (1) D = D {(, 0), (, 1), (, 0), (, 1)} (2) where is the vector of all zeros and is the vector of all ones. This has the effect of pretending that we observed each feature x with each class y. 34

1. Add- 1 Smoothing What if we write the MLEs in terms of the original dataset D? = i=1 I(y(i) = 1),0 = 1+ i=1 I(y(i) =0 x (i) = 1) 2+ i=1 I(y(i) = 0),1 = 1+ i=1 I(y(i) =1 x (i) = 1) 2+ i=1 I(y(i) = 1) {1,...,K} 35

2. Add- λ Smoothing For the Categorical Distribution Suppose we have a dataset obtained by repeatedly rolling a K-sided (weighted) die. Given data D = {x (i) } i=1 where x(i) {1,...,K}, we have the following MLE: = i=1 I(x(i) = ) With add- smoothing, we add pseudo-observations as before to obtain a smoothed estimate: = + i=1 I(x(i) = ) + 36

MLE vs. MAP Recall Suppose we have data D = {x (i) } i=1 MLE = MAP = i=1 i=1 p( (i) ) p( (i) )p( ) Maximum Lielihood Estimate (MLE) Maximum a posteriori (MAP) estimate Prior 37

3. MAP Estimation (Beta Prior) Generative Story: The parameters are drawn once for the entire dataset. for for i {1,...,K}: for y {0, 1}:,y Beta(, ) {1,...,}: y (i) Bernoulli( ) for x (i) {1,...,K}: Bernoulli(,y (i)) Training: Find the class- conditional MAP parameters = i=1 I(y(i) = 1),0 = ( 1) + i=1 I(y(i) =0 x (i) = 1) ( 1) + ( 1) + i=1 I(y(i) = 0),1 = ( 1) + i=1 I(y(i) =1 x (i) = 1) ( 1) + ( 1) + i=1 I(y(i) = 1) {1,...,K} 38

Outline Bacground: Maximum lielihood estimation (MLE) Maximum a posteriori (MAP) estimation Example: Exponential distribution Generative Models Model 0: ot- so- naïve Model aïve Bayes aïve Bayes Assumption Model 1: Bernoulli aïve Bayes Model 2: Multinomial aïve Bayes Model 3: Gaussian aïve Bayes Model 4: Multiclass aïve Bayes Smoothing Add- 1 Smoothing Add- λ Smoothing MAP Estimation (Beta Prior) 39

Model 2: Multinomial aïve Bayes Support: Option 1: Integer vector (word IDs) =[x 1,x 2,...,x M ] where x m {1,...,K} a word id. Generative Story: for i {1,...,}: y (i) Bernoulli( ) for j {1,...,M i }: x (i) j Multinomial( y (i), 1) Model: p, (x,y)=p (y) K =1 p (x y) =( ) y (1 ) (1 y) M i j=1 y,x j 40

Gaussian Discriminative Analysis l learning f: X Y, where l X is a vector of real-valued features, X n = < X n1, X n m > l Y is an indicator vector l What does that imply about the form of P(Y X)? l The joint probability of a datum and its label is: l Given a datum x n, we predict its label using the conditional probability of the label given the datum: Eric Xing @ CMU, 2006-2011 41 Y n X n { } ) - ( ) - ( exp - ) (2 1 ), 1, ( 1) ( ), 1, ( 1 2 1 2 / 1 n T n n n n n n y p y p y p µ µ π π µ σ µ!! x x x x Σ Σ = Σ = = = = { } { } Σ Σ Σ Σ = Σ = ' ' 1 2 1 2 1/ ' 1 2 1 2 1/ ) - ( ) - ( - exp ) (2 1 ) - ( ) - ( - exp ) (2 1 ),, 1 ( n T n n T n n y n p µ µ π π µ µ π π µ!!!! x x x x x

Model 3: Gaussian aïve Bayes Support: Model: Product of prior and the event model R K p(x,y)=p(x 1,...,x K,y) = p(y) K =1 p(x y) Gaussian aive Bayes assumes that p(x y) is given by a ormal distribution. 42

Gaussian aïve Bayes Classifier l When X is multivariate-gaussian vector: l The joint probability of a datum and it label is: l The naïve Bayes simplification l More generally: l Where p(..) is an arbitrary conditional (discrete or continuous) 1-D density Eric Xing @ CMU, 2006-2011 43 { } ) - ( ) - ( exp - ) (2 1 ), 1, ( 1) ( ), 1, ( 1 2 1 2 / 1 n T n n n n n n y p y p y p µ µ π π µ µ!!!! x x x x Σ Σ = Σ = = = Σ = Y n X n = = = = = j j j j n j j j j n j n n n n x y x p y p y p 2 2 1 - - exp 2 1 ), 1, ( 1) ( ), 1, ( σ µ πσ π σ µ σ µ x = = m j n j n n n n y x p y p y p 1 ), ( ) ( ),, ( η π π η x Y n X n,1 X n,2 X n,m

VISUALIZIG AÏVE BAYES Slides in this section from William Cohen (10-601B, Spring 2016) 44

Fisher Iris Dataset Fisher (1936) used 150 measurements of flowers from 3 different species: Iris setosa (0), Iris virginica (1), Iris versicolor (2) collected by Anderson (1936) Species Sepal Length Sepal Width Petal Length 0 4.3 3.0 1.1 0.1 0 4.9 3.6 1.4 0.1 0 5.3 3.7 1.5 0.2 1 4.9 2.4 3.3 1.0 1 5.7 2.8 4.1 1.3 1 6.3 3.3 4.7 1.6 1 6.7 3.0 5.0 1.7 Petal Width Full dataset: https://en.wiipedia.org/wii/iris_flower_data_set 46

%% Import the IRIS data load fisheriris; X = meas; pos = strcmp(species,'setosa'); Y = 2*pos - 1; %% Visualize the data imagesc([x,y]); title('iris data');

%% Visualize by scatter plotting the the first two dimensions figure; scatter(x(y<0,1),x(y<0,2),'r*'); hold on; scatter(x(y>0,1),x(y>0,2),'bo'); title('iris data');

%% Compute the mean and SD of each class PosMean = mean(x(y>0,:)); PosSD = std(x(y>0,:)); egmean = mean(x(y<0,:)); egsd = std(x(y<0,:)); %% Compute the B probabilities for each class for each grid element [G1,G2]=meshgrid(3:0.1:8, 2:0.1:5); Z1 = gaussmf(g1,[possd(1),posmean(1)]); Z2 = gaussmf(g2,[possd(2),posmean(2)]); Z = Z1.* Z2; V1 = gaussmf(g1,[egsd(1),egmean(1)]); V2 = gaussmf(g2,[egsd(2),egmean(2)]); V = V1.* V2;

% Add them to the scatter plot figure; scatter(x(y<0,1),x(y<0,2),'r*'); hold on; scatter(x(y>0,1),x(y>0,2),'bo'); contour(g1,g2,z); contour(g1,g2,v);

%% ow plot the difference of the probabilities figure; scatter(x(y<0,1),x(y<0,2),'r*'); hold on; scatter(x(y>0,1),x(y>0,2),'bo'); contour(g1,g2,z- V); mesh(g1,g2,z- V); alpha(0.4)

AÏVE BAYES IS LIEAR Slides in this section from William Cohen (10-601B, Spring 2016)

Question: what does the boundary between positive and negative loo lie for aïve Bayes?

argmax y P(X i = x i Y = y ) P(Y = y ) i = argmax y log P(X i = x i Y = y ) + log P(Y = y ) i = argmax y {+1, 1} log P(x i y ) + log P(y ) # & = sign% log P(x i y +1 ) log P(x i y 1 ) + log P(y +1 ) log P(y 1 )( $ ' i i i two classes only # = sign log P(x i y +1 ) P(x i y 1 ) + log P(y +1) & % ( $ P(y 1 )' i rearrange terms

argmax y P(X i = x i Y = y ) P(Y = y ) i # = sign log P(x i y +1 ) P(x i y 1 ) + log P(y +1) & % ( $ P(y 1 )' i if x i = 1 or 0. " u i = log P(x =1 y ) i +1 P(x i =1 y 1 ) log P(x = 0 y ) % i +1 $ ' # P(x i = 0 y 1 )& " " = sign x i log P(x =1 y ) i +1 P(x i =1 y 1 ) log P(x = 0 y ) % " i +1 $ ' + log P(x = 0 y ) % i +1 $ ' + log P(y ) % +1 $ # i # P(x i = 0 y 1 )& i # P(x i = 0 y 1 )& P(y 1 ) ' & " % = sign$ x i u i + u 0 ' # i & " % = sign$ x i u i + x 0 u 0 ' # i & x 0 =1 for every x (bias term) = sign x u ( ) " u 0 = log P(x i = 0 y +1 )% $ ' + log P(y +1) # P(x i = 0 y 1 )& P(y 1 ) i

Why don t we drop the generative model and try to learn this hyperplane directly?

Model 4: Multiclass aïve Bayes Model: The only change is that we permit y to range over C classes. p(x,y)=p(x 1,...,x K,y) = p(y) K =1 p(x y) ow, y Multinomial(, 1) and we have a separate conditional distribution p(x y) for each of the C classes. 60

aïve Bayes Assumption Recall Conditional independence of features: P (X 1,...,X K,Y)=P(X 1,...,X K Y )P (Y ) = K =1 P (X Y ) P (Y ) 61

=P (X 1,...,X K Y )P (Y ) = K =1 P (X Y ) P (Y ) What s wrong with the aïve Bayes Assumption? The features might not be independent!! Example 1: If a document contains the word Donald, it s extremely liely to contain the word Trump These are not independent! Example 2: If the petal width is very high, the petal length is also liely to be very high 62

Summary 1. aïve Bayes provides a framewor for generative modeling 2. Choose an event model appropriate to the data 3. Train by MLE or MAP 4. Classify by maximizing the posterior 63