Naïve Bayes classification

Similar documents
Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

The Naïve Bayes Classifier. Machine Learning Fall 2017

Bayesian Methods: Naïve Bayes

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

Bayesian Learning (II)

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Notes on Machine Learning for and

Introduction to Machine Learning

CSC 411: Lecture 09: Naive Bayes

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Introduction to Machine Learning

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Introduction to Machine Learning

CSC321 Lecture 18: Learning Probabilistic Models

Bayesian Learning. Bayesian Learning Criteria

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Introduction to Machine Learning

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Machine Learning 4771

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

MLE/MAP + Naïve Bayes

Machine Learning, Fall 2012 Homework 2

Generative Classifiers: Part 1. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Machine Learning for Signal Processing Bayes Classification

Gaussian discriminant analysis Naive Bayes

COMP 328: Machine Learning

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Probabilistic Classification

Probability Based Learning

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

MLE/MAP + Naïve Bayes

The Bayes classifier

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Naive Bayes and Gaussian Bayes Classifier

Introduction to Bayesian Learning. Machine Learning Fall 2018

Naive Bayes and Gaussian Bayes Classifier

Learning with Probabilities

Machine Learning. Classification. Bayes Classifier. Representing data: Choosing hypothesis class. Learning: h:x a Y. Eric Xing

COMP90051 Statistical Machine Learning

Support Vector Machines

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Naive Bayes and Gaussian Bayes Classifier

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Introduction to Bayesian Learning

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

An Introduction to Statistical and Probabilistic Linear Models

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Machine Learning for Signal Processing Bayes Classification and Regression

Generative Learning algorithms

Introduction to Machine Learning

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Bayesian Learning. Reading: Tom Mitchell, Generative and discriminative classifiers: Naive Bayes and logistic regression, Sections 1-2.

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Lecture 9: Naive Bayes, SVM, Kernels. Saravanan Thirumuruganathan

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Logistic Regression. Machine Learning Fall 2018

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Lecture 4: Probabilistic Learning

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

Naïve Bayes Classifiers and Logistic Regression. Doug Downey Northwestern EECS 349 Winter 2014

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

Machine Learning

Introduction to Probabilistic Machine Learning

Be able to define the following terms and answer basic questions about them:

Stephen Scott.

Probability Theory for Machine Learning. Chris Cremer September 2015

CSCE 478/878 Lecture 6: Bayesian Learning

Probabilistic Graphical Models

COS 424: Interacting with Data. Lecturer: Dave Blei Lecture #11 Scribe: Andrew Ferguson March 13, 2007

Bayesian Learning. Two Roles for Bayesian Methods. Bayes Theorem. Choosing Hypotheses

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Learning Bayesian network : Given structure and completely observed data

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Bayesian Machine Learning

Algorithmisches Lernen/Machine Learning

Lecture : Probabilistic Machine Learning

Lecture 9: PGM Learning

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

Mathematical Formulation of Our Example

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Classical and Bayesian inference

Why Probability? It's the right way to look at the world.

Bayesian Models in Machine Learning

CSCI-567: Machine Learning (Spring 2019)

Some Concepts of Probability (Review) Volker Tresp Summer 2018

Qualifier: CS 6375 Machine Learning Spring 2015

UVA CS / Introduc8on to Machine Learning and Data Mining

Introduction to Machine Learning

Bayes Rule. CS789: Machine Learning and Neural Network Bayesian learning. A Side Note on Probability. What will we learn in this lecture?

Transcription:

Naïve Bayes classification 1

Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss Distinguish between discrete and continuous variables. The distribution of a discrete random variable: The probabilities of each value it can take. Notation: P(X = x i ). These numbers satisfy: X P (X = x i )=1 i 2

Probability theory Marginal Probability Joint Probability Conditional Probability

Probability theory A joint probability distribution for two variables is a table. If the two variables are binary, how many parameters does it have? Let s consider now the joint probability of d variables P(X 1,,X d ). How many parameters does it have if each variable is binary? 4

Probability theory Marginalization: Product Rule

The Rules of Probability Marginalization Product Rule Independence: X and Y are independent if P(Y X) = P(Y) This implies P(X,Y) = P(X) P(Y)

Using probability in learning Suppose we have access to P(y x), we would classify according to argmax y P(y x). This is called the Bayes-optimal classifier. What is the error of such a classifier? P(y=-1 x) P(y=+1 x) 7

Using probability in learning 1% 0.5% 0% Some classifiers model P(Y X) directly: discriminative learning However, it s usually easier to model P(X Y) from which we can get P(Y X) using Bayes rule: P (Y X) = This is called generative learning P (X Y )P (Y ) P (X) 8

Maximum likelihood Fit a probabilistic model P(x θ) to data Estimate θ Given independent identically distributed (i.i.d.) data X = (x 1, x 2,, x n ) Likelihood Log likelihood P (X ) =P (x 1 )P (x 2 ),...,P(x n ) ln P (X ) = nx ln P (x i ) i=1 Maximum likelihood solution: parameters θ that maximize ln P(X θ)

Example Example: coin toss Estimate the probability p that a coin lands Heads using the result of n coin tosses, h of which resulted in heads. The likelihood of the data: P (X ) =p h (1 p) n h Log likelihood: ln P (X ) =h ln p +(n h)ln(1 p) Taking a derivative and setting to 0: @ ln P (X ) @p = h p (n h) (1 p) =0 ) p = h n 10

Bayes rule From the product rule: and: P(Y, X) = P(Y X) P(X) P(Y, X) = P(X Y) P(Y) Therefore: P (Y X) = P (X Y )P (Y ) P (X) This is known as Bayes rule 11

Bayes rule P (Y X) = posterior likelihood prior P (X Y )P (Y ) P (X) posterior likelihood prior P(X) can be computed as: P (X) = X Y P (X Y )P (Y ) But is not important for inferring a label

Maximum a-posteriori and maximum likelihood The maximum a posteriori (MAP) rule: y MAP = arg max Y P (Y X) = arg max Y P (X Y )P (Y ) P (X) = arg max Y P (X Y )P (Y ) If we ignore the prior distribution or assume it is uniform we obtain the maximum likelihood rule: y ML = arg max Y P (X Y ) A classifier that has access to P(Y X) is a Bayes optimal classifier. 13

Classification with Bayes rule We would like to model P(X Y), where X is a feature vector, and Y is its associated label. Task:%Predict%whether%or%not%a%picnic%spot%is%enjoyable% % Training&Data:&& X%=%(X 1 %%%%%%%X 2 %%%%%%%%X 3 %%%%%%%% %%%%%%%% %%%%%%%X d )%%%%%%%%%%Y% n&rows& How many parameters? Prior: P(Y) k-1 if k classes Likelihood: P(X Y) (2 d 1)k for binary features 14

Naïve Bayes classifier We would like to model P(X Y), where X is a feature vector, and Y is its associated label. Simplifying assumption: conditional independence: given the class label the features are independent, i.e. P (X Y )=P (x 1 Y )P (x 2 Y ),...,P(x d Y ) How many parameters now? 15

Naïve Bayes classifier We would like to model P(X Y), where X is a feature vector, and Y is its associated label. Simplifying assumption: conditional independence: given the class label the features are independent, i.e. P (X Y )=P (x 1 Y )P (x 2 Y ),...,P(x d Y ) How many parameters now? dk + k - 1 16

Naïve Bayes classifier Naïve Bayes decision rule: y NB = arg max Y P (X Y )P (Y ) = arg max Y dy P (x i Y )P (Y ) i=1 If conditional independence holds, NB is an optimal classifier! 17

Training a Naïve Bayes classifier Training data: Feature matrix X (n x d) and labels y 1, y n Maximum likelihood estimates: Class prior: ˆP (y) = {i : y i = y} n Likelihood: ˆP (x i y) = ˆP (x i,y) ˆP (y) = {i : X ij = x i,y i = y} /n {i : y i = y} /n 18

Example Email classification Suppose our vocabulary contains three words a, b and c, and we use a multivariate Bernoulli model for our e-mails, with parameters = (0.5,0.67,0.33) = (0.67,0.33,0.33) This means, for example, that the presence of b is twice as likely in spam (+), compared with ham. The e-mail to be classified contains words a and b but not c, and hence is described by the bit vector x = (1,1,0). We obtain likelihoods P(x ) = 0.5 0.67 (1 0.33) = 0.222 P(x ) = 0.67 0.33 (1 0.33) = 0.148 The ML classification of x is thus spam. 19

Example Email classification: training data E-mail a? b? c? Class e 1 0 1 0 + e 2 0 1 1 + e 3 1 0 0 + e 4 1 1 0 + e 5 1 1 0 e 6 1 0 1 e 7 1 0 0 e 8 0 0 0 ount vectors. (right) The same data set What are the parameters of the model? 20

Example Email classification: training data E-mail a? b? c? Class e 1 0 1 0 + e 2 0 1 1 + e 3 1 0 0 + e 4 1 1 0 + e 5 1 1 0 e 6 1 0 1 e 7 1 0 0 e 8 0 0 0 ount vectors. (right) The same data set What are the parameters of the model? ˆP (y) = {i : y i = y} n ˆP (x i y) = ˆP (x i,y) ˆP (y) = {i : X ij = x i,y i = y} /n {i : y i = y} /n 21

Example Email classification: training data E-mail a? b? c? Class e 1 0 1 0 + e 2 0 1 1 + e 3 1 0 0 + e 4 1 1 0 + e 5 1 1 0 e 6 1 0 1 e 7 1 0 0 e 8 0 0 0 ount vectors. (right) The same data set What are the parameters of the model? P(+) = 0.5, P(-) = 0.5 P(a +) = 0.5, P(a -) = 0.75 P(b +) = 0.75, P(b -)= 0.25 P(c +) = 0.25, P(c -)= 0.25 ˆP (x i y) = ˆP (x i,y) ˆP (y) ˆP (y) = {i : y i = y} n = {i : X ij = x i,y i = y} /n {i : y i = y} /n 22

Comments on Naïve Bayes Usually features are not conditionally independent, i.e. P (X Y ) 6= P (x 1 Y )P (x 2 Y ),...,P(x d Y ) And yet, one of the most widely used classifiers. Easy to train! It often performs well even when the assumption is violated. Domingos, P., & Pazzani, M. (1997). Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier. Machine Learning. 29, 103-130. 23

When there are few training examples What if you never see a training example where x 1 =a when y=spam? P(x spam) = P(a spam) P(b spam) P(c spam) = 0 What to do? 24

When there are few training examples What if you never see a training example where x 1 =a when y=spam? P(x spam) = P(a spam) P(b spam) P(c spam) = 0 What to do? Add virtual examples for which x 1 =a when y=spam. 25

Naïve Bayes for continuous variables Need to talk about continuous distributions! 26

Continuous Probability Distributions The probability of the random variable assuming a value within some given interval from x 1 to x 2 is defined to be the area under the graph of the probability density function between x 1 and x 2. f (x) Uniform f (x) Normal/Gaussian x x x x 1 x 1 x 2 2

Expectations Discrete variables Continuous variables Conditional expectation (discrete) Approximate expectation (discrete and continuous)

The Gaussian (normal) distribution

Properties of the Gaussian distribution 99.72% 95.44% 68.26% µ 3σ µ 1σ µ µ + 1σ µ + 3σ x µ 2σ µ + 2σ

Standard Normal Distribution A random variable having a normal distribution with a mean of 0 and a standard deviation of 1 is said to have a standard normal probability distribution.

Standard Normal Probability Distribution Converting to the Standard Normal Distribution z x = µ σ We can think of z as a measure of the number of standard deviations x is from µ.

Gaussian Parameter Estimation Likelihood function

Maximum (Log) Likelihood

Example 35

Gaussian models Assume we have data that belongs to three classes, and assume a likelihood that follows a Gaussian distribution 36

Gaussian Naïve Bayes Likelihood function: P (X i = x Y = y k )= 1 (x µik ) 2 p exp 2 ik 2 ik 2 Need to estimate mean and variance for each feature in each class. 37

Summary Naïve Bayes classifier: ² ² ² What s the assumption Why we make it How we learn it Naïve Bayes for discrete data Gaussian naïve Bayes 38