Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

Similar documents
Bayesian Methods: Naïve Bayes

Machine Learning. Classification. Bayes Classifier. Representing data: Choosing hypothesis class. Learning: h:x a Y. Eric Xing

Dimensionality reduction

CSE446: Naïve Bayes Winter 2016

CSE546: Naïve Bayes Winter 2012

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Naïve Bayes classification

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Estimating Parameters

Bayesian Classifiers, Conditional Independence and Naïve Bayes. Required reading: Naïve Bayes and Logistic Regression (available on class website)

Machine Learning

Introduction to AI Learning Bayesian networks. Vibhav Gogate

The Naïve Bayes Classifier. Machine Learning Fall 2017

Text Categorization CSE 454. (Based on slides by Dan Weld, Tom Mitchell, and others)

MLE/MAP + Naïve Bayes

Forward algorithm vs. particle filtering

CS 188: Artificial Intelligence. Machine Learning

The Bayes classifier

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

MLE/MAP + Naïve Bayes

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Machine Learning, Fall 2012 Homework 2

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Machine Learning

Generative Classifiers: Part 1. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 8 May 2012

Introduc)on to Bayesian methods (con)nued) - Lecture 16

Bayes Theorem & Naïve Bayes. (some slides adapted from slides by Massimo Poesio, adapted from slides by Chris Manning)

COS 424: Interacting with Data. Lecturer: Dave Blei Lecture #11 Scribe: Andrew Ferguson March 13, 2007

Classification: Naïve Bayes. Nathan Schneider (slides adapted from Chris Dyer, Noah Smith, et al.) ENLP 19 September 2016

CS 188: Artificial Intelligence Spring Today

Generative Learning algorithms

Logistics. Naïve Bayes & Expectation Maximization. 573 Schedule. Coming Soon. Estimation Models. Topics

Generative Clustering, Topic Modeling, & Bayesian Inference

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

CS 6375 Machine Learning

CPSC 340: Machine Learning and Data Mining

CSE 473: Artificial Intelligence Autumn Topics

Day 5: Generative models, structured classification

Probability Review and Naïve Bayes

Naïve Bayes Classifiers and Logistic Regression. Doug Downey Northwestern EECS 349 Winter 2014

Generative Models. CS4780/5780 Machine Learning Fall Thorsten Joachims Cornell University

Point Estimation. Vibhav Gogate The University of Texas at Dallas

Generative v. Discriminative classifiers Intuition

Expectation Maximization Algorithm

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Bayesian Models in Machine Learning

Categorization ANLP Lecture 10 Text Categorization with Naive Bayes

ANLP Lecture 10 Text Categorization with Naive Bayes

Machine Learning. Naïve Bayes classifiers

Behavioral Data Mining. Lecture 2

Probabilistic Classification

Binary Classification / Perceptron

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

CS 188: Artificial Intelligence Fall 2008

General Naïve Bayes. CS 188: Artificial Intelligence Fall Example: Overfitting. Example: OCR. Example: Spam Filtering. Example: Spam Filtering

Generative Models for Classification

COMP 328: Machine Learning

Bayesian Approaches Data Mining Selected Technique

Probabilistic Graphical Models

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Bias-Variance Tradeoff

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Machine Learning and Deep Learning! Vincent Lepetit!

Machine Learning

Lecture 2: Probability, Naive Bayes

Naïve Bayes Lecture 6: Self-Study -----

Pattern Recognition and Machine Learning. Learning and Evaluation of Pattern Recognition Processes

Lecture 2: Logistic Regression and Neural Networks

Generative Models for Discrete Data

A REVIEW ARTICLE ON NAIVE BAYES CLASSIFIER WITH VARIOUS SMOOTHING TECHNIQUES

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

Probabilistic Graphical Models

Classification Algorithms

Classification Algorithms

Logistic Regression. Machine Learning Fall 2018

Click Prediction and Preference Ranking of RSS Feeds

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Probability Based Learning

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Machine Learning

Generative v. Discriminative classifiers Intuition

Machine Learning Gaussian Naïve Bayes Big Picture

CSC 411: Lecture 09: Naive Bayes

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Naive Bayes and Gaussian Bayes Classifier

Naive Bayes and Gaussian Bayes Classifier

ECE 5984: Introduction to Machine Learning

Algorithmisches Lernen/Machine Learning

Machine Learning for natural language processing

Tackling the Poor Assumptions of Naive Bayes Text Classifiers

Introduction to Machine Learning

COMPUTATIONAL LEARNING THEORY

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Transcription:

Naïve Bayes Vibhav Gogate The University of Texas at Dallas

Supervised Learning of Classifiers Find f Given: Training set {(x i, y i ) i = 1 n} Find: A good approximation to f : X Y Examples: what are X and Y? Spam Detection Map email to {Spam,Ham} Digit recognition Map pixels to {0,1,2,3,4,5,6,7,8,9} Stock Prediction Map new, historic prices, etc. to (the real numbers) Â Classification

3 Bayesian Categorization/Classification Let the set of categories be {c 1, c 2, c n } Let E be description of an instance. Determine category of E by determining for each c i P(E) can be ignored (normalization constant) ) ( ) ( ) ( ) ( E P c E P c P E c P i i i ) ( ) ( ~ ) ( i i i c E P c P E c P

Classify e-mails Text classification Y = {Spam,NotSpam} Classify news articles Y = {what is the topic of the article?} Classify webpages Y = {Student, professor, project, } What to use for features, X?

Features X are word sequence in document X i for i th word in article

Features for Text Classification X is sequence of words in document X (and hence P(X Y)) is huge!!! Article at least 1000 words, X={X 1,,X 1000 } X i represents i th word in document, i.e., the domain of X i is entire vocabulary, e.g., Webster Dictionary (or more), 10,000 words, etc. 10,000 1000 = 10 4000 Atoms in Universe: 10 80 We may have a problem

Bag of Words Model Typical additional assumption Position in document doesn t matter: P(X i =x i Y=y) = P(X k =x i Y=y) (all position have the same distribution) Ignore the order of words Sounds really silly, but often works very well! From now on: X i = Boolean: word i is in document X = X 1 X n

Bag of Words Approach aardvark 0 about 2 all 2 Africa 1 apple 0 anxious 0... gas 1... oil 1 Zaire 0

Bayesian Categorization P(y 1 X) ~ P(y i )P(X y i ) Need to know: Priors: P(y i ) Conditionals: P(X y i ) P(y i ) are easily estimated from data. If n i of the examples in D are in y i, then P(y i ) = n i / D Conditionals: X = X 1 X n Estimate P(X 1 X n y i ) Too many possible instances to estimate! (exponential in n) Even with bag of words assumption! 9

Need to Simplify Somehow Too many probabilities P(x 1 x 2 x 3 y i ) P(x 1 x 2 x 3 spam) P(x 1 x 2 x 3 spam) P(x 1 x 2 x 3 spam). P( x 1 x 2 x 3 spam) Can we assume some are the same? P(x 1 x 2 y i )=P(x 1 y i ) P(x 2 y i ) 10

Conditional Independence X is conditionally independent of Y given Z, if the probability distribution for X is independent of the value of Y, given the value of Z e.g., Equivalent to:

Naïve Bayes Naïve Bayes assumption: Features are independent given class: More generally: How many parameters now? Suppose X is composed of n binary features

The Naïve Bayes Classifier Given: Prior P(Y) Y n conditionally independent features X given the class Y For each X i, we have likelihood P(X i Y) X 1 X 2 X n Decision rule:

MLE for the parameters of NB Given dataset, count occurrences for all pairs Count(X i =x i,y=y)----- How many pairs? MLE for discrete NB, simply: Prior: Likelihood:

NAÏVE BAYES CALCULATIONS

Subtleties of NB Classifier: #1 Violating the NB Assumption Usually, features are not conditionally independent: Actual probabilities P(Y X) often biased towards 0 or 1 Nonetheless, NB is the single most used classifier out there NB often performs well, even when assumption is violated [Domingos & Pazzani 96] discuss some conditions for good performance

Subtleties of NB Classifier: #2 Insufficient Training Data What if you never see a training instance where X1=a and Y=b You never saw, Y=Spam, X1={Enlargement} P(X1=Enlargement Y=Spam)=0 Thus no matter what values X2, X3,.,Xn take: P(X1=Enlargement, X2=a2,,Xn=an Y=Spam)=0 Why?

For Binary Features: We already know the answer! MAP: use most likely parameter Beta prior equivalent to extra observations for each feature As N 1, prior is forgotten But, for small sample size, prior is important!

That s Great for Binomial Works for Spam / Ham What about multiple classes Eg, given a wikipedia page, predicting type 19

Multinomials: Laplace Smoothing Laplace s estimate: Pretend you saw every outcome k extra times H H T What s Laplace with k = 0? k is the strength of the prior Can derive this as a MAP estimate for multinomial with Dirichlet priors Laplace for conditionals: Smooth each condition independently:

Probabilities: Important Detail! P(spam X 1 X n ) = P(spam X i ) i Any more potential problems here? We are multiplying lots of small numbers Danger of underflow! 0.5 57 = 7 E -18 Solution? Use logs and add! p 1 * p 2 = e log(p1)+log(p2) Always keep in log form

Naïve Bayes Posterior Probabilities Classification results of naïve Bayes I.e. the class with maximum posterior probability Usually fairly accurate (?!?!?) However, due to the inadequacy of the conditional independence assumption Actual posterior-probability estimates not accurate. Output probabilities generally very close to 0 or 1. 22

NB with Bag of Words for Text Classification Learning phase: Prior P(Y m ) Count how many documents from each topic (prior) P(X i Y m ) Let B be the bag of words created from the union of all docs Let B m be a bag of words formed from all the docs in topic m Let #(i, B) be the number of times word i is in bag B P(X i Y m ) is proportional to (#(i, B m )+1) Test phase: For each document Use naïve Bayes decision rule

Twenty News Groups results

Learning curve for Twenty News Groups

Bayesian Learning What if Features are Continuous? Eg., Character Recognition: X i is i th pixel Prior Posterior P(Y X) P(X Y) P(Y) Data Likelihood

Bayesian Learning What if Features are Continuous? Eg., Character Recognition: X i is i th pixel P(X i =x Y=y k ) = N( ik, ik ) P(Y X) P(X Y) P(Y) N( ik, ik ) =

P(X i =x Y=y k ) = N( ik, ik ) Gaussian Naïve Bayes Sometimes Assume Variance is independent of Y (i.e., i ), or independent of X i (i.e., k ) or both (i.e., ) P(Y X) P(X Y) P(Y) N( ik, ik ) =

Maximum Likelihood Estimates: Mean: Learning Gaussian Parameters Variance:

Learning Gaussian Parameters Maximum Likelihood Estimates: Mean: j th training example Variance: (x) 1 if x true, else 0

Maximum Likelihood Estimates: Mean: Learning Gaussian Parameters Variance:

What you need to know about Naïve Bayes Naïve Bayes classifier What s the assumption Why we use it How do we learn it Why is Bayesian estimation important Text classification Bag of words model Gaussian NB Features are still conditionally independent Each feature has a Gaussian distribution given class