Naïve Bayes for Text Classification

Similar documents
Naïve Bayes for Text Classification

Bayes Theorem & Naïve Bayes. (some slides adapted from slides by Massimo Poesio, adapted from slides by Chris Manning)

10/15/2015 A FAST REVIEW OF DISCRETE PROBABILITY (PART 2) Probability, Conditional Probability & Bayes Rule. Discrete random variables

BAYES CLASSIFIER. Ivan Michael Siregar APLYSIT IT SOLUTION CENTER. Jl. Ir. H. Djuanda 109 Bandung

Data Mining and MapReduce. Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)

Handling Uncertainty

An AI-ish view of Probability, Conditional Probability & Bayes Theorem

10/18/2017. An AI-ish view of Probability, Conditional Probability & Bayes Theorem. Making decisions under uncertainty.

Naïve Bayes classification

CS 687 Jana Kosecka. Uncertainty, Bayesian Networks Chapter 13, Russell and Norvig Chapter 14,

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

MLE/MAP + Naïve Bayes

Bayesian Methods: Naïve Bayes

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Probability Review and Naïve Bayes

MLE/MAP + Naïve Bayes

CS 188: Artificial Intelligence Spring Today

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

naive bayes document classification

Categorization ANLP Lecture 10 Text Categorization with Naive Bayes

ANLP Lecture 10 Text Categorization with Naive Bayes

Information Retrieval and Organisation

Behavioral Data Mining. Lecture 2

Machine Learning, Fall 2012 Homework 2

The Bayes classifier

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Bayesian Approach 2. CSC412 Probabilistic Learning & Reasoning

Introduction to Bayesian Learning. Machine Learning Fall 2018

Naïve Bayes, Maxent and Neural Models

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

Probabilistic Graphical Models

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

Be able to define the following terms and answer basic questions about them:

CPSC 340: Machine Learning and Data Mining

7 Classification: Naïve Bayes Classifier

COMP 328: Machine Learning

CSC 411: Lecture 09: Naive Bayes

Maximum Entropy and Exponential Families

Part 9: Text Classification; The Naïve Bayes algorithm Francesco Ricci

Introduc)on to Bayesian methods (con)nued) - Lecture 16

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Text Categorization CSE 454. (Based on slides by Dan Weld, Tom Mitchell, and others)

Classification: Naïve Bayes. Nathan Schneider (slides adapted from Chris Dyer, Noah Smith, et al.) ENLP 19 September 2016

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

Probabilistic Classification

CSE 473: Artificial Intelligence Autumn Topics

Introduction to Machine Learning

Machine Learning Lecture 2

Introduction to Machine Learning

2.6 Absolute Value Equations

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Machine Learning Algorithm. Heejun Kim

2 The Bayesian Perspective of Distributions Viewed as Information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Danielle Maddix AA238 Final Project December 9, 2016

Introduction to AI Learning Bayesian networks. Vibhav Gogate

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Naïve Bayes Classifiers and Logistic Regression. Doug Downey Northwestern EECS 349 Winter 2014

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Generative Classifiers: Part 1. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang

CSC321: 2011 Introduction to Neural Networks and Machine Learning. Lecture 11: Bayesian learning continued. Geoffrey Hinton

Bayesian Learning. Reading: Tom Mitchell, Generative and discriminative classifiers: Naive Bayes and logistic regression, Sections 1-2.

max min z i i=1 x j k s.t. j=1 x j j:i T j

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

CSC321 Lecture 18: Learning Probabilistic Models

The Naïve Bayes Classifier. Machine Learning Fall 2017

Learning Bayesian network : Given structure and completely observed data

CS 361: Probability & Statistics

Logistic Regression. Machine Learning Fall 2018

Bayesian RL Seminar. Chris Mansley September 9, 2008

Notes on Discriminant Functions and Optimal Classification

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Generative Clustering, Topic Modeling, & Bayesian Inference

CS 188: Artificial Intelligence Fall 2008

General Naïve Bayes. CS 188: Artificial Intelligence Fall Example: Overfitting. Example: OCR. Example: Spam Filtering. Example: Spam Filtering

18.05 Problem Set 6, Spring 2014 Solutions

Be able to define the following terms and answer basic questions about them:

Applied Natural Language Processing

Modes are solutions, of Maxwell s equation applied to a specific device.

Complexity of Regularization RBF Networks

Algorithmisches Lernen/Machine Learning

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Control Theory association of mathematics and engineering

Bayesian Learning (II)

CMSC 451: Lecture 9 Greedy Approximation: Set Cover Thursday, Sep 28, 2017

Machine Learning for natural language processing

Expectation maximization

A Channel-Based Perspective on Conjugate Priors

More Smoothing, Tuning, and Evaluation

CSCE 478/878 Lecture 6: Bayesian Learning

CS 361: Probability & Statistics

QUANTUM MECHANICS II PHYS 517. Solutions to Problem Set # 1

COS 424: Interacting with Data. Lecturer: Dave Blei Lecture #11 Scribe: Andrew Ferguson March 13, 2007

Bayes Rule. CS789: Machine Learning and Neural Network Bayesian learning. A Side Note on Probability. What will we learn in this lecture?

Evaluation for sets of classes

10.5 Unsupervised Bayesian Learning

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

Transcription:

Naïve Bayes for Tet Classifiation adapted by Lyle Ungar from slides by Mith Marus, whih were adapted from slides by Massimo Poesio, whih were adapted from slides by Chris Manning :

Eample: Is this spam? From: "" <takworlld@hotmail.om> Subet: real estate is the only way... gem oalvgkay Anyone an buy real estate with no money down Stop paying rent TODAY! There is no need to spend hundreds or even thousands for similar ourses I am 22 years old and I have already purhased 6 properties using the methods outlined in this truly INCREDIBLE ebook. Change your life NOW! ================================================= Clik Below to order: http://www.wholesaledaily.om/sales/nmd.htm ================================================= How do you know?

u Given Classifiation l A vetor, X desribing an instane n Issue: how to represent tet douments as vetors? l A fied set of ategories: C = { 1, 2,, k } u Determine l An optimal lassifier (: Xè C

A Graphial View of Tet Classifiation Arh. Graphis NLP AI Theory

Eamples of tet ategorization u Spam l spam / not spam u Topis l u Author l l l l finane / sports / asia Shakespeare / Marlowe / Ben Jonson The Federalist papers author Male/female Native language: English/Chinese, u Opinion l like / hate / neutral u Emotion l angry / sad / happy / disgusted /

Conditional models p(y=y X=;w ~ ep(-(y-. w 2 /2σ 2 linear regression p(y=h X=;w ~ 1/(1+ep(-. w logisti regression u Or derive from full model l l P(y = p(,y/p( Making some assumptions about the distribution of (,y

Bayesian Methods u Use Bayes theorem to build a generative model that approimates how data are produed u Use prior probability of eah ategory u Produe a posterior probability distribution over the possible ategories given a desription of an item.

Bayes Rule one more P ( C D = P( D C P( C P( D

Maimum a posteriori (MAP ( argma D P C MAP ( ( ( argma D P P D P C = ( ( argma P D P C = As P(D is onstant

Maimum likelihood If all hypotheses are a priori equally likely, we only need to onsider the P(D term: ML argma C P( D Maimum Likelihood Estimate ( MLE

Naive Bayes Classifiers Task: Classify a new instane based on a tuple of attribute values = ( 1 n into one of the lasses C,,, ( argma 2 1 n C MAP P =,,, ( (,,, ( argma 2 1 2 1 n n C P P P = (,,, ( argma 2 1 P P n C = Sorry: n here is what we all p the number of preditors. For now we re thinking of it as a sequene of n words in a doument

Naïve Bayes Classifier: Assumption u P( l Can be estimated from the frequeny of lasses in the training eamples. u P( 1, 2,, n l O( X n C parameters l Could only be estimated if a very, very large number of training eamples was available. Naïve Bayes assumes Conditional Independene: u Assume that the probability of observing the onuntion of attributes is equal to the produt of the individual probabilities P( i.

The Naïve Bayes Classifier Flu X 1 X 2 X 3 X 4 X 5 runnynose sinus ough fever musle-ahe u Conditional Independene Assumption: features are independent of eah other given the lass: P ( 5 X1,, X 5 C = P( X1 C P( X 2 C! P( X C u This model is appropriate for binary variables l Similar models work more generally ( Belief Networks

Learning the Model C X 1 X 2 X 3 X 4 X 5 X 6 u First attempt: maimum likelihood estimates l simply use the frequenies in the data N( C = Pˆ( = N ˆP( i = N(X i = i,c = N(C =

Problem with Ma Likelihood Flu X 1 X 2 X 3 X 4 X 5 runnynose sinus ough u What if we have seen no training ases where patient had no flu and musle ahes? P ˆ N( X = t, C = flu P( X 5 5 = t C = flu = = N( C = flu u Zero probabilities annot be onditioned away, no matter the other evidene! l = argma P ˆ( Pˆ( fever musle-ahe ( 5 X1,, X 5 C = P( X1 C P( X 2 C! P( X C i i 0

Smoothing to Avoid Overfitting ˆP( i = N(X i = i,c = +1 N(C = + v u Somewhat more subtle version P( N( X = # of values of X i overall fration in data where X i = i,k, C = + mp ˆ i i, k i k i k =,, N( C = + m N(C= = # of dos in lass N(X i = i,c= = # of dos in lass with word position X i having value word i, here v is the voabulary size If X i is ust true or false, then k is 2. p i,k is marginalized over all lasses, how often feature X i takes on eah of it s k possible values. etent of smoothing

Using Naive Bayes Classifiers to Classify Tet: Bag of Words u General model: Features are positions in the tet (X 1 is first word, X 2 is seond word,, values are words in the voabulary NB = argma P( C = argma P( C i P( P( 1 = "our"! P( = "tet" Too many possibilities, so assume that lassifiation is independent of the positions of the words Result is bag of words model Just use the ounts of words, or even a variable for eah word: is it in the doument or not? i n

Smoothing to Avoid Overfitting Bag of words ˆP( i = N(X i = true,c = +1 N(C = + v # of values of X i u Somewhat more subtle version overall fration of dos ontaining i ˆP( i = N(X = true,c = i + mpi N(C = + m Now N(C= = # of dos in lass N(X i =true, C= = # of dos in lass ontaining word i, v = voabulary size p i is the the probability that word i is present, ignoring lass labels etent of smoothing

Naïve Bayes: Learning u From training orpus, determine Voabulary u Estimate P( and P( k l For eah in C do dos douments labeled with lass P( For eah word k in Voabulary n k number of ourrenes of k in all dos P( dos total # douments k dos nk + 1 + Voabulary Simple Laplae smoothing

Naïve Bayes: Classifying u For all words i in urrent doument u Return NB, where NB = argma C i doumant P( P( i What is the impliit assumption hidden in this?

Naïve Bayes for tet u The orret model would have a probability for eah word observed and one for eah word not observed. l Naïve Bayes for tet assumes that there is no information in words that are not observed sine most words are very rare, their probability of not being seen is lose to 1.

Naive Bayes is not so dumb u A good baseline for tet lassifiation u Optimal if the Independene Assumptions hold: u Very Fast: l Learn with one pass over the data l Testing linear in the number of attributes and of douments l Low Storage requirements

Tehnial Detail: Underflow u Multiplying lots of probabilities, whih are between 0 and 1 by definition, an result in floating-point underflow. u Sine log(y = log( + log(y, it is better to perform all omputations by summing logs of probabilities rather than multiplying probabilities. u Class with highest final un-normalized log probability sore is still the most probable. NB = argma log C P( + log i positions P( i

More Fats About Bayes Classifiers u Bayes Classifiers an be built with real-valued inputs* l Or many other distributions u Bayes Classifiers don t try to be maimally disriminative l They merely try to honestly model what s going on* u Zero probabilities give stupid results u Naïve Bayes is wonderfully heap l And handles 1,000,000 features heerfully! *See future Letures and homework

Naïve Bayes MLE word topi ount a sports 0 ball sports 1 arrot sports 0 game sports 2 I sports 2 saw sports 2 the sports 3 P(a sports = 0/5 P(ball sports = 1/5 Assume 5 sports douments Counts are number of douments on the sports topi ontaining eah word

Naïve Bayes prior (noninformative Word topi ount a sports 0.5 ball sports 0.5 arrot sports 0.5 game sports 0.5 I sports 0.5 saw sports 0.5 the sports 0.5 Assume 5 sports douments Adding a ount of 0.5 beta(0.5,0.5 is a Jeffreys prior. A ount of 1 beta(1,1 is Laplae smoothing. Pseudo-ounts to be added to the observed ounts We did 0.5 here; before in the notes it was 1; either is fine

Naïve Bayes posterior (MAP Word topi ount a sports 0.5 ball sports 1.5 arrot sports 0.5 game sports 2.5 I sports 2.5 saw sports 2.5 the sports 3.5 Assume 5 sports douments, P(word,topi = N(word,topi+0.5 N(topi + 0.5 k Pseudo ount of dos on topi=sports is (5 + 0.5*7=8.5 P(a sports = 0.5/8.5 posterior P(ball sports = 1.5/8.5

Naïve Bayes prior overall word topi ount topi ount p(word a sports 0 politis 2 2/11 ball sports 1 politis 0 1/11 arrot sports 0 politis 0 0/11 game sports 2 politis 1 3/11 I sports 2 politis 5 7/11 saw sports 2 politis 1 3/11 the sports 3 politis 5 8/11 Assume 5 sports dos and 6 politis dos 11 total dos

Naïve Bayes posterior (MAP P(a sports = (0+ 4*(2/11/(5 + 4 = 0.08 P(ball sports = (1+4*(1/11/(5 + 4 = 0.15 P(word,topi = N(word,topi+4 P word N(topi + 4 Here we arbitrarily pik m=4 as the strength of our prior

What you should know u Appliations of doument lassifiation l Spam detetion, topi predition, email routing, author ID, sentiment analysis u Naïve Bayes l As MAP estimator (uses prior for smoothing n Contrast MLE l For doument lassifiation n Use bag of words n Could use riher feature set