Naïve Bayes for Text Classification

Naïve Bayes for Tet Classifiation adapted by Lyle Ungar from slides by Mith Marus, whih were adapted from slides by Massimo Poesio, whih were adapted from slides by Chris Manning :

Eample: Is this spam? From: "" <takworlld@hotmail.om> Subet: real estate is the only way... gem oalvgkay Anyone an buy real estate with no money down Stop paying rent TODAY! There is no need to spend hundreds or even thousands for similar ourses I am 22 years old and I have already purhased 6 properties using the methods outlined in this truly INCREDIBLE ebook. Change your life NOW! ================================================= Clik Below to order: http://www.wholesaledaily.om/sales/nmd.htm ================================================= How do you know?

u Given Classifiation l A vetor, X desribing an instane n Issue: how to represent tet douments as vetors? l A fied set of ategories: C = { 1, 2,, k } u Determine l An optimal lassifier (: Xè C

A Graphial View of Tet Classifiation Arh. Graphis NLP AI Theory

Eamples of tet ategorization u Spam l spam / not spam u Topis l u Author l l l l finane / sports / asia Shakespeare / Marlowe / Ben Jonson The Federalist papers author Male/female Native language: English/Chinese, u Opinion l like / hate / neutral u Emotion l angry / sad / happy / disgusted /

Conditional models p(y=y X=;w ~ ep(-(y-. w 2 /2σ 2 linear regression p(y=h X=;w ~ 1/(1+ep(-. w logisti regression u Or derive from full model l l P(y = p(,y/p( Making some assumptions about the distribution of (,y

Bayesian Methods u Use Bayes theorem to build a generative model that approimates how data are produed u Use prior probability of eah ategory u Produe a posterior probability distribution over the possible ategories given a desription of an item.

Bayes Rule one more P ( C D = P( D C P( C P( D

Maimum a posteriori (MAP ( argma D P C MAP ( ( ( argma D P P D P C = ( ( argma P D P C = As P(D is onstant

Maimum likelihood If all hypotheses are a priori equally likely, we only need to onsider the P(D term: ML argma C P( D Maimum Likelihood Estimate ( MLE

Naive Bayes Classifiers Task: Classify a new instane based on a tuple of attribute values = ( 1 n into one of the lasses C,,, ( argma 2 1 n C MAP P =,,, ( (,,, ( argma 2 1 2 1 n n C P P P = (,,, ( argma 2 1 P P n C = Sorry: n here is what we all p the number of preditors. For now we re thinking of it as a sequene of n words in a doument

Naïve Bayes Classifier: Assumption u P( l Can be estimated from the frequeny of lasses in the training eamples. u P( 1, 2,, n l O( X n C parameters l Could only be estimated if a very, very large number of training eamples was available. Naïve Bayes assumes Conditional Independene: u Assume that the probability of observing the onuntion of attributes is equal to the produt of the individual probabilities P( i.

The Naïve Bayes Classifier Flu X 1 X 2 X 3 X 4 X 5 runnynose sinus ough fever musle-ahe u Conditional Independene Assumption: features are independent of eah other given the lass: P ( 5 X1,, X 5 C = P( X1 C P( X 2 C! P( X C u This model is appropriate for binary variables l Similar models work more generally ( Belief Networks

Learning the Model C X 1 X 2 X 3 X 4 X 5 X 6 u First attempt: maimum likelihood estimates l simply use the frequenies in the data N( C = Pˆ( = N ˆP( i = N(X i = i,c = N(C =

Problem with Ma Likelihood Flu X 1 X 2 X 3 X 4 X 5 runnynose sinus ough u What if we have seen no training ases where patient had no flu and musle ahes? P ˆ N( X = t, C = flu P( X 5 5 = t C = flu = = N( C = flu u Zero probabilities annot be onditioned away, no matter the other evidene! l = argma P ˆ( Pˆ( fever musle-ahe ( 5 X1,, X 5 C = P( X1 C P( X 2 C! P( X C i i 0

Smoothing to Avoid Overfitting ˆP( i = N(X i = i,c = +1 N(C = + v u Somewhat more subtle version P( N( X = # of values of X i overall fration in data where X i = i,k, C = + mp ˆ i i, k i k i k =,, N( C = + m N(C= = # of dos in lass N(X i = i,c= = # of dos in lass with word position X i having value word i, here v is the voabulary size If X i is ust true or false, then k is 2. p i,k is marginalized over all lasses, how often feature X i takes on eah of it s k possible values. etent of smoothing

Using Naive Bayes Classifiers to Classify Tet: Bag of Words u General model: Features are positions in the tet (X 1 is first word, X 2 is seond word,, values are words in the voabulary NB = argma P( C = argma P( C i P( P( 1 = "our"! P( = "tet" Too many possibilities, so assume that lassifiation is independent of the positions of the words Result is bag of words model Just use the ounts of words, or even a variable for eah word: is it in the doument or not? i n

Smoothing to Avoid Overfitting Bag of words ˆP( i = N(X i = true,c = +1 N(C = + v # of values of X i u Somewhat more subtle version overall fration of dos ontaining i ˆP( i = N(X = true,c = i + mpi N(C = + m Now N(C= = # of dos in lass N(X i =true, C= = # of dos in lass ontaining word i, v = voabulary size p i is the the probability that word i is present, ignoring lass labels etent of smoothing

Naïve Bayes: Learning u From training orpus, determine Voabulary u Estimate P( and P( k l For eah in C do dos douments labeled with lass P( For eah word k in Voabulary n k number of ourrenes of k in all dos P( dos total # douments k dos nk + 1 + Voabulary Simple Laplae smoothing

Naïve Bayes: Classifying u For all words i in urrent doument u Return NB, where NB = argma C i doumant P( P( i What is the impliit assumption hidden in this?

Naïve Bayes for tet u The orret model would have a probability for eah word observed and one for eah word not observed. l Naïve Bayes for tet assumes that there is no information in words that are not observed sine most words are very rare, their probability of not being seen is lose to 1.

Naive Bayes is not so dumb u A good baseline for tet lassifiation u Optimal if the Independene Assumptions hold: u Very Fast: l Learn with one pass over the data l Testing linear in the number of attributes and of douments l Low Storage requirements

Tehnial Detail: Underflow u Multiplying lots of probabilities, whih are between 0 and 1 by definition, an result in floating-point underflow. u Sine log(y = log( + log(y, it is better to perform all omputations by summing logs of probabilities rather than multiplying probabilities. u Class with highest final un-normalized log probability sore is still the most probable. NB = argma log C P( + log i positions P( i

More Fats About Bayes Classifiers u Bayes Classifiers an be built with real-valued inputs* l Or many other distributions u Bayes Classifiers don t try to be maimally disriminative l They merely try to honestly model what s going on* u Zero probabilities give stupid results u Naïve Bayes is wonderfully heap l And handles 1,000,000 features heerfully! *See future Letures and homework

Naïve Bayes MLE word topi ount a sports 0 ball sports 1 arrot sports 0 game sports 2 I sports 2 saw sports 2 the sports 3 P(a sports = 0/5 P(ball sports = 1/5 Assume 5 sports douments Counts are number of douments on the sports topi ontaining eah word

Naïve Bayes prior (noninformative Word topi ount a sports 0.5 ball sports 0.5 arrot sports 0.5 game sports 0.5 I sports 0.5 saw sports 0.5 the sports 0.5 Assume 5 sports douments Adding a ount of 0.5 beta(0.5,0.5 is a Jeffreys prior. A ount of 1 beta(1,1 is Laplae smoothing. Pseudo-ounts to be added to the observed ounts We did 0.5 here; before in the notes it was 1; either is fine

Naïve Bayes posterior (MAP Word topi ount a sports 0.5 ball sports 1.5 arrot sports 0.5 game sports 2.5 I sports 2.5 saw sports 2.5 the sports 3.5 Assume 5 sports douments, P(word,topi = N(word,topi+0.5 N(topi + 0.5 k Pseudo ount of dos on topi=sports is (5 + 0.5*7=8.5 P(a sports = 0.5/8.5 posterior P(ball sports = 1.5/8.5

Naïve Bayes prior overall word topi ount topi ount p(word a sports 0 politis 2 2/11 ball sports 1 politis 0 1/11 arrot sports 0 politis 0 0/11 game sports 2 politis 1 3/11 I sports 2 politis 5 7/11 saw sports 2 politis 1 3/11 the sports 3 politis 5 8/11 Assume 5 sports dos and 6 politis dos 11 total dos

Naïve Bayes posterior (MAP P(a sports = (0+ 4*(2/11/(5 + 4 = 0.08 P(ball sports = (1+4*(1/11/(5 + 4 = 0.15 P(word,topi = N(word,topi+4 P word N(topi + 4 Here we arbitrarily pik m=4 as the strength of our prior

What you should know u Appliations of doument lassifiation l Spam detetion, topi predition, email routing, author ID, sentiment analysis u Naïve Bayes l As MAP estimator (uses prior for smoothing n Contrast MLE l For doument lassifiation n Use bag of words n Could use riher feature set