Naïve Bayes for Text Classification

Size: px

Start display at page:

Download "Naïve Bayes for Text Classification"

Cornelius Newman
5 years ago
Views:

1 Naïve Bayes for Text Cassifiation adapted by Lye Ungar from sides by Mith Marus, whih were adapted from sides by Massimo Poesio, whih were adapted from sides by Chris Manning :

2 Exampe: Is this spam? From: "" Subet: rea estate is the ony way... gem oavgkay Anyone an buy rea estate with no money down Stop paying rent TODAY! There is no need to spend hundreds or even thousands for simiar ourses I am 22 years od and I have aready purhased 6 properties using the methods outined in this truy INCREDIBLE ebook. Change your ife NOW! ================================================= Cik Beow to order: ================================================= How do you know?

3 u Given Cassifiation A vetor, x Î X desribing an instane n Issue: how to represent text douments as vetors? A fixed set of ategories: C = { 1, 2,, k } u Determine An optima assifier (x: Xè C

4 A Graphia View of Text Cassifiation Arh. Graphis NLP AI Theory

5 u u u u Exampes of text ategorization Spam Topis Author spam / not spam finane / sports / asia Shakespeare / Marowe / Ben Jonson The Federaist papers author Mae/femae Native anguage: Engish/Chinese, Opinion u Emotion ike / hate / neutra angry / sad / happy / disgusted /

6 Conditiona modes p(y=y X=x;w ~ exp(-(y-x. w 2 /2s 2 inear regression p(y=y X=x;w ~ 1/(1+exp(-x. w ogisti regression u Or derive from fu mode p(y x = p(x,y/p(x Making some assumptions about the distribution of (x,y

7 Bayesian Methods u Use Bayes theorem to buid a generative mode that approximates how data are produed u Use prior probabiity of eah ategory u Produe a posterior probabiity distribution over the possibe ategories given a desription of an item.

8 Bayes Rue one more P ( C D = P( D C P( C P( D

9 Maximum a posteriori (MAP ( argmax D P C MAP Î º ( ( ( argmax D P P D P ÎC = ( ( argmax P D P ÎC = As P(D is onstant

10 Maximum ikeihood If a hypotheses are a priori equay ikey, we ony need to onsider the P(D term: ML º argmax ÎC P( D Maximum Likeihood Estimate ( MLE

11 Naive Bayes Cassifiers Task: Cassify a new instane x based on a tupe of attribute vaues x = (x 1 x p into one of the asses Î C MAP = argmax p( x 1,..x p = argmax p(x 1,..x p p( / p(x 1,..x p = argmax p(x 1,..x p p(

12 Naïve Bayes Cassifier: Assumption u P( Estimate from the training data. u P(x 1,x 2,,x p O( X p C parameters Coud ony be estimated if a very, very arge number of training exampes was avaiabe. Naïve Bayes assumes Conditiona Independene: u Assume that the probabiity of observing the onuntion of attributes is equa to the produt of the individua probabiities P(x i.

13 The Naïve Bayes Cassifier Fu X 1 X 2 X 3 X 4 X 5 runnynose sinus ough fever muse-ahe u Conditiona Independene Assumption: Features are independent of eah other given the ass: P u This mode is appropriate for binary variabes ( 5 X1, ", X 5 C = P( X1 C P( X 2 C! P( X C Simiar modes work more generay ( Beief Networks

14 Learning the Mode C u First attempt: maximum ikeihood estimates X 1 X 2 X 3 X 4 X 5 X 6 simpy use the frequenies in the data N( C = Pˆ( = N ˆP(x i = N(X i = x i,c = N(C =

15 Probem with Max Likeihood Fu X 1 X 2 X 3 X 4 X 5 runnynose sinus u What if we have seen no training ases where patient had no fu and muse ahes? P ˆ N( X = t, C = fu P( X 5 5 = t C = fu = = N( C = fu u Zero probabiities annot be onditioned away, no matter the other evidene!! = ough argmax fever muse-ahe ( 5 X1, ", X 5 C = P( X1 C P( X 2 C! P( X C Õ i 0 P ˆ( Pˆ( x i

16 Smoothing to Avoid Overfitting ˆP(x i = N(X i = x i,c = +1 N(C = + v u Somewhat more subte version P( x N( X = x # of vaues of X i, C mp ˆ i i, k i k i k =,, N( C = + m N(C= = # of dos in ass N(X i =x i,c= = # of dos in ass with word position X i having vaue word x i, here v is the voabuary size If X i is ust true or fase, then k is 2. p i,k is marginaized over a asses, how often feature X i takes on eah of it s k possibe vaues. = overa fration in data where X i =x i,k + extent of smoothing

Using Naive Bayes Cassifiers to Cassify Text: Bag of Words u Genera mode: Features are positions in the text (X 1 is first word, X 2 is seond word,, vaues are words in the voabuary NB = argmax P( ÎC

17 Using Naive Bayes Cassifiers to Cassify Text: Bag of Words u Genera mode: Features are positions in the text (X 1 is first word, X 2 is seond word,, vaues are words in the voabuary NB = argmax P( ÎC = argmax P( ÎC Õ i P( x P( x 1 = "our"! P( x = "text" Too many possibiities, so assume that assifiation is independent of the positions of the words Resut is a bag of words mode Just use the ounts of words, or even a variabe for eah word: is it in the doument or not? i n

Smoothing to Avoid Overfitting Bag of words ˆP(x i = N(X i = true,c = +1 N(C = + v u Somewhat more subte version # of vaues of X i overa fration of dos ontaining x i ˆP(x i = N(X = true,c = i +

18 Smoothing to Avoid Overfitting Bag of words ˆP(x i = N(X i = true,c = +1 N(C = + v u Somewhat more subte version # of vaues of X i overa fration of dos ontaining x i ˆP(x i = N(X = true,c = i + mpi N(C = + m Now N(C= = # of dos in ass N(X i =true, C= = # of dos in ass ontaining word x i, v = voabuary size p i is the the probabiity that word i is present, ignoring ass abes extent of smoothing

19 Naïve Bayes: Learning u From training orpus, determine Voabuary u Estimate P( and P(x k For eah in C do dos douments abeed with ass dos P ( tota # douments For eah word x k in Voabuary n k number of ourrenes of x k in a dos P( x k dos nk Voabuary Lapae smoothing

20 Naïve Bayes: Cassifying u For a words x i in urrent doument u Return NB, where NB = argmax C i doumant P( P(x i What is the impiit assumption hidden in this?

21 Naïve Bayes for text u The orret mode woud have a probabiity for eah word observed and one for eah word not observed. Naïve Bayes for text assumes that there is no information in words that are not observed sine most words are very rare, their probabiity of not being seen is ose to 1.

22 Naive Bayes is not so dumb ua good baseine for text assifiation uoptima if the Independene Assumptions hod: uvery Fast: Learns with one pass over the data Testing inear in the number of attributes and of douments Low storage requirements

23 Tehnia Detai: Underfow u Mutipying ots of probabiities, whih are between 0 and 1 by definition, an resut in foating-point underfow. u Sine og(xy = og(x + og(y, it is better to perform a omputations by summing ogs of probabiities rather than mutipying probabiities. u Cass with highest fina un-normaized og probabiity sore is sti the most probabe. NB = argmax og ÎC P( + å og iîpositions P( x i

24 More Fats About Bayes Cassifiers u Bayes Cassifiers an be buit with rea-vaued inputs* Or many other distributions u Bayes Cassifiers don t try to be maximay disriminative They merey try to honesty mode what s going on* u Zero probabiities give stupid resuts u Naïve Bayes is wonderfuy heap And handes 1,000,000 features heerfuy! *See future Letures and homework

P(a sports = 0/5 P(ba sports = 1/5 Assume 5 sports douments

25 Naïve Bayes MLE word topi ount a sports 0 ba sports 1 arrot sports 0 game sports 2 I sports 2 saw sports 2 the sports 3 P(a sports = 0/5 P(ba sports = 1/5 Assume 5 sports douments Counts are number of douments on the sports topi ontaining eah word

Naïve Bayes prior (noninformative Word topi ount a sports 0.

5 is a Jeffreys prior. A ount of 1 beta(1,1 is Lapae smoothing.

26 Naïve Bayes prior (noninformative Word topi ount a sports 0.5 ba sports 0.5 arrot sports 0.5 game sports 0.5 I sports 0.5 saw sports 0.5 the sports 0.5 Assume 5 sports douments Adding a ount of 0.5 beta(0.5,0.5 is a Jeffreys prior. A ount of 1 beta(1,1 is Lapae smoothing. Pseudo-ounts to be added to the observed ounts We did 0.5 here; before in the notes it was 1; either is fine

27 Naïve Bayes posterior (MAP Word topi ount a sports 0.5 ba sports 1.5 arrot sports 0.5 game sports 2.5 I sports 2.5 saw sports 2.5 the sports 3.5 Assume 5 sports douments, P(word topi = N(word,topi+0.5 N(topi k P(a sports = 0.5/8.5 posterior P(ba sports = 1.5/8.5 Pseudo ount of dos on topi=sports is ( *7=8.5

28 But words have different base rates word topi ount topi ount p(word a sports 0 poitis 2 2/11 ba sports 1 poitis 0 1/11 arrot sports 0 poitis 0 0/11 game sports 2 poitis 1 3/11 I sports 2 poitis 5 7/11 saw sports 2 poitis 1 3/11 the sports 3 poitis 5 8/11 Assume 5 sports dos and 6 poitis dos 11 tota dos

29 Naïve Bayes posterior (MAP P(word,topi = N(word,topi + m P word N(topi + m Arbitrariy pik m=4 as the strength of our prior P(a sports = (0 + 4*(2/11/(5 + 4 = 0.08 P(ba sports = (1 + 4*(1/11/(5 + 4 = 0.15

30 What you shoud know u Appiations of doument assifiation Spam detetion, topi predition, emai routing, author ID, sentiment anaysis u Naïve Bayes As MAP estimator (uses prior for smoothing n Contrast MLE For doument assifiation n Use bag of words n Coud use riher feature set

Naïve Bayes for Text Classification

Naïve Bayes for Text Classification Naïve Bayes for Tet Classifiation adapted by Lyle Ungar from slides by Mith Marus, whih were adapted from slides by Massimo Poesio, whih were adapted from slides by Chris Manning : Eample: Is this spam?