Naïve Bayes MIT Course Notes Cynthia Rudin

Thaks to Şeyda Ertek Credt: Ng, Mtchell Naïve Bayes MIT 5.097 Course Notes Cytha Rud The Naïve Bayes algorthm comes from a geeratve model. There s a mportat dstcto betwee geeratve ad dscrmatve models. I all cases, we wat to predct the label y, gve x, that s, we wat P (Y = y X = x). Throughout the paper, we ll remember that the probablty dstrbuto for measure P s over a ukow dstrbuto over X Y. Naïve Bayes Geeratve Model Estmate P (X = x Y = y) ad P (Y = y) ad use Bayes rule to get P (Y = y X = x) Dscrmatve Model Drectly estmate P (Y = y X = x) Most of the top 0 classfcato algorthms are dscrmatve (K-NN, CART, C4.5, SVM, AdaBoost). For Naïve Bayes, we make a assumpto that f we kow the class label y, the we kow the mechasm (the radom process) of how x s geerated. Naïve Bayes s great for very hgh dmesoal problems because t makes a very strog assumpto. Very hgh dmesoal problems suffer from the curse of dmesoalty t s dffcult to uderstad what s gog o a hgh dmesoal space wthout tos of data. Example: Costructg a spam flter. Each example s a emal, each dmeso j of vector x represets the presece of a word.

a 0 0 aardvark aardwolf x =.... buy. 0 zyxt Ths x represets a emal cotag the words a ad buy, but ot aardvark or zyxt. The sze of the vocabulary could be 50,000 words, so we are a 50,000 dmesoal space. Naïve Bayes makes the assumpto that the x (j) s are codtoally depedet gve y. Say y = meas spam emal, word 2,087 s buy, ad word 39,83 s prce. Naïve Bayes assumes that f y = (t s spam), the kowg x (2,087) = (emal cotas buy ) wo t effect your belef about x (39,38) (emal cotas prce ). Note: Ths does ot mea x (2,087) ad x (39,83) are depedet, that s, P (X (2,087) = x (2,087) ) = P (X (2,087) = x (2,087) X (39,83) = x (39,83) ). It oly meas they are codtoally depedet gve y. Usg the defto of codtoal probablty recursvely, P (X () = x (),..., X (50,000) = x (50,000) Y = y) = P (X () = x () Y = y)p (X (2) = x (2) Y = y, X () = x () ) P (X (3) = x (3) Y = y, X () = x (), X (2) = x (2) )... P (X (50,000) = x (50,000) Y = y, X () = x (),..., X (49,999) = x (49,999) ). The depedece assumpto gves: P (X () = x (),..., X () = x () Y = y) = P (X () = x () Y = y)p (X (2) = x (2) Y = y)... P (X () = x () Y = y) = P (X (j) = x (j) Y = y). () 2

Bayes rule says () () P (Y = y)p (X () = x (),..., X () = x () Y = y) P (Y = y X = x,..., X () = x () ) = P (X () = x (),..., X () = x () ) so pluggg (), we have P (Y = y) P (X (j) = x (j) Y = y) P (Y = y X () = x (),..., X () = x () ) = P (X () = x (),..., X () = x () ) For a ew test stace, called x test, we wat to choose the most probable value of y, that s () () test () () P (Y = ) () j P (X = x test,..., X () = x y NB arg max P (X () = x test,..., X () = x test ) = arg max P (Y = ) P (X (j) = x (j) Y = ). Y = ) (j) So ow, we just eed P (Y = ) for each possble, ad P (X (j) = x test Y = ) for each j ad. Of course we ca t compute those. Let s use the emprcal probablty estmates: [y =] P (Y = ) = = fracto of data where the label s m [ (j) (j) (j) (j) x =x test,y =] P (X = x Y = ) = = Cof(Y = X (j) (j) test = x test ). [y =] That s the smplest verso of Naïve Bayes: y NB (j) arg max P (Y = ) P (X (j) = x test Y = ). There could potetally be a problem that most of the codtoal probabltes are 0 because the dmesoalty of the data s very hgh compared to the amout (j) of data. Ths causes a problem because f eve oe P (X (j) = x test Y = ) s zero the the whole rght sde s zero. I other words, f o trag examples from class spam have the word tomato, we d ever classfy a test example cotag the word tomato as spam! 3

To avod ths, we (sort of) set the probabltes to a small postve value whe there are o data. I partcular, we use a Bayesa shrkage estmate of P (X (j) (j) = x test Y = ) where we add some hallucated examples. There are K hallucated examples spread evely over the possble values of X (j). K s the umber of dstct values of X (j). The probabltes are pulled toward /K. So, ow we replace: (j) (j) + ) [ x = x,y = P (X (j) (j ] = x test Y = ) = test [y =] + K P (Y = ) = [y =] + m + K Ths s called Laplace smoothg. The smoothg for P (Y = ) s probably uecessary ad has lttle to o effect. Naïve Bayes s ot ecessarly the best algorthm, but s a good frst thg to try, ad performs surprsgly well gve ts smplcty! There are extesos to cotuous data ad other varatos too. PPT Sldes 4

MIT OpeCourseWare http://ocw.mt.edu 5.097 Predcto: Mache Learg ad Statstcs Sprg 202 For formato about ctg these materals or our Terms of Use, vst: http://ocw.mt.edu/terms.