Learnng from Data 1 Nave Bayes Davd Barber dbarber@anc.ed.ac.uk course page : http://anc.ed.ac.uk/ dbarber/lfd1/lfd1.html c Davd Barber 2001, 2002 1
Learnng from Data 1 : c Davd Barber 2001,2002 2 1 Why Nave Bayes? Nave Bayes s one of the smplest densty estmaton methods from whch we can form one of the standard classfcaton methods n machne learnng. Its fame s partly due to the followng propertes: Very easy to program and ntutve Fast to tran and to use as a classfer Very easy to deal wth mssng attrbutes 2 Understandng Condtonal Independence Very popular n certan felds such as computatonal lngustcs/nlp However, despte the smplcty of Nave Bayes, there are some ptfalls that need to be avoded, as we wll descrbe. The ptfalls usually made are due to a poor understandng of the central assumpton behnd Nave Bayes, namely condtonal ndependence. Before we explan how to use condtonal ndependence to form a classfer, we concentrate on explanng the basc assumpton of condtonal ndependence. Consder a general probablty dstrbuton of two varables, p(x 1, x 2 ). Usng Bayes rule, wthout loss of generalty, we can wrte p(x 1, x 2 ) = p(x 1 x 2 )p(x 2 ) (2.1) Smlarly, f we had another class varable, c, we can wrte, usng Bayes rule : p(x 1, x 2 c) = p(x 1 x 2, c)p(x 2 c) (2.2) In the above expresson, we have not made any assumptons at all. Consder now the term p(x 1 x 2, c). If knowledge of c s suffcent to determne how x 1 wll be dstrbuted, we don t need to know the state of x 2. That s, we may wrte p(x 1 x 2, c) = p(x 1 c). For example, we may wrte the general statement: p(cloudy, wndy storm) = p(cloudy wndy, storm)p(wndy storm) (2.3) where, for example, each of the varables can take the values yes or no, and now further make the assumpton p(cloudy wndy, storm) = p(cloudy storm) so that the dstrbuton becomes p(cloudy, wndy storm) = p(cloudy storm)p(wndy storm) (2.4) We can generalse the stuaton of two varables to a condtonal ndependence assumpton for a set of varables x 1,..., x N, condtonal on another varable c: N p(x c) = p(x c) (2.5) =1 A further example may help to clarfy the assumptons behnd condtonal ndependence. EasySell.com consders that ts customers convenently fall nto two groups the young or old. Based on only ths nformaton, they buld general customer profles for product preferences. Easysell.com assumes that, gven the knowledge that a customer s ether young or old, ths s suffcent to determne whether or not a customer wll lke a product, ndependent of ther lkes or dslkes for any other products. Thus, gven that a customer s young, she has a 95% chance to lke Rado1, a 5% chance to lke Rado2, a 2% chance to lke Rado3 and a 20% chance to lke Rado4. Smlarly, they model that an old customer has a 3% chance to lke Rado1, an 82% chance to lke Rado2, a 34% chance to lke Rado3 and a 92% chance to lke Rado4. Mathematcally, we would wrte p(r1, R2, R3, R4 age) = p(r1 age)p(r2 age)p(r3 age)p(r4 age) (2.6)
Learnng from Data 1 : c Davd Barber 2001,2002 3 3 Are they Scottsh? where each of the varables R1, R2, R3, R4 can take the values ether lke or dslke, and the age varable can take the value ether young or old. Thus the nformaton about the age of the customer s so powerful that ths determnes the ndvdual product preferences wthout needng to know anythng else. Clearly, ths s a rather strong assumpton, but a popular one, and sometmes leads to surprsngly good results. In ths chapter, we wll take the condtonng varable to represent the class of the datapont x. Coupled then wth a sutable choce for the condtonal dstrbuton p(x c), we can then use Bayes rule to form a classfer. In ths chapter, we wll consder two cases of dfferent condtonal dstrbutons, one approprate for dscrete data and the other for contnuous data. Furthermore, we wll demonstrate how to learn any free parameters of these models. Consder the followng vector of attrbutes: (lkes shortbread, lkes lager, drnks whskey, eats porrdge, watched England play football) T (3.1) A vector x = (1, 0, 1, 1, 0) T would descrbe that a person lkes shortbread, does not lke lager, drnks whskey, eats porrdge, and has not watched England play football. Together wth each vector x µ, there s a class label descrbng the natonalty of the person: Scottsh, or Englsh. We wsh to classfy a new vector x = (1, 0, 1, 1, 0) T as ether Scottsh or Englsh. We can use Bayes rule to calculate the probablty that x s Scottsh or Englsh: p(s x) = p(x S)p(S) p(x) p(e x) = p(x E)p(E) p(x) (3.2) (3.3) Snce we must have p(s x) + p(e x) = 1, we could also wrte p(s x) = p(x S)p(S) p(x S)p(S) + p(x E)p(E) (3.4) It s straghtforward to show that the pror class probablty p(s) s smply gven by the fracton of people n the database that are Scottsh, and smlarly p(e) s gven as the fracton of people n the database that are Englsh. What about p(x S)? Ths s where our densty model for x comes n. In the prevous chapter, we looked at a usng a Gaussan dstrbuton. Here we wll make a dfferent, very strong condtonal ndependence assumpton: p(x S) = p(x 1 S)p(x 2 S)... p(x 5 S) (3.5) What ths assumpton means s that knowng whether or not someone s Scottsh, we don t need to know anythng else to calculate the probablty of ther lkes and dslkes. Matlab code to mplement Nave Bayes on a small dataset s wrtten below, where each row of the datasets represents a (row) vector of attrbutes of the form equaton (3.1).
Learnng from Data 1 : c Davd Barber 2001,2002 4 % Nave Bayes usng Bernoull Dstrbuton xe=[0 1 1 1 0 0; % englsh 0 0 1 1 1 0; 1 1 0 0 0 0; 1 1 0 0 0 1; 1 0 1 0 1 0]; xs=[1 1 1 1 1 1 1; % scottsh 0 1 1 1 1 0 0; 0 0 1 0 0 1 1; 1 0 1 1 1 1 0; 1 1 0 0 1 0 0]; pe = sze(xe,2)/(sze(xe,2) + sze(xs,2)); ps =1-pE; % ML class prors pe = p(c=e), ps=p(c=s) me = mean(xe ) ; % ML estmates of p(x=1 c=e) ms = mean(xs ) ; % ML estmates of p(x=1 c=s) x=[1 0 1 1 0] ; % test pont npe = pe*prod(me.^x.*(1-me).^(1-x)); % p(x,c=e) nps = ps*prod(ms.^x.*(1-ms).^(1-x)); % p(x,c=s) pxe = npe/(npe+nps) % probablty that x s englsh 3.1 Further Issues Based on the tranng data n the code above, we have the followng : p(x 1 = 1 E) = 1/2,p(x 2 = 1 E) = 1/2,p(x 3 = 1 E) = 1/3,p(x 4 = 1 E) = 1/2,p(x 5 = 1 E) = 1/2, p(x 1 = 1 S) = 1,p(x 2 = 1 S) = 4/7,p(x 3 = 1 S) = 3/7,p(x 4 = 1 S) = 5/7,p(x 5 = 1 S) = 3/7 and the pror probabltes are p(s) = 7/13 and p(e) = 6/13. For x = (1, 0, 1, 1, 0) T, we get p(s x*) = 1 3 7 3 7 5 7 4 7 7 13 1 3 7 3 7 5 7 4 7 7 13 + 1 2 1 2 1 3 1 2 1 2 6 13 (3.6) whch s 0.8076. Snce ths s greater than 0.5, we would classfy ths person as beng Scottsh. Consder tryng to classfy the vector x = (0, 1, 1, 1, 1) T. In the tranng data, all Scottsh people say they lke shortbread. Ths means that p(x, S) = 0, and hence that p(s x) = 0. Ths demonstrates a dffculty wth sparse data very extreme class probabltes can be made. One way to amelorate ths stuaton s to smooth the probabltes n some way, for example by addng a certan small number M to the frequency counts of each class: p(x = 1 c) = number of tmes x = 1 for class c + M number of tmes x = 1 for class c + M + number of tmes x = 0 for class c + M (3.7) 3.2 Gaussans Ths ensures that there are no zero probabltes n the model. Fttng contnuous data s also straghtforward usng Nave Bayes. For example, f we were to model each attrbutes dstrbuton as a Gaussan, p(x c) = N(µ, σ ), ths would be exactly equvalent to usng the condtonal Gaussan densty estmator n the prevous chapter by replacng the covarance matrx wth all elements zero except for those on the dagonal.
Learnng from Data 1 : c Davd Barber 2001,2002 5 3.3 Text Classfcaton Bag of words Nave Bayes has been often appled to classfy documents n classes. We wll outlne here how ths s done. Refer to a computatonal lngustcs course for detals of how exactly to do ths. Consder a set of documents about poltcs, and a set about sport. We search through all documents to fnd the, say 100 most commonly occurng words. Each document s then represented by a 100 dmensonal vector representng the number of tmes that each of the words occurs n that document the so called bag of words representaton (ths s clearly a very crude assumpton snce t does not take nto account the order of the words). We then ft a Nave Bayes model by fttng a dstrbuton of the number of occurrences of each word for all the documents of, frst sport, and then poltcs. Ths then completes the model. The reason Nave Bayes may be able to classfy documents reasonably well n ths way s that the condtonal ndependence assumpton s not so slly : f we know people are talkng about poltcs, ths perhaps s almost suffcent nformaton to specfy what knds of other words they wll be usng we don t need to know anythng else. (Of course, f you want ultmately a more powerful text classfer, you need to relax ths assumpton). 4 Ptfalls wth Nave Bayes 1-of-M encodng So far we have descrbed how to mplement Nave Bayes for the case of bnary attrbutes and also for the case of Gaussan contnuous attrbutes. However, very often, the software that people seem to commonly use requres that the data s n the form of bnary attrbutes. It s n the transformaton of non-bnary data to a bnary form that a common mstake occurs. Consder the followng attrbute : age. In a survey, a person s age s marked down usng the varable a 1, 2, 3. a = 1 means the person s between 0 and 10 years old, a = 2 means the person s between 10 and 20 years old, a = 3 means the person s older than 20. Perhaps there would be other attrbutes for the data, so that each data entry s a vector of two varables (a, b) T. One way to transform the varable a nto a bnary representaton would be to use three bnary varables (a 1, a 2, a 3 ). Thus, (1, 0, 0) represents a = 1, (0, 1, 0) represents a = 2 and (0, 0, 1) represents a = 3. Ths s called 1 of M codng snce only 1 of the bnary varables s actve n encodng the M states. The problem here s that ths encodng, by constructon, means that the varables a 1, a 2, a 3 are dependent for example, f we know that a 1 = 1, we know that a 2 = 0 and a 3 = 0. Regardless of any possble condtonng, these varables wll always reman completely dependent, contrary to the assumpton of Nave Bayes. Ths mstake, however, s wdespread please help preserve a lttle of my santy by not makng the same error. The correct approach s to smply use varables wth many states the multnomal rather than bnomal dstrbuton. Ths s straghtforward and left as an exercse for the nterested reader. 5 Estmaton usng Maxmum Lkelhood : Bernoull Process In ths secton we formally derve how to learn the parameters n a Nave Bayes model from data. The results are ntutve, and ndeed, we have already made use of them n the prevous sectons. However, t s nstructve to carry out ths procedure and some lght can be cast also on the nature of the decson boundary (at least for the case of bnary attrbutes). Consder a dataset X = {x µ, µ = 1,..., P } of bnary attrbutes. That s x µ {0, 1}. Each datapont x µ has an assocated class label c µ. Based upon the class label, we can splt the nputs nto those that belong to each class : X c = {x x s n class c}. We wll consder here only the case of
Learnng from Data 1 : c Davd Barber 2001,2002 6 two classes (ths s called a Bernoull process the case of more classes s also straghtforward and called the multnomal process). Let the number of dataponts from class c = 0 be n 0 and the number from class c = 1 be n 1. For each class of the two classes, we then need to estmate the values p(x = 1 c) θ c. (The other probablty, p(x = 0 c) s smply gven from the normalsaton requrement, p(x = 0 c) = 1 p(x = 1 c) = 1 θ c). Usng the standard assumpton that the data s generated dentcally and ndependently, the lkelhood of the model generatng the dataset X c (the data X belongng to class c) s p(x c ) = p(x µ c) (5.1) µ from class c Usng our condtonal ndependence assumpton p(x c) = p(x c) = (θ c ) x (1 θ c ) 1 x (5.2) (remember that n each term n the above expresson, x s ether 0 or 1 and hence, for each term n the product, only one of the two factors wll contrbute, contrbutng a factor θ c f x = 1 and 1 θ c f x = 0). Puttng ths all together, we can fnd the log lkelhood L(θ c ) =,µ x µ log θc + (1 x µ ) log(1 θc ) (5.3) Optmsng wth respect to θ c and equate to zero) gves θ c p(x = 1 c) (dfferentate wth respect to p(x = 1 c) = number of tmes x = 1 for class c (number of tmes x = 1 for class c) + (number of tmes x = 0 for class c) (5.4) A smlar Maxmum Lkelhood argument gves the ntutve result: p(c) = number of tmes class c occurs total number of data ponts (5.5) 5.1 Classfcaton Boundary If we just wsh to fnd the most lkely class for a new pont x, we can compare the log probabltes, classfyng x as class 1 f log p(c = 1 x ) > log p(c = 0 x ) (5.6) Usng the defnton of the classfer, ths s equvalent to (snce the normalsaton constant log p(x ) can be dropped from both sdes) log p(x c = 1) + log p(c = 1) > log p(x c = 0) + log p(c = 0) (5.7) Usng the bnary encodng x {0, 1}, ths s : classfy x as class 1 f { x log θ 1 + (1 x ) log(1 θ 1 ) } + log p(c = 1) > { x log θ 0 + (1 x ) log(1 θ 0 ) } + log p(c = 0) (5.8) Note that ths decson rule can be expressed n the form : classfy x as class 1 f w x +a > 0 for some sutable choce of weghts w and constant a (the reader s nvted to fnd the explct values of these weghts). The nterpretaton of ths s that w specfes a hyperplane n the x space and x s classfed as a 1 f t les on one sde of the hyperplane. We shall talk about other such lnear classfers n a later chapter.