9.2 Maximum A Posteriori and Maximum Likelihood

Size: px

Start display at page:

Download "9.2 Maximum A Posteriori and Maximum Likelihood"

Tyler Pope
6 years ago
Views:

1 Maxmum A Posteror and Maxmum Lkelhood In the above, p( 0 < 0.5 V) = = Z p( 0 V)d 0 (9.1.29) 1 B( + N H, + N T ) Z N H 1 (1 ) +N T 1 d (9.1.30) I 0.5 ( + N H, + N T ) (9.1.31) where I x (a, b) stheregularsed ncomplete Beta functon. For the former case of N H =2,N T = 8, under a flat pror, p( 0 < 0.5 V) =I 0.5 (N H +1,N T + 1) = (9.1.32) Snce the events are exclusve, p( V) = = Hence the expected utlty of sayng heads s more lkely s = (9.1.33) Smlarly, the utlty of sayng tals s more lkely s = (9.1.34) So we are better o takng the decson that the con s more lkely to come up tals. If we modfy the above so that we lose 100 mllon dollars f we guess tals when n fact t as heads, the expected utlty of sayng tals would be n whch case we would be better of sayng heads. In ths case, even though we are more confdent that the con s lkely to come up tals, we would pay such a penalty of makng a mstake n sayng tals, that t s fact better to say heads. 9.2 Maxmum A Posteror and Maxmum Lkelhood Summarsng the posteror Defnton 86 (Maxmum Lkelhood and Maxmum a Posteror). Maxmum Lkelhood sets parameter, gven data V, usng ML = argmax p(v ) (9.2.1) Maxmum A Posteror uses that settng that maxmses the posteror dstrbuton of the parameter, MAP = argmax where p() s the pror dstrbuton. p(v )p() (9.2.2) A crude summary of the posteror s gven by a dstrbuton wth all ts mass n a sngle most lkely state,, MAP. In makng such an approxmaton, potentally useful nformaton concernng the relablty of the parameter estmate s lost. In contrast the full posteror reflects our belefs about the range of possbltes and ther assocated credbltes. One can motvate MAP from a decson theoretc perspectve. If we assume a utlty that s zero for all but the correct, U( true, ) =I [ true = ] (9.2.3) DRAFT March 9,

2 Maxmum A Posteror and Maxmum Lkelhood a s a s a n c n s n n =1:N Fgure 9.4: (a): A model for the relatonshp between lung Cancer, Asbestos exposure and Smokng. (b): Plate notaton replcatng the observed n dataponts and placng prors over the CPTs, ted across all dataponts. c (a) c (b) then the expected utlty of s U() = X true I [ true = ] p( true V) =p( V) (9.2.4) Ths means that the maxmum utlty decson s to return that wth the hghest posteror value. When a flat pror p() = const. s used the MAP parameter assgnment s equvalent to the Maxmum Lkelhood settng ML = argmax p(v ) (9.2.5) The term Maxmum Lkelhood refers to the parameter for whch the observed data s most lkely to be generated by the model. Snce the logarthm s a strctly ncreasng functon, then for a postve functon f() opt = argmax f(), opt = argmax log f() (9.2.6) so that the MAP parameters can be found ether by optmsng the MAP objectve or, equvalently, ts logarthm, log p( V) = log p(v ) + log p() log p(v) (9.2.7) where the normalsaton constant, p(v), s not a functon of. The log lkelhood s convenent snce under the..d. assumpton t s a summaton of data terms, log p( V) = X n log p(v n ) + log p() log p(v) (9.2.8) so that quanttes such as dervatves of the log-lkelhood w.r.t. are straghtforward to compute. Example 36. In the con-tossng experment of secton(9.1.1) the ML settng s =0.2 n both N H = 2,N T = 8 and N H = 20,N T = Maxmum lkelhood and the emprcal dstrbuton Gven a dataset of dscrete varables X = x 1,...,x N we defne the emprcal dstrbuton as q(x) = 1 N I [x = x n ] (9.2.9) n=1 170 DRAFT March 9, 2010

3 Maxmum A Posteror and Maxmum Lkelhood a s c Fgure 9.5: A database contanng nformaton about the Asbestos exposure (1 sgnfes exposure), beng a Smoker (1 sgnfes the ndvdual s a smoker), and lung Cancer (1 sgnfes the ndvdual has lung Cancer). Each row contans the nformaton for an ndvdual, so that there are 7 ndvduals n the database. n the case that x s a vector of varables, I [x = x n ]= Y I [x = x n ] (9.2.10) The Kullback-Lebler dvergence between the emprcal dstrbuton q(x) and a dstrbuton p(x) s KL(q p) =hlog q(x) q(x) hlog p(x) q(x) (9.2.11) Our nterest s the functonal dependence of KL(q p) on p. Snce the entropc term hlog q(x) q(x) s ndependent of p(x) we may consder ths constant and focus on the second term alone. Hence KL(q p) = hlog p(x) q(x) + const. = 1 N log p(x n )+const. (9.2.12) n=1 We recognse P N n=1 log p(xn ) as the log lkelhood under the model p(x), assumng that the data s..d. Ths means that settng parameters by maxmum lkelhood s equvalent to settng parameters by mnmsng the Kullback-Lebler dvergence between the emprcal dstrbuton and the parametersed dstrbuton. In the case that p(x) s unconstraned, the optmal choce s to set p(x) =q(x), namely the maxmum lkelhood optmal dstrbuton corresponds to the emprcal dstrbuton Maxmum lkelhood tranng of belef networks Consder the followng model of the relatonshp between exposure to asbestos (a), beng a smoker (s) and the ncdence of lung cancer (c) p(a, s, c) =p(c a, s)p(a)p(s) (9.2.13) whch s depcted n fg(9.4a). Each varable s bnary, dom(a) ={0, 1}, dom(s) ={0, 1}, dom(c) ={0, 1}. We assume that there s no drect relatonshp between Smokng and exposure to Asbestos. Ths s the knd of assumpton that we may be able to elct from medcal experts. Furthermore, we assume that we have a lst of patent records, fg(9.5), where each row represents a patent s data. To learn the table entres p(c a, s) we can do so by countng the number of c s n state 1 for each of the 4 parental states of a and s: p(c =1 a =0,s= 0) = 0, p(c =1 a =0,s= 1) = 0.5 p(c =1 a =1,s= 0) = 0.5 p(c =1 a =1,s= 1) = 1 (9.2.14) Smlarly, based on countng, p(a = 1) = 4/7, and p(s = 1) = 4/7. These three CPTs then complete the full dstrbuton specfcaton. Settng the CPT entres n ths way by countng the relatve number of occurrences corresponds mathematcally to maxmum lkelhood learnng under the..d. assumpton, as we show below. Maxmum lkelhood corresponds to countng For a BN there s a constrant on the form of p(x), namely p(x) = KY p(x pa (x )) (9.2.15) =1 DRAFT March 9,

4 CHAPTER 10 Nave Bayes 10.1 Nave Bayes and Condtonal Independence Nave Bayes (NB) s a popular classfcaton method and ads our dscusson of condtonal ndependence, overfttng and Bayesan methods. In NB, we form a jont model of observatons x and the correspondng class label c usng a Belef network of the form p(x,c)=p(c) DY p(x c) (10.1.1) =1 whose Belef Network s depcted n fg(10.1a). Coupled wth a sutable choce for each condtonal dstrbuton p(x c), we can then use Bayes rule to form a classfer for a novel attrbute vector x : p(c x )= p(x c)p(c) p(x ) = p(x c)p(c) P c p(x c)p(c) (10.1.2) In practce t s common to consder only two classes dom(c) ={0, 1}. The theory we descrbe below s vald for any number of classes c, though our examples are restrcted to the bnary class case. Also, the attrbutes x are often taken to be bnary, as we shall do ntally below as well. The extenson to more than two attrbute states, or contnuous attrbutes s straghtforward. Example 47. EZsurvey.org consders Rado staton lsteners convenently fall nto two groups the young and old. They assume that, gven the knowledge that a customer s ether young or old, ths s su cent to determne whether or not a customer wll lke a partcular Rado staton, ndependent of ther lkes or dslkes for any other statons: p(r1,r2,r3,r4 age) =p(r1 age)p(r2 age)p(r3 age)p(r4 age) (10.1.3) where each of the varables R1,R2,R3,R4 can take the states ether lke or dslke, and the age varable can take the value ether young or old. Thus the nformaton about the age of the customer determnes the ndvdual product preferences wthout needng to know anythng else. To complete the specfcaton, gven that a customer s young, she has a 95% chance to lke Rado1, a 5% chance to lke Rado2, a 2% chance to lke Rado3 and a 20% chance to lke Rado4. Smlarly, an old lstener has a 3% chance to lke Rado1, an 82% chance to lke Rado2, a 34% chance to lke Rado3 and a 92% chance to lke Rado4. They know that 90% of the lsteners are old. 203

5 Estmaton usng Maxmum Lkelhood c n c Fgure 10.1: Nave Bayes classfer. (a): The central c x 1 x 2 x 3 x n n =1:N,c =1:D assumpton s that gven the class c, the attrbutes x are ndependent. (b): Assumng the data s..d., Maxmum Lkelhood learns the optmal parameters of the dstrbuton p(c) and the class-dependent attrbute dstrbutons p(x c). (a) (b) Gven ths model, and a new customer that lkes Rado1, and Rado3, but dslkes Rado2 and Rado4, what s the probablty that they are young? Ths s gven by p(age = young R1 = lke,r2 = dslke,r3 = lke,r4 = dslke) = p(r1 =lke,r2=dslke,r3=lke,r4=dslke age = young)p(age = young) Page p(r1 =lke,r2=dslke,r3=lke,r4=dslke age)p(age) (10.1.4) Usng the Nave Bayes structure, the numerator above s gven by p(r1 = lke age = young)p(r2 = dslke age = young) Pluggng n the values we obtan = p(r3 = lke age = young)p(r4 = dslke age = young)p(age = young) (10.1.5) The denomnator s gven by ths value plus the correspondng term evaluated under assumng the customer s old, = Whch gves p(age = young R1 = lke,r2 = dslke,r3 = lke,r4 = dslke) = = (10.1.6) 10.2 Estmaton usng Maxmum Lkelhood Learnng the table entres for NB s a straghtforward applcaton of the more general BN learnng dscussed n secton(9.2.3). For a fully observed dataset, Maxmum Lkelhood learnng of the table entres corresponds to countng the number of occurrences n the tranng data, as we show below Bnary attrbutes Consder a dataset {x n,n=1,...,n} of bnary attrbutes, x n 2 {0, 1}, =1,...,D. Each datapont x n has an assocated class label c n. The number of dataponts from class c =0sn 0 and the number from class c = 1 denoted s n 1. For each attrbute of the two classes, we need to estmate the values p(x =1 c) c. The other probablty, p(x =0 c) s gven by the normalsaton requrement, p(x =0 c) =1 p(x =1 c) =1 c. 204 DRAFT March 9, 2010

6 Estmaton usng Maxmum Lkelhood Based on the NB condtonal ndependence assumpton the probablty of observng a vector x can be compactly wrtten DY DY p(x c) = p(x c) = ( c ) x (1 c ) 1 x (10.2.1) =1 =1 In the above expresson, x s ether 0 or 1 and hence each term contrbutes a factor c f x = 1 or 1 c f x = 0. Together wth the assumpton that the tranng data s..d. generated, the log lkelhood of the attrbutes and class labels s L = X n log p(x n,c n )= X n log p(c n ) Y p(x n c n ) (10.2.2) = X,n x n log cn +(1 x n ) log(1 cn )+n 0 log p(c = 0) + n 1 log p(c = 1) (10.2.3) Ths can be wrtten more explctly n terms of the parameters as L = X,n I [x n =1,c n = 0] log 0 + I [x n =0,c n = 0] log(1 0 )+I [x n =1,c n = 1] log 1 + I [x n =0,c n = 1] log(1 1 ) + n 0 log p(c = 0) + n 1 log p(c = 1) (10.2.4) We can fnd the Maxmum Lkelhood optmal c by d erentatng w.r.t. c P c n = p(x =1 c) = I [xn =1,c n = c] P n I [xn =0,c n = c]+i [x n =1,c n = c] = number of tmes x = 1 for class c number of dataponts n class c and equatng to zero, gvng (10.2.5) (10.2.6) Smlarly, optmsng equaton (10.2.3) wth respect to p(c) gves p(c) = number of tmes class c occurs total number of data ponts (10.2.7) Classfcaton boundary We classfy a novel nput x as class 1 f p(c =1 x ) >p(c =0 x ) (10.2.8) Usng Bayes rule and wrtng the log of the above expresson, ths s equvalent to log p(x c = 1) + log p(c = 1) log p(x ) > log p(x c = 0) + log p(c = 0) log p(x ) (10.2.9) From the defnton of the classfer, ths s equvalent to (the normalsaton constant log p(x ) can be dropped from both sdes) X log p(x c = 1) + log p(c = 1) > X log p(x c = 0) + log p(c = 0) ( ) Usng the bnary encodng x 2 {0, 1}, we classfy x as class 1 f X x log 1 +(1 x ) log(1 1 ) +log p(c = 1) > X x log 0 +(1 x ) log(1 0 ) +log p(c = 0) ( ) Ths decson rule can be expressed n the form: classfy x as class 1 f P w x + a>0 for some sutable choce of weghts w and constant a, see exercse(133). The nterpretaton s that w specfes a hyperplane n the attrbute space and x s classfed as 1 f t les on the postve sde of the hyperplane. DRAFT March 9,

7 Estmaton usng Maxmum Lkelhood (a) (b) Fgure 10.2: (a): Englsh tastes over attrbutes (shortbread, lager, whskey, porrdge, f ootball). Each column represents the tastes of an ndvdual. (b): Scottsh tastes. Example 48 (Are they Scottsh?). Consder the followng vector of attrbutes: (lkes shortbread, lkes lager, drnks whskey, eats porrdge, watched England play football) ( ) A vector x = (1, 0, 1, 1, 0) T would descrbe that a person lkes shortbread, does not lke lager, drnks whskey, eats porrdge, and has not watched England play football. Together wth each vector x, there s a label nat descrbng the natonalty of the person, dom(nat) ={scottsh, englsh}, see fg(10.2). We wsh to classfy the vector x =(1, 0, 1, 1, 0) T as ether scottsh or englsh. We can use Bayes rule to calculate the probablty that x s Scottsh or Englsh: p(scottsh x) = p(x scottsh)p(scottsh) p(x) = p(x scottsh)p(scottsh) p(x scottsh)p(scottsh)+p(x englsh)p(englsh) ( ) By Maxmum Lkelhood the pror class probablty p(scottsh) s gven by the fracton of people n the database that are Scottsh, and smlarly p(englsh) s gven as the fracton of people n the database that are Englsh. Ths gves p(scottsh) = 7/13 and p(englsh) = 6/13. For p(x nat) under the Nave Bayes assumpton: p(x nat) =p(x 1 nat)p(x 2 nat)p(x 3 nat)p(x 4 nat)p(x 5 nat) ( ) so that knowng whether not someone s Scottsh, we don t need to know anythng else to calculate the probablty of ther lkes and dslkes. Based on the table n fg(10.2) and usng Maxmum Lkelhood we have: p(x 1 =1 englsh) = 1/2 p(x 1 =1 scottsh) = 1 p(x 2 =1 englsh) = 1/2 p(x 2 =1 scottsh) = 4/7 p(x 3 =1 englsh) = 1/3 p(x 3 =1 scottsh) = 3/7 p(x 4 =1 englsh) = 1/2 p(x 4 =1 scottsh) = 5/7 p(x 5 =1 englsh) = 1/2 p(x 5 =1 scottsh) = 3/7 ( ) For x =(1, 0, 1, 1, 0) T, we get p(scottsh x) = Snce ths s greater than 0.5, we would classfy ths person as beng Scottsh. = ( ) Small data counts In example(48), consder tryng to classfy the vector x =(0, 1, 1, 1, 1) T. In the tranng data, all Scottsh people say they lke shortbread. Ths means that for ths partcular x, p(x, scottsh) = 0, and therefore that we make the extremely confdent classfcaton p(scottsh x) = 0. Ths demonstrates a d culty usng Maxmum Lkelhood wth sparse data. One way to amelorate ths s to smooth the probabltes, for example by addng a certan small number to the frequency counts of each attrbute. Ths ensures that 206 DRAFT March 9, 2010

8 Estmaton usng Maxmum Lkelhood there are no zero probabltes n the model. An alternatve s to use a Bayesan approach that dscourages extreme probabltes, as dscussed n secton(10.3). Potental ptfalls wth encodng In many o -the-shelf packages mplementng Nave Bayes, bnary attrbutes are assumed. In practce, however, the case of non-bnary attrbutes often occurs. Consder the followng attrbute : age. In a survey, a person s age s marked down usng the varable a 2 1, 2, 3. a = 1 means the person s between 0 and 10 years old, a = 2 means the person s between 10 and 20 years old, a = 3 means the person s older than 20. One way to transform the varable a nto a bnary representaton would be to use three bnary varables (a 1,a 2,a 3 )wth(1, 0, 0), (0, 1, 0), (0, 0, 1) representng a =1,a =2,a =3respectvely. Ths s called 1 of M codng snce only 1 of the bnary varables s actve n encodng the M states. By constructon, means that the varables a 1,a 2,a 3 are dependent for example, f we know that a 1 = 1, we know that a 2 = 0 and a 3 = 0. Regardless of any class condtonng, these varables wll always be dependent, contrary to the assumpton of Nave Bayes. A correct approach s to use varables wth more than two states, as explaned n secton(10.2.2) Mult-state varables For a varable x wth more than two states, dom(x )={1,...,S}, the lkelhood of observng a state x = s s denoted p(x = s c) = s(c) ( ) wth P s p(x = s c) = 1. For a set of data vectors x n,n =1,...N, belongng to class c, underthe..d. assumpton, the lkelhood of the NB model generatng data from class c s NY NY DY SY CY p(x n c n )= s(c) I[xn =s]i[cn =c] n=1 n=1 =1 s=1 c=1 ( ) whch gves the class condtonal log-lkelhood L = DX SX n=1 =1 s=1 c=1 CX I [x n = s] I [c n = c] log s(c) ( ) We can optmze wth respect to the parameters usng a Lagrange multpler (one for each of the attrbutes and classes c) to ensure normalsaton: L() = DX SX n=1 =1 s=1 c=1 CX I [x n = s] I [c n = c] log s(c)+ CX DX c=1 =1 c 1! SX s(c) s=1 ( ) To fnd the optmum of ths functon we may d erentate wth respect to s(c) and equate to zero. Solvng the resultng equaton we obtan n=1 I [x n = s] I [c n = c] s(c) = c ( ) Hence, by normalsaton, s(c) =p(x = s c) = P n I [xn = s] I [c n = c] Ps 0,n 0 I x n0 = s 0 I [c n0 = c] ( ) The Maxmum Lkelhood settng for the parameter p(x = s c) equals the relatve number of tmes that attrbute s n state s for class c. DRAFT March 9,

9 Bayesan Nave Bayes n =1:N c n c x n Fgure 10.3: Bayesan Nave Bayes wth a factorsed pror on the class condtonal attrbute probabltes p(x = s c). For smplcty we assume that the class probablty c p(c) s learned wth Maxmum Lkelhood, so that no dstrbuton s placed over ths parameter.,c c =1:C =1:D Text classfcaton Consder a set of documents about poltcs, and another set about sport. Our nterest s to make a method that can automatcally classfy a new document as pertanng to ether sport or poltcs. We search through both sets of documents to fnd the 100 most commonly occurrng words. Each document s then represented by a 100 dmensonal vector representng the number of tmes that each of the words occurs n that document the so called bag of words representaton (ths s a crude representaton of the document snce t dscards word order). A Nave Bayes model specfes a dstrbuton of these number of occurrences p(x c), where x s the count of the number of tmes word appears n documents of type c. One can acheve ths usng ether a multstate representaton (as dscussed n secton(10.2.2)) or usng a contnuous x to represent the frequency of word n the document. In ths case p(x c) could be convenently modelled usng for example a Beta dstrbuton. Despte the smplcty of Nave Bayes, t can classfy documents surprsngly well[125]. Intutvely a potental justfcaton for the condtonal ndependence assumpton s that f we know a document s about poltcs, ths s a good ndcaton of the knds of other words we wll fnd n the document. Because Nave Bayes s a reasonable classfer n ths sense, and has mnmal storage and fast tranng, t has been appled to tme-storage crtcal applcatons, such as automatcally classfyng webpages nto types[289], and spam flterng[9] Bayesan Nave Bayes To predct the class c of an nput x we use p(c x, D) / p(x, D, c)p(c D) / p(x D, c)p(c D) (10.3.1) For convenence we wll smply set p(c D) usng Maxmum Lkelhood p(c D) = 1 X I [c n = c] (10.3.2) N n However, as we ve seen, settng the parameters of p(x D,c) usng Maxmum Lkelhood tranng can yeld over-confdent predctons n the case of sparse data. A Bayesan approach that addresses ths d culty s to use prors on the probabltes p(x = s c) s(c) that dscourage extreme values. The model s depcted n fg(10.3). The pror We wll use a pror on the table entres and make the global factorsaton assumpton (see secton(9.3)) p() = Y,c p( (c)) (10.3.3) 208 DRAFT March 9, 2010

Learning from Data 1 Naive Bayes

Learning from Data 1 Naive Bayes Learnng from Data 1 Nave Bayes Davd Barber dbarber@anc.ed.ac.uk course page : http://anc.ed.ac.uk/ dbarber/lfd1/lfd1.html c Davd Barber 2001, 2002 1 Learnng from Data 1 : c Davd Barber 2001,2002 2 1 Why