Bayesian Classifier. v MAP. argmax v j V P(x 1,x 2,...,x n v j )P(v j ) ,..., x. x) argmax. )P(v j

Size: px

Start display at page:

Download "Bayesian Classifier. v MAP. argmax v j V P(x 1,x 2,...,x n v j )P(v j ) ,..., x. x) argmax. )P(v j"

Lucas Walker
6 years ago
Views:

1 Bayesa Classfer f:xv, fte set of alues Istaces xx ca be descrbed as a collecto of features x = (x 1, x 2, x x 2 {0,1} Ge a example, assg t the most probable alue V Bayes Rule: MAP argmax V MAP argmax V x argmax V Notatoal coeto: P(y meas P(Y=y P(x 1,x 2,...,x P(x 1,x 2,...,x argmax V P(x 1,x 2,...,x x 1,x 2,..., x Bayesa Learg CS446 Sprg 17 1

2 Bayesa Classfer V MAP = argmax P(x 1, x 2,, x Ge trag data we ca estmate the two terms. Estmatg s easy. E.g., uder the bomal dstrbuto assumpto, cout the umber of tmes appears the trag data. Howeer, t s ot feasble to estmate P(x 1, x 2,, x I ths case we hae to estmate, for each target alue, the probablty of each stace (most of whch wll ot occur. I order to use a Bayesa classfers practce, we eed to make assumptos that wll allow us to estmate these quattes. Bayesa Learg CS446 Sprg 17 2

3 P(x 1,x 2 P(x1 P(x1... P(x,..., x 1 x x x 2 2 2,..., x,..., x,..., x,,, Nae Bayes V MAP = argmax P(x 1, x 2,, x P(x P(x P(x 2 2 2,..., x x x 3 3,..., x,..., x,, P(x P(x 3 3,..., x x 4,..., x,... P(x Assumpto: feature alues are depedet ge the target alue 1 P(x Bayesa Learg CS446 Sprg 17 3

4 Nae Bayes (2 V MAP = argmax P(x 1, x 2,, x Assumpto: feature alues are depedet ge the target alue P(x 1 = b 1, x 2 = b 2,,x = b = = 1 P(x = b = Geerate model: Frst choose a alue V accordg to For each : choose x 1 x 2,, x accordg to P(x k Bayesa Learg CS446 Sprg 17 4

5 Nae Bayes (3 V MAP = argmax P(x 1, x 2,, x Assumpto: feature alues are depedet ge the target alue P(x 1 = b 1, x 2 = b 2,,x = b = = 1 P(x = b = Learg method: Estmate V + V parameters ad use them to make a predcto. (How to estmate? Notce that ths s learg wthout search. Ge a collecto of trag examples, you ust compute the best hypothess (ge the assumptos. Ths s learg wthout tryg to achee cosstecy or ee approxmate cosstecy. Why does t work? Bayesa Learg CS446 Sprg 17 5

6 Codtoal Idepedece Notce that the features alues are codtoally depedet ge the target alue, ad are ot requred to be depedet. Example: The Boolea features are x ad y. We defe the label to be l = f(x,y=xy oer the product dstrbuto: p(x=0=p(x=1=1/2 ad p(y=0=p(y=1=1/2 The dstrbuto s defed so that x ad y are depedet: p(x,y = p(xp(y That s: X=0 X=1 Y=0 ¼ (l = 0 ¼ (l = 0 Y=1 ¼ (l = 0 ¼ (l = 1 But, ge that l =0: p(x=1 l =0 = p(y=1 l =0 = 1/3 whle: p(x=1,y=1 l =0 = 0 so x ad y are ot codtoally depedet. Bayesa Learg CS446 Sprg 17 6

7 Codtoal Idepedece The other drecto also does ot hold. x ad y ca be codtoally depedet but ot depedet. Example: We defe a dstrbuto s.t.: l =0: p(x=1 l =0 =1, p(y=1 l =0 = 0 l =1: p(x=1 l =1 =0, p(y=1 l =1 = 1 ad assume, that: p(l =0 = p(l =1=1/2 X=0 X=1 Y=0 0 (l= 0 ½ (l= 0 Y=1 ½ (l= 1 0 (l= 1 Ge the alue of l, x ad y are depedet (check What about ucodtoal depedece? p(x=1 = p(x=1 l =0p(l =0+p(x=1 l =1p(l =1 = 0.5+0=0.5 p(y=1 = p(y=1 l =0p(l =0+p(y=1 l =1p(l =1 = 0+0.5=0.5 But, p(x=1, y=1=p(x=1,y=1 l =0p(l =0+p(x=1,y=1 l =1p(l =1 = 0 so x ad y are ot depedet. Bayesa Learg CS446 Sprg 17 7

8 Naïe Bayes Example argmax P(x NB V Day Outlook Temperature Humdty Wd PlayTes 1 Suy Hot Hgh Weak No 2 Suy Hot Hgh Strog No 3 Oercast Hot Hgh Weak Yes 4 Ra Mld Hgh Weak Yes 5 Ra Cool Normal Weak Yes 6 Ra Cool Normal Strog No 7 Oercast Cool Normal Strog Yes 8 Suy Mld Hgh Weak No 9 Suy Cool Normal Weak Yes 10 Ra Mld Normal Weak Yes 11 Suy Mld Normal Strog Yes 12 Oercast Mld Hgh Strog Yes 13 Oercast Hot Normal Weak Yes 14 Bayesa Learg Ra Mld Hgh CS446 Sprg 17 Strog No 8

9 Estmatg Probabltes NB argmax {yes,o} P(x obserato How do we estmate P(obserato? Bayesa Learg CS446 Sprg 17 9

10 argmax Example P(x NB V Compute P(PlayTes= yes; P(PlayTes= o Compute P(outlook= s/oc/r PlayTes= yes/o (6 umbers Compute P(Temp= h/mld/cool PlayTes= yes/o (6 umbers Compute P(humdty= h/or PlayTes= yes/o (4 umbers Compute P(wd= w/st PlayTes= yes/o (4 umbers Bayesa Learg CS446 Sprg 17 10

11 Compute P(PlayTes= yes; P(PlayTes= o Compute P(outlook= s/oc/r PlayTes= yes/o (6 umbers Compute P(Temp= h/mld/cool PlayTes= yes/o (6 umbers Compute P(humdty= h/or PlayTes= yes/o (4 umbers Compute P(wd= w/st PlayTes= yes/o (4 umbers Ge a ew stace: (Outlook=suy; Temperature=cool; Humdty=hgh; Wd=strog Predct: PlayTes=? argmax Example P(x NB V Bayesa Learg CS446 Sprg 17 11

12 argmax Example NB V Ge: (Outlook=suy; Temperature=cool; Humdty=hgh; Wd=strog P(x P(PlayTes= yes=9/14=0.64 P(PlayTes= o=5/14=0.36 P(outlook = suy yes= 2/9 P(outlook = suy o= 3/5 P(temp = cool yes = 3/9 P(temp = cool o = 1/5 P(humdty = h yes = 3/9 P(humdty = h o = 4/5 P(wd = strog yes = 3/9 P(wd = strog o= 3/5 P(yes,.. ~ P(o,.. ~ Bayesa Learg CS446 Sprg 17 12

13 argmax Example NB V Ge: (Outlook=suy; Temperature=cool; Humdty=hgh; Wd=strog P(x P(PlayTes= yes=9/14=0.64 P(PlayTes= o=5/14=0.36 P(outlook = suy yes= 2/9 P(outlook = suy o= 3/5 P(temp = cool yes = 3/9 P(temp = cool o = 1/5 P(humdty = h yes = 3/9 P(humdty = h o = 4/5 P(wd = strog yes = 3/9 P(wd = strog o= 3/5 P(yes,.. ~ P(o,.. ~ P(o stace = /( =0.795 What f we were asked about Outlook=OC? Bayesa Learg CS446 Sprg 17 13

14 Estmatg Probabltes How do we estmate P(word k? As we suggested before, we made a Bomal assumpto; the: # (word k appears trag documets P(word k #( documets Sparsty of data s a problem -- f s small, the estmate s ot accurate -- f k s 0, t wll domate the estmate: we wll eer predct f a word that eer appeared trag (wth appears the test data argmax {lke,dslke} P(x word NB k Bayesa Learg CS446 Sprg 17 14

15 Robust Estmato of Probabltes argmax {lke,dslke} P(x word NB Ths process s called smoothg. There are may ways to do t, some better ustfed tha others; A emprcal ssue. k mp P(xk m Here: k s # of occurreces of the word the presece of s # of occurreces of the label p s a pror estmate of (e.g., uform m s equalet sample sze (# of labels Is ths a reasoable defto? Bayesa Learg CS446 Sprg 17 15

16 Robust Estmato of Probabltes Smoothg: Commo alues: P(x Laplace Rule: for the Boolea case, p=1/2, m=2 P(x k k k m k 1 mp 2 Lear to classfy text: p = 1/( alues (uform m= alues Bayesa Learg CS446 Sprg 17 16

17 Assume a Bomal r..: p(k,µ = C k µ k (1- µ -k Robust Estmato We saw that the maxmum lkelhood estmate s µ ML = k/ I order to compute the MAP estmate, we eed to assume a pror. It s easer to assume a pror of the form: p(µ = µ a-1 (1- µ b-1 (a ad b are called the hyper parameters The pror ths case s the beta dstrbuto, ad t s called a cougate pror, sce t has the same form as the posteror. Ideed, t s easy to compute the posteror: p(µ D ~= p(d µp(µ = µ a+k-1 (1- µ b+-k-1 Therefore, as we hae show before (dfferetate the log posteror µ map = k+a-1/(+a+b-2 The posteror mea: E(µ D = s 01 µp(µ Ddµ = a+k/(a+b+ Uder the uform pror, the posteror mea of obserg (k, s: k+1/+2 Bayesa Learg CS446 Sprg 17 17

18 Naïe Bayes: Two Classes NB argmax V P(x Notce that the aïe Bayes method ges a method for predctg rather tha a explct classfer. I the case of two classes, {0,1} we predct that =1 ff: P(x P(x Bayesa Learg CS446 Sprg 17 18

19 Naïe Bayes: Two Classes argmax P(x NB V Notce that the aïe Bayes method ges a method for predctg rather tha a explct classfer. I the case of two classes, {0,1} we predct that =1 ff: P(x P(x Deote : p P(x 1 1, q P(x (1- q Bayesa Learg CS446 Sprg p q x x (1- p 1-x 1-x 1 19

20 Naïe Bayes: Two Classes I the case of two classes, {0,1} we predct that =1 ff: p q x x (1 - p (1 - q 1-x 1-x (1 - p (1 - q p ( 1 - p q ( 1 - q x x 1 Bayesa Learg CS446 Sprg 17 20

21 Naïe Bayes: Two Classes I the case of two classes, {0,1} we predct that =1 ff: p q x x (1 - p (1 - q 1-x 1-x Take logarthm; we predct log p log 1 - q ff : p (log 1 - p 1 1 (1 - p (1 - q q log 1 - q p ( 1 - p q ( 1 - q x x x 0 1 Bayesa Learg CS446 Sprg 17 21

22 I the case of two classes, {0,1} we predct that =1 ff: p x 1-x 1 (1 - p ( 1 p (1 - p p x 1-x 0 q (1 - q q 1 0 (1 - q ( q Take logarthm; we predct 1 ff : log Naïe Bayes: Two Classes p log 1 - q p (log 1 - p We get that ae Bayes s a lear separator wth q log 1 - q x x x 0 1 w log p log q log p 1- q 1- p 1- q q 1- p f p q the w 0 ad the feature s rreleat Bayesa Learg CS446 Sprg 17 22

23 I the case of two classes we hae that: but sce We get: Naïe Bayes: Two Classes log Whch s smply the logstc fucto. 1 x 0 x 1 x 1 x 1- b The learty of NB prodes a better explaato for why t works. w 1 exp(- x 0 x 1 w x We hae: A = 1-B; Log(B/A = -C. The: Exp(-C = B/A = = (1-A/A = 1/A 1 = 1 + Exp(-C = 1/A A = 1/(1+Exp(-C b Bayesa Learg CS446 Sprg 17 23

24 A few more NB examples Bayesa Learg CS446 Sprg 17 24

25 Example: Learg to Classfy Text argmax NB V Istace space X: Text documets Istaces are labeled accordg to f(x=lke/dslke Goal: Lear ths fucto such that, ge a ew documet you ca use t to decde f you lke t or ot How to represet the documet? How to estmate the probabltes? How to classfy? P(x Bayesa Learg CS446 Sprg 17 25

26 Documet Represetato Istace space X: Text documets Istaces are labeled accordg to y = f(x = lke/dslke How to represet the documet? A documet wll be represeted as a lst of ts words The represetato questo ca be ewed as the geerato questo We hae a dctoary of words (therefore 2 parameters We hae documets of sze N: ca accout for word posto & cout Hag a parameter for each word & posto may be too much: # of parameters: 2 x N x (2 x 100 x 50,000 ~ 10 7 Smplfyg Assumpto: The probablty of obserg a word a documet s depedet of ts locato Ths stll allows us to thk about two ways of geeratg the documet Bayesa Learg CS446 Sprg 17 26

27 Classfcato a Bayes Rule (B We wat to compute argmax y P(y D = argmax y P(D y P(y/P(D = = argmax y P(D yp(y Our assumptos wll go to estmatg P(D y: 1. Multarate Beroull I. To geerate a documet, frst decde f t s good (y=1 or bad (y=0. II. III. Parameters: 1. Prors: P(y=0/ w 2 Dctoary p(w =0/1 y=0/1 Ge that, cosder your dctoary of words ad choose w to your documet wth probablty p(w y, rrespecte of aythg else. If the sze of the dctoary s V =, we ca the wrte P(d y = 1 P(w =1 y b P(w =0 y 1-b Where: p(w=1/0 y: the probablty that w appears/does-ot a y-labeled documet. b {0,1} dcates whether word w occurs documet d 2+2 parameters: Estmatg P(w =1 y ad P(y s doe the ML way as before (coutg. Bayesa Learg CS446 Sprg 17 27

28 We wat to compute argmax y P(y D = argmax y P(D y P(y/P(D = = argmax y P(D yp(y Our assumptos wll go to estmatg P(D y: 2. Multomal I. To geerate a documet, frst decde f t s good (y=1 or bad (y=0. II. III. A Multomal Model Parameters: 1. Prors: P(y=0/ w 2 Dctoary p(w =0/1 y=0/1 N dctoary tems are chose to D Ge that, place N words to d, such that w s placed wth probablty P(w y, ad N P(w y =1. The Probablty of a documet s: P(d y N!/ 1!... k! P(w 1 y 1 p(w k y k Where s the # of tmes w appears the documet. Same # of parameters: 2+2, where = Dctoary, but the estmato s doe a bt dfferetly. (HW. Bayesa Learg CS446 Sprg 17 28

29 Model Represetato The geerate model these two cases s dfferet µ µ label label Documets d Appear Documets d Appear (d Words w Posto p Beroull: A bary arable correspods to a documet d ad a dctoary word w, ad t takes the alue 1 f w appears d. Documet topc/label s goered by a pror µ, ts topc (label, ad the arable the tersecto of the plates s goered by µ ad the Beroull parameter for the dctoary word w Bayesa Learg CS446 Sprg 17 Multomal: Words do ot correspod to dctoary words but to postos (occurreces the documet d. The teral arable s the W(D,P. These arables are geerated from the same multomal dstrbuto, ad deped o the topc/label. 29

30 Geeral NB Scearo We assume a mxture probablty model, parameterzed by µ. Dfferet compoets {c 1,c 2, c k } of the model are parameterze by dsot subsets of µ. The geerate story: A documet d s created by (1 selectg a compoet accordg to the prors, P(c µ, the (2 hag the mxture compoet geerate a documet accordg to ts ow parameters, wth dstrbuto P(d c, µ So we hae: P(d µ = 1 k P(c µ P(d c,µ I the case of documet classfcato, we assume a oe to oe correspodece betwee compoets ad labels. Bayesa Learg CS446 Sprg 17 30

31 Naïe Bayes: Cotuous Features X ca be cotuous We ca stll use Ad Bayesa Learg CS446 Sprg 17 31

32 Naïe Bayes: Cotuous Features X ca be cotuous We ca stll use Ad Naïe Bayes classfer: Bayesa Learg CS446 Sprg 17 32

33 Naïe Bayes: Cotuous Features X ca be cotuous We ca stll use Ad Naïe Bayes classfer: Assumpto: P(X Y has a Gaussa dstrbuto Bayesa Learg CS446 Sprg 17 33

34 The Gaussa Probablty Dstrbuto Gaussa probablty dstrbuto also called ormal dstrbuto. It s a cotuous dstrbuto wth pdf: 2 ( x = mea of dstrbuto p( x e 2 2 = arace of dstrbuto x s a cotuous arable (- x Probablty of x beg the rage [a, b] caot be ealuated aalytcally (has to be looked up a table (x 1 2 p(x 2 e 2 2 gaussa Bayesa Learg CS446 Sprg 17 x 34

35 Naïe Bayes: Cotuous Features P(X Y s Gaussa Trag: estmate mea ad stadard deato Note that the followg sldes abuse otato sgfcatly. Sce P(x =0 for cotues dstrbutos, we thk of P (X=x Y=y, ot as a classc probablty dstrbuto, but ust as a fucto f(x = N(x, ¹, ¾ 2. f(x behaes as a probablty dstrbuto the sese that 8 x, f(x 0 ad the alues add up to 1. Also, ote that f(x satsfes Bayes Rule, that s, t s true that: f Y (y X = x = f X (x Y = y f Y (y/f X (x Bayesa Learg CS446 Sprg 17 35

36 Naïe Bayes: Cotuous Features P(X Y s Gaussa Trag: estmate mea ad stadard deato X 1 X 2 X 3 Y Bayesa Learg CS446 Sprg 17 36

37 Naïe Bayes: Cotuous Features P(X Y s Gaussa Trag: estmate mea ad stadard deato X 1 X 2 X 3 Y Bayesa Learg CS446 Sprg 17 37

38 Recall: Naïe Bayes, Two Classes I the case of two classes we hae that: but sce We get: log 1 x 0 x w x Whch s smply the logstc fucto (also used the eural etwork represetato The same formula ca be wrtte for cotuous features 1 x 1-1 x 1 exp(- 0 x 1 b w x b Bayesa Learg CS446 Sprg 17 38

39 Logstc Fucto: Cotuous Features Logstc fucto for Gaussa features Note that we are usg rato of probabltes, sce x s a cotuous arable. Bayesa Learg CS446 Sprg 17 39

40 Hdde Marko Model (HMM A probablstc geerate model: models the geerato of a obsered sequece. At each tme step, there are two arables: Curret state (hdde, Obserato s 1 s 2 s 3 s 4 s 5 s 6 Elemets o 1 o 2 o 3 Ital state probablty P(s 1 ( S parameters Trasto probablty P(s t s t-1 ( S ^2 parameters Obserato probablty P(o t s t ( S x O parameters As before, the graphcal model s a ecodg of the depedece assumptos: P(s t s t-1, s t-2, s 1 =P(s t s t-1 P(o t s T,,s t, s 1, o T,,o t, o 1 =P(o t s t Examples: POS taggg, Sequetal Segmetato Bayesa Learg CS446 Sprg o 4 o 5 o 6

41 HMM for Shallow Parsg States: {B, I, O} Obseratos: Actual words ad/or part-of-speech tags s 1 =B s 2 =I s 3 =O s 4 =B s 5 =I s 6 =O o 1 Mr. o 2 Brow o 3 blamed o 4 Mr. o 5 Bob o 6 for Bayesa Learg CS446 Sprg 17 41

42 HMM for Shallow Parsg s 1 =B s 2 =I s 3 =O s 4 =B s 5 =I s 6 =O o 1 o 2 o 3 o 4 o 5 o 6 Mr. Brow blamed Mr. Bob for Ital state Trasto probablty: probablty: Obserato Probablty: P(s 1 =B,P(s 1 t =B s =I,P(s t-1 =B,P(s 1 =O t =I s t-1 =B,P(s t =O s t-1 =B, P(o P(s t =B s t-1 =I,P(s t = Mr. s t =I s t =B,P(o t-1 =I,P(s t = Brow s t =O s t-1 =I, t =B,, P(o t = Mr. s t =I,P(o t = Brow s t =I,, Ge a seteces, we ca ask what the most lkely state sequece s Bayesa Learg CS446 Sprg 17 42

43 Three Computatoal Problems Decodg fdg the most lkely path Hae: model, parameters, obseratos (data Wat: most lkely states sequece S S... S arg max p( S S... S O arg max p( S S... S, O * * * 1 2 T S S... S 1 2 T S S... S 1 2 T 1 2 T 1 2 Ealuato computg obserato lkelhood Hae: model, parameters, obseratos (data Wat: the lkelhood to geerate the obsered data p( O p( O S S... S p( S S... S S S... S B I I I B 1 2 T T I both cases a smple mded soluto depeds o S T steps Trag estmatg parameters Supersed: Hae: model, aotated data(data + states sequece Usupersed: Hae: model, data Wat: parameters Bayesa Learg CS446 Sprg a c d a d T T 43

44 Fdg most lkely state sequece HMM (1 P (s k ; s k 1; : : : ; s1; o k ; o k 1; : : : ; o1 = P (o k o k 1; o k 2; : : : ; o 1 ; s k ; s k 1; : : : ; s 1 P (o k 1; o k 2; : : : ; o1; s k ; s k 1; : : : ; s1 = P (o k s k P (o k 1; o k 2; : : : ; o1; s k ; s k 1; : : : ; s1 = P (o k s k P (s k s k 1; s k 2; : : : ; s1; o k 1; o k 2; : : : ; o1 P (s k 1; s k 2; : : : ; s1; o k 1; o k 2; : : : ; o1 = P (o k s k P (s k s k 1 P (s k 1; s k 2; : : : ; s1; o k 1; o k 2; : : : ; o1 k 1 Y = P (o k s k [ P (s t +1 s t P (o t s t ] P (s1 t=1 Bayesa Learg CS446 Sprg 17 44

45 Fdg most lkely state sequece HMM (2 arg max P (s s k ;s k ; s k 1; : : : ; s1 o k ; o k 1; : : : ; o1 k 1;:::;s 1 P (s = arg max k ; s k 1; : : : ; s1; o k ; o k 1; : : : ; o1 s k ;s k 1;:::;s1 P (o k ; o k 1; : : : ; o1 = arg max s k ;s P (s k ; s k 1; : : : ; s1; o k ; o k 1; : : : ; o1 k 1;:::;s1 = arg max s k ;s P (o k 1;:::;s1 k s k [ k 1 Y t=1 P(s t +1 s t P (o t s t ] P (s1 Bayesa Learg CS446 Sprg 17 45

46 = max s k Fdg most lkely state sequece HMM (3 k 1 Y max s k ;s P (o k 1;:::;s1 k sk [ P (s t +1 s t P (o t st ] P (s1 t=1 k 1 Y = maxp (o s k s k max k s [ k 1;:::;s1 max t=1 A fucto of s k P(s t +1 s t P (o t s t ] P (s1 P (o k s k max s k 1 [ P(s k s k 1 P (o k 1 s k 1] k 2 Y s [ k 2;:::;s1 t=1 P (s t +1 s t P (o t s t ] P(s1 = maxp (o s k k s k max s [ P(s k 1 k s k 1 P (o k 1 s k 1 ] max s [ P (s k 2 k 1 s k 2 P (o k 2 s k 2] : : : max [ P (s2 s1 P(o1 s1] P (s1 s1 Bayesa Learg CS446 Sprg 17 46

47 Fdg most lkely state sequece HMM (4 max s k max s k 2 P(o k s k max s [ P (s k 1 k s k 1 P(o k 1 s k 1] [P(s k 1 s k 2 P(o k 2 s k 2] : : : max s 2 [P(s3 s2 P (o2 s2] max s 1 [P(s2 s1 P (o1 s1] P(s1 Vterb s Algorthm Dyamc Programmg Bayesa Learg CS446 Sprg 17 47

48 Learg the Model Estmate Ital state probablty P (s 1 Trasto probablty P(s t s t-1 Obserato probablty P(o t s t Usupersed Learg (states are ot obsered EM Algorthm Supersed Learg (states are obsered; more commo ML Estmate of aboe terms drectly from data Notce that ths s completely aalogues to the case of ae Bayes, ad essetally all other models. Bayesa Learg CS446 Sprg 17 48

Bayes (Naïve or not) Classifiers: Generative Approach

Bayes (Naïve or not) Classifiers: Generative Approach Logstc regresso Bayes (Naïve or ot) Classfers: Geeratve Approach What do we mea by Geeratve approach: Lear p(y), p(x y) ad the apply bayes rule to compute p(y x) for makg predctos Ths s essetally makg