Graphical Models and Conditional Random Fields

Size: px

Start display at page:

Download "Graphical Models and Conditional Random Fields"

Silvester Gibbs
6 years ago
Views:

1 Grahcal Models and Condtonal Random Felds Presenter: Shh-Hsang Ln Bsho, C. M., Pattern Recognton and Machne Learnng, Srnger, 006 Sutton, C., McCallum, A., An Introducton to Condtonal Random Felds for Relatonal Learnng, Introducton to Statstcal Relatonal Learnng, MIT Press. 007 Rahul Guta, Condtonal Random Felds, Det. of Comuter Scence and Engg., IIT Bomba, Inda.

2 Overvew Introducton to grahcal models Alcatons of grahcal models More detal on condtonal random felds Conclusons

3 Introducton to Grahcal Models

4 Power of Probablstc Grahcal Models Wh do we need grahcal models Grahs are an ntutve wa of reresentng and vsualzng the relatonshs between man varables Used to desgn and motvate new models A grah allows us to abstract out the condtonal ndeendence relatonshs between the varables from the detals of ther arametrc forms. Provde a new nsghts nto estng model Grahcal models allow us to defne general message-assng algorthms that mlement robablstc nference effcentl Grah based algorthms for calculaton and comutaton Probablt Theor Grah Theor Probablt Grahcal Models 4

5 Probablt Theor What do we need to now n advance Probablt Theor Sum Rule (Law of Total Probablt or Margnal Probablt) ( ) ( ), Product Rule (Chan Rule) (, ) ( ) ( ) ( ) ( ) From the above we can derve Baes theorem ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 5

6 Condtonal Indeendence and Margnal Indeendence Condtonal Indeendence z whch s equvalent to sang (, z ) ( z ) (, z ) (, z ) ( z ) ( z ) ( z ) Condtonal ndeendence crucal n ractcal alcatons snce we can rarel wor wth a general jont dstrbuton Margnal Indeendence a b 0 (, ) ( ) ( ) emt set Eamle Amount of Seedng Fne Te of Car Seed Lung Cancer ellow Teeth Smong Chld s Genes Grandarents Genes Parents Genes Ablt of Team A Ablt of Team B 6

7 Grahcal models a b a b c Drected Grah c Undrected Grah A grahcal model comrses nodes connected b lns Nodes (vertces) corresond to random varables Lns (edges or arcs) reresents the relatonshs between the varables Drected grahs are useful for eressng casual relatonshs between random varables Undrected grahs are better suted to eressng soft constrants between random varables 7

8 Drected Grahs Consder an arbtrar dstrbuton a, b, c, we can wrte the jont dstrbuton n the form B successve alcaton of the roduct rule ( a, b, c) ( c a, b) ( a b), or ( a, b, c) ( c a, b) ( b a ) ( a ) * Note that ths decomoston holds for an choce of jont dstrbuton We then can reresent the above equaton n terms of a smle grahcal models Frst, we ntroduce a node for each of the random varables Second, for each condtonal dstrbuton we add drected lns to the grah a c b A full connected grah 8

9 Drected Grahs (cont.) Let us consder another case (,,,,, ) , (,,,,,, ) (, L, ) L ( ) ( ) Agan, t s a full connected grah What would haen f some lns were droed?? (consderng the relatonsh between nodes) 9

10 Drected Grahs (cont.) The jont dstrbuton of,,,,, s therefore gven b , 7 (,, 3, 4, 5, 6, 7 ) ( ) ( ) ( 3 ) ( 4,, 3 ) (, ) ( ) (, ) * The jont dstrbuton s then defned b the roduct of a condtonal dstrbuton for each node condtoned on the varables corresondng to the arents of that node n the grah Thus, for a grah wth K nodes, the jont dstrbuton s gven b (, K ) ( arent ( )),L ( ) K where arent denotes the set of arents of We alwas restrct the drected grah must have no drected ccles Such grahs are also called drected acclc grahs (DAGs) or Baesan networ 0

11 Drected Grah: Condtonal Indeendence Jont dstrbuton over 3 varables secfed b the grah a c b ( a, b, c) ( a c) ( b c) ( c) f node c s not observed ( a b) ( a c) ( b c) ( c) ( a ) ( b), a b 0 a c c b ( a, b, c) () c f node c s observed ( a, b c) ( a c) ( b c) a b c The node c s sad to be tal-to-tal r.w.t. ths ath from node a to node b ths observaton blocs the ath from a to b and cause a and b to become condtonall ndeendent

12 Drected Grah: Condtonal Indeendence (cont.) The second eamle ( a, b, c) ( a ) ( c a ) ( b c) a c b f node c s not observed ( a b) ( a ) ( c a ) ( b c) ( a ) ( b a ) ( a ) ( b), a b 0 c a c b f node c s observed ( a b c) ( a, b, c) () c ( a ) ( c a ) ( b c) () c, ( a c) ( b c) a b c The node c s sad to be head-to-tal r.w.t. ths ath from node a to node b ths observaton blocs the ath from a to b and cause a and b to become condtonall ndeendent

13 Drected Grah: Condtonal Indeendence (cont.) The thrd eamle ( a, b, c) ( a ) ( b) ( c a b), a c b f node c s not observed ( a, b) ( a, b, c) ( a ) ( b) ( c a b) ( a ) ( b), c c a b 0 a ( a, b, c) () c c b ( a ) ( b) ( c a, b) () c f node c s observed ( a, b c) ( a c) ( b c) a b c The node c s sad to be head-to-head r.w.t. ths ath from node a to node b the condtoned node c unblocs the ath and renders a and b deendent 3

14 D-searaton A B C f C d-searated A from B We need to consder all ossble aths from an node n A to an node n B An such ath s sad to be bloced f t ncludes a node such that ether (a) the arrows on the ath meet ether head-to-tal or tal-to-tal at the node, and the node s n the set C (b) the arrows meet head-to-tal at the node, and nether the node, nor an of ts descendants, s n the set C If all aths are bloced, then A s sad to be d-searated from B b C a f a f e b e b c c a b c a b f 4

15 Marov Blanets Marov blanets (or Marov boundar) of a node s the mnmal set of nodes that solates nodes A from the rest of the grah Ever set of nodes n the networ s condtonall ndeendent of A when condtoned on the Marov blanet of the node A ( A MB ( A) B ) ( A MB ( A) ) MB(A) {arents(a) and chldren(a) and arents-of-chldren(a)} A 5

16 Eamles of Drected Grahs Hdden Marov models Kalman flters Factor analss Probablstc rncal comonent analss Indeendent comonent analss Mtures of Gaussans Transformed comonent analss Probablstc eert sstems Sgmod belef networs Herarchcal mtures of eerts etc, 6

17 Eamle: State Sace Models (SSM) Hdden Marov models Kalman flters t t t + Hdden Observed t t t + P ( ) L ( ) ( ) ( )L, t t t t t + 7

18 Eamle: Factoral SSM Multle hdden sequences Hdden Observed 8

19 Marov Random Felds Random Feld Let F { F, F,..., FM } be a faml of random varables defned on the set S, n whch each random varable F taes a value f n a label set L. The faml F s called a random feld Marov Random Feld F s sad to be a Marov random feld on S wth resect to a neghborhood sstem N f and onl f the followng two condtons are satsfed Posstvt : Marovant : P ( f ) > 0, f F ( P f all other f P f neghbors f a e c b d P b all other node P b c, d 9

20 Undrected Grahs An undrected grahcal model can also called Marov random felds, or also nown as a Marov networs It has a set of nodes each of whch corresonds to a varable of grou of varables, as well as a set of lns each of whch connects a ar of nodes In an undrected grahcal models, the jont dstrbuton s roduct of non-negatve functons over the clques of the grah where ψ C C are the clque otental, and Z s a normalzaton constant (sometmes called the artton functon) ( ) ψ C ( C ) Z C A a b B c Z ( ) ψ ( a, c) ψ ( b, c, d ) ψ ( c, d e) A B C, e d C 0

21 Clque Potentals A clque s a full connected subgrah B clque we usuall mean mamal clque (.e. not contaned wthn another clque) measures comatblt between settngs of the varables a b c e d

22 Undrected Grahs: Condtonal Indeendence A B C smle grah searaton can tell us about condtonal ndeendences A A C B The Marov blanet of a node A s defned as MB(A){Neghbors(A)}

23 Eamles of Undrected Grahs Marov Random Felds Condton Random Felds Mamum Entro Marov Models Mamum Entro Boltzmann Machnes etc, 3

24 Eamle: Marov Random Feld Hdden Observed P (, ) ψ (, ) ψ (, ) Z j, j j 4

25 Eamle: Condtonal Random Feld Hdden Observed P ( ) ψ ( ) ψ (, ) Z j, j j 5

26 Summar of Factorzaton Proertes Drected grahs (,L, K ) ( arent ( )) K Condtonal ndeendence from d-searaton test Drected grahs are better at eressng causal generatve models Undrected grahs ( ) ψ C ( C ) Z C Condtonal ndeendence from grah searaton Undrected grahs are better at reresentng soft constrants between varables 6

27 Alcatons of Grahcal Models

28 Classfcaton Classfcaton s redctng a sngle class varable gve a vector of feature (,, L K ) Naïve Baes classfer Assume that once the class label s nown, all the features are ndeendent based drectl on jont robablt dstrbuton n generatve models set of arameters must reresent nut dstrbuton and condtonal well (, ) ( ) ( ) K Logstc regresson (mamum entro classfer) based drectl on condtonal robablt ( ) need no model ( ) n dscrmnatve models are not as strongl ted to ther nut dstrbuton ( ) K e λ +, j j λ Z j where Z class bas weght ( ) e λ + K j λ, j (, ) * It can be shown that a Gaussan Naïve Baes (GNB) classfer mles the arametrc form of () of ts dscrmnatve ar logstc regresson j 8

29 9 Classfcaton (cont.) Consder a GNB based on the followng modelng assumtons s a Gaussan dstrbuton of the form s Boolean, governed b a Bernoull dstrbuton wth arameter log log e 0 0 log e θ θ P K, Naïve Baes N σ μ. P θ log e e e log 0 log πσ μ μ σ μ μ πσ μ μ πσ μ μ πσ μ πσ πσ μ πσ λ λ πσ μ μ σ μ μ θ θ e log e

30 Sequence Models Classfer redct onl a sngle class varable, but the true ower of grahcal models les n ther ablt to model man varables that are nterdeendent e.g. named-entt recognton (NER), art-of-seech taggng (POS) Hdden Marov models Rela the ndeendence assumton b arrangng the outut varables n a lnear chan To model the jont dstrbuton,, an HMM maes two assumtons Each state deends onl on ts mmedate redecessor (Frst order assumton) Each observaton varable deends onl on the current state (Outut-ndeendent assumton) T (, ) ( 0 ) ( t t ) ( t t ) t t t t + t t t + 30

31 Sequence Models (cont.) Mamum Entro Marov Models (MEMMs) A condtonal model that reresentng the robablt of reachng a state gven an observaton and the revous state ( ) ( ) ( t t, t ) T t t t t λ t t, Z Z e λ f t, ' t, t ' (, ) e f (, ) * er-state normalzaton t t t t + t t + Per-state normalzaton wll cause all the mass that arrves at a state must be dstrbuted among the ossble successor states Label Bas Problem!!!!! Potental vctms: Dscrmnatve Models t 3

32 Sequence Models (cont.) Label Bas Problem Consder ths MEMM P( and ro) P( and ro)p( ro) P( and o)p( r) P( and r) P( and r)p( r) P( and )P( r) Snce P( and ) for all, P( and ro) P( and r) In the tranng data, label s the onl label value observed after label Therefore P( ), so P( and ) for all However, we eect P( and r) to be greater than P( and ro) 3

33 Sequence Models (cont.) SEQUENCE GENERAL Naïve Baes HMMs Generatve Drected Models CONDITIONAL CONDITIONAL CONDITIONAL SEQUENCE GENERAL Logstc Regresson Lnear-chan CRFs General CRFs 33

34 From HMM to CRFs We can rewrte the HMM jont dstrbuton, as follows { } { } { } { }, e λj + j μo t t t t o Z t, j S t S o O Because we do not requre the arameter to be log robabltes, we are no longer guaranteed that the dstrbuton sums to So we elctl enforce ths b usng a normalzaton constant Z We can wrte the above equaton more comactl b ntroducng the concet of feature functon K Z (, ) e λ f (, ) t, The last ste s to wrte the condtonal dstrbuton ( ) (, ) ( ', ) t e t HMMs T (, ) ( 0 ) ( t t ) ( t t ) t Feature functon for HMMs ( t, t, t ) { } { } f t t f ( t, t, t ) { } { o} ( ) { } K λ,, f t t t K λ f ( ' t, ' t t ) { } ' e ', t state transton state observaton Lnear-chan CRFs 34

35 More Detal on Condtonal Random Felds

36 Condtonal Random Felds CRFs have all the advantages of MEMMs wthout label bas roblem MEMM uses er-state eonental model for the condtonal robabltes of net states gven the current state CRF has a sngle eonental model for the jont robablt of the entre sequence of labels gven the observaton sequence Let G ( V, E ) be a grah such that ( v ), so that v V s ndeed b the vertces of G. Then (, ) s a condtonal random feld n case, when condtoned on, the random varables v obe the Marovan roert (,, w v, neghbor v w v v , all other 3,, 4 36

37 Lnear-Chan Condtonal Random Felds Defnton K Let, be the random vectors, Λ { λ } R be a arameter vector, and { } K f, ', t be a set of real-valued feature functons. Then a lnear-chan condtonal random feld s a dstrbuton that taes the form K f t t Z e λ, ( ) (, ) Where Z s an nstance-secfc normalzaton functon t or ( ) Z ( ) e T ( Λ F ) Z K ( ) e λ f ( t, t, t ) sum over all ossble state sequences an eonentall large number of terms Fortunatel, forward-bacward ndeed hels us to calculate ths term 37

38 Forward and Bacward Algorthms Suose that we are nterested n taggng a sequence onl artall, sa tll the oston Denote the un-normalzed robablt of a artal labelng endng at oston wth fed label b α (,) Denote the un-normalzed robablt of a artal segmentaton startng at oston + assumng a label at oston b β (,) α and β can be comuted va the followng recurrences α β T (, ) a( ', ) e Λ f (, ', ) ' T (, ) β ( ', + ) e Λ f (, ', ) + ' We can now wrte the margnal and artton functon n term of these ( ) α (, ) β (, ) Z ( ) ( ' ) α (, ) e ( Λf ( ',, )) ( ', ) / Z ( ) P / P, + + β + ( ) α (, ) β (, ) Z 38

39 Inference n lnear CRFs usng the Vterb Algorthm Gven the arameter vector Λ, the best labelng for a sequence can be found eactl usng the Vterb algorthm For each tule of the form (, ), the Vterb algorthm mantans the unnormalzed robablt of the best labelng endng at oston wth the label The recurrence s V ( ) ma ', T ( V (, ' ) e ( Λ f (, ', ) ) [[ start ]] ( > 0 ) ( I 0 ) The normalzed robablt of the best labelng s gven b ma Z V ( n ) ( ), 39

40 Tranng (Parameter Estmaton) The varous methods used to tran CRFs dffer manl n the objectve functon the tr to otmze Penalzed log-lelhood crtera Voted ercetron Pseudo log-lelhood Margn mamzaton Gradent tree boostng Logarthmc oolng and so on 40

41 4 Penalzed log-lelhood crtera The condtonal log-lelhood of a set of tranng nstances usng arameters s gven b The gradent of the log-lelhood s gven b In order to avod overfttng roblem, we mose a enalt on t and the gradent s gven b Λ T Z L F Λ Λ Λ log,, [ ] P T P Z L F F F F F Λ F F Λ Λ,, ' ',, ', e ',, ' ' E log, σ Λ F Λ Λ Λ T Z L Eucldean norm [ ],, σ Λ F F Λ P L E

42 4 Penalzed log-lelhood crtera (cont.) The trc term n the gradent s the eectaton those comutaton requres the enumeraton of all the ossble sequence Let us loo at the j th entr n ths vector, vz. and s equal to. Therefore, we can rewrte as After obtaned the gradent, varous teratve methods can be used to mamzed the log-lelhood Imroved Iteratve Scalng (IIS), Generalzed Iteratve Scalng (GIS), Lmted Memor Quas Newton method (L-BFGS) [ ] P F, E j F, j F, j f,, [ ] P F, E [ ] [ ] T T j j P j P j P Q f f f β α β α ',, ',, e ',, ',,,,,, f Λ F E E E

43 Voted Percetron Method Percetron uses an aromaton of the gradent of the unregularzed log-lelhood functon L ( [ ]) Λ F, E F, P It consders one msclassfed nstance at a tme, along wth ts contrbuton to the gradent ( F(, ) E [ ]) F, P The feature eectaton s further aromated b a ont estmate of the feature vector at the best ossble labelng * T, F, arg ma Λ F( ) * MAP-hothess based classfer LΛ F, Usng ths aromate gradent, the followng frst order udate rule can be used for mamzaton *, F Λ t + Λ t + F, Ths udate ste s aled once for each msclassfed nstance n the tranng set. Or we can collect all the udate n each ass and tae ther unweghted average to udate the arameter 43

44 Pseudo log-lelhood In man scenaros, we are wllng to assgn dfferent error values to dfferent labelng It maes senses to mamze the margnal dstrbutons P( ) nstead of P ( ) Ths objectve s called the seudo-lelhood and for the case of lnear CRFs, t s gven b L Λ t T T log log P t : t t ( t, Λ ) T e ( Λ F(, ) Z ( ) Λ t 44

45 Other tes of CRFs Sem-Marov CRFs It s stll n the realm of frst-order Marovan deendence, but the dfferent s the label deend onl on segment feature and the label of revous segment Instead of assgnng labels to each oston, assgn labels to segments Sem-Marov CRFs S-Chan CRFs A condtonal model that collectvel segments a document nto mentons and classfes the mentons b entt te A B C A A B S-chan CRFs 45

46 Other tes of CRFs (cont.) Factoral CRFs Several snchronzed nter-deendent tass Cascadng roagates errors NE POS Factoral CRFs Tree CRFs The deendences are organzed as a tree structure Tree CRFs 46

47 Conclusons Condtonal Random Felds offer a unque combnaton of roertes dscrmnatvel traned models for sequence segmentaton and labelng combnaton of arbtrar and overlang observaton features from both the ast and future effcent tranng and decodng based on dnamc rogrammng for a smle chan grah arameter estmaton guaranteed to fnd the global otmum Possble Future wor? Effcent tranng aroach?? Effcent Feature Inducton?? Constraned Inferencng?? Dfferent toolog?? 47

48 Reference Laffert, J., McCallum, A., Perera, F., Condtonal random felds: Probablstc models for segmentng and labelng sequence data, In: Proc. 8th Internatonal Conf. on Machne Learnng, Morgan Kaufmann, San Francsco, CA (00) Rahul Guta, Condtonal Random Felds, Det. of Comuter Scence and Engg., IIT Bomba, Inda. Document avalable from htt:// Sutton, C., McCallum, A., An Introducton to Condtonal Random Felds for Relatonal Learnng, Introducton to Statstcal Relatonal Learnng, MIT Press Document avalable from htt:// Mtchell, T. M., Machne Learnng, McGraw Hll, 997. Document avalable from htt:// Bsho, C. M., Pattern Recognton and Machne Learnng, Srnger, 006 Bsho, C. M., Grahcal Models and Varatonal Methods, vdeo lecture, Machne Learnng Summer School 004. Avalable from htt://vdeolectures.net/mlss04_bsho_gmvm/ Ghahraman, Z., Grahcal models, vdeo lecture, EPSRC Wnter School n Mathematcs for Data Modellng, 008. Avalable from htt://vdeolectures.net/esrcws08_ghahraman_gm/ Rowes, S., Machne Learnng, Probablt and Grahcal Models, vdeo lecture, Machne Learnng Summer School 006, Avalable from htt://vdeolectures.net/mlss06tw_rowes_mlgm/ 48

Mixture of Gaussians Expectation Maximization (EM) Part 2

Mixture of Gaussians Expectation Maximization (EM) Part 2 Mture of Gaussans Eectaton Mamaton EM Part 2 Most of the sldes are due to Chrstoher Bsho BCS Summer School Eeter 2003. The rest of the sldes are based on lecture notes by A. Ng Lmtatons of K-means Hard