Statistical inference with probabilistic graphical models

Size: px

Start display at page:

Download "Statistical inference with probabilistic graphical models"

Beverley Conley
5 years ago
Views:

1 Statstcal nference wth probablstc graphcal models Angélque Drémeau, Chrstophe Schülke, Yngyng Xu, Devavrat Shah September 18, 2014 arxv: v1 [cs.lg] 17 Sep 2014 These are notes from the lecture of Devavrat Shah gven at the autumn school Statstcal Physcs, Optmzaton, Inference, and Message-Passng Algorthms, that took place n Les Houches, France from Monday September 30th, 2013, tll Frday October 11th, The school was organzed by Florent Krzakala from UPMC & ENS Pars, Federco Rcc-Tersengh from La Sapenza Roma, Lenka Zdeborová from CEA Saclay & CNRS, and Rccardo Zecchna from Poltecnco Torno. École Normale Supéreure, France Unversté Pars Dderot, France Tokyo Insttute of Technology, Japan Massachusetts Insttute of Technology, USA 1

2 Contents 1 Introducton to Graphcal Models Inference Graphcal models Drected GMs Undrected GMs Clques Factor graphs Image processng Crowd-sourcng MAP and MARG Inference Algorthms: Elmnaton, Juncton Tree and Belef Propagaton The elmnaton algorthm Juncton Tree property and chordal graphs Juncton Tree (JCT) property Chordal graph Procedure to fnd a JCT Tree wdth Belef propagaton (BP) algorthms Factor graphs Understandng Belef Propagaton Exstence of a fxed pont Nature of the fxed ponts Background on Nonlnear Optmzaton Belef Propagaton as a varatonal problem Can the fxed ponts be reached? The hardcore model Learnng Graphcal Models Parameter learnng Sngle parameter learnng Drected graphs Undrected graphs Graphcal model learnng Drected graphs Undrected graphs Latent Graphcal Model learnng: the Expectaton-maxmzaton algorthm References 27 2

3 1 Introducton to Graphcal Models 1.1 Inference Consder two random varables A and B wth a jont probablty dstrbuton P A,B. From the observaton of the realzaton of one of those varables, say B = b, we want to nfer the one that we dd not observe. To that end, we compute the condtonal probablty dstrbuton P A B, and use t to obtan an estmate â(b) of a. To quantfy how good ths estmate s, we ntroduce the error probablty: P error P (A â(b) B = b) (1) = 1 P (A = â(b) B = b), and we can see from the second equalty that mnmzng ths error probablty s equvalent to the followng maxmzaton problem, called maxmum a posteror (MAP) problem: â(b) = arg max P A B (a b). (2) a The problem of computng P A B (a b) for all a gven b s called the margnal (MARG) problem. When the number of random varables ncreases, the MARG problem becomes dffcult, because an exponental number of combnatons has to be calculated. Fano s nequalty provdes us an nformaton-theoretcal way of ganng nsght nto how much nformaton about a the knowledge of b can gve us: wth P error H(A B) 1, (3) log A H(A B) = b H(A B = b) = a P B (b)h(a B = b), ( P A B (a b)log 1 P A B (a b) ). Fano s nequalty formalses only a theoretcal bound that does not tell us how to actually make an estmaton. From a practcal pont of vew, graphcal models (GM) consttute here a powerful tool allowng us to wrte algorthms that solve nference problems. 1.2 Graphcal models Drected GMs Consder N random varables X 1 X N on a dscrete alphabet X, and ther jont probablty dstrbuton P X1 X N. We can always factorze ths jont dstrbuton n the followng way: P X1 X N = P X1 P X2 X 1 P XN X 1 X N 1 (4) and represent ths factorzed form by the followng drected graphcal model: 3

4 1 2 3 N Fgure 1: A drected graphcal model representng the factorzed form (4). In ths graphcal model, each node s affected to a random varable, and each drected edge represents a condtonng. The way that we factorzed the dstrbuton, we obtan a complcated graphcal model, n the sense that t has many edges. A much smpler graphcal model would be: N Fgure 2: A smpler graphcal model representng the factorzed form (5). The latter graphcal model corresponds to a factorzaton n whch each of the probablty dstrbutons n the product s condtoned on only one varable: P X1 X N = P X1 P X2 X 1 P XN X N 1 (5) In the most general case, we can wrte a dstrbuton represented by a drected graphcal model n the factorzed form: P X1 X N = P X X Π, (6) where X Π s the set contanng the parents of X (the vertces from whch an edge ponts to ). The followng notatons wll hold for the rest of ths chapter: random varables are captalzed: X, realzatons of random varables are lower case: x, a set of random varables {X 1 X N } s noted X, a set of realzatons of X s noted x, the subset of random varables of ndces n S s noted X S Undrected GMs Another type of graphcal model s the undrected graphcal model. In that case, we defne the graphcal model not through the factorzaton, but through ndependence. 4

5 Let: G(V, E) V = {1,, N} E V V be an undrected graphcal model, where s the set of vertces, and s the set of edges. Each vertex V of ths GM represents one random varables X, and each edge (, j) E represents a condtonal dependence. As the GM s undrected, we have (, j) (j, ). We defne: N() {j V (, j) E} the set contanng the neghbours of. (7) Undrected graphcal model captures followng dependence: P X X V\{} P X X N(), (8) meanng that only varables connected by edges have a condtonal dependence. Let A V, B V, C V. We wrte that X A X B X C f A and B are dsjont and f all pathes leadng from one element of A to one element of B lead over an element of C, as s llustrated n Fg. 3. In other words, f we remove C, then A and B are unconnected (Fg. 4). A C B Fgure 3: Schematc vew of a graphcal model n whch X A X B X C. All paths leadng from A to B go through C. remove C A C B A B Fgure 4: Smple vew showng the ndependence of A and B condtoned on C. Undrected GMs are also called Markov random felds (MRF) Clques (Defnton) A clque s a subgraph of a graph n whch all possble pars of vertces are lnked by an edge. A maxmal clque s a clque that s contaned by no other clque. 5

6 Fgure 5: In ths graphcal model, the maxmal clques are {1, 2, 4}, {2, 3, 4} and {4, 5}. Theorem 1 ([4]) Gven a MRF G and a probablty dstrbuton P X (x ) > 0. Then: P X (x ) C C φ C (x C ) (9) where C s the set of clques of G. Proof 1 ([3]) for X = {0, 1}. We wll show the followng, equvalent formulaton: by exhbtng the soluton: wth P X (x ) e C C V C(x C ) V C (x C ) = { Q(C) f x C = 1 C, 0 otherwse, (10) (11) Q(C) = ( 1) C A ( ln P X xa = 1 A, x V \A = 0 ). (12) }{{} A C G(A) Suppose we have an assgnement X N(X ) = { x = 1}. We want to prove that: G(N(X )) ln P X (x ), = C C V C (x C ), = C N(x ) Ths s equvalent to provng the two clams: Let us begn by provng C1: C1 : S C, G(S) = A S Q(A) C2 : f A s not a clque, Q(A) = 0 A S Q(A) = A S ( 1) A B G(B) B A = G(B) B S B A S Q(C). (13) ( 1) A B (14) 6

7 where we note that the term n brackets s zero except when B = S, because we can rewrte t as ( ) l ( 1) l = ( 1 + 1) k = 0. (15) k 0 l k Therefore G(S) = A S Q(A). For C2, suppose that A s not a clque, whch allows us to choose (, j) A wth (, j) / E. Then Q(A) = ( 1) A B [G(B) G(B + ) + G(B + + j) G(B + j)]. B A\{,j} Let us show that the term n brackets s zero by showng or equvalently G(B + + j) G(B + j) = G(B + ) G(B) ln P X(x B =1 B,x =1,x j=1,x V\{,j,B} =0) P X(x B =1 B,x =0,x j=1,x V\{,j,B} =0) = ln P X(x B =1 B,x =1,x j=0,x V\{,j,B} =0) P X(x B =1 B,x =0,x j=0,x V\{,j,B} =0), where V\{, j, B} stands for the set of all vertces except, j and those n B. We see that the only dfference between the left-hand sde and the rght-hand sde s the value taken by x j. Usng Bayes rule, we can rewrte both the rght-hand sde and the left-hand sde under the form ln P X(X =1 X j=±1,x B =1 B,X V\{,j,B} =0) P X(X =0 X j=±1,x B =1 B,X V\{,j,B} =0)). As (, j) / E, the condtonal probabltes on X do not depend on the value taken by X j, and therefore the rght-hand sde equals the left-hand sde, Q(A) = 0 and C2 s proved. 1.3 Factor graphs Thanks to the Hammersley-Clfford theorem, we know that we can wrte a probablty dstrbuton correspondng to a MRF G n the followng way P X (x ) C C φ C (x C ) (16) where C s the set of maxmal clques of G. In a general defnton, we can also wrte P X (x ) φ F (x F ) (17) F F where the partton F 2 V has nothng to do wth any underlyng graph. In what follows, we gve two examples n whch ntroducng factor graphs s a natural approach to an nference problem. 7

8 1.3.1 Image processng We consder an mage wth bnary pxels (X = { 1, 1}), and a probablty dstrbuton: p(x ) e V θx+ (,j) E θjxxj (18) y 1 y 2 y k x 1 x 2 x k Fgure 6: Graphcal model representng a 2D mage. The fat crcles correspond to the pxels of the mage x k, and each one s lnked to a nosy measurement y k. Adjacent pxels are lnked by edges that allow modellng the assumed smoothness of the mage. For each pxel x k, we record a nosy verson y k. We consder natural mages, n whch bg jumps n ntensty between two neghbourng pxels are unlkely. Ths can be modelled wth: a x y + b x x j (19) (,j) E Ths way, the frst term pushes x k to match the measured value y k, whle the second term favours pecewse constant mages. We can dentfy θ ay and θ j b Crowd-sourcng Crowd-sourcng s used for tasks that are easy for humans but dffcult for machnes, and that are as hard to verfy as to evaluate. Crowd-sourcng then conssts n assgnng to each of M human workers a subset of N tasks to evaluate, and to collect ther answers A. Each worker has a dfferent error probablty p { 1 2, 1}: ether he gves random answers, or he s fully relable. The goal s to nfer both the correct values of each task, t j, and the p of each worker. The factor graph correspondng to that problem s represented n Fg 7. 8

9 Workers Tasks p 1 { 1 2, 1} A M N Fgure 7: Graphcal model llustratng crowd-sourcng. Each worker s assgned a subset of the tasks for evaluaton, and for each of those tasks a, hs answer A a s collected. The condtonal probablty dstrbuton of t and p knowng the answers A reads P t,p A P A t,p P t,p P A t,p (20) where we assumed a unform dstrbuton on the jont probablty P t,p. Then P A t,p = e P Ae t e,p e (21) wth P Ae t e,p e = 1.4 MAP and MARG ( ( pe 1 p e MAP. The MAP problem conssts n solvng: θ x + max x {0,1} N ) Aet e (1 p e )p e ) 1 2. (22) (,j) E θ j x x j. (23) When θ j, neghbourng nodes can not be n the same state anymore. Ths s the hard-core model, whch s very hard to solve. MARG. The MARG focuses on the evaluaton of margnal probabltes, dependng on only one random varable, for nstance: P X1 (0) = Z(X 1 = 0) Z (24) 9

10 as well as condtonal margnal probabltes: P X2 X 1 (X 2 = 0 X 1 = 0) = Z(X 1 = 0, X 2 = 0) Z(X 1 = 0) Z(all 0) P XN X 1 X N(1) (X N = 0 X 1 X N 1 = 0) = Z(all but X N are 0) (25) (26) P X1 (0) P XN X 1 X N 1 (0) = 1 Z (27) Both of these problems are computatonally hard. Can we desgn effcent algorthms to solve them? 2 Inference Algorthms: Elmnaton, Juncton Tree and Belef Propagaton In the MAP and MARG problems descrbed prevously, the hardness comes from the fact that wth growng nstance sze, the number of combnatons of varables over whch to maxmze or margnalze becomes quckly ntractable. But when dealng wth GMs, one can explot the structure of the GM n order to reduce the number of combnatons that have to be taken nto account. Intutvely, the smaller the connectvty of the varables n the GM s, the smaller ths number of combnaton becomes. We wll formalze ths by ntroducng the elmnaton algorthm, that gves us a systematc way of makng fewer maxmzatons/margnalzatons on a gven graph. We wll see how substantally the number of operatons s reduced on a graph that s not completely connected. 2.1 The elmnaton algorthm We consder the GM n Fg. 8 whch s not fully connected. The colored subgraphs represent the maxmal clques Fgure 8: A GM and ts maxmal clques. Usng decomposton (16), we can wrte P X (x ) φ 123 (x 1, x 2, x 3 ). φ 234 (x 2, x 3, x 4 ). φ 245 (x 2, x 4, x 5 ). (28) We want to solve the MARG problem on ths GM, for example for calculatng the margnal probablty of x 1 : P X1 (x 1 ) = P X (x ). (29) x 2,x 3,x 4,x 5 10

11 A pror, ths requres to evaluate X 4 terms, each of them takng X dfferent values. In the end, 3 X X 4 operatons are needed for calculatng ths margnal navely. But f we take advantage of the factorzed form (28), we can elmnate some of the varables. The elmnaton process goes along these lnes: P X1 (x 1 ) φ 123 (x 1, x 2, x 3 ). φ 234 (x 2, x 3, x 4 ). φ 245 (x 2, x 4, x 5 ), (30) x 2,x 3,x 4,x 5 φ 123 (x 1, x 2, x 3 ). φ 234 (x 2, x 3, x 4 ). φ 245 (x 2, x 4, x 5 ), (31) x 2,x 3,x 4 x 5 φ 123 (x 1, x 2, x 3 ). φ 234 (x 2, x 3, x 4 ). m 5 (x 2, x 4 ), (32) x 2,x 3,x 4 φ 123 (x 1, x 2, x 3 ). φ 234 (x 2, x 3, x 4 ). m 5 (x 2, x 4 ), (33) x 2,x 3 x 4 φ 123 (x 1, x 2, x 3 ). m 4 (x 2, x 3 ), (34) x 2,x 3 ) x 2 ( x 3 φ 123 (x 1, x 2, x 3 ) m 4 (x 2, x 3 ), (35) x 2 m 3 (x 1, x 2 ), (36) m 2 (x 1 ). (37) Wth ths elmnaton process made, the number of operatons necessary to compute the margnal scales as X 3 nstead of X 5, thereby greatly reducng the complexty of the problem by usng the structure of the GM. Smlarly, we can rewrte the MAP problem as follows max φ 123 (x 1, x 2, x 3 ). φ 234 (x 2, x 3, x 4 ). φ 245 (x 2, x 4, x 5 ), (38) x 1,x 2,x 3,x 4,x 5 = max x 1,x 2,x 3,x 4 φ 123 (x 1, x 2, x 3 ). φ 234 (x 2, x 3, x 4 ). max x 5 φ 245 (x 2, x 4, x 5 ), (39) = max x 1,x 2,x 3,x 4 φ 123 (x 1, x 2, x 3 ). φ 234 (x 2, x 3, x 4 ). m 5(x 2, x 4 ), (40) = max x 1,x 2,x 3 φ 123 (x 1, x 2, x 3 ). max x 4 φ 234 (x 2, x 3, x 4 ). m 5(x 2, x 4 ), (41) = max φ 123 (x 1, x 2, x 3 ). m x 1,x 2,x 3 4(x 2, x 3 ), (42) ( ) = max max φ 123 (x 1, x 2, x 3 ) m x 1,x 2 x 3 4(x 2, x 3 ), (43) = max m x 1,x 2 3(x 1, x 2 ), (44) ( ) = max max m x 1 x 2 3(x 1, x 2 ), (45) leadng to x 1 = arg max x 1 m 2(x 1 ). (46) Just lke for the MARG problem, the complexty s reduced from X 5 (a pror) to X 3. We would lke to further reduce the complexty of the margnalzatons (n X 3 ). One smple dea would be to reduce the GM nto a lnear 11

12 Fgure 9: A lnear graph. Each margnalzaton s computed n X 2 operatons. graph as n Fg. 9. Y Y Y Y Y 123 X 3 Y 234 X 3 Y 245 X 3 Fgure 10: Lnear GM obtaned by groupng varables. By groupng varables n the GM (Fg. 8), t s n fact possble to obtan a lnear graph, as shown n Fg. 10, wth the assocated potentals φ 123 (Y 123 ), φ 234 (Y 234 ) and φ 245 (Y 245 ) and the consstency constrants Y Y and Y Y For other GMs, the smplest graph achevable by groupng varables mght be a tree nstead of a smple chan. But not all groupngs of varables wll lead to a tree graph that correctly represents the problem. In order for the groupng of varables to be correct, we need to buld the tree attached to the maxmal clques, and we have to resort to the Juncton Tree property. 2.2 Juncton Tree property and chordal graphs The Juncton Tree property allows us to fnd groupngs of varables under whch the GM becomes a tree (f such groupngs exst). On ths tree, the elmnaton algorthm wll need a lower number of maxmzatons/margnalzatons than on the ntal GM. However, there s a remanng problem: runnng the algorthm on the juncton tree does not gve a straghtforward soluton to the ntal problem, as the varables on the juncton tree are groupngs of varables of the orgnal problem. Ths means that further maxmzatons/margnalzatons are then requred to have a soluton n terms of the varables of the ntal problem Juncton Tree (JCT) property (Defnton) A graph G = (V, E) s sad to possess the JCT property f t has a Juncton Tree T whch s defned as follows: t s a tree graph such that ts nodes are maxmal clques of G an edge between nodes of T s allowed only f the correspondng clques share at least one vertex for any vertex v of G, let C v denote set of all clques contanng v. Then C v forms a connected sub-tree of T. Two questons then arse Do all graphs have a JCT? 12

13 If a graph has a JCT, how can we fnd t? Chordal graph (Defnton) A graph s chordal f all of ts loops have chords. Fg. 11 gves an llustraton of the concept. Fgure 11: The graph on the left s not chordal, the one on the rght s. Proposton 1 G has a juncton tree G s a chordal graph. Proof 2 of the mplcaton. Let us take a chordal graph G = (V, E) that s not complete, as represented n Fg. 12. A S B b a Fgure 12: On a chordal graph that s not complete, two vertces a and b that are not connected, separated by a subgraph S that s fully connected. We wll use the two followng lemmas that can be shown to be true: 1. If G s chordal, has at least three nodes and s not fully connected, then V = A B S, where all three sets are dsjont and S s a fully connected subgraph that separates A from B. 2. If G s chordal and has at least two nodes, then G has at least two nodes each wth all neghbors connected. Furthermore, f G s not fully connected, then there exst two nonadjacent nodes each wth all ts neghbors connected. 13

14 The property If G s a chordal graph wth N vertces, then t has a juncton tree. can be shown by nducton on N. For N = 2, the property s trval. Now, suppose that the property s true for all ntegers up to N. Consder a chordal graph wth N + 1 nodes. By the second lemma, G has a node a wth all ts neghbors connected. Removng t creates a graph G whch s chordal, and therefore has a JCT, T. Let C be the maxmal clque that a partcpates n. Ether C \ a s a maxmal-clque node n T, and n ths case addng a to ths clque node results n a juncton tree T for G. Or C \ a s not a maxmal-clque node n T. Then, C \ a must be a subset of a maxmal-clque node D n T. Then, we add C as a new maxmal-clque node n T, whch we connect to D to obtan a juncton tree T for G Procedure to fnd a JCT Let G be the ntal GM, and G(V, E) be the GM n whch V s the set of maxmal clques of G and (c 1, c 2 ) E f the maxmal clques c 1 and c 2 share a vertex. Let us take e = (c 1, c 2 ) wth c 1, c 2 V and defne the weght of e as w e = c 1 c 2. Then, fndng a juncton tree of G s equvalent to fndng the max-cut spannng tree of G. Denotng by T the set of edges n a tree, we defne the weght of the tree as W (T ) = e T w e (47) = c 1 c 2 e T = 1 {v e}. v V e T and we clam that W (T ) s maxmal when T s a JCT. Procedure to get the maxmum weghted spannng tree Lst all edges n a decreasng order, Include e n E 1 f you can. what we are left wth at the end of the algorthm s the maxmal weght spannng tree Tree wdth (Defnton) The wdth of a tree decomposton s the sze of ts maxmal clque mnus one. 14

15 Toy examples N Fgure 13: tree wdth = 2 (left), tree wdth = N (rght) 2.3 Belef propagaton (BP) algorthms Untl now, everythng we have done was exact. The elmnaton algorthm s an exact algorthm. But as we are nterested n effcent algorthms, as opposed to exact algorthms wth too hgh complextes to actually end n reasonable tme, we wll from now on ntroduce approxmatons. (k) j Fgure 14: Message passng on a graph. Comng back to the elmnaton algorthm (30)-(37), we can generalze the notatons used as m (x j ) x φ (x ). φ,j (x, x j ). k m k (x ). (48) Consderng now the same but orented GM (arrows on fgure above), we get m j (x j ) φ (x ). φ,j (x, x j ). m k (x ), (49) x where N() s the neghbourhood of x. k N()\j 15

16 The MARG problem can then be solved usng the sum-product procedure. Sum-product BP t = 0, (, j) E, (x, x j ) X 2 : m 0 j(x j ) = m 0 j (x ) = 1. (50) t > 0, m t+1 j (x j) x φ (x ). φ j (x, x j ). P t+1 X (x ) = k N() k N()\j m t k (x ), (51) m t+1 k (x ). (52) Whle, for the MAP problem, the max-sum procedure s consdered. Max-sum BP t = 0, m 0 j(x j ) = m 0 j (x ) = 1. (53) t > 0, m t+1 j (x j) max x φ (x ). φ j (x, x j ). x t+1 = arg max x φ (x ). k N() k N()\j m t k (x ), (54) m t+1 k (x ). (55) Note: here, we use only the potentals of pars. But n case of clques, we have to consder the JCT and terate on t. To understand ths pont, let us apply the sum-product algorthm on factor graphs Factor graphs Consderng the general notatons n Fg. 15, the sum-product BP algorthm s partcularzed such that m t+1 f (x ) = m t f (x ), (56) m t+1 f (x ) = f N()\f x j,j N(f)\ f(x, x j ) j N(f)\ m t j f (x j ). (57) On a tree, the leaves are sendng the rght messages at tme 1 already, and 16

17 f 1 f f k Fgure 15: A smple factor graph. after a number of tme steps proportonal to the tree dameter 1, all messages are correct: the steady pont s reached and the algorthm s exact. Therefore, BP s exact on trees. The JCT property dscussed before s therefore useful, and can n certan cases allow us to construct graphs on whch we know that BP s exact. However, the problem mentoned before remans: f BP s run on the JCT of a GM, subsequent maxmzatons/margnalzatons wll be necessary to recover the soluton n terms of the ntal problem s varables. 3 Understandng Belef Propagaton We have seen how to use the (exact) elmnaton algorthm n order to desgn the BP algorthms max-product and sum-product, that are exact only on trees. The JCT property has taught us how to group varables of an ntal loopy GM such that the resultng GM s a tree (when t s possble), on whch we can then run BP wth a guarantee of an exact result. However, the subsequent operatons that are necessary to obtan the soluton n terms of the ntal problem s varables can be a new source of ntractablty. Therefore, we would lke to know what happens f we use BP on the ntal (loopy) graph anyway. The advantage s that BP remans tractable because of the low number of operatons per teraton. The danger s that BP s not exact anymore and therefore we need to ask ourselves the followng 3 questons: 1. Does the algorthm have fxed ponts? 2. What are those fxed ponts? 3. Are they reached? The analyss wll be made wth the sum-product BP algorthm, but could be carred out smlarly for the max-product verson. 3.1 Exstence of a fxed pont The algorthm s of the type m t+1 = F ( m t) wth m t [0, 1] 2 E X (58) and the exstence of a fxed pont s guaranteed by a theorem. 1 The eccentrcty of a vertex v n a graph s the maxmum dstance from v to any other vertex. The dameter of a graph s the maxmum eccentrcty over all vertces n a graph. 17

18 3.2 Nature of the fxed ponts Let us remnd that we had factorzed P X (x ) n ths way: P X (x ) φ (x ) ψ j (x, x j ) V (,j) E = 1 Z eq(x ). (59) The fxed ponts are a soluton of the followng problem wth P X arg max E [Q(X)] + H(µ) (60) µ M(X N ) µ E µ [Q(X)] + H(µ) = x µ(x )Q(x ) x µ(x ) log µ(x ) = F (µ). (61) Let us fnd a bound for ths quantty. From (59), we get Q(x ) = log P X (x ) + log Z. Then F (µ) = µ(x ) log Z + µ(x ) log P X (x ) (62) µ(x ) x x [ = log Z + E log P ] X µ µ(x ) [ ] PX log Z + log E usng Jensen s nequalty µ µ log Z and the equalty s reached when the dstrbutons µ and P are equal. Ths maxmzaton n equaton (60) s made over the space of all possble dstrbutons, whch s a far too bg search space. But f we restrct ourselves to trees, we know that µ has the form: µ µ j µ (63) µ µ j BP has taught us that: µ φ µ j k N() k N()\j (,j) m k (64) m k φ ψ j φ j l N(j)\ m l j (65) If we margnalze µ j wth respect to x j, we should obtan µ : x j µ j (x, x j ) = µ (x ). Wrtng ths out, we obtan: m k φ ψ j φ j m l j = φ m k (66) k N()\j xj l N()\j k N() 18

19 and ths should lead us to what we beleve from the fxed ponts of BP. Let us make a recharacterzaton n terms of the fxed ponts. In order to lghten notatons, we wll wrte φ nstead of log φ and ψ nstead of log ψ: F Bethe (µ) = E µ φ +,j ψ j E µ [log µ] (67) We now use followng factorzaton E [log µ] = E [log µ ] ( ) E [log µ j ] E [log µ ] E [log µ j ] µ µ µ j µ µj j (68) and obtan a new expresson for the Bethe free energy F Bethe = ( ) (1 d ) H µ + E [φ ] + ( ) H(µ j ) + E [ψ j + φ + φ j ], (69) µ µj j where d s the degree of node Background on Nonlnear Optmzaton The problem max q G(q) s.t. Aq = b (70) can be expressed n a dfferent form by usng Lagrange multplers λ and maxmzng L(q, λ) = G(q) + λ T (Aq b) (71) max L(q, λ) = M(λ) G(q ) q nf λ M(λ) G(q ). Let us look at all λ such that q L(q) = 0. In a sense, BP s fndng statonary ponts of ths Lagrangan Belef Propagaton as a varatonal problem In our case, here are the condtons we wll enforce wth Lagrange multplers: µ j (x, x j ) 0 (72) x µ (x ) = 1 λ (73) x j µ j (x, x j ) = µ (x ) λ j (x ) (74) x µ j (x, x j ) = µ j (x j ) λ j (x j ) (75) 19

20 The complete Lagrangan reads L = F Bethe (µ) + λ ( x µ (x ) 1 ) + µ j (x, x j ) µ (x ) λ j (x ) j xj ( ) ] + µ j (x, x j ) µ j (x j ) λ j (x j ). (76) x We need to mnmze ths Lagrangan wth respect to all possble varables, whch we obtan by settng the partal dervatves to zero: L µ (x ) = 0 (77) = (1 d )(1 + log µ (x )) + (1 d )φ (x ) + λ λ j (x ) whch mposes followng equalty for the dstrbuton µ : j N() µ (x ) e φ(x)+ 1 d 1 j N() λj (x) (78) Let us now use the transformaton λ j (x ) = k N()\j log m k (x ), and we obtan λ j (x ) (d 1) log m j (x ). (79) j N() In the same way, we can show that: j N() L µ j (x, x j ) = 0 µ j(x, x j ) e φ(x)+φj(xj)+ψj(x,xj)+λj (x)+λ j(xj) Ths way, we found the dstrbutons µ and µ j that are the fxed ponts of BP. 3.3 Can the fxed ponts be reached? We wll now try to analyze f the algorthm can actually reach those fxed ponts that we have exhbted n the prevous secton. Let us look at the smple (but loopy) graph n Fg. 16. At tme t = 1, we have φ 1 1 φ 12 φ 13 φ φ 3 φ 23 Fgure 16: A smple loopy graph. 20

21 m 1 2 1(x 1 ) x 2 φ 2 (x 2 )φ 12 (x 1, x 2 ) m 0 3 2(x 2 ) }{{} =1 (80) and m x 3 φ 3 φ 13 (81) whch also corresponds to the messages of the modfed graph n Fg. 17. φ 1 1 φ 12 φ 13 φ φ 3 Fgure 17: Graph seen by BP at tme t = 1. 1 φ 12 φ φ 23 φ Fgure 18: Graph seen by BP at tme t = 2. At tme t = 2, the messages wll be as m x 2 φ 2 φ 12 m 1 3 2(x 2 ) (82) correspondng to the messages on the modfed graph n Fg. 18. If we ncrease t, the correspondng non-loopy graph gets longer at each tme step. Another way of seng ths s by lookng at the recurson equaton: F j (m ) = m j (83) m t+1 j = F j (m t ) m t+1 j m j = F j (m t ) F j (m ) = F j (θ) T (m t m ) (mean value theorem) m t+1 m F j (θ) 1 m t m (84) From ths last nequalty, t s clear that f we can prove that F j 1 s bounded by some constant ρ < 1, the convergence s proved. Unfortunately, t s not often easy to prove such a thng. 21

22 3.3.1 The hardcore model In the hardcore model, we have φ (x ) = 1 for all x {0, 1} (85) ψ j (x, x j ) = 1 x x j. (86) Instead of usng BP, let us do the followng gradent-descent lke algorthm: [ y(t + 1) = y(t) + α(t) F ] (87) y where the operator [.] s a clppng functon that ensures that the result stays n the nterval (0, 1). Ths s a projected verson of a gradent algorthm wth varable step sze α(t). Choosng ths step sze wth followng rule: y(t) α(t) = 1 t 1 2 d (88) then we can show that n a tme T n 2 2 d 1 ɛ we wll fnd F 4 b up to ɛ, and convergence s proved. 4 Learnng Graphcal Models In ths fnal secton, we focus on the learnng problem. In partcular, we consder three dfferent cases: Parameter learnng Gven a graph, the parameters are learned from the observaton of the entre set of realzatons of all random varables. Graphcal model learnng Both the parameters and the graph are learned from the observatons of the entre set of realzatons of all random varables. Latent graphcal model learnng The parameters and the graph are learned from partal observatons: some of the random varables are assumed to be hdden. 4.1 Parameter learnng Sngle parameter learnng We consder the followng smple settng where x s a Bernoull random varable wth parameter θ: { θ f x = 1, P X (x, θ) = (89) 1 θ f x = 0. Gven observatons {x 1,..., x S }, we are nterested n the MAP estmaton of the parameter θ: ˆθ MAP = arg max P (θ x 1,... x S ), θ [0,1] = arg max P (x 1,... x S θ) p(θ), (90) θ [0,1] 22

23 where maxmzng P (x 1,... x S θ) leads to the maxmum lkelhood (ML) estmator ˆθ ML of θ. Denotng D {x 1,... x S } the observed set of realzatons, we defne the emprcal lkelhood as follows: l(d; θ) = 1 S log P (x 1,... x S θ), = 1 log P (x θ), S = ˆP (1) log θ + ˆP (0) log(1 θ), (91) wth ˆP (1) = 1 S S 1 {x =1}. Dervatng (91) and settng the result to zero, we obtan the maxmal lkelhood estmator ˆθ ML : θ l(d; θ) = ˆP (1) ˆP (0) θ 1 θ = 0, ˆθML = ˆP (1) (92) What s the amount of samples S needed to acheve ˆθ ML (S) (1±ɛ)θ? Consderng the bnomal varable B(S, θ) (whch s the sum of S ndependently drawn Bernoull varables from (89)), we can wrte Drected graphs P ( B(S, θ) Sθ > ɛsθ) exp( ɛ 2 Sθ) δ, S 1 θ 1 ɛ 2 log 1 δ (93) We consder the followng settng n whch we have not one, but many random varables to learn on a drected graph: P X (x) P X X Π (x x Π ), (94) where Π stands for the parents of node, and P X X Π (x x Π ) θ x,x Π. Agan, we look at the emprcal lkelhood l(d; θ) = ˆP (x, x Π ) log θ x,x Π, x,x Π = [ ] ˆP (x x Π ) ˆP θ x,x Π (x Π ) log x,x Π ˆP (x x Π ) + log ˆP (x x Π ), = ˆP (x x Π ) ˆP θ x,x Π (x Π ) log ˆP (x x Π ), (95) x,x Π and set the dervatve to zero n order to obtan the ML estmaton of θ, resultng n [ ] θ x,x Π ˆP (x x Π ) log ˆP (x x Π ) = Ḙ θ x,x Π log, P ˆP (x x Π ) x ˆθML x,x Π = ˆP (x x Π ) (96) 23

24 4.1.3 Undrected graphs Let us now consder the case of undrected graphs. To reduce the amount of ndces, we wll wrte nstead of x n the followng. On a tree, on a chordal graph, on a trangle-free graph, P X = P j ˆPj P possble estmator: ˆP P j P j ˆP ˆPj P X C φ C(x C ) ˆPC S φ possble estmator: S(x S ) ˆP S P X φ j ψ j For the last case, let us use the Hammersley-Clfford theorem. Let X = {0, 1}. On a trangle-free graph, the maxmal clque sze s 2, and therefore we can wrte P X (x ) exp U (x ) + V j (x, x j ). (97) j Usng the fact that we have a MRF, we get P (X = 1, X rest = 0) exp (Q()). (98) P (X = 0, X rest = 0) Also, because of the fact that on a MRF, a varable condtoned on ts neghbours s ndependent of all the others, we can wrte P (X = 1, X rest = 0) P (X = 0, X rest = 0) = P (X = 1, X N() = 0) P (X = 0, X N() = 0) (99) and therefore ths quantty can be calculated wth 2 N() +1 operatons. 4.2 Graphcal model learnng What can we learn from a set of realzatons of varables when the underlyng graph s not known? We focus now n the followng maxmsaton max l(d; G, θ G ) = max max l(d; G, θ G ). (100) G,θ G G θ G }{{} ˆl(D;G) l(d;g,ˆθ G ML) ML From the prevous subsecton, we have ˆθ G, and therefore we only need to fnd a way to evalute the maxmzaton on the possble graphs. 24

25 4.2.1 Drected graphs On a drected graph G (, Π ), the emprcal lkelhood reads ˆl(D; G) = ˆP (x, x Π ) log ˆP (x x Π ), x,x Π = [ ] ˆP (x, x Π ) ˆP (x, x Π ) log ˆP (x ) ˆP ˆP (x ), (x Π ) = = x,x Π x,x Π ˆP (x, x Π ) log ˆP (x, x Π ) ˆP (x ) ˆP (x Π ) + x ˆP (x ) log ˆP (x ), I( ˆX ; ˆX Π ) H( ˆX ). (101) Lookng for the graph maxmzng the emprcal lkelhood thus conssts n maxmsng the mutual nformaton: max G I( ˆX ; ˆX Π ). In a general settng, ths s not easy. Reducng the search space to trees however, some methods exst, lke the Chow-Lu algorthm [1], whch reles on the procedure used to get the maxmum weghted spannng tree (cf. secton 2) Undrected graphs What can we do n the case of undrected graphs? Let us restrct ourselves to the bnary case x {0, 1} N and to exponental famles: P X (x) = exp θ x +,j θ j x x j log Z(θ). (102) Agan, we denote D = {x 1,, x S } the observed dataset, and the log-lkelhood can be wrtten as l(d; θ) = θ µ + θ j µ j log Z(θ). (103),j }{{} θ,µ As l(d; θ) s a concave functon of θ, t can be effcently solved usng a gradent descent algorthm of the form θ t+1 = θ t + α(t) θ l(d; θ) θ=θ t (104) The dffculty n ths formula s the evaluaton of the gradent: θ l(d; θ) = µ E θ (X ), (105) whose second term s an expectaton that has to be calculated, usng the sumproduct algorthm or wth a Markov chan Monte Carlo method for nstance. Another queston s whether we wll be learnng nterestng graphs at all. Graph-learnng algorthms tend to lnk varables that are not lnked n the real underlyng graph. To avod ths, complcated graphs should be penalzed by ntroducng a regularzer. Unfortunately, ths s a hghly non-trval problem, and graphcal model learnng algorthms do not always perform well to ths day. 25

26 4.3 Latent Graphcal Model learnng: the Expectatonmaxmzaton algorthm In ths last case, we dstngush two dfferent varables: Y stands for observed varables, X denotes the hdden varables. The parameter θ s estmated from the observatons, namely ˆθ ML = arg max log P Y (y; θ). (106) θ The log-lkelhood s derved by margnalzng on the hdden varables l(y; θ) = log P Y (y; θ), = log x P X,Y (x, y; θ), (107) = log q(x y) P X,Y (x, y; θ), (108) q(x y) x [ ] [ ] P P = log E E L(q; θ). (109) q q q q Ths gves rase to the Expectaton-Maxmsaton (EM) algorthm [2]. EM algorthm Untl convergence, terate between E-step: estmaton of the dstrbuton q θ t q t+1 = arg max q L(q; θ t ). M-step: estmaton of the parameter θ q t+1 θ t+1 = arg max θ L(q t+1 ; θ). 26

27 References [1] C. K. Chow and C. N. Lu. Approxmatng dscrete probablty dstrbutons wth dependence trees. Informaton Theory, IEEE Transactons on, 14(3): , [2] A. P. Dempster, N. M. Lard, and D. Rubn. Maxmum lkelhood from ncomplete data va the em algorthm. Journal of the Royal Statstcal Socety, Seres B, 39(1):1 38, [3] G. R. Grmmet. A theorem about random felds. Bulletn of the London Mathematcal Socety, 5(1):81 84, [4] J. M. Hammersley and P. Clfford. Markov felds on fnte graphs and lattces. Avalable onlne: hammfest/hamm-clff.pdf,

EM and Structure Learning

EM and Structure Learning EM and Structure Learnng Le Song Machne Learnng II: Advanced Topcs CSE 8803ML, Sprng 2012 Partally observed graphcal models Mxture Models N(μ 1, Σ 1 ) Z X N N(μ 2, Σ 2 ) 2 Gaussan mxture model Consder