CSE 254: MAP estimation via agreement on (hyper)trees: Message-passing and linear programming approaches

Size: px

Start display at page:

Download "CSE 254: MAP estimation via agreement on (hyper)trees: Message-passing and linear programming approaches"

Lesley Kelly
5 years ago
Views:

1 CSE 254: MAP estimation via agreement on (hyper)trees: Message-passing and linear programming approaches A presentation by Evan Ettinger November 11, 2005

2 Outline Introduction Motivation and Background Tree-Reparameterizations Definitions and Upper Bounds LP Approaches IP to LP to Lagrangian Dual Message-passing Approaches Two Algorithms Summary

3 Outline Introduction Motivation and Background Tree-Reparameterizations Definitions and Upper Bounds LP Approaches IP to LP to Lagrangian Dual Message-passing Approaches Two Algorithms Summary

4 Motivation and Background Motivation I Want: MAP estimation on pairwise MRFs i.e. configurations in the set argmax x 2 np(x; ). I We limit ourselves to pairwise MRFs and discrete sets of labels,.

5 Motivation and Background MRFs as Exponential Family I Remember that we can write any MRF in the exponential family form as: p(x; ) / expfh; (x)ig expf 2I (x)g I Here I indexes over the vertices, edges and labels. I We will use the overcomplete representation f j (x s) j j 2 sg for s 2 V f j (x s) k (x t ) j (j; k ) 2 s tg for (s; t) 2 E I Consequence: There are many exponential parameters corresponding to a given distribution (i.e. p(x; ) = p(x; ^) for 6= ^)

6 Motivation and Background Marginal Distributions I Using this representation we can now write marginal probabilities as: s;j = E[ j (x s)] := p(x; ) j (x s) x2 n st;jk = E[ j (x s) k (x t )] := p(x; ) j (x s) k (x t ) x2 n I We define the marginal polytope as MARG(G) := f 2 R d j s;j = E p[ j (x s)]; st;jk = E p[ j (x s) k (x t )] for some p( )g

7 Motivation and Background MAP Estimation I We can represent the exponential parameters of our sufficient statistic (x) = f j (x s); j (x s) k (x t )g by: s(x s) := s;j j (x s) st (x s; x t ) := j2s st;jk j (x s) k (x t ) (j;k)2s t I The MAP problem is then equivalent to: 1( ) := max x2 n h ; (x)i max x2 n s2v s(x s) + (s;t)2e st (x s; x t )

8 Outline Introduction Motivation and Background Tree-Reparameterizations Definitions and Upper Bounds LP Approaches IP to LP to Lagrangian Dual Message-passing Approaches Two Algorithms Summary

9 Definitions and Upper Bounds Convex Combinations and Upper Bounds I Consider f i g such that P i pi i = and P i pi = 1. Claim: 1( ) P i pi 1( i ) I Let x be our MAP solution for 1( ) 1( ) = max x2 nh ; (x)i = h ; (x )i = «p i max x2 nhi ; (x)i i = p i 1( i ) i p i h i ; (x )i I Goals: 1. Characterize f i g by tree-structured exponential parameters. 2. When is the upper bound tight? 3. When not tight, how can we make it as tight as possible? i

10 Definitions and Upper Bounds Tree Reparameterizations I Let fti g be a set of spanning trees on G, p be a probability distribution over this set. I = Pi p(t i )(T i ) is a p-reparameterization. I Consider pe, the probability a given edge e 2 E appears in a spanning tree chosen at random under p. I We can always find Pi p(t i )(T i ) = as long as p e > 0 for all e 2 E. I E.g. Above, if p(ti ) = 1=3 for all i, then p b = 1; p e = 2=3; p f = 1=3.

11 Definitions and Upper Bounds Example: Ising Model I Consider a 4-node cycle and let x 2 f0; 1g 4. Also let, p(x; ) / expfx 1 x 2 + x 2 x 3 + x 3 x 4 + x 4 x 1 g I So, = [ ] I We will reparameterize by a convex combination of exponential parameters of the following spanning trees:

12 Definitions and Upper Bounds Example (cont.) I Give each tree, Ti, equal weighting p(t i ) = 1=4 I We construct each tree-structured exponential parameter as: I This gives us = Pi p(t i )(T i ). (T 1 ) = (4=3)[ ]; (T 2 ) = (4=3)[ ]; (T 3 ) = (4=3)[ ]; (T 4 ) = (4=3)[ ]

13 Definitions and Upper Bounds Tightness of Upper Bound I For any tree p-reparameterization we have: 1( ) p(t i ) 1((T i )) = p(t i ) max nfh(t i ); (x)ig x2 i i I When does the above inequality meet with equality? I Let s define the set of optimal configurations for a given : OPT () := fx 2 n j h; (x 0 )i h; (x)i for all x 0 2 n g Lemma: Let = P i p(t i )(T i ), then \ T i OPT ((T i )) OPT ( ) Moreover, the bound is tight iff the intersection on the LHS in non-empty.

14 Definitions and Upper Bounds Proof I Let x 2 OPT ( ) and consider the difference between the RHS and LHS: " # " # 0 p(t i ) 1((T i )) 1( ) = p(t i ) 1((T i )) h; (x ))i i i» = p(t i ) 1((T i )) h(t i ); (x )i {z } i I This underscored term is non-negative, and equal to zero only when x 2 OPT ((T i )). I So bound is tight iff x 2 T T i OPT ((T i )).

15 Definitions and Upper Bounds Tree Reparameterization Wrap-up I A set of spanning trees where TT i OPT ((T i )) is non-empty is said to follow the tree agreement. I Suppose that we fix the spanning tree distribution p and our target parameter. I Goal: Optimize the upper bound as a function of (Ti ) I Two Approaches: 1. Direct Minimization through LP relaxation techniques 2. Message-passing Approaches

16 Outline Introduction Motivation and Background Tree-Reparameterizations Definitions and Upper Bounds LP Approaches IP to LP to Lagrangian Dual Message-passing Approaches Two Algorithms Summary

17 IP to LP to Lagrangian Dual MAP as an Integer Program (IP) I Fix our set of spanning trees fti g and our distribution over them, p. I Main Problem: We would like to minimize the RHS of the following inequality w.r.t. (T i ) to get a bound as tight as possible: 1( ) i p(t i ) 1((T i )) I Consider the MAP optimization problem: arg max x2 nh ; (x)i I We first relax the above integer program to an equivalent linear program...

18 IP to LP to Lagrangian Dual LP Relaxation I Remind yourself what the marginal polytope is: MARG(G) := f 2 R d j s;j = E p[ j (x s)]; st;jk = E p[ j (x s) k (x t )] for some p( )g I Claim: maxx2 nh; (x)i = max 2MARG(G)h; i I Let P = fp( ) j p(x) 0; Px p(x) = 1g. 0 1 max x2 nh ; (x)i = p(x)h; (x)ia p2p x2 n I We now transform the RHS of this equation...

19 IP to LP to Lagrangian Dual LP Relaxation (cont.) I From the last slide we had: p(x)h; (x)ia = s(x s) + st (x s; x t ) A p2p x2 n x2 n s2v (s;t)2e = s;j s;j + st;jk st;jk s2v j2s (s;t)2e (j;k)2s t I So as p ranges over P the marginals range over MARG(G). I Conclusion: maxx2 nh ; (x)i = max 2MARG(G)h ; i

20 IP to LP to Lagrangian Dual Lagrangian Dual I We now have an LP formulation of MAP which is much easier to solve. I Remember we have: 1( ) p(t i ) 1((T i )) = p(t i ) max nfh(t i ); (x)ig x2 i i I We now would like to address the problem of: min p(t i ) 1((T i )) (T i ) i s:t: p(t i )(T i ) = i I Attack problem via its Lagrangian dual, which yields the relaxed LP: LOCAL(G) := f 2 R d + j 1( ) max h; i 2LOCAL(G) s;j = 1 8s 2 V ; st;jk = s;j 8(s; t) 2 E ; j 2 sg j2s k2 t

$I LOCAL(G) is characterized by O(mn + m 2 jej); m := max s j sj constraints. I 2 optimum possibilities: 1. Optimum is a vertex that is both in MARG(G) and LOCAL(G), then tree agreement holds. 2. Optimum is a fractional vertex of LOCAL(G), then no tree agreement can occur.$

21 IP to LP to Lagrangian Dual LOCAL(G) I We ve relaxed our original constraint set from MARG(G) to LOCAL(G). I The relaxation is exact for any problem on a tree-structured graph. I LOCAL(G) is characterized by O(mn + m 2 jej); m := max s j sj constraints. I 2 optimum possibilities: 1. Optimum is a vertex that is both in MARG(G) and LOCAL(G), then tree agreement holds. 2. Optimum is a fractional vertex of LOCAL(G), then no tree agreement can occur.

22 IP to LP to Lagrangian Dual LP Summary I We ve now reformulated the MAP estimation problem as an approximate LP. I We could easily use classical LP techniques (i.e. Simplex Method) to solve this problem. I However, we would like to exploit the graphical structure of our MRF. I Want: Message-passing algorithms to solve the same LP over LOCAL(G) fixed points of iterative algorithm = optimum of LP relaxation.

23 Outline Introduction Motivation and Background Tree-Reparameterizations Definitions and Upper Bounds LP Approaches IP to LP to Lagrangian Dual Message-passing Approaches Two Algorithms Summary

24 Two Algorithms Preliminaries I Let s define max-marginals: s;j := s max p(x; (T )) fx j x s =jg st;jk := st max fx j (x s ;xt )=(j;k)g s(x s) := s;j j (x s) p(x; (T )) st (x s; x t ) := j2s st;jk j (x s) k (x t ) (j;k)2s t I For trees we can decompose the joint distributions into the junction tree decomposition: p(x; (T )) / Y s2v s(x s) Y (s;t)2e(t ) st (x s; x t ) s(x s) t (x t )

25 Two Algorithms Trees and Max-marginals I For every edge (s; t) 2 E(T ) I x is in OPT ((T )) iff: s(x s) = max st (x s; x 0 x t 0 t ) 2 t x s 2 arg max s(x s) 8 s x s (x s ; x t ) 2 arg max st (x s; x t ) 8 (s; t) x s ;xt I All the above are local conditions. I So, to find a MAP configuration: 1. root T at x r and choose x r 2 arg max xr r (x r ). 2. from parent s to child t choose x t 2 arg max xt st (x s ; x t )

26 Two Algorithms Algorithm Idea I For graphs with cycles we develop the idea of pseudo-max marginals = f s; stg I Denote (T ) as elements of pertaining to tree T. I (T ) implicitly specifies a tree exponential parameter (T ) by junction tree decomposition I Goal: Given p and fti g, we would like to find a that satisfies the following two properties: 1. implicitly specifies a p-reparameterization of (T i )! (T i ) and = P i p i (T i ). 2. (T i ) is exact max-marginal under p(x; (T i )) 8i tree consistency I We devise in iterative algorithm that after each iteration maintains (1) and upon convergence achieves (2).

27 Two Algorithms Preliminary Idea I Update pseudo-max marginal iteratively. I Two strategies: 1. Tree based updates update only (T ) a tree at a time. 2. Edge based updates update st along with associated s; t in parellel. I The algorithm to follow uses (2) and is motivated by: s(x s) = max st (x s; x 0 x t 0 t ) 2 t I So is consistent on every tree Ti iff the above holds for every (s; t) 2 E.

28 Two Algorithms Algorithm 1: Edge-based updates I Here st is the edge-appearance probability in our ft i g

29 Two Algorithms p-reparameterization proof I Note: From junction-tree decomposition we can define: n s (T )(x s) = log n s (x s) 8 s 2 V n st (T )(x s; x t ) = ( log n st (xs ;x t ) n s (xs )n t (x t ) if (s; t) 2 E(T ) 0 otherwise 8 (s; t) 2 E I Lemma: At each iteration we satisfy the p-reparameterization requirement. (proof by induction) I Base Case (n=0): 2 p(t i ) 4 s 0 (T ) + i s2v (s;t)2e(t ) 3 2 st 0 (T ) 5 = p(t i ) 4 s + i s2v = s + s2v (s;t)2e st 1 st st (s;t)2e(t ) 3 5

30 Two Algorithms Induction Step I Induction Step: Pi p(t i ) n+1 (T )(x) = 0 1 = p(t i s n (x s) + st log max x 0 t st n (x s; x 0 t ) A s n (x s) i s2v t2 (s) st n (x s; x t ) + p(t i ) log max i x 0 n (s;t)2e(t ) t st (xs; x0 t ) max xs 0 n st (x0 s; x t ) = log s n (x s) + log n st (x s; x t ) s n (x s) n s2v (s;t)2e t (xt ) I This last line is equal to Pi p(t i ) n (T )(x) up to an additive constant independent of x.

31 Two Algorithms Convergence Criteria Proof I Lemma: A fixed point of the algorithm satisfies the tree consistency condition. I Substitute = n+1 = n into all the updates in the algorithm and we get: max x 0 t st (x s) st (xs; x0 t ) t (x t ) max x 0 s v st (x0 s; x t ) = 8 (xs; xt ) I This implies is consistent on every tree Ti I Reminder: consistent on every tree!= consistent on entire graph (i.e. unique MAP).

32 Two Algorithms Algorithm 2: Message passing version I We can create a message passing version of the same algorithm by utilizing the following mapping: s (x s ) := exp( s (xs )) Y v2 (s) [M vs (xs )] vs Q st (xs ; x 1 t ) := exp( st (xs ; x t ) + s (x s ) + t (x t )) v2 (s)nt [Mvs (xs )]vs st [M ts (x s )] (1 ts ) Q v2 (t )ns [M vt (x t )] vt [M st (x t )] (1 st )

33 Two Algorithms Analysis of Algorithms I The equivalence between algorithms 1 & 2 follows from the equivalence between message-passing and reparameterization (Buhm s talk). I Key observations about Algorithm 2: 1. if st = 1 for every edge, equvalent to loopy BP. 2. Key difference from standard message passing is M ts depends on M st. I Not always guaranteed to converge. I In practice it s nice to use a dampening technique to encourage convergence. I Connection to LP relaxation: Let = log M elementwise where M is fixed point of algorithm 2. Then is an optimal solution to the LP relaxation given earlier.

34 Outline Introduction Motivation and Background Tree-Reparameterizations Definitions and Upper Bounds LP Approaches IP to LP to Lagrangian Dual Message-passing Approaches Two Algorithms Summary

35 Wrapping Things Up I MAP can be attacked via tree-reparameterizations and LP relaxations. I Message-passing can be viewed as an alternative way to solve the LP dual problem proposed. I Authors claim they ve successfully used these algorithms for a few applications. Results? I When we do not converge, can we perform a further relaxation by clustering random variables like Kikuchi approximations?

36 Citations I V. Kolmogorov, M. Wainwright. On the optimality of tree-reweighted max-product message-passing. Uncertainty on Artificial Intelligence. July I M. Wainwright, T. Jaakkola, A. Willsky. Tree consistency and bounds on the performance of the max-product algorithm and its generalizations. Statistics and Computing. Vol. 14 pp I M. Wainwright, T. Jaakkola, A. Willsky. MAP estimation via agreement on (hyper)trees: Message-passing and linear programming. UC Berkeley Report I V. Komogorov. Convergent tree-reweighted message passing for energy minimization. Microsoft Technical Reports p. 24. June 2005.

MAP estimation via agreement on (hyper)trees: Message-passing and linear programming approaches

MAP estimation via agreement on (hyper)trees: Message-passing and linear programming approaches Martin Wainwright Tommi Jaakkola Alan Willsky martinw@eecs.berkeley.edu tommi@ai.mit.edu willsky@mit.edu