Max$Sum(( Exact(Inference(

Size: px

Start display at page:

Download "Max$Sum(( Exact(Inference("

Robyn Rich
6 years ago
Views:

1 Probabilis7c( Graphical( Models( Inference( MAP( Max$Sum(( Exact(Inference(

2 Product Summation a 1 b 1 8 a 1 b 2 1 a 2 b a 2 b 2 2 a 1 b 1 3 a 1 b 2 0 a 2 b 1-1 a 2 b 2 1

3 Max-Sum Elimination in Chains X A B C D E ( θ ( A, B) + θ ( B, C) + θ ( C, D) + θ ( D, )) max D max C max B max A E ( θ ( B, C) + θ ( C, D) + θ ( D, E) + max θ ( A, )) max D max C max B 3 4 A 1 2 B ( θ ( B, C) + θ ( C, D) + θ ( D, E) + λ ( )) max D max C max B B

4 Factor Summation a 1 b 1 c 1 3+4=7 a 1 b 1 c =4.5 a 1 b 1 3 a 1 b 2 0 a 2 b 1-1 a 2 b 2 1 b 1 c 1 4 b 1 c b 2 c b 2 c 2 2 a 1 b 2 c =0.2 a 1 b 2 c 2 0+2=2 a 2 b 1 c 1-1+4=3 a 2 b 1 c =0.5 a 2 b 2 c =1.2 a 2 b 2 c 2 1+2=3

5 Factor Maximization a 1 b 1 c 1 7 a 1 b 1 c a 1 b 2 c a 1 b 2 c 2 2 a 2 b 1 c 1 3 a 2 b 1 c a 1 c 1 7 a 1 c a 2 c 1 3 a 2 c 2 3 a 2 b 2 c a 2 b 2 c 2 3

6 Max-Sum Elimination in Chains X X A B C D E ( θ ( B, C) + θ ( C, D) + θ ( D, E) + λ ( )) max D max C max B B ( θ ( C, D) + θ ( D, E) + max ( θ ( B, C) + λ ( ))) max D max C 4 B B ( θ ( C, D) + θ ( D, E) + λ ( )) max D max C C

7 Max-Sum Elimination in Chains X X A B X C X D E ( θ ( C, D) + θ ( D, E) + λ ( )) max D max C C ( θ ( D, E) + λ ( )) max D 3 4 D λ 4( E) λ 4 (e)

8 Max-Sum in Clique Trees A B C D E λ λ 2 3 (C) = 1 2 (B) = λ (D) = 3 4 max max ( θ +δ A θ 1 B ) max C θ3 +δ2 ( ) 3 1: A,B 2: 3: B C D B,C C,D 4: D,E λ 2 1 (B) = max θ +δ C ( ) λ 3 2 (C) = max θ +δ D ( ) λ 4 3 (D) = max E θ 4

9 Convergence of Message Passing Once C i receives a final message from all neighbors except C j, then λ i j is also final (will never change) Messages from leaves are immediately final λ λ 2 3 (C) = 1 2 (B) = λ (D) = 3 4 max max ( θ +δ A θ 1 B ) max C θ3 +δ2 ( ) 3 1: A,B 2: 3: B C D B,C C,D 4: D,E λ 2 1 (B) = max θ +δ C ( ) λ 3 2 (C) = max θ +δ D ( ) λ 4 3 (D) = max E θ 4

10 Simple Example A B C a 1 b 1 c 1 3+4=7 a 1 b 1 c =4.5 a 1 b 1 3 a 1 b 2 0 a 2 b 1-1 a 2 b 2 1 b 1 c 1 4 b 1 c b 2 c b 2 c 2 2 a 1 b 2 c =0.2 a 1 b 2 c 2 0+2=2 a 2 b 1 c 1-1+4=3 a 2 b 1 c =0.5 a 2 b 2 c =1.2 a 2 b 2 c 2 1+2=3

11 Simple Example a 1 b 1 3 a 1 b 2 0 a 2 b 1-1 a 2 b 2 1 1: A,B a 1 b 1 3+4=7 a 1 b 2 0+2=2 a 2 b 1-1+4=3 a 2 b 2 1+2=3 B b 1 3 b 2 1 b 1 4 b 2 2 b 1 c 1 4 b 1 c b 2 c b 2 c 2 2 2: B,C b 1 c 1 4+3=7 b 1 c =4.5 b 2 c =1.2 b 2 c 2 2+1=3

12 Max-Sum BP at Convergence Beliefs at each clique are max-marginals β i ( C i ) =θ i ( C i ) + λ k i k Calibration: cliques agree on shared variables a 1 b 1 3+4=7 a 1 b 2 0+2=2 a 2 b 1-1+4= 3 a 2 b 2 1+2=3 b 1 c 1 4+3=7 b 1 c =4.5 b 2 c =1.2 b 2 c 2 2+1=3

13 Summary The same clique tree algorithm used for sum-product can be used for max-sum As in sum-product, convergence is achieved after a single up-down pass Result is a max-marginal at each clique C: For each assignment c to C, what is the score of the best completion to c

14 Probabilis3c& Graphical& Models& Inference& MAP& Finding&a&MAP& Assignment&

15 Decoding a MAP Assignment Easy if MAP assignment is unique Single maximizing assignment at each clique Whose value is the θ value of the MAP assignment Due to calibration, choices at all cliques must agree a 1 b 1 c 1 7 a 1 b 1 c a 1 b 2 c a 1 b 2 c 2 2 a 2 b 1 c 1 3 a 2 b 1 c a 2 b 2 c a 2 b 2 c 2 3 a 1 b 1 3+4=7 a 1 b 2 0+2=2 a 2 b 1-1+4=3 a 2 b 2 1+2=3 b 1 c 1 4+3=7 b 1 c =4.5 b 2 c =1.2 b 2 c 2 2+1=3

16 Decoding a MAP assignment If MAP assignment is not unique, we may have multiple choices at some cliques Arbitrary tie-breaking may not produce a MAP assignment a 1 b 1 2 a 1 b 2 1 a 2 b 1 1 a 2 b 2 2 b 1 c 1 2 b 1 c 2 1 b 2 c 1 1 b 2 c 2 2

17 Decoding a MAP assignment If MAP assignment is not unique, we may have multiple choices at some cliques Arbitrary tie-breaking may not produce a MAP assignment Two options: Slightly perturb parameters to make MAP unique Use traceback procedure that incrementally builds a MAP assignment, one variable at a time

18 Probabilis1c$ Graphical$ Models$ Inference$ MAP$ Tractable$ MAP$$ Problems$

19 Correspondence /data association X ij = 1 if i matched to j 0 otherwise θ ij = quality of match between i and j Find highest scoring matching maximize Σ ij θ ij X ij subject to mutual exclusion constraint Easily solved using matching algorithms Many applications matching sensor readings to objects matching features in two related images matching mentions in text to entities

20 3D Cell Reconstruction correspond tilt images compute 3D reconstruction Matching weights: similarity of location and local neighborhood appearance Duchi, Tarlow, Elidan, and Koller, NIPS Amat, Moussavi, Comolli, Elidan, Downing, Horowitz, Journal of Strurctural Biology, 2006.

21 Mesh Registration Matching weights: similarity of location and local neighborhood appearance [Anguelov, Koller, Srinivasan, Thrun, Pang, Davis, NIPS 2004]

22 Associative potentials Arbitrary network over binary variables using only singleton θ i and supermodular pairwise potentials θ ij Exact solution using algorithms for finding minimum cuts in graphs Many related variants admit efficient exact or approximate solutions Metric MRFs a b 1 c d

23 Example: Depth Reconstruction view 1 view 2 depth reconstruction Scharstein & Szeliski, High-accuracy stereo depth maps using structured light Proc. IEEE CVPR 2003

24 A B C D score Cardinality Factors A factor over arbitrarily many binary variables X 1,,X k Score(X 1,,X k ) = f(σ i X i ) Example applications: soft parity constraints prior on # pixels in a given category prior on # of instances assigned to a given cluster

25 A B C D score Sparse Pattern Factors A factor over variables X 1,,X k Score(X 1,,X k ) specified for some small # of assignments x 1,,x k Constant for all other assignments Examples: give higher score to combinations that occur in real data In spelling, letter combinations that occur in dictionary 5 5 image patches that appear in natural images

26 Convexity Factors Ordered binary variables X 1,,X k Convexity constraints Examples: Convexity of parts in image segmentation Contiguity of word labeling in text Temporal contiguity of subactivities

27 Summary Many specialized models admit tractable MAP solution Many do not have tractable algorithms for computing marginals These specialized models are useful On their own As a component in a larger model with other types of factors

28 Probabilis1c) Graphical) Models) Inference) MAP) Tractable) MAP) Problems)

29 Correspondence /data association X ij = 1 if i matched to j 0 otherwise θ ij = quality of match between i and j Find highest scoring matching maximize Σ ij θ ij X ij subject to mutual exclusion constraint Easily solved using matching algorithms Many applications matching sensor readings to objects matching features in two related images matching mentions in text to entities

30 3D Cell Reconstruction correspond tilt images compute 3D reconstruction Matching weights: similarity of location and local neighborhood appearance Duchi, Tarlow, Elidan, and Koller, NIPS Amat, Moussavi, Comolli, Elidan, Downing, Horowitz, Journal of Strurctural Biology, 2006.

31 Mesh Registration Matching weights: similarity of location and local neighborhood appearance [Anguelov, Koller, Srinivasan, Thrun, Pang, Davis, NIPS 2004]

32 Associative potentials Arbitrary network over binary variables using only singleton θ i and supermodular pairwise potentials θ ij Exact solution using algorithms for finding minimum cuts in graphs Many related variants admit efficient exact or approximate solutions Metric MRFs a b 1 c d

33 Example: Depth Reconstruction view 1 view 2 depth reconstruction Scharstein & Szeliski, High-accuracy stereo depth maps using structured light Proc. IEEE CVPR 2003

34 A B C D score Cardinality Factors A factor over arbitrarily many binary variables X 1,,X k Score(X 1,,X k ) = f(σ i X i ) Example applications: soft parity constraints prior on # pixels in a given category prior on # of instances assigned to a given cluster

35 A B C D score Sparse Pattern Factors A factor over variables X 1,,X k Score(X 1,,X k ) specified for some small # of assignments x 1,,x k Constant for all other assignments Examples: give higher score to combinations that occur in real data In spelling, letter combinations that occur in dictionary 5 5 image patches that appear in natural images

36 Convexity Factors Ordered binary variables X 1,,X k Convexity constraints Examples: Convexity of parts in image segmentation Contiguity of word labeling in text Temporal contiguity of subactivities

37 Summary Many specialized models admit tractable MAP solution Many do not have tractable algorithms for computing marginals These specialized models are useful On their own As a component in a larger model with other types of factors

38 Probabilis-c% Graphical% Models% Inference% MAP% Dual% Decomposi-on%

39 Problem%Formula-on% Singleton factors Large factors

40 Divide%and%Conquer%

41 Divide%and%Conquer% L(λ) is upper bound on MAP(θ) for any setting of λ s

42 θ F (X 1,X 2 ) -λ F1 (X 1 ) -λ F2 (X 2 ) X 1 θ K (X 1,X 4 ) -λ X K1 (X 1 ) 1 -λ K4 (X 4 ) θ 1 (X 1 ) X 2 θ 1 (X 1 )+λ F1 (X 1 )+λ K1 (X 1 ) X 4 θ F (X 1,X 2 ) X 1 θ K (X 1,X 4 ) X 1 θ 2 (X 2 )+λ F2 (X 2 )+λ G2 (X 2 ) θ 4 (X 4 )+λ K4 (X 4 )+λ H4 (X 4 ) θ 2 (X 2 ) X 2 X 4 θ 4 (X 4 ) X 2 X 4 θ G (X 2,X 3 ) θ H (X 3,X 4 ) X 3 X 3 θ 3 (X 3 ) θ 3 (X 3 )+λ G3 (X 3 )+λ H3 (X 3 ) X 2 X 4 θ G (X 2,X 3 ) -λ G2 (X 2 ) -λ G3 (X 3 ) X 3 X 3 θ H (X 3,X 4 ) -λ H3 (X 3 ) -λ H4 (X 4 )

43 Divide%and%Conquer% Slaves don t have to be factors in original model Subsets of factors that admit tractable solution to local maximization task θ F (X 1,X 2 )+θ G (X 2,X 3 ) θ 1 (X 1 ) X 1 X 2 X 3 -λ FG1 (X 1 )-λ FG2 (X 2 )-λ FG3 (X 3 ) θ F (X 1,X 2 ) X 1 θ K (X 1,X 4 ) X 1 X 4 X 3 θ K (X 1,X 4 )+θ H (X 3,X 4 ) -λ KH1 (X 1 )-λ KH3 (X 3 )-λ KH4 (X 4 ) θ 2 (X 2 ) X 2 X 4 θ 4 (X 4 ) θ 1 (X 1 )+λ FG1 (X 1 )+λ KH1 (X 1 ) θ 2 (X 2 )+λ FG2 (X 2 ) θ G (X 2,X 3 ) θ H (X 3,X 4 ) X 3 θ 3 (X 3 ) X 1 θ 3 (X 3 )+λ FG3 (X 3 )+λ KH3 (X 3 ) X 3 X 2 θ 4 (X 4 )+λ KH4 (X 4 ) X 4

44 Divide%and%Conquer% In pairwise networks, often divide factors into set of disjoint trees Each edge factor assigned to exactly one tree Other tractable classes of factor sets Matchings Associative models

45 Example:%3D%Cell%Reconstruc-on% correspond%-lt% images% compute%3d% reconstruc-on% Matching%weights:%similarity%of%loca-on%and% local%neighborhood%appearance% Pairwise%poten-als:%approximate%preserva-on% of%rela-ve%marker%posi-ons%across%images% Duchi,%Tarlow,%Elidan,%and%Koller,%NIPS%2006.%Amat,%Moussavi,%Comolli,%Elidan,%Downing,%Horowitz,% Journal%of%Strurctural%Biology,%2006.%

46 Probabilis-c% Graphical% Models% Inference% MAP% Dual% Decomposi-on% Algorithm%

47 Dual Decomposition Algorithm Initialize all λ s to be 0 Repeat for t=1,2, Locally optimize all slaves: For all F and i F If then

48 Dual Decomposition Convergence Under weak conditions on α t, the λ s are guaranteed to converge Σ t α t = Σ t α t 2 < Convergence is to a unique global optimum, regardless of initialization

49 At Convergence Each slave has a locally optimal solution over its own variables Solutions may not agree on shared variables If all slaves agree, the shared solution is a guaranteed MAP assignment Otherwise, we need to solve the decoding problem to construct a joint assignment

50 Options for Decoding x* Several heuristics If we use decomposition into spanning trees, can take MAP solution of any tree Have each slave vote on X i s in its scope & for each X i pick value with most votes Weighted average of sequence of messages sent regarding each X i Score θ is easy to evaluate Best to generate many candidates and pick the one with highest score

51 Upper Bound L(λ) is upper bound on MAP(θ) score(x) MAP(θ) L(λ) MAP(θ) - score(x) L(λ) score(x)

52 Important Design Choices Division of problem into slaves Larger slaves (with more factors) improve convergence and often quality of answers Selecting locally optimal solutions for slaves Try to move toward faster agreement Adjusting the step size α t Methods to construct candidate solutions

53 Summary: Algorithm Dual decomposition is a general-purpose algorithm for MAP inference Divides model into tractable components Solves each one locally Passes messages to induce them to agree Any tractable MAP subclass can be used in this setting

54 Summary: Theory Formally: a subgradient optimization algorithm on dual problem to MAP Provides important guarantees Upper bound on distance to MAP Conditions that guarantee exact MAP solution Even some analysis for which decomposition into slaves is better

55 Pros: Summary: Practice Very general purpose Best theoretical guarantees Can use very fast, specialized MAP subroutines for solving large model components Cons: Not the fastest algorithm Lots of tunable parameters / design choices

Dual Decomposition for Inference

Dual Decomposition for Inference Yunshu Liu ASPITRG Research Group 2014-05-06 References: [1]. D. Sontag, A. Globerson and T. Jaakkola, Introduction to Dual Decomposition for Inference, Optimization for