Part 7: Structured Prediction and Energy Minimization (2/2)

Size: px

Start display at page:

Download "Part 7: Structured Prediction and Energy Minimization (2/2)"

Liliana Neal
6 years ago
Views:

1 Part 7: Structured Prediction and Energy Minimization (2/2) Colorado Springs, 25th June 2011

2 G: Worst-case Complexity Hard problem Generality Optimality Worst-case complexity Integrality Determinism

3 G: Worst-case Complexity Giving up Worst-case Complexity Worst-case complexity is an asymptotic behaviour Worst-case complexity quantifies worst case Practical case might be very different Issue: what is the distribution over inputs? Popular methods with bad or unknown worst-case complexity Simplex Method for Linear Programming Hash tables Branch-and-bound search

4 G: Worst-case Complexity Branch-and-bound Search Implicit enumeration: globally optimal Choose 1. Partitioning of solution space 2. Branching scheme 3. Upper and lower bounds over partitions Branch and bound is very flexible, many tuning possibilities in partitioning, branching schemes and bounding functions, can be very efficient in practise, typically has worst-case complexity equal to exhaustive enumeration

5 G: Worst-case Complexity Branch-and-Bound (cont) A 5 A 4 C 2 C 1 A 2 A 1 C 3 A 3 Y Work with partitioning of solution space Y Active nodes (white) Closed nodes (gray)

6 G: Worst-case Complexity Branch-and-Bound (cont) A C Y Y Initially: everything active Goal: everything closed

7 G: Worst-case Complexity Branch-and-Bound (cont) A 5 A 4 C 2 C 1 A 2 A 1 C 3 A 3 Y

8 G: Worst-case Complexity Branch-and-Bound (cont) A 2 Take an active element (A 2 )

9 G: Worst-case Complexity Branch-and-Bound (cont) A 2 Partition into two or more subsets of Y

10 G: Worst-case Complexity Branch-and-Bound (cont) C 4 C 5 A 6 Evaluate bounds and set node active or closed Closing possible if we can prove that no solution in a partition can be better than a known solution of value L g(a) L close A

11 G: Worst-case Complexity Example 1: Efficient Subwindow Search Efficient Subwindow Search (Lampert and Blaschko, 2008) Find the bounding box that maximizes a linear scoring function y = argmax w, φ(y, x) y Y

12 G: Worst-case Complexity Example 1: Efficient Subwindow Search Efficient Subwindow Search (Lampert and Blaschko, 2008) g(x, y) = β + w(x i ) x i within y

13 G: Worst-case Complexity Example 1: Efficient Subwindow Search Efficient Subwindow Search (Lampert and Blaschko, 2008) Subsets B of bounding boxes specified by interval coordinates, B 2 Y

14 G: Worst-case Complexity Example 1: Efficient Subwindow Search Efficient Subwindow Search (Lampert and Blaschko, 2008) Upper bound: g(x, B) g(x, y) for all y B g(x, B) = β + x i within B max g(x, y) max y B max{w(x i ), 0} + x i within B min min{0, w(x i )}

15 G: Worst-case Complexity Example 1: Efficient Subwindow Search Efficient Subwindow Search (Lampert and Blaschko, 2008) PASCAL VOC 2007 detection challenge bounding boxes found using ESS

16 G: Worst-case Complexity Example 2: Branch-and-Mincut Branch-and-Mincut (Lempitsky et al., 2008) Binary image segmentation with non-local interaction y 1, y 2 Y

17 G: Worst-case Complexity Example 2: Branch-and-Mincut Branch-and-Mincut (Lempitsky et al., 2008) Binary image segmentation with non-local interaction E(z, y) = C(y)+ p V F p (y)z p + p V B p (y)(1 z p )+ g(x, y) = max z 2 V E(z, y) Here: z {0, 1} V is a binary pixel mask F p (y), B p (y) are foreground/background unary energies P pq (y) is a standard pairwise energy Global dependencies on y {i,j} E P pq (y) z p z q,

18 G: Worst-case Complexity Example 2: Branch-and-Mincut Branch-and-Mincut (Lempitsky et al., 2008) Upper bound for any subset A Y: max y A = max y A max max z 2 V + X p V g(x, y) = max 2 z 2 V + X 2 4 {i,j} E max y A z 2 4C(y) + X p V V E(z, y) P pq (y) z p z q 5 max C(y) y A «+ X p V F p (y)z p + X p V B p (y)(1 z p) 3 «max F p (y) z p y A «max y A Bp (y) (1 z p) + X {i,j} E 3 «max y A Ppq (y) z p z q 5.

19 Hard problem Generality Optimality Worst-case complexity Integrality Determinism

20 Problem Relaxations Optimization problems (minimizing g : G R over Y G) can become easier if feasible set is enlarged, and/or objective function is replaced with a bound. g(x), h(x) h(z) Y g(y) y, z Z Y

21 Problem Relaxations Optimization problems (minimizing g : G R over Y G) can become easier if feasible set is enlarged, and/or objective function is replaced with a bound. Definition (Relaxation (Geoffrion, 1974)) Given two optimization problems (g, Y, G) and (h, Z, G), the problem (h, Z, G) is said to be a relaxation of (g, Y, G) if, 1. Z Y, i.e. the feasible set of the relaxation contains the feasible set of the original problem, and 2. y Y : h(y) g(y), i.e. over the original feasible set the objective function h achieves no smaller values than the objective function g.

22 Relaxations Relaxed solution z provides a bound: h(z ) g(y ), (maximization: upper bound) h(z ) g(y ), (minimization: lower bound) Relaxation is typically tractable Evidence that relaxations are better for learning (Kulesza and Pereira, 2007), (Finley and Joachims, 2008), (Martins et al., 2009) Drawback: z Z \ Y possible Are there principled methods to construct relaxations? Linear Programming Relaxations Lagrangian relaxation Lagrangian/Dual decomposition

23 Relaxations Relaxed solution z provides a bound: h(z ) g(y ), (maximization: upper bound) h(z ) g(y ), (minimization: lower bound) Relaxation is typically tractable Evidence that relaxations are better for learning (Kulesza and Pereira, 2007), (Finley and Joachims, 2008), (Martins et al., 2009) Drawback: z Z \ Y possible Are there principled methods to construct relaxations? Linear Programming Relaxations Lagrangian relaxation Lagrangian/Dual decomposition

24 Linear Programming Relaxation y 2 1 Ay b 1 y 1 max y c y sb.t. Ay b, y is integer.

25 Linear Programming Relaxation y 2 1 Ay b 1 y 1 max y c y sb.t. Ay b, y is integer.

26 Linear Programming Relaxation y 2 y 2 1 Ay b 1 Ay b 1 y 1 1 y 1 max y c y sb.t. Ay b,

27 MAP-MRF Linear Programming Relaxation Y 1 Y 2 Y 1 Y 1 Y 2 Y 2 θ 1 θ 1,2 θ 2

28 MAP-MRF Linear Programming Relaxation Y 1 Y 2 Y 1 Y 1 Y 2 Y 2 y 1 = 2 (y 1, y 2 ) = (2, 3) y 2 = 3 Energy E(y; x) = θ 1 (y 1 ; x) + θ 1,2 (y 1, y 2 ; x) + θ 2 (y 2 ; x) Probability p(y x) = 1 Z(x) exp{ E(y; x)} MAP prediction: argmax y Y p(y x) = argmin E(y; x) y Y

29 MAP-MRF Linear Programming Relaxation Y 1 Y 2 Y 1 Y 1 Y 2 Y 2 µ 1 {0, 1} Y1 µ 1,2 {0, 1} Y1 Y2 µ 2 {0, 1} Y2 µ 1 (y 1 ) = y 2 Y 2 µ 1,2 (y 1, y 2 ) y 1 Y 1 µ 1,2 (y 1, y 2 ) = µ 2 (y 2 ) y 1 Y 1 µ 1 (y 1 ) = 1 (y 1,y 2) Y 1 Y 2 µ 1,2 (y 1, y 2 ) = 1 y 2 Y 2 µ 2 (y 2 ) = 1 Energy is now linear: E(y; x) = θ, µ =: g(y, x)

30 MAP-MRF Linear Programming Relaxation (cont) max µ sb.t. X X X X θ i (y i )µ i (y i ) + θ i,j (y i, y j )µ i,j (y i, y j ) i V y i Y i {i,j} E (y i,y j ) Y i Y j X µ i (y i ) = 1, i V, y i Y i X µ i,j (y i, y j ) = µ i (y i ), {i, j} E, y i Y i y j Y j µ i (y i ) {0, 1}, i V, y i Y i, µ i,j (y i, y j ) {0, 1}, {i, j} E, (y i, y j ) Y i Y j.

31 MAP-MRF Linear Programming Relaxation (cont) max µ sb.t. X X X X θ i (y i )µ i (y i ) + θ i,j (y i, y j )µ i,j (y i, y j ) i V y i Y i {i,j} E (y i,y j ) Y i Y j X µ i (y i ) = 1, i V, y i Y i X µ i,j (y i, y j ) = µ i (y i ), {i, j} E, y i Y i y j Y j µ i (y i ) {0, 1}, i V, y i Y i, µ i,j (y i, y j ) {0, 1}, {i, j} E, (y i, y j ) Y i Y j. NP-hard integer linear program

32 MAP-MRF Linear Programming Relaxation (cont) max µ sb.t. X X X X θ i (y i )µ i (y i ) + θ i,j (y i, y j )µ i,j (y i, y j ) i V y i Y i {i,j} E (y i,y j ) Y i Y j X µ i (y i ) = 1, i V, y i Y i X µ i,j (y i, y j ) = µ i (y i ), {i, j} E, y i Y i y j Y j µ i (y i ) [0, 1], i V, y i Y i, µ i,j (y i, y j ) [0, 1], {i, j} E, (y i, y j ) Y i Y j. linear program

33 MAP-MRF LP Relaxation, Works MAP-MRF LP Analysis (Wainwright et al., Trans. Inf. Tech., 2005), (Weiss et al., UAI 2007), (Werner, PAMI 2005) (Kolmogorov, PAMI 2006) Improving tightness (Komodakis and Paragios, ECCV 2008), (Kumar and Torr, ICML 2008), (Sontag and Jaakkola, NIPS 2007), (Sontag et al., UAI 2008), (Werner, CVPR 2008), (Kumar et al., NIPS 2007) Algorithms based on the LP (Globerson and Jaakkola, NIPS 2007), (Kumar and Torr, ICML 2008), (Sontag et al., UAI 2008)

34 MAP-MRF LP Relaxation, Known results 1. LOCAL is tight iff the factor graph is a forest (acyclic) 2. All labelings are vertices of LOCAL (Wainwright and Jordan, 2008) 3. For cyclic graphs there are additional fractional vertices. 4. If all factors have regular energies, the fractional solutions are never optimal (Wainwright and Jordan, 2008) 5. For models with only binary states: half-integrality, integral variables are certain

35 Lagrangian Relaxation Assumption max y g(y) sb.t. y D, y C. Optimizing g(y) over y D is easy. Optimizing over y D C is hard. High-level idea Incorporate y C constraint into objective function For an excellent introduction, see (Guignard, Lagrangean Relaxation, TOP 2003) and (Lemaréchal, Lagrangian Relaxation, 2001).

36 Lagrangian Relaxation (cont) Recipe 1. Express C in terms of equalities and inequalities C = {y G : u i (y) = 0, i = 1,..., I, v j (y) 0, j = 1,..., J}, ui : G R differentiable, typically affine, vj : G R differentiable, typically convex. 2. Introduce Lagrange multipliers, yielding max y g(y) sb.t. y D, u i (y) = 0 : λ, i = 1,..., I, v j (y) 0 : µ, j = 1,..., J.

37 Lagrangian Relaxation (cont) 2. Introduce Lagrange multipliers, yielding 3. Build partial Lagrangian max y g(y) sb.t. y D, u i (y) = 0 : λ, i = 1,..., I, v j (y) 0 : µ, j = 1,..., J. max y g(y) + λ u(y) + µ v(y) (1) sb.t. y D.

38 Lagrangian Relaxation (cont) 3. Build partial Lagrangian max y g(y) + λ u(y) + µ v(y) (1) sb.t. y D. Theorem (Weak Duality of Lagrangean Relaxation) For differentiable functions u i : G R and v j : G R, and for any λ R I and any non-negative µ R J, µ 0, problem (1) is a relaxation of the original problem: its value is larger than or equal to the optimal value of the original problem.

39 Lagrangian Relaxation (cont) 3. Build partial Lagrangian q(λ, µ) := max y g(y) + λ u(y) + µ v(y) (1) sb.t. y D. 4. Minimize upper bound wrt λ, µ min λ,µ q(λ, µ) sb.t. µ 0 efficiently solvable if q(λ, µ) can be evaluated

40 Optimizing Lagrangian Dual Functions 4. Minimize upper bound wrt λ, µ min λ,µ sb.t. µ 0 Theorem (Lagrangean Dual Function) q(λ, µ) (2) 1. q is convex in λ and µ, Problem (2) is a convex minimization problem. 2. If q is unbounded below, then the original problem is infeasible. 3. For any λ, µ 0, let y(λ, µ) = argmax y D g(y) + λ u(y) + µ v(u) Then, a subgradient of q can be constructed by evaluating the constraint functions at y(λ, µ) as u(y(λ, µ)) q(λ, µ), and v(y(λ, µ)) q(λ, µ). λ µ

41 Example: MAP-MRF Message Passing max µ sb.t. X X X X θ i (y i )µ i (y i ) + θ i,j (y i, y j )µ i,j (y i, y j ) i V y i Y i {i,j} E (y i,y j ) Y i Y j X µ i (y i ) = 1, i V, y i Y i X µ i,j (y i, y j ) = 1, {i, j} E, (y i,y j ) Y i Y j X µ i,j (y i, y j ) = µ i (y i ), {i, j} E, y i Y i y j Y j µ i (y i ) {0, 1}, i V, y i Y i, µ i,j (y i, y j ) {0, 1}, {i, j} E, (y i, y j ) Y i Y j.

42 Example: MAP-MRF Message Passing max µ sb.t. X X X X θ i (y i )µ i (y i ) + θ i,j (y i, y j )µ i,j (y i, y j ) i V y i Y i {i,j} E (y i,y j ) Y i Y j X µ i (y i ) = 1, i V, y i Y i X µ i,j (y i, y j ) = 1, {i, j} E, (y i,y j ) Y i Y j X µ i,j (y i, y j ) = µ i (y i ), {i, j} E, y i Y i y j Y j µ i (y i ) {0, 1}, i V, y i Y i, µ i,j (y i, y j ) {0, 1}, {i, j} E, (y i, y j ) Y i Y j. If the constraint would not be there, problem is trivial! (Wainwright and Jordan, 2008, section 4.1.3)

43 Example: MAP-MRF Message Passing max µ sb.t. X X X X θ i (y i )µ i (y i ) + θ i,j (y i, y j )µ i,j (y i, y j ) i V y i Y i {i,j} E (y i,y j ) Y i Y j X µ i (y i ) = 1, i V, y i Y i X µ i,j (y i, y j ) = 1, {i, j} E, (y i,y j ) Y i Y j X µ i,j (y i, y j ) = µ i (y i ) : φ i,j (y i ), {i, j} E, y i Y i y j Y j µ i (y i ) {0, 1}, i V, y i Y i, µ i,j (y i, y j ) {0, 1}, {i, j} E, (y i, y j ) Y i Y j. If the constraint would not be there, problem is trivial! (Wainwright and Jordan, 2008, section 4.1.3)

44 Example: MAP-MRF Message Passing q(φ) := max µ sb.t. X X X X θ i (y i )µ i (y i ) + θ i,j (y i, y j )µ i,j (y i, y j ) i V y i Y i {i,j} E (y i,y j ) Y i Y j X X φ i,j (y i X µ i,j (y i, y j ) µ i (y i ) A {i,j} E y i Y i y j Y j X µ i (y i ) = 1, i V, y i Y i X µ i,j (y i, y j ) = 1, {i, j} E, (y i,y j ) Y i Y j µ i (y i ) {0, 1}, i V, y i Y i, µ i,j (y i, y j ) {0, 1}, {i, j} E, (y i, y j ) Y i Y j.

45 Example: MAP-MRF Message Passing q(φ) := max µ sb.t. 0 X i (y i ) i V y i Y i + X X {i,j} E X X j V :{i,j} E 1 φ i,j (y i ) A µ i (y i ) (y i,y j ) Y i Y j (θ i,j (y i, y j ) + φ i,j (y i )) µ i,j (y i, y j ) y i Y i µ i (y i ) = 1, i V, X (y i,y j ) Y i Y j µ i,j (y i, y j ) = 1, {i, j} E, µ i (y i ) {0, 1}, i V, y i Y i, µ i,j (y i, y j ) {0, 1}, {i, j} E, (y i, y j ) Y i Y j.

46 Example: MAP-MRF Message Passing φ 1,2 R Y1 φ 2,1 R Y2 Y 1 Y 2 Y 1 Y 1 Y 2 Y 2 θ 1 θ 1,2 θ 2 +φ 1,2 φ 1,2 φ 2,1 +φ 2,1 Parent-to-child region-graph messages (Meltzer et al., UAI 2009) Max-sum diffusion reparametrization (Werner, PAMI 2007)

47 Example behaviour of objectives 10 0 Dual objective Primal objective Objective Iteration

48 Primal Solution Recovery Assume we solved the dual problem for (λ, µ ) 1. Can we obtain a primal solution y? 2. Can we say something about the relaxation quality? Theorem (Sufficient Optimality Conditions) If for a given λ, µ 0, we have u(y(λ, µ)) = 0 and v(y(λ, µ)) 0 (primal feasibility) and further we have λ u(y(λ, µ)) = 0, and µ v(y(λ, µ)) = 0, (complementary slackness), then y(λ, µ) is an optimal primal solution to the original problem, and (λ, µ) is an optimal solution to the dual problem.

49 Primal Solution Recovery Assume we solved the dual problem for (λ, µ ) 1. Can we obtain a primal solution y? 2. Can we say something about the relaxation quality? Theorem (Sufficient Optimality Conditions) If for a given λ, µ 0, we have u(y(λ, µ)) = 0 and v(y(λ, µ)) 0 (primal feasibility) and further we have λ u(y(λ, µ)) = 0, and µ v(y(λ, µ)) = 0, (complementary slackness), then y(λ, µ) is an optimal primal solution to the original problem, and (λ, µ) is an optimal solution to the dual problem.

50 Primal Solution Recovery (cont) But, we might never see a solution satisfying the optimality condition is the case only if there is no duality gap, i.e. q(λ, µ ) = g(x, y ) Special case: integer linear programs we can always reconstruct a primal solution to min y g(y) (3) sb.t. y conv(d), y C. For example for subgradient method updates it is known that (Anstreicher and Wolsey, MathProg 2009) 1 lim T T T y(λ t, µ t ) y of (3). t=1

51 Dual/Lagrangian Decomposition Additive structure max y K g k (y) k=1 sb.t. y Y, such that K k=1 g k(y) is hard, but for any k, max y sb.t. g k (y) y Y is tractable. (Guignard, Lagrangean Decomposition, MathProg 1987), the original paper for dual decomposition (Conejo et al., Decomposition Techniques in Mathematical Programming, 2006), continuous variable problems

52 Dual/Lagrangian Decomposition Additive structure max y K g k (y) k=1 sb.t. y Y, such that K k=1 g k(y) is hard, but for any k, max y sb.t. g k (y) y Y is tractable. (Guignard, Lagrangean Decomposition, MathProg 1987), the original paper for dual decomposition (Conejo et al., Decomposition Techniques in Mathematical Programming, 2006), continuous variable problems

53 Dual/Lagrangian Decomposition Idea 1. Selectively duplicate variables ( variable splitting ) max y 1,...,y K,y K g k (y k ) k=1 sb.t. y Y, y k Y, k = 1,..., K, y = y k : λ k, k = 1,..., K. (4) 2. Add coupling equality constraints between duplicated variables 3. Apply Lagrangian relaxation to the coupling constraint Also known as dual decomposition and variable splitting.

54 Dual/Lagrangian Decomposition (cont) max y 1,...,y K,y K g k (y k ) k=1 sb.t. y Y, y k Y, k = 1,..., K, y = y k : λ k, k = 1,..., K. (5)

55 Dual/Lagrangian Decomposition (cont) max y 1,...,y K,y K K g k (y k ) + λ k (y y k ) k=1 sb.t. y Y, k=1 y k Y, k = 1,..., K.

56 Dual/Lagrangian Decomposition (cont) max y 1,...,y K,y ( K ( gk (y k ) λ ) K ) k y k + λ k y k=1 sb.t. y Y, y k Y, k = 1,..., K. k=1 Problem is decomposed

57 Dual/Lagrangian Decomposition (cont) q(λ) := max y 1,...,y K,y ( K ( gk (y k ) λ ) K ) k y k + λ k y k=1 sb.t. y Y, Dual optimization problem is y k Y, k = 1,..., K. min λ sb.t. q(λ) K λ k = 0, k=1 k=1 where {λ K k=1 λ k 0} is the domain where q(λ) > Projected subgradient method, using (y y k ) 1 K (y y l ) = 1 K y l y k. λ k K K l=1 l=1

58 Primal Interpretation Primal interpretation for linear case due to (Guignard, 1987) Theorem (Lagrangian Decomposition Primal) Let g(x, y) = c(x) y be linear, then the solution of the dual obtains the value of the following relaxed primal optimization problem. min y K g k (x, y) k=1 sb.t. y conv(y k ), k = 1,..., K.

59 Primal Interpretation (cont) y D conv(y 1 ) conv(y 1 ) conv(y 2 ) y conv(y 2 ) c y min y K g k (x, y) k=1 sb.t. y conv(y k ), k = 1,..., K.

60 Applications in Computer Vision Very broadly applicable (Komodakis et al., ICCV 2007) (Woodford et al., ICCV 2009) (Strandmark and Kahl, CVPR 2010) (Vicente et al., ICCV 2009) (Torresani et al., ECCV 2008) (Werner, TechReport 2009) Further reading (Sontag et al., Introduction to Dual Decomposition for Inference, OfML Vol, 2011)

61 Determinism Hard problem Generality Optimality Worst-case complexity Integrality Determinism

62 Determinism Giving up Determinism Algorithms that use randomness, and are non-deterministic, result does not exclusively depend on the input, and are allowed to return a wrong result (with low probability), or/and have a random runtime. Is it a good idea? for some problems randomized algorithms are the only known efficient algorithms, using randomness for hard problems has a long tradition in sampling-based algorithms, physics, computational chemistry, etc. such algorithms can be simple and effective, but proving theorems about them can be much harder

63 Determinism Simulated Annealing Basic idea (Kirkpatrick et al., 1983) 1. Define a distribution p(y; T ) that concentrates mass on states y with high values g(x, y) 2. Simulate p(y; T ) 3. Increase concentration and repeat Defining p(y; T ) Boltzmann distribution Simulating p(y; T ) MCMC: usually done using a Metropolis chain or a Gibbs sampler

64 Determinism Simulated Annealing (cont) Definition (Boltzmann Distribution) For a finite set Y, a function g : X Y R and a temperature parameter T > 0, let p(y; T ) = 1 ( ) g(x, y) Z(T ) exp, (5) T with Z(T ) = y Y Y at temperature T. g(x,y) exp( T ) be the Boltzmann distribution for g over

65 Determinism Simulated Annealing (cont) 40 Function g Function value State y

66 Determinism Simulated Annealing (cont) Probability mass Distribution P(T=100.0) State y

67 Determinism Simulated Annealing (cont) Probability mass Distribution P(T=10.0) State y

68 Determinism Simulated Annealing (cont) Probability mass Distribution P(T=4.0) State y

69 Determinism Simulated Annealing (cont) Probability mass Distribution P(T=1.0) State y

70 Determinism Simulated Annealing (cont) Probability mass Distribution P(T=0.1) State y

71 Determinism Simulated Annealing (cont) 40 Function g Function value State y

72 Determinism Simulated Annealing (cont) 1: y = SimulatedAnnealing(Y, g, T, N) 2: Input: 3: Y finite feasible set 4: g : X Y R objective function 5: T R K sequence of K decreasing temperatures 6: N N K sequence of K step lengths 7: (y, y ) (y 0, y 0 ) 8: for k = 1,..., K do 9: y simulate Markov chain p(y; T (k)) starting from y for N(k) steps 10: end for 11: y y

73 Determinism Guarantees Theorem (Guaranteed Optimality (Geman and Geman, 1984)) If there exist a k 0 2 such that for all k k 0 the temperature T (k) satisfies T (k) Y (max y Y g(x, y) min y Y g(x, y)), log k then the probability of seeing the maximizer y of g tends to one as k. too slow in practise faster schedules are used in practise, such as T (k) = T 0 α k.

74 Determinism Guarantees Theorem (Guaranteed Optimality (Geman and Geman, 1984)) If there exist a k 0 2 such that for all k k 0 the temperature T (k) satisfies T (k) Y (max y Y g(x, y) min y Y g(x, y)), log k then the probability of seeing the maximizer y of g tends to one as k. too slow in practise faster schedules are used in practise, such as T (k) = T 0 α k.

75 Determinism Example: Simulated Annealing (Geman and Geman, 1984) Figure: Factor graph model 2D 8-neighbor 128-by-128 grid, 5 possible labels Pairwise Potts potentials Task: restoration Figure: Approximate sample (200 sweeps)

76 Determinism Example: Simulated Annealing (Geman and Geman, 1984) Figure: Approximate sample (200 sweeps) Figure: Corrupted with Gaussian noise 2D 8-neighbor 128-by-128 grid, 5 possible labels Pairwise Potts potentials Task: restoration

Determinism Example: Simulated Annealing (Geman and Geman, 1984) 20 20 40 40 60 60 80 80 100 100 120 120 20 40 60 80 100 120 Figure: Corrupted input image

77 Determinism Example: Simulated Annealing (Geman and Geman, 1984) Figure: Corrupted input image Figure: Restoration, 25 sweeps Derive unary energies from corrupted input image (optimal) Simulated annealing, 25 Gibbs sampling sweeps

78 Determinism Example: Simulated Annealing (Geman and Geman, 1984) Figure: Corrupted input image Figure: Restoration, 300 sweeps Simulated annealing, 300 Gibbs sampling sweeps Schedule T (k) = 4.0/ log(1 + k)

79 Determinism Example: Simulated Annealing (Geman and Geman, 1984) Figure: Noise-free sample Figure: Restoration, 300 sweeps Simulated annealing, 300 Gibbs sampling sweeps Schedule T (k) = 4.0/ log(1 + k)

80 End The End... Tutorial in written form now publisher s FnT Computer Graphics and Vision series PDF available on authors homepages Thank you!

Discrete Inference and Learning Lecture 3

Discrete Inference and Learning Lecture 3 MVA 2017 2018 h