Part 6: Structured Prediction and Energy Minimization (1/2)

Size: px

Start display at page:

Download "Part 6: Structured Prediction and Energy Minimization (1/2)"

Ezra Bell
5 years ago
Views:

1 Part 6: Structured Prediction and Energy Minimization (1/2) Providence, 21st June 2012

2 Prediction Problem Prediction Problem y = f (x) = argmax y Y g(x, y) g(x, y) = p(y x), factor graphs/mrf/crf, g(x, y) = E(y; x, w), factor graphs/mrf/crf, g(x, y) = w, ψ(x, y), linear model (e.g. multiclass SVM), difficulty: Y finite but very large

3 Prediction Problem Prediction Problem y = f (x) = argmax y Y g(x, y) g(x, y) = p(y x), factor graphs/mrf/crf, g(x, y) = E(y; x, w), factor graphs/mrf/crf, g(x, y) = w, ψ(x, y), linear model (e.g. multiclass SVM), difficulty: Y finite but very large

4 Prediction Problem Prediction Prblem (cont) Definition (Optimization Problem) Given (g, Y, G, x), with feasible set Y G over decision domain G, and given an input instance x X and an objective function g : X G R, find the optimal value α = sup g(x, y), y Y and, if the supremum exists, find an optimal solution y Y such that g(x, y ) = α.

5 Prediction Problem The feasible set Ingredients Decision domain G, typically simple (G = R d, G = 2 V, etc.) Feasible set Y G, defining the problem-specific structure Objective function g : X G R. Terminology Y = G: unconstrained optimization problem, G finite: discrete optimization problem, G = 2 Σ for ground set Σ: combinatorial optimization problem, Y = : infeasible problem.

6 Prediction Problem Example: Feasible Sets (cont) J ij y i y j J jk y j y k Y i Y j Y k (+1) ( 1) ( 1) h i y i h j y j h k y k Ising model with external field Graph G = (V, E) External field : h R V Interaction matrix: J R V V Objective, defined on y i { 1, 1} g(y) = h i y i + h j y j + h k y k J ijy i y j J jky j y k

7 Prediction Problem Example: Feasible Sets (cont) Ising model with external field Y = G = { 1, +1} V g(y) = 1 J i,j y i y j + h i y i 2 i V (i,j) E Unconstrained Objective function contains quadratic terms

8 Prediction Problem Example: Feasible Sets (cont) G = {0, 1} (V { 1,+1}) (E { 1,+1} { 1,+1}), Y = {y G : i V : y i, 1 + y i,+1 = 1, (i, j) E : y i,j,+1,+1 + y i,j,+1, 1 = y i,+1, (i, j) E : y i,j, 1,+1 + y i,j, 1, 1 = y i, 1 }, g(y) = 1 J i,j (y i,j,+1,+1 + y i,j, 1, 1 ) (i,j) E (i,j) E J i,j (y i,j,+1, 1 + y i,j, 1,+1 ) + h i (y i,+1 y i, 1 ) i V Constrained, more variables Objective function contains linear terms only

9 Prediction Problem Evaluating f : what do we want? f (x) = argmax g(x, y) y Y For evaluating f (x) we want an algorithm that 1. is general: applicable to all instances of the problem, 2. is optimal: provides an optimal y, 3. has good worst-case complexity: for all instances the runtime and space is acceptably bounded, 4. is integral: its solutions are restricted to Y, 5. is deterministic: its results and runtime are reproducible and depend on the input data only.

10 Prediction Problem Evaluating f : what do we want? f (x) = argmax g(x, y) y Y For evaluating f (x) we want an algorithm that 1. is general: applicable to all instances of the problem, 2. is optimal: provides an optimal y, 3. has good worst-case complexity: for all instances the runtime and space is acceptably bounded, 4. is integral: its solutions are restricted to Y, 5. is deterministic: its results and runtime are reproducible and depend on the input data only. wanting all of them impossible

11 Prediction Problem Giving up some properties Hard problem Generality Optimality Worst-case complexity Integrality Determinism giving up one or more properties allows us to design algorithms satisfying the remaining properties might be sufficient for the task at hand

12 G: Generality Hard problem Generality Optimality Worst-case complexity Integrality Determinism

13 G: Generality Giving up Generality Identify an interesting and tractable subset of instances Set of all instances Tractable subset

14 G: Generality Example: MAP Inference in Markov Random Fields Although NP-hard in general, it is tractable... with low tree-width (Lauritzen, Spiegelhalter, 1988) with binary states, pairwise submodular interactions (Boykov, Jolly, 2001) with binary states, pairwise interactions (only), planar graph structure (Globerson, Jaakkola, 2006) with submodular pairwise interactions (Schlesinger, 2006) with P n -Potts higher order factors (Kohli, Kumar, Torr, 2007) with perfect graph structure (Jebara, 2009)

15 G: Generality Binary Graph-Cuts Energy function: unary and pairwise E(y; x, w) = E F (y F ; x, w tf )+ E F (y F ; x, w tf ) F F 1 F F 2 Restriction 1 (wlog) E F (y i ; x, w tf ) 0 Restriction 2 (regular/submodular/attractive) E F (y i, y j ; x, w tf ) = 0, if y i = y j, E F (y i, y j ; x, w tf ) = E F (y j, y i ; x, w tf ) 0, otherwise.

16 G: Generality Binary Graph-Cuts Energy function: unary and pairwise E(y; x, w) = E F (y F ; x, w tf )+ E F (y F ; x, w tf ) F F 1 F F 2 Restriction 1 (wlog) E F (y i ; x, w tf ) 0 Restriction 2 (regular/submodular/attractive) E F (y i, y j ; x, w tf ) = 0, if y i = y j, E F (y i, y j ; x, w tf ) = E F (y j, y i ; x, w tf ) 0, otherwise.

17 G: Generality Binary Graph-Cuts (cont) Construct auxiliary undirected graph One node {i} i V per variable Two extra nodes: source s, sink t {i, s} s Edges Edge Graph cut weight {i, j} E F (y i = 0, y j = 1; x, w tf ) {i, s} E F (y i = 1; x, w tf ) {i, t} E F (y i = 0; x, w tf ) Find linear s-t-mincut {i, t} i j k l m n t Solution defines optimal binary labeling of the original energy minimization problem

18 G: Generality Example: Figure-Ground Segmentation Input image (

19 G: Generality Example: Figure-Ground Segmentation Color model log-odds

20 G: Generality Example: Figure-Ground Segmentation Independent decisions

21 G: Generality Example: Figure-Ground Segmentation g(x, y, w) = i V Gradient strength log p(y i x i ) + w (i,j) E C(x i, x j ) = exp(γ x i x j 2 ) C(x i, x j )I (y i y j ) γ estimated from mean edge strength (Blake et al, 2004) w 0 controls smoothing

22 G: Generality Example: Figure-Ground Segmentation w = 0

23 G: Generality Example: Figure-Ground Segmentation Small w > 0

24 G: Generality Example: Figure-Ground Segmentation Medium w > 0

25 G: Generality Example: Figure-Ground Segmentation Large w > 0

26 G: Generality General Binary Case Is there a larger class of energies for which binary graph cuts are applicable? (Kolmogorov and Zabih, 2004), (Freedman and Drineas, 2005) Theorem (Regular Binary Energies) E(y; x, w) = F F 1 E F (y F ; x, w tf ) + F F 2 E F (y F ; x, w tf ) is a energy function of binary variables containing only unary and pairwise factors. The discrete energy minimization problem argmin y E(y; x, w) is representable as a graph cut problem if and only if all pairwise energy functions E F for F F 2 with F = {i, j} satisfy E i,j (0, 0) + E i,j (1, 1) E i,j (0, 1) + E i,j (1, 0).

27 G: Generality General Binary Case Is there a larger class of energies for which binary graph cuts are applicable? (Kolmogorov and Zabih, 2004), (Freedman and Drineas, 2005) Theorem (Regular Binary Energies) E(y; x, w) = F F 1 E F (y F ; x, w tf ) + F F 2 E F (y F ; x, w tf ) is a energy function of binary variables containing only unary and pairwise factors. The discrete energy minimization problem argmin y E(y; x, w) is representable as a graph cut problem if and only if all pairwise energy functions E F for F F 2 with F = {i, j} satisfy E i,j (0, 0) + E i,j (1, 1) E i,j (0, 1) + E i,j (1, 0).

28 G: Generality Example: Class-independent Object Hypotheses (Carreira and Sminchisescu, 2010) PASCAL VOC 2009/2010 segmentation winner Generate class-independent object hypotheses Energy (almost) as before g(x, y, w) = i V E i (y i ) + w (i,j) E C(x i, x j )I (y i y j ) Fixed unaries if i V fg and y i = 0 E i (y i ) = if i V bg and y i = 1 0 otherwise Test all w 0 using parametric max-flow (Picard and Queyranne, 1980), (Kolmogorov et al., 2007)

29 G: Generality Example: Class-independent Object Hypotheses (Carreira and Sminchisescu, 2010) PASCAL VOC 2009/2010 segmentation winner Generate class-independent object hypotheses Energy (almost) as before g(x, y, w) = i V E i (y i ) + w (i,j) E C(x i, x j )I (y i y j ) Fixed unaries if i V fg and y i = 0 E i (y i ) = if i V bg and y i = 1 0 otherwise Test all w 0 using parametric max-flow (Picard and Queyranne, 1980), (Kolmogorov et al., 2007)

30 G: Generality Example: Class-independent Object Hypotheses (Carreira and Sminchisescu, 2010) PASCAL VOC 2009/2010 segmentation winner Generate class-independent object hypotheses Energy (almost) as before g(x, y, w) = i V E i (y i ) + w (i,j) E C(x i, x j )I (y i y j ) Fixed unaries if i V fg and y i = 0 E i (y i ) = if i V bg and y i = 1 0 otherwise Test all w 0 using parametric max-flow (Picard and Queyranne, 1980), (Kolmogorov et al., 2007)

31 G: Generality Example: Class-independent Object Hypotheses (cont) Input image (

32 G: Generality Example: Class-independent Object Hypotheses (cont) CPMC proposal segmentations (Carreira and Sminchisescu, 2010)

33 Hard problem Generality Optimality Worst-case complexity Integrality Determinism

34 Giving up Optimality Solving for y is hard, but is it necessary? pragmatic motivation: in many applications a close-to-optimal solution is good enough computational motivation: set of good solutions might be large and finding just one element can be easy For machine learning models modeling error: we always use the wrong model estimation error: preference for y might be an artifact

35 Giving up Optimality Solving for y is hard, but is it necessary? pragmatic motivation: in many applications a close-to-optimal solution is good enough computational motivation: set of good solutions might be large and finding just one element can be easy For machine learning models modeling error: we always use the wrong model estimation error: preference for y might be an artifact

36 Local Search Y y 0

37 Local Search Y N (y 0 ) y 0

38 Local Search Y N (y 0 ) y 0 y 1

39 Local Search Y y 0 y 1 )y2 N (y 0 ) N (y 1 N (y 2 ) y 3 N (y ) y N (y 3 )

40 Local Search Y y 0 y 1 )y2 N (y 0 ) N (y 1 N (y 2 ) y 3 N (y ) y N (y 3 ) N t : Y 2 Y, neighborhood system Optimization with respect to N t (y) must be tractable: y t+1 = argmax g(x, y) y N t(y t )

41 Example: Iterated Conditional Modes (ICM) Iterated Conditional Modes (ICM), (Besag, 1986) g(x, y) = log p(y x) y = argmax y Y log p(y x) Neighborhoods N s (y) = {(y 1,..., y s 1, z s, y s+1,..., y S ) z s Y s }

42 Example: Iterated Conditional Modes (ICM) Iterated Conditional Modes (ICM), (Besag, 1986) y t+1 = argmax y 1 Y 1 log p(y 1, y t 2,..., y t V x)

43 Example: Iterated Conditional Modes (ICM) Iterated Conditional Modes (ICM), (Besag, 1986) y t+1 = argmax y 2 Y 2 log p(y t 1, y 2, y t 3,..., y t V x)

44 Neighborhood Size ICM neighborhood N t (y t ): all states reachable from y t by changing a single variable (Besag, 1986) Neighborhood size: in general, larger is better (VLSN, Ahuja, 2000) Example: neighborhood along chains

45 Example: Block ICM Block Iterated Conditional Modes (ICM) (Kelm et al., 2006), (Kittler and Föglein, 1984) y t+1 = argmax y C1 Y C1 log p(y C1, y t V \C 1 x)

46 Example: Block ICM Block Iterated Conditional Modes (ICM) (Kelm et al., 2006), (Kittler and Föglein, 1984) y t+1 = argmax y C2 Y C2 log p(y C2, y t V \C 2 x)

47 Example: Multilabel Graph-Cut Binary graph-cuts are not applicable to multilabel energy minimization problems (Boykov et al., 2001): two local search algorithms for multilabel problems Sequence of binary directed s-t-mincut problems Iteratively improve multilabel solution

48 α-β Swap Neighborhood Select two different labels α and β Fix all variables i for which y i / {α, β} Optimize over remaining i with y i {α, β} N α,β : Y N N 2 Y, N α,β (y, α, β) := {z Y : z i = y i if y i / {α, β}, otherwise z i {α, β}}.

49 α-β-swap illustrated 5-label problem α β-swap

50 α-β-swap illustrated 5-label problem α β-swap

51 α-β-swap illustrated 5-label problem α β-swap

52 α-β-swap illustrated 5-label problem α β-swap

53 α-β-swap illustrated 5-label problem α β-swap

54 α-β-swap derivation y t+1 = argmin E(y; x) y N α,β (y t,α,β) Constant: drop out Unary: combine Pairwise: binary pairwise

55 α-β-swap derivation y t+1 = argmin E i (y i ; x) + E i,j (y i, y j ; x) y N α,β (y t,α,β) i V (i,j) E Constant: drop out Unary: combine Pairwise: binary pairwise

56 α-β-swap derivation y t+1 [ = argmin y N α,β (y t,α,β) i V, + + y t i / {α,β} (i,j) E, y t i / {α,β},y t j / {α,β} (i,j) E, y t i / {α,β},y t j {α,β} E i (y t i ; x) + E i,j (y t i, y t j ; x) + E i,j (y t i, y j ; x) + i V, y t i {α,β} E i (y i ; x) (i,j) E, y t i {α,β},y t j / {α,β} (i,j) E, y t i {α,β},y t j {α,β} E i,j (y i, y t j ; x) ] E i,j (y i, y j ; x). Constant: drop out Unary: combine Pairwise: binary pairwise

57 α-β-swap derivation y t+1 [ = argmin y N α,β (y t,α,β) i V, + + y t i / {α,β} (i,j) E, y t i / {α,β},y t j / {α,β} (i,j) E, y t i / {α,β},y t j {α,β} E i (y t i ; x) + E i,j (y t i, y t j ; x) + E i,j (y t i, y j ; x) + i V, y t i {α,β} E i (y i ; x) (i,j) E, y t i {α,β},y t j / {α,β} (i,j) E, y t i {α,β},y t j {α,β} E i,j (y i, y t j ; x) ] E i,j (y i, y j ; x). Constant: drop out Unary: combine Pairwise: binary pairwise

58 α-β-swap graph construction Directed graph G = (V, E ) V = {α, β} {i V : y i {α, β}}, E = {(α, i, t α i ) : i V : y i {α, β}} {(i, β, t β i ) : i V : y i {α, β}} {(i, j, n i,j ) : (i, j), (j, i) E : y i, y j {α, β}}. Edge weights t α i, t β i, and n i,j n i,j = E i,j (α, β; x) t α i = E i (α; x) + (i,j) E, y j / {α,β} t β i = E i (β; x) + (i,j) E, y j / {α,β} E i,j (α, y j ; x) E i,j (β, y j ; x) t α i t α j α i j... k t β i n i,j n i,j t β j β t α k t β k

59 α-β-swap graph construction Directed graph G = (V, E ) V = {α, β} {i V : y i {α, β}}, E = {(α, i, t α i ) : i V : y i {α, β}} {(i, β, t β i ) : i V : y i {α, β}} {(i, j, n i,j ) : (i, j), (j, i) E : y i, y j {α, β}}. Edge weights t α i, t β i, and n i,j n i,j = E i,j (α, β; x) t α i = E i (α; x) + (i,j) E, y j / {α,β} t β i = E i (β; x) + (i,j) E, y j / {α,β} E i,j (α, y j ; x) E i,j (β, y j ; x) t α i t α j α i j... k t β i n i,j n i,j t β j β t α k t β k

60 α-β-swap move α i j... k t β i t α i n i,j n i,j Side of cut determines y i {α, β} Iterate all possible (α, β) combinations Semi-metric requirement on pairwise energies t α j t β j β t α k t β k E i,j (y i, y j ; x) = 0 y i = y j E i,j (y i, y j ; x) = E i,j (y j, y i ; x) 0

61 α-β-swap move C α i j... k Side of cut determines y i {α, β} Iterate all possible (α, β) combinations Semi-metric requirement on pairwise energies β E i,j (y i, y j ; x) = 0 y i = y j E i,j (y i, y j ; x) = E i,j (y j, y i ; x) 0

62 α-β-swap move C α i j... k Side of cut determines y i {α, β} Iterate all possible (α, β) combinations Semi-metric requirement on pairwise energies β E i,j (y i, y j ; x) = 0 y i = y j E i,j (y i, y j ; x) = E i,j (y j, y i ; x) 0

63 Example: Stereo Disparity Estimation Infer depth from two images Discretized multi-label problem α-expansion solution close to optimal

64 Example: Stereo Disparity Estimation Infer depth from two images Discretized multi-label problem α-expansion solution close to optimal

65 Model Reduction Energy minimization problem: many decision to make jointly Model reduction 1. Fix a subset of decisions 2. Optimize the smaller remaining model Example: forcing y i = y j for pairs (i, j)

66 Example: Superpixels in Labeling Problems Input image: 500-by-375 pixels (187,500 decisions)

67 Example: Superpixels in Labeling Problems Image with 149 superpixels (149 decisions)

Pushmeet Kohli Microsoft Research

Pushmeet Kohli Microsoft Research E(x) x in {0,1} n Image (D) [Boykov and Jolly 01] [Blake et al. 04] E(x) = c i x i Pixel Colour x in {0,1} n Unary Cost (c i ) Dark (Bg) Bright (Fg) x* = arg min E(x)