Discrete Markov Random Fields the Inference story. Pradeep Ravikumar

Size: px

Start display at page:

Download "Discrete Markov Random Fields the Inference story. Pradeep Ravikumar"

Brittany Walton
5 years ago
Views:

1 Discrete Markov Random Fields the Inference story Pradeep Ravikumar

2 Graphical Models, The History How to model stochastic processes of the world? I want to model the world, and I like graphs... 2

3 Mid to Late Twentieth Century History Pioneering work of Conspiracy Theorists The System, it is all connected... 3

4 History Late Twentieth Century: people realize that existing scientific literature offers a marriage between probability theory and graph theory which can be used to model the world. 4

5 History Common Misconception: Called graphical models after Grafici Modeles, a sculptor protege of Da Vinci. Called Graphical Models because it models stochastic systems using graphs. 5

6 History Common Misconception: Called graphical models after Grafici Modeles, a sculptor protege of Da Vinci. Called Graphical Models because it models stochastic systems using graphs. 6

7 Graphical Models X2 X4 X1 X3 7

8 Graphical Models X2 X4 X1 X3 8

9 Graphical Models X2 X4 X1 X3 Separating Set (X 2, X 3 ) disconnects X 1 and X 4 9

10 Graphical Models X2 X4 X1 X3 Separating Set (X 2, X 3 ) disconnects X 1 and X 4 Global Markov Property X 1 X 4 (X 2, X 3 ) 10

11 Graphical Models MP(G) Set of all Markov properties, by ranging over separating sets of G. P represented by G P satisfies MP(G) P1(X) G(X) P4(X) P3(X) P2(X) 11

12 Hammersley and Clifford Theorem Positive P over X satisfies MP(G) iff P factorizes according to cliques C in G, P(X) = 1 Z C C ψ C (X C ) Specific member of family specified by weights over cliques. P1(X) G(X) P4(X) P3(X) {ψ (3) C (X C)} P2(X) 12

13 Exponential Family p(x) = 1 Z ψ C (X C ) C C = exp( log ψ C (X C ) log Z) C C Exponential family: p(x; θ) = exp ( α C θ αφ α (X) Ψ(θ) ) {φ α } features {θ α } parameters Ψ(θ) log partition function 13

14 Inference Answering queries about the graphical model probability distribution. 14

15 Inference For undirected model p(x; θ) = exp ( α I θ αφ α (x) Ψ(θ) ) key inference problems are: compute log partition function (normalization constant) Ψ(θ) marginals p(x A ) = P xv,v A p(x) most probable configurations x = arg max x p(x x L ) These problems are intractable in full generality. 15

16 Log Partition Function Z = log x α I ψ α (x α ) 16

17 Variable Elimination Z = log x α I ψ α (x α ) x = = = ψ α (x α ) α {x j i } {x j i } {x j i } ψ α (x α ) x i α ψ α (x α ) ψ α (x α ) α C \i x i α C i ψ α (x α )g(x j i ) α C \i 17

18 Variable Elimination Z = {x j i } α C \i ψ α (x α )g(x j i ) Continue to eliminate other variables x j. Is this a linear time method then? 18

19 Variable Elimination Z = {x j i } α C \i ψ α (x α )g(x j i ) Continue to eliminate other variables x j. Is this a linear time method then? g(x j i ) depends on variables j which share a factor with i. 19

20 Variable Elimination Z = log x α I ψ α (x α ) x 7 ψ 57 ψ 67 x 5 x 6 ψ 15 ψ 25 ψ 36 ψ 46 x 1 x 2 x 3 x 4 20

21 Variable Elimination Z = log x α I ψ α (x α ) x 7 ψ 57 ψ 67 x 5 x 6 ψ 15 ψ 25 ψ 36 ψ 46 x 1 x 2 x 3 x 4 21

22 Variable Elimination Z = log x α I ψ α (x α ) x 7 ψ 57 ψ 67 x 5 x 6 m m

23 Variable Elimination Z = log x α I ψ α (x α ) x 7 ψ 57 ψ 67 x 5 x 6 m m

24 Variable Elimination Z = log x α I ψ α (x α ) Exponential in tree-width. x 7 ψ 57 ψ 67 x 5 x 6 ψ 15 ψ 25 ψ 36 ψ 46 x 1 x 2 x 3 x 4 x 7 ψ 57 ( ψ 15 ψ 25 ) x 5 x 1 x 2 ψ 67 ( ψ 36 ψ 46 ) x 6 x

25 Inference p(x; θ) = exp(θ φ(x) A(θ)) A(θ) log partition function 25

26 Inference A(θ) = log x exp(θ φ(x)) B(θ, λ) C(θ, λ) λ variational parameter A(θ) inf λ sup λ B(θ, λ) C(θ, λ) Summing over configurations Optimization! 26

27 Inference But... (there s always a but!) is there a principled way to obtain parametrized bounds B(θ, λ) and C(θ, λ)? 27

28 Fenchel Duality f(x) concave function Define f (λ) = min x {λ x f(x)}. = f (x λ ) = λ, Slope of tangent at x λ is λ Tangent λ x (Intercept) = f(x λ ) = λ x λ (Intercept) = f (λ) Intercept of line with slope λ tangent to f(x). 28

29 Fenchel Duality Tangent λ x f (λ) f(x) = min λ {λ x f (λ)} Thus, g(x, λ) = λ x f (λ) is an upper bound of f(x)! 29

30 Fenchel Duality Let us apply fenchel duality to the log partition function! A(θ) convex A (µ) = sup(θ µ A(θ)) θ A(θ) = sup(θ µ A (µ)) µ 30

31 Define the marginal polytope Log Partition Function M = {µ R d p(.) s.t. x φ(x)p(x) = µ} M Convex hull of {φ(x)} 31

32 Mean parameter mapping Consider the mapping : Θ M, (θ) := E θ [φ(x)] = x φ(x)p(x; θ) The mapping associates θ to mean parameters µ := (θ) M. Conversely, for µ in Int(M), θ = 1 (µ) (unique if exponential family is minimal) 32

33 Partition function conjugates A (µ) = sup(θ µ A(θ)) θ A(θ) = sup(θ µ A (µ)) µ Optimal parameters given by, θ µ = 1 (µ) µ θ = (θ) 33

34 Partition function conjugate Properties of the fenchel conjugate, A (µ) = sup θ (θ µ A(θ)) A (µ) is finite only for µ M. A (µ) is the entropy of graphical model distribution with mean parameters µ, or equivalently with parameters 1 (µ)! 34

35 A(θ) = sup µ M θ µ A (µ) Partition Function Hardness is due to two bottlenecks M: a polytope with exponentially many vertices and no compact representation A (µ): entropy computation Approximate either or both! 35

36 Pairwise Graphical Models x 1 θ 12 φ 12 (x 1, x 2 ) x 2 Overcomplete potentials: I j (x s ) = I j,k (x s, x t ) = { 1 xs = j 0 otherwise { 1 xs = j and x t = k 0 otherwise. p(x θ) = exp s,j θ s;j I j (x s ) + θ s,t;j,k I j,k (x s, x t ) Ψ(θ) s,t;j,k 36

37 Overcomplete Representation; Mean Parameters µ s;j := E θ [I j (x s )] = p(x s = j; θ) µ s,t;j,k := E θ [I j,k (x s, x t )] = p(x s = j, x t = k; θ) Mean parameters are marginals! Define the following functional forms, µ s (x s ) = j µ s;j I j (x s ) µ st (x s, x t ) = j,k µ s,t;j,k I jk (x s, x t ) 37

38 Outer Polytope Approximations LOCAL(G) := {µ 0 x s µ s (x s ) = 1, x t µ st (x s, x t ) = µ s (x s )} 38

39 Inner Polytope Approximations For the given graph G and a subgraph H, let E(H) = {θ θ st = θ st 1 (s,t) H } M(G; H) = {µ µ = E θ [φ(x)] for some θ E(H)}. M(G; H) M(G) 39

40 Entropy Approximations Tree-structured distributions, p(x; µ) = s µ s (x s ) (s,t) E µ st (x s, x t ) µ s µ t Define, H s (µ s ) := x s µ s (x s ) log µ s (x s ) I st (µ st ) := x s,x t µ st (x s, x t ) log µ st (x s, x t ) 40

41 Tree-structured Entropy, A tree(µ) = s V H s (µ s ) + I st (µ st ) (s,t) E Compact representation; can be used as an approximation. 41

42 Approximate Inference Techniques Belief Propagation Polytope LOCAL(G), Entropy Treestructured entropy! Structured Mean Field Polytope M(G; H), Entropy H- structured entropy Mean Field H = H 0, completely independent graph 42

43 Given: p(x; θ) exp(θ φ(x)) Divergence Measure View Would like a more manageable surrogate distribution, q Q, min q Q D(q(x) p(x; θ)) 43

44 Divergence Measure View min q Q D(q(x) p(x; θ)) D(q p) = KL(q p) Structured Mean Field, Belief Propagation D(p q) = KL(p q) Expectation Propagation (look out for talk on Continuous Markov Random Fields!) Typically approximate KL measure with energy approximations (Bethe free energy, Kikuchi free energy) (Ravikumar,Lafferty 05; Preconditioner Approximations) Optimizing for a minimax criterion reduces task to a generalized linear systems 44

45 problem! 45

46 Bounds on event probabilities Doctor: So what is the lower bound on the diagnosis probability? Graphical Model: value. I don t know, but here is an approximate Doctor: =( Can we get upper and lower bounds on p(x C; θ) (instead of just approximate values)? 46

47 Bounds on event probabilities Classical Chernoff Bounds give useful estimates for i.i.d. random variables. Can they be extended to graphical models? [Ravikumar, Lafferty 04; Variational Chernoff Bounds] 47

48 Classical Chernoff Bounds p θ (X u) E θ(x) u Markov Inequality p θ (X u) = p θ (e λx e λu ) E θ [e λ(x u) ] From this it follows that: log p θ (X u) inf λ 0 ( λu + log E θ [e λx ]) Bounds on cumulant function E θ [e λx ] yield the standard Chernoff Bounds. 48

49 Classical Chernoff Bounds p θ (X u) E θ(x) u Markov Inequality p θ (X u) = p θ (e λx e λu ) E θ [e λ(x u) ] From this it follows that: log p θ (X u) inf λ 0 ( λu + log E θ [e λx ]) Bounds on cumulant function E θ [e λx ] yield the standard Chernoff Bounds. 49

50 Classical Chernoff Bounds p θ (X u) E θ(x) u Markov Inequality p θ (X u) = p θ (e λx e λu ) E θ [e λ(x u) ] From this it follows that: log p θ (X u) inf λ 0 ( λu + log E θ [e λx ]) Bounds on cumulant function E θ [e λx ] yield the standard Chernoff Bounds. 50

51 Generalized Chernoff Bounds Event: X C I C (X) = { 1 X C 0 otherwise I C (X) f λ p θ (X C) E θ [f λ ] 51

52 Generalized Chernoff Bounds Event: X C I C (X) = { 1 X C 0 otherwise I C (X) f λ p θ (X C) E θ [f λ ] 52

53 Graphical Model Chernoff Bounds With f λ = exp( λ, x + u), we get: log p θ (X C) inf λ (S C( λ) + log E θ [e <λ,x> ]) where S C (λ) = sup x C x, λ is the support function of the set C. For an exponential model with sufficient statistic φ(x), the above becomes: log p θ (X C) inf λ (S C,φ( λ) + Φ(θ + λ) Φ(θ)) where Φ is the log-partition function and S C,φ (λ) = sup x C φ(x), λ 53

54 Graphical Model Chernoff Bounds With f λ = exp( λ, x + u), we get: log p θ (X C) inf λ (S C( λ) + log E θ [e <λ,x> ]) where S C (λ) = sup x C x, λ is the support function of the set C. For an exponential model with sufficient statistic φ(x), the above becomes: log p θ (X C) inf λ (S C,φ( λ) + Φ(θ + λ) Φ(θ)) where Φ is the log-partition function and S C,φ (λ) = sup x C φ(x), λ 54

55 MAP Estimation x 3 x 2 x 4 x 1 55

56 MAP Estimation x 3 x 2 PROB = 0.01 x 4 x 1 56

57 MAP Estimation x 3 x 2 PROB = 0.2 x 4 x 1 57

58 MAP Estimation x 3 x 2 PROB = 0.2 x 4 x 1 Most Probable Configuration? 58

59 Polytope View µ = max x θ φ(x) = sup θ µ µ M 59

60 Outer Polytope Relaxations LOCAL(G) := {µ 0 x s µ s (x s ) = 1, x t µ st (x s, x t ) = µ s (x s )} 60

61 Outer Polytope Relaxations sup µ M(G) sup µ LOCAL(G) θ µ θ µ A Linear Program! (Chekuri, Khanna, Naor, Zosin 05; LP Formulation for Metric Labeling) (Wainwright, Jaakkola, Willsky 05; Tree-reweighted Max-Product, Dual of LP) 61

62 Inner Polytope Approximations If M I M is any subset of the marginal polytope that includes all of the vertices, µ = max x θ, φ = sup θ, µ µ M I 62

63 Inner Polytope Approximations For the given graph G and a subgraph H, let E(H) = {θ θ st = θ st 1 (s,t) H } M(G; H) = {µ µ = E θ [φ(x)] for some θ E(H)}. M(G; H) M(G) 63

64 Mean Field parameters, Inner Polytope Approximations M(G; H 0 ) = {µ(s; j), µ(s, j; t, k) 0 µ(s; j) 1, µ(s, j; t, k) = µ(s; j)µ(t; k)} Mean Field Relaxation, sup θ, µ µ M(G;H 0 ) Quadratic Program! = sup µ M(G;H 0 ) = sup µ M(G;H 0 ) θ s;j µ(s; j) + θ s,j;t,k µ(s, j; t, k) st;jk s;j θ s;j µ(s; j) + θ s,j;t,k µ(s; j)µ(t; k) st;jk s;j (Ravikumar, Lafferty 06; Quadratic Relaxations for Metric Labeling and MAP in MRFs) 64

65 References Martin. J. Wainwright and Michael I. Jordan (2003). Graphical models, exponential families, and variational inference. M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul (1999). An introduction to variational methods for graphical models. Pradeep Ravikumar, John Lafferty (2005). Preconditioner Approximations for Probabilistic Graphical Models. Pradeep Ravikumar, John Lafferty (2004). Variational Chernoff Bounds for Graphical Models. Chekuri, C., Khanna, S., Naor, J., Zosin, L. (2005). A linear programming formulation and approximation algorithms for the metric labeling problem. Pradeep Ravikumar, John Lafferty (2006). Quadratic Programming Relaxations for Metric Labeling and Markov Random Field MAP Estimation. 65

14 : Theory of Variational Inference: Inner and Outer Approximation

10-708: Probabilistic Graphical Models 10-708, Spring 2017 14 : Theory of Variational Inference: Inner and Outer Approximation Lecturer: Eric P. Xing Scribes: Maria Ryskina, Yen-Chia Hsu 1 Introduction