Lecture 8: PGM Inference

Size: px

Start display at page:

Download "Lecture 8: PGM Inference"

Madison Jenkins
5 years ago
Views:

1 15 September 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401

2 Table of Contents I 1 Variable elimination Max-product Sum-product 2 LP Relaxations QP Relaxations 3

3 Marginal and MAP X1 X2 X3 X4 Marginal inference: P(x i ) = x j :j i P(x 1, x 2, x 3, x 4 ) MAP inference: (x 1, x 2, x 3, x 4 ) = argmax x 1,x 2,x 3,x 4 P(x 1, x 2, x 3, x 4 ) In general, x i argmax x i P(x i )

4 Marginals When do we need marginals? Marginals are used to compute normalisation constant Z = x i q(x i ) = x j q(x j ) i, j = 1,.... log loss in CRFs is log P(x 1,..., x n ) = log(z) +... expectations like E P(xi )[φ(x i )] and E P(xi,x j )[φ(x i, x j )], where ψ(x i ) = φ(x i ), w and ψ(x i, x j ) = φ(x i, x j ), w Gradient of CRFs risk contains above expectations.

5 MAP When do we need MAP? find the most likely configuration for (x i ) i V in testing. find the most violated constraint generated by (x i ) i V in training (i.e. learning), e.g. by cutting plane method (used in SVM-Struct) or by Bundle method for Risk Minimisation (Teo JMLR2010).

6 Variable elimination Variable elimination Max-product Sum-product max P(x 1, x 2, x 3, x 4 ) x 1,x 2,x 3,x 4 = max ψ(x 1, x 2 )ψ(x 2, x 3 )ψ(x 2, x 4 )ψ(x 1 )ψ(x 2 )ψ(x 3 )ψ(x 4 ) x 1,x 2,x 3,x 4 [ ( ) ( )] = max... max ψ(x 2, x 3 )ψ(x 3 ) max ψ(x 2, x 4 )ψ(x 4 ) x 1,x 2 x 3 x 4 [ ( )] = max ψ(x 1 ) max ψ(x 2 )ψ(x 1, x 2 )m 3 2 (x 2 )m 4 2 (x 2 ) x 1 x 2 ) ) ( = max ψ(x 1 )m 2 1 (x 1 ) x 1 ( x1 = argmax ψ(x 1 )m 2 1 (x 1 ) x 1

7 Max-product Variable elimination Max-product Sum-product Variable elimination for MAP Max-product: ( ) xi = argmax ψ(x i ) m j i (x i ) x i j Ne(i) ( m j i (x i ) = max ψ(x j )ψ(x i, x j ) x j k Ne(j)\{i} ) m k j (x j ) Ne(i): neighbouring nodes of i (i.e. nodes that connect with i). Ne(j)\{i} = if j has only one edge connecting it. e.g. x 1, x 3, x 4. For such node j, m j i (x i ) = max x j ψ(x j )ψ(x i, x j ) ( ) Easier computation!

8 Max-product Variable elimination Max-product Sum-product Order matters: message m 2 3 (x 3 ) requires m 1 2 (x 2 ) and m 4 2 (x 2 ). X1 X1 X1 m1->2(x2) m2->1(x1) X2 m3->2(x2) X2 m4->2(x2) m2->3(x3) X2 m2->4(x4) X3 X4 X3 X4 X3 X4 Alternatively, leaves to root, and root to leaves. X1 X1 X1 m2->1(x1) m1->2(x2) X2 m3->2(x2) X2 m4->2(x2) m2->3(x3) X2 m2->4(x4) X3 X4 X3 X4 X3 X4

9 Sum-product Variable elimination Max-product Sum-product Variable elimination for marginal Sum-product: P(x i ) = 1 Z ( ψ(x i ) j Ne(i) ) m j i (x i ) m j i (x i ) = x j (ψ(x j )ψ(x i, x j ) k Ne(j)\{i} ) m k j (x j )

10 Extension Variable elimination Max-product Sum-product To avoid over/under flow, often operate in the log space. Max/sum-product is also known as and Belief Propagation (BP). In graphs with loops, running BP for several iterations is known as Loopy BP (neither convergence nor optimal guarantee in general). Extend to Junction Tree Algorithm (exact, but expensive) and Clusters-based BP.

11 LP Relaxations LP Relaxations QP Relaxations Assume pairwise MRFs with graph G(V, E) P(X Y) = 1 ψ i,j (x i, x j ) ψ i (x i ) Z (i,j) E i V = 1 ( Z exp ) θ i (x i ) θ i,j (x i, x j ) + (i,j) E i V MAP X = argmax ψ i,j (x i, x j ) ψ i (x i ) X (i,j) E i V = argmax X (i,j) E θ i,j (x i, x j ) + i V θ i (x i )

12 LP Relaxations LP Relaxations QP Relaxations argmax X (i,j) E θ i,j (x i, x j ) + i V θ i (x i ) the following Integer Program: argmax q i,j (x i, x j )θ i,j (x i, x j ) + q i (x i )θ i (x i ) {q} x i,x j x i (i,j) E s.t. q i,j (x i, x j ) {0, 1}, x i,x j q i,j (x i, x j ) = 1, x i q i,j (x i, x j ) = q j (x j ). Relax to Linear Program: argmax q i,j (x i, x j )θ i,j (x i, x j ) + q i (x i )θ i (x i ) {q} x i,x j x i (i,j) E s.t. q i,j (x i, x j ) [0, 1], x i,x j q i,j (x i, x j ) = 1, x i q i,j (x i, x j ) = q j (x j ). i V i V

13 QP Relaxations LP Relaxations QP Relaxations Imposing q i,j (x i, x j ) = q i (x i )q j (x j ) yields QP: argmax {q} (i,j) E q i (x i )q j (x j )θ i,j (x i, x j ) + q i (x i )θ i (x i ) x i,x j x i s.t. q i (x i ) [0, 1], x i q i (x i ) = 1. i V Rewrite objective function as y T Θ y + y T θ, where y = [q i (x i )] i V,xi Val(X ) R N, θ = [θ i (x i )] i,xi Θ = [θ i,j (x i, x j )] i,xi,j,x j R N N. R N, and

14 QP Relaxations LP Relaxations QP Relaxations Problem: y T Θ y + y T θ may not be concave i.e. Θ may not be negative semidefinite (NSD). 1 In general for y i [0, 1], y i y 2 i 0, thus g(d, y) 0

15 QP Relaxations LP Relaxations QP Relaxations Problem: y T Θ y + y T θ may not be concave i.e. Θ may not be negative semidefinite (NSD). Solution: To find a d R N, such that (Θ diag(d)) is NSD. Given d, we have y T Θ y + y T θ = y T (Θ diag(d)) y +(y + d) T θ + g(d, y), where the gap g(d, y) = y T diag(d) y d T θ = N i=1 d i(y i yi 2). Note for y i {0, 1}, y i yi 2 = 0, thus g(d, y) = 0 1. So we know y T Θ y + y T θ y T (Θ diag(d)) y +(y + d) T θ }{{} Concave argmax y T (Θ diag(d)) y +(y + d) T θ y 1 In general for y i [0, 1], y i y 2 i 0, thus g(d, y) 0

16 Break LP Relaxations QP Relaxations Take a break...

17 Understanding samples In fact, there is no way to check a sample is from a distribution or not two totally different distributions can generate the same sample. For example, uniform[0, 1] and gaussian N(0, 1) can both generate a sample with value 0. Looking at a sample with value = 0 alone, how do you know its distribution for sure? What we really check (and know for sure) is the way that the samples were generated. When we say a procedure generates a sample from a distribution P, what we really mean is that keeping sampling this way (by the procedure), the normalised histogram H n with n samples is going to converge to the distribution P. That is H n P as n. If we don t know the way that the samples were generated, we never know what s the distribution for sure we can only guess (e.g. using statistical tests) based on a number of available samples.

18 Overview Monte Carlo Importance sampling Acceptance-rejection sampling

19 Monte Carlo Monte Carlo methods are a class of computational algorithms that rely on repeated random sampling to compute their results. repeat draw sample(s) compute result according to the samples until sampled enough ( or the result is stable)

20 Monte Carlo To estimate π ( area of a circle with radius r is S c = πr 2 ). Idea: draw a circle ( r = 1) and a rectangle (2r 2r) enclosing the circle. We know the area of the rectangle is S rec = (2r) 2. If we can estimate the area of the circle, then we can estimate π by π = S c /r 2. Draw a sample point from the rectangle area uniformly. The chance of it being within the circle is S c /S rec. So if we throw enough points, we have N within /N total S c /S rec. Thus S c S rec N within /N total. Theorefore, π S recn within /N total r 2 = 4N within /N total See a matlab demo.

21 Monte Carlo #1000, pi e st: ,pi: π Estimated π True π Number of samples x 10 4

22 Monte Carlo To estimate an expectation: Generate samples x i q(x ), i = 1,..., N. E X q(x ) [f (X )] Ê X q(x ) [f (X )] = 1 N f (x i ), N i=1

23 Importance sampling To compute E X p(x ) [f (X )]. Assume p(x) (target distribution) is hard to sample from directly, and q(x) (proposal distribution) is easy to sample from and q(x) > 0 when p(x) > 0. E X p(x ) [f (X )] = p(x)f (x)dx x = x q(x) p(x) q(x) f (x)dx = E X q(x ) [ p(x ) f (X )]. q(x ) Ê X p(x ) [f (X )] = Ê X q(x ) [ p(x ) f (X )], q(x ) where Ê X q(x ) [f (X )] = 1 N f (x i ), x i q(x ), i = 1,..., N. N i=1

24 Acceptance-rejection sampling Target: to sample X from p(x). Given: q(x) easy to sample from. Find a constant M such that M q(x) p(x), x. repeat step 1: sample Y q(y) step 2: sample U Uniform[0, 1] p(y) M q(y) then if U then X = Y ; else reject and go to step 1. end if until sampled enough

25 Acceptance-rejection sampling Proof: Pr(accept X = x) = p(x) and Pr(X = x) = q(x) M q(x) Pr(accept) = Pr(accept X = x) Pr(X = x)dx x p(x) = x M q(x) q(x)dx = 1 ( thus don t want M big) M Pr(accept X ) P(X ) Pr(X accept) = Pr(accept) = p(x) M q(x) q(x) 1 M = p(x).

26 Understanding AR sampling (1) I guess the most confusing part, is why M comes in. So let s look at the case without M first. Denote the histogram formed by n samples from q(x) as H n q, the histogram formed by n samples from p(x) as H n p, the histogram formed by n accepted samples from AR sampling procedure as H n. For a sample x q(x), if p(x) < q(x), it suggests if you accept all the x and keep sampling this way, the histogram you will get is H n q. But what you really want to get, is a way that the resulting histogram H becomes H n p. Rejecting some portion of x can make the histogram H has the same shape as H p at point x. In other words, the histogram H has more counts at point x than H p, so we remove some counts to make H(x) = H p (x). (Take a moment to think this through).

27 Understanding AR sampling (2) What if for a sample x q(x), p(x) > q(x)? The histogram Hq n already has less counts than Hp n at x. What do we do? Well, we can sample M n points from q(x) to build Hq Mn first. Now Hq Mn should have more counts than Hp n at x (because we choose a M such that p(x) < Mq(x) for all x. If not, choose a larger M). Visually, Hq Mn encloses Hp n. At point x, we only want to keep Hp n (x) many samples from totally Hq Mn (x) many. This is how uniform sampling and M came in. We sample u Uniform[0, Mq(x)], accept x when u < p(x) (equivalent to sample u Uniform[0, 1], accept x when u < p(x)/mq(x)). As a result, after Mn samples, we will get a H close to Hp n. Moreover, lim n Hn = lim n Hn p = p.

28 Understanding AR sampling (3) Here we can choose any M such that p(x) < Mq(x) for all x. The bigger M is, the more samples (Mn samples) you need to approximate H n p. That s why in practice, people want to use the smallest M (such that p(x) < Mq(x) for all x) to reduce the number of rejected samples.

29 Sampling in PGM inference Overview:

30 Given an ordering of subsets of random variables {X i } n i=1 (knowing parents to generate children). for i = 1 to n do u i Pa(x i 1 ) sample x i from P(X i u i ) end for P(D) Difficulty D Intelligence I P(I) P(G D,I) Grade G P(L G) S SAT P(S I) Happy H Letter L J Job P(J L,S) P(H G,J)

31 Assume {x i } M i=1 are M samples from P(X ), we can approximately compute expectation: E X P(X ) [f (X )] 1 M M f (x i ) i=1 MAP solution: argmax x P(x) argmax x {xi } M i=1 P(x) marginal: P(x) N X =x /N total sample from P(X e) when evidences e: sample from P(X ) first, and reject x when it does not agree on e.

32 Problems?

33 Problem: Rejection step in estimating P(X e) wastes too many samples when P(e) is small. In real applications, P(e) is almost always very small. Question: how do we avoid rejecting samples?

34 How about setting the observed random variables to the observed values, and then doing forward sampling on the rest?

35 Let s see if it works. To sample from P(D, I, G, L S = 0) from a simplified PGM. P(D) Difficulty D Intelligence I P(I) P(G D,I) Grade G P(L G) S SAT P(S I) Letter L Fixing S = 0, and then sample D, I, G, L. Does this give the same result comparing to forward sampling with rejection?

36 No! It doesn t. The samples are not from P(D, I, G, L S = 0) at all! Fixing this lead to.

37 Input: {Z i = z i } i are observed. Step 1: set {Z i } i to the observed values. Step 2: forward sampling the unobserved variables. Step 3: weight the sample by i P(zi Pa(z i ))

38 inference To sample from P(D, I, G, L S = 0) from the following PGM. P(D) Difficulty D Intelligence I P(I) P(G D,I) Grade G P(L G) S SAT P(S I) Letter L Fix S = 0, and forward sample D, I, G, L. Then weight the sample by P(D, I, G, L S = 0). Does this give the same result comparing to forward sampling with rejection?

39 E X P(D,I,G,L S=0) [f (D, I, G, L, 0)] 1 N [f (d j, i j, g j, l j, 0) P(d j, i j, g j, l j S = 0)] N j=1

40 P(D) Difficulty D Intelligence I P(I) P(D) Difficulty D Intelligence I P(I) P(G D,I) Grade G P(L G) S SAT P(S I) P(G D,I) Grade G P(L G) S SAT P(S I) Letter L (a) G 1 with p(x ) Sample {x i } N i=1 from q(x ). Mutilate Ê X p(x ) [f (X )] = 1 N N i=1 Letter L (b) G 2 with q(x ) p(x i ) q(x i ) f (x i).

41 MAP Inference Revisit Primal Variable Elimination Max-product, (Loopy) BP) Junction Tree Algorithm and Clusters-based BP Linear Programming (LP) Relaxations Quadratic programming (QP). Semidefinite programming (SDP), Second-Order Cone Programming (SOCP)... Special potentials (Graph Cut)... Dual GMPLP, Dual decomposition...

42 Marginal Inference Revisit Many MAP inference methods can be converted to marginal ones (vice versa). max-product sum-product. LP based dual methods marginal (via + entropy term (kikuchi approximation)).

43 That s all Thanks!

Lecture 9: PGM Learning

Lecture 9: PGM Learning 13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and