CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling Professor Erik Sudderth Brown University Computer Science October 27, 2016 Some figures and materials courtesy Iain Murray, Markov Chain Monte Carlo, MLSS 2009 http://homepages.inf.ed.ac.uk/imurray2/teaching/09mlss/

CS242: Lecture 7B Outline Ø MCMC & Metropolis-Hastings Algorithms Ø Gibbs Sampling for Graphical Models Ø Moments, mixing, & convergence diagnostics

MCMC & Metropolis-Hastings Markov Chain Monte Carlo (MCMC) Goal: Draw samples from some target distribution p(x) = 1 Z p (x) (discrete or continuous) Assumption: Can evaluate up to unknown normalizer Approach: Design Markov chain (biased random walk) T which has target distribution as unique equilibrium distribution: X Z T (x 0 x)p (x) dx = p (x 0 ) x T (x 0 x)p (x) =p (x 0 ) Metropolis-Hastings Method Input: Target distribution, and user-specified transition distribution q(x 0 x) Approach: Transform transition distribution so it has target as equilibrium x

Detailed Balance A sufficient condition for Markov Chain to have desired target distribution. Probability of reaching state x via some path, and then jumping to x. T (x 0 x)p (x) =p (x 0 )T (x x 0 ) Implications of detailed balance condition: Probability of reaching state x via some path, and then jumping to x. If p (x) =p (x 0 ), then T (x 0 x)=t (x x 0 ). If p (x 0 ) >p (x), then T (x0 x) T (x x 0 ) = p (x 0 ) p (x) > 1.

Detailed Balance A sufficient condition for Markov Chain to have desired target distribution. Probability of reaching state x via some path, and then jumping to x. T (x 0 x)p (x) =p (x 0 )T (x x 0 ) Probability of reaching state x via some path, and then jumping to x. Detailed balance implies target is equilibrium distribution: X apple X T (x 0 x)p (x) =p (x 0 ) T (x x 0 ) = p (x 0 ) x Detailed balance restricts transition distribution. But we can check equilibrium distribution via isolated state pairs, without explicitly summing over all states. x

The Metropolis-Hastings Algorithms Transforms proposals into correct MCMC methods. Metropolis-Hastings (MH) Transition Operator Propose: Sample from user-specified transition distribution p( x) =q( x x) assumed tractable Evaluate: Compute the Metropolis-Hastings acceptance ratio r( x, x) = p ( x)q(x x) p (x)q( x x) T (x 0 x) equals one if q satisfies detailed balance for this target Definite Accept: If r( x, x) 1, set x 0 = x. Possible Reject: If r( x, x) < 1, sample auxiliary variable u Ber(r( x, x)). x 0 = u x +(1 u)x. if reject, state is unchanged Overall acceptance probability: Pr[x 0 = x] =min(1,r( x, x))

The MH Independence Sampler Transforms independent proposals into correct MCMC methods. MH Independence Sampler Transition Operator Propose: Sample from user-specified proposal distribution q( x x) =q( x) T (x 0 x) independent of current state Evaluate: Compute the Metropolis-Hastings acceptance ratio r( x, x) = p ( x)q(x) p (x)q( x) = w( x) w(x) w(x) = p (x) q(x) Definite Accept: If r( x, x) 1, set x 0 = x. Possible Reject: If r( x, x) < 1, sample auxiliary variable u Ber(r( x, x)). x 0 = u x +(1 u)x. if reject, state is unchanged If q(x)=p(x), always accept and recover basic Monte Carlo estimator.

Special Case: The Metropolis Algorithm Transforms symmetric proposals into correct MCMC methods. Metropolis Transition Operator T (x 0 x) Propose: Sample from user-specified symmetric transition distribution p( x) =q( x x), Evaluate: Compute the Metropolis acceptance ratio r( x, x) = p ( x) p (x) q( x x) =q(x x). Definite Accept: If r( x, x) 1, set x 0 = x. Possible Reject: If r( x, x) < 1, sample auxiliary variable u Ber(r( x, x)). x 0 = u x +(1 u)x. if reject, state is unchanged Standard greedy search would always, rather than possibly, reject.

Gaussian Random Walk Metropolis p(x) = Norm(x 0, 1) q(x 0 x) = Norm(x 0 x, sigma(0.1) 99.8% accepts 4 2 0 2 4 0 100 200 300 400 500 600 700 800 900 1000 2 ) sigma(1) 68.4% accepts 4 2 0 2 4 0 100 200 300 400 500 600 700 800 900 1000 sigma(100) 0.5% accepts 4 2 0 2 4 0 100 200 300 400 500 600 700 800 900 1000

Mixing: Metropolis Random Walk Q L P A general, but often ineffective, Generic Metropolis proposals proposal: use Q(x ; x) =N (x, σ 2 ) q(x 0 x) = Norm(x 0 x, 2 I) σ large many rejections σ small slow diffusion: (L/σ) 2 iterations required

Mixing: Metropolis Random Walk Discrete target distribution is uniform over all states: p(x) = 1, x 2{0, 1, 2,...,20} 21 q(x 0 x)= 1 2, x0 2{x 1,x+1} 100 iterations 12 10 8 6 4 2 0 0 5 10 15 20 400 iterations 40 35 30 25 20 15 10 5 0 0 5 10 15 20 1200 iterations 90 80 70 60 50 40 30 20 10 0 0 5 10 15 20 100 iterations 12 10 8 6 4 2 0 0 5 10 15 20 400 iterations 40 35 30 25 20 15 10 5 0 0 5 10 15 20 1200 iterations 90 80 70 60 50 40 30 20 10 0 0 5 10 15 20

MH Algorithms Satisfy Detailed Balance Metropolis-Hastings (MH) Transition Operator: T (x 0 x) x q( x x) Pr[x 0 = x] =min(1,r( x, x)) r( x, x) = p ( x)q(x x) p (x)q( x x) Pr[x 0 = x] =1 min(1,r( x, x)) Detailed Balance Condition: T (x 0 x)p (x) =p (x 0 )T (x x 0 ) For any x 0 6= x : P (x) T (x x) =P (x) Q(x ; x)min 1, P (x )Q(x; x! ) P (x)q(x ; x) = min P (x)q(x ; x), P(x )Q(x; x ) = P (x ) Q(x; x )min 1, P (x)q(x ; x) P (x )Q(x; x )! = P (x ) T (x x )

CS242: Lecture 7B Outline Ø MCMC & Metropolis-Hastings Algorithms Ø Gibbs Sampling for Graphical Models Ø Moments, mixing, & convergence diagnostics z 2 L l z 1

Combining MCMC Transition Proposals A sequence of operators, each with P invariant: x 0 P (x) x 1 T a (x 1 x 0 ) x 2 T b (x 2 x 1 ) x 3 T c (x 3 x 2 ) P (x 1 ) = x 0 T a (x 1 x 0 )P (x 0 ) = P (x 1 ) P (x 2 ) = x 1 T b (x 2 x 1 )P (x 1 ) = P (x 2 ) P (x 3 ) = x 1 T c (x 3 x 2 )P (x 2 ) = P (x 3 ) Combination T c T b T a leaves P invariant If they can reach any x, T c T b T a is a valid MCMC operator Individually T c, T b and T a need not be ergodic

p(x) = 1 Z F Y f2f MCMC for Factor Graphs f (x f ) Z = X x Y f2f set of hyperedges linking subsets of nodes (s) ={f 2F s 2 f} f (x f ) f V neighbors of node s Consider a proposal which modifies one variable: x s q s ( x s x), x t = x t for all t 6= s. Acceptance Ratio: r( x, x) = p ( x)q(x x) p (x)q( x x) = q s(x s x) q s ( x s x) Y f2 (s) Computation of acceptance ratio is local: depends only on variable being resampled and its neighbors Proposal is low-dimensional: Easier to have high acceptance rate f ( x f ) f (x f )

p(x) = 1 Z Gibbs Sampling for Factor Graphs Y f2f f (x f ) (s) ={f 2F s 2 f} Z = X x Y f2f f (x f ) neighbors of node s Visiting nodes s in some fixed or random order: Determine the conditional distribution of node s given values for all other nodes (equivalently, its neighbors): q s ( x s x) =p( x s {x t,t6= s}) / Y f2 (s) f ( x s,x f\s ) Propose a new value for node s, leave all other nodes unchanged: x s q s ( x s x), x t = x t for all t 6= s. Always accepting this proposal gives valid MCMC: x 0 = x

Gibbs Sampling is Metropolis-Hastings p(x) = 1 Z Y f2f f (x f ) Z = X x Y f2f Gibbs sampling proposal for node s: q s ( x s x) =p( x s {x t,t6= s}) / Y f2 (s) x s q s ( x s x), x t = x t for all t 6= s. Metropolis-Hastings Acceptance Ratio: r( x, x) = q s(x s x) q s ( x s x) Y f2 (s) f ( x f ) f (x f ) =1 f (x f ) f ( x s,x f\s ) Always Accept! plugging in conditional above, using fact that x t = x t for t 6= s.

Gibbs Sampling as Message Passing Consider a pairwise MRF: (Similar result holds for factor graphs.) p(x) = 1 Z Y (s,t)2e st(x s,x t ) x i x i q i (x i ) / ˆx i q i (x i ) Y j2 (i) m ji (x i ) Draw single sample from marginal { m ij (x j ) / ij (ˆx i,x j ) Use sample to extract a slice of pairwise potential For discrete variables, sample from a categorical distribution. For continuous variables, need to analyze model to determine conditional.

Gibbs Sampling Implementation Gibbs sampling benefits from few free choices and convenient features of conditional distributions: Conditionals with a few discrete settings can be explicitly normalized: P (x i x j i ) P (x i, x j i ) = P (x i, x j i ) x i P (x i, x j i) this sum is small and easy Continuous conditionals only univariate amenable to standard sampling methods. Ø Inverse CDF sampling Ø Rejection sampling Ø Slice sampling Ø

CS242: Lecture 7B Outline Ø MCMC & Metropolis-Hastings Algorithms Ø Gibbs Sampling for Graphical Models Ø Moments, mixing, & convergence diagnostics

MCMC Implementation & Application The samples aren t independent. Should we thin, only keep every Kth sample? Arbitrary initialization means starting iterations are bad. Should we discard a burn-in period? Maybe we should perform multiple runs? How do we know if we have run for long enough?

Estimating Moments from Samples Approximately independent samples can be obtained by thinning. However, all the samples can be used. Use the simple Monte Carlo estimator on MCMC samples. It is: consistent unbiased if the chain has burned in E P [f] 1 S f(x (s) ) S The correct motivation to thin: if computing f(x (s) ) is expensive s=1 Thinned Sampling All Samples after Burn-in

MCMC Mixing Diagnostics Autocovariance: Empirical covariance of values produced by MCMC method, versus iteration lag (spacing) Small autocovariances are necessary, but not sufficient, to demonstrate mixing to the target distribution Fairly reliable for unimodal posteriors, but very misleading more generally Trace Plot: Value of some interesting summary statistic, versus MCMC iteration

Mixing for Gibbs Sampling z 2 L z 1 z 2 Each Gibbs sampling proposal only changes one variable at a time l z 1 One or more rare events may need to happen before Gibbs sampler moves between modes: SLOW mixing z 2 =0 z 2 =1 z 1 =0 z 1 =1 0.6 0.4 Example: Slow mixing for a pair of binary variables

A Gibbs Sampler for Spatial MRFs Ising and Potts Priors on Partitions Gibbs sampler is special case of pairwise MRF sampler. Previous Applications Interactive foreground segmentation Supervised segmentation with object category appearance models These are strong likelihoods. Is the prior itself effective for images? GrabCut: Rother, Kolmogorov, & Blake 2004 Verbeek & Triggs, 2007

Geman & Geman, PAMI 1984 200 Gibbs Iterations 128 x128 grid 8 nearest neighbor edges K = 5 states Potts potentials: 10,000 Gibbs Iterations

10-State Potts Models: Extended Gibbs States sorted by size: largest in blue, smallest in red

1996 IEEE DSP Workshop natural images very noisy giant cluster number of edges on which states take same value edge strength Even within the phase transition region, samples lack the size distribution and spatial coherence of real image segments. Output of a limited MCMC run can be arbitrarily misleading!

MCMC & Computational Resources Best practical option: A few (> 1) initializations for as many iterations as possible