CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

Similar documents
STA 4273H: Statistical Machine Learning

Markov chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

17 : Markov Chain Monte Carlo

SAMPLING ALGORITHMS. In general. Inference in Bayesian models

Markov Chain Monte Carlo (MCMC)

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

MCMC and Gibbs Sampling. Kayhan Batmanghelich

Quantifying Uncertainty

Machine Learning for Data Science (CS4786) Lecture 24

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Shared Segmentation of Natural Scenes. Dependent Pitman-Yor Processes

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

16 : Markov Chain Monte Carlo (MCMC)

I. Bayesian econometrics

Bayesian Methods for Machine Learning

Probabilistic Graphical Models

A = {(x, u) : 0 u f(x)},

Markov chain Monte Carlo

Probabilistic Graphical Models

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

6 Markov Chain Monte Carlo (MCMC)

Lecture 7 and 8: Markov Chain Monte Carlo

CPSC 540: Machine Learning

Bayesian Inference and MCMC

Graphical Models and Kernel Methods

Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018

Introduction to Machine Learning CMU-10701

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods

Probabilistic Graphical Models

Spatial Bayesian Nonparametrics for Natural Image Segmentation

Markov chain Monte Carlo Lecture 9

Monte Carlo Methods. Geoff Gordon February 9, 2006

Monte Carlo Inference Methods

CSC 2541: Bayesian Methods for Machine Learning

Monte Carlo Methods. Leon Gu CSD, CMU

19 : Slice Sampling and HMC

ST 740: Markov Chain Monte Carlo

STAT 425: Introduction to Bayesian Analysis

Introduction to Machine Learning

Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo. Sampling Methods. Oliver Schulte - CMPT 419/726. Bishop PRML Ch.

The Ising model and Markov chain Monte Carlo

Sampling Methods (11/30/04)

Intelligent Systems:

Computational statistics

18 : Advanced topics in MCMC. 1 Gibbs Sampling (Continued from the last lecture)

Markov Chain Monte Carlo

Introduction to Computational Biology Lecture # 14: MCMC - Markov Chain Monte Carlo

16 : Approximate Inference: Markov Chain Monte Carlo

CSC 446 Notes: Lecture 13

Session 3A: Markov chain Monte Carlo (MCMC)

Markov Chain Monte Carlo Using the Ratio-of-Uniforms Transformation. Luke Tierney Department of Statistics & Actuarial Science University of Iowa

Simulation - Lectures - Part III Markov chain Monte Carlo

An introduction to Sequential Monte Carlo

Markov Chain Monte Carlo The Metropolis-Hastings Algorithm

MCV172, HW#3. Oren Freifeld May 6, 2017

Markov Chain Monte Carlo, Numerical Integration

Answers and expectations

Three examples of a Practical Exact Markov Chain Sampling

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Advanced MCMC methods

Markov Chain Monte Carlo (MCMC)

Markov Chains and MCMC

Lecture 6: Markov Chain Monte Carlo

Probabilistic Graphical Models

MCMC algorithms for fitting Bayesian models

Markov Chain Monte Carlo

Sampling Algorithms for Probabilistic Graphical models

MCMC Methods: Gibbs and Metropolis

Markov Chains and MCMC

Monte Carlo methods for sampling-based Stochastic Optimization

Energy Minimization via Graph Cuts

Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling

Lecture 8: The Metropolis-Hastings Algorithm

MCMC and Gibbs Sampling. Sargur Srihari

Bayesian model selection in graphs by using BDgraph package

Markov Chain Monte Carlo methods

Winter 2019 Math 106 Topics in Applied Mathematics. Lecture 9: Markov Chain Monte Carlo

Markov chain Monte Carlo

Introduction to MCMC. DB Breakfast 09/30/2011 Guozhang Wang

A Review of Pseudo-Marginal Markov Chain Monte Carlo

Probabilistic Graphical Models

Markov Random Fields

Pattern Recognition and Machine Learning

Markov Chain Monte Carlo

High dimensional Ising model selection

References. Markov-Chain Monte Carlo. Recall: Sampling Motivation. Problem. Recall: Sampling Methods. CSE586 Computer Vision II

Markov Networks.

Kernel adaptive Sequential Monte Carlo

Markov and Gibbs Random Fields

Robert Collins CSE586, PSU Intro to Sampling Methods

Markov chain Monte Carlo

Inference in Bayesian Networks

ELEC633: Graphical Models

Probabilistic Graphical Models Lecture Notes Fall 2009

Deblurring Jupiter (sampling in GLIP faster than regularized inversion) Colin Fox Richard A. Norton, J.

DAG models and Markov Chain Monte Carlo methods a short overview

Transcription:

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling Professor Erik Sudderth Brown University Computer Science October 27, 2016 Some figures and materials courtesy Iain Murray, Markov Chain Monte Carlo, MLSS 2009 http://homepages.inf.ed.ac.uk/imurray2/teaching/09mlss/

CS242: Lecture 7B Outline Ø MCMC & Metropolis-Hastings Algorithms Ø Gibbs Sampling for Graphical Models Ø Moments, mixing, & convergence diagnostics

MCMC & Metropolis-Hastings Markov Chain Monte Carlo (MCMC) Goal: Draw samples from some target distribution p(x) = 1 Z p (x) (discrete or continuous) Assumption: Can evaluate up to unknown normalizer Approach: Design Markov chain (biased random walk) T which has target distribution as unique equilibrium distribution: X Z T (x 0 x)p (x) dx = p (x 0 ) x T (x 0 x)p (x) =p (x 0 ) Metropolis-Hastings Method Input: Target distribution, and user-specified transition distribution q(x 0 x) Approach: Transform transition distribution so it has target as equilibrium x

Detailed Balance A sufficient condition for Markov Chain to have desired target distribution. Probability of reaching state x via some path, and then jumping to x. T (x 0 x)p (x) =p (x 0 )T (x x 0 ) Implications of detailed balance condition: Probability of reaching state x via some path, and then jumping to x. If p (x) =p (x 0 ), then T (x 0 x)=t (x x 0 ). If p (x 0 ) >p (x), then T (x0 x) T (x x 0 ) = p (x 0 ) p (x) > 1.

Detailed Balance A sufficient condition for Markov Chain to have desired target distribution. Probability of reaching state x via some path, and then jumping to x. T (x 0 x)p (x) =p (x 0 )T (x x 0 ) Probability of reaching state x via some path, and then jumping to x. Detailed balance implies target is equilibrium distribution: X apple X T (x 0 x)p (x) =p (x 0 ) T (x x 0 ) = p (x 0 ) x Detailed balance restricts transition distribution. But we can check equilibrium distribution via isolated state pairs, without explicitly summing over all states. x

The Metropolis-Hastings Algorithms Transforms proposals into correct MCMC methods. Metropolis-Hastings (MH) Transition Operator Propose: Sample from user-specified transition distribution p( x) =q( x x) assumed tractable Evaluate: Compute the Metropolis-Hastings acceptance ratio r( x, x) = p ( x)q(x x) p (x)q( x x) T (x 0 x) equals one if q satisfies detailed balance for this target Definite Accept: If r( x, x) 1, set x 0 = x. Possible Reject: If r( x, x) < 1, sample auxiliary variable u Ber(r( x, x)). x 0 = u x +(1 u)x. if reject, state is unchanged Overall acceptance probability: Pr[x 0 = x] =min(1,r( x, x))

The MH Independence Sampler Transforms independent proposals into correct MCMC methods. MH Independence Sampler Transition Operator Propose: Sample from user-specified proposal distribution q( x x) =q( x) T (x 0 x) independent of current state Evaluate: Compute the Metropolis-Hastings acceptance ratio r( x, x) = p ( x)q(x) p (x)q( x) = w( x) w(x) w(x) = p (x) q(x) Definite Accept: If r( x, x) 1, set x 0 = x. Possible Reject: If r( x, x) < 1, sample auxiliary variable u Ber(r( x, x)). x 0 = u x +(1 u)x. if reject, state is unchanged If q(x)=p(x), always accept and recover basic Monte Carlo estimator.

Special Case: The Metropolis Algorithm Transforms symmetric proposals into correct MCMC methods. Metropolis Transition Operator T (x 0 x) Propose: Sample from user-specified symmetric transition distribution p( x) =q( x x), Evaluate: Compute the Metropolis acceptance ratio r( x, x) = p ( x) p (x) q( x x) =q(x x). Definite Accept: If r( x, x) 1, set x 0 = x. Possible Reject: If r( x, x) < 1, sample auxiliary variable u Ber(r( x, x)). x 0 = u x +(1 u)x. if reject, state is unchanged Standard greedy search would always, rather than possibly, reject.

Gaussian Random Walk Metropolis p(x) = Norm(x 0, 1) q(x 0 x) = Norm(x 0 x, sigma(0.1) 99.8% accepts 4 2 0 2 4 0 100 200 300 400 500 600 700 800 900 1000 2 ) sigma(1) 68.4% accepts 4 2 0 2 4 0 100 200 300 400 500 600 700 800 900 1000 sigma(100) 0.5% accepts 4 2 0 2 4 0 100 200 300 400 500 600 700 800 900 1000

Mixing: Metropolis Random Walk Q L P A general, but often ineffective, Generic Metropolis proposals proposal: use Q(x ; x) =N (x, σ 2 ) q(x 0 x) = Norm(x 0 x, 2 I) σ large many rejections σ small slow diffusion: (L/σ) 2 iterations required

Mixing: Metropolis Random Walk Discrete target distribution is uniform over all states: p(x) = 1, x 2{0, 1, 2,...,20} 21 q(x 0 x)= 1 2, x0 2{x 1,x+1} 100 iterations 12 10 8 6 4 2 0 0 5 10 15 20 400 iterations 40 35 30 25 20 15 10 5 0 0 5 10 15 20 1200 iterations 90 80 70 60 50 40 30 20 10 0 0 5 10 15 20 100 iterations 12 10 8 6 4 2 0 0 5 10 15 20 400 iterations 40 35 30 25 20 15 10 5 0 0 5 10 15 20 1200 iterations 90 80 70 60 50 40 30 20 10 0 0 5 10 15 20

MH Algorithms Satisfy Detailed Balance Metropolis-Hastings (MH) Transition Operator: T (x 0 x) x q( x x) Pr[x 0 = x] =min(1,r( x, x)) r( x, x) = p ( x)q(x x) p (x)q( x x) Pr[x 0 = x] =1 min(1,r( x, x)) Detailed Balance Condition: T (x 0 x)p (x) =p (x 0 )T (x x 0 ) For any x 0 6= x : P (x) T (x x) =P (x) Q(x ; x)min 1, P (x )Q(x; x! ) P (x)q(x ; x) = min P (x)q(x ; x), P(x )Q(x; x ) = P (x ) Q(x; x )min 1, P (x)q(x ; x) P (x )Q(x; x )! = P (x ) T (x x )

CS242: Lecture 7B Outline Ø MCMC & Metropolis-Hastings Algorithms Ø Gibbs Sampling for Graphical Models Ø Moments, mixing, & convergence diagnostics z 2 L l z 1

Combining MCMC Transition Proposals A sequence of operators, each with P invariant: x 0 P (x) x 1 T a (x 1 x 0 ) x 2 T b (x 2 x 1 ) x 3 T c (x 3 x 2 ) P (x 1 ) = x 0 T a (x 1 x 0 )P (x 0 ) = P (x 1 ) P (x 2 ) = x 1 T b (x 2 x 1 )P (x 1 ) = P (x 2 ) P (x 3 ) = x 1 T c (x 3 x 2 )P (x 2 ) = P (x 3 ) Combination T c T b T a leaves P invariant If they can reach any x, T c T b T a is a valid MCMC operator Individually T c, T b and T a need not be ergodic

p(x) = 1 Z F Y f2f MCMC for Factor Graphs f (x f ) Z = X x Y f2f set of hyperedges linking subsets of nodes (s) ={f 2F s 2 f} f (x f ) f V neighbors of node s Consider a proposal which modifies one variable: x s q s ( x s x), x t = x t for all t 6= s. Acceptance Ratio: r( x, x) = p ( x)q(x x) p (x)q( x x) = q s(x s x) q s ( x s x) Y f2 (s) Computation of acceptance ratio is local: depends only on variable being resampled and its neighbors Proposal is low-dimensional: Easier to have high acceptance rate f ( x f ) f (x f )

p(x) = 1 Z Gibbs Sampling for Factor Graphs Y f2f f (x f ) (s) ={f 2F s 2 f} Z = X x Y f2f f (x f ) neighbors of node s Visiting nodes s in some fixed or random order: Determine the conditional distribution of node s given values for all other nodes (equivalently, its neighbors): q s ( x s x) =p( x s {x t,t6= s}) / Y f2 (s) f ( x s,x f\s ) Propose a new value for node s, leave all other nodes unchanged: x s q s ( x s x), x t = x t for all t 6= s. Always accepting this proposal gives valid MCMC: x 0 = x

Gibbs Sampling is Metropolis-Hastings p(x) = 1 Z Y f2f f (x f ) Z = X x Y f2f Gibbs sampling proposal for node s: q s ( x s x) =p( x s {x t,t6= s}) / Y f2 (s) x s q s ( x s x), x t = x t for all t 6= s. Metropolis-Hastings Acceptance Ratio: r( x, x) = q s(x s x) q s ( x s x) Y f2 (s) f ( x f ) f (x f ) =1 f (x f ) f ( x s,x f\s ) Always Accept! plugging in conditional above, using fact that x t = x t for t 6= s.

Gibbs Sampling as Message Passing Consider a pairwise MRF: (Similar result holds for factor graphs.) p(x) = 1 Z Y (s,t)2e st(x s,x t ) x i x i q i (x i ) / ˆx i q i (x i ) Y j2 (i) m ji (x i ) Draw single sample from marginal { m ij (x j ) / ij (ˆx i,x j ) Use sample to extract a slice of pairwise potential For discrete variables, sample from a categorical distribution. For continuous variables, need to analyze model to determine conditional.

Gibbs Sampling Implementation Gibbs sampling benefits from few free choices and convenient features of conditional distributions: Conditionals with a few discrete settings can be explicitly normalized: P (x i x j i ) P (x i, x j i ) = P (x i, x j i ) x i P (x i, x j i) this sum is small and easy Continuous conditionals only univariate amenable to standard sampling methods. Ø Inverse CDF sampling Ø Rejection sampling Ø Slice sampling Ø

CS242: Lecture 7B Outline Ø MCMC & Metropolis-Hastings Algorithms Ø Gibbs Sampling for Graphical Models Ø Moments, mixing, & convergence diagnostics

MCMC Implementation & Application The samples aren t independent. Should we thin, only keep every Kth sample? Arbitrary initialization means starting iterations are bad. Should we discard a burn-in period? Maybe we should perform multiple runs? How do we know if we have run for long enough?

Estimating Moments from Samples Approximately independent samples can be obtained by thinning. However, all the samples can be used. Use the simple Monte Carlo estimator on MCMC samples. It is: consistent unbiased if the chain has burned in E P [f] 1 S f(x (s) ) S The correct motivation to thin: if computing f(x (s) ) is expensive s=1 Thinned Sampling All Samples after Burn-in

MCMC Mixing Diagnostics Autocovariance: Empirical covariance of values produced by MCMC method, versus iteration lag (spacing) Small autocovariances are necessary, but not sufficient, to demonstrate mixing to the target distribution Fairly reliable for unimodal posteriors, but very misleading more generally Trace Plot: Value of some interesting summary statistic, versus MCMC iteration

Mixing for Gibbs Sampling z 2 L z 1 z 2 Each Gibbs sampling proposal only changes one variable at a time l z 1 One or more rare events may need to happen before Gibbs sampler moves between modes: SLOW mixing z 2 =0 z 2 =1 z 1 =0 z 1 =1 0.6 0.4 Example: Slow mixing for a pair of binary variables

A Gibbs Sampler for Spatial MRFs Ising and Potts Priors on Partitions Gibbs sampler is special case of pairwise MRF sampler. Previous Applications Interactive foreground segmentation Supervised segmentation with object category appearance models These are strong likelihoods. Is the prior itself effective for images? GrabCut: Rother, Kolmogorov, & Blake 2004 Verbeek & Triggs, 2007

Geman & Geman, PAMI 1984 200 Gibbs Iterations 128 x128 grid 8 nearest neighbor edges K = 5 states Potts potentials: 10,000 Gibbs Iterations

10-State Potts Models: Extended Gibbs States sorted by size: largest in blue, smallest in red

1996 IEEE DSP Workshop natural images very noisy giant cluster number of edges on which states take same value edge strength Even within the phase transition region, samples lack the size distribution and spatial coherence of real image segments. Output of a limited MCMC run can be arbitrarily misleading!

MCMC & Computational Resources Best practical option: A few (> 1) initializations for as many iterations as possible