Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018

Similar documents
Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Markov chain Monte Carlo Lecture 9

Monte Carlo Methods. Leon Gu CSD, CMU

16 : Markov Chain Monte Carlo (MCMC)

16 : Approximate Inference: Markov Chain Monte Carlo

Markov Chains and MCMC

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Sampling Methods (11/30/04)

Introduction to Machine Learning CMU-10701

MCMC and Gibbs Sampling. Kayhan Batmanghelich

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

CPSC 540: Machine Learning

Markov Chains and MCMC

STA 4273H: Statistical Machine Learning

Machine Learning for Data Science (CS4786) Lecture 24

CPSC 540: Machine Learning

Convex Optimization CMU-10725

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

Sampling Algorithms for Probabilistic Graphical models

Markov Networks.

Computational statistics

Lecture 6: Markov Chain Monte Carlo

Markov Chain Monte Carlo The Metropolis-Hastings Algorithm

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods

Markov chain Monte Carlo

Simulation - Lectures - Part III Markov chain Monte Carlo

17 : Markov Chain Monte Carlo

6 Markov Chain Monte Carlo (MCMC)

MCMC: Markov Chain Monte Carlo

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

18 : Advanced topics in MCMC. 1 Gibbs Sampling (Continued from the last lecture)

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

Lecture 7 and 8: Markov Chain Monte Carlo

Markov Chain Monte Carlo (MCMC)

Stochastic optimization Markov Chain Monte Carlo

Bayesian Inference and MCMC

Bayesian Methods for Machine Learning

Markov chain Monte Carlo

Graphical Models and Kernel Methods

19 : Slice Sampling and HMC

The Ising model and Markov chain Monte Carlo

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Probabilistic Machine Learning

Quantifying Uncertainty

Lecture 8: The Metropolis-Hastings Algorithm

Markov Chain Monte Carlo

Introduction to Restricted Boltzmann Machines


Lecture 15: MCMC Sanjeev Arora Elad Hazan. COS 402 Machine Learning and Artificial Intelligence Fall 2016

Inference in Bayesian Networks

Likelihood Inference for Lattice Spatial Processes

CS 343: Artificial Intelligence

Markov Networks. l Like Bayes Nets. l Graphical model that describes joint probability distribution using tables (AKA potentials)

Markov Chain Monte Carlo

Advanced Sampling Algorithms

LECTURE 15 Markov chain Monte Carlo

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Markov Chain Monte Carlo

Math 456: Mathematical Modeling. Tuesday, April 9th, 2018

Markov Chain Monte Carlo

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC

Chapter 7. Markov chain background. 7.1 Finite state space

Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo. Sampling Methods. Oliver Schulte - CMPT 419/726. Bishop PRML Ch.

3 : Representation of Undirected GM

Markov Chains Handout for Stat 110

Markov chain Monte Carlo methods in atmospheric remote sensing

Bayesian model selection in graphs by using BDgraph package

The University of Auckland Applied Mathematics Bayesian Methods for Inverse Problems : why and how Colin Fox Tiangang Cui, Mike O Sullivan (Auckland),

Markov Chain Monte Carlo, Numerical Integration

Markov Chain Monte Carlo Methods

Winter 2019 Math 106 Topics in Applied Mathematics. Lecture 9: Markov Chain Monte Carlo

Markov Chain Monte Carlo methods

Introduction to Computational Biology Lecture # 14: MCMC - Markov Chain Monte Carlo

MARKOV CHAIN MONTE CARLO

CSC 2541: Bayesian Methods for Machine Learning

Results: MCMC Dancers, q=10, n=500

Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling

The Origin of Deep Learning. Lili Mou Jan, 2015

Inverse Problems in the Bayesian Framework

Markov Networks. l Like Bayes Nets. l Graph model that describes joint probability distribution using tables (AKA potentials)

Lecture 8: Bayesian Networks

Review: Directed Models (Bayes Nets)

Introduction to MCMC. DB Breakfast 09/30/2011 Guozhang Wang

Bayesian GLMs and Metropolis-Hastings Algorithm

SAMPLING ALGORITHMS. In general. Inference in Bayesian models

Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

Introduction to Machine Learning CMU-10701

Understanding MCMC. Marcel Lüthi, University of Basel. Slides based on presentation by Sandro Schönborn

Adaptive Monte Carlo methods

Stat 535 C - Statistical Computing & Monte Carlo Methods. Lecture February Arnaud Doucet

Markov Chain Monte Carlo Methods

Markov Chain Monte Carlo (MCMC)

Bayes Nets: Sampling

CS281A/Stat241A Lecture 22

Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo. Sampling Methods. Machine Learning. Torsten Möller.

SAMSI Astrostatistics Tutorial. More Markov chain Monte Carlo & Demo of Mathematica software

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Markov Chain Monte Carlo Lecture 6

Transcription:

Graphical Models Markov Chain Monte Carlo Inference Siamak Ravanbakhsh Winter 2018

Learning objectives Markov chains the idea behind Markov Chain Monte Carlo (MCMC) two important examples: Gibbs sampling Metropolis-Hastings algorithm

Problem with likelihood weighting Recap use a topological ordering sample conditioned on the parents if observed: keep the observed value update the weight

Problem with likelihood weighting Recap use a topological ordering sample conditioned on the parents if observed: Issues keep the observed value update the weight observing the child does not affect the parent's assignment only applies to Bayes-nets

Idea Gibbs sampling iteratively sample each var. condition on its Markov blanket X p(x X ) i i MB(i) if X i is observed: keep the observed value after many Gibbs sampling iterations X P

Idea Gibbs sampling iteratively sample each var. condition on its Markov blanket X p(x X ) i i MB(i) if X i is observed: keep the observed value equivalent to first simplifying the model by removing observed vars sampling from the simplified Gibbs dist. after many Gibbs sampling iterations X P

Example: Ising model recall the Ising model: p(x) exp( i xih i + i,j E xixjj i,j) x { 1, +1} i

Example: Ising model recall the Ising model: p(x) exp( i xih i + i,j E xixjj i,j) sample each node i: x { 1, +1} i p(x = +1 X ) = i MB(i) exp(h i+ j Mb(i) Ji,jX j) exp(h i+ j Mb(i) Ji,jX j)+exp( hi j Mb(i) Ji,jX j) =

Example: Ising model recall the Ising model: p(x) exp( i xih i + i,j E xixjj i,j) sample each node i: x { 1, +1} i p(x = +1 X ) = i MB(i) exp(h i+ j Mb(i) Ji,jX j) exp(h i+ j Mb(i) Ji,jX j)+exp( hi j Mb(i) Ji,jX j) = σ(2h + 2 J X ) i j Mb(i) i,j j

Example: Ising model recall the Ising model: p(x) exp( i xih i + i,j E xixjj i,j) sample each node i: x { 1, +1} i p(x = +1 X ) = i MB(i) exp(h i+ j Mb(i) Ji,jX j) exp(h i+ j Mb(i) Ji,jX j)+exp( hi j Mb(i) Ji,jX j) = σ(2h i + 2 j Mb(i) Ji,jX j) compare with mean-field σ(2h i + 2 j Mb(i) Ji,jμ j)

Markov Chain a sequence of random variables with Markov property (t) (1) (t 1) (t) (t 1) P (X X,, X ) = P (X X ) its graphical model... X (1) (T 1) X (T ) X (2) X many applications: language modeling: X is a word or a character physics: with correct choice of X, the world is Markov

Transition model we assume a homogeneous chain: (t) (t 1) (t+1) (t) P (X X ) = P (X X ) t cond. probabilities remain the same across time-steps notation: conditional probability (t) (t 1) P (X = x X = x ) = T (x, x ) is called the transition model think of this as a matrix T

Transition model we assume a homogeneous chain: (t) (t 1) (t+1) (t) P (X X ) = P (X X ) t cond. probabilities remain the same across time-steps notation: conditional probability (t) (t 1) P (X = x X = x ) = T (x, x ) state-transition diagram its transition matrix is called the transition model think of this as a matrix T.25 T = 0.5 0.7.5.75.3 0

Transition model we assume a homogeneous chain: (t) (t 1) (t+1) (t) P (X X ) = P (X X ) t cond. probabilities remain the same across time-steps notation: conditional probability (t) (t 1) P (X = x X = x ) = T (x, x ) state-transition diagram its transition matrix is called the transition model think of this as a matrix T.25 T = 0.5 0.7.5.75.3 0 (t+1) (t) evolving the distribution P (X = x) = P (X = x )T (x, x) x V al(x)

Markov Chain Monte Carlo (MCMC) Example state-transition diagram for grasshopper random walk initial distribution (0) P (X = 0) = 1

Markov Chain Monte Carlo (MCMC) Example state-transition diagram for grasshopper random walk initial distribution (0) P (X = 0) = 1 after t=50 steps, the distribution is almost uniform P (x) t 1 9 x

Markov Chain Monte Carlo (MCMC) Example state-transition diagram for grasshopper random walk initial distribution (0) P (X = 0) = 1 after t=50 steps, the distribution is almost uniform P (x) use the chain to sample from the uniform distribution t P (X) t 1 9 x 1 9

Markov Chain Monte Carlo (MCMC) Example state-transition diagram for grasshopper random walk initial distribution (0) P (X = 0) = 1 after t=50 steps, the distribution is almost uniform P (x) use the chain to sample from the uniform distribution t P (X) t 1 9 x 1 9 why is it uniform? (mixing image: Murphy's book)

Markov Chain Monte Carlo (MCMC) Example state-transition diagram for grasshopper random walk initial distribution (0) P (X = 0) = 1 after t=50 steps, the distribution is almost uniform P (x) use the chain to sample from the uniform distribution t P (X) t 1 9 x 1 9 MCMC generalize this idea beyond uniform dist. P we want to sample from pick the transition model such that P (X) = P (X) why is it uniform? (mixing image: Murphy's book)

Stationary distribution given a transition model T (x, x ) if the chain converges: global balance equation (t) (t+1) (t) P (x) P (x) = x P (x )T (x, x)

Stationary distribution given a transition model T (x, x ) if the chain converges: global balance equation this condition defines the stationary distribution: (t) (t+1) (t) P (x) P (x) = x P (x )T (x, x) π π(x = x) = π(x = x )T (x, x ) x V al(x)

Stationary distribution given a transition model T (x, x ) if the chain converges: global balance equation this condition defines the stationary distribution: (t) (t+1) (t) P (x) P (x) = x P (x )T (x, x) π π(x = x) = π(x = x )T (x, x ) x V al(x) Example finding the stationary dist. 1 1 3 π(x ) =.25π(x ) +.5π(x ) 2 2 3 π(x ) =.7π(x ) +.5π(x ) 3 1 2 π(x ) =.75π(x ) +.3π(x ) π(x 1) + π(x 2) + π(x 3) = 1 1 π(x ) =.2 2 π(x ) =.5 3 π(x ) =.3

Stationary distribution as an eigenvector Example finding the stationary dist. 1 1 3 π(x ) =.25π(x ) +.5π(x ) 2 2 3 π(x ) =.7π(x ) +.5π(x ) 3 1 2 π(x ) =.75π(x ) +.3π(x ) 1 2 3 π(x ) + π(x ) + π(x ) = 1 1 π(x ) =.2 2 π(x ) =.5 3 π(x ) =.3

Stationary distribution as an eigenvector Example finding the stationary dist. 1 1 3 π(x ) =.25π(x ) +.5π(x ) 2 2 3 π(x ) =.7π(x ) +.5π(x ) 3 1 2 π(x ) =.75π(x ) +.3π(x ) 1 2 3 π(x ) + π(x ) + π(x ) = 1 1 π(x ) =.2 2 π(x ) =.5 3 π(x ) =.3 viewing T (.,.) as a matrix and P (x) as a vector t evolution of dist P (x) : P multiple steps: P t (t+1) T (t) = (T ) P = T P (t+m) T m (t).25 0.75 0.7.3 T T.5.2.5.5 0.3 π =.2.5.3 π

Stationary distribution as an eigenvector Example finding the stationary dist. 1 1 3 π(x ) =.25π(x ) +.5π(x ) 2 2 3 π(x ) =.7π(x ) +.5π(x ) 3 1 2 π(x ) =.75π(x ) +.3π(x ) 1 2 3 π(x ) + π(x ) + π(x ) = 1 1 π(x ) =.2 2 π(x ) =.5 3 π(x ) =.3 viewing T (.,.) as a matrix and P (x) as a vector t evolution of dist P (x) : P multiple steps: P for stationary dist: t (t+1) T (t) = (T ) P = T P (t+m) T m (t) π = T T π.25 0.75 0.7.3 T T.5.2.5.5 0.3 π =.2.5.3 π

Stationary distribution as an eigenvector Example finding the stationary dist. 1 1 3 π(x ) =.25π(x ) +.5π(x ) 2 2 3 π(x ) =.7π(x ) +.5π(x ) 3 1 2 π(x ) =.75π(x ) +.3π(x ) 1 2 3 π(x ) + π(x ) + π(x ) = 1 1 π(x ) =.2 2 π(x ) =.5 3 π(x ) =.3 viewing T (.,.) as a matrix and P (x) as a vector t evolution of dist P (x) : P multiple steps: P for stationary dist: π is an eigenvector of with eigenvalue 1 t (t+1) T (t) = (T ) P = T P (t+m) T m (t) π = T T π T T.25 0.75 0.7.3 T T (produce it using the power method).5.2.5.5 0.3 π =.2.5.3 π

Stationary distribution: existance & uniquness irreducible we should be able to reach any x' from any x π otherwise, is not unique 0 0 1 1

Stationary distribution: existance & uniquness irreducible 1 1 we should be able to reach any x' from any x π otherwise, is not unique 0 0 aperiodic 0 1 0 the chain should not have a fixed cyclic behavior 0 1 otherwise, the chain does not converge (it oscillates) 1 0

Stationary distribution: existance & uniquness irreducible 1 1 we should be able to reach any x' from any x π otherwise, is not unique 0 0 aperiodic 0 1 0 the chain should not have a fixed cyclic behavior 0 1 otherwise, the chain does not converge (it oscillates) 1 0 every aperiodic and irreducible chain (with a finite domain) has a unique limiting distribution such that π(x = x) = π(x = x )T (x, x ) x V al(x) π

Stationary distribution: existance & uniquness irreducible 1 1 we should be able to reach any x' from any x π otherwise, is not unique 0 0 aperiodic 0 1 0 the chain should not have a fixed cyclic behavior 0 1 otherwise, the chain does not converge (it oscillates) 1 0 every aperiodic and irreducible chain (with a finite domain) has a unique limiting distribution such that π(x = x) = π(x = x )T (x, x ) x V al(x) π regular chain a sufficient condition: there exists a K, such that the probability of reaching any destination from any source in K steps is positive (applies to discrete & continuous domains)

MCMC in graphical models distinguishing the "graphical models" involved 0 P (X)... π(x) X (1) (T 1) X (T ) X (2) X 1: the Markov chain P (X) X = [C, D, I, G, S, L, J, H] 3: the graphical model, from which we want to sample P (X) 2: state-transition diagram (not shown) that has exponentially many nodes #nodes = V al(x)

MCMC in graphical models distinguishing the "graphical models" involved 0 P (X)... π(x) X (1) (T 1) X (T ) X (2) X 1: the Markov chain P (X) X = [C, D, I, G, S, L, J, H] 3: the graphical model, from which we want to sample P (X) 2: state-transition diagram (not shown) that has exponentially many nodes #nodes = V al(x) objective: design the Markov chain transition so that π(x) = P (X)

Multiple transition models idea have multiple transition models each making local changes to x aka, kernels 1 2 T (x, x ), T (x, x ),, T (x, x ) n x = (x, x ) 1 2 T 2 T 1 only updates x 1 using a single kernel we may not be able to visit all the states while their combination is "ergodic"... X (1) X (2) X (T 1) X (T )

Multiple transition models idea have multiple transition models each making local changes to x aka, kernels 1 2 T (x, x ), T (x, x ),, T (x, x ) n x = (x, x ) 1 2 T 2 if π(x = x) = π(x = x )T (x, x ) k x V al(x) then we can combine the kernels: k T 1 only updates x 1 using a single kernel we may not be able to visit all the states while their combination is "ergodic" mixing them cycling them T (x, x ) = p(k)t (x, x ) k x,x,,x k T (x, x ) = T (x, x )T (x, x ), T (x, x )dx dx dx [1] [2] [n] 1 [1] 2 [1] [2] n [n 1] [1] [2] [n]... X (1) X (2) X (T 1) X (T )

Revisiting Gibbs sampling one kernel for each variable perform local, conditional updates T (x, x ) = P (x x )I(x = x ) i... i i X (1) X (2) X (T 1) X (T ) i P (x x ) = P (x x ) cycle the local kernels i i i MB(i) i... π(x) = P (X) is the stationary dist. for this Markov chain

Revisiting Gibbs sampling one kernel for each variable perform local, conditional updates T (x, x ) = P (x x )I(x = x ) i... i i X (1) X (2) X (T 1) X (T ) i P (x x ) = P (x x ) cycle the local kernels i i i MB(i) i... π(x) = P (X) is the stationary dist. for this Markov chain if P (x) > 0 x then this chain is regular i.e., converges to its unique stationary dist.

Block Gibbs sampling local moves can get stuck in modes of P (X) updates using exploring these modes P (x x ), P (x x ) 1 2 2 1 will have problem x 1 x 2

Block Gibbs sampling local moves can get stuck in modes of P (X) updates using exploring these modes P (x x ), P (x x ) 1 2 2 1 will have problem x 1 idea: each kernel updates a block of variables x 2

Block Gibbs sampling local moves can get stuck in modes of P (X) updates using exploring these modes P (x x ), P (x x ) 1 2 2 1 will have problem x 1 idea: each kernel updates a block of variables x 2 update h given v P (h, h, h v,, v ) = P (h v) P (h v) 0 1 2 0 3 0 2 Restricted Boltzmann Machine (RBM) update v given h P (v,, v h, h, h ) = P (v h) P (v h) 0 3 0 1 2 0 3

Detailed balance A Markov chain is reversible if for a unique π detailed balance π(x)t (x, x ) = π(x )T (x, x) same frequency in both directions x, x

Detailed balance A Markov chain is reversible if for a unique π x left-hand side detailed balance π(x)t (x, x ) = π(x )T (x, x) x π(x)t (x, x )dx = π(x) same frequency in both directions T (x, x )dx = π(x) x, x = x global balance π(x )T (x, x)dx right-hand side

Detailed balance A Markov chain is reversible if for a unique π x left-hand side detailed balance π(x)t (x, x ) = π(x )T (x, x) x π(x)t (x, x )dx = π(x) same frequency in both directions T (x, x )dx = π(x) x, x = x global balance π(x )T (x, x)dx right-hand side detailed balance is a stronger condition π = [.4,.4,.2] global balance detailed balance (example: Murphy's book)

Detailed balance A Markov chain is reversible if for a unique π x left-hand side detailed balance π(x)t (x, x ) = π(x )T (x, x) x π(x)t (x, x )dx = π(x) same frequency in both directions T (x, x )dx = π(x) x, x = x global balance π(x )T (x, x)dx right-hand side detailed balance is a stronger condition if Markov chain is regular and then π π is the unique stationary distribution satisfies detailed balance, π = [.4,.4,.2] global balance detailed balance (example: Murphy's book)

Detailed balance A Markov chain is reversible if for a unique π x left-hand side detailed balance π(x)t (x, x ) = π(x )T (x, x) x π(x)t (x, x )dx = π(x) same frequency in both directions T (x, x )dx = π(x) x, x = x global balance π(x )T (x, x)dx right-hand side detailed balance is a stronger condition if Markov chain is regular and then π π is the unique stationary distribution satisfies detailed balance, π = [.4,.4,.2] global balance detailed balance analogous to the theorem for global balance checking for detailed balance is easier (example: Murphy's book)

Detailed balance A Markov chain is reversible if for a unique π x left-hand side detailed balance π(x)t (x, x ) = π(x )T (x, x) x π(x)t (x, x )dx = π(x) same frequency in both directions T (x, x )dx = π(x) x, x = x global balance π(x )T (x, x)dx right-hand side detailed balance is a stronger condition if Markov chain is regular and then π π is the unique stationary distribution satisfies detailed balance, π = [.4,.4,.2] global balance detailed balance analogous to the theorem for global balance checking for detailed balance is easier what happens if T is symmetric? (example: Murphy's book)

Using a proposal for the chain Given P design a chain to sample from P idea

Using a proposal for the chain Given P design a chain to sample from P idea use a proposal transition we can sample from q T (x, x ) q T (x, ) q T (x, x ) is a regular chain (reaching every state in K steps has a non-zero probability)

Using a proposal for the chain Given P design a chain to sample from P idea use a proposal transition we can sample from q T (x, x ) q T (x, ) q T (x, x ) is a regular chain (reaching every state in K steps has a non-zero probability) accept the proposed move with probability to achieve detailed balance A(x, x )

use a proposal transition we can sample from q T (x, x ) Metropolis algorithm q T (x, ) q T (x, x ) is a regular chain (reaching every state in K steps has a non-zero probability) accept the proposed move with probability to achieve detailed balance proposal is symmetric T (x, x ) = T (x, x) p(x ) p(x) A(x, x ) min(1, ) accepts the move if it increases P may accept it otherwise A(x, x ) (image: Wikipedia)

Metropolis-Hastings algorithm q p(x )T (x,x) p(x)t q (x,x ) if the proposal is NOT symmetric, then A(x, x ) min(1, )

Metropolis-Hastings algorithm if the proposal is NOT symmetric, then A(x, x ) min(1, ) why does it sample from P? q p(x )T (x,x) p(x)t q (x,x )

Metropolis-Hastings algorithm if the proposal is NOT symmetric, then A(x, x ) min(1, ) why does it sample from P? q p(x )T (x,x) p(x)t q (x,x ) q T (x, x ) = T (x, x )A(x, x ) x x move to a different state is accepted

Metropolis-Hastings algorithm if the proposal is NOT symmetric, then A(x, x ) min(1, ) why does it sample from P? q p(x )T (x,x) p(x)t q (x,x ) q T (x, x ) = T (x, x )A(x, x ) x x q T (x, x) = T (x, x) + x x (1 A(x, x ))T (x, x ) move to a different state is accepted proposal to stay is always accepted move to a new state is rejected

Metropolis-Hastings algorithm if the proposal is NOT symmetric, then A(x, x ) min(1, ) why does it sample from P? q p(x )T (x,x) p(x)t q (x,x ) q T (x, x ) = T (x, x )A(x, x ) x x q T (x, x) = T (x, x) + x x (1 A(x, x ))T (x, x ) move to a different state is accepted substitute this into detailed balance (does it hold?) q q proposal to stay is always accepted move to a new state is rejected π(x)t (x, x )A(x, x ) = π(x )T (x, x)a(x, x) q π(x )T (x,x) π(x)t q (x,x ) min(1, )? q π(x)t (x,x ) π(x )T q (x,x) min(1, )

Metropolis-Hastings algorithm if the proposal is NOT symmetric, then A(x, x ) min(1, ) why does it sample from P? q p(x )T (x,x) p(x)t q (x,x ) q T (x, x ) = T (x, x )A(x, x ) x x q T (x, x) = T (x, x) + x x (1 A(x, x ))T (x, x ) move to a different state is accepted substitute this into detailed balance (does it hold?) q q proposal to stay is always accepted move to a new state is rejected π(x)t (x, x )A(x, x ) = π(x )T (x, x)a(x, x) q π(x )T (x,x) π(x)t q (x,x ) min(1, )? q π(x)t (x,x ) π(x )T q (x,x) min(1, ) Gibbs sampling is a special case, with A(x, x ) = 1 all the time!

Sampling from the chain at the limit T, P = π = P how long should we wait for T D(P, π) < ϵ? 1 1 λ 2 mixing time N ϵ O( log( )) #states (exponential) 2nd largest eigenvalue of T

Sampling from the chain at the limit T, P = π = P how long should we wait for run the chain for a burn-in period (T steps) collect samples (few more steps) multiple restarts can ensure a better coverage T D(P, π) < ϵ? 1 1 λ 2 mixing time N ϵ O( log( )) #states (exponential) 2nd largest eigenvalue of T

Sampling from the chain at the limit T, P = π = P how long should we wait for run the chain for a burn-in period (T steps) collect samples (few more steps) multiple restarts can ensure a better coverage T D(P, π) < ϵ? 1 1 λ 2 mixing time N ϵ O( log( )) #states (exponential) 2nd largest eigenvalue of T Example Potts model model V al(x) = 5 128x128 grid p(x) exp( i h(x i) + i,j E.66I(x j = x j)) Gibbs sampling different colors 200 iterations 10,000 iterations image : Murphy's book

Diagnosing convergence heuristics for diagnosing non-convergence difficult problem run multiple chains (compare sample statistics) auto-correlation within each chain

Diagnosing convergence heuristics for diagnosing non-convergence difficult problem run multiple chains (compare sample statistics) auto-correlation within each chain example sampling from a mixture of two 1D Gaussians (3 chains: colors) metropolis-hastings (MH) with increasing step sizes for the proposal Gibbs sampling

Summary Markov Chain: can model the "evolution" of an initial distribution converges to a stationary distribution

Summary Markov Chain: can model the "evolution" of an initial distribution converges to a stationary distribution Markov Chain Monte Carlo: design a Markov chain: stationary dist. is what we want to sample run the chain to produce samples

Summary Markov Chain: can model the "evolution" of an initial distribution converges to a stationary distribution Markov Chain Monte Carlo: design a Markov chain: stationary dist. is what we want to sample run the chain to produce samples Two MCMC methods: Gibbs sampling Metropolis-Hastings