Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter PDF Free Download

Graphical Models Markov Chain Monte Carlo Inference Siamak Ravanbakhsh Winter 2018

Learning objectives Markov chains the idea behind Markov Chain Monte Carlo (MCMC) two important examples: Gibbs sampling Metropolis-Hastings algorithm

Problem with likelihood weighting Recap use a topological ordering sample conditioned on the parents if observed: keep the observed value update the weight

Problem with likelihood weighting Recap use a topological ordering sample conditioned on the parents if observed: Issues keep the observed value update the weight observing the child does not affect the parent's assignment only applies to Bayes-nets

Idea Gibbs sampling iteratively sample each var. condition on its Markov blanket X p(x X ) i i MB(i) if X i is observed: keep the observed value after many Gibbs sampling iterations X P

Idea Gibbs sampling iteratively sample each var. condition on its Markov blanket X p(x X ) i i MB(i) if X i is observed: keep the observed value equivalent to first simplifying the model by removing observed vars sampling from the simplified Gibbs dist. after many Gibbs sampling iterations X P

Example: Ising model recall the Ising model: p(x) exp( i xih i + i,j E xixjj i,j) x { 1, +1} i

Example: Ising model recall the Ising model: p(x) exp( i xih i + i,j E xixjj i,j) sample each node i: x { 1, +1} i p(x = +1 X ) = i MB(i) exp(h i+ j Mb(i) Ji,jX j) exp(h i+ j Mb(i) Ji,jX j)+exp( hi j Mb(i) Ji,jX j) =

Markov Chain a sequence of random variables with Markov property (t) (1) (t 1) (t) (t 1) P (X X,, X ) = P (X X ) its graphical model... X (1) (T 1) X (T ) X (2) X many applications: language modeling: X is a word or a character physics: with correct choice of X, the world is Markov

Transition model we assume a homogeneous chain: (t) (t 1) (t+1) (t) P (X X ) = P (X X ) t cond. probabilities remain the same across time-steps notation: conditional probability (t) (t 1) P (X = x X = x ) = T (x, x ) state-transition diagram its transition matrix is called the transition model think of this as a matrix T.25 T = 0.5 0.7.5.75.3 0

Markov Chain Monte Carlo (MCMC) Example state-transition diagram for grasshopper random walk initial distribution (0) P (X = 0) = 1

Markov Chain Monte Carlo (MCMC) Example state-transition diagram for grasshopper random walk initial distribution (0) P (X = 0) = 1 after t=50 steps, the distribution is almost uniform P (x) t 1 9 x

Markov Chain Monte Carlo (MCMC) Example state-transition diagram for grasshopper random walk initial distribution (0) P (X = 0) = 1 after t=50 steps, the distribution is almost uniform P (x) use the chain to sample from the uniform distribution t P (X) t 1 9 x 1 9 why is it uniform? (mixing image: Murphy's book)

Stationary distribution given a transition model T (x, x ) if the chain converges: global balance equation (t) (t+1) (t) P (x) P (x) = x P (x )T (x, x)

Stationary distribution given a transition model T (x, x ) if the chain converges: global balance equation this condition defines the stationary distribution: (t) (t+1) (t) P (x) P (x) = x P (x )T (x, x) π π(x = x) = π(x = x )T (x, x ) x V al(x) Example finding the stationary dist. 1 1 3 π(x ) =.25π(x ) +.5π(x ) 2 2 3 π(x ) =.7π(x ) +.5π(x ) 3 1 2 π(x ) =.75π(x ) +.3π(x ) π(x 1) + π(x 2) + π(x 3) = 1 1 π(x ) =.2 2 π(x ) =.5 3 π(x ) =.3

Stationary distribution as an eigenvector Example finding the stationary dist. 1 1 3 π(x ) =.25π(x ) +.5π(x ) 2 2 3 π(x ) =.7π(x ) +.5π(x ) 3 1 2 π(x ) =.75π(x ) +.3π(x ) 1 2 3 π(x ) + π(x ) + π(x ) = 1 1 π(x ) =.2 2 π(x ) =.5 3 π(x ) =.3 viewing T (.,.) as a matrix and P (x) as a vector t evolution of dist P (x) : P multiple steps: P t (t+1) T (t) = (T ) P = T P (t+m) T m (t).25 0.75 0.7.3 T T.5.2.5.5 0.3 π =.2.5.3 π

Stationary distribution as an eigenvector Example finding the stationary dist. 1 1 3 π(x ) =.25π(x ) +.5π(x ) 2 2 3 π(x ) =.7π(x ) +.5π(x ) 3 1 2 π(x ) =.75π(x ) +.3π(x ) 1 2 3 π(x ) + π(x ) + π(x ) = 1 1 π(x ) =.2 2 π(x ) =.5 3 π(x ) =.3 viewing T (.,.) as a matrix and P (x) as a vector t evolution of dist P (x) : P multiple steps: P for stationary dist: t (t+1) T (t) = (T ) P = T P (t+m) T m (t) π = T T π.25 0.75 0.7.3 T T.5.2.5.5 0.3 π =.2.5.3 π

Stationary distribution as an eigenvector Example finding the stationary dist. 1 1 3 π(x ) =.25π(x ) +.5π(x ) 2 2 3 π(x ) =.7π(x ) +.5π(x ) 3 1 2 π(x ) =.75π(x ) +.3π(x ) 1 2 3 π(x ) + π(x ) + π(x ) = 1 1 π(x ) =.2 2 π(x ) =.5 3 π(x ) =.3 viewing T (.,.) as a matrix and P (x) as a vector t evolution of dist P (x) : P multiple steps: P for stationary dist: π is an eigenvector of with eigenvalue 1 t (t+1) T (t) = (T ) P = T P (t+m) T m (t) π = T T π T T.25 0.75 0.7.3 T T (produce it using the power method).5.2.5.5 0.3 π =.2.5.3 π

Stationary distribution: existance & uniquness irreducible we should be able to reach any x' from any x π otherwise, is not unique 0 0 1 1

Stationary distribution: existance & uniquness irreducible 1 1 we should be able to reach any x' from any x π otherwise, is not unique 0 0 aperiodic 0 1 0 the chain should not have a fixed cyclic behavior 0 1 otherwise, the chain does not converge (it oscillates) 1 0 every aperiodic and irreducible chain (with a finite domain) has a unique limiting distribution such that π(x = x) = π(x = x )T (x, x ) x V al(x) π regular chain a sufficient condition: there exists a K, such that the probability of reaching any destination from any source in K steps is positive (applies to discrete & continuous domains)

MCMC in graphical models distinguishing the "graphical models" involved 0 P (X)... π(x) X (1) (T 1) X (T ) X (2) X 1: the Markov chain P (X) X = [C, D, I, G, S, L, J, H] 3: the graphical model, from which we want to sample P (X) 2: state-transition diagram (not shown) that has exponentially many nodes #nodes = V al(x)

Multiple transition models idea have multiple transition models each making local changes to x aka, kernels 1 2 T (x, x ), T (x, x ),, T (x, x ) n x = (x, x ) 1 2 T 2 T 1 only updates x 1 using a single kernel we may not be able to visit all the states while their combination is "ergodic"... X (1) X (2) X (T 1) X (T )

Multiple transition models idea have multiple transition models each making local changes to x aka, kernels 1 2 T (x, x ), T (x, x ),, T (x, x ) n x = (x, x ) 1 2 T 2 if π(x = x) = π(x = x )T (x, x ) k x V al(x) then we can combine the kernels: k T 1 only updates x 1 using a single kernel we may not be able to visit all the states while their combination is "ergodic" mixing them cycling them T (x, x ) = p(k)t (x, x ) k x,x,,x k T (x, x ) = T (x, x )T (x, x ), T (x, x )dx dx dx [1] [2] [n] 1 [1] 2 [1] [2] n [n 1] [1] [2] [n]... X (1) X (2) X (T 1) X (T )

Revisiting Gibbs sampling one kernel for each variable perform local, conditional updates T (x, x ) = P (x x )I(x = x ) i... i i X (1) X (2) X (T 1) X (T ) i P (x x ) = P (x x ) cycle the local kernels i i i MB(i) i... π(x) = P (X) is the stationary dist. for this Markov chain

Block Gibbs sampling local moves can get stuck in modes of P (X) updates using exploring these modes P (x x ), P (x x ) 1 2 2 1 will have problem x 1 x 2

Block Gibbs sampling local moves can get stuck in modes of P (X) updates using exploring these modes P (x x ), P (x x ) 1 2 2 1 will have problem x 1 idea: each kernel updates a block of variables x 2 update h given v P (h, h, h v,, v ) = P (h v) P (h v) 0 1 2 0 3 0 2 Restricted Boltzmann Machine (RBM) update v given h P (v,, v h, h, h ) = P (v h) P (v h) 0 3 0 1 2 0 3

Detailed balance A Markov chain is reversible if for a unique π detailed balance π(x)t (x, x ) = π(x )T (x, x) same frequency in both directions x, x

Detailed balance A Markov chain is reversible if for a unique π x left-hand side detailed balance π(x)t (x, x ) = π(x )T (x, x) x π(x)t (x, x )dx = π(x) same frequency in both directions T (x, x )dx = π(x) x, x = x global balance π(x )T (x, x)dx right-hand side detailed balance is a stronger condition if Markov chain is regular and then π π is the unique stationary distribution satisfies detailed balance, π = [.4,.4,.2] global balance detailed balance (example: Murphy's book)

Using a proposal for the chain Given P design a chain to sample from P idea

Using a proposal for the chain Given P design a chain to sample from P idea use a proposal transition we can sample from q T (x, x ) q T (x, ) q T (x, x ) is a regular chain (reaching every state in K steps has a non-zero probability)

use a proposal transition we can sample from q T (x, x ) Metropolis algorithm q T (x, ) q T (x, x ) is a regular chain (reaching every state in K steps has a non-zero probability) accept the proposed move with probability to achieve detailed balance proposal is symmetric T (x, x ) = T (x, x) p(x ) p(x) A(x, x ) min(1, ) accepts the move if it increases P may accept it otherwise A(x, x ) (image: Wikipedia)

Metropolis-Hastings algorithm q p(x )T (x,x) p(x)t q (x,x ) if the proposal is NOT symmetric, then A(x, x ) min(1, )

Metropolis-Hastings algorithm if the proposal is NOT symmetric, then A(x, x ) min(1, ) why does it sample from P? q p(x )T (x,x) p(x)t q (x,x )

Metropolis-Hastings algorithm if the proposal is NOT symmetric, then A(x, x ) min(1, ) why does it sample from P? q p(x )T (x,x) p(x)t q (x,x ) q T (x, x ) = T (x, x )A(x, x ) x x move to a different state is accepted

Metropolis-Hastings algorithm if the proposal is NOT symmetric, then A(x, x ) min(1, ) why does it sample from P? q p(x )T (x,x) p(x)t q (x,x ) q T (x, x ) = T (x, x )A(x, x ) x x q T (x, x) = T (x, x) + x x (1 A(x, x ))T (x, x ) move to a different state is accepted substitute this into detailed balance (does it hold?) q q proposal to stay is always accepted move to a new state is rejected π(x)t (x, x )A(x, x ) = π(x )T (x, x)a(x, x) q π(x )T (x,x) π(x)t q (x,x ) min(1, )? q π(x)t (x,x ) π(x )T q (x,x) min(1, )

Sampling from the chain at the limit T, P = π = P how long should we wait for T D(P, π) < ϵ? 1 1 λ 2 mixing time N ϵ O( log( )) #states (exponential) 2nd largest eigenvalue of T

Sampling from the chain at the limit T, P = π = P how long should we wait for run the chain for a burn-in period (T steps) collect samples (few more steps) multiple restarts can ensure a better coverage T D(P, π) < ϵ? 1 1 λ 2 mixing time N ϵ O( log( )) #states (exponential) 2nd largest eigenvalue of T Example Potts model model V al(x) = 5 128x128 grid p(x) exp( i h(x i) + i,j E.66I(x j = x j)) Gibbs sampling different colors 200 iterations 10,000 iterations image : Murphy's book

Diagnosing convergence heuristics for diagnosing non-convergence difficult problem run multiple chains (compare sample statistics) auto-correlation within each chain

Diagnosing convergence heuristics for diagnosing non-convergence difficult problem run multiple chains (compare sample statistics) auto-correlation within each chain example sampling from a mixture of two 1D Gaussians (3 chains: colors) metropolis-hastings (MH) with increasing step sizes for the proposal Gibbs sampling

Summary Markov Chain: can model the "evolution" of an initial distribution converges to a stationary distribution

Summary Markov Chain: can model the "evolution" of an initial distribution converges to a stationary distribution Markov Chain Monte Carlo: design a Markov chain: stationary dist. is what we want to sample run the chain to produce samples

Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018