Graphical Models Markov Chain Monte Carlo Inference Siamak Ravanbakhsh Winter 2018
Learning objectives Markov chains the idea behind Markov Chain Monte Carlo (MCMC) two important examples: Gibbs sampling Metropolis-Hastings algorithm
Problem with likelihood weighting Recap use a topological ordering sample conditioned on the parents if observed: keep the observed value update the weight
Problem with likelihood weighting Recap use a topological ordering sample conditioned on the parents if observed: Issues keep the observed value update the weight observing the child does not affect the parent's assignment only applies to Bayes-nets
Idea Gibbs sampling iteratively sample each var. condition on its Markov blanket X p(x X ) i i MB(i) if X i is observed: keep the observed value after many Gibbs sampling iterations X P
Idea Gibbs sampling iteratively sample each var. condition on its Markov blanket X p(x X ) i i MB(i) if X i is observed: keep the observed value equivalent to first simplifying the model by removing observed vars sampling from the simplified Gibbs dist. after many Gibbs sampling iterations X P
Example: Ising model recall the Ising model: p(x) exp( i xih i + i,j E xixjj i,j) x { 1, +1} i
Example: Ising model recall the Ising model: p(x) exp( i xih i + i,j E xixjj i,j) sample each node i: x { 1, +1} i p(x = +1 X ) = i MB(i) exp(h i+ j Mb(i) Ji,jX j) exp(h i+ j Mb(i) Ji,jX j)+exp( hi j Mb(i) Ji,jX j) =
Example: Ising model recall the Ising model: p(x) exp( i xih i + i,j E xixjj i,j) sample each node i: x { 1, +1} i p(x = +1 X ) = i MB(i) exp(h i+ j Mb(i) Ji,jX j) exp(h i+ j Mb(i) Ji,jX j)+exp( hi j Mb(i) Ji,jX j) = σ(2h + 2 J X ) i j Mb(i) i,j j
Example: Ising model recall the Ising model: p(x) exp( i xih i + i,j E xixjj i,j) sample each node i: x { 1, +1} i p(x = +1 X ) = i MB(i) exp(h i+ j Mb(i) Ji,jX j) exp(h i+ j Mb(i) Ji,jX j)+exp( hi j Mb(i) Ji,jX j) = σ(2h i + 2 j Mb(i) Ji,jX j) compare with mean-field σ(2h i + 2 j Mb(i) Ji,jμ j)
Markov Chain a sequence of random variables with Markov property (t) (1) (t 1) (t) (t 1) P (X X,, X ) = P (X X ) its graphical model... X (1) (T 1) X (T ) X (2) X many applications: language modeling: X is a word or a character physics: with correct choice of X, the world is Markov
Transition model we assume a homogeneous chain: (t) (t 1) (t+1) (t) P (X X ) = P (X X ) t cond. probabilities remain the same across time-steps notation: conditional probability (t) (t 1) P (X = x X = x ) = T (x, x ) is called the transition model think of this as a matrix T
Transition model we assume a homogeneous chain: (t) (t 1) (t+1) (t) P (X X ) = P (X X ) t cond. probabilities remain the same across time-steps notation: conditional probability (t) (t 1) P (X = x X = x ) = T (x, x ) state-transition diagram its transition matrix is called the transition model think of this as a matrix T.25 T = 0.5 0.7.5.75.3 0
Transition model we assume a homogeneous chain: (t) (t 1) (t+1) (t) P (X X ) = P (X X ) t cond. probabilities remain the same across time-steps notation: conditional probability (t) (t 1) P (X = x X = x ) = T (x, x ) state-transition diagram its transition matrix is called the transition model think of this as a matrix T.25 T = 0.5 0.7.5.75.3 0 (t+1) (t) evolving the distribution P (X = x) = P (X = x )T (x, x) x V al(x)
Markov Chain Monte Carlo (MCMC) Example state-transition diagram for grasshopper random walk initial distribution (0) P (X = 0) = 1
Markov Chain Monte Carlo (MCMC) Example state-transition diagram for grasshopper random walk initial distribution (0) P (X = 0) = 1 after t=50 steps, the distribution is almost uniform P (x) t 1 9 x
Markov Chain Monte Carlo (MCMC) Example state-transition diagram for grasshopper random walk initial distribution (0) P (X = 0) = 1 after t=50 steps, the distribution is almost uniform P (x) use the chain to sample from the uniform distribution t P (X) t 1 9 x 1 9
Markov Chain Monte Carlo (MCMC) Example state-transition diagram for grasshopper random walk initial distribution (0) P (X = 0) = 1 after t=50 steps, the distribution is almost uniform P (x) use the chain to sample from the uniform distribution t P (X) t 1 9 x 1 9 why is it uniform? (mixing image: Murphy's book)
Markov Chain Monte Carlo (MCMC) Example state-transition diagram for grasshopper random walk initial distribution (0) P (X = 0) = 1 after t=50 steps, the distribution is almost uniform P (x) use the chain to sample from the uniform distribution t P (X) t 1 9 x 1 9 MCMC generalize this idea beyond uniform dist. P we want to sample from pick the transition model such that P (X) = P (X) why is it uniform? (mixing image: Murphy's book)
Stationary distribution given a transition model T (x, x ) if the chain converges: global balance equation (t) (t+1) (t) P (x) P (x) = x P (x )T (x, x)
Stationary distribution given a transition model T (x, x ) if the chain converges: global balance equation this condition defines the stationary distribution: (t) (t+1) (t) P (x) P (x) = x P (x )T (x, x) π π(x = x) = π(x = x )T (x, x ) x V al(x)
Stationary distribution given a transition model T (x, x ) if the chain converges: global balance equation this condition defines the stationary distribution: (t) (t+1) (t) P (x) P (x) = x P (x )T (x, x) π π(x = x) = π(x = x )T (x, x ) x V al(x) Example finding the stationary dist. 1 1 3 π(x ) =.25π(x ) +.5π(x ) 2 2 3 π(x ) =.7π(x ) +.5π(x ) 3 1 2 π(x ) =.75π(x ) +.3π(x ) π(x 1) + π(x 2) + π(x 3) = 1 1 π(x ) =.2 2 π(x ) =.5 3 π(x ) =.3
Stationary distribution as an eigenvector Example finding the stationary dist. 1 1 3 π(x ) =.25π(x ) +.5π(x ) 2 2 3 π(x ) =.7π(x ) +.5π(x ) 3 1 2 π(x ) =.75π(x ) +.3π(x ) 1 2 3 π(x ) + π(x ) + π(x ) = 1 1 π(x ) =.2 2 π(x ) =.5 3 π(x ) =.3
Stationary distribution as an eigenvector Example finding the stationary dist. 1 1 3 π(x ) =.25π(x ) +.5π(x ) 2 2 3 π(x ) =.7π(x ) +.5π(x ) 3 1 2 π(x ) =.75π(x ) +.3π(x ) 1 2 3 π(x ) + π(x ) + π(x ) = 1 1 π(x ) =.2 2 π(x ) =.5 3 π(x ) =.3 viewing T (.,.) as a matrix and P (x) as a vector t evolution of dist P (x) : P multiple steps: P t (t+1) T (t) = (T ) P = T P (t+m) T m (t).25 0.75 0.7.3 T T.5.2.5.5 0.3 π =.2.5.3 π
Stationary distribution as an eigenvector Example finding the stationary dist. 1 1 3 π(x ) =.25π(x ) +.5π(x ) 2 2 3 π(x ) =.7π(x ) +.5π(x ) 3 1 2 π(x ) =.75π(x ) +.3π(x ) 1 2 3 π(x ) + π(x ) + π(x ) = 1 1 π(x ) =.2 2 π(x ) =.5 3 π(x ) =.3 viewing T (.,.) as a matrix and P (x) as a vector t evolution of dist P (x) : P multiple steps: P for stationary dist: t (t+1) T (t) = (T ) P = T P (t+m) T m (t) π = T T π.25 0.75 0.7.3 T T.5.2.5.5 0.3 π =.2.5.3 π
Stationary distribution as an eigenvector Example finding the stationary dist. 1 1 3 π(x ) =.25π(x ) +.5π(x ) 2 2 3 π(x ) =.7π(x ) +.5π(x ) 3 1 2 π(x ) =.75π(x ) +.3π(x ) 1 2 3 π(x ) + π(x ) + π(x ) = 1 1 π(x ) =.2 2 π(x ) =.5 3 π(x ) =.3 viewing T (.,.) as a matrix and P (x) as a vector t evolution of dist P (x) : P multiple steps: P for stationary dist: π is an eigenvector of with eigenvalue 1 t (t+1) T (t) = (T ) P = T P (t+m) T m (t) π = T T π T T.25 0.75 0.7.3 T T (produce it using the power method).5.2.5.5 0.3 π =.2.5.3 π
Stationary distribution: existance & uniquness irreducible we should be able to reach any x' from any x π otherwise, is not unique 0 0 1 1
Stationary distribution: existance & uniquness irreducible 1 1 we should be able to reach any x' from any x π otherwise, is not unique 0 0 aperiodic 0 1 0 the chain should not have a fixed cyclic behavior 0 1 otherwise, the chain does not converge (it oscillates) 1 0
Stationary distribution: existance & uniquness irreducible 1 1 we should be able to reach any x' from any x π otherwise, is not unique 0 0 aperiodic 0 1 0 the chain should not have a fixed cyclic behavior 0 1 otherwise, the chain does not converge (it oscillates) 1 0 every aperiodic and irreducible chain (with a finite domain) has a unique limiting distribution such that π(x = x) = π(x = x )T (x, x ) x V al(x) π
Stationary distribution: existance & uniquness irreducible 1 1 we should be able to reach any x' from any x π otherwise, is not unique 0 0 aperiodic 0 1 0 the chain should not have a fixed cyclic behavior 0 1 otherwise, the chain does not converge (it oscillates) 1 0 every aperiodic and irreducible chain (with a finite domain) has a unique limiting distribution such that π(x = x) = π(x = x )T (x, x ) x V al(x) π regular chain a sufficient condition: there exists a K, such that the probability of reaching any destination from any source in K steps is positive (applies to discrete & continuous domains)
MCMC in graphical models distinguishing the "graphical models" involved 0 P (X)... π(x) X (1) (T 1) X (T ) X (2) X 1: the Markov chain P (X) X = [C, D, I, G, S, L, J, H] 3: the graphical model, from which we want to sample P (X) 2: state-transition diagram (not shown) that has exponentially many nodes #nodes = V al(x)
MCMC in graphical models distinguishing the "graphical models" involved 0 P (X)... π(x) X (1) (T 1) X (T ) X (2) X 1: the Markov chain P (X) X = [C, D, I, G, S, L, J, H] 3: the graphical model, from which we want to sample P (X) 2: state-transition diagram (not shown) that has exponentially many nodes #nodes = V al(x) objective: design the Markov chain transition so that π(x) = P (X)
Multiple transition models idea have multiple transition models each making local changes to x aka, kernels 1 2 T (x, x ), T (x, x ),, T (x, x ) n x = (x, x ) 1 2 T 2 T 1 only updates x 1 using a single kernel we may not be able to visit all the states while their combination is "ergodic"... X (1) X (2) X (T 1) X (T )
Multiple transition models idea have multiple transition models each making local changes to x aka, kernels 1 2 T (x, x ), T (x, x ),, T (x, x ) n x = (x, x ) 1 2 T 2 if π(x = x) = π(x = x )T (x, x ) k x V al(x) then we can combine the kernels: k T 1 only updates x 1 using a single kernel we may not be able to visit all the states while their combination is "ergodic" mixing them cycling them T (x, x ) = p(k)t (x, x ) k x,x,,x k T (x, x ) = T (x, x )T (x, x ), T (x, x )dx dx dx [1] [2] [n] 1 [1] 2 [1] [2] n [n 1] [1] [2] [n]... X (1) X (2) X (T 1) X (T )
Revisiting Gibbs sampling one kernel for each variable perform local, conditional updates T (x, x ) = P (x x )I(x = x ) i... i i X (1) X (2) X (T 1) X (T ) i P (x x ) = P (x x ) cycle the local kernels i i i MB(i) i... π(x) = P (X) is the stationary dist. for this Markov chain
Revisiting Gibbs sampling one kernel for each variable perform local, conditional updates T (x, x ) = P (x x )I(x = x ) i... i i X (1) X (2) X (T 1) X (T ) i P (x x ) = P (x x ) cycle the local kernels i i i MB(i) i... π(x) = P (X) is the stationary dist. for this Markov chain if P (x) > 0 x then this chain is regular i.e., converges to its unique stationary dist.
Block Gibbs sampling local moves can get stuck in modes of P (X) updates using exploring these modes P (x x ), P (x x ) 1 2 2 1 will have problem x 1 x 2
Block Gibbs sampling local moves can get stuck in modes of P (X) updates using exploring these modes P (x x ), P (x x ) 1 2 2 1 will have problem x 1 idea: each kernel updates a block of variables x 2
Block Gibbs sampling local moves can get stuck in modes of P (X) updates using exploring these modes P (x x ), P (x x ) 1 2 2 1 will have problem x 1 idea: each kernel updates a block of variables x 2 update h given v P (h, h, h v,, v ) = P (h v) P (h v) 0 1 2 0 3 0 2 Restricted Boltzmann Machine (RBM) update v given h P (v,, v h, h, h ) = P (v h) P (v h) 0 3 0 1 2 0 3
Detailed balance A Markov chain is reversible if for a unique π detailed balance π(x)t (x, x ) = π(x )T (x, x) same frequency in both directions x, x
Detailed balance A Markov chain is reversible if for a unique π x left-hand side detailed balance π(x)t (x, x ) = π(x )T (x, x) x π(x)t (x, x )dx = π(x) same frequency in both directions T (x, x )dx = π(x) x, x = x global balance π(x )T (x, x)dx right-hand side
Detailed balance A Markov chain is reversible if for a unique π x left-hand side detailed balance π(x)t (x, x ) = π(x )T (x, x) x π(x)t (x, x )dx = π(x) same frequency in both directions T (x, x )dx = π(x) x, x = x global balance π(x )T (x, x)dx right-hand side detailed balance is a stronger condition π = [.4,.4,.2] global balance detailed balance (example: Murphy's book)
Detailed balance A Markov chain is reversible if for a unique π x left-hand side detailed balance π(x)t (x, x ) = π(x )T (x, x) x π(x)t (x, x )dx = π(x) same frequency in both directions T (x, x )dx = π(x) x, x = x global balance π(x )T (x, x)dx right-hand side detailed balance is a stronger condition if Markov chain is regular and then π π is the unique stationary distribution satisfies detailed balance, π = [.4,.4,.2] global balance detailed balance (example: Murphy's book)
Detailed balance A Markov chain is reversible if for a unique π x left-hand side detailed balance π(x)t (x, x ) = π(x )T (x, x) x π(x)t (x, x )dx = π(x) same frequency in both directions T (x, x )dx = π(x) x, x = x global balance π(x )T (x, x)dx right-hand side detailed balance is a stronger condition if Markov chain is regular and then π π is the unique stationary distribution satisfies detailed balance, π = [.4,.4,.2] global balance detailed balance analogous to the theorem for global balance checking for detailed balance is easier (example: Murphy's book)
Detailed balance A Markov chain is reversible if for a unique π x left-hand side detailed balance π(x)t (x, x ) = π(x )T (x, x) x π(x)t (x, x )dx = π(x) same frequency in both directions T (x, x )dx = π(x) x, x = x global balance π(x )T (x, x)dx right-hand side detailed balance is a stronger condition if Markov chain is regular and then π π is the unique stationary distribution satisfies detailed balance, π = [.4,.4,.2] global balance detailed balance analogous to the theorem for global balance checking for detailed balance is easier what happens if T is symmetric? (example: Murphy's book)
Using a proposal for the chain Given P design a chain to sample from P idea
Using a proposal for the chain Given P design a chain to sample from P idea use a proposal transition we can sample from q T (x, x ) q T (x, ) q T (x, x ) is a regular chain (reaching every state in K steps has a non-zero probability)
Using a proposal for the chain Given P design a chain to sample from P idea use a proposal transition we can sample from q T (x, x ) q T (x, ) q T (x, x ) is a regular chain (reaching every state in K steps has a non-zero probability) accept the proposed move with probability to achieve detailed balance A(x, x )
use a proposal transition we can sample from q T (x, x ) Metropolis algorithm q T (x, ) q T (x, x ) is a regular chain (reaching every state in K steps has a non-zero probability) accept the proposed move with probability to achieve detailed balance proposal is symmetric T (x, x ) = T (x, x) p(x ) p(x) A(x, x ) min(1, ) accepts the move if it increases P may accept it otherwise A(x, x ) (image: Wikipedia)
Metropolis-Hastings algorithm q p(x )T (x,x) p(x)t q (x,x ) if the proposal is NOT symmetric, then A(x, x ) min(1, )
Metropolis-Hastings algorithm if the proposal is NOT symmetric, then A(x, x ) min(1, ) why does it sample from P? q p(x )T (x,x) p(x)t q (x,x )
Metropolis-Hastings algorithm if the proposal is NOT symmetric, then A(x, x ) min(1, ) why does it sample from P? q p(x )T (x,x) p(x)t q (x,x ) q T (x, x ) = T (x, x )A(x, x ) x x move to a different state is accepted
Metropolis-Hastings algorithm if the proposal is NOT symmetric, then A(x, x ) min(1, ) why does it sample from P? q p(x )T (x,x) p(x)t q (x,x ) q T (x, x ) = T (x, x )A(x, x ) x x q T (x, x) = T (x, x) + x x (1 A(x, x ))T (x, x ) move to a different state is accepted proposal to stay is always accepted move to a new state is rejected
Metropolis-Hastings algorithm if the proposal is NOT symmetric, then A(x, x ) min(1, ) why does it sample from P? q p(x )T (x,x) p(x)t q (x,x ) q T (x, x ) = T (x, x )A(x, x ) x x q T (x, x) = T (x, x) + x x (1 A(x, x ))T (x, x ) move to a different state is accepted substitute this into detailed balance (does it hold?) q q proposal to stay is always accepted move to a new state is rejected π(x)t (x, x )A(x, x ) = π(x )T (x, x)a(x, x) q π(x )T (x,x) π(x)t q (x,x ) min(1, )? q π(x)t (x,x ) π(x )T q (x,x) min(1, )
Metropolis-Hastings algorithm if the proposal is NOT symmetric, then A(x, x ) min(1, ) why does it sample from P? q p(x )T (x,x) p(x)t q (x,x ) q T (x, x ) = T (x, x )A(x, x ) x x q T (x, x) = T (x, x) + x x (1 A(x, x ))T (x, x ) move to a different state is accepted substitute this into detailed balance (does it hold?) q q proposal to stay is always accepted move to a new state is rejected π(x)t (x, x )A(x, x ) = π(x )T (x, x)a(x, x) q π(x )T (x,x) π(x)t q (x,x ) min(1, )? q π(x)t (x,x ) π(x )T q (x,x) min(1, ) Gibbs sampling is a special case, with A(x, x ) = 1 all the time!
Sampling from the chain at the limit T, P = π = P how long should we wait for T D(P, π) < ϵ? 1 1 λ 2 mixing time N ϵ O( log( )) #states (exponential) 2nd largest eigenvalue of T
Sampling from the chain at the limit T, P = π = P how long should we wait for run the chain for a burn-in period (T steps) collect samples (few more steps) multiple restarts can ensure a better coverage T D(P, π) < ϵ? 1 1 λ 2 mixing time N ϵ O( log( )) #states (exponential) 2nd largest eigenvalue of T
Sampling from the chain at the limit T, P = π = P how long should we wait for run the chain for a burn-in period (T steps) collect samples (few more steps) multiple restarts can ensure a better coverage T D(P, π) < ϵ? 1 1 λ 2 mixing time N ϵ O( log( )) #states (exponential) 2nd largest eigenvalue of T Example Potts model model V al(x) = 5 128x128 grid p(x) exp( i h(x i) + i,j E.66I(x j = x j)) Gibbs sampling different colors 200 iterations 10,000 iterations image : Murphy's book
Diagnosing convergence heuristics for diagnosing non-convergence difficult problem run multiple chains (compare sample statistics) auto-correlation within each chain
Diagnosing convergence heuristics for diagnosing non-convergence difficult problem run multiple chains (compare sample statistics) auto-correlation within each chain example sampling from a mixture of two 1D Gaussians (3 chains: colors) metropolis-hastings (MH) with increasing step sizes for the proposal Gibbs sampling
Summary Markov Chain: can model the "evolution" of an initial distribution converges to a stationary distribution
Summary Markov Chain: can model the "evolution" of an initial distribution converges to a stationary distribution Markov Chain Monte Carlo: design a Markov chain: stationary dist. is what we want to sample run the chain to produce samples
Summary Markov Chain: can model the "evolution" of an initial distribution converges to a stationary distribution Markov Chain Monte Carlo: design a Markov chain: stationary dist. is what we want to sample run the chain to produce samples Two MCMC methods: Gibbs sampling Metropolis-Hastings