Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018


 Jasper Dennis
 9 days ago
 Views:
Transcription
1 Graphical Models Markov Chain Monte Carlo Inference Siamak Ravanbakhsh Winter 2018
2 Learning objectives Markov chains the idea behind Markov Chain Monte Carlo (MCMC) two important examples: Gibbs sampling MetropolisHastings algorithm
3 Problem with likelihood weighting Recap use a topological ordering sample conditioned on the parents if observed: keep the observed value update the weight
4 Problem with likelihood weighting Recap use a topological ordering sample conditioned on the parents if observed: Issues keep the observed value update the weight observing the child does not affect the parent's assignment only applies to Bayesnets
5 Idea Gibbs sampling iteratively sample each var. condition on its Markov blanket X p(x X ) i i MB(i) if X i is observed: keep the observed value after many Gibbs sampling iterations X P
6 Idea Gibbs sampling iteratively sample each var. condition on its Markov blanket X p(x X ) i i MB(i) if X i is observed: keep the observed value equivalent to first simplifying the model by removing observed vars sampling from the simplified Gibbs dist. after many Gibbs sampling iterations X P
7 Example: Ising model recall the Ising model: p(x) exp( i xih i + i,j E xixjj i,j) x { 1, +1} i
8 Example: Ising model recall the Ising model: p(x) exp( i xih i + i,j E xixjj i,j) sample each node i: x { 1, +1} i p(x = +1 X ) = i MB(i) exp(h i+ j Mb(i) Ji,jX j) exp(h i+ j Mb(i) Ji,jX j)+exp( hi j Mb(i) Ji,jX j) =
9 Example: Ising model recall the Ising model: p(x) exp( i xih i + i,j E xixjj i,j) sample each node i: x { 1, +1} i p(x = +1 X ) = i MB(i) exp(h i+ j Mb(i) Ji,jX j) exp(h i+ j Mb(i) Ji,jX j)+exp( hi j Mb(i) Ji,jX j) = σ(2h + 2 J X ) i j Mb(i) i,j j
10 Example: Ising model recall the Ising model: p(x) exp( i xih i + i,j E xixjj i,j) sample each node i: x { 1, +1} i p(x = +1 X ) = i MB(i) exp(h i+ j Mb(i) Ji,jX j) exp(h i+ j Mb(i) Ji,jX j)+exp( hi j Mb(i) Ji,jX j) = σ(2h i + 2 j Mb(i) Ji,jX j) compare with meanfield σ(2h i + 2 j Mb(i) Ji,jμ j)
11 Markov Chain a sequence of random variables with Markov property (t) (1) (t 1) (t) (t 1) P (X X,, X ) = P (X X ) its graphical model... X (1) (T 1) X (T ) X (2) X many applications: language modeling: X is a word or a character physics: with correct choice of X, the world is Markov
12 Transition model we assume a homogeneous chain: (t) (t 1) (t+1) (t) P (X X ) = P (X X ) t cond. probabilities remain the same across timesteps notation: conditional probability (t) (t 1) P (X = x X = x ) = T (x, x ) is called the transition model think of this as a matrix T
13 Transition model we assume a homogeneous chain: (t) (t 1) (t+1) (t) P (X X ) = P (X X ) t cond. probabilities remain the same across timesteps notation: conditional probability (t) (t 1) P (X = x X = x ) = T (x, x ) statetransition diagram its transition matrix is called the transition model think of this as a matrix T.25 T =
14 Transition model we assume a homogeneous chain: (t) (t 1) (t+1) (t) P (X X ) = P (X X ) t cond. probabilities remain the same across timesteps notation: conditional probability (t) (t 1) P (X = x X = x ) = T (x, x ) statetransition diagram its transition matrix is called the transition model think of this as a matrix T.25 T = (t+1) (t) evolving the distribution P (X = x) = P (X = x )T (x, x) x V al(x)
15 Markov Chain Monte Carlo (MCMC) Example statetransition diagram for grasshopper random walk initial distribution (0) P (X = 0) = 1
16 Markov Chain Monte Carlo (MCMC) Example statetransition diagram for grasshopper random walk initial distribution (0) P (X = 0) = 1 after t=50 steps, the distribution is almost uniform P (x) t 1 9 x
17 Markov Chain Monte Carlo (MCMC) Example statetransition diagram for grasshopper random walk initial distribution (0) P (X = 0) = 1 after t=50 steps, the distribution is almost uniform P (x) use the chain to sample from the uniform distribution t P (X) t 1 9 x 1 9
18 Markov Chain Monte Carlo (MCMC) Example statetransition diagram for grasshopper random walk initial distribution (0) P (X = 0) = 1 after t=50 steps, the distribution is almost uniform P (x) use the chain to sample from the uniform distribution t P (X) t 1 9 x 1 9 why is it uniform? (mixing image: Murphy's book)
19 Markov Chain Monte Carlo (MCMC) Example statetransition diagram for grasshopper random walk initial distribution (0) P (X = 0) = 1 after t=50 steps, the distribution is almost uniform P (x) use the chain to sample from the uniform distribution t P (X) t 1 9 x 1 9 MCMC generalize this idea beyond uniform dist. P we want to sample from pick the transition model such that P (X) = P (X) why is it uniform? (mixing image: Murphy's book)
20 Stationary distribution given a transition model T (x, x ) if the chain converges: global balance equation (t) (t+1) (t) P (x) P (x) = x P (x )T (x, x)
21 Stationary distribution given a transition model T (x, x ) if the chain converges: global balance equation this condition defines the stationary distribution: (t) (t+1) (t) P (x) P (x) = x P (x )T (x, x) π π(x = x) = π(x = x )T (x, x ) x V al(x)
22 Stationary distribution given a transition model T (x, x ) if the chain converges: global balance equation this condition defines the stationary distribution: (t) (t+1) (t) P (x) P (x) = x P (x )T (x, x) π π(x = x) = π(x = x )T (x, x ) x V al(x) Example finding the stationary dist π(x ) =.25π(x ) +.5π(x ) π(x ) =.7π(x ) +.5π(x ) π(x ) =.75π(x ) +.3π(x ) π(x 1) + π(x 2) + π(x 3) = 1 1 π(x ) =.2 2 π(x ) =.5 3 π(x ) =.3
23 Stationary distribution as an eigenvector Example finding the stationary dist π(x ) =.25π(x ) +.5π(x ) π(x ) =.7π(x ) +.5π(x ) π(x ) =.75π(x ) +.3π(x ) π(x ) + π(x ) + π(x ) = 1 1 π(x ) =.2 2 π(x ) =.5 3 π(x ) =.3
24 Stationary distribution as an eigenvector Example finding the stationary dist π(x ) =.25π(x ) +.5π(x ) π(x ) =.7π(x ) +.5π(x ) π(x ) =.75π(x ) +.3π(x ) π(x ) + π(x ) + π(x ) = 1 1 π(x ) =.2 2 π(x ) =.5 3 π(x ) =.3 viewing T (.,.) as a matrix and P (x) as a vector t evolution of dist P (x) : P multiple steps: P t (t+1) T (t) = (T ) P = T P (t+m) T m (t) T T π = π
25 Stationary distribution as an eigenvector Example finding the stationary dist π(x ) =.25π(x ) +.5π(x ) π(x ) =.7π(x ) +.5π(x ) π(x ) =.75π(x ) +.3π(x ) π(x ) + π(x ) + π(x ) = 1 1 π(x ) =.2 2 π(x ) =.5 3 π(x ) =.3 viewing T (.,.) as a matrix and P (x) as a vector t evolution of dist P (x) : P multiple steps: P for stationary dist: t (t+1) T (t) = (T ) P = T P (t+m) T m (t) π = T T π T T π = π
26 Stationary distribution as an eigenvector Example finding the stationary dist π(x ) =.25π(x ) +.5π(x ) π(x ) =.7π(x ) +.5π(x ) π(x ) =.75π(x ) +.3π(x ) π(x ) + π(x ) + π(x ) = 1 1 π(x ) =.2 2 π(x ) =.5 3 π(x ) =.3 viewing T (.,.) as a matrix and P (x) as a vector t evolution of dist P (x) : P multiple steps: P for stationary dist: π is an eigenvector of with eigenvalue 1 t (t+1) T (t) = (T ) P = T P (t+m) T m (t) π = T T π T T T T (produce it using the power method) π = π
27 Stationary distribution: existance & uniquness irreducible we should be able to reach any x' from any x π otherwise, is not unique
28 Stationary distribution: existance & uniquness irreducible 1 1 we should be able to reach any x' from any x π otherwise, is not unique 0 0 aperiodic the chain should not have a fixed cyclic behavior 0 1 otherwise, the chain does not converge (it oscillates) 1 0
29 Stationary distribution: existance & uniquness irreducible 1 1 we should be able to reach any x' from any x π otherwise, is not unique 0 0 aperiodic the chain should not have a fixed cyclic behavior 0 1 otherwise, the chain does not converge (it oscillates) 1 0 every aperiodic and irreducible chain (with a finite domain) has a unique limiting distribution such that π(x = x) = π(x = x )T (x, x ) x V al(x) π
30 Stationary distribution: existance & uniquness irreducible 1 1 we should be able to reach any x' from any x π otherwise, is not unique 0 0 aperiodic the chain should not have a fixed cyclic behavior 0 1 otherwise, the chain does not converge (it oscillates) 1 0 every aperiodic and irreducible chain (with a finite domain) has a unique limiting distribution such that π(x = x) = π(x = x )T (x, x ) x V al(x) π regular chain a sufficient condition: there exists a K, such that the probability of reaching any destination from any source in K steps is positive (applies to discrete & continuous domains)
31 MCMC in graphical models distinguishing the "graphical models" involved 0 P (X)... π(x) X (1) (T 1) X (T ) X (2) X 1: the Markov chain P (X) X = [C, D, I, G, S, L, J, H] 3: the graphical model, from which we want to sample P (X) 2: statetransition diagram (not shown) that has exponentially many nodes #nodes = V al(x)
32 MCMC in graphical models distinguishing the "graphical models" involved 0 P (X)... π(x) X (1) (T 1) X (T ) X (2) X 1: the Markov chain P (X) X = [C, D, I, G, S, L, J, H] 3: the graphical model, from which we want to sample P (X) 2: statetransition diagram (not shown) that has exponentially many nodes #nodes = V al(x) objective: design the Markov chain transition so that π(x) = P (X)
33 Multiple transition models idea have multiple transition models each making local changes to x aka, kernels 1 2 T (x, x ), T (x, x ),, T (x, x ) n x = (x, x ) 1 2 T 2 T 1 only updates x 1 using a single kernel we may not be able to visit all the states while their combination is "ergodic"... X (1) X (2) X (T 1) X (T )
34 Multiple transition models idea have multiple transition models each making local changes to x aka, kernels 1 2 T (x, x ), T (x, x ),, T (x, x ) n x = (x, x ) 1 2 T 2 if π(x = x) = π(x = x )T (x, x ) k x V al(x) then we can combine the kernels: k T 1 only updates x 1 using a single kernel we may not be able to visit all the states while their combination is "ergodic" mixing them cycling them T (x, x ) = p(k)t (x, x ) k x,x,,x k T (x, x ) = T (x, x )T (x, x ), T (x, x )dx dx dx [1] [2] [n] 1 [1] 2 [1] [2] n [n 1] [1] [2] [n]... X (1) X (2) X (T 1) X (T )
35 Revisiting Gibbs sampling one kernel for each variable perform local, conditional updates T (x, x ) = P (x x )I(x = x ) i... i i X (1) X (2) X (T 1) X (T ) i P (x x ) = P (x x ) cycle the local kernels i i i MB(i) i... π(x) = P (X) is the stationary dist. for this Markov chain
36 Revisiting Gibbs sampling one kernel for each variable perform local, conditional updates T (x, x ) = P (x x )I(x = x ) i... i i X (1) X (2) X (T 1) X (T ) i P (x x ) = P (x x ) cycle the local kernels i i i MB(i) i... π(x) = P (X) is the stationary dist. for this Markov chain if P (x) > 0 x then this chain is regular i.e., converges to its unique stationary dist.
37 Block Gibbs sampling local moves can get stuck in modes of P (X) updates using exploring these modes P (x x ), P (x x ) will have problem x 1 x 2
38 Block Gibbs sampling local moves can get stuck in modes of P (X) updates using exploring these modes P (x x ), P (x x ) will have problem x 1 idea: each kernel updates a block of variables x 2
39 Block Gibbs sampling local moves can get stuck in modes of P (X) updates using exploring these modes P (x x ), P (x x ) will have problem x 1 idea: each kernel updates a block of variables x 2 update h given v P (h, h, h v,, v ) = P (h v) P (h v) Restricted Boltzmann Machine (RBM) update v given h P (v,, v h, h, h ) = P (v h) P (v h)
40 Detailed balance A Markov chain is reversible if for a unique π detailed balance π(x)t (x, x ) = π(x )T (x, x) same frequency in both directions x, x
41 Detailed balance A Markov chain is reversible if for a unique π x lefthand side detailed balance π(x)t (x, x ) = π(x )T (x, x) x π(x)t (x, x )dx = π(x) same frequency in both directions T (x, x )dx = π(x) x, x = x global balance π(x )T (x, x)dx righthand side
42 Detailed balance A Markov chain is reversible if for a unique π x lefthand side detailed balance π(x)t (x, x ) = π(x )T (x, x) x π(x)t (x, x )dx = π(x) same frequency in both directions T (x, x )dx = π(x) x, x = x global balance π(x )T (x, x)dx righthand side detailed balance is a stronger condition π = [.4,.4,.2] global balance detailed balance (example: Murphy's book)
43 Detailed balance A Markov chain is reversible if for a unique π x lefthand side detailed balance π(x)t (x, x ) = π(x )T (x, x) x π(x)t (x, x )dx = π(x) same frequency in both directions T (x, x )dx = π(x) x, x = x global balance π(x )T (x, x)dx righthand side detailed balance is a stronger condition if Markov chain is regular and then π π is the unique stationary distribution satisfies detailed balance, π = [.4,.4,.2] global balance detailed balance (example: Murphy's book)
44 Detailed balance A Markov chain is reversible if for a unique π x lefthand side detailed balance π(x)t (x, x ) = π(x )T (x, x) x π(x)t (x, x )dx = π(x) same frequency in both directions T (x, x )dx = π(x) x, x = x global balance π(x )T (x, x)dx righthand side detailed balance is a stronger condition if Markov chain is regular and then π π is the unique stationary distribution satisfies detailed balance, π = [.4,.4,.2] global balance detailed balance analogous to the theorem for global balance checking for detailed balance is easier (example: Murphy's book)
45 Detailed balance A Markov chain is reversible if for a unique π x lefthand side detailed balance π(x)t (x, x ) = π(x )T (x, x) x π(x)t (x, x )dx = π(x) same frequency in both directions T (x, x )dx = π(x) x, x = x global balance π(x )T (x, x)dx righthand side detailed balance is a stronger condition if Markov chain is regular and then π π is the unique stationary distribution satisfies detailed balance, π = [.4,.4,.2] global balance detailed balance analogous to the theorem for global balance checking for detailed balance is easier what happens if T is symmetric? (example: Murphy's book)
46 Using a proposal for the chain Given P design a chain to sample from P idea
47 Using a proposal for the chain Given P design a chain to sample from P idea use a proposal transition we can sample from q T (x, x ) q T (x, ) q T (x, x ) is a regular chain (reaching every state in K steps has a nonzero probability)
48 Using a proposal for the chain Given P design a chain to sample from P idea use a proposal transition we can sample from q T (x, x ) q T (x, ) q T (x, x ) is a regular chain (reaching every state in K steps has a nonzero probability) accept the proposed move with probability to achieve detailed balance A(x, x )
49 use a proposal transition we can sample from q T (x, x ) Metropolis algorithm q T (x, ) q T (x, x ) is a regular chain (reaching every state in K steps has a nonzero probability) accept the proposed move with probability to achieve detailed balance proposal is symmetric T (x, x ) = T (x, x) p(x ) p(x) A(x, x ) min(1, ) accepts the move if it increases P may accept it otherwise A(x, x ) (image: Wikipedia)
50 MetropolisHastings algorithm q p(x )T (x,x) p(x)t q (x,x ) if the proposal is NOT symmetric, then A(x, x ) min(1, )
51 MetropolisHastings algorithm if the proposal is NOT symmetric, then A(x, x ) min(1, ) why does it sample from P? q p(x )T (x,x) p(x)t q (x,x )
52 MetropolisHastings algorithm if the proposal is NOT symmetric, then A(x, x ) min(1, ) why does it sample from P? q p(x )T (x,x) p(x)t q (x,x ) q T (x, x ) = T (x, x )A(x, x ) x x move to a different state is accepted
53 MetropolisHastings algorithm if the proposal is NOT symmetric, then A(x, x ) min(1, ) why does it sample from P? q p(x )T (x,x) p(x)t q (x,x ) q T (x, x ) = T (x, x )A(x, x ) x x q T (x, x) = T (x, x) + x x (1 A(x, x ))T (x, x ) move to a different state is accepted proposal to stay is always accepted move to a new state is rejected
54 MetropolisHastings algorithm if the proposal is NOT symmetric, then A(x, x ) min(1, ) why does it sample from P? q p(x )T (x,x) p(x)t q (x,x ) q T (x, x ) = T (x, x )A(x, x ) x x q T (x, x) = T (x, x) + x x (1 A(x, x ))T (x, x ) move to a different state is accepted substitute this into detailed balance (does it hold?) q q proposal to stay is always accepted move to a new state is rejected π(x)t (x, x )A(x, x ) = π(x )T (x, x)a(x, x) q π(x )T (x,x) π(x)t q (x,x ) min(1, )? q π(x)t (x,x ) π(x )T q (x,x) min(1, )
55 MetropolisHastings algorithm if the proposal is NOT symmetric, then A(x, x ) min(1, ) why does it sample from P? q p(x )T (x,x) p(x)t q (x,x ) q T (x, x ) = T (x, x )A(x, x ) x x q T (x, x) = T (x, x) + x x (1 A(x, x ))T (x, x ) move to a different state is accepted substitute this into detailed balance (does it hold?) q q proposal to stay is always accepted move to a new state is rejected π(x)t (x, x )A(x, x ) = π(x )T (x, x)a(x, x) q π(x )T (x,x) π(x)t q (x,x ) min(1, )? q π(x)t (x,x ) π(x )T q (x,x) min(1, ) Gibbs sampling is a special case, with A(x, x ) = 1 all the time!
56 Sampling from the chain at the limit T, P = π = P how long should we wait for T D(P, π) < ϵ? 1 1 λ 2 mixing time N ϵ O( log( )) #states (exponential) 2nd largest eigenvalue of T
57 Sampling from the chain at the limit T, P = π = P how long should we wait for run the chain for a burnin period (T steps) collect samples (few more steps) multiple restarts can ensure a better coverage T D(P, π) < ϵ? 1 1 λ 2 mixing time N ϵ O( log( )) #states (exponential) 2nd largest eigenvalue of T
58 Sampling from the chain at the limit T, P = π = P how long should we wait for run the chain for a burnin period (T steps) collect samples (few more steps) multiple restarts can ensure a better coverage T D(P, π) < ϵ? 1 1 λ 2 mixing time N ϵ O( log( )) #states (exponential) 2nd largest eigenvalue of T Example Potts model model V al(x) = 5 128x128 grid p(x) exp( i h(x i) + i,j E.66I(x j = x j)) Gibbs sampling different colors 200 iterations 10,000 iterations image : Murphy's book
59 Diagnosing convergence heuristics for diagnosing nonconvergence difficult problem run multiple chains (compare sample statistics) autocorrelation within each chain
60 Diagnosing convergence heuristics for diagnosing nonconvergence difficult problem run multiple chains (compare sample statistics) autocorrelation within each chain example sampling from a mixture of two 1D Gaussians (3 chains: colors) metropolishastings (MH) with increasing step sizes for the proposal Gibbs sampling
61 Summary Markov Chain: can model the "evolution" of an initial distribution converges to a stationary distribution
62 Summary Markov Chain: can model the "evolution" of an initial distribution converges to a stationary distribution Markov Chain Monte Carlo: design a Markov chain: stationary dist. is what we want to sample run the chain to produce samples
63 Summary Markov Chain: can model the "evolution" of an initial distribution converges to a stationary distribution Markov Chain Monte Carlo: design a Markov chain: stationary dist. is what we want to sample run the chain to produce samples Two MCMC methods: Gibbs sampling MetropolisHastings