Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018"

Transcription

1 Graphical Models Markov Chain Monte Carlo Inference Siamak Ravanbakhsh Winter 2018

2 Learning objectives Markov chains the idea behind Markov Chain Monte Carlo (MCMC) two important examples: Gibbs sampling Metropolis-Hastings algorithm

3 Problem with likelihood weighting Recap use a topological ordering sample conditioned on the parents if observed: keep the observed value update the weight

4 Problem with likelihood weighting Recap use a topological ordering sample conditioned on the parents if observed: Issues keep the observed value update the weight observing the child does not affect the parent's assignment only applies to Bayes-nets

5 Idea Gibbs sampling iteratively sample each var. condition on its Markov blanket X p(x X ) i i MB(i) if X i is observed: keep the observed value after many Gibbs sampling iterations X P

6 Idea Gibbs sampling iteratively sample each var. condition on its Markov blanket X p(x X ) i i MB(i) if X i is observed: keep the observed value equivalent to first simplifying the model by removing observed vars sampling from the simplified Gibbs dist. after many Gibbs sampling iterations X P

7 Example: Ising model recall the Ising model: p(x) exp( i xih i + i,j E xixjj i,j) x { 1, +1} i

8 Example: Ising model recall the Ising model: p(x) exp( i xih i + i,j E xixjj i,j) sample each node i: x { 1, +1} i p(x = +1 X ) = i MB(i) exp(h i+ j Mb(i) Ji,jX j) exp(h i+ j Mb(i) Ji,jX j)+exp( hi j Mb(i) Ji,jX j) =

9 Example: Ising model recall the Ising model: p(x) exp( i xih i + i,j E xixjj i,j) sample each node i: x { 1, +1} i p(x = +1 X ) = i MB(i) exp(h i+ j Mb(i) Ji,jX j) exp(h i+ j Mb(i) Ji,jX j)+exp( hi j Mb(i) Ji,jX j) = σ(2h + 2 J X ) i j Mb(i) i,j j

10 Example: Ising model recall the Ising model: p(x) exp( i xih i + i,j E xixjj i,j) sample each node i: x { 1, +1} i p(x = +1 X ) = i MB(i) exp(h i+ j Mb(i) Ji,jX j) exp(h i+ j Mb(i) Ji,jX j)+exp( hi j Mb(i) Ji,jX j) = σ(2h i + 2 j Mb(i) Ji,jX j) compare with mean-field σ(2h i + 2 j Mb(i) Ji,jμ j)

11 Markov Chain a sequence of random variables with Markov property (t) (1) (t 1) (t) (t 1) P (X X,, X ) = P (X X ) its graphical model... X (1) (T 1) X (T ) X (2) X many applications: language modeling: X is a word or a character physics: with correct choice of X, the world is Markov

12 Transition model we assume a homogeneous chain: (t) (t 1) (t+1) (t) P (X X ) = P (X X ) t cond. probabilities remain the same across time-steps notation: conditional probability (t) (t 1) P (X = x X = x ) = T (x, x ) is called the transition model think of this as a matrix T

13 Transition model we assume a homogeneous chain: (t) (t 1) (t+1) (t) P (X X ) = P (X X ) t cond. probabilities remain the same across time-steps notation: conditional probability (t) (t 1) P (X = x X = x ) = T (x, x ) state-transition diagram its transition matrix is called the transition model think of this as a matrix T.25 T =

14 Transition model we assume a homogeneous chain: (t) (t 1) (t+1) (t) P (X X ) = P (X X ) t cond. probabilities remain the same across time-steps notation: conditional probability (t) (t 1) P (X = x X = x ) = T (x, x ) state-transition diagram its transition matrix is called the transition model think of this as a matrix T.25 T = (t+1) (t) evolving the distribution P (X = x) = P (X = x )T (x, x) x V al(x)

15 Markov Chain Monte Carlo (MCMC) Example state-transition diagram for grasshopper random walk initial distribution (0) P (X = 0) = 1

16 Markov Chain Monte Carlo (MCMC) Example state-transition diagram for grasshopper random walk initial distribution (0) P (X = 0) = 1 after t=50 steps, the distribution is almost uniform P (x) t 1 9 x

17 Markov Chain Monte Carlo (MCMC) Example state-transition diagram for grasshopper random walk initial distribution (0) P (X = 0) = 1 after t=50 steps, the distribution is almost uniform P (x) use the chain to sample from the uniform distribution t P (X) t 1 9 x 1 9

18 Markov Chain Monte Carlo (MCMC) Example state-transition diagram for grasshopper random walk initial distribution (0) P (X = 0) = 1 after t=50 steps, the distribution is almost uniform P (x) use the chain to sample from the uniform distribution t P (X) t 1 9 x 1 9 why is it uniform? (mixing image: Murphy's book)

19 Markov Chain Monte Carlo (MCMC) Example state-transition diagram for grasshopper random walk initial distribution (0) P (X = 0) = 1 after t=50 steps, the distribution is almost uniform P (x) use the chain to sample from the uniform distribution t P (X) t 1 9 x 1 9 MCMC generalize this idea beyond uniform dist. P we want to sample from pick the transition model such that P (X) = P (X) why is it uniform? (mixing image: Murphy's book)

20 Stationary distribution given a transition model T (x, x ) if the chain converges: global balance equation (t) (t+1) (t) P (x) P (x) = x P (x )T (x, x)

21 Stationary distribution given a transition model T (x, x ) if the chain converges: global balance equation this condition defines the stationary distribution: (t) (t+1) (t) P (x) P (x) = x P (x )T (x, x) π π(x = x) = π(x = x )T (x, x ) x V al(x)

22 Stationary distribution given a transition model T (x, x ) if the chain converges: global balance equation this condition defines the stationary distribution: (t) (t+1) (t) P (x) P (x) = x P (x )T (x, x) π π(x = x) = π(x = x )T (x, x ) x V al(x) Example finding the stationary dist π(x ) =.25π(x ) +.5π(x ) π(x ) =.7π(x ) +.5π(x ) π(x ) =.75π(x ) +.3π(x ) π(x 1) + π(x 2) + π(x 3) = 1 1 π(x ) =.2 2 π(x ) =.5 3 π(x ) =.3

23 Stationary distribution as an eigenvector Example finding the stationary dist π(x ) =.25π(x ) +.5π(x ) π(x ) =.7π(x ) +.5π(x ) π(x ) =.75π(x ) +.3π(x ) π(x ) + π(x ) + π(x ) = 1 1 π(x ) =.2 2 π(x ) =.5 3 π(x ) =.3

24 Stationary distribution as an eigenvector Example finding the stationary dist π(x ) =.25π(x ) +.5π(x ) π(x ) =.7π(x ) +.5π(x ) π(x ) =.75π(x ) +.3π(x ) π(x ) + π(x ) + π(x ) = 1 1 π(x ) =.2 2 π(x ) =.5 3 π(x ) =.3 viewing T (.,.) as a matrix and P (x) as a vector t evolution of dist P (x) : P multiple steps: P t (t+1) T (t) = (T ) P = T P (t+m) T m (t) T T π = π

25 Stationary distribution as an eigenvector Example finding the stationary dist π(x ) =.25π(x ) +.5π(x ) π(x ) =.7π(x ) +.5π(x ) π(x ) =.75π(x ) +.3π(x ) π(x ) + π(x ) + π(x ) = 1 1 π(x ) =.2 2 π(x ) =.5 3 π(x ) =.3 viewing T (.,.) as a matrix and P (x) as a vector t evolution of dist P (x) : P multiple steps: P for stationary dist: t (t+1) T (t) = (T ) P = T P (t+m) T m (t) π = T T π T T π = π

26 Stationary distribution as an eigenvector Example finding the stationary dist π(x ) =.25π(x ) +.5π(x ) π(x ) =.7π(x ) +.5π(x ) π(x ) =.75π(x ) +.3π(x ) π(x ) + π(x ) + π(x ) = 1 1 π(x ) =.2 2 π(x ) =.5 3 π(x ) =.3 viewing T (.,.) as a matrix and P (x) as a vector t evolution of dist P (x) : P multiple steps: P for stationary dist: π is an eigenvector of with eigenvalue 1 t (t+1) T (t) = (T ) P = T P (t+m) T m (t) π = T T π T T T T (produce it using the power method) π = π

27 Stationary distribution: existance & uniquness irreducible we should be able to reach any x' from any x π otherwise, is not unique

28 Stationary distribution: existance & uniquness irreducible 1 1 we should be able to reach any x' from any x π otherwise, is not unique 0 0 aperiodic the chain should not have a fixed cyclic behavior 0 1 otherwise, the chain does not converge (it oscillates) 1 0

29 Stationary distribution: existance & uniquness irreducible 1 1 we should be able to reach any x' from any x π otherwise, is not unique 0 0 aperiodic the chain should not have a fixed cyclic behavior 0 1 otherwise, the chain does not converge (it oscillates) 1 0 every aperiodic and irreducible chain (with a finite domain) has a unique limiting distribution such that π(x = x) = π(x = x )T (x, x ) x V al(x) π

30 Stationary distribution: existance & uniquness irreducible 1 1 we should be able to reach any x' from any x π otherwise, is not unique 0 0 aperiodic the chain should not have a fixed cyclic behavior 0 1 otherwise, the chain does not converge (it oscillates) 1 0 every aperiodic and irreducible chain (with a finite domain) has a unique limiting distribution such that π(x = x) = π(x = x )T (x, x ) x V al(x) π regular chain a sufficient condition: there exists a K, such that the probability of reaching any destination from any source in K steps is positive (applies to discrete & continuous domains)

31 MCMC in graphical models distinguishing the "graphical models" involved 0 P (X)... π(x) X (1) (T 1) X (T ) X (2) X 1: the Markov chain P (X) X = [C, D, I, G, S, L, J, H] 3: the graphical model, from which we want to sample P (X) 2: state-transition diagram (not shown) that has exponentially many nodes #nodes = V al(x)

32 MCMC in graphical models distinguishing the "graphical models" involved 0 P (X)... π(x) X (1) (T 1) X (T ) X (2) X 1: the Markov chain P (X) X = [C, D, I, G, S, L, J, H] 3: the graphical model, from which we want to sample P (X) 2: state-transition diagram (not shown) that has exponentially many nodes #nodes = V al(x) objective: design the Markov chain transition so that π(x) = P (X)

33 Multiple transition models idea have multiple transition models each making local changes to x aka, kernels 1 2 T (x, x ), T (x, x ),, T (x, x ) n x = (x, x ) 1 2 T 2 T 1 only updates x 1 using a single kernel we may not be able to visit all the states while their combination is "ergodic"... X (1) X (2) X (T 1) X (T )

34 Multiple transition models idea have multiple transition models each making local changes to x aka, kernels 1 2 T (x, x ), T (x, x ),, T (x, x ) n x = (x, x ) 1 2 T 2 if π(x = x) = π(x = x )T (x, x ) k x V al(x) then we can combine the kernels: k T 1 only updates x 1 using a single kernel we may not be able to visit all the states while their combination is "ergodic" mixing them cycling them T (x, x ) = p(k)t (x, x ) k x,x,,x k T (x, x ) = T (x, x )T (x, x ), T (x, x )dx dx dx [1] [2] [n] 1 [1] 2 [1] [2] n [n 1] [1] [2] [n]... X (1) X (2) X (T 1) X (T )

35 Revisiting Gibbs sampling one kernel for each variable perform local, conditional updates T (x, x ) = P (x x )I(x = x ) i... i i X (1) X (2) X (T 1) X (T ) i P (x x ) = P (x x ) cycle the local kernels i i i MB(i) i... π(x) = P (X) is the stationary dist. for this Markov chain

36 Revisiting Gibbs sampling one kernel for each variable perform local, conditional updates T (x, x ) = P (x x )I(x = x ) i... i i X (1) X (2) X (T 1) X (T ) i P (x x ) = P (x x ) cycle the local kernels i i i MB(i) i... π(x) = P (X) is the stationary dist. for this Markov chain if P (x) > 0 x then this chain is regular i.e., converges to its unique stationary dist.

37 Block Gibbs sampling local moves can get stuck in modes of P (X) updates using exploring these modes P (x x ), P (x x ) will have problem x 1 x 2

38 Block Gibbs sampling local moves can get stuck in modes of P (X) updates using exploring these modes P (x x ), P (x x ) will have problem x 1 idea: each kernel updates a block of variables x 2

39 Block Gibbs sampling local moves can get stuck in modes of P (X) updates using exploring these modes P (x x ), P (x x ) will have problem x 1 idea: each kernel updates a block of variables x 2 update h given v P (h, h, h v,, v ) = P (h v) P (h v) Restricted Boltzmann Machine (RBM) update v given h P (v,, v h, h, h ) = P (v h) P (v h)

40 Detailed balance A Markov chain is reversible if for a unique π detailed balance π(x)t (x, x ) = π(x )T (x, x) same frequency in both directions x, x

41 Detailed balance A Markov chain is reversible if for a unique π x left-hand side detailed balance π(x)t (x, x ) = π(x )T (x, x) x π(x)t (x, x )dx = π(x) same frequency in both directions T (x, x )dx = π(x) x, x = x global balance π(x )T (x, x)dx right-hand side

42 Detailed balance A Markov chain is reversible if for a unique π x left-hand side detailed balance π(x)t (x, x ) = π(x )T (x, x) x π(x)t (x, x )dx = π(x) same frequency in both directions T (x, x )dx = π(x) x, x = x global balance π(x )T (x, x)dx right-hand side detailed balance is a stronger condition π = [.4,.4,.2] global balance detailed balance (example: Murphy's book)

43 Detailed balance A Markov chain is reversible if for a unique π x left-hand side detailed balance π(x)t (x, x ) = π(x )T (x, x) x π(x)t (x, x )dx = π(x) same frequency in both directions T (x, x )dx = π(x) x, x = x global balance π(x )T (x, x)dx right-hand side detailed balance is a stronger condition if Markov chain is regular and then π π is the unique stationary distribution satisfies detailed balance, π = [.4,.4,.2] global balance detailed balance (example: Murphy's book)

44 Detailed balance A Markov chain is reversible if for a unique π x left-hand side detailed balance π(x)t (x, x ) = π(x )T (x, x) x π(x)t (x, x )dx = π(x) same frequency in both directions T (x, x )dx = π(x) x, x = x global balance π(x )T (x, x)dx right-hand side detailed balance is a stronger condition if Markov chain is regular and then π π is the unique stationary distribution satisfies detailed balance, π = [.4,.4,.2] global balance detailed balance analogous to the theorem for global balance checking for detailed balance is easier (example: Murphy's book)

45 Detailed balance A Markov chain is reversible if for a unique π x left-hand side detailed balance π(x)t (x, x ) = π(x )T (x, x) x π(x)t (x, x )dx = π(x) same frequency in both directions T (x, x )dx = π(x) x, x = x global balance π(x )T (x, x)dx right-hand side detailed balance is a stronger condition if Markov chain is regular and then π π is the unique stationary distribution satisfies detailed balance, π = [.4,.4,.2] global balance detailed balance analogous to the theorem for global balance checking for detailed balance is easier what happens if T is symmetric? (example: Murphy's book)

46 Using a proposal for the chain Given P design a chain to sample from P idea

47 Using a proposal for the chain Given P design a chain to sample from P idea use a proposal transition we can sample from q T (x, x ) q T (x, ) q T (x, x ) is a regular chain (reaching every state in K steps has a non-zero probability)

48 Using a proposal for the chain Given P design a chain to sample from P idea use a proposal transition we can sample from q T (x, x ) q T (x, ) q T (x, x ) is a regular chain (reaching every state in K steps has a non-zero probability) accept the proposed move with probability to achieve detailed balance A(x, x )

49 use a proposal transition we can sample from q T (x, x ) Metropolis algorithm q T (x, ) q T (x, x ) is a regular chain (reaching every state in K steps has a non-zero probability) accept the proposed move with probability to achieve detailed balance proposal is symmetric T (x, x ) = T (x, x) p(x ) p(x) A(x, x ) min(1, ) accepts the move if it increases P may accept it otherwise A(x, x ) (image: Wikipedia)

50 Metropolis-Hastings algorithm q p(x )T (x,x) p(x)t q (x,x ) if the proposal is NOT symmetric, then A(x, x ) min(1, )

51 Metropolis-Hastings algorithm if the proposal is NOT symmetric, then A(x, x ) min(1, ) why does it sample from P? q p(x )T (x,x) p(x)t q (x,x )

52 Metropolis-Hastings algorithm if the proposal is NOT symmetric, then A(x, x ) min(1, ) why does it sample from P? q p(x )T (x,x) p(x)t q (x,x ) q T (x, x ) = T (x, x )A(x, x ) x x move to a different state is accepted

53 Metropolis-Hastings algorithm if the proposal is NOT symmetric, then A(x, x ) min(1, ) why does it sample from P? q p(x )T (x,x) p(x)t q (x,x ) q T (x, x ) = T (x, x )A(x, x ) x x q T (x, x) = T (x, x) + x x (1 A(x, x ))T (x, x ) move to a different state is accepted proposal to stay is always accepted move to a new state is rejected

54 Metropolis-Hastings algorithm if the proposal is NOT symmetric, then A(x, x ) min(1, ) why does it sample from P? q p(x )T (x,x) p(x)t q (x,x ) q T (x, x ) = T (x, x )A(x, x ) x x q T (x, x) = T (x, x) + x x (1 A(x, x ))T (x, x ) move to a different state is accepted substitute this into detailed balance (does it hold?) q q proposal to stay is always accepted move to a new state is rejected π(x)t (x, x )A(x, x ) = π(x )T (x, x)a(x, x) q π(x )T (x,x) π(x)t q (x,x ) min(1, )? q π(x)t (x,x ) π(x )T q (x,x) min(1, )

55 Metropolis-Hastings algorithm if the proposal is NOT symmetric, then A(x, x ) min(1, ) why does it sample from P? q p(x )T (x,x) p(x)t q (x,x ) q T (x, x ) = T (x, x )A(x, x ) x x q T (x, x) = T (x, x) + x x (1 A(x, x ))T (x, x ) move to a different state is accepted substitute this into detailed balance (does it hold?) q q proposal to stay is always accepted move to a new state is rejected π(x)t (x, x )A(x, x ) = π(x )T (x, x)a(x, x) q π(x )T (x,x) π(x)t q (x,x ) min(1, )? q π(x)t (x,x ) π(x )T q (x,x) min(1, ) Gibbs sampling is a special case, with A(x, x ) = 1 all the time!

56 Sampling from the chain at the limit T, P = π = P how long should we wait for T D(P, π) < ϵ? 1 1 λ 2 mixing time N ϵ O( log( )) #states (exponential) 2nd largest eigenvalue of T

57 Sampling from the chain at the limit T, P = π = P how long should we wait for run the chain for a burn-in period (T steps) collect samples (few more steps) multiple restarts can ensure a better coverage T D(P, π) < ϵ? 1 1 λ 2 mixing time N ϵ O( log( )) #states (exponential) 2nd largest eigenvalue of T

58 Sampling from the chain at the limit T, P = π = P how long should we wait for run the chain for a burn-in period (T steps) collect samples (few more steps) multiple restarts can ensure a better coverage T D(P, π) < ϵ? 1 1 λ 2 mixing time N ϵ O( log( )) #states (exponential) 2nd largest eigenvalue of T Example Potts model model V al(x) = 5 128x128 grid p(x) exp( i h(x i) + i,j E.66I(x j = x j)) Gibbs sampling different colors 200 iterations 10,000 iterations image : Murphy's book

59 Diagnosing convergence heuristics for diagnosing non-convergence difficult problem run multiple chains (compare sample statistics) auto-correlation within each chain

60 Diagnosing convergence heuristics for diagnosing non-convergence difficult problem run multiple chains (compare sample statistics) auto-correlation within each chain example sampling from a mixture of two 1D Gaussians (3 chains: colors) metropolis-hastings (MH) with increasing step sizes for the proposal Gibbs sampling

61 Summary Markov Chain: can model the "evolution" of an initial distribution converges to a stationary distribution

62 Summary Markov Chain: can model the "evolution" of an initial distribution converges to a stationary distribution Markov Chain Monte Carlo: design a Markov chain: stationary dist. is what we want to sample run the chain to produce samples

63 Summary Markov Chain: can model the "evolution" of an initial distribution converges to a stationary distribution Markov Chain Monte Carlo: design a Markov chain: stationary dist. is what we want to sample run the chain to produce samples Two MCMC methods: Gibbs sampling Metropolis-Hastings