Introduction to Markov Chain Monte Carlo & Gibbs Sampling

Size: px

Start display at page:

Download "Introduction to Markov Chain Monte Carlo & Gibbs Sampling"

Camron Cameron
5 years ago
Views:

1 Introduction to Markov Chain Monte Carlo & Gibbs Sampling Prof. Nicholas Zabaras Sibley School of Mechanical and Aerospace Engineering 101 Frank H. T. Rhodes Hall Ithaca, NY URL: August 19,

2 Contents Incremental Strategies for Sampling, Iterative sampling Introduction to MCMC, autoregressive model The Gibbs sampler, systematic scan, random scan Gibbs sampler examples Block and Metropolized Gibbs Application in variable/model selection in linear regression Following closely: Monte Carlo Statistical Methods, C.P. Roberts and G. Casella, Chapter 3 (google books, slides, video) Other References: 1. D Mackay, Introduction to MC methods, reprint. 2. R Neal, Probabilistic Inference Using MCMC Methods, C. Andriew et al., An introduction to MCMC for Machine Learning, Machine Learning, 50, 5 43, S. Brooks, MCMC methods and its applications, Journal of the Royal Statistical Society. Series D (The Statistician), Vol. 47, No. 1 (1998), pp G. Casella and EI George, Explaining the Gibbs Sampler, The American Statistician, Vol. 46, 1992, S. Chib and E. Greenberg, Understanding the MH algorithm, The American Statistician, Vol. 49, No. 4 (Nov., 1995), pp

3 Using Incremental Strategies for Sampling We have seen that both rejection sampling (RS) and importance sampling (IS) are limited to problems of moderate dimensions. The problem with these algorithms is that we try to sample all the components of a high-dimensional parameter simultaneously. We can learn next incremental strategies: - Iterative Methods: Markov chain Monte Carlo. - Sequential Methods: Sequential Monte Carlo. 3

4 Motivating Example Multiple failures in a nuclear plant: a Model: Failures of the i th pump follow a Poisson process with parameter λi,1 λi 10. For an observation time t i, the number of failuresp i is a Poisson P(λ t t i ) random variable. ( ) The unknowns consist of θ : = λ1, λ2,..., λ10, β where β is parameter in the hierarchical model introduced next. a Statistical Computing and MC Methods, A. Doucet, Lecture 10. 4

5 Motivating Example: Nuclear Pump Data Hierarchical Model: iid... with α=1.8, γ=0.01, δ=1. 10 i= 1 λ ~ Ga ( αβ, ), and β~ Ga ( γδ, ), i The posterior distribution (see here Ga distribution) 10 pi λii t α α 1 βλ i γ 1 δβ ~ ( λiti) e β λi e β e i= 1 ( λ ) ~ (, ) ~ (, ) ii t λ β γ δ P i Ga α β Ga { p ( )} i λi ti+ β λi e β + α 1 10α+ γ 1 δβ It is not obvious how the inverse CDF method or the accept/reject method or how importance sampling could be used for this multidimensional distribution. e 5

6 Conditional Distributions The conditionals can be obtained with direct observation: λ ( β, t, p )~ Ga ( p + α, t + β) for1 i 10 i i i i i β ( λ,..., λ ) ~ Ga ( γ 10 α, δ λ ) 1 10 Instead of directly sampling the vector θ = ( λ1,..., λ10, β) at once, one could suggest sampling it iteratively. We can start with the λ i s for a given guess of β, followed by an update of β given the new samples ( λ,..., λ ) i i= 1 6

7 Iterative Sampling Given a sample, at iteration t, θ t = ( λ t 1,..., λ t 10, β t ), one could proceed as follows at iteration t + 1, Step t p p t for i t 1 1: λ + t t i ( β, i, i) ~ Ga ( i + α, i + β ) 1 10 Step t t+ 1 t Ga + + t i i= 1 2 : β ( λ,..., λ ) ~ ( γ 10 α, δ λ ) Note that instead of directly sampling in a space of dimension 11, one samples 11 times in spaces of dimension 1! 7

8 Iterative Sampling With this iterative procedure: Are we sampling from the desired joint distribution of the 11 variables? If yes, how many times should the iteration above be repeated? The validity of the approach described here is derived from the sequence { θ t }: = { λ t 1, λ t 2,..., λ t 10, β t } being a Markov chain. 8

9 Introduction to Markov Chain Monte Carlo Markov chain: A sequence of random variables {X n,n } defined on (X, B (X)) which satisfies that for any A B(X) the following probability condition: ( X A X,..., X ) = ( X A X ) n 0 n 1 n n 1 and we write: Transition Kernel : P( x, A) = ( X n A X n 1) As we have seen in an earlier lecture, given a target π, we need to design a transition kernel P such that asymptotically 1 N N n= 1 N f ( X ) f ( x) π ( x) dx and / or X ~ π n It is easy to simulate the Markov Chain even if π is complex. n 9

10 Autoregression Model Consider the autoregression model for α < 1 X n = α X n 1 + V n, where V n N(0,σ 2 ) The limiting distribution is: 2 σ π ( x) = N x;0, 1 2 α To sample from π, we just sample the Markov chain and we know that asymptotically X n ~π Of course this problem is only to demonstrate the main idea since we can here sample directly from π! 10

11 Autoregression Model Consider 100 independent Markov chains run in parallel. We assume that the initial distribution of these Markov chains is U[0,20]. So initially, the Markov chains samples are not distributed according to π. In the following example, we choose here for a MatLab implementation) α = 0.4, σ = 5 (see 11

12 Example A Markov chain with a normal distribution as target distribution. Initial distribution step=1 step=2 step=3 step=4 step=100 12

13 Example Histograms of 100 independent Markov chains with a normal distribution as target distribution. Initial distribution step=1 step=2 step=3 step=4 step=100 13

14 Example The target normal distribution seems to attract the distribution of the samples and even to be a fixed point of the algorithm. We have produced 100 independent samples from the normal distribution. We will see that it is not necessary to run N Markov chains in parallel in order to obtain 100 samples, but that one can consider a unique Markov chain, and build the histogram from this single Markov chain by forming histograms from one trajectory. 14

15 Markov Chain Monte Carlo The estimate of the target distribution, through the series of histograms, improves with the # of iterations. Assume that we have stored X, n n= 1,..., N for N large and wish to estimate f ( x) π ( x) dx. X { } N 1 We suggest the estimator f ( X which is the estimator n) N n = 1 we used before when { X, n n= 1,..., N} were independent. Under relatively mild conditions, such an estimator is consistent despite the fact that the samples are not independent. Under additional conditions, a CLT also holds with a rate of convergence 1/ N. 15

16 Markov Chain Monte Carlo We are interested in Markov chains with transition kernel P which has the following three important properties observed in the autoregressive example: A. The desired distribution π is an invariant distribution of the Markov chain, i.e. X π( xpxydx ) (, ) = π( y) B. The successive distributions of the Markov chains converge towards π regardless of the starting point. N 1 f ( X ) E n π ( f( X)) N n = 1 asymptotically X n ~ π (stronger requirement) C. The estimator converges towards and 16

17 Markov Chain Monte Carlo Since there is an infinite number of kernels P(x, y) which admit π(x) as their invariant distribution, the main task in MCMC is coming up with good ones. Convergence is ensured under very weak assumptions -- irreducibility and aperiodicity. It is usually easy to establish that an MCMC sampler converges towards π(x) but difficult to obtain rates of convergence. 17

18 The Gibbs Sampler The Gibbs sampler is a generic method to sample from a high dimensional distribution. It generates a Markov chain which converges to the target distribution under weak assumptions: irreducibility and aperiodicity. 18

19 The Two Component Gibbs Sampler Consider the target distribution π(θ) such that θ={θ 1, θ 2 }. The two component Gibbs sampler proceeds as follows: Initialization: Select deterministically or randomly θ = ( θ 1, θ 2 ) Iteration i, i 1. Sample Sample ( ) i 1 θ 1 ~ 1 2 i π θ θ ( ) i θ ~ π θ θ i Sampling from conditionals is often feasible even when sampling from the joint is impossible (e.g. in the nuclear pump data). 19

20 ( ) { } Invariant Distribution 1 2 Clearly θi, θi is a Markov Chain. Its transition kernel is: (( ) ( )) = The detailed balance equation is satisfied: ( ) ( ) P θ, θ, θ, θ π θ θ π θ θ ( ( 1 2 )) (, ) (, ),, π θ θ P θ θ θ θ dθ dθ ( 1 ) ( 2 1 ) ( 1 ) ( 2 1 ) d ( ) π θ, θ π θ θ π θ θ dθ dθ ( ) π θ π θ θ π θ θ θ 1 2 πθ (, θ ) ( 1 ) ( ) ( ) ( ) π θ, θ π θ θ dθ = π θ π θ θ = π θ, θ = = = ( ) 20

21 Irreducibility The detailed balance does not ensure that the Gibbs sampler converges towards the invariant distribution. Additionaly, it is required to ensure irreducibility: the Markov chain can move to any set A such that π(a)>0 for (almost) any starting point. This ensures that 1 N N n= 1 (, ) (, ) (, ) n n f θ 1 θ 2 f θ 1 θ 2 π θ 1 θ 2 dθ 1 dθ 2 but not that asymptotically ( 1 2 θ ) n θn, ~ π 21

22 Irreducibility A distribution is shown here that leads to a reducible Gibbs sampler. 22

23 Irreducibility Consider an example with X = {1, 2} and transition probabilities P (1, 2) = P (2, 1) = 1. The invariant distribution is clearly given by π (1) = π (2) = 1/2. However, we know that if the chain starts in X 0 = 1, then X 2n = 1 and X 2n+1 = 2 for any n. We have 1 N n N n= 1 ( ) ( ) π ( ) f X f x x dx but clearly X n is not distributed according to π. You need to make sure that you do not explore the space in a periodic way to ensure that X ~ π asymptotically. n 23

24 θ = θ1 θ2 θ p (,,..., ) Gibbs Sampler If where p>2, the Gibbs sampler still applies. Initialization: Select deterministically or randomly θ = ( θ, θ,..., θ p ) (0) (0) (0) (0) 1 1 Iteration i, i 1 For k=1:p Sample where θ ~ πθ ( θ ) () i () i k k k θ = ( θ,..., θ, θ,..., θ ) () i () i () i ( i 1) ( i 1) k 1 k 1 k+ 1 p 24

25 Systematic-Scan Gibbs Sampler ( ) 1 1 p Systematic Scan Gibbs: Let θ = θ, θ,..., θ () i () i () i () i () i Update θ1 from () i Update θ2 from ( i 1) ( i 1) π(. θ2,..., θ p ) () i ( i 1) ( i 1) π(. θ1, θ3,..., θ p ) () Update i from θ p π(. θ, θ,..., θ ) () i () i () i 1 2 p 1 25

26 Random Scan Gibbs Sampler Consider again: 1 2 where p>2. We consider the following random scan Gibbs sampler. Initialization: θ = ( θ, θ,..., θ p ) Select deterministically or randomly θ = ( θ, θ,..., θ p ) (0) (0) (0) Iteration i, i 1 Sample Set θ Sample K ~ U. () i ( i 1) K θ = K. { 1,..., p} () i () i θk ~ π( θk θ K) where θ = ( θ,..., θ, θ,..., θ ) () i () i () i () i () i K 1 K 1 K+ 1 p 26

27 Random-Scan Gibbs Sampler Random scan Gibbs: Let 1 2 p at step (iteration) i. Draw j from 1 to p with probability w j =1/p Draw new coordinate j, θ j θ j ~ π(. θ j) and leave the remaining components unchanged; that is, let θ () i ( i 1) θ j = j (,,..., ) θ = θ θ θ () i () i () i () i 27

28 Gibbs Sampler: Example Consider the following target distribution: 0 1 ρ ρ x1 π ( x1, x2) = N, exp ( x1 x2) 2 0 ρ ρ ρ 1 x2 The marginal distribution is given as: 1 π ( x ) exp x A systematic-scan Gibbs sampler is generated with the following conditionals: x x ~ N x,1 { ρ ρ } { ρ ρ } t+ 1 t t x x ~ N x,1 t+ 1 t+ 1 t

29 Gibbs Sampler: Example Set ρ=0.5, # of iterations 10000, and (x 0, x 1 )=(-3,-3) Histogram of x 1, the exact pdf of which is the standard Gaussian x 1 -x 2 plot C++ programs are given here 29

30 Gibbs Sampler: Example Set ρ=0.999, # of iterations 10000, and (x 0, x 1 )=(-3,-3) We can see that the sampling process in this case of highly correlated variables is inaccurate. Histogram of x 1, the exact pdf of which is the standard Gaussian x 1 -x 2 plot 30

31 Convergence of the Gibbs Sampler Even when irreducibility and aperiodicity are ensured, the Gibbs sampler can still converge very slowly. Consider the target bivariate Gaussian distribution a b N (0, ) b a A systematic-scan Gibbs sampler is generated as In this example, we set 2 t+ 1 t b t b x1 x2 ~ N x2, a a a 2 t+ 1 t+ 1 b t+ 1 b x2 x1 ~ N x1, a a a N (0, )

32 Convergence of the Gibbs Sampler The Gibbs sampling path and equiprobability curves are plotted below. A C++ implementation can be found here 32

33 Gibbs Sampler for Mixture of Gaussians A MatLab implementation can be found here 33

Gibbs Sampler: Example Consider the following target distribution x1 α n x1 ( x, x ) ~ x (1 x ) + + 1 +

Be ( x + α, n x + β ) We set n = 20, α = β = 0.5, initial state (0,0), time of iterations 10000.

34 Gibbs Sampler: Example Consider the following target distribution x1 α n x1 ( x, x ) ~ x (1 x ) β x1 The two conditional distributions for the Gibbs sampler are x1 x2 ~ Binom( n, x2) x x ~ Be ( x + α, n x + β ) We set n = 20, α = β = 0.5, initial state (0,0), time of iterations See here for a C++ implementation. Histogram of x 2, the exact pdf of which is a Beta distribution π n 34

35 Gibbs Sampler: Example Consider a likelihood defined with the Cauchy distribution C(μ,1) with two measurements as follows: n= 2 ( μ D ) = f ( x ) = n We take as prior a normal distribution This leads to a posterior of the form: ( 1 + ( x ) )( 1 + ( x ) ) μ i i= 1 π 1 μ 2 μ μ ~ N (0,10) 2 2 ( + x1 μ )( + x2 μ ) How do we use the Gibbs sampler to sample from this univariate distribution? 2 μ 20 e π( μ D)~ 1 ( ) 1 ( ) 1 35

36 Gibbs Sampler: Example We can use Gibbs sampler by noticing: We can then think π ( μ D) as the marginal of π ( μω, 1, ω2 D)! The Gibbs sampler is based on the following 2 steps: Generate μ (t) π(μ ω (t 1),D) Generate ω (t) π(ω μ (t),d) 2 2 ( + x1 μ )( + x2 μ ) 2 μ 20 e π( μ D)~ 1 ( ) 1 ( ) 1 = 2 + xi μ 0 1 ( ) π( μ, ω, ω )~ 1 1 e 2 ω i 1 + ( xi μ) dω 2 μ 2 2 ω i 1 + ( xi μ) 20 D e e i= 1 i 36

37 Gibbs Sampler: Example The step μ (t) π(μ ω (t 1),D) is straight forward since ω ixi 1 π ( μ ω, i D) N, ω 1/ 20 2 ω 1/10 + i i + i i The step ω (t) π(ω μ (t),d) is also straighrforward: ( () t ) ( () t 2, D Exp 1 + ( x μ ) ) i π ω μ A MatLab implementation can be found here. 37

38 Gibbs sampler: Example On the left, the last 100 iterations of the chain (μ (t )); on the right, the histogram of the chain (μ (t) ) and comparison with the target density for 10,000 iterations. A MatLab implementation can be found here 38

39 Block and Metropolized Gibbs Instead of updating single coordinates x j, one can update blocks x A. This is more efficient but requires knowing the block conditionals π (x A x A ) and being able to sample from them. Combinations of Gibbs and Metropolis Hastings (an introduction was provided in the introduction to Markov Chains lecture) are popular. In Metropolized Gibbs, for example, some coordinates are updated from conditionals and others using arbitrary proposals as in Metropolis- Hastings. 39

40 Gibbs Sampling Are the component-wise MH algorithms π-invariant, irreducible or aperiodic? Each transition kernel in Gibbs (which updates a single coordinate) is not irreducible nor aperiodic. However, their combination (random or systematic scan) might be! 40

41 Gibbs Sampling Consider a target π(x 1, x 2 ) (e.g. a uniform distribution) with disconnected support as in the figure. Conditioning on x 1 < 0, the distribution of x 2 cannot produce a value in [0,1]. You can make this type of problems to work by introducing a proper coordinate transformation. y = x + x, y = x x Conditioning now on y 1 produces a uniform distribution on the union of a negative & of a positive interval. Therefore, one iteration of the Gibbs sampler is sufficient to jump from one disk to the other one. 41

42 Gibbs Sampler: Recommendation Have as few blocks as possible. Put the most correlated variables in the same block. If necessary, reparametrize the model to achieve this. Integrate analytically as many variables as possible. There is no general strategy that will work for all problems. 42

43 Bayesian Variable Selection in Regression We select the following regression model: p βk k k = 1 Y = X + σv, wherev ~ N (0,1) ν γ 2 IG α << where we assume as priors σ ;, and for 1 1 βk N αδσ + N δσ ~ (0, ) (0, ) We introduce a latent variable γ k { 0,1} such that: 1 Pr( γk = 0) = Pr( γk = 1) = 2 β γ αδσ β γ δσ k k = 0 ~ N (0, ), k k = 1 ~ N (0, ) A. Doucet, Statistical Computing and MC Methods, Lecture 10 43

44 A Bad Gibbs Sampler ( 1: p 1: p ) 2 We have parameters β, γ, σ and observe n (, ) 1 D= x y = i i i A potential Gibbs sampler consists of sampling iteratively from p( β D, γ, σ )( Gaussian), p( σ D, γ, β )( inverse Gamma), and 2 2 1: p 1: p 1: p 1: p 2 ( γ1: p D, β1: p, σ ) 2 2 In particular, p( γ1: p D, β1: p, σ ) = p( γk D, βk, σ ) and p p k = 1 p 2 ( γk 1 βk, σ ) = = 2 1 β k exp 2 2 2πδσ 2δσ β 1 β k k exp exp πδσ 2δσ 2παδσ 2αδσ The Gibbs sampler becomes reducible as α goes to zero. 44

45 Bayes Variable Selection This is the result of bad modeling. As we already have seen, we put α 0 and write: p γβ k k k i= 1 Y = X + σv, wherev ~ N (0,1) where γ k = 1 if X k is included or γ k = 0 otherwise. However, this suggests that is defined even when γ = 0. β k A neater way to write such models is T Y = β X + σv = β X + σv, wherev ~ N (0,1) { k: γ = 1} k where, for a vector p = ( 1,..., ), γ = { : = 1 }, Xγ = { X : = 1 }, and nγ = k k γ γ γ γ β β γ γ γ γ k p k k k k k k = 1 Prior distributions ( ) ( ) ν γ π β σ N β δ σ In σ π γ π γ γ γ, = γ;0, ;,, ( ) ( ) 2. γ = p p IG and k = 2 2 k = 1 45

46 A Better Gibbs Sampler We are interested in sampling from the trans-dimensional distribution 2 (, γ, σ D) π γ β However, we know that where ( γ ) = ( ) π γ β σ π γ π β σ γ 2 2,, D D ( γ, D, ) ( D) ( D ) π γ π γ π( γ) and (see result from earlier lecture) n γ0 + μ Σ μ 2 2 ν 1/2 0 + y n nγ i= 1 π( D γ) = π( D, βγ, σ γ) dβγdσ Γ( ) δ Σ γ 2 2 with μ n n 1 2 T γ =Σγ yx i γ, i, Σ γ = δ In + x, ix γ γ γ, i i= 1 i= 1 2 T 1 T i γ γ γ ν 0 + n ( ) 2 46

47 A Better Gibbs Sampler 2 The full conditional distribution for π β, σ D, γ is where ( ) π ( β, σ D) = N β ; μ, σ Σ 2 2 γ γ γ γ γ μ IG σ ;, γ 0 + μγ Σγ μγ 2 ν 0 + n T T yi n i= 1 n n 1 2 T γ =Σγ yx i γ, i, Σ γ = δ In + x, ix γ γ γ, i i= 1 i= 1 ( γ ) The derivation of the above conditional is already given in an earlier lecture. 47

48 A Better Gibbs Sampler Popular alternative prior models for γ i include g-prior (Zellner) ( ) ( ) γi ~ B λ, where λ ~ U[0,1] γ ~ B λ, where λ ~ Be( α, β) i i T ( X X ) β σ ~ N β ;0, δ σ ( ) γ γ γ γ where here for robustness we additionally use 2 a0 b0 δ ~ IG, 2 2 Such variations are very important and can modify dramatically the performance of the Bayesian model. 48

49 Bayesian Variable Selection Example π(γ D) is a discrete probability distribution with 2 p potential values. We assume δ 2 is known here. We can use the Gibbs sampler to sample from it. Initialization: Select deterministically or randomly = Iteration i, i 1 For k=1:p Sample () i () i γk ~ π( γk D, γ k), (0) (0) (0) γ ( γ1,..., γ p ) where γ = ( γ,..., γ, γ,..., γ ) () i () i () i ( i 1) ( i 1) k 1 k 1 k+ 1 p Optional step: Sample ( ) γ β, σ ~ π( β, σ D, γ ) () i 2() i 2 () i γ 49

50 Bayesian Variable Selection Example Consider the case where δ 2 is unknown. Initialization: Select deterministically or randomly Iteration i, i 1 For k=1:p Sample i i i γ ~ π( γ D, γ, δ ) () () 2( 1) k k k ( (0) (0) 2(0) 2(0) γ, β,, ) γ σ δ where γ = ( γ,..., γ, γ,..., γ ) () i () i () i ( i 1) ( i 1) k 1 k 1 k+ 1 p Sample Sample ( ) γ β, σ ~ π( β, σ D, γ, δ ) () i 2() i 2 () i 2() i γ 2() i 2() i () i δ ~ πδ ( β γ ) 50

51 Bayesian Variable Selection Example This very simple sampler is much more efficient than the ones where γ is sampled conditional upon (β, σ 2 ) However, it mixes very slowly because the components are updated one at a time. Updating correlated components together would increase significantly the convergence speed of the algorithm at the cost of an increased complexity. We will revisit linear regression models in more detail and provide implementation of the variable selection caterpillar example in the following lecture. 51

Pine Processionary Caterpillars Caterpillar dataset: 1973 study to assess the influence of some forest settlement characteristics on the development of catepillar colonies.

52 Pine Processionary Caterpillars Caterpillar dataset: 1973 study to assess the influence of some forest settlement characteristics on the development of catepillar colonies. The response variable is the log of the average number of nests of caterpillars per tree on an area of 500 m 2. We have n = 33 data and 10 explanatory variables. Following closely: Bayesian Core, J.M. Marin and C.P. Roberts, Chapter 3 (available on line for Cornell students) 52

53 Regression Linear regression is one of the most widespread tools of statistics for modeling the (linear) influence of some variables on others. The variable of primary interest, y, is called the response variable e.g. here the number of pine processionary caterpillar colonies The variables x=(x 1,,x k ) are the explanatory variables. These variables can be continuous or discrete. Our objective is to uncover explanatory and predictive patterns. 53

54 Caterpillar Regression Problem The pine processionary caterpillar colony size is influenced by: x 1 is the altitude (in meters), x 2 is the slope (in degrees), x 3 is the number of pines in the square, x 4 is the height (in meters) of the tree sampled at the center of the square, x 5 is the diameter of the tree sampled at the center of the square, x 6 is the index of the settlement density, x 7 is the orientation of the square (from 1 if southbound to 2 otherwise), x 8 is the height (in meters) of the dominant tree, x 9 is the number of vegetation strata, x 10 is the mix settlement index (from 1 if not mixed to 2 if mixed). 54

55 Caterpillar Regression Problem x 1 x 2 x 3 Semilog y plot of the data (x i,y),i=1,..,9 x 4 x 5 x 6 x 7 x 8 x 9 55

56 Bayesian Variable Selection Example Top five most likely models for the selection models discussed: Results from: Statistical Computing and MC Methods, A. Doucet. 56

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Stat 535 C - Statistical Computing & Monte Carlo Methods Arnaud Doucet Email: arnaud@cs.ubc.ca 1 1.1 Outline Introduction to Markov chain Monte Carlo The Gibbs Sampler Examples Overview of the Lecture