Markov Chain Monte Carlo (MCMC)

Markov Chain Monte Carlo (MCMC Dependent Sampling Suppose we wish to sample from a density π, and we can evaluate π as a function but have no means to directly generate a sample. Rejection sampling can be used if a trial density f can be found where π/f has a reasonable bound. When the dimension is high this is difficult or impossible. Rejection sampling, the inversion method, and the density transform method all produce independent realizations from π. If these methods are inefficient or difficult to implement, we can drop independence as a goal, aiming instead to generate a dependent sequence X 1, X 2,... such that the marginal distribution of ech X i is π. Going a step further, we may allow the marginal distribution π k (X k to be different from π, but converge to π in some sense: π k π. By relaxing the goal of the problem, it becomes possible to overcome the key problem of rejection sampling that it does not adapt as information about π is built up through repeated evaluations of π at the sampled points. A practical framework for constructing dependent sequences satsifying these goals is provided by discrete time Markov chains. Review of Markov Chains A discrete time Markov chain is a sequence of random variables X 0, X 1,... such that P (X n+1 A X 0,..., X n = P (X n+1 A X n P (X n, A. The function P is called the transition kernel. A probability distribution π is called the invariant distribution for P if π(a = P (x, Aπ(xdx, for all measurable sets A, in which case we may write πp = π. More generally this relationship may hold for a non-finite measure π, in which case π is called the invariant measure for P. 1

Suppose that P has a density, denoted q(x, y. This means that P (x, A = A q(x, ydy. A sufficient condition for π to be the invariant distribution for P is detailed balance: π(xq(x, y = π(yq(y, x. The proof is a simple application of Fubini s theorem. The distribution π is the equilibrium distribution for X 0, X 1,... if satisfies P n (X 0, A P (X n A X 0 lim P n (x, A = π(a, n for all measurable sets A and for π almost-every x. Key definitions: Irreducible: The Markov chain given by transition kernel P with invariant distribution π is irreducible if for each measurable set A with π(a > 0, for each x there exists n such that P n (x, A > 0. Recurrent: The Markov chain given by transition kernel P with invariant distribution π is recurrent if for every measurable set B with π(b > 0, and for π almost-every X 0. P (X 1, X 2,... B i.o. X 0 > 0 X 0 P (X 1, X 2,... B i.o. X 0 = 1 The Markov chain given by transition kernel P with invariant distribution π is Harris recurrent if P (X 1, X 2,... B i.o. X 0 = 1 X 0. Fact: A Markov transition kernel P that is π-irreducible and has an invariant measure is recurrent. A recurrent kernel P with a finite invariant measure is positive recurrent, otherwise it is null recurrent. 2

Periodic: A Markov transition kernel P is periodic if there exists n 2 and a sequence of nonempty, disjoint measurable sets A 1, A 2,..., A n such that x A j (j < n, P (x, A j+1 = 1, x A j, P (x, A 1 = 1 x A n. Total Variation norm: For any measure λ, the total variation (TV norm is λ = sup A λ(a inf A λ(a, where the supremum and infimum are taken over all λ measurable sets. The distance between two probability measures π 1 and π 2 can be assesed using π 1 π 2. Theorem: Let P be a Markov transition kernel with invariant distribution π, and suppose that P is π irreducible. Then P is positive recurrent, and π is the unique invariant distribution of P. If P is aperiodic, then for π almost-every x P n (x, A π(a 0 for all measurable sets A (P n converges to π in the TV norm. If P is Harris recurrent, then the convergence occurs for all x. In words, we can start at essentially any X, run the chain for a long time, and the final draw has a distribution that is approximately π. The long time is called the burn in, and it is generally impossible to know how long this needs to be in order to get a good approximation. We don t need independence for the most common application of Monte Carlo studies: estimating E π f: the mean of a function f with respect to the invariant distribution π. If chain is uniformly ergodic sup x P n (x, π Mr n for some M 0 and r < 1, there is a central limit theorem: let f n = n j=1 f(x j /n denote the sample mean. Then n( f n E π f converges weakly to a normal distribution with mean 0. Methods for generating chains with a given equilibrium distribution We have a density π that we can evaluate but not sample from. How do we construct a markov chain P with invariant distribution π? Gibbs sampling: Suppose the random variable X = (X 1,..., X d can be partitioned into blocks, and we can sample from each of the conditional distributions P (X i {X j, j i}. These transitions are invariant with respect to P (X 1,..., X d. Consider the case X = (X 1, X 2, and for a set A, let A X1 = {X 2 (X 1, X 2 A}. Under the proposal distribution P (X 2 X 1 : 3

P ((X 1, X 2, Aπ(X 1, X 2 dx 1 dx 2 = π(y X 1 dy π(x 2 X 1 π(x 1 dx 1 dx 2 A X1 = π(y X 1 π(x 1 dy dx 1 A X1 = π(a. The Gibbs chain will in general only be irreducible and aperiodic if we combine all conditional distributions in random order. The Metropolis (or Metropolis-Hastings algorithm. Suppose we have a transition density q(x, y (P (X n+1 A X n = x = A q(x, ydy. Define the Metropolis-Hastings ratio by α(x, y = π(yq(y, x min{, 1} π(xq(x, y π(xq(x, y > 0 = 1 π(xq(x, y = 0. The kernel q(x, y will roughly play the role that the trial density plays in rejection sampling. Suppose we are at X n. Sample Y from q(x n, y. With probability α(x n, Y set X n+1 = Y, otherwise set X n+1 = X n. The transition kernel can be represented as where δ x is the point mass at x, and p(x, y = q(x, yα(x, y + r(xδ x r(x = 1 q(x, yα(x, ydy is the marginal probability of remaining at x. The resulting chain satisfies detailed balance, and hence has invariant distribution π. Only density evaluations are required, and unknown normalizing constants cancel. General Metropolis-Hastings Schemes Gibbs sampling If q(x, y = π(y x then α(x, y 1, so we accept every move. This shows that Gibbs sampling is a special case of the Metropolis-Hastings algorithm. Random Walks Suppose q(x, y is a random walk: q(x, y = q (y x for some distribution q. The Metropolis-Hastings ratio is min(π(yq (x y/π(xq (y x, 1 (or 1 if π(xq (y x = 0. If q is symmetric, then the Metropolis-Hastings ratio reduces to min(π(y/π(x, 1 (or 1 if π(x = 0. 4

An important practical issue with random walk chains is the scaling of the random walk density q. If the variance of q is too large, most steps will move into the tails of the distribution and will be rejected. On the other hand if the variance is too small, the autocorrelation in the sequence of accepted draws will be high and convergence to the target distribution will be slow. Roberts and Rosenthal (2001 have shown that an acceptance rate of around 0.23 is optimal when the target density is the product of identical marginal density functions. Independence chains If q(x, y = q(y, so that the proposals are an iid sequence, the Metropolis-Hastings ratio is min(π(yq(x/π(xq(y, 1 (or 1 if π(xq(y = 0. This is similar to rejection sampling, however the acceptance probabilities are scaled by q(x/q(y rather than by 1/c, and points are repeated rather than rejected if the proposal failes. Two interesting special cases are: 1. q(x 1: if the proposal distribution is uniform, it is even more similar to rejection sampling, since the Metropolis-Hastings ratio becomes min(π(y/π(x, 1. 2. If q(y = π(y, then the Metropolis-Hastings ratio is identically 1, and we are sampling directly from the target density. Rejection sampling chains Suppose we want to use rejection sampling from a trial density f to produce realizations from π. We can use the Metropolis-Hastings algorithm to rescue rejection sampling when we can not find a good bound for π/f. We conjecture, but are not certain, that π/f < c. If we carry out rejection sampling we are actually drawing from the density proportional to f = min(π, cf. Suppose we use an independence chain generated by f to drive the Metropolis-Hastings algorithm. Thus we have q(x, y min(π(y, cf(y. Let S denote the subset of the sample space where the bound π/f < c holds. We have four possibilities: 1. x, y S α = 1. 2. x S, y S α = 1. 3. x S, y S α < 1. 4. x S, y S. In this case nothing can be said about α in general. Since S c is the set of points that will be undersampled using the rejection sampling procedure, it is reasonable that moves from S to S c are always accepted, while moves from S c to S always have a nonzero probability of being rejected. The chain dwells in S c to build up mass that is missing from f. Examples: Normal mean and variance: Suppose we observe Y 1,..., Y n iid from the N(µ, σ 2 distribution. If we take flat (improper priors on µ and σ 2, the posterior distribution is 5

( P (µ, σ 2 {Y } (σ 2 n/2 exp 1 (Y 2σ 2 i µ 2. i As a function of µ this is proportional to Thus exp( 1 2σ 2 (nµ2 2µY exp( n 2σ 2 (µ Ȳ 2 P (µ σ 2, {Y } = N(Ȳ, σ2 /n. To get the conditional distribution of σ 2 given {Y } and µ, do the change of variables η = 1/σ 2, so P (η µ, {Y } η n/2 2 exp( η i (Y i µ 2 /2. This is a Γ(n/2 1, 2/ i(y i µ 2 distribution. So the Gibbs sampling algorithm alternates between sampling µ from a normal distribution and η from a gamma distribution. Grouped binary data: Suppose we observe m groups of Bernoulli trials, where the i th group consists of n i iid trials. Let n i be the number of trials in group i, and let p i be the success probability. Suppose the p i are viewed as random effects from a symmetric Beta distribution, which has density proportional to p α 1 (1 p α 1. Let k i be the number of successes observed in the i th group, so P ({k i } {p i }P ({p i } α i p k i+α 1 i (1 p i n i k i +α 1. To be safe, we will take a nearly-flat proper exponential prior on α with mean θ (set to an arbitrary large value. Gibbs sampling operates using the following distributions The specific updates are P (α {p i }, {k i } P ({p i } α, {k i } α exponential ( 1/( i p i i (1 p i + 1/θ 6

and p i Beta(k i + α, n i k i + α. Grouped continuous data with right-skewed group means: Suppose we observe data Y ij for m groups with n i replicates in group i, so i = 1,..., m and j = 1,..., n i. Let µ i be the population mean for group i, and suppose we take the µ i to be random effects from an exponential distribution with mean λ. We will also take Y i1,..., Y ini µ i to be independent and Gaussian with mean µ i and variance σ 2. If we form the improper posterior based on flat priors for λ and σ 2, then integrate out the random effects µ i, we get P (λ, σ 2 {Y } i (σ 2 n i/2 λ 1 ( exp 1 2σ (Y 2 i µ1 ni (Y i µ1 ni exp( µ/λdµ = (σ 2 n /2+m/2 λ m i exp ( σ 2 (S i M 2 i /n i /2 M i λ 1 /n i + σ 2 λ 2 /2n i, where Y i = (Y i1,..., Y ini, M i = j Y ij, and S i = j Y 2 ij. This does not have a finite integral, therefore a sampling algorithm applied to this formulation of the problem will not give meaningful results. For simplicity, we first suppose that η = 1/λ and σ 2 have a joint prior density proportional to exp( 1 2 i 1 n i σ 2 η 2 on (0, (0,. In this case, the posterior on η, σ 2 is proportional to (σ 2 n /2+m/2 η m 2 exp ( σ 2 i (S i M 2 i /n i /2 η i M i /n i. Now if we set θ = 1/σ 2, we get the posterior expressed in terms of η and θ θ (n m/2 2 η m 2 exp ( θ i (S i M 2 i /n i /2 η i M i /n i. Letting V i = i S i /n i M 2 i /n 2 i, approximately var(y i /n i, and M i = M i /n i, the sample mean of Y i, we get θ (n m/2 2 η m 2 exp ( θ i n i V i /2 η i M i. 7

Thus η and θ have independent Gamma posteriors η Γ(m 1, 1/ i M i θ Γ((n m/2 1, 2/ i n i V i. If we wanted to use a different prior, say with λ and σ 2 independent and exponential with means β λ and β σ 2 we could use the Gibbs sampler. Cauchy regression: Suppose we take the linear regression model Y i = α + β X i + ɛ i, where Y i R and X i R d are observed, and the ɛ i have central Cauchy distributions, with density π 1 /(1 + ɛ 2 i. The joint distribution of the observed data is: π(y 1,..., Y n X 1,..., X n = i π 1 1 + (Y i α β X i 2. Suppose we consider the Bayesian model with improper prior on θ = (α, β. We use a Metropolis-Hastings algorithm with a random walk distribution for q: q(θ, θ = φ(θ θ, where φ is the bivariate normal distribution with mean 0. Due to the symmetry of the random walk, the Metropolis ratio α simplifies to α(θ, θ = π(θ /π(θ. Rejection sampling using the wrong bound. Suppose we want to sample from the density π(x = I( 1 x 1 x 1/2 /4. The density is unbounded at x = 0, so we can not use rejection sampling with a bounded trial density. Suppose we generate draws using the rejection sampling algorithm with a uniform trial density on ( 1, 1 and c = 6. These draws will have density q(x I( 1 x 1 min( x 1/2 /4, 3. Now we use these draws to drive a Metropolis algorithm. On S = {x x 1/144}, the bound π/f < 6 is satisfied (where f is the uniform trial density. The Metropolis- Hastings ratio is given by the following. 1. x, y S α = 1. 2. x S, y S α = 1. 3. x S, y S α = 12 x < 1. 4. x S, y S α = π(y/π(x = x/y. 8

3 2 alpha beta 1 0-1 -2-3 -4 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 3 2 alpha beta 1 0-1 -2-3 -4 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 3 2 1 0-1 -2-3 -4 alpha beta -5 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Figure 1: Sample paths for α and β for the Cauchy regression model, using random walk input chains with Gaussian steps having mean zero and standard deviation σ =.01 (top, σ = 3 (middle, and σ = 1 (bottom. 9

Gaussian mixture model. Suppose that Y 1,..., Y n R d are independent and observed, δ 1,..., δ n {1, 2,..., K} are iid and unobserved, and P (Y j δ j = k = φ(y j µ k, where φ( is the standard multivariate normal density. The parameters in the model are µ 1,..., µ K, and π 1,..., π K, where π k = P (δ j = k. We can take flat priors on these parameters, and estimate them through their posterior distribution, which can be simulated using Gibbs sampling. Here is a full set of conditional distributions, where µ = [µ 1,..., µ n ], and similarly for Y and δ. P (µ π, Y, δ = P (µ Y, δ P (Y µ, δ = φ(y j µ δj j This splits into a product of functions of µ 1,..., µ K, implying that the µ k are independent given π, Y and δ. The factor that involves µ k can be expressed exp( 1 2 ( 2µ k Y j I(δ j = k + µ 2 k I(δ j = k, j which is a multivariate normal density with mean j Y j I(δ j = k/ j I(δ j = k, and variance 1/ I(δ j = k (with zero correlation. j P (δ π, Y, µ P (Y δ, µ P (δ π = j φ(y j µ δj π δj This splits into a product of functions of δ 1,..., δ n, implying that the δ j are independent given π, Y and µ. To sample δ j, first compute q k = φ(y j µ k π k for k = 1,..., K, then rescale the q k so they sum to 1, and generate a multinomial draw according to the probabilities that result. P (π Y, µ, δ P (δ π π k k j I(δ j=k Let n k = j I(δ j = k, so the posterior density of π is proportional to k π n k k. This is a Dirichlet density, from which samples can be generated by simulating Z k Γ(n k +1, 1, and setting π k = Z k / l Z l. 10

1 0.8 0.6 0.4 0.2 0 0 20 40 60 80 100 120 140 160 180 200 Figure 2: Gibbs sample paths for the probability parameters π 1, π 2, π 3 in the Gaussian mixture model with k = 3 components, d = 10 Y dimensions, and n = 200 observations. The horizontal lines are the true values. Spatial processes. Suppose we have a random 0/1 array Y ij {0, 1}, and we specify the conditional probabilities as P (Y ij = 1 Y i+1,j, Y i 1,j, Y i,j+1, Y i,j 1 = ɛ + τ ɛ 4 (Y i+1,j + Y i 1,j + Y i,j+1 + Y i,j 1. where 0 ɛ τ 1. It is a fact that this family of conditional distributions is associated with a unique joint distribution on the array. The conditional proability of observing a 1 given the neighbors ranges from ɛ to τ, and increases as more neighbors have 1 s. Larger values of ɛ τ produce greater spatial autocorrelation (the tendency of 1 s to cluster spatially in the array. We can simulate from this distribution using the Gibbs sampler. 11