Markov Chain Monte Carlo methods

Markov Chain Monte Carlo methods By Oleg Makhnin 1 Introduction a b c M = d e f g h i 0 f(x)dx 1.1 Motivation 1.1.1 Just here Supresses numbering 1.1.2 After this 1.2 Literature 2 Method 2.1 New math As we pointed out in Section 1... and 1.1.1 2.2 Application See Example 1 for details 3 Results 4 Conclusion 4.1 Introduction Markov Chain Monte Carlo (MCMC) methods are computational methods developed for Bayesian inference. Bayesian inference deals with parameter 1

estimation under some prior assumptions. For example, suppose we are estimating some parameter θ. We have some information about θ expressed as a prior distribution. It s called prior because this is what we believe will happen prior to collecting any data. What happens after you collect your data? The evidence from your data is summarized in the likelihood. This is simply the (joint) density of your observations. However, in the likelihood and Bayesian inference, it is treated as a function of the (unknown) θ, and the data y 1,..., y n are treated as given. The goal of Bayesian inference is to compute the posterior distribution. It s called posterior because it is computed after obtaining the data. It s the conditional distribution of the parameter θ given the data. We will assume that we deal with continuous quantities, so the formulas below are in terms of densities. By the continuous version of Bayes formula, the posterior is p(θ y 1,..., y n ) = p(θ, y 1,..., y n ) p(y 1,..., y n ) or, briefly, = p(y 1,..., y n θ) p(θ) p(y 1,..., y n ) posterior likelihood prior p(y 1,..., y n θ) p(θ) where the sign is frequently used in Bayesian analysis and reads "proportional to", that is, equal to the quantity described times some proportionality constant. Frequently, this constant is found later from the condition that the posterior density integrates to 1. Thus, the posterior distribution combines prior information with the new information obtained from the data, and makes a balanced guess about the unknown parameters. Example 1 : Normal/Normal prior and likelihood Under a simple assumption that both prior and likelihood are Normal, suppose first that we have just one observation y. Let the prior p(θ) exp [ (θ µ ] 0) 2, that is, N (µ 0, σ0), 2 2σ 2 0 and the likelihood ] (y θ)2 p(y θ) exp [, that is, N (θ, σ 2 ). 2σ 2 We will assume that σ 2 is known and so the parameter θ is the unknown mean we d like to estimate. (1) 2

Using Eq.(1), we get, after some algebra work, p(θ y) exp [ (θ µ ] p) 2 2σ 2 p where ( 1/σ0 2 µ p = µ 0 1/σ0 2 + 1/σ + y 1/σ 2 1 2 1/σ0 2 + 1/σ and 2 σ2 p = σ0 2 + 1 σ 2 ) 1 Thus, the posterior also has Normal distribution, N (µ p, σ 2 p). Note also that µ p is the weighted average of the prior mean µ 0 and the observation y, where the weights are inversely proportional to the variances. p(x) 0.0 0.2 0.4 0.6 0.8 1.0 p(x) 0.0 0.2 0.4 0.6 0.8 1.0 p(x) 0.0 0.2 0.4 0.6 0.8 1.0 2 0 2 4 x 2 0 2 4 x 2 0 2 4 x Figure 1: Prior (broken lines), likelihood(solid lines) and posterior (thick lines) for normal conjugate prior. Left: σ 2 0 = 0.5, center: σ 2 0 = 1, right: σ 2 0 = 2. In Fig. 1, three cases are shown, all with the same data y = 2, σ 2 = 1, prior mean µ 0 = 0, and differing only by the value of σ0. 2 When σ0 2 is small, the prior has a larger weight, and thus the posterior mean is closer to the prior mean. When σ0 2 is large, the situation is the opposite. This is a case of the so-called conjugate prior on θ which is chosen in such a way that the posterior has the same functional form as the prior. This example is easily generalized to several observations y 1,..., y n. If they are all independent Normal, with the same mean θ and standard deviation σ, then we may treat them as a single observation y = y i /n and standard deviation σ/ n. Direct computation of posterior densities is impossible for all but the simplest problems. Markov Chain Monte Carlo (MCMC) methods are computational methods developed for Bayesian inference. As with all Monte Carlo methods 1, the goal of MCMC is to obtain a sample from the probability distribution of interest. However, this sample will not consist of independent 1 See e.g. Tarantola, Chapter 2 3

observations (as is the case with classical Monte Carlo), but rather form a sequence of realizations of a Markov chain. The trick is to set up a Markov chain whose stationary distribution is exactly the posterior distribution we need to sample from. Markov chain is a sequence of random variables, for which every observation is independent of the past, except for its immediate predecessor. This means that the Markov chain {X t } is defined by its transition probability (or transition kernel in case of continuous state-space) that describes the conditional probability density q(x t+1 X t ), t = 0, 1, 2, 3,... (2) In addition, we will assume that a starting probability density q(x 0 ) is also known. The procedure of generating such a chain starts with generating (sampling) X 0 from this density, then iterating (2) to get the subsequent samples. Under some assumptions on the transition probability, a Markov chain will converge to its stationary distribution, regardless of the starting value X 0. Such Markov chains are called ergodic. In particular, an easy condition to validate is the detailed balance condition π(y)q(x y) = q(y x)π(x) for all x, y which ensures that π is the stationary distribution. Denote the data briefly as y and the unknown parameters we d like to estimate as θ. Then, the Markov chain takes values in the θ-space. The MCMC methods employ ways to generate Markov chains that will have the desired posterior p(θ y) as its stationary distribution. They are Gibbs sampler and Metropolis-Hastings method. 4.2 Gibbs sampler One way to form a Markov chain that will converge to the posterior p(θ y) is to split the values of θ into blocks of variables, assuming that it s easy to generate samples from these blocks. This is the case when the blocks will have a nice conjugate forms for its distributions. Suppose that we split the model into k blocks, θ = [θ 1, θ 2,..., θ k ]. In the simplest case, the blocks are just scalar components of the model θ. The Gibbs sampler will iteratively obtain samples of θ j based on their full conditional posteriors (FCP s) defined as p j (θ j y, θ 1,..., θ j 1, θ j+1,..., θ k ). The process is done for all j = 1,..., k and then repeated many times until the sample of the desired length from the entire θ is obtained. 4

4.3 Example: Regression with censored data There are frequently cases in Bayesian inference when the estimates can be easily obtained if only certain hidden variables were known. We will consider one such situation and indicate how to set up the Gibbs sampler. The censored data situation arises when we do not know the exact value of an observation, but some inequality is available, for example, yi c i (* here indicates a censored observation). This frequently happens in survival studies when an item was removed from study at the time c i before we had the chance to observe its failure at yi. A similar situation arises in environmental studies with non-detects. A non-detect means that a certain chemical (likely a pollutant) was not detected in the sample, however we cannot with certainty claim that it does not exist, but only that its concentration lies below an estimated threshold. In this case, yi c i where yi is the unknown true concentration and c i is the threshold. If we treat the missing data as extra parameters in the model, we can sample from their FCP given all the other model parameters. For example, if we fit a regression model with some predictors x i and errors ε i, y i = β 0 + β 1 x i + ε i, i = 1,..., n then the unobserved y i can be sampled from the truncated Normal distribution with the mean β 0 +β 1 x i, standard deviation σ (equal to the standard deviation of errors ε i ) and the upper threshold c i. Then, in turn, the Gibbs sampler will use the current samples of y i, together with known concentrations y j, to estimate the regression parameters β 0, β 1 and σ. This way, we will get the MCMC samples from both model parameters and the missing data. Example: from Helzel (2005), Ch. 14. The data given are TCE 2 concentrations (µg/l) in ground waters of Long Island, New York, along with several possible explanatory variables (population density, land use and depth to the water surface). Objective is to determine if concentrations are related to one or more explanatory variables. There are four detection limits, at 1, 2, 4 and 5 µg/l. Out of 247 observations, 194 are classified as non-detects. We will use y i = ln(t CE) and x i = population density. The data are shown in Fig. 2. What do you think is the direction of the trendline? Is the slope positive or negative? The parameters β 0, β 1 can be estimated using linear regression. However, we will be interested not only in the estimates ˆβ 0 and ˆβ 1, but their entire FCP. Fortunately, it s a bivariate Normal, and we have an easy way to generate samples from it. 2 trichloroethylene, a chlorinated hydrocarbon commonly used as an industrial solvent 5

log(conc) 0 1 2 3 4 5 6 0 5 10 15 Population density Figure 2: Censored observations: blank circles, uncensored observations: dark circles. The points are jittered, i.e. have a small amount of noise added to visualize multiple occurrences of the same point. Namely, if β are the coefficients from linear regression y = Gβ, where the data have covariance matrix σ 2 I, then we know that β has the distribution (likelihood) N (ˆβ, σ 2 (G T G) 1 ), where ˆβ = (G T G) 1 G T y. We can generate a sample from such distribution by using e.g. Cholesky decomposition and then using σ 2 (G T G) 1 = R T R β = R T s + ˆβ where s is a standard Normal vector. For simplicity, we will assume that β = [β 0, β 1 ] have a flat prior, that is, p(β) 1, which corresponds to the situation we have no prior information on β s. Another technicality concerns sampling from σ. This is usually done using inverse chi-square prior on σ 2. Under assumption of normal errors e i, this 6

turns out to be a conjugate prior, i.e. the posterior distribution of σ 2 will also be the inverse chi-square. 3 The updating equation is σ 2 = V 0df 0 + e 2 i χ 2 df where χ 2 df is a chi-square random variable with df = df 0 + n m, with m equal to the number of parameters in the regression equation (here m = 2), df 0 are the prior degrees of freedom, and V 0 is the prior variance. Decreasing df 0 to 0 leads to using a flat prior on σ 2. 4.4 Metropolis-Hastings algorithm Metropolis algorithm and its extension, Metropolis-Hastings algorithm, were developed for sampling from non-standard densities. Its idea is as follows. To sample from x with some density q(x), generate a proposal value x and, at the iteration t + 1 of the sampler, accept this value, setting x t+1 = x, with the probability 4 { } q(x ) p acc = min q(x t ), 1. Otherwise, we would keep the old value x t. In practice, this means generating a uniform [0, 1] random variable U and setting { xt+1 = x, if U p acc x t+1 = x t, if U > p acc (3) This algorithm is reminiscent of the stochastic search algorithm for maximizing the function q(x), where we move to a new point if that increases the value of q, and stay at the old point otherwise. The difference in Metropolis algorithm is that we also occasionally jump to a point with a lower value of q. This, among other things, helps us overcome the local maxima. 5 The practical difficulty lies in generating a suitable proposal value x. One popular method is random-walk Metropolis algorithm for which the proposal value equals x = x t +h t, where x t is the previous value from the Markov chain, and h t is chosen randomly from a symmetric distribution, for example, Uniform 3 The equation for the inverse chi-square density is given e.g. by Gelman et al (2003). Of course, a sample value for σ can be obtained taking the square root of σ 2. 4 Note that since p acc is the ratio of q-densities, we only need to know the density q(x) up to a proportionality constant! This makes the Metropolis method particularly attractive to the computation of the posterior densities. 5 The benefits of both algorithms can be combined in the simulated annealing algorithm that will move more freely in the beginning and become more like stochastic search in the end. 7

on [ δ h, δ h ] or Normal N (0, σh 2). The size of the jump h t may be chosen adaptively. Smaller jumps will result in higher acceptance rates, but will be slower in exploring the model space. Longer jumps will be less frequently accepted, so the chain will tend to get stuck in the same place. The jumps that are either too small or too large will result in the increased autocorrelation of your Markov chain, and therefore the need to get longer chains to estimate your parameters more precisely. Some studies have shown (see e.g. Gelman et al.) that the most efficient acceptance frequency lies between 20% and 50%. The Metropolis method is simple to use, however, it requires certain assumptions on how the proposal value x is picked. For example, the randomwalk Metropolis version will not work if the jump h t has an asymmetric distribution. The generalization, Metropolis-Hastings method, does not require the proposal distribution to be symmetric, but we will not discuss it here. 4.5 Analyzing the MCMC output First, we need to realize that, using MCMC, we produce just a sample from the posterior density. Thus, the estimates we obtain from it (e.g. mean, median, credible intervals etc.) will be subject to sampling error. For example, if we took M Monte-Carlo samples, then the error in estimating the mean is proportional to 1/ M. m3 values 0.8 1.0 1.2 1.4 1.6 Frequency 0 50 100 150 200 250 ACF 0.0 0.2 0.4 0.6 0.8 1.0 0 200 400 600 800 1000 sample 0.6 0.8 1.0 1.2 1.4 1.6 0 5 10 15 20 25 30 Lag Figure 3: Tri-plot for a highly autocorrelated output Also, as any iterative method, MCMC methods take some time to converge. However, the convergence is not to any particular number, but to the distribution π, i.e. the entire range of numbers. To monitor convergence, a simple graphical tool is to produce the plots of MC values and watch them until they seem to converge to some stable values. A more scientific method uses multiple Markov chains, started at different values and run in parallel 8

(see Gelman et al). After the convergence analysis, we determine a burn-in period, during which the initial values that we collected are discarded. Another complication is specific to MCMC methods. These methods produce positively correlated samples from the desired distribution. This means, roughly, that the next value of the Markov chain is similar to the previous value. The higher the correlation, the less new information each sample value contains! The amount of correlation is usually monitored using the autocorrelation plots. They indicate the amount of correlation between x t and x t+, for various values of the lag. We can use these plots to find the value D such that the autocorrelation is negligible for D. Then, to obtain the final estimates of our parameters, we will thin out our sample, keeping only every Dth value. m1 values 1.0 1.2 1.4 1.6 1.8 2.0 Frequency 0 200 400 600 800 ACF 0.0 0.2 0.4 0.6 0.8 1.0 0 200 400 600 800 1000 sample 0.8 1.0 1.2 1.4 1.6 1.8 2.0 0 5 10 15 20 25 30 Lag Figure 4: Tri-plot for a slightly autocorrelated output See Fig. 4 above. In practice, we recommend using the tri-plot for each scalar parameter that we fit. The tri-plot includes the time-series plot of the MC values, the histogram of the results (to monitor the posterior distribution), and the autocorrelation plot. See Figures 1 and 2 for the examples. In Figure 2, the burn-in period is clearly seen in the beginning. Once we obtained a clean sample of the model parameters, it is straightforward to obtain the estimates. It is important to realize, however, that all of these estimates are subject to sampling error. The MAP (maximum a posteriori) solutions might be difficult to obtain in the absence of the density functions to maximize. Thus, for symmetric distributions, we might settle for the posterior means instead. In case of asymmetric distributions, we might want to use posterior medians. The credible intervals can be easily obtained using the sample percentiles 9

(quantiles). For example, to obtain the 95% credible interval, we may use the sample 2.5-th percentile as the lower bound and 97.5-th percentile as the upper bound. Bibliography Gelman, A., J.B. Carlin, H.S. Stern and D.B. Rubin (2003), Bayesian Data Analysis, 2nd ed., Chapman & Hall/CRC. Gi Gilks, W. R., Richardson, S. and Spiegelhalter, D., eds. (1996), Markov Chain Monte Carlo in Practice: Interdisciplinary Statistics, Chapman & Hall/CRC Diaconis, P. (2009), The Markov Chain Monte Carlo revolution, Bulletin of the AMS, 46 (2), 179-205 Helzel, D.R. (2005), Nondetects and Data Analysis: Statistics for Censored Environmental Data, Wiley. Tarantola, A. (2005), Inverse Problem Theory and Methods for Model Parameter Estimation, SIAM. 10