The Pennsylvania State University The Graduate School RATIO-OF-UNIFORMS MARKOV CHAIN MONTE CARLO FOR GAUSSIAN PROCESS MODELS

Size: px

Start display at page:

Download "The Pennsylvania State University The Graduate School RATIO-OF-UNIFORMS MARKOV CHAIN MONTE CARLO FOR GAUSSIAN PROCESS MODELS"

Dwayne Willis
5 years ago
Views:

1 The Pennsylvania State University The Graduate School RATIO-OF-UNIFORMS MARKOV CHAIN MONTE CARLO FOR GAUSSIAN PROCESS MODELS A Thesis in Statistics by Chris Groendyke c 2008 Chris Groendyke Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science May 2008

2 The thesis of Chris Groendyke was reviewed and approved by the following: Murali Haran Assistant Professor of Statistics Thesis Advisor Donald Richards Professor of Statistics Associate Chair of the Department of Statistics Runze Li Associate Professor of Statistics Graduate Program Chair Signatures are on file in the Graduate School.

3 Abstract We develop various Markov chain Monte Carlo (MCMC) methods based on the ratio-of-uniforms (ROU) transformation and show how they can be used in a Bayesian context to simulate from the posterior distribution of linear Gaussian process models. These models are very popular in many disciplines, but are particularly important for modeling spatial data. We show that these algorithms, in spite of requiring no tuning, perform well in practice. We describe how the algorithms can be used in conjunction with some recently developed methods to estimate standard errors of MCMC-based estimates accurately. The estimated standard errors can, in turn, be used to automatically decide when to stop the MCMC runs thereby providing, in principle, a completely automated MCMC algorithm. We conclude with a study of the properties of these algorithms, using simulated as well as real data, taken from the field of Geosciences. iii

4 Table of Contents List of Figures List of Tables Acknowledgments vi viii ix Chapter 1 Introduction The Gaussian Process Model Bayesian Inference The Need for Automation Chapter 2 Markov Chain Monte Carlo MCMC Theory Markov Chains The Metropolis-Hastings Algorithm Variable-at-a-time Metropolis-Hastings Monte Carlo Standard Errors and Stopping Rules Effective Sample Size Chapter 3 Ratio-of-Uniforms Markov Chain Monte Carlo The Ratio-of-Uniforms Transformation Slice Sampling Multivariate Generalizations of the Ratio-of-Uniforms Transformation MCMC Using the Ratio-of-Uniforms Transformation Random Walk Stepping Out / Doubling Auto-tuning Random Walk Stepping Out / Doubling Starting Values Other Methods Using the ROU Transformation iv

5 3.6.1 Hybrid ROU Approach Rejection Sampling in the ROU Region Adaptive Rejection Sampling in the ROU Region Chapter 4 Comparative Study of Algorithms A Simulated Dataset A Geosciences Application Chapter 5 Conclusions and Future Work Conclusions Future Work Further Exploration of ROU-MCMC Algorithms Hit and Run Theoretical Results Spatial Generalized Linear Models Appendix A Derivation of Posterior Distributions 70 Bibliography 72 v

6 List of Figures 1.1 Spatially correlated data with linear regression fit and kriging Realization of a random walk Markov chain The Metropolis-Hastings algorithm Movement of the random walk Markov chain Sample produced by the random walk Markov chain The Gibbs algorithm for constructing a bivariate Markov chain Movement of the random walk Markov chain using Gibbs updates Sample produced by the random walk Markov chain using Gibbs updates The IMSE algorithm for computing autocorrelation time ROU region corresponding to univariate Uniform random variable Standard Normal ROU Example ROU region corresponding to bivariate Normal random variable The random walk algorithm for generating a new point in the ROU space The coordinate-at-a-time random walk algorithm for generating a new value for the i th coordinate in the ROU space The stepping out procedure for finding an interval (L, R) around the current point η 0 which contains the desired slice The doubling procedure for finding an interval (L, R) around the current point η 0 which contains the desired slice The procedure for generating a point in the slice from the proposal interval (L, R) The doubling procedure for finding an proposal hyper-rectangle around the current point The procedure for generating a point in the slice from a given proposal hyperrectangle The tuning procedure for the univariate random walk algorithm The tuning procedure for the multivariate random walk algorithm Empirical relationship between steps and shrinks for stepping out procedure The tuning procedure for the univariate stepping out algorithm Empirical relationship between steps and shrinks for doubling procedure The tuning procedure for the univariate doubling algorithm ACF plots for Univariate ROU Stepping Out Algorithm Run on Simulated Data ACF plots for Slice Sampler Algorithm Run on Simulated Data ACF plots for Multivariate Metropolis-Hastings Algorithm Run on Simulated Data Estimated Posterior Densities for Parameter κ for Simulated Data vi

7 4.5 Estimated Posterior Densities for Parameter ψ for Simulated Data Estimated Posterior Densities for Parameter φ for Simulated Data Estimated Posterior Densities for Parameter β for Simulated Data ACF plots for Univariate ROU Random Walk Algorithm Run on Geosciences Data ACF plots for Slice Sampler Algorithm Run on Geosciences Data ACF plots for Multivariate Metropolis-Hastings Algorithm Run on Geosciences Data Estimated Posterior Densities for Parameter κ for Geosciences Data Estimated Posterior Densities for Parameter ψ for Geosciences Data Estimated Posterior Densities for Parameter φ for Geosciences Data Estimated Posterior Densities for Parameter β for Geosciences Data The Hit and Run procedure for generating a proposal interval (L, R) vii

8 List of Tables 2.1 First six trials of the Metropolis-Hastings random walk First six trials of the Metropolis-Hastings random walk using Gibbs updates Comparison of Algorithms Run on Simulated Data for Parameter κ Comparison of Algorithms Run on Simulated Data for Parameter ψ Comparison of Algorithms Run on Simulated Data for Parameter φ Comparison of Algorithms Run on Simulated Data for Parameter β Comparison of Algorithms Run on Geosciences Data for Parameter κ Comparison of Algorithms Run on Geosciences Data for Parameter ψ Comparison of Algorithms Run on Geosciences Data for Parameter φ Comparison of Algorithms Run on Geosciences Data for Parameter β viii

9 Acknowledgments The author is very grateful to Dr. Murali Haran for his guidance and efforts during the course of this research. In addition, the author thanks Klaus Keller and Josh Dorin for providing the Geosciences data used in this study. The author is also grateful to the following people for their helpful conversations and suggestions regarding this effort: K. Sham Bhat, Matthew Tibbits, Muhammad Atiyat, and Scott Roths. ix

10 Chapter 1 Introduction Linear Gaussian process models are very flexible and widely applicable. They have therefore been used as models for data in a number of disciplines. One of the areas in which these models are commonly used is in modeling spatially-dependent data; for the current study, we will apply the linear Gaussian process model in this context. In addition to their applicability to many types of data, the linear Gaussian process model enjoys other significant advantages. Of particular note are a number of attractive theoretical properties (Cressie, 1993), some of which are described in Section 1.1. Our main interest lies in inference for the parameters of this model. To this end, one approach would be to use frequentist methods to perform inference on the model parameters. For instance, we might consider the possibility of estimating the parameters using a Maximum Likelihood Estimation (MLE) technique. Another approach for this problem is to use Bayesian inference methods, which have a few notable benefits. First, they allow us to incorporate the uncertainty in our parameter estimates into the predictions we make. They also provide a natural framework for working with hierarchical or multi-level statistical models. Finally, Bayesian inference methods provide us with the ability to utilize prior information or beliefs about model parameters, if such information is available. In the Bayesian approach, we assign prior distributions to each of the model parameters. Then the inference for each model parameter is based on its posterior distribution. In the ideal situation, this posterior distribution would be of a known form (or at least an unknown, but analytically tractable form). We would then be able to perform inference directly, either using analytical methods or possibly by generating a sample from this posterior distribution. However, when we are not able to work with a tractable posterior distribution (as is the case in this study), we can instead resort to Markov chain Monte Carlo (MCMC). That is, we run a Markov chain that converges to the desired posterior distribution, and base our inference on the sample produced by this Markov chain. Some basic theory relating to Markov chain Monte Carlo methods is covered in Chapter 2. The use of Markov chain Monte Carlo methods is very common in modern statistics. How-

11 2 ever, the algorithms used here differ from typical applications of MCMC theory in that they couple MCMC theory with an auxiliary variable method known as the ratio-of-uniforms (ROU) transformation. Using MCMC methods in conjunction with the ROU transformation (henceforth ROU-MCMC) has been suggested by Tierney (2005) and Karawatzki et al. (2006). These authors discuss various strategies for ROU-MCMC, but only discuss the application of the algorithm to relatively simple examples. Here we consider a number of variants of ROU-MCMC in the context of fitting linear Gaussian process models, which can present computational challenges. The specific algorithms used for this study are discussed in Chapter The Gaussian Process Model As noted above, the linear Gaussian process model has been used to model data from a wide spectrum of disciplines. One of the areas in which this model is commonly used is in spatial statistics - in particular, in the study of geostatistical data. This is the context in which we are using this model for the present study. In geostatistical data, we work with a response variable Z, which is present over some continuous domain D R p (see Cressie (1993) or Schabenberger and Gotway (2005) for a more detailed discussion). We only observe this process at a finite number of points in D; we denote the points at which the process is observed as s 1, s 2,..., s n, so that the response variable at each location s i is given by Z(s i ). Let Z = (Z(s 1 ),..., Z(s n )) T. Then, if we assume that Z can be described using the linear Gaussian process model, we have Z N (µ, Σ(Θ)), (1.1) with the mean vector µ given by µ = Xβ, where X is a matrix of covariates, and β is the corresponding vector of regression parameters. Therefore, under the assumption that the data can be described by this model, then the probability density function (pdf) of the data is f Z (z) = ( 1 exp 1 ) (2π) n/2 Σ(Θ) 1/2 2 (z µ)t Σ(Θ) 1 (z µ). (1.2) For this study, we are assuming an exponential covariance matrix, although it should be noted that other choices for the covariance structure, such as the Matérn, could also ( be used. ) In this specification, Θ = (κ, ψ, φ) and Σ(Θ) = ψi + κh(φ), where {H(φ)} i,j = exp and I si sj φ is the identity matrix. s i s j is the distance between locations i and j. The most common distance metric used in this model is the Euclidean distance, which is the distance measure we shall use here as well. The basic idea of this model is that observations which are closer together will be more similar to each other (in terms of the values of their response variables) than those with a greater distance between them; the covariance model parameters serve to precisely describe the nature of this relationship. Also note that the (covariance) parameters of the model above have meaningful physical interpretations, so that inference about the model parameters can yield immediate physical con-

12 3 clusions. In geostatistical terms, the parameter κ represents the sill, which is the asymptotic covariance between two points at a large distance from each other. φ denotes the range parameter. The range is the minimum distance required to attain the sill. Finally, the parameter ψ is the nugget and represents the amount of intrinsic variance not due to the distance between points (Schabenberger and Gotway, 2005). As mentioned above, the linear Gaussian process model also has some theoretical properties that can be beneficial. One of these properties is that its distribution is completely and uniquely determined by its mean vector and covariance matrix. That is, in order to fully describe the distribution of a random vector following this model, we need only specify its mean vector and covariance matrix. Another desirable property of this model is that weak stationarity is necessary and sufficient to imply strong stationarity. In general, weak stationarity is only necessary, but not sufficient (Cressie, 1993). Also, much asymptotic theory is known of Gaussian distributions. It is also important to note the importance of accounting for spatial correlation in data, when such spatial correlation exists. Failure to do so can lead to incorrect model assumptions, invalid parameter inference, and poor predicted values. For example, consider the following data set, which consists of 100 one-dimensional points that were simulated from a linear Gaussian process model. To demonstrate the importance of accounting for spatial dependence in the error structure of the data, we have fit both a standard linear regression model and a linear Gaussian process model to this data. The former model assumes independence between the data points, whereas the latter model incorporates spatial dependence. The predicted values are superimposed on the data shown in Figure 1.1. The solid line shows the predicted values based on a standard linear regression. The dashed line gives predicted values obtained by kriging. (Performing prediction on geostatisical data, such as we are doing for this example, is known as kriging (Schabenberger and Gotway, 2005).) We can see immediately that accounting for spatial dependence results in predictions that are much closer to the actual data points. 1.2 Bayesian Inference In classical frequentist inference methods, we treat the parameters of interest as fixed but unknown values. We use the data to try to determine the best estimates for these parameters, using methods like Maximum Likelihood Estimation (MLE) or the Method of Moments. These methods produce point estimates (perhaps with associated confidence intervals) for the parameters being estimated. Bayesian inference, on the other hand, treats the parameters as random variables, rather than fixed, unknown values. To each parameter η, we assign a prior distribution which represents our prior beliefs about this parameter. We then use the data to update our beliefs about the parameter, producing a posterior distribution for the parameter η. Then our inference regarding each parameter is based on its corresponding posterior distribution. The actual updating of the distributions of the parameters is performed by using Bayes theorem. We will denote the prior distribution for the parameter η by π(η), the likelihood function by f(z η), and the posterior distribution of η by π(η Z). Then by Bayes rule, we have

13 4 Figure 1.1: Spatially correlated data with linear regression fit and kriging. π(η Z) = f(z η)π(η) f(z η)π(η)dη (1.3) f(z η)π(η). (1.4) The denominator of (1.3) is known as the normalizing constant. One beneficial feature of some MCMC methods is that they often do not require us to know (or compute) this normalizing constant; in these cases it is sufficient to estimate the density kernel given by (1.4). 1.3 The Need for Automation One practical problem that arises with MCMC algorithms such as the Metropolis-Hastings algorithm is that they often require a substantial amount of tuning by the user. Tuning refers to the repeated adjustment of various auxiliary parameters, often known as tuning parameters. This tuning stage can potentially be expensive in terms of the time and effort required of the user. For example, the standard Metropolis-Hastings algorithm requires the user to specify a proposal distribution for each parameter or block of parameters being updated. In order to attempt to increase the efficiency of the algorithm, the user may be required to experiment with both the

14 5 form as well as the parameters, of these proposal distributions. Worse yet, these adjustments are dataset-specific, meaning that they must be repeated for each different dataset on which the algorithm is used. Using the ratio-of-uniforms transformation in conjunction with MCMC algorithms offers the possibility of automating this tuning process, freeing users of the burden of designing and tuning MCMC algorithms for each new data set. A related problem, which also requires the intervention of the user, is determining how long to run the Markov chain. Even when we know that a Markov chain will eventually converge to the correct target distribution, there still remains the question of how many trials it may take before this convergence can be deemed to have occurred. The user is often forced to rely on ad-hoc methods in order to make this judgment. This situation is clearly not ideal; it would be preferable to have clear, theoretically justified rules which tell us how many trials of our algorithms are sufficient. Fortunately, we are able to use recently developed methods on fixedwidth MCMC to accurately assess standard errors of our estimates, along with determining stopping rules for our algorithms. Thus, one advantage of this ROU-MCMC idea is that it offers the potential of producing a completely automated algorithm, that is, an algorithm which requires no user intervention either to tune the algorithm or to decide the length of the chain, but nonetheless retains desirable theoretical and practical properties. We consider the implementation of this idea in the context of linear Gaussian process models which are very important and popular and hence would benefit greatly from more efficient and/or automated MCMC algorithms. In this paper, we explore several different types of Markov chain Monte Carlo algorithms for sampling from the posterior distribution of a linear Gaussian process model. We implement some algorithms based on the ratio-of-uniforms transformation, as well as some standard algorithms, such as a standard Metropolis-Hastings algorithm, and compare their performances. We also discuss ideas for the automation of some of these algorithms. We show how to automate these algorithms both on the front end by having the algorithm tune itself, as well as on the back end by using estimates of Monte Carlo standard errors to determine how long to run the Markov chain. The remainder of the paper is organized as follows: in Chapter 2, we outline some basic theory of Markov chain Monte Carlo methods, as well as the estimation of Monte Carlo standard errors and how these Monte Carlo standard errors can be used to construct stopping rules. In Chapter 3, we introduce the ratio-of-uniforms transformation and discuss how this transformation can be used in conjunction with Markov chain Monte Carlo methods. Chapter 4 gives comparisons of the performance of the various algorithms in the context of real data. Finally, Chapter 5 contains the conclusions of this study and ideas for future work.

15 Chapter 2 Markov Chain Monte Carlo 2.1 MCMC Theory Before we proceed to describing the Markov chain Monte Carlo algorithms used in this study, it is first necessary to briefly discuss some basic theory of Markov chains, and how these Markov chains can be used to construct MCMC algorithms. More detailed discussions of MCMC theory can be found in Tierney (1994) and Robert and Casella (2004), while Geyer (1992) contains a discussion of some of the practical aspects of constructing MCMC algorithms Markov Chains A Markov chain is a sequence of random variables {X (i) }, i 1 having the property that the distribution of each random variable depends, at most, on the value of the previous random variable. That is, {X (i) } is a Markov chain if we have P (X (i+1) A X (1),..., X (i) ) = P (X (i+1) A X (i) ) for any set A (Casella and Berger, 2002), where X (j) denotes the j th step of the Markov chain. This property proves very useful in the construction of MCMC algorithms. In particular, the lack of dependence on prior random variables allows us to generate the next value in the chain using only its current value, rather than having to consider all previous values of the sequence. In the construction of Markov chains, we make use of a transition kernel. The transition kernel specifies the likelihood of the sequence moving from the current value of the random variable to all of the possible values that the next random variable in the sequence could take. It takes the form of a conditional density function and specifies the probability density for all values of the next step in the chain, given the current value of the chain. Note that for this study, all of the random variables we are studying have continuous distributions. Thus, we will only concern ourselves here with the continuous case, and not explore the theory of Markov chains on discrete

16 7 state spaces. Given a transition kernel, we can construct a Markov chain by choosing an initial starting point for the chain, and then using the transition kernel to govern the probabilities of moving to future states. Example 2.1. As a simple example of constructing a Markov chain, consider a random walk model (Robert and Casella, 2004). For this model, we have the relationship X (n+1) = X (n) + ɛ (n), where ɛ (n) is a random variable whose distribution is independent of the {X (i) } values. For this example, we will assume that ɛ (n) N(0, 1), so that X (n+1) N(x (n), 1), where x (n) is the realized value of X (n), i.e., the previous value of the Markov chain. Thus, the transition kernel for this model is given in (2.1). P (X (n+1) = x X (1) = x (1),..., X (n) = x (n) ) = P (X (n+1) = x X (n) = x (n) ) = 1 2π e 1 2 (x x(n) ) 2 (2.1) To complete the specification of the chain, we will also need to assign a starting value, that is, a value for X (1). For this example we will set x (1) = 0. Now we can generate each subsequent value of the chain using the transition kernel given in (2.1) and conditioning on the current value of the chain. Thus, to generate X (2) we would simulate using X (2) N(x (1), 1) = N(0, 1). Once we have a value for X (2) (call it x (2) ), we continue by simulating X (3) N(x (2), 1). We can continue to build a Markov chain of any desired length in this manner. A plot of the first 1,000 values of one possible realization of this Markov chain is shown in Figure 2.1. Generally, when we construct a Markov chain, we are hoping that it will eventually converge to a particular target distribution. In some circumstances, we can cause this to occur by the nature of the construction of the Markov chain. We now briefly explore some of the necessary conditions for this to take place, starting by defining some properties common to Markov chains. The invariant distribution π is the stationary distribution of a Markov chain if lim P n (X(n) A X (1) = x (1) ) = π(a) for almost all sets A and points x (1). Now denote the 1-step transition kernel by P and the n-step transition kernel by P n. That is, given that the chain is currently at x, the conditional probability that the next point will fall within the set A is P (x, A). Similarly, given that the chain is currently at x, the conditional probability that the chain will be at a point in the set A in n steps is P n (x, A). Then we can say that π is the stationary distribution of a Markov chain with transition kernel P if lim P n (x, A) = π(a) n for almost all x. This terminology means that the chain is stationary in its distribution, i.e.,

17 8 Figure 2.1: Realization of a random walk Markov chain. X (i) π implies that X (i+j) π for all j. A Markov chain is said to be irreducible if it has positive probability of moving to any set A for which π(a) > 0. Thus, an irreducible Markov chain is one in which all states communicate with one other. This is clearly an important property in the construction of MCMC algorithms; in order to have any chance of fully exploring the state space, the Markov chain must be able to get to all states, that is, it needs to be irreducible. Another important property of a Markov chain is its period. A Markov chain is known as periodic if there exists states to which the chain can only move at some particular regularly spaced times. For example, if a Markov chain can only take on a value in a set A every fourth period, then this chain would be periodic with a period of four. Irreducible Markov chains which are not periodic are known as aperiodic. A concept which will be important in the discussion of the convergence of Markov chains is recurrence. An irreducible Markov chain {X (n) } with invariant distribution π is said to be recurrent if, for each set A such that π(a) > 0, we have P (X (n) A i.o.) = 1 for almost all x A and P (X (n) A i.o.) > 0 for all x A (Tierney, 1994), where i.o. stands for infinitely often. Intuitively, recurrence means that the expected number of times that the chain will return to any set with positive measure is infinite. A slightly stronger property than recurrence is Harris recurrence. A Markov chain {X (i) } is called Harris recurrent if P (X (i) A i.o.) = 1 for all x. A Harris recurrent chain will return to every set of positive measure infinitely often with probability one. If there is an invariant finite measure for an irreducible Markov chain, then the chain is called positive recurrent. Markov chains which are recurrent, but not positive recurrent are called null recurrent.

18 9 A Markov chain which is positive recurrent and aperiodic is said to be ergodic. Intuitively, an ergodic Markov chain is one whose invariant distribution π is independent of the initial conditions of the chain (Robert and Casella, 2004). Similarly, a Markov chain which is both Harris recurrent and aperiodic is known as a Harris ergodic chain. The conditions which assure the convergence of a Markov chain to the stationary distribution π are given in Theorem 2.1, known as the Ergodic Theorem (which is a form of the Law of Large Numbers for Markov chains). Theorem 2.1. If a Markov chain with n-step transition kernel P n is Harris ergodic and irreducible, then lim n P n π T V = 0, where T V denotes the total variation norm, that is, f 1 f 2 T V = sup A f 1 (A) f 2 (A), where the supremum is taken over all measurable sets A. Proof. See Athreya et al. (1996). The Ergodic Theorem, while guaranteeing convergence of the Markov chain, unfortunately does not specify the rate of this convergence. In other words, while it assures us that the given Markov chain will indeed eventually converge to π, it does not give any indication of how long this convergence might take, or even provide an upper bound on this length of time. Clearly, this is an important point; if our goal in constructing the Markov chain is that it converge to a given stationary distribution π, we would like to have some indication of when this might occur, so that we might have an idea of how long to run the Markov chain. To address this issue, we can consider more stringent forms of ergodicity which put bounds on the rate of convergence of a Markov chain to its stationary distribution π. Uniform ergodicity and geometric ergodicity are two such stronger types of ergodicity. Specifically, a Markov chain with invariant distribution π is geometrically ergodic if there is a function M( ) and a constant r, 0 < r < 1 such that P n (x, ) π( ) T V M(x)r n for all x (Tierney, 1994). Furthermore, the chain is uniformly ergodic if there is a constant M and a constant r, 0 < r < 1 such that P n (x, ) π( ) T V Mr n for all x (Tierney, 1994). Clearly, uniform ergodicity is stronger than geometric ergodicity, and in fact the former implies the latter. Once we have completed running the Markov chain and have the corresponding sample {X (i) }, we can then use this sample to estimate various functions of the the random variable. In particular, we would estimate the function E π (g) (that is, the expectation of the function g with respect to the stationary distribution π) by using the corresponding sample mean ḡ n, where ḡ n = 1 n n g(x i ). i=1

19 10 While ḡ n will necessarily be an imperfect estimate of E π (g), under regularity conditions we can bound this discrepancy via a type of Central Limit Theorem for Markov chains. Theorem 2.2. Under regularity conditions, n (ḡ n E π (g)) d N(0, σ 2 g), where σ 2 g = V ar π (g(x 1 )) +2 i=2 Cov π(g(x 1 ), g(x i )) and the variance and covariance calculations are performed with respect to the distribution π. Proof. See Tierney (1994) and Nummelin (1984). Two examples of regularity conditions that will guarantee this Central Limit Theorem are (Roberts and Rosenthal, 2004): (i) {X i } is geometrically ergodic and E π g 2+δ < for some δ > 0, or (ii) {X i } is uniformly ergodic and E π g 2 <, though we should note that these are not the only such conditions. The importance of establishing this Central Limit Theorem is that it allows us to estimate σ 2 g, the variability of ḡ n, so that we can get some idea of the quality of our estimate ḡ n. Although there are many different methods of finding estimates for σ 2 g, here we will only consider the batch means method, which is described in Section The Metropolis-Hastings Algorithm Perhaps the most commonly used Markov chain Monte Carlo method is the Metropolis- Hastings algorithm. The basic idea of this algorithm is that instead of constructing the Markov chain by directly using the target distribution, the state transitions will be guided by a different distribution, known as the proposal distribution. Of course, using transition probabilities from the proposal distribution rather than the target distribution will cause the Markov chain to converge to the incorrect stationary distribution. The algorithm adjusts for this by sometimes staying at the current state, rather than moving to the state selected by the proposal distribution. This adjustment ensures that the algorithm does indeed converge to the correct target distribution. Suppose that our target distribution (the distribution we are interested in sampling from) is π. Further suppose that the proposal distribution is q(x, y) or q(y x). In both notations, x represents the current value of the Markov chain, whereas y is a possible next value of the chain. If the chain is at point X (n) = x, we define the acceptance probability as { } π(y)q(y, x) α(x, y) = min π(x)q(x, y), 1 unless π(x)q(x, y) = 0, in which case we set α(x, y) = 1. Next we generate a proposal from q( x), accept the proposal with probability α(x, y) and reject otherwise. If we accept the proposal, then this proposal becomes the next point in the Markov chain; if on the other hand we reject the proposal, then the current point is used as the next point in the chain. This algorithm, which (2.2)

20 11 was originally introduced by Metropolis et al. (1953) and later generalized by Hastings (1970), is described in Figure 2.2. Input: x (n) = current value of Markov chain x q(x (n), ) q(x, y) = proposal distribution a α(x (n), x ) V Uniform(0, 1) if (V < a) Output: then x (n+1) x x (n+1) = new value of the Markov chain else x (n+1) x (n) Figure 2.2: The Metropolis-Hastings algorithm. It is common to use a symmetric proposal distribution so that q(x, y) = q(y, x). In this case, (2.2) reduces to which simplifies the calculation of the acceptance probability. { } π(y) α(x, y) = min π(x), 1, (2.3) This is often referred to as a Metropolis update. Also note that, both (2.2) and (2.3) only depend on the distribution π( ) through the ratio π(y). It is for this reason that we need only specify the kernel of π( ); the π(x) normalizing constants cancel in this expression. Example 2.2. As an example of the Metropolis-Hastings algorithm, consider the problem of generating a random sample uniformly on a unit circle C centered at the origin. In this case, our target distribution is π(x 1, x 2 ) = 1 area(c) I ((x 1, x 2 ) C) = 1 π I ( x x 2 2 < 1 ), (2.4) where I( ) denotes the indicator function. For the proposal distribution, we will use a two-dimensional Normal distribution. The mean vector for the distribution will be the current point, and the covariance matrix will be the identity matrix. Thus, if the Markov chain is currently at X (n) = (x (n) ), then our proposal 1, x(n) 2 distribution is q(y 1, y 2 x 1, x 2 ) = 1 ( 2π exp 1 ( (y1 x 1 ) 2 2(y 1 x 1 )(y 2 x 2 ) + (y 2 x 2 ) 2)), 2 which is a symmetric distribution, enabling us to use the simpler form of the acceptance probability given in (2.3). To initialize the Metropolis-Hastings algorithm, we must choose a starting value for the Markov chain; we will start at the origin, so that X (1) = (x (1) ) = (0, 0). We 1, x(1) 2 then run the algorithm for as many trials as desired. Note that for this example, (2.3) becomes { } π(y1, y 2 ) α((x 1, x 2 ), (y 1, y 2 )) = min π(x 1, x 2 ), 1

21 12 { 1 π = min I ( y1 2 + y2 2 < 1 ) } 1 π I (x2 1 + x2 2 < 1), 1 { ( I y 2 = min 1 + y2 2 < 1 ) } I (x x2 2 < 1), 1 { ( I y 2 = min 1 + y2 2 < 1 ) }, 1 1 = I ( y1 2 + y2 2 < 1 ) (2.5) since I ( x x 2 2 < 1 ) is 1 because we know that the current point X (n) = (x (n) 1, x(n) 2 ) is in the unit circle (due to the fact that this is the current state of the Markov chain). Now notice that (2.5) will either be 0 or 1, depending on whether the proposed point is in the unit circle. If the proposed point is indeed within the unit circle, the acceptance probability (2.5) is 1, so that the proposed point is automatically accepted. On the other hand, if the proposed point lies outside the unit circle, the proposed point will always be rejected. Thus, for this random walk algorithm, the problem of determining whether a proposed point should be accepted or rejected reduces to calculating whether or not this proposed point lies within the unit circle, which is a rather simple calculation. For demonstration purposes, we will run this MCMC algorithm for 100 trials. The results of the first six trials are shown in Table 2.1. The movement of the Markov chain for these six trials is shown in Figure 2.3, along with the boundary of the region C from which we are trying to sample. Table 2.1: First six trials of the Metropolis-Hastings random walk. Trial Location of Markov chain Proposed Point Point Accepted? 1 (0.000, 0.000) (-0.140, 0.827) YES 2 (-0.140, 0.827) (0.706, ) YES 3 (0.706, ) (0.557, ) NO 4 (0.706, ) (0.608, 0.461) YES 5 (0.608, 0.461) (0.256, 0.167) YES 6 (0.256, 0.167) (-0.826, ) NO A plot of the entire sample of 100 points is shown in Figure 2.4. These points do indeed appear to be distributed uniformly across the unit circle, as we would hope. Note, however, that there are fewer than 100 distinct points on the plot. Some points are duplicates as a result of the trials in which the proposed point fell outside the unit circle and was hence rejected. Thus, in these trials, the Markov chain remained at its current location, rather than moving to a new point. We should also note that this simple example is only presented for demonstration purposes; if we actually wanted to generate a random sample with a bivariate Uniform distribution on the unit circle, there are many more efficient algorithms to produce such a sample than the random walk algorithm given in this example. In fact, for this case, it is unlikely that we would use any type of Markov chain algorithm, since it would be simple to produce an i.i.d. (independent and

22 13 Figure 2.3: Movement of the random walk Markov chain. identically distributed) sample from this distribution. Finally, note that in general, most Markov chains Monte Carlo algorithms are run for far more than 100 trials. These few trials will typically not be sufficient to produce a reasonable sample from the target distribution Variable-at-a-time Metropolis-Hastings Variable-at-a-time Metropolis-Hastings algorithms, of which the Gibbs sampler (Gelfand and Smith, 1990) is a special case, can be particularly helpful when we are attempting to construct a multivariate Markov chain. The reason this class of samplers is often beneficial is because they allow us to update the variables in the Markov chain individually, rather than having to update all of them at once. Suppose that we are trying to construct a Markov chain which converges to a stationary distribution π(x 1, x 2 ). We first let π X1 (x 1 ) = π(x 1, x 2 )dx 2 and π X2 (x 2 ) = π(x1, x 2 )dx 1 be the marginal distributions associated with π(x 1, x 2 ). Then the conditional distributions for the two variables are π X1 X 2 (x 1 x 2 ) = π(x 1, x 2 ) π X2 (x 2 ) and π X 2 X 1 (x 2 x 1 ) = π(x 1, x 2 ) π X1 (x 1 ). Then we can sample x 1 and x 2 individually, conditional upon the other. That is, we will first sample x 1 from π X1 X 2 (x 1 x 2 ) and then sample x 2 from π X2 X 1 (x 2 x 1 ). Sampling from these conditional distributions (rather than the full distribution) can lead to increases in efficiency, especially in the cases where the conditional distributions have recognizable distributions or are much easier to generate samples from. To produce the sampled points from each of these conditional distributions, we can use univariate Metropolis-Hastings methods, rejection samplers, or if the conditional distributions have recognized forms, we may be able to sample directly from one or more of them. We need not use the same updating method for each of the variables;

23 14 Figure 2.4: Sample produced by the random walk Markov chain. we can choose any univariate updating scheme that is appropriate for the given variable. This procedure is shown in Figure 2.5 for the case of a bivariate Markov chain, and can easily be extended to Markov chains of any finite dimension. Input: (x (n) 1, x(n) 2 ) = current value of Markov chain x(n+1) 1 π X1 X 2 (x 1 X 2 = x (n) 2 ) π X1 X 2 (x 1 x 2 ) = conditional distribution of X 1 X 2 π X2 X 1 (x 2 x 1 ) = conditional distribution of X 2 X 1 x (n+1) 2 π X2 X 1 (x 2 X 1 = x (n+1) 1 ) Output: (x (n+1) 1, x (n+1) 2 ) = new value of the Markov chain Figure 2.5: The Gibbs algorithm for constructing a bivariate Markov chain. Example 2.3. As an example, we will use the Gibbs algorithm to sample uniformly from a unit circle C centered at the origin. Note that this is the same target distribution as in the previous example. That is, we will construct a Markov chain that converges to the distribution given in (2.4). Instead of updating both coordinates simultaneously as before, however, using the Gibbs algorithm we will update the coordinates individually. To do this we need to find the appropriate conditional distributions for each variable; we first solve for the marginal distributions for each variable. π X1 (x 1 ) = π(x 1, x 2 )dx 2

24 15 1 = π I ( x x 2 2 < 1 ) dx 2 1 = π I ( x 2 2 < 1 x 2 ) 1 dx2 1 x = π dx 2 1 x 2 1 = 2 1 x 2 1 I ( 1 < x 1 < 1) π Similarly, we find that π X2 (x 2 ) = 2 1 x 2 2 I ( 1 < x 2 < 1). π Then we can solve for the conditional distributions corresponding to each of these variables. Likewise, we can also see that π X1 X 2 (x 1 x 2 ) = π(x 1, x 2 ) π X2 (x 2 ) 1 π = I ( x x 2 2 < 1 ) 2 1 x 2 2 I ( 1 < x 2 < 1) π ( 1 = 2 I 1 x 2 1 x 2 2 < x 1 < 2 ) 1 x 2 2 ( ) 1 π X2 X 1 (x 2 x 1 ) = 2 I 1 x 2 1 x 2 1 < x 2 < 1 x Inspecting these distributions, we can see that, conditional upon the value of the other coordinate, each coordinate has a uniform distribution, with limits determined by the value of the other coordinate. These limits correspond with the boundary of the unit circle. In this Gibbs sampler, we will update each coordinate via a univariate Metropolis-Hastings step. In order to do this, we must specify a proposal distribution for each coordinate. We will use a univariate normal distribution for each coordinate. The means of each of these Normal distributions will be the current values of the corresponding coordinate, and each distribution will have a variance of 1. Thus we have that q X1 (y x 1 ) is N(x 1, 1) and q X2 (y x 2 ) is N(x 2, 1). Now we can calculate the acceptance probabilities for each of the Metropolis-Hastings updates. In both cases, the proposal distributions are symmetric, so that we can use the simplified version of the acceptance probability given in (2.3). { } πx1 X α X1 X 2 (x, y) = min 2 (y) π X1 X 2 (x), x 2 2 = min x 2 2 ( I 1 x 2 2 < y < ) 1 x 2 2 ( I 1 x 2 2 < x < 1 x 2 2 ), 1

25 16 1 ( 2 I 1 x 2 1 x 2 2 < y < ) 1 x = min, x 2 2 ( I 1 x 2 2 < y < ) 1 x 2 2 = min, 1 1 = min = I { I ( 1 x 22 < y < ( 1 x 22 < y < 1 x 2 2 ) } 1 x 2 2, 1 ) As was the case in the previous example, this acceptance probability will always be either 0 or 1, depending on whether or not the proposed point lies within the unit circle. If it does, then we will accept it; if not, we ( reject and this coordinate of the Markov chain remains at its current value. Also note that I 1 x 2 2 < x < ) 1 x 2 2 will always be 1 by virtue of the current point lying within the unit circle. Similarly, the acceptance probability for the other coordinate is ( ) α X2 X 1 (x, y) = I 1 x 21 < y < 1 x 2 1. To complete the specification of this algorithm, we must assign a starting value to the Markov chain. As before, we will start the chain at the origin so that X (1) = (x (1) ) = (0, 0). We 1, x(1) 2 run this Markov chain for 100 trials (i.e., 50 updates of each coordinate). The first six trials are shown in Table 2.2. The movement of the Markov chain for these six trials is shown in Figure 2.6, along with the boundary of the region C from which we are trying to sample. Table 2.2: First six trials of the Metropolis-Hastings random walk using Gibbs updates. Trial Location of Markov chain Proposed Point Point Accepted? 1 (0.000, 0.000) (-0.472, 0.000) YES 2 (-0.472, 0.000) (-0.472, 0.402) YES 3 (-0.472, 0.402) (0.364, 0.402) YES 4 (0.364, 0.402) (0.364, ) YES 5 (0.364, ) (-0.534, ) YES 6 (-0.534, ) (-0.534, ) NO A plot of the entire sample is shown in Figure Monte Carlo Standard Errors and Stopping Rules In section 2.1 we mentioned some of the properties of Markov chains, including circumstances under which we can state a Central Limit Theorem for Markov chains, which was given in (2.2). If we can estimate σg, 2 the Central Limit Theorem lets us assess the accuracy of any of the calculations we do based on the Markov chain. While there are many different potential methods

26 17 Figure 2.6: Movement of the random walk Markov chain using Gibbs updates. available for calculating ˆσ 2 g (an estimate of σ 2 g), the one we will focus on here is the consistent batch means (CBM) method, as described in Jones et al. (2006) and Flegal et al. (2008). Our discussion of this method closely follows the description given by Flegal et al. (2008). Using the batch means method, to compute ˆσ 2 g based on a Markov chain run for n trials, we first split these n trials into a number of batches. In particular, we let a be the number of batches and b be the number of trials in each batch, so that n = ab. We then compute the sample mean for each batch as Ȳ j = 1 b Then the estimate of σ 2 g is given by jb i=(j 1)b+1 ˆσ 2 g = b a 1 g(x i ), j = 1,..., a. a ) 2 (Ȳj ḡ n. (2.6) j=1 Note that in general, for arbitrary values of a and b, the estimator defined in (2.6) will not be consistent for σ 2 g. However, there are some choices for a and b that do indeed assure consistency of this estimator. One such choice is to let b = n and a = n/b. Once we have an estimate of σ 2 g, we can then use this estimate to create a confidence interval for E π g. Specifically, if ˆσ 2 g is the estimator defined in (2.6), then t a 1 ˆσ 2 g n (2.7)

27 18 Figure 2.7: Sample produced by the random walk Markov chain using Gibbs updates. is the half-width of an asymptotically valid confidence interval for E π g, where t a 1 is the desired quantile from a t distribution with a 1 degrees of freedom (Jones et al., 2006). We can now use ˆσ g 2 to help devise stopping rules for an MCMC simulation. That is, we can use (2.7) to create guidelines which tell us for how long we should run a Markov chain in order to produce estimates of a desired accuracy. We must first specify the desired level of accuracy ɛ for our estimate. To achieve this level of accuracy, we want the half-width given in (2.7) to be no greater than ɛ. Thus our stopping rule is as follows: at periodic intervals (i.e., every k trials for some pre-specified value of k) we will calculate ˆσ g 2 using (2.6). Then we stop the simulation only if the associated confidence interval is narrow enough. In particular, we will stop if ˆσ g 2 t an 1 n + p(n) ɛ, (2.8) where p(n) = ɛi (n < n ) with n being a number chosen beforehand. (2.8) makes use of the half-width formula given in (2.7), with the addition of the p(n) term. This term ensures that the simulation is not stopped prematurely due to ˆσ g 2 being a poor estimate for σg 2 resulting from a small sample size. ˆσ g 2 being a consistent estimator of σg 2 implies that this procedure will stop for a sufficiently large value of n Effective Sample Size Ideally, when we produce a sample from a target distribution, we would like this sample to consist of i.i.d. (independent and identically distributed) draws from the distribution. However, draws produced using Markov chains will typically not be independent of each other, though the level of

28 19 dependence can vary greatly. We need to take into account the degree of this dependence when judging the quality of the samples produced. One way to do this is by examining the sample autocorrelations at various lags. If the samples were truly i.i.d., then the autocorrelation at each lag would be close to 0. Thus, we will prefer samples with autocorrelations that decay to 0 quickly, as this indicates samples which are less dependent. In order to assess the autocorrelation in a given sample, we must then examine the autocorrelation at each lag, which can be very tedious. (Note that in practice, we typically only look at autocorrelations for the first n lags, for some moderate value of n, since the autocorrelations at the very large lags are typically insignificant.) Another metric that is commonly used to assess the level of dependence in a sample is the effective sample size (ESS). A sample (containing some autocorrelation) which has an ESS of m contains as much information as an i.i.d. sample of m draws. We can calculate ESS for a sample of size N by using the formula ESS = N/κ(η), where κ(η) is the autocorrelation time for parameter η. A standard formula for autocorrelation time is κ(η) = k=1 ρ k(η), where ρ k (η) is the autocorrelation at lag k for the parameter η. Kass et al. (1998) recommends modifying this formula slightly by summing only the first j autocorrelation lags (for some finite j past which the autocorrelations have nearly vanished); here we determine j by using the Initial Monotone Sequence Estimator (IMSE) method (Geyer, 1992). This method is described in Figure 2.8. Note that since we do not know the true autocorrelations for each parameter, in practice we use the corresponding estimates from the sample data, ˆρ k (η). Input: ˆρ i (η) = Γ i ˆρ i (η) + ˆρ i+1 (η) estimated lag i autocorrelation for η k 1 repeat while Γ k+1 > 0 and Γ k > Γ k+1 : k k + 1 Output: κ(η) = autocorrelation time for η κ(η) k 1 j=1 ˆρ j(η) Figure 2.8: The IMSE algorithm for computing autocorrelation time.

29 Chapter 3 Ratio-of-Uniforms Markov Chain Monte Carlo 3.1 The Ratio-of-Uniforms Transformation The Ratio-of-Uniforms (ROU) transformation, as described by Kinderman and Monahan (1977), is a method for producing a random draw from a given distribution. Rather than sampling from the desired p-dimensional distribution directly, the ratio-of-uniforms method instead generates a draw from a uniform distribution on a particular region in p + 1 dimensions. A transformation is then required to translate this draw back into the original space; this backtransformed draw is a sample from the desired distribution. This is a type of auxiliary variable method; the method introduces an extra variable, with the hope that increasing the dimension of the target distribution (the distribution we are attempting to generate a sample from) will result in a more tractable sampling problem. The simplest case is that of a univariate target distribution. Consider the problem of sampling from a univariate distribution f. With the ROU method, we would generate a sample from the 2-dimensional region C f defined by C f = {(u, v) : 0 < v < f(u/v)}. After we have obtained the sample { (u (1), v (1) ), (u (2), v (2) ),..., (u (n), v (n) ) }, then we apply a transformation to translate this sample into a sample from the desired (1-dimensional) distribution. In particular, consider the transformation (y, z) = (u/v, v) After applying this transformation to the 2-dimensional sample, the result is that the marginal distribution of y = u/v has the desired distribution f. We simply ignore the other variable, z = v. This is summarized in Theorem 3.1: Theorem 3.1. (Kinderman and Monahan, 1977) Let f be a density function for a univariate random variable. Let (U,V) be random variables with a joint Uniform distribution on region C f R 2, with C f = {(u, v) : 0 < v < f(u/v)}. Then the random variable Y = U/V has distribution f.

30 21 Proof. (Kinderman and Monahan, 1977) The bivariate distribution of (u, v) is g(u, v) = 1 area(c f ) I ((u, v) C 1 ( f ) = area(c f ) I 0 < v < ) ( f(u/v) = 2I 0 < v < ) f(u/v) (Note that for univariate f, the area of C f is always 1 2, since this corresponds to the normalizing constant that causes the density to integrate to 1 as required.) The transformation (y, z) = (u/v, v) implies that (u, v) = (yz, z), so the Jacobian of this transformation is J = du dy dv dy du dz dv dz = du dv dy dz dv du = z 1 y 0 = z J = z = z dy dz (since z = v 0 by construction). Then the bivariate distribution of (y, z) is ( h Y,Z (y, z) = g U,V (u(y, z), v(y, z)) J = 2I 0 < z < f so that the marginal distribution of y is ( yz ) ) ( z = 2zI 0 < z < ) f(y) z j(y) = h(y, z)dz = ( 2zI 0 < z < ) f(y) dz = f(y) 0 2zdz = [ z 2] f(y) 0 = f(y) which is the desired target distribution. One important point to note concerning this method is that, in order to use the ROU transformation, we do not need to know the normalizing constant. That is, f does not need to be a density; it only must be proportional to a density. This property is very useful in cases where the normalizing constant is unknown or intractable. Example 3.1. As a simple example of this transformation, consider a target distribution X Uniform(1,2), i.e., f X (x) = I (x (1, 2)). We have C f = {(u, v) : 0 < v < f(u/v)} = {(u, v) : 0 < v < I (u/v (1, 2))} = {(u, v) : 0 < v < 1, u/v (1, 2)} = {(u, v) : 0 < v < 1, 1 < u/v < 2} = {(u, v) : 0 < v < 1, v < u < 2v}, which is a region bounded by a triangle with vertices at the points (0,0), (1,1), and (2,1) in the U V plane (see Figure 3.1).

31 22 Figure 3.1: ROU region corresponding to univariate Uniform random variable. Example 3.2. As an example of generating a sample from a univariate target distribution via the ROU transformation, consider the kernel of a univariate standard normal distribution (Kinderman and Monahan, 1977). That is, we will generate a sample for the random variable X, where X N(0, 1), i.e., f X (x) e x2 /2. In this case, the 2-dimensional ROU region corresponding to f X is given by C f = {0 < v < e u2 /4v 2 }. Note that C f is bounded by the rectangle {(u, v) : 1 u 1, 0 v 1}. This allows us to sample uniformly on C f by using a rejection sampler on this bounding rectangle. Once we have the sample of points on this region, then we simply let X (i) = U (i) /V (i), i = 1,..., n, where n is the desired sample size, so that {X (i) } is the sample from f X. Figure 3.2 shows the ROU region for this distribution, along with the histogram corresponding to the generated sample. We can also state a more general version of Theorem 3.1. This generalization of the ROU method, due to Wakefield et al. (1991), enables us to use different power transformations to create ROU regions of varying shapes, possibly yielding more efficient sampling schemes. Theorem 3.2. (Wakefield et al., 1991) Let f be a density function for a univariate random variable. Let (U,V) be random variables with a joint Uniform distribution on region C f R 2, with C f = {(u, v) : 0 < v < r+1 f(u/v r )}. Then the random variable Y = U/V r has distribution f.

32 23 (a) ROU Region for Standard Normal Sample (b) Histogram for ROU Standard Normal Sample Figure 3.2: Standard Normal ROU Example. Proof. The bivariate distribution of (u, v) is 1 g(u, v) = area(c f ) I ((u, v) C f ) 1 ( = area(c f ) I 0 < v < r+1 ) f(u/v r ) ( = (r + 1)I 0 < v < r+1 ) f(u/v r ) 1 (Note that for univariate f, the area of C f is always r+1, since this corresponds to the normalizing constant that causes the density to integrate to 1 as required.) The transformation (y, z) = (u/v r, v) implies that (u, v) = (yz r, z), so the Jacobian of this transformation is J = du dy dv dy du dz dv dz = du dv dy dz dv du dy dz = zr 1 ryz r 1 0 = z r J = z r = z r (since z = v 0 by construction). Then the bivariate distribution of (y, z) is h Y,Z (y, z) = g U,V (u(y, z), v(y, z)) J ( ) = (r + 1)I (0 ) yz r < z < r+1 f z r z r ( = (r + 1)z r I 0 < z < r+1 ) f(y)

Markov Chain Monte Carlo (MCMC)

Markov Chain Monte Carlo (MCMC Dependent Sampling Suppose we wish to sample from a density π, and we can evaluate π as a function but have no means to directly generate a sample. Rejection sampling can