Output analysis for Markov chain Monte Carlo simulations

Size: px

Start display at page:

Download "Output analysis for Markov chain Monte Carlo simulations"

Barnaby Neal
6 years ago
Views:

1 Chapter 1 Output analysis for Markov chain Monte Carlo simulations James M. Flegal and Galin L. Jones (October 12, 2009) 1.1 Introduction In obtaining simulation-based results, it is desirable to use estimation procedures which include a measure of the reliability of the procedure. Our goal is to introduce some of the tools useful for analyzing the output of a Markov chain Monte Carlo (MCMC) simulation which allow us to have confidence in the claims put forward. The following are the main issues we will address: (1) basic graphical assessment of MCMC output; (2) using the output for estimation; (3) assessing the Monte Carlo error of estimation; and (4) terminating the simulation. Let π be a probability function with support X R d (most often π will be a probability mass function or a probability density function) about which we wish to make an inference. Typically, this inference will be based on some feature θ π of π. For example, if g : X R a 1

2 2 CHAPTER 1. OUTPUT ANALYSIS common goal is the calculation of 1 θ π = E π g = g(x)π(dx). (1.1.1) We might also want quantiles associated with the marginal distributions of π. Indeed we will typically want several features of π such as various mean and variance parameters along with quantiles and so on. As a result, θ π will generally be a p-dimensional vector. Unfortunately, due to the use of complex stochastic models, we very often cannot calculate any of the components of θ π analytically or even numerically. Thus we are faced with a classical statistical problem; given a probability distribution π we want to estimate several fixed and unknown features of it. A brief aside is in order. Users of MCMC are often doing so in order to make an inferential statement about π. That is, the estimation of θ π is simply an intermediate step towards the ultimate inferential goal. It is an important point that the quality of the estimate of θ π can have a profound impact on the quality of the resulting inference about π. For now we consider only the problem of estimating an expectation as in (1.1.1) and defer considering what to do when θ π is not an expectation. The basic MCMC method entails constructing a Markov chain X = {X 0, X 1, X 2,...} on X having π as its invariant distribution. (A description of the required regularity conditions is given in Section 1.7). Then we simulate X for a finite number of steps, say n, and use the observed sample path to estimate E π g with a sample average X ḡ n := 1 n 1 g(x i ). (1.1.2) n i=0 The use of this estimator is justified through the Markov chain strong law of large numbers (SLLN) 2 : If E π g <, then ḡ n E π g almost surely as n. From a practical point of view, this means that we can obtain an arbitrarily accurate estimate of E π g by increasing n, i.e. running the simulation longer. How long should we run the simulation? There are several methods commonly used to terminate an MCMC simulation. For example, one could stop when the sequence ḡ n, as a function of n, appears to have stabilized or n could be chosen based on the variability in 1 The notation in (1.1.1) avoids having separate formulas for the discrete case where it denotes g(x) π(x) and the continuous case where it denotes g(x) π(x) dx. x X 2 This is a special case of the Birkhoff Ergodic Theorem (p. 558 Fristedt and Gray, 1997). X

3 1.1. INTRODUCTION 3 ḡ n observed in some necessarily short preliminary runs. Convergence diagnostics (Cowles and Carlin, 1996) are also often used for this purpose; for example, see Gelman and Rubin (1992). However, these approaches do not provide a measure of the quality of our estimate ḡ n. The alternative advocated here (see also Flegal et al., 2008) is that this decision should be based on the size of the Monte Carlo error, ḡ n E π g. Unfortunately, we can never know the Monte Carlo error except in toy problems where we already know E π g. However we can obtain its approximate sampling distribution if a Markov chain central limit theorem (CLT) holds n(ḡn E π g) d N(0, σg) 2 (1.1.3) as n where σ 2 g (0, ). Due to the correlation present in a Markov chain, σ 2 g is generally complicated and, in particular, σ 2 g var π g. For now, suppose we have an estimator such that ˆσ n 2 σg 2 almost surely as n (see Section 1.4 for some suitable techniques). This allows one to construct an asymptotically valid confidence interval for E π g with half-width ˆσ n t n where t is an appropriate quantile. This interval then allows us to describe the confidence in the reported estimate and one can continue the simulation until the interval estimate is sufficiently narrow. This is a fixed-width approach to terminating the simulation and will be illustrated in Section 1.6. But, more importantly, reporting the Monte Carlo standard error allows an independent reader to assess the Monte Carlo sampling error and independently judge the results being presented. The rest of this chapter is organized as follows. In Section 1.2 we consider some basic techniques for graphical assessment of MCMC output then Section 1.3 contains a discussion of various point estimators of θ π. Next in Section 1.4 we introduce techniques for constructing interval estimators of θ π. Section 1.5 considers estimating marginal densities associated with π and Section 1.6 further considers stopping rules for MCMC simulations. Finally, in Section 1.7 we give conditions for ensuring (1.1.3). The computations presented in our examples were carried out using the R language. See the author s web pages for the relevant Sweave file from which the reader can reproduce all of our calculations.

4 4 CHAPTER 1. OUTPUT ANALYSIS 1.2 Initial Examination of Output As a first step it pays to examine the empirical finite-sample properties of the Markov chain being simulated. A few simple graphical methods provide some of the most common ways of initially assessing the output of an MCMC simulation. These include scatterplots, histograms, time series plots, autocorrelation plots and plots of sample means. As an illustration consider the following toy example which will be used several times throughout this chapter. Example 1 (Normal AR(1) Markov chains). Consider the normal AR(1) time series defined by X n+1 = ρx n + ɛ n (1.2.1) where the ɛ n are i.i.d. N(0,1) and ρ < 1. This Markov chain has invariant distribution N (0, 1/(1 ρ 2 )). As a simple numerical example, consider simulating the chain (1.2.1) in order to estimate the mean of the invariant distribution, i.e. 0. While this is a toy example, it is quite useful because ρ plays a crucial role in the behavior of this chain. Figure 1.1 contains plots based on single sample path realizations starting at X 1 = 1 with ρ = 0.5 and ρ = In each figure the top plot is a time series plot of the observed sample path. The mean of the target distribution is 0 and the horizontal lines are 2 standard deviations above and below the mean. Comparing the time series plots it is apparent that while we may be getting a representative sample from the invariant distribution, when ρ = 0.95 the sample is highly correlated. This is also apparent from the autocorrelation (middle) plots in both figures. When ρ = 0.5 the autocorrelation is negligible after about lag 4 but when ρ = 0.95 there is a substantial autocorrelation until about lag 30. The impact of this correlation is apparent in the bottom two plots which plot the running estimates of the mean versus iterations in the chain. The true value is displayed as the horizontal line at 0. Clearly, the more correlated sequence requires many more iterations to achieve a reasonable estimate. In fact, it is apparent from these plots that the simulation with ρ = 0.5 has been run too long while in the ρ = 0.95 case it likely hasn t been run long enough. In the example, the plots were informative because we were able to draw the horizontal lines depicting the true values. In practically relevant MCMC settings where these would be unavailable it is hard to know when to trust these plots. Nevertheless, they can still be useful since a Markov chain that is mixing well would tend to have a time series and autocorrelation plots that look like Figure 1.1a while time series and autocorrelation plots like the one in

5 1.3. POINT ESTIMATES OF θ π 5 Time Series Time Series Autocorrelation Autocorrelation Lag Lag Running Average Running Average (a) ρ = (b) ρ = 0.95 Figure 1.1: Initial output examination for Example 1. Figure 1.1b (or worse) would indicate a potentially problematic simulation. Also, plots of current parameter estimates (with no reference to a standard error) versus iteration are not as helpful since they provide little information as to the quality of estimation. In simulating a d-dimensional Markov chain to simultaneously estimate the p-dimensional vector θ π of features of π, p can be either greater than or less than d. When either d or p are large, the standard graphical techniques are obviously problematic. That is, even if for each component the time series plot indicates good mixing we should not necessarily infer that the chain has converged to its joint target distribution. In addition, if either d or p is large it will be impractical to look at plots of each component. 1.3 Point Estimates of θ π In this section, we consider two specific cases of θ π, estimating an expectation E π g and estimating a quantile of one of the univariate marginal distributions from π.

6 6 CHAPTER 1. OUTPUT ANALYSIS Expectations Suppose θ π = E π g and assume E π g <. Recall from Section 1.1 that there is a SLLN and hence it is natural to use the sample average ḡ n to estimate θ π. However, there are two popular alternatives the first of which is obtained through the use of burn-in, i.e. discarding an initial portion of the simulation, and the second which is obtained through averaging conditional expectations. Since the simulation is not typically started with a draw from π (otherwise we could just do ordinary Monte Carlo) it follows that marginally each X i π and hence E π g E[g(X i )]. Thus ḡ n is a biased estimator of E π g. In our setting, we have that X d n π as n so, in order to diminish the effect of this initialization bias an alternative estimator is often employed ḡ n,b = 1 n n+b 1 i=b g(x i ) where B denotes the burn-in or amount of simulation discarded. By keeping only the draws obtained after B 1 we are effectively choosing a new initial distribution that is closer to π. The SLLN still applies to ḡ n,b since if it holds for any initial distribution it holds for every initial distribution. Of course, one possible (maybe even likely) consequence of using burn-in is that var(ḡ n,b ) var(ḡ n+b,0 ), that is, the bias decreases but the variance increases for the same total simulation effort. Obviously, this means that using burn-in could result in an estimator having larger mean-squared error than one without burn-in. Moreover, without some theoretical work (Jones and Hobert, 2001; Latuszynski and Niemiro, 2009; Rosenthal, 1995; Rudolf, 2009) it is not clear what value of B should be chosen. Popular approaches to determining B are usually based on so-called convergence diagnostics (Cowles and Carlin, 1996). Unfortunately, there simply is no guarantee that any of these diagnostics will detect a problem with the simulation and, in fact, using them can introduce bias (Cowles et al., 1999). To motivate the estimator obtained by averaging conditional expectations, suppose the target is a function of two variables π(x, y) and we are interested in estimating the expectation of a function of only one of the variables, say g(x). Let m Y (y) and f X Y (x y) be marginal and conditional densities, respectively. Notice E π g = [ g(x)π(x, y)dydx = ] g(x)f X Y (x y)dx m Y (y)dy

7 1.3. POINT ESTIMATES OF θ π 7 so by letting h(y) = g(x)f X Y (x y)dx we can appeal to the SLLN again to see that as n h n = 1 n 1 h(y i ) = 1 n 1 n n i=0 i=0 g(x)f X Yi (x y i )dx a.s. E π g. This estimator is conceptually the same as ḡ n in the sense that both are sample averages over the chain and the Markov chain SLLN applies to both. The estimator h n is often called the Rao-Blackwellised estimator of E π g (an unfortunate name since it has nothing to do with the Rao-Blackwell Theorem). An obvious question is which of ḡ n, the sample average, or h n, the Rao-Blackwellised estimator, is better? It is obvious that h n will sometimes be impossible to use if f X Y (x y) is not available in closed form or if the integral is intractable, hence h n will not be as generally practical as ḡ n. However, there are settings, such as in data augmentation (Hobert, 2009) where h n is theoretically and empirically superior to ḡ n ; see Liu et al. (1994) and Geyer (1995) for theoretical investigation of these two estimators. We will content ourselves with an empirical comparison in two simple examples. Example 2. This example is also considered in Hobert (2009). Suppose our goal is to estimate the first two moments of a Students t distribution with 4 degrees of freedom and having density m(x) = 3 8 ) 5/2 (1 + x2. 4 Obviously, there is nothing about this that requires MCMC since we can easily calculate that E m X = 0 and E m X 2 = 2. Nevertheless, we will use a data augmentation algorithm based on the joint density π(x, y) = 4 2π y 3 2 e y(2+x 2 /2) so that the full conditionals are X Y N(0, y 1 ) and Y X Gamma(5/2, 2 + x 2 /2). Consider the Gibbs sampler that updates X then Y so that a one-step transition looks like (x, y ) (x, y ) (x, y). Suppose we have obtained n observations {x i, y i ; i = 0,..., n 1} from running the Gibbs sampler. Then the standard sample average estimates of E m X and E m X 2 are 1 n 1 n i=0 x i and 1 n 1 x 2 i, respectively. n i=0

8 8 CHAPTER 1. OUTPUT ANALYSIS First Moment Second Moment Std RB Std RB (a) θ π = E π X (b) θ π = E π X 2 Figure 1.2: Estimators of the first two moments for Example 2. The horizontal line denotes the truth, the solid curves are the running sample averages while the dotted curves are the running RB sample averages. Further, the Rao-Blackwellised estimates are easily computed. Since X Y N(0, y 1 ) the Rao-Blackwellised estimate of E m X is 0! On the other hand, the RB estimate of E m X 2 is 1 n 1 y 1 i. n i=0 As an illustration of these estimators we simulated 2000 iterations of the Gibbs sampler and plotted the running values of the estimators in Figure 1.2. It is obvious for these plots that the RB averages are much less variable than the standard sample averages, especially for smaller MCMC sample sizes. It is not the case that RB estimators are always better than sample means. Whether they are better depends on the expectation being estimated as well as the properties of the MCMC sampler. In fact, Liu et al. (1994) and Geyer (1995) give an example where the RB estimator is provably worse than the sample average which is illustrated in the following example.

9 1.3. POINT ESTIMATES OF θ π Std RB Std RB (a) θ π = E π X (b) θ π = E π (X 2Y ) Figure 1.3: Plots based on a single run of the Gibbs sampler in Example 3 with ρ = 0.6. The horizontal line denotes the truth, the solid curves are the running sample averages while the dotted curves are the running RB sample averages. Example 3. As in Liu et al. (1994) and Geyer (1995), suppose ( ) (( ) ( )) X 0 1 ρ N, Y 0 ρ 1 then X Y N(ρY, 1 ρ 2 ) and Y X N(ρX, 1 ρ 2 ). Consider a Gibbs sampler that updates X followed by an update of Y so that a one-step transition looks like (x, y ) (x, y ) (x, y). For estimating θ π = E π (X by ), Geyer (1995) shows when b = 2 and ρ = 0.6, the sample average has smaller asymptotic variance than the RB estimator. However, for estimating E π X, the RB estimator will be asymptotically superior. The gain in either case is not always apparent in small Monte Carlo sample sizes when looking at running plots of the estimates; see Figure 1.3. RB estimators are more general than our presentation suggests. Let h be any function and set f(x) = E π [g(x) h(x) = h(x)]

10 10 CHAPTER 1. OUTPUT ANALYSIS so that E π g = E π f. Thus, by the Markov chain SLLN with probability 1 as n, 1 n n g(x i ) = 1 n i=1 n E π [g(x) h(x i ) = h(x i )] E π g. i=1 As long as the conditional distribution X h(x) is tractable, RB estimators are available Quantiles It is common to report estimated quantiles in addition to estimated expectations. Actually, what is nearly always reported is not a multivariate quantile but rather a quantile of the univariate marginal distributions associated with π. This is the only setting we consider. Let F be a marginal distribution function associated with π. Then the qth quantile of F is φ q = F 1 (q) = inf{x : F (x) q}, 0 < q < 1. (1.3.1) There are many potential estimates of φ q (Hyndman and Fan, 1996) but we consider only the inverse of the empirical distribution function from the observed sequence. First define {X (1),..., X (n) } as the order statistics of {X 0,..., X n 1 }, then the estimator of φ q is given by ˆφ q,n = X (j+1) where j n q < j + 1 n. (1.3.2) Example 4 (Normal AR(1) Markov chains). Consider again the time series defined at (1.2.1). Our goal in this example is to illustrate estimating the first and third quartiles, denoted Q 1 and Q 3. In this simple example, the true values of Q 1 and Q 3 are ±Φ 1 (0.75)/ 1 ρ 2 where Φ is the cumulative distribution function of a standard normal distribution. Using the same realization of the chain as in Example 1, Figures 1.4a and 1.4b show plots of the running quartiles versus iteration when ρ = 0.5 and ρ = 0.95 respectively. It is immediately apparent that obtaining good estimates is more difficult when ρ = 0.95 and hence the simulation should continue. Also, without the horizontal lines, these plots would not be as useful. Recall a similar conclusion was reached for estimating the mean of the target distribution.

11 1.4. INTERVAL ESTIMATES OF θ π 11 Running Quartiles Running Quartiles (a) ρ = 0.5 (b) ρ = 0.95 Figure 1.4: Plots for AR(1) model of running Q 1 and Q 3. The horizontal lines are the true quartiles. 1.4 Interval Estimates of θ π In our examples so far we have known the truth enabling us to draw the horizontal lines on the plots which allow us to gauge the quality of estimation. Obviously, the true parameter value is unknown in practically relevant settings and hence the size of the Monte Carlo error is unknown. For this reason, when reporting a point estimate of θ π one should also include a Monte Carlo standard error (MCSE). In this section we address how to construct asymptotically valid standard errors and interval estimates of θ π. The method presented will be based on the existence of a Markov chain CLT as at (1.1.3). We defer discussion of the regularity conditions required to ensure a CLT until Section 1.7 and, for now assume a limiting distribution for the Monte Carlo error holds; see also Geyer (2009) and Hobert (2009) Expectations Suppose θ π = E π g which will be estimated with ḡ n = ḡ n,0, that is with no burn-in. However, we note that, as with the SLLN, if the CLT holds for any initial distribution then it holds for every initial distribution. Thus the use of burn-in does not affect the existence of a CLT,

12 12 CHAPTER 1. OUTPUT ANALYSIS but it may affect the quality of the asymptotic approximation. If ˆσ 2 n is an estimate of σ 2 g, then one can form a confidence interval for E π g with half-width ˆσ n t (1.4.1) n where t is an appropriate quantile. Thus the difficulty in finding interval estimates is estimating σg 2 which requires specialized techniques to account for correlation in the underlying chain. We restrict attention to strongly consistent estimators of σg 2 so that the asymptotic validity of (1.4.1) is ensured. The methods satisfying this requirement include various batch means methods, spectral variance methods and regenerative simulation. Other alternatives include the initial sequence methods of Geyer (1992), however the theoretical properties of Geyer s estimators are not well understood. We will focus on batch means as it is the most generally applicable method; for more on spectral methods see Flegal and Jones (2009a) while Hobert et al. (2002) and Mykland et al. (1995) study regenerative simulation. There are many variants of batch means, here we focus on non-overlapping batch means (BM) and overlapping batch means (OLBM), a simple extension with a smaller asymptotic variance (Flegal and Jones, 2009a; Meketon and Schmeiser, 1984). Batch Means For BM the simulation output is broken into a n equally sized batches of length b n. If an MCMC sampler is run for a total of n = a n b n iterations then define Ȳk := b 1 bn 1 n i=0 g(x kb n+i) for k = 0,..., a n 1. Further define the batch means estimator of σg 2 as ˆσ 2 BM = b a n 1 n (Ȳk ḡ n ) 2. a n 1 The estimator ˆσ 2 BM is not generally a consistent estimator of σ2 g (Glynn and Whitt, 1991). Specifically, if the number of batches is fixed, then no matter how large the batch size the estimator ˆσ 2 BM does not converge in probability to σ2 g as the Monte Carlo sample size increases. However, roughly speaking, Jones et al. (2006) show that if the Markov chain mixes quickly and both a n and b n are allowed to increase as the overall length of the simulation does, then ˆσ 2 BM is a strongly consistent estimator of σ2 g. It is often convenient to take b n = n ν for some 0 < ν < 1 and a n = n/b n. k=0 OLBM is a generalization of BM if we consider the n b n + 1 overlapping batches of

13 1.4. INTERVAL ESTIMATES OF θ π 13 length b n. Indexing the batches by j = 0,..., n b n, define the batch mean as Ȳj(b n ) := b 1 bn 1 n i=0 g(x j+i), so that σg 2 is estimated with ˆσ 2 OLBM = n b nb n n (Ȳj(b n ) ḡ n ) 2. (n b n )(n b n + 1) Flegal and Jones (2009a) establish strong consistency of the OLBM estimator under similar conditions to those required for BM, e.g. by setting b n = n ν for some 0 < ν < 3/4. When constructing the interval (1.4.1) t is a quantile from a Student s t distribution with n b n degrees of freedom. In both BM and OLBM, ν values yielding strongly consistent estimators are dependent on the number of finite moments and mixing conditions of the Markov chain. These conditions are similar to those required for a Markov chain CLT, see (Flegal and Jones, 2009a; Jones, 2004; Jones et al., 2006). Example 5 (Normal AR(1) Markov chains). Recall the AR(1) model defined at (1.2.1). Using the same realization of the chain as in Example 1, 2000 iterations with ρ {0.5, 0.95} starting from X 1 = 1, we consider estimating the mean of the invariant distribution, i.e. 0. Utilizing OLBM with b n = n 1/2, we calculated an MCSE and resulting 95% confidence interval, see Figure 1.5. The top figure is a plot of the running MCSE versus iteration for a single realization where the dashed line corresponds to ρ = 0.5 and the dotted line corresponds to ρ = When ρ is larger it takes longer for the MCSE to stabilize and begin decreasing. After 2000 iterations for ρ = 0.5 we obtained an interval of ± while for ρ = 0.95 the interval is ± Notice that both confidence intervals contain 0 (the true value). Many of our plots are based on simulating only 2000 iterations. j=0 We chose this value strictly for illustration purposes. An obvious question is has the simulation been run long enough? Another way of asking the same questions is are the interval estimates sufficiently narrow after 2000 iterations? In the ρ =.5 case, the answer is maybe while in the ρ =.95 case it is clearly no. Consider the interval estimate of the mean with ρ = 0.5, that is, ± = ( 0.119, 0.051). This indicates that we can trust only the leading 0 in the point estimate 0.034, but not the sign of the estimate. If the user is satisfied with this level of precision, then 2000 iterations is sufficient. On the other hand, when ρ =.95 our interval estimate is ± = ( 1.196, 0.182) which indicates that we cannot trust any of the figures (or sign) reported in the point estimate, Recall the RB estimators of Section It is straightforward to use OLBM to calculate

14 14 CHAPTER 1. OUTPUT ANALYSIS Running MCSE ρ=0.5 ρ=0.95 Running Average Running Average Figure 1.5: Plots for AR(1) model of running MCSE and confidence intervals calculated via OLBM. the MCSE for these estimators since the conditional expectations being averaged define the sequence of batch means Ȳj(b n ). Example 6. Recall Example 3 where we found it difficult to distinguish between using the sample average estimator and the RB estimators. The difference becomes clearer with interval estimators. Figure 1.6 considers estimating E π X, here the RB estimators have a smaller asymptotic variance than the sample average estimators as clearly reflected by the narrower interval estimates. On the other hand, Figure 1.7 shows the RB estimators have a larger asymptotic variance than the sample average estimators, also reflected by the wider interval estimates. The practical effect of this is that, depending on which expectation is being estimated, using RB estimators can lead to either more or less simulation effort in order to achieve a sufficiently narrow interval estimate. Notice the additional information available by including the resulting interval estimates. Parallel Chains To this point, our recipe seems straightforward: given a sampler, pick a starting value and run the simulation for a long time using the SLLN and the CLT to produce a point estimate and a measure of its uncertainty. A variation that stimulated debate in the early statistics literature on MCMC (see e.g. Gelman and Rubin, 1992; Geyer, 1992), even earlier

15 1.4. INTERVAL ESTIMATES OF θ π (a) Sample Average Estimates (b) RB Estimates Figure 1.6: Both plots are based on a single run of the Gibbs sampler in Example 3 with ρ =.6 for estimating E π X. The horizontal line denotes the truth, the solid curves are the running estimates while the dotted curves are the running pointwise 95% interval estimates (a) Sample Average Estimates (b) RB Estimates Figure 1.7: Both plots are based on a single run of the Gibbs sampler in Example 3 with ρ =.6 for estimating E π (X 2Y ). The horizontal line denotes the truth, the solid curves are the running estimates while the dotted curves are the running pointwise 95% interval estimates.

16 16 CHAPTER 1. OUTPUT ANALYSIS in the operations research literature (see e.g. Bratley et al., 1987; Kelton and Law, 1984), and continues today (Alexopoulos et al., 2006; Alexopoulos and Goldsman, 2004) instead relies on simulating r independent replications of this recipe. If we suppose that each replication is started from different starting points but is run for the same length and using the same burn-in, then this procedure would produce r independent estimates of E π g, namely ḡ n,b,1, ḡ n,b,2,..., ḡ n,b,r. The resulting estimate of E π g would then be the grand mean and our estimate of its performance, i.e. σ 2 g, would be the usual sample variance of the ḡ n,b,i. This approach has some intuitive appeal in that estimation avoids some of the serial correlation inherent in MCMC. Moreover, clearly there is value in trying a variety of initial values for the simulation. It has also been argued that by choosing the r starting points in a widely dispersed manner there is a greater chance of encountering modes that one long run may have missed. Thus, for example, some argue that using independent replications results in superior inferential validity (Gelman and Rubin, 1992, p. 503). However, there is no agreement on this issue, indeed, Bratley et al. (1987, p. 80) are skeptical about [the] rationale of some proponents of independent replications. Notice that the total simulation effort using independent replications is r(n + B). To obtain good estimates of σ 2 g will require r to be large which will require n + B to be small for a given computational effort. If we use the same value of B as we would when using one one long run this means that each ḡ n,b,i will be based on a comparatively small number n of observations. Moreover, since each run will be short there is a reasonable chance that a given replication will not move far from its starting value. Alexopoulos and Goldsman (2004) have shown that this can result in much poorer estimates (in terms of mean square error) of E π g than a single long run. On the other hand, if we can find widely dispersed starting values that are from a distribution very close to π, then independent replications may indeed be superior (Alexopoulos and Goldsman, 2004). This should not be surprising since draws directly from π are clearly desirable Functions of Moments Suppose now that we are interested in estimating φ(e π g) where φ is some function. If φ is continuous, then φ(ḡ n ) φ(e π g) with probability 1 as n which makes estimation of φ(e π g) straightforward. Again we also want a valid Monte Carlo error which can be obtained via the delta method (Ferguson, 1996; van der Vaart, 1998). Assuming (1.1.3) the delta method says that if φ is continuously differentiable in a neighborhood of E π g and

17 1.4. INTERVAL ESTIMATES OF θ π 17 φ (E π g) 0, then as n n(φ(ḡn ) φ(e π g)) d N(0, [φ (E π g)] 2 σ 2 g). Thus if our estimator of σg, 2 say ˆσ n 2 is strongly consistent then [φ (ḡ n )] 2ˆσ n 2 is strongly consistent for [φ (E π g)] 2 σg. 2 Example 7. Consider estimating (E π X) 2 with ( X n ) 2. Let φ(x) = x 2 and assume φ(e π X) 0 and (1.1.3). Then ( n ( Xn ) 2 (E π X) 2) d N(0, 4(E π X) 2 σx) 2 and we can use OLBM to consistently estimate σ 2 x with ˆσ 2 n which means that 4( X n ) 2ˆσ 2 n is a strongly consistent estimator of 4(E π X) 2 σ 2 x. From this example we see that the univariate delta method makes it straightforward to handle powers of moments. The multivariate delta method allows us to handle more complicated functions of moments. Let T n denote a d-dimensional sequence of random vectors and θ be a d-dimensional parameter. If as n, n(tn θ) d N(µ, Σ) and φ is continuously differentiable in a neighborhood of θ and φ (θ) 0, then as n n(φ(tn ) φ(θ)) d N(µ, φ (θ)σφ (θ) T ). Example 8. Consider estimating var π g = E π g 2 (E π g) 2 with, setting h = g 2, ˆv n := 1 n n h(x i ) i=1 ( 1 n 2 n g(x i )). i=1 Assume ) ( )) (( ) ((ḡ n E π g d 0 n N, h n E π g 2 0 ( σg 2 c )) c σ 2 h where c = E π g 3 E π ge π g 2. Let φ(x, y) = y x 2 and assume E π g 0, then as n n(ˆvn var π g) d N(0, 4(E π g)(σ 2 ge π g E π g 3 + E π ge π g 2 ) + σ 2 h).

18 18 CHAPTER 1. OUTPUT ANALYSIS Since it is easy to use OLBM to construct strongly consistent estimators of σg 2 and σh 2 the following formula gives a strongly consistent estimator of the variance in the asymptotic normal distribution for ˆv n 4(ḡ n )(ˆσ 2 g,n ḡ n j n + ḡ n hn ) + ˆσ 2 h,n where j = g Quantiles Suppose our goal is to estimate φ q with ˆφ q,n defined at (1.3.1) and (1.3.2), respectively. We now turn our attention to constructing an interval estimate of φ q. It is tempting to think that bootstrap methods would be appropriate for this problem. Indeed there has been a substantial amount of research into bootstrap methods for stationary time series which would be appropriate for MCMC settings (see e.g. Bertail and Clémençon, 2006; Bühlmann, 2002; Datta and McCormick, 1993; Politis, 2003). Unfortunately, our experience has been that these methods are extremely computationally intensive as compared to the method presented below. Moreover, the coverage probabilities of the intervals produced with the bootstrap methods are typically much worse; see Flegal (2008). As above, we assume the existence of an asymptotic distribution for the Monte Carlo error, that is, there is a constant γq 2 (0, ) such that as n n( ˆφq,n φ q ) d N(0, γ 2 q ). (1.4.2) Flegal and Jones (2009b) give conditions under which (1.4.2) obtains and these are also briefly introduced in Section 1.7. Just as when we were estimating an expectation we find ourselves in the position of estimating a complicated constant γq 2. We focus on the use of subsampling methods in this context. However, before we begin note that this use of the term subsampling is quite different than the way it is often used in the context of MCMC in that we are not deleting any observations of the Markov chain. Subsampling methods This section will provide a brief overview of subsampling methods in the context of MCMC and illustrate their use for estimating the MCSE of ˆφ q,n. While this section focuses on quan-

19 1.4. INTERVAL ESTIMATES OF θ π 19 tiles, subsampling methods apply much more generally; the interested reader is encouraged to consult Politis et al. (1999). In the case of sample averages, as at (1.1.2), the subsampling variance estimator is equivalent to the OLBM estimator of σ 2 g. The main idea for subsampling is similar to OLBM in that we are taking overlapping batches (or subsamples) of size b n from the first n observations of the chain {X 0, X 1,..., X n 1 }. There are n b n + 1 such subsamples where b n << n typically. Let {X i,..., X i+bn 1} be the ith subsample with corresponding ordered subsample {X(1),..., X (b n) }. Then define the quantile based on the ith subsample as φ i = X (j+1) where j b n q < j + 1 b n for i = 0,..., n b n. The subsampling estimator of γ 2 q is then where ˆγ 2 q = n b b n n+1 (φ i n b n + 1 φ ) 2, (1.4.3) φ = i=0 n b 1 n+1 φ i. n b n + 1 Politis et al. (1999) give conditions to ensure this estimator is strongly consistent, however their conditions could be difficult to check in practice. i=0 Implementation of subsampling requires choosing b n so that as n we have b n and b n /n 0. A natural choice is b n = n. Example 9 (Normal AR(1) Markov chains). Using the AR(1) model defined at (1.2.1) we again consider estimating first and third quartiles, denoted Q 1 and Q 3, which equal ±Φ 1 (0.75)/ 1 ρ 2. Figure 1.8 shows the output from the same realization of the chain used previously in Example 4 but this time the plot includes an interval estimate of the quartiles. Figure 1.8a shows a plot of the running quartiles versus iteration when ρ = 0.5. In addition, the dashed lines show the 95% confidence interval bounds at each iteration. These intervals were produced with subsampling using b n = n. At around 200 iterations, the MCSE (and hence interval estimates) seem to stabilize and begin to decrease. At 2000 iterations, the estimates for Q 1 and Q 3 are 0.817±0.105 and 0.778±0.099 respectively. Figure 1.8b shows the same plot when ρ = At 2000 iterations, the estimates for Q 1 and Q 3 are 2.74 ± and

20 20 CHAPTER 1. OUTPUT ANALYSIS Running Quartiles Running Quartiles (a) ρ = 0.5 (b) ρ = 0.95 Figure 1.8: Plots for AR(1) model of running Q 1 and Q 3 estimates with 95% pointwise confidence intervals calculated via subsampling ± respectively. Are the intervals sufficiently narrow after 2000 iterations? In both cases (ρ = 0.5 and ρ =.95) the answer is apparently no. Consider the narrowest interval which is the estimate for Q 3 with ρ =.5, that is, ± = (.679,.877) which indicates that we can at most trust the sign and the leading 0 of the estimate, Note that in a real problem we would not have the horizontal line in the plot depicting the truth. 1.5 Estimating Marginals A common inferential goal is the production of a plot of a marginal density associated with π. In this section we cover two methods for doing this. We begin with a simple graphical method then introduce a clever method due to Wei and Tanner (1990) that reminds us of the Rao-Blackwellization methods of Section 1.3. A histogram approximates the true marginal by the Markov chain SLLN and because it is so easy to implement provides the most common method for doing so. Another common approach is to report a nonparametric density estimate or smoothed histogram. It is conceptually straightforward to construct pointwise interval estimates for the smoothed histogram

21 1.5. ESTIMATING MARGINALS 21 using subsampling techniques. However, outside of toy examples, the computational cost can be substantial; see Politis et al. (1999) Example 10. Suppose Y i µ, θ N(µ, θ) independently for i = 1,..., m where m 3 and prior ν(µ, θ) θ 1/2. The target is the posterior density π(µ, θ y) θ (m+1)/2 e m 2θ (s2 +(ȳ µ) 2 ) where s 2 is the usual biased sample variance. It is easy to see that µ θ, y N(ȳ, θ/m) and that θ µ, y IG((m 1)/2, m[s 2 + (ȳ µ) 2 ]/2). We will consider the Gibbs sampler that updates µ then θ so that a one-step transition is given by (µ, θ ) (µ, θ ) (µ, θ) and use this sampler to estimate the marginal densities of µ and θ. Now suppose m = 11, ȳ = 1 and s 2 = 4. We simulated 2000 realizations of the Gibbs sampler starting from (µ 1, λ 1 ) = (1, 1). The marginal density plots were created using the default settings for the density function in R and are shown in Figure 1.9 while an estimated bivariate density plot (created using R functions kde2d and persp) is given in Figure It is obvious from these figures that the posterior is simple, so it is no surprise that the Gibbs sampler has been shown to converge in just a few iterations (Jones and Hobert, 2001). Obviously, in more realistic examples where the dimension of the posterior is three or more it will be difficult to use these techniques to visualize the entire posterior. A clever technique for estimating a marginal is based on the same idea as RB estimators (Wei and Tanner, 1990). To keep the notation simple suppose the target is a function of only two variables, π(x, y) and let m X and m Y be the associated marginals. Then m X (x) = π(x, y)dy = f(x y)m Y (y)dy = E my f(x y) suggesting that, by the Markov chain SLLN, we can get a functional approximation to m X since as n 1 n 1 f(x y i ) m X (x). (1.5.1) n i=0 Of course, just as with RB estimators this will only be useful when the conditionals are tractable. Note also, that it is straightforward to use BM or OLBM to get pointwise confidence intervals for the resulting curve; that is for each x we can calculate an MCSE of the sample average in (1.5.1). Example 11. Recall the setting of Example 10. We will focus on estimation of the marginal

22 22 CHAPTER 1. OUTPUT ANALYSIS Density Density (a) Marginal density of µ (b) Marginal density of θ Figure 1.9: Histograms for estimating the marginal densities of Example 10. mu theta Figure 1.10: Estimated posterior density of Example 10.

23 1.5. ESTIMATING MARGINALS 23 Density Figure 1.11: Estimates of the marginal density µ. The three estimates are based a histogram, smoothed marginal densities (solid line), and the method of Wei and Tanner (1990) (dashed line). posterior density of µ y, i.e. π(µ y). Note that π(µ y) = π(µ θ, y)π(θ y) dθ so that by the Markov chain SLLN we can estimate π(µ y) with 1 n n π(µ θ i, y) which is straightforward to evaluate since µ θ i, y N(ȳ, θ i /m). i=1 Specifically, the resulting marginal estimate is a linear combination of normal distributions resulting in N(ȳ, θ/m). Using the same realization of the chain from Example 10 we estimated π(µ y) using this method. Figure 1.11 shows the results with our previous estimates. One can also calculate pointwise 95% confidence intervals using OLBM which results in a very small Monte Carlo error (which are therefore not included in the plot). Notice the estimate based on (1.5.1) is a bit smoother than either the histogram or the smoothed histogram estimate but is otherwise quite similar.

24 24 CHAPTER 1. OUTPUT ANALYSIS 1.6 Terminating the Simulation The most common approach to stopping an MCMC experiment is to simulate a fixed, predetermined path length. That is, the simulation is terminated with a fixed-time rule. Indeed there are settings where, due to the nearly prohibitive difficulty of the simulation, this is the only reasonable way to plan an experiment; for example see Caffo et al. (2009). However, this is not the case for the typical MCMC experiment. As in Section 1.4, inclusion of a MCSE with the resulting point estimate allows one to gauge whether the desired precision has been achieved. After an initial run, if the MCSE is not sufficiently small, then one might increase the simulation length and recalculate an MCSE. That is, the simulation is terminated the first time the MCSE is sufficiently small. Equivalently, the simulation is terminated the first time that the half-width of a confidence interval for θ π is sufficiently small and so this is a fixed-width rule. Notice that this is a sequential procedure that will result in a random total simulation effort. There is a substantial amount of research on fixed-width procedures in the MCMC context when θ π is an expectation; see Flegal et al. (2008), Glynn and Whitt (1992) and Jones et al. (2006) and the references therein, but none that we are aware of when θ π is not an expectation. Let ˆσ n 2 be a consistent estimator of σg 2 from (1.1.3). Formally, given a desired half-width ɛ the simulation terminates the first time the following inequality is satisfied t ˆσ n n + p(n) ɛ (1.6.1) where t is an appropriate Student s t quantile and p(n) is a positive function such that as n, p(n) = o(n 1/2 ). Letting n be the desired minimum simulation effort, a reasonable default is p(n) = ɛ[i(n n ) + n 1 I(n > n )]. Glynn and Whitt (1992) established that the interval at (1.6.1) is asymptotically valid in the sense that the desired coverage probability is obtained as ɛ 0. The use of these intervals in MCMC settings has been extensively investigated by Flegal et al. (2008), Flegal and Jones (2009a) and Jones et al. (2006) who show that they work well. Example 12 (Normal AR(1) Markov chains). Consider the normal AR(1) time series defined at (1.2.1). In Example 5 we simulated 2000 iterations from the chain with ρ = 0.95 starting from X 1 = 1 and found that a 95% confidence interval for the mean of the invariant distribution was 0.27 ± Suppose we wanted to continue our simulation until we were 95% confident our estimate

25 1.7. MARKOV CHAIN CENTRAL LIMIT THEOREMS 25 was within.1 of the true value, a fixed-width procedure using OLBM with b n = n. Thus (1.6.1) becomes t n bn ˆσ n n + 0.1[I(n 1000) + n 1 I(n > 1000)] 0.1. It would be computationally expensive to check this criterion after each iteration so instead we added 1000 iterations before recalculating the half-width each time. In this case, the simulation terminated after 1.42e5 iterations resulting in an interval estimate of 8.59e 3 ± Notice that this simple example required an exceptionally large simulation effort relative to what is often done in practice in much more complicated settings. 1.7 Markov Chain Central Limit Theorems Throughout we have assumed the existence of a Markov chain central limit theorem, see e.g. (1.1.3) and (1.4.2). In this section we provide a brief discussion of the conditions required for these claims; much more detail can be found in Chan and Geyer (1994), Jones (2004), Meyn and Tweedie (1993), Roberts and Rosenthal (2004) and Tierney (1994). Without stating it we have assumed throughout this chapter the Markov chain X is Harris ergodic. To fully explain this condition would require a fair amount of Markov chain theory and hence we will content ourselves with an informal discussion of this assumption. An interested reader should consult Meyn and Tweedie (1993), Nummelin (1984) or Roberts and Rosenthal (2004). Informally, Harris ergodicity ensures that for every starting value if A X is such that π(dx) > 0, then A will be visited infinitely often as n. Thus, at A least asymptotically, X will thoroughly explore the support X of the target π. This assumption guarantees a nice form of convergence which requires a bit of notation to describe. Let the conditional distribution of X n given X 1 = x be denoted P n (x, ), that is, P n (x, A) = Pr(X n A X 1 = x). Then Harris ergodicity implies for every starting point x X P n (x, ) π( ) 0 as n (1.7.1) where is the total variation norm. (This is stronger than convergence in distribution.)

26 26 CHAPTER 1. OUTPUT ANALYSIS Hence as n increases the marginal distribution of X n will be closer to that of π. Harris ergodicity alone is not sufficient for the Markov chain SLLN or a CLT. However, if X is Harris ergodic and E π g <, then the SLLN holds: ḡ n E π g with probability 1 as n. For the CLT we require more; specifically we will need to know the rate of the convergence in (1.7.1) to say something about the existence of a CLT. Let M(x) be a nonnegative function on X and γ(n) be a nonnegative function on Z + such that P n (x, ) π( ) M(x)γ(n). (1.7.2) When X is geometrically ergodic γ(n) = t n for some t < 1 while uniform ergodicity means X is geometrically ergodic and M is bounded. These are key sufficient conditions for the existence of an asymptotic distribution of the Monte Carlo error. In particular, a CLT as at (1.1.3) holds if X is geometrically ergodic and E π g 2+δ < for some δ > 0 or if X is uniformly ergodic and E π g 2 <. Moreover, Flegal and Jones (2009a) and Jones et al. (2006) establish that when X is geometrically ergodic the batch means and overlapping batch means methods produce strongly consistent estimators of σg 2 from (1.1.3). Geometric ergodicity is also an important sufficient condition for establishing (1.4.2) when estimating a quantile. Suppose F has has derivative f(φ q ) which is positive and finite. Further assume f (x) exists and f (x) B < for x in some neighborhood of φ q. If X is geometrically ergodic, then (1.4.2) obtains (Flegal and Jones, 2009b). In general, establishing (1.7.2) directly is apparently daunting. However, If X is finite (no matter how large), then a Harris ergodic Markov chain is uniformly ergodic. When X is a general space there are constructive methods which can be used to establish geometric or uniform ergodicity; see Hobert (2009) and Jones and Hobert (2001) for accessible introductions. These techniques have been applied to many MCMC samplers. For example, Metropolis-Hastings samplers with state-independent proposals can be uniformly ergodic (Tierney, 1994) but such situations are uncommon in realistic settings. Standard random walk Metropolis-Hastings chains on R d, d 1 cannot be uniformly ergodic but may still be geometrically ergodic; see Mengersen and Tweedie (1996). Other research on establishing convergence rates of Markov chains used in MCMC is given by Geyer (1999), Jarner and Hansen (2000), Meyn and Tweedie (1994) and Neath and Jones (2009) who considered Metropolis-Hastings algorithms and Hobert and Geyer (1998), Hobert et al. (2002), Johnson and Jones (2008), Jones and Hobert (2004), Marchev and Hobert (2004), Roberts and Polson (1994), Roberts and Rosenthal (1999), Rosenthal (1995, 1996), Roy and Hobert (2007), Tan

27 1.7. MARKOV CHAIN CENTRAL LIMIT THEOREMS 27 and Hobert (2009) and Tierney (1994) who examined Gibbs samplers. Acknowledgments This work supported by NSF grant DMS

28 28 CHAPTER 1. OUTPUT ANALYSIS

29 Bibliography Alexopoulos, C., Andradóttir, S., Argon, N. T., and Goldsman, D. (2006). Replicated batch means variance estimators in the presence of an initial transient. ACM Transactions on Modeling and Computer Simulation, 16: Alexopoulos, C. and Goldsman, D. (2004). To batch or not to batch? ACM Trans. Model. Comput. Simul., 14(1): Bertail, P. and Clémençon, S. (2006). Bernoulli, 12: Regenerative block-bootstrap for Markov chains. Bratley, P., Fox, B. L., and Schrage, L. E. (1987). A Guide to Simulation. Springer Verlag, New York. Bühlmann, P. (2002). Bootstraps for time series. Statistical Science, 17: Caffo, B. S., Peng, R., Dominici, F., Louis, T., and Zeger, S. (2009). Parallel Bayesian MCMC imputation for multiple distributed lag models: A case study in environmental epidemiology (to appear). In Handbook of Markov Chain Monte Carlo. CRC, London. Chan, K. S. and Geyer, C. J. (1994). Comment on Markov chains for exploring posterior distributions. The Annals of Statistics, 22: Cowles, M. K. and Carlin, B. P. (1996). Markov chain Monte Carlo convergence diagnostics: A comparative review. Journal of the American Statistical Association, 91: Cowles, M. K., Roberts, G. O., and Rosenthal, J. S. (1999). Possible biases induced by MCMC convergence diagnostics. Journal of Statistical Computing and Simulation, 64: Datta, S. and McCormick, W. P. (1993). Regeneration-based bootstrap for Markov chains. The Canadian Journal of Statistics / La Revue Canadienne de Statistique, 21:

30 30 BIBLIOGRAPHY Ferguson, T. S. (1996). A Course in Large Sample Theory. Chapman & Hall / CRC, Boca Raton. Flegal, J. M. (2008). Monte Carlo standard errors for Markov chain Monte Carlo. PhD thesis, University of Minnesota, School of Statistics. Flegal, J. M., Haran, M., and Jones, G. L. (2008). Markov chain Monte Carlo: Can we trust the third significant figure? Statistical Science, 23: Flegal, J. M. and Jones, G. L. (2009a). Batch means and spectral variance estimators in Markov chain Monte Carlo. The Annals of Statistics (to appear). Flegal, J. M. and Jones, G. L. (2009b). Quantile estimation via Markov chain Monte Carlo. Technical report, University of California, Riverside, Department of Statistics. Fristedt, B. and Gray, L. F. (1997). A Modern Approach to Probability Theory. Birkhauser Verlag. Gelman, A. and Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences. Statistical Science, 7: Geyer, C. J. (1992). Practical Markov chain Monte Carlo (with discussion). Statistical Science, 7: Geyer, C. J. (1995). Conditioning in Markov chain Monte Carlo. Journal of Computational and Graphical Statistics, 4: Geyer, C. J. (1999). Likelihood inference for spatial point processes. In Barndorff-Nielsen, O. E., Kendall, W. S., and van Lieshout, M. N. M., editors, Stochastic Geometry: Likelihood and Computation, pages Chapman & Hall/CRC, Boca Raton. Geyer, C. J. (2009). Introduction to Markov chain Monte Carlo (to appear). In Handbook of Markov Chain Monte Carlo. CRC, London. Glynn, P. W. and Whitt, W. (1991). Estimating the asymptotic variance with batch means. Operations Research Letters, 10: Glynn, P. W. and Whitt, W. (1992). The asymptotic validity of sequential stopping rules for stochastic simulations. The Annals of Applied Probability, 2: Hobert, J. P. (2009). The data augmentation algorithm: Theory and methodology (to appear). In Handbook of Markov Chain Monte Carlo. CRC, London.

Applicability of subsampling bootstrap methods in Markov chain Monte Carlo

Applicability of subsampling bootstrap methods in Markov chain Monte Carlo James M. Flegal Abstract Markov chain Monte Carlo (MCMC) methods allow exploration of intractable probability distributions by