Applicability of subsampling bootstrap methods in Markov chain Monte Carlo

Applicability of subsampling bootstrap methods in Markov chain Monte Carlo James M. Flegal Abstract Markov chain Monte Carlo (MCMC) methods allow exploration of intractable probability distributions by constructing a Markov chain whose stationary distribution equals the desired distribution. The output from the Markov chain is typically used to estimate several features of the stationary distribution such as mean and variance parameters along with quantiles and so on. Unfortunately, most reported MCMC estimates do not include a clear notion of the associated uncertainty. For expectations one can assess the uncertainty by estimating the variance in an asymptotic normal distribution of the Monte Carlo error. For general functionals there is no such clear path. This article studies the applicability of subsampling bootstrap methods to assess the uncertainty in estimating general functionals from MCMC simulations. 1 Introduction This article develops methods to evaluate the reliability of estimators constructed from Markov chain Monte Carlo (MCMC) simulations. MCMC uses computergenerated data to estimate some functional θ π, where π is a probability distribution with support X. It has become a standard technique, especially for Bayesian inference, and the reliability of MCMC estimators has already been studied for cases where we are estimating an expected value [9, 13, 19]. Here, we investigate the applicability of subsampling bootstrap methods (SBM) for output analysis of an MCMC simulation. This work is appropriate for general functionals including expectations, quantiles and modes. The basic MCMC method entails constructing a Harris ergodic Markov chain X = {X 0,X 1,X 2,...} on X having invariant distribution π. The popularity of MCMC methods result from the ease with which an appropriate X can be simulated [4, 20, James M. Flegal University of California, Riverside, CA 92521, e-mail: jflegal@ucr.edu 1

2 J.M. Flegal 23]. Suppose we simulate X for a finite number of steps, say n, and use the observed values to estimate θ π with ˆθ n. In practice the simulation is run sufficiently long until we have obtained an accurate estimate of θ π. Unfortunately, we have no certain way to know when to terminate the simulation. At present, most analysts use convergence diagnostics for this purpose (for a review see [5]); although it is easily implemented, this method is mute about the quality of ˆθ n as an estimate of θ π. Moreover, diagnostics can introduce bias directly in to the estimates [6]. The approach advocated here will directly analyze output from an MCMC simulation to establish non-parametric or parametric confidence intervals for θ π. There is already substantial research when θ π is an expectation, but very little for general quantities. Calculating and reporting an uncertainty estimate, or confidence interval, allows everyone to judge the reliability of the estimates. The main point is an uncertainty estimate should be reported along with the point estimate obtained from an MCMC experiment. This may seem obvious to most statisticians but this is not currently standard practice in MCMC [9, 13, 19]. Outside of toy examples, no matter how long our simulation, there will be an unknown Monte Carlo error, ˆθ n θ π. While it is impossible to assess this error directly, we can estimate the error via a sampling distribution. That is, we need an asymptotic distribution for ˆθ n obtained from a Markov chain simulation. Assume ˆθ n, properly normalized, has a limiting distribution J π, specifically as n τ n ( ˆθ n θ π ) d Jπ (1) where τ n. For general dependent sequences, there is a substantial amount of research about obtaining asymptotic distributions for a large variety of θ π. These results are often applicable since the Markov chains in MCMC are special cases of strong mixing processes. This article addresses how to estimate the uncertainty of ˆθ n given a limiting distribution as at (1). Bootstrap methods may be appropriate for this task. Indeed, there is already sentiment that bootstrap methods used in stationary time series are appropriate for MCMC [1, 2, 7, 21]. However, my preliminary work [8] suggests that the SBM has superior computational and finite-sample properties. The basic SBM provides a general approach to constructing asymptotically valid confidence intervals [22]. In short, SBM calculates the desired statistic over subsamples of the chain and then use these values to approximate the sampling distribution of θ π. From the subsample values, one can construct a non-parametric confidence interval directly or estimate the unknown asymptotic variance of J π and construct a parametric confidence interval. The rest of this article is organized as follows. Section 2 overviews construction of non-parametric and parametric confidence intervals for general quantities θ π via SBM. Section 3 examines the finite sample properties in a toy example and Section 4

Subsampling bootstrap in MCMC 3 illustrates the use of SBM in a realistic example to obtain uncertainty estimates for estimating quantiles. 2 Subsampling bootstrap methods This section overviews SBM for constructing asymptotically valid confidence intervals of θ π. Aside from a proposed diagnostic [14] and a brief summary for quantiles [11], there has been little investigation of SBM in MCMC. Nonetheless, SBM is widely applicable with only limited assumptions. The main requirement is that ˆθ n, properly normalized, has a limiting distribution as at (1). SBM divides the simulation into overlapping subsamples of length b from the first n observations of X. In general, there are n b + 1 subsamples for which we calculate the statistics over each subsample. Procedurally, we select a batch size b such that b/n 0, τ b /τ n 0, τ b and b as n. If we let ˆθ i for i = 1,...,n b+1 denote the value of the statistic calculated from the ith batch, the assumptions on b imply as n τ b ( ˆθ i θ π ) d Jπ for i = 1,...,n b + 1. We can then use the values of ˆθ i to approximate J π and construct asymptotically valid inference procedures. Specifically, define the empirical distribution of the standardized ˆθ i s as L n,b (y) = Further for α (0,1) define n b+1 1 n b + 1 i=1 I { τ b ( ˆθ i ˆθ n ) y }. L 1 n,b (1 α) = inf{ y : L n,b (y) 1 α } and Jπ 1 (1 α) = inf{y : J π (y) 1 α}. Theorem 1. Let X be a Harris ergodic Markov chain. Assume (1) and that b/n 0, τ b /τ n 0, τ b and b as n. 1. If y is a continuity point of J π ( ), then L n,b (y) J π (y) in probability. 2. If J π ( ) is continuous at Jπ 1 (1 α), then as n { ( ) } Pr τ n ˆθ n θ π L 1 n,b (1 α) 1 α. Proof. Note that Assumption 4.2.1 of [22] holds under (1) and the fact that X possesses a unique invariant distribution. Then the proof is a direct result of Theorem 4.2.1 of [22] and the fact that Harris ergodic Markov chains are strongly mixing [18].

4 J.M. Flegal Theorem 1 provides a consistent estimate of the limiting law J π for Harris ergodic Markov chains through the empirical distribution of ˆθ i. Hence a theoretically valid (1 α)100% non-parametric interval can be expressed as [ ] ˆθ n τn 1 Ln,b 1 (1 α/2), ˆθ n τn 1 Ln,b 1 (α/2). (2) Alternatively, one can also estimate the asymptotic variance [3, 22] using ˆσ 2 SBM = τ 2 b n b + 1 n b+1( ˆθ ) 2 i ˆθ n. (3) i=1 If J π is Normal then a (1 α)100% level parametric confidence interval can be obtained as [ ˆθ n t n b,α/2 τn 1 ˆσ SBM, ˆθ n +t n b,α/2 τn 1 ] ˆσ SBM. (4) SBM is applicable for any ˆθ n such that (1) holds and the rate of convergence τ n is known as required in (2)-(4). Implementation requires selection of b, the subsample size. We will use the naive choice of b n = n 1/2 in later examples. The following sections consider two common quantities where SBM is appropriate, expectations and quantiles. 2.1 Expectations Consider estimating an expectation of π, that is θ π = E π g = g(x)π(dx). X Suppose we use the observed values to estimate E π g with a sample average ḡ n = 1 n n 1 g(x i ). i=0 The use of this estimator is justified through the Markov chain strong law of large numbers. Further assume a Markov chain CLT holds [18, 26], that is n(ḡn E π g) d N(0,σ 2 ) (5) as n where σ 2 (0, ). Then we can use (2) or (4) to form non-parametric or parametric confidence intervals, respectively. Alternatively, one can consider the overlapping batch means (OLBM) variance estimator [10]. As the name suggests, OLBM divides the simulation into overlapping batches of length b resulting in n b + 1 batches for which Ȳ j (b) =

Subsampling bootstrap in MCMC 5 b 1 b 1 i=0 g(x j+i) for j = 0,...,n b. Then the OLBM estimator of σ 2 is ˆσ 2 OLBM = n b nb (n b)(n b + 1) j=0 It is easy to show that (3) is asymptotically equivalent to (6). (Ȳ j (b) ḡ n ) 2. (6) 2.2 Quantiles It is routine when summarizing an MCMC experiment to include sample quantiles, especially in Bayesian applications. These are based on quantiles of the univariate marginal distributions associated with π. Let F be the marginal cumulative distribution function, then consider estimating the quantile function of F, i.e. the generalized inverse F 1 : (0,1) R given by θ π = F 1 (q) = inf{y : F(y) q}. We will say a sequence of quantile functions converges weakly to a limit quantile function, denoted Fn 1 F 1, if and only if Fn 1 (t) F 1 (t) at every t where F 1 is continuous. Lemma 21.2 of [28] shows Fn 1 F 1 if and only if F n F. Thus we consider estimating F with the empirical distribution function defined as F n (y) = 1 n n i=1 I{Y i y}, where Y = {Y 1,...,Y n } is the observed univariate sample from F and I is the usual indicator function on Z +. The ergodic theorem gives pointwise convergence (F n (y) F(y) for every y almost surely as n ) and the Glivenko-Cantelli theorem extends this to uniform convergence (sup y R F n (y) F(y) 0 almost surely as n ). Letting Y n(1),...,y n(n) be the order statistics of the sample, the empirical quantile function is given by: ( j 1 F 1 n = Y n( j) for q n, j ]. n Often the empirical distribution function F n and the empirical quantile function F 1 n are directly used to estimate F and F 1. Construction of interval estimate of F 1 requires existence of a limiting distribution as at (1). We will assume a CLT exists for the Monte Carlo error [12], that is ( n F 1 n (q) F 1 (q) ) d N(0,σ 2 ) (7)

6 J.M. Flegal as n where σ 2 (0, ). Then we can use (2) or (4) to form non-parametric or parametric confidence intervals respectively by setting ˆθ i to the estimated quantile from the ith subsample. 3 Toy example Consider estimating the quantiles of an Exp(1) distribution, i.e. f (x) = e x I(x > 0), using the methods outlined above. It is easy to show that F 1 (q) = log(1 q) 1, and simulation methods are not necessary; accordingly, we use the true values to evaluate the resulting coverage probability of the parametric and non-parametric intervals. Monte Carlo sampling. SBM is also valid using i.i.d. draws from π, that is for Monte Carlo simulations. Here the subsamples need not be overlapping, hence there are N := ( n b) subsamples. Calculation over N subsamples will often be computational extensive. Instead, a suitably large N << ( n b) can be selected resulting in a estimate based on a large number of subsamples rather than all the subsamples. Consider sampling from π using i.i.d. draws. For each simulation, with n = 1e4 iterations, CIs were calculated for q {.025,.1,.5,.9,.975} based on b {100,4000}. For both values of b, calculation of ˆσ SBM 2 was based on N = 1000 random subsamples rather than ( n b) subsamples. This procedure was repeated 2000 times to evaluate the resulting confidence intervals, see Table 1 for a summary of the simulation results. For b = 100, the mean values of ˆσ SBM /σ 2 are close to 1 for all values of q implying there is no systematic bias in the variance estimates. When q {.1,.5,.9}, the coverage probabilities are close to the nominal value of 0.95. For more extreme values of q {.025,.975}, the results are worse, which should not be surprising given b = 100. The use of non-parametric CIs at (2) show a similar trend, though the overall results are considerably worse. Table 1: Coverage probabilities for Exp(1) example using i.i.d. sampler. Coverage probabilities reported have 0.95 nominal level with standard errors equal to ˆp(1 ˆp)/2000 0.0082. q 0.025 0.1 0.5 0.9 0.975 b =100 b =4e3 SBM 0.9705 0.9490 0.9480 0.9485 0.9400 NP SBM 0.8595 0.9260 0.9415 0.9410 0.9210 SBM 0.8660 0.8690 0.8700 0.8715 0.8670 NP SBM 0.8375 0.8575 0.8600 0.8575 0.8395

Subsampling bootstrap in MCMC 7 One may consider increasing b to improve the results for q {.025,.975}. However, if b = 4000 without increasing n, the resulting coverage probabilities are significantly worse for both types of CIs (see Table 1). The simulations also show the mean value of ˆσ SBM /σ 2 is less that 1, hence the variance estimates are biased down. Instead, as b increases, the overall simulation effort should also increase. Rather than increasing b, it may be useful to consider different quantile estimates including continuous estimators [16] or a finite sampler correction [22]. Given our interest in MCMC, these were not considered here. MCMC sampling. Consider sampling from π using an independence Metropolis sampler with an Exp(θ) proposal [19, 25, 27]. If θ = 1 the sampler simply provides i.i.d. draws from π. The chain is geometrically ergodic if 0 < θ < 1 and subgeometric (slower than geometric) if θ > 1. Table 2: Coverage probabilities for Exp(1) example using independence Metropolis sampler. Coverage probabilities reported have 0.95 nominal level with standard errors equal to ˆp(1 ˆp)/2000 0.0109. q 0.025 0.1 0.5 0.9 0.975 θ = 1/4 θ = 1/2 θ = 2 SBM 0.9790 0.9530 0.9370 0.9380 0.9360 NP SBM 0.7305 0.8795 0.9305 0.9425 0.9450 SBM 0.9710 0.9495 0.9445 0.9385 0.9470 NP SBM 0.8125 0.9215 0.9465 0.9415 0.9520 SBM 0.9930 0.9905 0.9885 0.6145 0.1720 NP SBM 0.8630 0.9120 0.8765 0.6295 0.1695 We calculated intervals for q {.025,.1,.5,.9,.975}; each chain contained n =1e4 iterations and the procedure was repeated 2000 times. The simulations began at X 0 = 1, with θ {1/4,1/2,2}, and b = 100. Table 2 summarizes the results. For θ {1/4,1/2} and q {.1,.5,.9,.975}, the coverage probabilities are close to the nominal value of 0.95. Increasing b would likely improve the results, but with a concurrent requirement for larger n. These limited results for parametric confidence intervals are very encouraging. In contrast, non-parametric CIs derived from (2) perform worse, especially for q {.025,.1}. When θ = 2, the chain is sub-geometric and it is unclear if n-clt holds as at (7). In fact, the independence sampler fails to have a n-clt at (5) for all suitably non-trial functions g when θ > 2 [24, 27]. However, it is possible via SBM to obtain parametric and non-parametric CIs at (2) or (4) if one assumes a CLT with rate of convergence τ n = n. The results from this simulation are also contained in Table 2. We can see the coverage probabilities are close to the 0.95 nominal level for small quantiles, but this is likely because 0.95 is close to 1. In the case of large quantiles, the results are terrible, as low as 0.17. This example highlights the importance of obtaining a Markov chain CLT.

8 J.M. Flegal 4 A realistic example In this section, we consider the analysis of US government HMO data [15] under the following proposed model [17]. Let y i denote the individual monthly premium of the ith HMO plan for i = 1,...,341 and consider a Bayesian version of the following frequentist model y i = β 0 + β 1 x i1 + β 2 x i2 + ε i (8) where ε i are i.i.d. N ( 0,λ 1), x i1 denotes the centered and scaled average expenses per admission in the state in which the ith HMO operates, and x i2 is an indicator for New England. (Specifically, if x i1 are the original values and x 1 is the overall average per admission then x i1 = ( x i1 x 1 )/1000.) Our analysis is based on the following Bayesian version of (8) y β,λ N N ( Xβ,λ 1 I N ) β λ N 3 ( b,b 1 ) λ Gamma(r 1,r 2 ) where N = 341, y is the 341 1 vector of individual premiums, β = (β 0,β 1,β 2 ) is the vector of regression coefficients, and X is the 341 3 design matrix whose ith row is xi T = (1,x i1,x i2 ). (We will say W Gamma(a,b) if it has density proportional to w a 1 e bw for w > 0.) This model requires specification of the hyper-parameters (r 1,r 2,b,B) which we assign based on estimates from the usual frequentist model [17]. Specifically, r 1 =3.122e-06, r 2 =1.77e-03, b = 164.989 3.910, and B 1 = 2 0 0 0 3 0. 32.799 0 0 36 We will sample from π (β,λ y) using a two-component block Gibbs sampler requiring the following full conditionals λ β Gamma (r 1 + N2,r 2 + 12 ) V (β) ( (λx β λ N T 3 X + B ) 1 ( λx T y + Bb ), ( λx T X + B ) ) 1 where V (β) = (y Xβ) T (y Xβ) and we have suppressed the dependency on y. We consider the sampler which updates λ followed by β in each iteration, i.e. (β,λ ) (β,λ) (β,λ). Our goal is estimating the median and reporting a 90% Bayesian credible region for each of the three marginal distributions. Denote the qth quantile associated with the marginal for β j as φ q (i) for j = 0,1,2. Then the vector of parameters to be estimated is

Subsampling bootstrap in MCMC 9 Table 3: HMO parameter estimates with MCSEs. q Estimate MCSE β 0 0.5 164.99 5.86e-3 0.05 163.40 1.05e-2 0.95 166.56 9.72e-3 β 1 0.5 3.92 7.24e-3 0.05 2.06 1.28e-2 0.95 5.79 1.19e-2 β 2 0.5 32.78 2.50e-2 0.05 25.86 4.61e-2 0.95 39.69 4.37e-2 ( ) Φ = φ (0).05,φ(0).5,φ(0).95,φ(1).05,φ(1).5,φ(1).95,φ(2).05,φ(2).5,φ(2).95. Along with estimating Φ, we calculated the associated MCSEs using SBM. Table 3 summarizes estimates for Φ and MCSEs from 40,000 total iterations (b n = 40,000 1/2 = 200). Acknowledgements I am grateful to Galin L. Jones and two anonymous referees for their constructive comments in preparing this article. References 1. Patrice Bertail and Stéphan Clémençon. Regenerative block-bootstrap for Markov chains. Bernoulli, 12:689 712, 2006. 2. Peter Bühlmann. Bootstraps for time series. Statistical Science, 17:52 72, 2002. 3. Edward Carlstein. The use of subseries values for estimating the variance of a general statistic from a stationary sequence. The Annals of Statistics, 14:1171 1179, 1986. 4. Ming-Hui Chen, Qi-Man Shao, and Joseph George Ibrahim. Monte Carlo Methods in Bayesian Computation. Springer-Verlag Inc, 2000. 5. Mary Kathryn Cowles and Bradley P. Carlin. Markov chain Monte Carlo convergence diagnostics: A comparative review. Journal of the American Statistical Association, 91:883 904, 1996. 6. Mary Kathryn Cowles, Gareth O. Roberts, and Jeffrey S. Rosenthal. Possible biases induced by MCMC convergence diagnostics. Journal of Statistical Computing and Simulation, 64:87 104, 1999. 7. Somnath Datta and William P. McCormick. Regeneration-based bootstrap for Markov chains. The Canadian Journal of Statistics, 21:181 193, 1993. 8. James M. Flegal. Monte Carlo standard errors for Markov chain Monte Carlo. PhD thesis, University of Minnesota, School of Statistics, 2008. 9. James M. Flegal, Murali Haran, and Galin L. Jones. Markov chain Monte Carlo: Can we trust the third significant figure? Statistical Science, 23:250 260, 2008. 10. James M. Flegal and Galin L. Jones. Batch means and spectral variance estimators in Markov chain Monte Carlo. The Annals of Statistics, 38:1034 1070, 2010.

10 J.M. Flegal 11. James M. Flegal and Galin L. Jones. Implementing Markov chain Monte Carlo: Estimating with confidence. In S.P. Brooks, A.E. Gelman, G.L. Jones, and X.L. Meng, editors, Handbook of Markov Chain Monte Carlo. Chapman & Hall/CRC Press, 2010. 12. James M. Flegal and Galin L. Jones. Quantile estimation via Markov chain Monte Carlo. Work in progress, 2011. 13. Charles J. Geyer. Practical Markov chain Monte Carlo (with discussion). Statistical Science, 7:473 511, 1992. 14. S. G. Giakoumatos, I. D. Vrontos, P. Dellaportas, and D. N. Politis. A Markov chain Monte Carlo convergence diagnostic using subsampling. Journal of Computational and Graphical Statistics, 8:431 451, 1999. 15. James S. Hodges. Some algebra and geometry for hierarchical models, applied to diagnostics (Disc: P521-536). Journal of the Royal Statistical Society, Series B: Statistical Methodology, 60:497 521, 1998. 16. Rob J. Hyndman and Yanan Fan. Sample quantiles in statistical packages. The American Statistician, 50:361 365, 1996. 17. Alicia A. Johnson and Galin L. Jones. Gibbs sampling for a Bayesian hierarchical general linear model. Electronic Journal of Statistics, 4:313 333, 2010. 18. Galin L. Jones. On the Markov chain central limit theorem. Probability Surveys, 1:299 320, 2004. 19. Galin L. Jones and James P. Hobert. Honest exploration of intractable probability distributions via Markov chain Monte Carlo. Statistical Science, 16:312 334, 2001. 20. Jun S. Liu. Monte Carlo Strategies in Scientific Computing. Springer, New York, 2001. 21. Dimitris N. Politis. The impact of bootstrap methods on time series analysis. Statistical Science, 18:219 230, 2003. 22. Dimitris N. Politis, Joseph P. Romano, and Michael Wolf. Subsampling. Springer-Verlag Inc, 1999. 23. Christian P. Robert and George Casella. Monte Carlo Statistical Methods. Springer, New York, 1999. 24. Gareth O. Roberts. A note on acceptance rate criteria for CLTs for Metropolis-Hastings algorithms. Journal of Applied Probability, 36:1210 1217, 1999. 25. Gareth O. Roberts and Jeffrey S. Rosenthal. Markov chain Monte Carlo: Some practical implications of theoretical results (with discussion). Canadian Journal of Statistics, 26:5 31, 1998. 26. Gareth O. Roberts and Jeffrey S. Rosenthal. General state space Markov chains and MCMC algorithms. Probability Surveys, 1:20 71, 2004. 27. Gareth O. Roberts and Jeffrey S. Rosenthal. Quantitative non-geometric convergence bounds for independence samplers. Methodology and Computing in Applied Probability, 13:391 403, 2011. 28. A. W. van der Vaart. Asymptotic Statistics. Cambridge University Press, New York, 1998.