Structured Markov Chain Monte Carlo

Size: px

Start display at page:

Download "Structured Markov Chain Monte Carlo"

Godfrey Hensley
6 years ago
Views:

1 Structured Markov Chain Monte Carlo by Daniel J. SARGENT 1, James S. HODGES, and Bradley P. CARLIN 1 Section of Biostatistics, Mayo Clinic Division of Biostatistics, School of Public Health, University of Minnesota July 13, 1999 Abstract In this paper we introduce a general method for Bayesian computing in richly-parameterized models, Structured Markov Chain Monte Carlo (SMCMC), that is based on a blocked hybrid of the Gibbs sampling and Metropolis-Hastings algorithms. SMCMC speeds algorithm convergence by using the structure that is present in the problem to suggest an appropriate Metropolis-Hastings candidate distribution. While the approach is easiest to describe for hierarchical normal linear models, we show that its extension to both non-normal and nonlinear cases is straightforward. After describing the method in detail we compare its performance (in terms of runtime and autocorrelation in the samples) to other existing methods, including the single-site updating Gibbs sampler available in the popular BUGS software package. Our results suggest signicant improvements in convergence for many problems using SMCMC, as well as broad applicability of the method, including previously intractable hierarchical nonlinear model settings. KEY WORDS: Blocking Convergence acceleration Gibbs sampling Hierarchical model Metropolis- Hastings algorithm.

2 1 Introduction The past decade has seen an explosion in the use of advanced Bayesian methods, largely due to Markov chain Monte Carlo (MCMC) computational methods for estimating posterior distributions. These methods sample from a Markov chain whose stationary distribution is the posterior, producing a correlated sample from this distribution. Compared to quadrature and independent, identically distributed Monte Carlo approaches, MCMC methods are typically easier to implement and more broadly applicable, but they require a convergence \diagnosis," i.e., a decision as to when the samples may be safely viewed as draws from the chain's stationary distribution. While many authors (e.g., Tierney 1994 Roberts and Smith 1993 and Roberts and Tweedie 199) have investigated theoretical convergence properties of MCMC methods, assessing convergence in practice is problematic, because strictly speaking this can only be determined from a sampled chain of innite length. Most practitioners use a variety of diagnostics to isolate convergence diculties see Cowles and Carlin (199) or Mengersen et al. (1999) for reviews. The diculty of assessing convergence has led many authors to refocus on convergence acceleration, on the grounds that a sampler that traverses the parameter space more quickly istypically easier and safer to use. Acceleration methods include reparametrization (see e.g. Hills and Smith 199 Gelfand, Sahu, and Carlin 1995 Gilks and Roberts 199), auxiliary variables (Swendsen and Wang 198 Besag and Green 1993 Damien et al Mira and Tierney 199), and multichain annealing or tempering (Geyer and Thompson 1995 Neal 199). Gilks and Roberts (199) give an overview of these and other acceleration methods. Another approach to accelerating convergence is blocking, orupdatingmultivariate blocks of (typically highly correlated) parameters. This contrasts with the most common approach to MCMC, in which each parameter is updated separately according to its conditional distribution 1

3 given the data and every other parameter in the model. The latter approach is used by BUGS (Spiegelhalter et al. 1995a), the most general and easy-to-use MCMC software package to date. BUGS has been used on a wide range of models (Spiegelhalter et al. 1995b), but high posterior correlations are a major hindrance to its univariate updating algorithms. Blocking { which can be done using Gibbs updates or multivariate Metropolis-Hastings (M-H) updates (Hastings 190 Carlin and Louis 199, Sec ) { often solves this problem. Recent work by Liu (1994) and Liu et al. (1994) conrms its good performance for a broad class of models, though Liu et al. (1994, Sec. 5) and Roberts and Sahu (199, Sec..4) give examples where blocking actually slows a sampler's convergence. Unfortunately, success stories in blocking are often application-specic, and general rules have been hard to nd. In this paper, we introduce a simple, general, and exible blocked MCMC method for a large class of richly-parameterized linear and nonlinear models. The method, which we term Structured MCMC (SMCMC), accelerates convergence for many of these models by blocking groups of similar parameters while taking full advantage of the posterior correlation structure induced by the model and data. For linear models, this structure yields closed-form full conditionals, which may be used as candidate distributions in Hastings independence chain updates. For nonlinear models, Gaussian distributions can be used to approximate the data's contribution to the posterior. When combined with Gaussian prior distributions, this yields eective approximations to the dependence structure in the posterior, which can in turn produce ecient candidate distributions for a Hastings algorithm. In this respect, our algorithm is reminiscent of a \linearization" approach usedby Gelfand et al. (199), though our approach applies much more broadly. The remainder of the paper is organized as follows. Section reviews and illustrates the basis of SMCMC, the constraint case formulation of Hodges (1998). Section 3 lays out the SMCMC algorithm for richly-parameterized linear models, while Section 4 considers the more dicult nonlinear

4 case. Section 5 gives three examples covering a range of modeling and computational complexity, ranging from a hierarchical linear model with normal errors to a hierarchical Cox proportional hazards model. Besides illustrating SMCMC's use, we compare its runtimes and eective sample sizes to those of standard algorithms, including BUGS, when such alternatives are feasible. Section oers concluding remarks, as well as directions for future research. Constraint case framework for richly-parametrized models \Richly-parameterized models" includes hierarchical and other multilevel models (Lindley and Smith 19 Bryk and Raudenbush 199), dynamic models (West and Harrison 1989), variance component models (Searle et al. 199), some spatial models (Besag et al. 1991), and others. We introduce the constraint case framework using hierarchical linear models. Hierarchical models are usually used for data structures with a natural hierarchical structure, e.g., students within classrooms within a school. In this example, a separate regression could be t to the students in each classroom each classroom's vector of regression parameters could then be treated as an outcome in a regression for the whole school. A hierarchial model is thus a hierarchy of simple models conforming to the hierarchy in the data. As no standard terminology for richlyparameterized models has evolved, we use the terminology in Hodges (1998), as outlined in the following example. The simplest hierarchical model is the balanced one-way random-eects model. Suppose we have JK observations y ij i=1 ::: K j =1 ::: J the model assumes that for each i, they ij have acommonmean i, and that the i are in turn draws from a distribution with mean. Ifwe take ij N(0 ) and i N(0 ) (where N(a b) denotes the normal distribution with mean a 3

5 and variance b), the model can be represented by: y ij = i ij (1) i = i () and = M (3) where (3) represents a N(M s ) prior for. A Bayesian analysis adds prior distributions for and. Ifwe rewrite () and (3) as 0 = ; i i (4) and M = ; (5) the model can be expressed in the form of a linear model by combining (1), (4), and (5): 4 y 0 M 3 5 = 4 1 J JK1 0 1 J ;I K 1 K 0 1K K ; 3 5 () where y = fy ij g = f ij g, = f i g, and I m 1 m, and 0 mp are the identity matrix, a column vector of 1's, and a matrix of zeros of the specied dimensions, respectively. Hodges (1998) shows that Bayesian inferences based on () are identical to those drawn from the standard formulation. A wide variety of richly-parameterized models can be re-expressed in a general form of which 4

6 () is a special case, namely 4 y 0 M 3 = 5 4 X 1 0 H 1 H G 1 G : () In more compact notation, () can be expressed as Y = X E (8) where X and Y are known, is unknown, and E is an error term having Cov(E) = ;, where ; is block diagonal with a block corresponding to the covariance matrices of each of, and, i.e. 0 Cov(E)=;= Cov() Cov() Cov() 1 C A : (9) Note that in our simple example (), ; is actually a full (not just block) diagonal matrix. Following Hodges (1998), we use the term \data case" to refer to rows of X, Y, and E in (8) corresponding to X 1 in (). The data cases are the terms in the joint posterior into which the outcomes y enter directly. Rows of X, Y, and E corresponding to the H i are \constraint cases." They place restrictions { stochastic constraints { on possible values of 1. Finally, we call the rows of X, Y,andE corresponding to the G i \prior cases", this label being reserved for cases with known (specied) error variances. Constraint cases being common to both Bayesian and non-bayesian analyses based on this formulation, we will henceforth call it \the constraint case formulation" of a richly-parameterized model. 5

7 Nonlinear models cannot be expressed in the general form (). However, we can usually represent some levels of the hierarchy as linear models, typically the levels corresponding to equations (4) and (5). We defer further discussion on this point to Section 4. 3 Structured MCMC for richly-parameterized linear models 3.1 Normal errors For models with normal errors for the data, constraint, and prior cases, Gibbs sampling and (8) provide a multivariate method for generating draws from the marginal posterior of. From (8), has conditional posterior density j Y X ; N (X 0 ; ;1 X) ;1 X 0 ; ;1 Y (X 0 ; ;1 X) ;1 : (10) If we use conjugate priors for the variance components (i.e., gamma or Wishart priors for their inverses), then the full conditional for the variance components factors into a product of gamma and/or Wishart distributions. Convergence for such samplers is virtually immediate see Hodges (1998, Sec. 5.), and the associated discussion by Wakeeld (1998). The constraint case formulation works by using, in each MCMC draw, all the information in the model and data about the posterior correlations among the elements of. The MCMC algorithm outlined in the preceding paragraph has three key features: 1. It samples, that is, all of the mean-structure parameters, as a single block.. It does so using information about the conditional posterior covariance of, supplied by (a) the mean structure of the richly-parameterized model expressed in the constraint-case formulation in (10), this structure is captured by X

8 (b) the covariance structure of the richly parameterized model expressed in the constraintcase formulation in (10), this structure is captured by ;. 3. It does so using the conditional distribution in (10), for suitable denitions of X and ;. We dene a SMCMC algorithm for a richly-parameterized model to be any algorithm having these three features. For linear models with normal errors, the suitable values of X and ; are those in (8) and (9). These permit at least three dierent implementations of a SMCMC algorithm. The conceptually simplest implementation is the blocked Gibbs sampler discussed above. However, although this Gibbs implementation may converge in few iterations, computing may be slow due to the need to invert a large matrix, X 0 ; ;1 X,ateach iteration. A second SMCMC implementation uses (10) as a candidate distribution in a Hastings independence chain algorithm. Here we might update ; occasionally during the algorithm's pre-convergence \burn-in" period. Such apilot adaptation scheme (Gilks et al, 1998) is simple but forces us to use a single (possibly imperfect) ; for the post-convergence samples that are summarized for posterior inference. A third SMCMC implementation for linear models with normal errors updates ; continually at the algorithm's regeneration times. Regeneration times are points in the algorithm which divide the Markov chain into sections whose sample paths are independent. This allows adaptation of ; to occur repeatedly without disturbing the chain's stationary distribution, or the consistency of point estimates made from its sampled values. Gilks, Roberts, and Sahu (1998) provide a straight forward method for identifying regeneration times in Hastings independence chains of the sort used by SMCMC. The pilot and regenerative adaptive schemes typically sacrice a small amount of eciency for a substantial saving in computing time compared to the Gibbs implementation of a SMCMC algorithm. This possibility is illustrated in the examples of Section 5.

9 3. Non-normal errors An advantage of MCMC algorithms is that they expand the class of models a statistician can feasibly consider. Carlin and Polson (1991) and Evans and Swartz (1995) use Gibbs sampling with latent variables for non-normal error distributions that can be written as normal scale mixtures, such as the t, double exponential, and exponential power distributions. Using such auxiliary variables, these non-normal models can be written in the normal-errors form (), making these models accessible to SMCMC algorithms like those described in Section 3.1. As for models with normal errors, at least three implementations of a SMCMC algorithm are possible. When this use of auxiliary variables is not available, a SMCMC algorithm must use a more general M-H implementation, and an appropriate candidate distribution may be obtained by writing the model as in () and using (10) as the candidate distribution. Of course, (10) is not the full conditional distribution, but it can still be a good candidate distribution for a Hastings independence chain. 4 Structured MCMC for richly-parameterized nonlinear models By \nonlinear models", we mean models in which the outcome y is not linearly related to the parameters in the mean structure. The data cases of such models do not t the form (), but the constraint and prior cases can often be written as a linear model we consider such models here. Because of the nonlinearity in the data cases, it is less straightforward to specify the \suitable denitions of X and ;" needed for a SMCMC algorithm than it was for the linear models of Section 3. This section discusses specication of suitable X and ; in nonlinear settings. To implement a SMCMC algorithm in this setting, we must supply a linear structure approximating the data's contribution to the posterior. We dosoby constructing articial outcome data ~y 8

10 with the property that E(~yj 1 ) is roughly equal to 1. It is only necessary to have rough equality because we only use ~y in a M-H implementation to generate candidate draws for. Specically, for a given ~y with covariance matrix V,we can write the approximate linear model 4 ~y 0 M I 0 H 1 H G 1 G (11) where N(0 V). Equation (11) supplies the necessary X matrix for a SMCMC algorithm, and the necessary ; uses cov() = V and the appropriate covariance matrices for and in (9). The articial outcome data ~y can be supplied by two general strategies, producing (respectively) \unshrunk estimates" and \low-shrinkage estimates" of. The former strategy makes use of crude parameter estimates from the nonlinear part of the model (without the prior or constraint cases), provided they are available. For example, in the pharmacokinetic example of Section 5., we have ve to eight observations per subject, so a standard nonlinear regression will give point estimates and associated variance estimates for the two parameters per subject. Such crude estimates are often available when the ratio of data elements to parameters is large and the constraint and prior cases are included solely to induce shrinkage in estimation. We call such estimates \unshrunk estimates" { which they literally are { and use them and an estimate of their variance as ~y and V, respectively. In certain problems, however, such unshrunk estimates may be unstable, or the model may not be identied without constraint and prior cases. This is typically true when the ratio of data elements to parameters approaches 1. For example, in Section 5.3 we t a model with only one observation per element of, and the entire purpose of tting the model is to shrink the i cleverly (or so we hope). In such cases the constraint and prior cases are added to the model not only to 9

11 encourage shrinkage, but to ensure identiability. Toobtain ~y and V in these cases, we recommend running a simple univariate Metropolis algorithm for a small number of iterations, using prior distributions for the variance components that insist on little shrinkage. That is, use just enough prior information on the variances to identify, but not enough to induce much shrinkage. The posterior mean of approximated by this algorithm is a type of articial data ~y which, when used with the constraint and prior cases in (11), approximates the nonlinear t well enough for the present purpose. We call the results of this univariate algorithm \low shrinkage" estimates, and use them and an estimate of their variance as ~y and V, respectively. Often, setting the low shrinkage priors is straightforward: one can consider ranges of measurable values for the elements of and use priors whose sole purpose is to force the i to stay in that range. Moreover, denitive convergence is generally not necessary in this preliminary chain in our experience rough estimates provided a M-H candidate density suitable for a SMCMC algorithm. These preliminary chains tend to converge quickly for two related reasons. First, the elements of have low posterior correlations by construction. Second, the marginal posterior for each i is determined almost entirely by the data to which it is directly related. Still, a careful choice of prior in the preliminary algorithm may be necessary to produce acceptable M-H candidates for the second-stage SMCMC algorithm. Finally, in the common case of generalized linear mixed models (GLMMs), a simple transformation can often improve matters further. Suppose our model for individual i is i = x 0 i z i where z i is an individual-specic random eect, and = g() where g is the link function relating the linear predictor to the expected value of a data point y. As is detailed by Besag et 10

12 al. (1995) in the binomial case (logit link) and by Waller et al. (199) in the Poisson case (log link), reparametrizing from (z i ;) to ( i ;) produces exactly a normal/inverse Wishart model conditional on i (assuming the usual conjugate prior structures). While equation (11) still requires articial data ~y i, these are naturally obtained as rough estimates of the individual parameters i, a second benet provided by the transformed scale. 5 Numerical Illustrations 5.1 Linear model with normal errors A recent AIDS clinical trial, Community Programs for Clinical Research on AIDS (CPCRA) trial 00 (Abrams et al. 1994) compared didanosine (ddi) and zalcitabine (ddc) in patients with HIV infection who were intolerant to or had treatment failure on zidovudine (ZDV, also known as AZT). For the present purpose, the response variable Y i for patient i is the change in CD4 count between baseline the two-month follow-up, for the K = 3 patients who had both measurements. Three binary predictor variables and their interactions are of interest: treatment group (x 1i :1= ddc, 0 = ddi), reason for eligibility (x i : 1 = failed ZDV,0=intolerant tozdv), and baseline Karnofsky score (x 3i : 1 = score > 85, 0 = score 85 higher scores are better). Following Sargent and Hodges (199), consider the saturated model for this 3 factorial design: Y i = 0 x 1i 1 x i x 3i 3 x 1i x i 1 x 1i x 3i 13 x i x 3i 3 x 1i x i x 3i 13 i (1) where i iid N(0 ) i =1 ::: 3. We use at priors for the intercept 0 and main eects ( 1 3 ), but place hierarchical constraints on the two- and three-way interaction terms, namely l N(0 l )forl =1 13 3, and 13. This linear model is easily written in the form (). Adopting a vague G(0:0001 0:0001) prior for 1= (i.e., having mean 1 and variance 10 4 )and 11

13 independent G(1 1) priors for the h l 1=l l= , completes the specication. Our model is fully conjugate, so BUGS handles it easily. Besides the univariate Gibbs sampler, BUGS allows blocking of the xed eects ( )into a single vector, for which we specied a multivariate normal prior having near-zero precision. BUGS allows no further blocking. For a comparison based on speed alone, we also ran a univariate Gibbs sampler coded in Fortran. For comparison, equation (10) yields a SMCMC implementation that alternately samples from the multivariate normal full conditional p(jh y), and the gamma full conditionals p(j h y) and p(h l j y). As previously mentioned, this Gibbs sampler is a SMCMC implementation that updates the candidate ; at every iteration. We also consider a pilot-adaptive SMCMC implementation, in which we update ; at iterations 1 ::: 10, then every 10th iteration until iteration 1000, and then use the value of ; at iteration 1000 in the \production" run, discarding the 1000 iteration burn-in period. In this chain, p(j ~ h ~ y) is used as a candidate distribution for in a Hastings independence subchain, where ~ h and ~ are the components of ; at iteration The implementations outlined above illustrate the MCMC trade-o between short runtimes and low autocorrelation in the generated samples. An implementation that is fast per iteration may produce highly autocorrelated samples, which are less useful for posterior summarization. To make a fair comparison among the various implementations, we use the notion of eective sample size, or ESS (Kass et al. 1998, p. 99). ESS is dened for each parameter as the number of MCMC samples drawn, N, divided by the parameter's so-called autocorrelation time, =1 P 1 k=1 (k), where (k) is the autocorrelation at lag k. We estimate using sample autocorrelations estimated from the MCMC chain, cutting o the summation when these drop below 0.1 in magnitude. The comparisons between chains presented here are not sensitive to the method of calculating the ESS. For each chain, we obtain N = 5000 post burn-in iterations, compute ESS values, and divide by the chain's runtime in seconds. The resulting \eective samples generated per second" (ES/sec) 1

14 BUGS BUGS Gibbs (univariate SMCMC Gibbs SMCMC (univariate) (partial blocking) Fortran) (full blocking) (pilot adaptation) ESS ES/sec ESS ES/sec ESS ES/sec ESS ES/sec ESS ES/sec h h h h Table 1: Eective sample sizes (ESS) and eective samples drawn per second (ES/sec) for the CPCRA 00 example. provides a fair basis for comparing chains. Table 1 shows the results for the ve implementations. While we caution against putting too much stock in crude runtimes (which are machine-, compiler-, and programmer-dependent), our runtimes (in seconds) to obtain 5000 post-burn-in samples on an Ultra Sparc 0 workstation were as follows: univariate BUGS, 3.4 partially blocked BUGS,. univariate Gibbs implemented in Fortran,.1 SMCMC using fully blocked Gibbs, and SMCMC with pilot adaptation, Pilot-adaptive SMCMC dominates the other three in terms of generation rate (ES/sec) for, the advantage is substantial. The BUGS chains are fast but produce highly autocorrelated samples, hurting their ESS. By contrast, SMCMC implemented as a fully blocked Gibbs sampler produces essentially uncorrelated draws, but at the cost of long runtimes because of repeated matrix inversions. We have tried other SMCMC implementations in this setting, including updating ; less frequently (say, the rst 10, then only every 100th iteration til iteration 1000) or during a shorter 13

15 pilot period (say, the rst 100 iterations). Both of these options produce results similar to those in the nal column of Table 1, with little eect on speed. An alternative method for this problem would be to estimate the precision parameters via pilot adaptation, and then use plain importance sampling for the using the structured distribution, equation (10), as the importance distribution. As a cautionary note, we have experimented with more extreme priors for the h l (e.g., a gamma having mean 1000 and standard deviation 10,000) and found that a SMCMC implementation with infrequent ; updates during the pilot period performs less well. With this extreme prior, the posteriors for the h l are very skewed and highly dispersed, so the values of the h l at each update are not in any way typical of draws from the posterior hence convergence of these chains suers. We discuss this issue further in Section. 5. Nonlinear pharmacokinetic model Wakeeld et al. (1994) presented the data in Figure 1, plasma concentrations Y ij of the drug Cadralazine at various times t ij after administration of a single dose of 30 mg, in 10 heart failure patients. Here i = 1 ::: 10 indexes patients, while j = 1 ::: n i indexes observations within patient, 5 n i 8. Wakeeld et al. (1994) suggested a \one-compartment" pharmacokinetic model in which the mean plasma concentration i (t ij )attimet ij is i (t ij )=30 ;1 i exp(; i t ij = i ) : Later unpublished work by these authors suggests this model is best t on the log scale. Dening Z ij log Y ij, Z ij is then Z ij =log30; a i ; exp(b i ; a i )t ij ij where a i = log i, b i =log i,and ij ind N(0 i ). 14

16 concentration time (hours) (a) log(concentration) -4-0 (b) time (hours) Figure 1: Cadralazine concentration pharmacokinetic data (a) original scale (b) log scale. Following the analysis by Wakeeld et al. (1994), we assume the subject-specic eects i (a i b i ) 0 are i.i.d. N ( ), where =( a b ). These authors recommend, and we use, the usual conjugate priors, namely N ( C), ;1 i ;( ), and ;1 Wishart((R) ;1 ). Our SMCMC implementations for these data used priors recommended by Wakeeld et al. (1994), namely 0 =0, = 0, C ;1 = Diag(0:01 0:01), =,andr = Diag(0:04 0:04). Any MCMC algorithm for this model needs a Metropolis-Hastings step for the random eects i because their full conditional distributions are neither conjugate nor necessarily log-concave. BUGS, V: 0. for UNIX allows such steps, using Metropolis rejection from a grid-based proposal distribution (Ritter and Tanner 199) see Spiegelhalter et al. (199). This form of proposal requires us to place bounds on the individual elements of, but we can create a BUGS specication quite close to the MCMC specication above by using the product formulation of the bivariate normal, namely a i N( a a )I(L a U a )andb i ja i N (k 0 k 1 (a i ; c) b ) I(L b U b ), where (L a U a ) and (L b U b ) 15

17 are broad truncation regions to enable the grid-based Metropolis algorithm, and c is a constant that roughly centers the a i 's (thus reducing correlation between the intercept k 0 and slope k 1 ). In this formulation, we approximate the SMCMC specication by taking G(0:0001 0:0001) priors for the ;1 i, N(0 0:0001) priors for a k 0 and k 1,andG(1 0:04) priors for ;1 a and ;1 b. We considered four MCMC implementations: BUGS V: 0., a standard univariate Metropolis implementation, and two SMCMC implementations. The rst SMCMC implementation used unshrunk estimates of the i s obtained by tting separate models to each patient using SAS PROC NLIN. We used these unshrunk estimates and estimates of their variances as ~y and V in (10), and implemented SMCMC as described in Section 4. The second SMCMC implementation used lowshrinkage estimates from a preliminary univariate Metropolis run. Low-shrinkage estimates are not required here because we have atleast5observations per patient we include them for illustration and comparison. To obtain low-shrinkage estimates, we used a univariate Metropolis chain as above except that we set 0 1 = v v v v C A with v xed and to be estimated. The a i may range from to 4, and the b i from 0 to, suggesting that v = 1 might be appropriate. We ran a univariate Metropolis algorithm with this v for 5000 iterations to produce low-shrinkage estimates for the (a i b i ) pairs, using the posterior means of a i and b i for ~y and the posterior covariance matrix of the (a i b i ) pairs for V in equation (11). The two SMCMC implementations updated ; at iterations 1 ::: 10,thenatevery 10th until iteration 1000, with the value of ; at iteration 1000 used to set the candidate distribution for the production run. Table shows the ESS and eective samples per second for the four chains. Each chain was run for 5000 post burn-in iterations. SMCMC with pilot adaptation does better than all the competitors 1

18 BUGS V0. Metropolis SMCMC, SMCMC, (univariate) (univariate) unshrunk ~y low shrinkage ~y (pilot adaptation) (pilot adaptation) ESS ES/s ESS ES/s ESS ES/s ESS ES/s a a a a b b b b a b a Table : Eective sample sizes (ESS) and eective samples drawn per second (ES/s), MCMC algorithms for the PK data. except for b. The runtimes in seconds for each methodwere: BUGS, univariate Metropolis, 8.5 and SMCMC with pilot adaptation, 1.0. The SAS PROC NLIN run (required for SMCMC with unshrunk ~y) used a negligible amount of CPU time (approximately 0.3 CPU seconds). These results again highlight the tradeo between the eciency and speed of each iteration: BUGS produced less correlated samples (larger ESS), but when we account forruntimes, SMCMC's greater speed prevails. We see at least two reasons why SMCMC did not have a larger advantage compared to BUGS in this problem. First, the 0-dimensional joint posterior of the f(a i b i )g may be non-normal. However, a thorough investigation of this distribution revealed no gross departures from normality. Thus, it appears that SMCMC's modest improvement over BUGS arises because SMCMC exploits posterior correlation, but the constraint cases induce little correlation in the posteriors of the 1

19 mean-structure parameters. As Figure 1 shows, each patient (except #8) has data more or less on a straight line on the log scale, so neither the a i nor the b i will shrink much. There is high correlation within the (a i b i ) pairs, and SMCMC and BUGS provide substantial benet in ESS over the univariate Metropolis algorithm by using this correlation. But once BUGS' runtime is taken into account, the SMCMC algorithm based on the unshrunk ~y emerges as the clear winner (ES/s columns in Table ). 5.3 Cox model with time-varying coecients Kalbeisch and Prentice (198, p.3-4) gave data from a clinical trial in which 13 men with cancer were randomized between experimental chemotherapy and a standard treatment ve covariates were measured. The endpoint was time to death in days there were 9 unique death times. Previous analyses (Lin 1991 Grambsch and Therneau 1994) showed no signicant treatment eect, but strong evidence of a non-proportional eect of one covariate, Karnofsky score, which measures a patient's functional status, ranging from 100 (normal) to zero (dead). We revisit these data to demonstrate SMCMC in a relatively dicult problem, one that BUGS cannot handle. In the Cox proportional hazards model (Cox, 19), covariates have constant multiplicative eects on the hazard function (t x), where the hazard function for an individual's death time T i, given the individual's covariate vector x i, is given by (t x i ) = lim 4t!0 Pr[T i (t t 4t) j T i t x i ] : t Consider instead a model that allows a covariate's coecient to take a dierent value at each unique event time, but which smooths these coecients through a simple random-walk smoother (Sargent 199). This smoother assumes that the dierence between the coecient values at adjacent event 18

20 times t i and t i;1, i.e. ( i ; i;1 ), is drawn from a distribution f with mean zero and variance (t i ; t i;1 ). Specically, for patient j at time t i, the model is (t i x j ) = 0 (t i ) exp(x j i ) i ; i;1 N 0 (t i ; t i;1 ) where x j is the Karnofsky score for patient j and i is specic to t i. This smoother induces very high posterior correlations between adjacent i 's, causing great diculty for univariate MCMC implementations. Sargent (199) analyzed these data using a Cox model with a time-varying coecient for Karnofsky score with K =9 i. He used a Gamma prior for h = 1 with mean and standard deviation 10 5, and a univariate Metropolis algorithm. This algorithm converged very slowly: posterior summaries were based on three chains of 50,000 iterations each, which required approximately 3.5 hours of computer time. The long chains were necessitated by extremely slow mixing for example, the median lag 8 sample autocorrelation over the 9 i was Attempts to use a multivariate Metropolis algorithm, using a candidate distribution with covariance matrix based on an estimate of the within-chain covariance obtained from a pilot adaptive scheme, had no success due to extremely high posterior correlations among the i. Even when the chains were started in the needle-thin ellipsoid of high probability, computations with the covariance matrix were unstable because it was nearly singular. To make a SMCMC algorithm, we need articial data ~y i with expectations that are roughly i. Unshrunk estimates are not available here: if we t separate i for each unique event time without smoothing, many i have anunbounded posterior mode. Instead, we obtained low-shrinkage estimates by running a small number of iterations of a univariate algorithm for the model that smooths 19

21 the i, but using a prior for that forces little smoothing. We then used the posterior means of the i from this algorithm as our low-shrinkage estimates. This avoids the usual problem of univariate algorithms because the low-smoothing prior forces low posterior correlations among the i. In this case, we used a gamma prior for h with mean and standard deviation 1/10. (Means and standard deviations smaller by up to 4 orders of magnitude do not change the performance of our SMCMC implementations.) Dene ~y i and i to be the posterior mean and variance for each i from this univariate low-shrinkage run. With these low-shrinkage estimates, we used this linear model in a SMCMC algorithm: 3 4 ~y 0 (K;1) 3 5 = 4 I KK ; ; ;1 1 E (13) 5 where =( 1 ::: K ) 0, E is a (K ; 1)-vector with mean zero and diagonal covariance matrix, the rst K diagonal elements being i and the last (K ; 1) elements being. In the language of Section, equation (13) is a special case of () with no prior cases. Figure displays ES/sec for selected parameters for three MCMC implementations using this model. The rst algorithm is the simple univariate Metropolis algorithm mentioned above we note that to obtain convergence with this algorithm a run of 50,000 iterations was necessary, which in the gure we have normalized to 5000 iterations by dividing the ESS by 10. In an attempt to improve on these results, we considered three SMCMC implementations. First, we used an implementation that used the value of ; at each iteration as the basis for the M-H proposal distribution. A chain of 5000 iterations of this implementation is shown in Figure with the label 'SMCMC update every 0

22 ES/sec SMCMC adaptive SMCMC update every iteration o Univariate Metropolis ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo oo oo oooooo ooo oo o oooo o o oooo theta Figure : Eective samples drawn per second (ES/sec) for three algorithms for the time varying coecient Cox example. iteration'. This chain converged quickly (i.e., it had excellent convergence diagnostics after 000 iterations) however, its run time was elevated because the proposal distribution was re-computed at each iteration. In an attempt to speed run time, we tried other implementations to avoid this re-computation. The rst attempt was a pilot-adaptive implementation, updating ; every 10th iteration for the rst 1000 iterations, but this chain converged poorly. In this problem, the posterior standard deviation of h = 1 is suciently large (on the order of 10 5 ) that no single value of ; gives consistently good M-H proposals. The second SMCMC implementation used the adaptive approach of Gilks et al. (1998), where the proposal distribution (10) is updated at the chain's regeneration times with the current value of ;. This chain converged moderately quickly (apparentconvergence after 5000 iterations) regenerations occurred on average every 1 th iteration. The results from this chain are shown in Figure with the label 'SMCMC adaptive'. The run times per 5000 iterations for the three chains shown in Figure were: univariate 1

23 Metropolis minutes (i.e., 0 minutes for 50,000 iterations) SMCMC update every iteration 45 minutes and SMCMC adaptive 5 minutes. Based on the data in Figure, the both SM- CMC implementations provided substantial improvements in ES/sec compared to the univariate algorithm. Conclusions We have presented a general method of MCMC computing for richly-parameterized models. Based on Hodges (1998), SMCMC uses linear structure implied by the model and data to suggest multivariate candidate distributions for a M-H algorithm. This speeds computing by avoiding problems created by high posterior correlations and by requiring fewer likelihood evaluations. We have demonstrated our approach with both linear and nonlinear examples. While these examples are far too complicated for analytical evaluation of convergence, investigations in simpler settings are available and instructive see e.g. Liu et al. (1994, Sec. 5) and Roberts and Sahu (199, Sec..4). SMCMC improves convergence by blocking parameters with high posterior correlations, specically by exploiting structure induced in the mean-structure parameters by the constraint cases. In our experience, in linear models this structure is ample enough that SMCMC pays dividends. In nonlinear cases, however, matters are less clear, and we have found examples in which SMCMC is no better than univariate algorithms. The diculty appears to have two sources. First, SMCMC needs articial data cases, either directly from individual-level data or from a preliminary M-H algorithm with a low shrinkage prior. Poor selection of these articial data cases can hamper a SMCMC algorithm. For example, in the pharmacokinetic problem of Section 5., we originally created low-shrinkage estimates by running a univariate M-H algorithm with an xed to have

24 large variances but zero correlation. This choice restricted the very posterior correlations (in this case, between a i and b i for each i) that SMCMC exploits. The resulting SMCMC implementation was no better than a univariate algorithm, but a better choice of low-shrinkage estimates, shown in Section 5., led to a SMCMC implementation that has superior ESS compared to the univariate algorithm. A second example where SMCMC may provide no advantage is when is of high dimension and V is not diagonal. In these cases, equation (11) can require the user to manipulate huge matrices. In one longitudinal binary-outcome problem, had 541 elements and SMCMC was actually slower than the univariate alternative because of the matrix manipulations. In some cases, it may be possible to avoid this problem by using special structure in the design and/or covariance matrices. If a hierarchical model induces little shrinkage, SMCMC will have little structure to exploit in creating candidate densities. This situation can arise either because the constraint cases cannot induce exploitable structure, or more often because the constraint-case variances happen to be so large that the constraint cases induce little shrinkage. This happened in a frailty model we t, treating the log frailties as a random eect, and convergence of a SMCMC implementation took as many iterations as the univariate alternative, although run time was shorter because the SMCMC implementation made fewer likelihood evaluations. As a general approach, we recommend rst attempting to use BUGS or a simple univariate algorithm to simulate draws from the joint posterior. If these approaches suer from poor convergence or lengthy runtimes, SMCMC may provide substantial improvement. Two apparently reasonable suggestions do not improve SMCMC algorithms. First, one might consider replacing the mean of (10) with the current location of the chain, i.e., a pure Metropolis instead of M-H approach. This strategy radically slowed convergence in several examples, including those in Sections 5.1 and 5.3. This might be because convergence of the pure Metropolis form relies 3

25 solely on the mixing of the Markov chain, while the M-H form is similar to importance sampling when the proposal and target densities are nearly alike. Secondly, one might suggest using the mean of the covariance matrix obtained from a pilot adaptive scheme as the covariance matrix in (10). Again, this suggestion has proven detrimental to convergence inseveral of the examples considered here. The failure of both of these suggestions may also arise because in essence SMCMC works by mimicking a Gibbs sampler. Substituting a current meanchain, or using the covariance matrix from a pilot adaptive scheme, weakens this analogy. Pilot-adaptive SMCMC implementations may be unsatisfactory in cases where the posterior distributions of the variance components are highly dispersed. Thus, such implementations are only recommended in cases where the analyst can feel comfortable that the values for the components of ; at a given iteration are in some sense typical of the posterior distribution. When this is not the case, an adaptive implementation using regeneration points (Gilks et al. 1998) has proven helpful. In summary, SMCMC appears very competitive with univariate algorithms when they are available, and can oer a feasible solution in harder problems. We have had success using SMCMC algorithms in some problems that were otherwise infeasible, such as complex hierarchical proportional hazards models. Further investigation of SMCMC and other MCMC blocking methods is warranted, so more experience can be gained in choosing good strategies in particular situations and in designing general-purpose SMCMC algorithms for large classes of problems. Acknowledgments The second author was supported in part by the Minnesota Oral Health Clinical Research Center, NIH/NIDR P30-DE093, while the third author was supported in part by National Institute of Allergy and Infectious Diseases (NIAID) Grant R01-AI419. Jon Wakeeld graciously supplied 4

26 us with the data used in Section 5.. We are grateful for the assistance of three diligent referees whose comments led to substantial improvements in the manuscript. References Abrams, D.I., Goldman, A.I., Launer, C., et al. (1994), \Comparative Trial of Didanosine and Zalcitabine in Patients with Human Immunodeciency Virus Infection who are Intolerant or have Failed Zidovudine Therapy," New Eng. J. Med., 330, 5{. Besag, J. and Green, P.J. (1993), \Spatial Statistics and Bayesian Computation" (with discussion), J. Roy. Stat. Soc., Ser. B, 55, 5{3. Besag, J., Green, P., Higdon, D. and Mengersen, K. (1995), \Bayesian Computation and Stochastic Systems" (with discussion), Statistical Science, 10, 3{. Besag, J., York, J.C., and Mollie, A. (1991), \Bayesian Image Restoration, with Two Applications in Spatial Statistics" (with discussion), Ann. Inst. Stat. Math., 43, 1{59. Bryk, A.S. and Raudenbush, S.W. (199), Hierarchical Linear Models: Applications and Data Analysis Methods, Newbury Park, CA: SAGE Publications. Carlin, B.P. and Louis, T.A. (199), Bayes and Empirical Bayes Methods for Data Analysis, London: Chapman and Hall. Carlin, B.P. and Polson, N.G. (1991), \Inference for Nonconjugate Bayesian Models Using the Gibbs Sampler," Can. J. Stat., 19, 399{405. Cowles, M.K. and Carlin, B.P. (199), \Markov Chain Monte Carlo Convergence Diagnostics: A Comparative Review," J. Amer. Stat. Assoc., 91, 883{904. Cox, D.R. (19), Regression models and life tables (with discussion), Journal of the Royal Statistical Society, Series B, 34, Damien, P., Wakeeld, J. and Walker, S. (1999), \Gibbs Sampling for Bayesian Nonconjugate and Hierarchical Models using Auxilliary Variables," to appear J. Roy. Stat. Soc., Ser. B. Evans, M. and Swartz, T. (1995), \Methods for Approximating Integrals in Statistics with Special Emphasis on Bayesian Integration Problems," Stat. Sci., 10, 54{. Gelfand, A.E., Mallick, B.K. and Polasek, W. (199), \Broken Biological Size Relationships: A Truncated Semiparametric Regression Approach with Measurement Error," J. Amer. Stat. Assoc., 9, 83{845. Gelfand, A.E. and Sahu, S.K. (1994), \On Markov Chain Monte Carlo Acceleration," J. Comp. Graph. Stat., 3, 1{. Gelfand, A.E., Sahu, S.K. and Carlin, B.P. (1995), \Ecient Parametrizations for Normal Linear Mixed Models," Biometrika, 8, 49{488. 5

27 Gelman, A. and Rubin, D.B. (199), \Inference from Iterative Simulation using Multiple Sequences" (with discussion), Stat. Sci.,, 45{511. Geyer, C.J. and Thompson, E.A. (1995), \Annealing Markov Chain Monte Carlo with Applications to Ancestral Inference," J. Amer. Stat. Assoc., 90, 909{90. Gilks, W.R. and Roberts, G.O. (199), \Strategies for Improving MCMC," in Markov Chain Monte Carlo in Practice, W.R. Gilks, S. Richardson and D.J. Spiegelhalter, D.J., eds, London: Chapman and Hall, pp. 89{114. Gilks, W.R., Roberts, G.O., and Sahu, S.K. (1998), \Adaptive Markov Chain Monte Carlo through Regeneration," J. Amer. Stat. Assoc., 93, Grambsch, P.M. and Therneau, T.M. (1994), \Proportional Hazards Tests and Diagnostics based on Weighted Residuals," Biometrika, 81, 515{5. Hastings, W.K. (190), \Monte Carlo Sampling Methods using Markov Chains and their Applications," Biometrika, 5, 9{109. Hills, S.E. and Smith, A.F.M. (199), \Parametrization Issues in Bayesian Inference," in Bayesian Statistics 4, Ed. J.M. Bernardo, J.O. Berger, A.P. Dawid and A.F.M. Smith, pp. 41{49, Oxford: Oxford University Press. Hodges, J.S. (1998), \Some Algebra and Geometry for Hierarchical Models, Applied to Diagnostics" (with discussion), J. Roy. Stat. Soc., Ser. B, 0, 49{53. Kalbeisch, J.D. and Prentice, R.L. (198), The Statistical Analysis of Failure Time Data. New York: Wiley. Kass, R.E., Carlin, B.P., Gelman, A. and Neal, R. (1998), \Markov Chain Monte Carlo in Practice: A Roundtable Discussion," Amer. Stat., 5, 93{100. Lin, D.Y. (1991). \Goodness-of-Fit for the Cox Regression Model Based on a Class of Parameter Estimators," J. Amer. Stat. Assoc., 8, 5{8. Lindley, D.V., and Smith, A.F.M. (19), \Bayes Estimates for the Linear Model" (with discussion), J. Roy. Stat. Soc., Ser. B, 34, 1{41. Liu, J.S. (1994), \The Collapsed Gibbs Sampler in Bayesian Computations with Applications to a Gene Regulation Problem," J. Amer. Stat. Assoc., 89, 958{9. Liu, J.S., Wong, W.H., and Kong, A. (1994), \Covariance Structure of the Gibbs Sampler with Applications to the Comparisons of Estimators and Augmentation Schemes," Biometrika, 81, {40. Mengersen, K.L., Robert, C.P., and Guihenneuc-Jouyaux, C. (1999), \MCMC Convergence Diagnostics: A `RevieWWW' " (with discussion), to appear in Bayesian Statistics, eds. J.M. Bernardo, J.O. Berger, A.P. Dawid, and A.F.M. Smith. Oxford: Oxford University Press. Mira, A., and Tierney, L. (199), \On the Use of Auxiliary Variables in Markov Chain Monte Carlo Sampling," technical report, School of Statistics, University of Minnesota.

28 Neal, R.M. (199), \Sampling from Multimodal Distributions using Tempered Transitions," Stat. and Comp.,, 353{ 3. Ritter, C. and Tanner, M.A. (199), \Facilitating the Gibbs Sampler: The Gibbs Stopper and the Griddy Gibbs Sampler," J. Amer. Stat. Assoc., 8, 81{88. Roberts, G.O. and Sahu, S.K. (199), \Updating Schemes, Correlation Structure, Blocking and Parameterization for the Gibbs Sampler," J. Roy. Stat. Soc., Ser. B, 59, 91{31. Roberts, G.O. and Smith, A.F.M. (1993), \Simple Conditions for the Convergence of the Gibbs Sampler and Metropolis-Hastings Algorithms," Stoch. Proc. and their Apps., 49, 0{1. Roberts, G.O. and Tweedie, R.L. (199), \Geometric Convergence and Central Limit Theorems for Multidimensional Hastings and Metropolis Algorithms," Biometrika, 83, 95{110. Sargent, D.J. (199), \A Flexible Approach to Time-Varying Coecients in the Cox Regression Setting," Lifetime Data Analysis, 3, 13{5. Sargent, D.J., and Hodges, J.S (199), \Smoothed ANOVA with Application to Subgroup Analysis," Research Report 9{00, Division of Biostatistics, University of Minnesota. Searle, S.R., Casella, G., and McCulloch, C.E. (199), Variance Components, New York: Wiley. Spiegelhalter, D.J., Thomas, A., Best, N. and Gilks, W.R. (1995a), \BUGS: Bayesian Inference Using Gibbs Sampling, Version 0.50," technical report, Medical Research Council Biostatistics Unit, Institute of Public Health, Cambridge University. Spiegelhalter, D.J., Thomas, A., Best, N. and Gilks, W.R. (1995b), \BUGS Examples, Version 0.50," technical report, Medical Research Council Biostatistics Unit, Institute of Public Health, Cambridge University. Spiegelhalter, D.J., Thomas, A., Best, N. and Gilks, W.R. (199), \BUGS 0.: Bayesian Inference Using Gibbs Sampling (Addendum to Manual)," technical report, Medical Research Council Biostatistics Unit, Institute of Public Health, Cambridge University. Swendsen, R.H. and Wang, J.-S. (198), \Nonuniversal Critical Dynamics in Monte Carlo Simulations," Phys. Rev. Letters, 58, 8{88. Tierney, L. (1994), \Markov Chains for Exploring Posterior Distributions (with discussion)," Ann. Statist.,, 101{1. Wakeeld, J.C. (1998), Discussion of \Some Algebra and Geometry for Hierarchical Models, Applied to Diagnostics," by J.S. Hodges, J. Roy. Stat. Soc., Ser. B, 0, 53{55. Wakeeld, J.C., Smith, A.F.M., Racine-Poon, A., and Gelfand, A.E. (1994), \Bayesian Analysis of Linear and Non-linear Population Models by Using the Gibbs Sampler," App. Stat., 43, 01{1. Waller, L.A., Carlin, B.P., Xia, H. and Gelfand, A.E. (199), \Hierarchical Spatio-temporal Mapping of Disease Rates," Journal of the American Statistical Association, 9, 0{1. West, M. and Harrison, W. (1989), Bayesian Forecasting and Dynamic Models, New York: Springer- Verlag.

The simple slice sampler is a specialised type of MCMC auxiliary variable method (Swendsen and Wang, 1987; Edwards and Sokal, 1988; Besag and Green, 1

The simple slice sampler is a specialised type of MCMC auxiliary variable method (Swendsen and Wang, 1987; Edwards and Sokal, 1988; Besag and Green, 1 Recent progress on computable bounds and the simple slice sampler by Gareth O. Roberts* and Jerey S. Rosenthal** (May, 1999.) This paper discusses general quantitative bounds on the convergence rates of