2 1 Introduction Multilevel models, for data having a nested or hierarchical structure, have become an important component of the applied statistician

Size: px

Start display at page:

Download "2 1 Introduction Multilevel models, for data having a nested or hierarchical structure, have become an important component of the applied statistician"

Juliana Jones
6 years ago
Views:

1 Implementation and Performance Issues in the Bayesian and Likelihood Fitting of Multilevel Models William J. Browne 1 and David Draper 2 1 Institute of Education, University of London, 20 Bedford Way, London WC1H 0AL, England 2 Department of Mathematical Sciences, University of Bath, Claverton Down, Bath BA2 7AY, England Summary We use simulation studies (a) to compare Bayesian and likelihood fitting methods, in terms of validity of conclusions, in two-level random-slopes regression (RSR) models, and (b) to compare several Bayesian estimation methods based on Markov chain Monte Carlo, in terms of computational efficiency, in random-effects logistic regression (RELR) models. We find (a) that the Bayesian approach witha particular choice of diffuse inverse Wishart prior distribution for the (co)variance parameters performs at least as well in terms of bias of estimates and actual coverage of nominal 95% intervals as maximum likelihood methods in RSR models with medium sample sizes (expressed in terms of the number J of level 2 units), but neither approach performs as well as might be hoped with small J; and (b) that an adaptive hybrid Metropolis-Gibbs sampling method we have developed for use in the multilevel modeling package MlwiN outperforms adaptive rejection Gibbs sampling in the RELR models we have considered, sometimes by a wide margin. Keywords: Adaptive Metropolis Sampling, Diffuse Prior Distributions, Educational Data, Gibbs Sampling, Hierarchical Modeling, IGLS, Markov Chain Monte Carlo (MCMC), MCMC Efficiency, Maximum Likelihood Methods, Random-Effects Logistic Regression, Random-Slopes Regression, RIGLS, Variance Components.

2 2 1 Introduction Multilevel models, for data having a nested or hierarchical structure, have become an important component of the applied statistician's tool-chest in the past 15 years (e.g., Bryk and Raudenbush 1992, Goldstein 1995, Draper 2000). Examples include variance-components (VC), random-slopes regression (RSR), and random effects logistic regression (RELR) models, all of which we will visit in what follows. In the early days of multilevel modeling the only available fitting methods were based on maximum likelihood: iterative generalized least squares (IGLS) and restricted IGLS (RIGLS) or related methods such as Fisher scoring (Longford 1987), restricted maximum likelihood (REML), and empirical Bayes estimation (Bryk et al. 1988) for models with Gaussian outcomes (Goldstein 1986, 1989); and marginal quasi-likelihood (MQL) and penalized (or predictive) quasi-likelihood (PQL) for data sets with dichotomous outcomes (e.g., Breslow and Clayton 1993). More recently fully Bayesian analyses based on Markov chain Monte Carlo (MCMC) methods have become possible in packages such as BUGS (Spiegelhalter et al. 1997) and MLwiN (Rasbash et al. 1999). Recent alternatives for fitting multilevel models, which we do not pursue here, include integratedlikelihood approaches based on Gaussian quadrature (e.g., Pinheiro and Bates 1995) and Laplace approximations (e.g., Raudenbush et al. 2000). We (the authors of this article) are the co-developers of the Bayesian MCMC capabilities in MLwiN. Below we examine (a) the relative performance, in the sense of point and interval estimation accuracy, of likelihood and Bayesian fitting methods in RSR models, and (b) some performance comparisons in RELR models in the sense of required CPU time to achieve a given accuracy of posterior summary between several MCMC fitting methods, including adaptive rejection sampling (Gilks and Wild 1992) and an approach we have developed specifically for MLwiN based on adaptive hybrid Metropolis-Gibbs sampling. In a companion article to this one (Browne and Draper 1999, hereafter BD99) we compare likelihood and Bayesian fitting methods in VC and RELR models (also see Hoijtink 2000 for an MCMC investigation of a random-intercept model). 2 Random-slopes regression (RSR) models A multilevel modeling data set which we have found useful in fixing ideas was collected by Mortimore et al. (1988) in a study called the Junior School Project (JSP). This was a longitudinal investigation of roughly 2,000 pupils from 50 primary schools chosen randomly from the 636 Inner London Education Authority (ILEA) schools in Woodhouse et al. (1995) examined a random subsample of N = 887 students at J = 48 of these schools; here we will refer to this subsample as the JSP data. One focus of principal interest was the relationship between mathematics test scores at year 3 (math3) and

3 3 Table 1: A comparison of IGLS/RIGLS and Bayesian fitting (with the diffuse prior labeled Wishart 1 in Section 2.1.2) in model (1) applied to the JSP data. Figures in parentheses in the upper table are SEs (for the ML methods) or posterior SDs (for the Bayesian method). Point Estimates Parameter Method fi 0 fi 1 ffu0 2 fl 01 ffu1 2 ffe 2 IGLS : (0.366) (0.043) (1.30) (0.119) (0.017) (1.34) RIGLS : (0.370) (0.043) (1.33) (0.122) (0.017) (1.34) Bayesian : (0.371) (0.058) (1.39) (0.153) (0.029) (1.35) 95% Interval Estimates Parameter Method fi 0 fi 1 ffu0 2 fl 01 ffu1 2 ffe 2 RIGLS (29:8; (0:529; (2:13; ( 0:611 (0:004; (24:3; (Gaussian) 31:3) 0:697) 7:35) 0:133) 0:070) 29:6) Bayesian (29:9; (0:505; (2:36; ( 0:660; (0:058; (24:3; 31:3) 0:732) 7:77) 0:061) 0:170) 29:7) year 5 (math5). These two variables have marginal distributions which are not far from Gaussian, and the scatterplot of the entire data set indicates approximate linearity, but it is quite possible that the slopes and intercepts of the best-fitting linear models at the school level (one regression line per school) are different. However, the numbers n j of pupils per school vary from 5 to 62 in this data set, with fully 1 of the schools having 12 pupils or less, 3 so it would be unwise to attempt to fit regressions local to each school. A natural approach that strikes a balance between global fitting (which ignores the clustered character of the data) and local regression (which would be unstable) is based on a random-slopes regression model of the form y ij =(fi 0 + u 0j )+(fi 1 + u 1j )(x ij μx)+e ij ; u j = u0j u 1j IID ο N2 (0;V u ) ;V u = ff 2 u0 fl 01 fl 01 ff 2 u1 ;e ij IID ο N(0;ff 2 e ); (1) where i =1;:::;n j ;j =1;:::;J; P J j=1 n j = N;y ij and x ij are the math5 and math3 scores for pupil i in school j, respectively, andμx is the mean of math3 over all N pupils (centering the predictor in this way improves convergence of the iterative estimation methods discussed below). This model regards the schools as having been drawn randomly from a population of schools, each having its own slope and intercept, and the result of fitting (1) will be to

4 4 shrink the local estimates of these parameters toward the global (population) regression. Table 1 presents a comparison of maximum likelihood (IGLS/RIGLS) and Bayesian (posterior mean) estimates of model (1) applied to the JSP data (the Bayesian results use a particular diffuse prior to be discussed below in Section 2.1.2). Packages such asmlwin (which produced these summaries) often report only point estimates and standard errors when maximum likelihood (ML) fitting is employed, thereby tacitly encouraging users to build largesample Gaussian interval estimates of the form h^ ± 1:96 d SE(^ ) i ; we quote these intervals in the table. It may be seen that the two fitting methods produce similar results for many of the point estimates (with the notable exception of ff 2 u1, where the Bayesian estimate is almost three times the size of the RIGLS value), but the Bayesian intervals are all wider than their likelihood counterparts (up to 70% wider in the case of ff 2 u1). An MLwiN user writing a paper based on the JSP data might well wonder which setof results to report (bearing in mind that the ML and diffuse-prior Bayesian approaches are estimating different quantities, namely in Bayesian language the modes and means of the marginal posterior distributions, respectively). We offer a suggested answer to this question below in Section Methods for fitting RSR models IGLS and RIGLS IGLS is an iterative maximum likelihood method based on generalized least squares (GLS). For the general algorithm see Goldstein (1995); here we present asketch of this fitting method in the special case of the RSR model (1). This model may be expressed in the usual general linear model form Y = Xfi + e Λ (2) by means of the following steps: (a) stackthevalues y ij in the order (y 11 ;:::; y n11 ; :::; y 1J ;:::;y nj J) into the N 1vector Y ; (b) create a vector x out of the (x ij μx) values analogously, let X be the N 2 matrix whose first column is a vector of 1s and whose second column is x, and define the 2 1 vector fi =(fi 0 ;fi 1 ) T ; and (c) stack the values e Λ ij = u 0j + u 1j (x ij μx)+e ij into the N 1vector e Λ. Under the standard assumptions this vector has mean 0 and covariance matrix V whose diagonal elements are V (e Λ ij) =ff 2 u0 +2(x ij μx) fl 01 +(x ij μx) 2 ff 2 u1 + ff 2 e (3) and whose off-diagonal elements may be computed as in Goldstein (1995). The idea underlying IGLS is that (i) if V were known, fi could be estimated by GLS, yielding ^fi = X T V 1 X 1 X T V 1 Y (4)

5 5 with covariance matrix X T V 1 X 1, and (ii) if fi were known, one could form the residuals ~ Y = Y Xfi, calculate Y Λ = ~ Y ~ Y T, stack the columns of Y Λ into one long column vector Y ΛΛ, and define a linear model Y ΛΛ = Z Λ + ffl; (5) where Z Λ is the design matrix for the random-effects parameters = ff 2 u0 ; fl 01 ;ff 2 u1 T in model (1). Another application of GLS then yields ^ = Z ΛT V Λ 1 Z Λ 1 Z ΛT V Λ 1 Y ΛΛ ; (6) where V Λ = V ΩV, Ω is the Kronecker product, and the covariance matrix of 1. ^ is 2 Z ΛT V Λ 1 Z Λ Starting with initial estimates of the fixed effects fi from ordinary least squares, IGLS iterates between equations (4) and (6) to convergence, which is judged to occur when two successive sets of estimates differ by no more than a given tolerance (on a component-by-component basis; this is not guaranteed to occur in RSR models, as will be seen in Section 2.2). As with many ML procedures, IGLS produces biased estimates in small samples, often in particular underestimating random-effects variances because the sampling variation of ^fi is not accounted for in the algorithm above. Defining the residuals instead as ~ Y Λ = Y X ^fi and ^Y Λ = ~ Y Λ ~Y Λ T, Goldstein (1989) showed that E( ^Y Λ )=V X X T V 1 X 1 X T ; (7) so that the IGLS estimates can be bias-adjusted by adding the second term in (7) to ^Y Λ at each iteration. This is restricted IGLS (RIGLS), which coincides with restricted maximum likelihood (REML) in Gaussian models such as(1). Standard errors of IGLS and RIGLS estimates are based on the final values of the covariance matrices mentioned above at convergence Prior distributions in multilevel modeling The Bayesian fitting of multilevel models requires, as usual in Bayesian work, a joint prior distribution for the parameters, or a series of marginal priors if a priori independence is assumed. If substantive information about the parameters is available Λ then it should naturally be used, although there are risks to the validity of one's conclusions if strong prior information that is after the fact seen to have been out of step with reality is employed (see, e.g., BD99 for an example). When developing the Bayesian capabilities in MLwiN we took the view that, in addition to being provided with a facility for Λ For instance, from expert judgment (see, e.g., Madigan et al for a method of eliciting a prior data set" in the context of graphical models) or previous studies judged relevant to the current inquiry.

6 6 specifying informative priors, users should be given the option of selecting among one or more diffuse priors for those occasions when little was known a priori. The reason for the phrase one or more" in the last sentence is that, while the literature is unanimous in recommending Gaussian priors with huge variances (effectively, improper U( 1; 1) priors) as the diffuse choice for fixed effects and this is what weuseinmlwin and in what follows there are several possibilities for diffuse priors on (co)variance parameters and matrices. The conjugate choice for the level 1 variance ffe 2 in model (1) is the scaled inverse chi-square χ 2 (ν; s 2 ) family (e.g., Gelman, Carlin, et al. 1995); this is equivalent toaninverse gamma 1 s2 ν ; ν distribution, where ν is the 2 2 prior effective sample size and s 2 is a prior estimate of ffe. 2 In the results below weusetwo diffuse members of this family: ffl A (proper) locally uniform prior for ffe 2 on (0; 1 ffl ) for small positive ffl (Gelman and Rubin 1992, Carlin 1992), which is equivalent to a Pareto(1;ffl) prior for the precision fi e = ff 1 (Spiegelhalter et al. 1997); 2 e and ffl A (proper) 1 (ffl; ffl) prior for ff 2 e (Spiegelhalter et al. 1997), for small positive ffl. These priors are specified within the χ 2 (ν; s 2 ) family bythechoices (ν; s 2 )= ( 2; 0) and (2ffl; 1), respectively. We have found that results are generally insensitive to the specific choice of ffl in the region of (the default setting recommended by the developers of the BUGS package for the 1 prior); we report findings with this value. All the remaining parameters in model (1) are contained in the covariance matrix V u, for which the conjugate choice is the inverse Wishart W 1 (ν p ;S p ) family, where in parallel with the χ 2 (ν; s 2 ) distribution ν p is the prior effective sample size and S p is a prior estimate for V u. We examine three diffuse settings of the Wishart parameters below: (ν p ;S p )=(2;I 2 ) (labeled Wishart prior 1 in Tables 4 and 5 below), (4; ^± u ) (Wishart prior 2 in what follows), and ( 3; 0), where I 2 is the 2 2 identity matrix and ^± u is the RIGLS estimate of V u. The third of these settings corresponds to an improper uniform prior on V u, and it is perhaps worth noting that the second is gently data-determined Gibbs sampling: full conditionals With the conjugate prior choices in the previous subsection, all of the full conditional distributions in model (1) have familiar closed-form expressions, making Gibbs sampling a natural MCMC choice with this model. It is helpful notationally (a) to re-express the first line of the model as y ij = fi 0 x 0ij + fi 1 x 1ij + u 0j x 0ij + u 1j x 1ij + e ij ; (8)

7 7 where x 0ij =1andx 1ij =(x ij μx) in the previous notation, and (b) to let X ij stand for the row vector (x 0ij ;x 1ij ). Gibbs sampling in model (1) or (8) proceeds most smoothly by regarding the level 2 residuals u j =(u 0j ;u 1j ) as latent variables to be sampled along with the model parameters. It is then natural to split the unknowns into four groups: the 2 1 column vector of fixed effects fi, the level 2 residuals u j,the level 2 variance matrix V u, and the level 1 variance ff 2 e. The full conditionals with this blocking of unknowns and the priors as above are then as follows (cf. Zeger and Karim 1991): h fi j y; ff 2 e ;u ο N ^Dff P J P nj 2 2 j=1 e i=1 XT ij (y ij X ij u j ) ; ^D ; P nj i i=1 XT ij (y ij X ij fi) ; ^Dj ; (V u j u) ο W J P 1 J + ν p ; j=1 ut j u j + S p ; uj j y; V u ;ff 2 e;fi ο N 2 h ^Dj ff 2 e h ff 2 e j y; fi; u ο 1 N+νe ; h where ^D = ff 2 e P ij XT ij X ij 1 ; ^Dj = y ij X ij (fi + u j ). ν e s 2 e + P J j=1 1 ff 2 e P nj i=1 e2 ij P nj i=1 XT ij X ij + V u Gibbs sampling: computational efficiency gains i i ; (9) i 1, and eij = RSR models are a special case of the general L-level model, in which Gibbs sampling has the same basic four steps as in equation (9) above (also see Section 3.1). By analyzing the allocation of CPU time across these steps in a naive initial coding of the algorithm, we were able to identify two efficiency gains which considerably reduced execution time without creating undue storage burden. P P n ffl Quantities such asm j = j i=1 XT ij X J ij and M j=1 j involve onlythe fixed matrix X of predictors and can be calculated once and stored for useineach iteration. ffl Let Xli Λ be the vector of predictor variables at level l for observation i in the L-level formulation, where l = 1 refers to the variables associated with the fixed effects (the fi k in (8)), and let e i be the level 1 residual for observation i. Considerable use is made in the general algorithm of quantities of the form d li = e i + c li, where c li is the product of a vector fil Λ of fixed or random effects and the predictor vector Xli Λ (note, for example, the appearance of (y ij X ij u j ) and (y ij X ij fi) in the first two steps of (9)). If the e i are stored then whenever d li is needed the current value of fil Λ times Xli Λ can be added to the residual for use in sampling the new value of fil Λ, and once this is done the new level 1 residual is available by subtraction of the updated product c li from the new d li.

8 Starting values and burn-in/monitoring strategy Normally in MCMC analyses considerable attention needs to be given to the combined choice of starting values and burn-in strategy, to ensure that the equilibrium distribution has been reached before monitoring begins. This is far less of a problem in MLwiN, where likelihood estimates provide initial values of sufficient quality that burn-ins of 500 iterations (the default) or less are typically more than adequate y. MLwiN release 1.1 provides a range of MCMC diagnostics to aid in determining the minimum length of monitoring run to achieve the user's accuracy goals in summarizing the marginal and joint posterior distributions of interest. Time series traces, or trajectories, may be displayed for all unknowns in the model, and clicking on any of these trajectories produces a pop-up window with the following diagnostics and summaries: a marginal kernel density trace, the autocorrelation and partial autocorrelation functions, a plot of the Monte Carlo standard error of the posterior mean as a function of the length n M of the monitoring run, the Raftery-Lewis (1992) and Brooks- Draper (2000) diagnostics, and the posterior mean, standard deviation (SD), median, and (by default) 2.5% and 97.5% quantiles. With its default settings the Raftery-Lewis diagnostic estimates how large n M needs to be so that the actual posterior probability content of the nominal central 95% interval for the parameter in question is between 94% and 96% with Monte Carlo probability at least 95%. By contrast the Brooks-Draper diagnostic estimates the value of n M required so that the posterior mean of may be quoted to k significant figures with at least 100(1 ff)% Monte Carlo probability. In the typical case in which the trajectory for approximates that of an autoregressive time series of order 1 with estimated first-order autocorrelation ^ρ, and writing the estimated posterior mean in the form ^ = a 10 b for 1» a<10, the required ^n M satisfies ^n M 4 hφ 1 1 ff 2 i 2 b k+1 2 ^ff 1+^ρ 10 1 ^ρ ; (10) where ^ff is the estimated posterior SD of. It is presumed that the user has thought carefully about the scale on which results are to be reported for example, diagnostic (10) applied to the JSP data would produce a much larger value of ^n M for fi 0 if 30 were subtracted from all the observations with k held fixed z. y Jim Hodges (personal communication) has recently noted that there may be more potential problems with multimodality of posterior distributions in hierarchical models than is commonly believed; see Liu and Hodges (1999) for details. This may beinvestigated in MLwiN by making parallel runs with widely dispersed starting values, as in Gelman and Rubin (1992). z For example, if the user wished to report ^fi0 =30:6 =3: , i.e., k =3,(10)would be applied with b = 1; whereas if 30 were subtracted from all data values and the user still insisted on k = 3, the estimate would now be6: , (10) would now beinvoked with all the same inputs except b = 1, and the new ^nm value would be 10,000 times larger

9 9 Table 2: Summary of initial study designs for the RSR model (1) simulations. Total Design Number of pupils number of (J) per school (n j) pupils (N) 3(12) (12) (18 for all schools) 216 7(48) (48) (18 for all schools) Simulation study design Our interest in conducting the simulation study described here focused upon the effects of three aspects of model (1) on the performance of methods used to fit the model: the total numbers of level 1 and 2 units in the design (pupils and schools in the JSP data, respectively), the degree of imbalance in the numbers of pupils per school, and the strength of correlation between the level 1 and 2 random effects, which is governed by the covariance parameter fl 01. As in our study (BD99) of the variance-components model y ij = fi 0 + u j + e ij ; i =1;:::;n j ; j =1;:::;J; P J j=1 n j = N; u j IID ο N(0;ff 2 u ); e ij IID ο N(0;ff 2 e ); (11) which is a special case of (1) without the covariate x, we therefore initially examined four different study designs with respect to J and the n j, crossing this factor with five values of fl 01, in all cases holding the other parameters in (1) constant at values similar to those in a version of the JSP data with greater between-school variation in the effect of x on y: fi 0 = 30:0;fi 1 = 0:5;ff 2 u0 = 5:0;ff 2 u1 = 0:5, and ff 2 e = 30:0. Table 2 summarizes the initial designs considered, which are numbered 3, 4, 7, and 8 for consistency with BD99. Design 7 was arrived at by removing one pupil at random from the 23 largest schools in the JSP data, to produce a value of N (864) which was an integer multiple of 18, the average number of pupils per school in all designs. Designs 4 and 8 are balanced, with 12 and 48 schools respectively, and design 3 is imbalanced in a way that mimics the actual JSP distribution of pupils per school. The chosen values of fl 01 were ±1:4; ±0:5, than before. In effect, in the presence of Monte Carlo uncertainty, it is just as hard to accurately announce a posterior mean of when the posterior SD is (say) as it is to quote a posterior mean of with the same posterior SD.

10 10 Table 3: Summary of convergence results when ML methods are used to fit model (1). Entries m 1=m 2 give numbers of simulated data sets out of 1,000 with indicated results for IGLS (m 1) and RIGLS (m 2). Study design is given in terms of the number of level 2 units (12 or 48) and whether the design is balanced (B) or unbalanced (U). PD = estimate of V u positive definite. Failed to Converged Converged Design ρ Converge but not PD and PD 3(12U) /349 21/77 623/574 3(12U) /124 5/19 902/857 (Λ) 3 (12U) /116 2/7 927/877 3(12U) /118 3/11 906/871 3(12U) /366 12/76 621/558 4 (12B) /74 3/23 914/903 4 (12B) /14 1/1 986/985 (Λ) 4 (12B) 0.0 9/9 0/1 991/990 4 (12B) /7 0/2 994/991 4 (12B) /72 3/25 912/903 (Λ) 7 (48U) /14 1/2 986/984 (Λ) 7 (48U) /2 0/0 998/998 (Λ) 7 (48U) 0.0 0/0 0/0 1000/1000 (Λ) 7 (48U) /0 0/0 1000/1000 (Λ) 7 (48U) /15 0/2 984/983 8 (48B) /6 0/2 994/992 8 (48B) /1 0/0 999/999 (Λ) 8 (48B) 0.0 0/0 0/0 1000/ (48B) /0 0/0 1000/ (48B) /8 0/0 992/992 and 0, corresponding to correlations between the slope and intercept random effects of ρ = ±0:89; ±0:32, and 0, respectively. In each cell of the 4 5layout crossing design and correlation, we simulated 1,000 data sets according to model (1), holding the predictor x fixed throughout at its values in the 864-pupil version of the JSP data. Table 3 summarizes the sorts of convergence problems to which the ML methods IGLS and RIGLS are susceptible in RSR models. It may be seen that in unbalanced multilevel data sets with relatively few level 2 units and strong correlation between the slope and intercept random effects, convergence of both IGLS and RIGLS can fail to occur up to 37% of the time, and even when convergence occurs the resulting estimate of the covariance matrix V u in model (1) may failto be positive definite on a significant number of occasions (up to 8% of the simulated data sets). As is intuitively reasonable, problems of this kind occur more readily with increasing jρj, decreasing J, and increasing imbalance. Figure 1 presents a trajectory plot for a data set in which IGLS has failed to converge; this fitting method appears to be cycling

11 11 Figure 1: Trajectory plot arising from IGLS fitting of model (1) to a simulated data set in which convergence is not achieved. between two sets of parameter estimates, even though Bayesian analysis of the same data showed that the posterior distribution with a diffuse prior was unimodal (it is possible that direct application of the EM algorithm instead of IGLS would yield the ML estimates with these data while avoiding problems like those in Figure 1). To avoid ML convergence problems and concentrate on other performance measures such as bias and coverage of point andinterval estimates, respectively, we focus in reporting our main results on the designs marked with a (Λ) intable 3. This subset of eight designs includes two with a small J value (12), six unbalanced designs, and four with nonzero ρ, so that something may be said about the effects of all three of these factors on the outcomes of interest. The subsets of the 1,000 replications in each design for which both ML methods converged and produced positive definite V u estimates are used in what follows, making the simulation sample size at least 877 in all designs examined in detail. To decidehow long to monitor the Gibbs-sampling output we estimated time per iteration and calculated Raftery-Lewis diagnostics as a function of the total number of pupils N. This revealed that the smaller designs in Table 2 needed longer monitoring runs to satisfy Raftery-Lewis default accuracy constraints but took less time per iteration, leading to the following monitoring run lengths: 30,000 in studies 3 and 4, and 10,000 in studies 7 and 8. After verifying that MLwiN and BUGS gave identical results (up to Monte Carlo noise) with RSR models specified with the same priors, for

12 12 computational convenience we used MLwiN for the IGLS/RIGLS and uniform- Wishart-prior Bayesian results and BUGS for the other two sets of Wishart findings. 2.3 RSR validity results Tables 4 and 5 and Figures 2 5 summarize the performance, in RSR model (1), of the two ML methods (IGLS and RIGLS, which do not use prior distributions) and the Bayesian approach with the three prior distributions described in Section 2.1.2, in terms of bias of point estimates and coverage and length of nominal 95% interval estimates. For the sake of brevity we report full numerical results about interval estimates only for two of the eight (Λ) designs in Table 3 (chosen to be typical), although bias results from all eight designs are i displayed in the figures. Bias is reported in relative terms (as 100h^ %) except when the true value of the parameter was 0, when it is presented in absolute terms ([^ ]). To simulate the behavior of users of the ML features in packages such asmlwin,wherepoint estimates and standard errors are typically reported rather than intervals, the ML results in the middle and bottom ^ iparts of Tables 4 and 5 are based on intervalsofthe form h^ ± 1:96 d SE for all parameters of model (1). The performance of intervals for variance parameters based on the gamma distribution (see BD99 for examples and formulae) was only marginally better than that for the large-sample Gaussian intervals examined here; we again omit details for brevity. As noted in Section 2.1.2, we obtained results using both U 0; 1 ffl and 1 (ffl; ffl) priors for ffe, 2 which in all cases were so close that there was little value in presenting both. We examined the behavior of posterior means, medians, and modes for a number of the Bayesian estimation methods but confine our reporting here largely to results for posterior means. Bayesian 95% intervals were obtained as the 2.5% and 97.5% points in the empirical distributions of the MCMC draws for each parameter. The following conclusions are evident from the tables and plots given here and from additional detailed results in Browne (1998), which isavailable on the web at ffl All methods produced estimates of fi 0 ;fi 1, and ff 2 e which were close to unbiased in all simulation design configurations (Figures 2 and 4), and the actual coverage of nominal 95% intervals for these parameters was close to 95% for all methods in the designs with J =48(as was also the case with J = 12 for ff 2 e). ffl Imbalance in the design had a smaller effect on the results than the number of level 2 units (Figures 4 and 5), and with a few exceptions (see Figures 2 and 3) the effect of ρ on performance was also modest.

13 13 Table 4: Summary of RSR results, unbalanced design (3) with J = 12 schools and ρ = 0, based on 877 simulated data sets. Monte Carlo standard errors are in parentheses; bias in top table is relative (in percentage points) except for fl 01, where absolute bias is reported [in brackets] since the true value is zero. Relative Bias (%) Parameter Method fi 0 fi 1 ffu0 2 fl 01 ffu1 2 ffe 2 IGLS [ (0.15) (1.5) (2.2) (0.02)] (1.6) (0.40) RIGLS [ (0.15) (1.5) (2.2) (0.02)] (1.7) (0.41) Wishart [ Prior 1 (0.09) (1.5) (2.3) (0.02)] (1.8) (0.36) Wishart [ Prior 2 (0.09) (1.5) (2.1) (0.02)] (1.7) (0.36) Uniform [ Prior 3 (0.09) (1.5) (5.1) (0.05)] (3.7) (0.35) 95% Interval Coverage Parameter Method fi 0 fi 1 ffu0 2 fl 01 ffu1 2 ffe 2 IGLS RIGLS Wishart Wishart Uniform % Interval Length Parameter Method fi 0 fi 1 ffu0 2 fl 01 ffu1 2 ffe 2 IGLS (0.02) (0.006) (0.2) (0.03) (0.01) (0.04) RIGLS (0.03) (0.007) (0.2) (0.03) (0.01) (0.05) Wishart (0.02) (0.005) (0.2) (0.04) (0.02) (0.04) Wishart (0.02) (0.007) (0.2) (0.03) (0.01) (0.04) Uniform (0.04) (0.01) (0.7) (0.1) (0.06) (0.04) Note: Monte Carlo SEs for coverage rates in the middle table ranged from 0.7% (for estimates near 95%) to 1.4% (for estimates near 80%).

14 14 Table 5: Summary of RSR results, unbalanced design (7) with J =48schools and ρ = 0:89, based on 984 simulated data sets. Monte Carlo standard errors are in parentheses; bias in top table is relative (in percentage points). Relative Bias (%) Parameter Method fi 0 fi 1 ffu0 2 fl 01 ffu1 2 ffe 2 IGLS (0.04) (0.69) (0.92) (0.79) (0.72) (0.16) RIGLS (0.04) (0.69) (0.94) (0.79) (0.73) (0.16) Wishart Prior 1 (0.04) (0.69) (0.94) (0.82) (0.74) (0.16) Wishart Prior 2 (0.04) (0.69) (0.90) (0.82) (0.75) (0.16) Uniform Prior 3 (0.04) (0.69) (1.1) (0.93) (0.84) (0.16) 95% Interval Coverage Parameter Method fi 0 fi 1 ffu0 2 fl 01 ffu1 2 ffe 2 IGLS RIGLS Wishart Wishart Uniform % Interval Length Parameter Method fi 0 fi 1 ffu0 2 fl 01 ffu1 2 ffe 2 IGLS (0.005) (0.001) (0.04) (0.009) (0.003) (0.01) RIGLS (0.005) (0.001) (0.04) (0.009) (0.003) (0.001) Wishart (0.005) (0.001) (0.04) (0.01) (0.003) (0.01) Wishart (0.005) (0.001) (0.04) (0.009) (0.003) (0.01) Uniform (0.005) (0.002) (0.05) (0.01) (0.004) (0.01) Note: Monte Carlo SEs for coverage rates in the middle table ranged from 0.7% (for estimates near 95%) to 1.0% (for estimates near 90%).

15 15 Relative bias (%) fi 0 IGLS RIGLS Wishart Prior 1 Wishart Prior 2 Uniform Prior ρ Relative bias (%) fi ρ Relative bias (%) ff 2 e ρ Figure 2: Relative bias (in %) as a function of ρ for fi 0, fi 1, and ff 2 e, design 7.

16 16 Relative bias (%) ff 2 u0 IGLS RIGLS Wishart Prior 1 Wishart Prior 2 Uniform Prior ρ Relative bias (%) fl ρ Relative bias (%) ff 2 u ρ Figure 3: Relative bias (in %) as a function of ρ for ff 2 u0, fl 01, andff 2 u1, design 7.

17 17 Relative bias (%) fi 0 IGLS RIGLS Wishart Prior 1 Wishart Prior 2 Uniform Prior 12U 12B 48U 48B Study Design Relative bias (%) fi 1 12U 12B 48U 48B Study Design Relative bias (%) ff 2 e 12U 12B 48U 48B Study Design Figure 4: Relative bias (in %) as a function of study design (symbols as in Table 3) for fi 0, fi 1, and ff 2 e, with ρ =0.

18 18 Relative bias (%) ff 2 u0 IGLS RIGLS Wishart Prior 1 Wishart Prior 2 Uniform Prior 12U 12B 48U 48B Study Design Absolute bias fl 01 12U 12B 48U 48B Study Design Relative bias (%) ff 2 u1 12U 12B 48U 48B Study Design Figure 5: Bias as a function of study design (symbols as in Table 3) for ff 2 u0, fl 01, and ff 2 u1, with ρ =0. Top and bottom figures display relative bias (in %); middle figure gives absolute bias (the true quantity is 0).

19 19 ffl IGLS underestimated ff 2 u0 and ff 2 u1 in all settings examined (Figures 3 and 5), by up to 13 ± 2 percentage points in the case of design (Λ)4 (±x denotes the Monte Carlo standard error here and below). RIGLS often corrected this bias substantially, although the correction resulted in positive biases of up to 10 ± 2 percentage points (for ff 2 u0 in design (Λ) 3; see Table 4). ffl While uniform priors on location parameters in many Bayesian settings, including RSR model (1), perform well even in fairly small-sample situations, a uniform prior on the covariance matrix V u yielded disastrous results, with (a) routine biases on the high side of percentage points when J = 48 and up to 219 percentage points with J = 12 (Figure 5), and (b) intervals up to 4.8 times as wide as those from the other methods, which nevertheless yielded coverages as low as 82 ± 1% (in design (Λ)3, Table 4). Using the posterior median or mode instead of the mean improves the performance to some extent (see BD99), but this prior should be regarded as a failure in calibration terms even with a fairly large sample size. ffl The two Wishart priors performed reasonably well with respect to bias in the J = 48 designs (median jrelative biasj across these six designs and the three parameters in V u of 3.9 and 2.8 percentage points, and maximum jrelative biasj of 9.0 and 9.3 percentage points, for the W 1 (2;I 2 ) and W 1 (4; ^± u ) priors, respectively; Table 5), but bias was higher in the J = 12 designs: median and maximum jrelative biasj of (7:8; 34) and (4:9; 20) percentage points for these two priors, respectively. Overall the gently data-determined W 1 (4; ^± u ) prior performed a bit better than the other Wishart prior we examined on bias grounds, but identifying a diffuse prior for covariance matrices in multilevel models with excellent bias properties in small samples is a subject of continuing investigation. ffl The actual coverage of nominal 95% IGLS intervals for parameters in the covariance matrix V u was systematically and substantially below 95% in all designs examined, with values for the diagonal elements as low as91± 1% for J =48and81± 1% with J =12. RIGLS achieved some improvement but still produced intervals with undercoverage for example, the corresponding RIGLS figures were 92 ± 1% and 85 ± 1%. ffl The coverage behavior of the W 1 (4; ^± u ) prior was similar to that of RIGLS with J = 48 but noticeably inferior when J = 12, with coverage as low as78± 1% for some components of V u. The W 1 (2;I 2 ) prior matched or exceeded RIGLS (when Monte Carlo noise was considered) in coverage performance in every design examined, but still yielded coverage as lowas88±1% for ff 2 u0 in design (Λ)4. The W 1 (2;I 2 )intervals for the covariance components achieved their improved performance by

20 20 being appropriately wider than the RIGLS intervals (an average of 26% wider in the J = 12 designs, for instance); their lack of perfect coverage, when present, was directly traceable to bias. The other dimension along which ML and Bayesian methods maybecompared is computation time, where maximum likelihood has a clear advantage over MCMC-based approaches (e.g., at 333MHz RIGLS takes 8 seconds in real time to fit model (1) to the JSP data versus about 6 minutes for 50,000 MCMC iterations). To summarize our findings, therefore, we are unable to make a strong recommendation at this time on validity grounds for two-level RSR models with a small number J of level 2 units, but with moderate or large J echoing the conclusion in BD99 for VC and RELR models we would recommend (for computational speed) the use of RIGLS estimation in the exploratory stages of the analysis, when a number of models are typically examined, followed by Bayesian fitting with a prior similar to W 1 (2;I 2 )on the variance components to produce publishable point and interval estimates with the final model chosen x. 3 Alternatives to Gibbs sampling in multilevel models Gibbs sampling is a natural approach to MCMC fitting in Gaussian variancecomponents (VC) and random-slopes regression (RSR) models such as(11) and (1), respectively, because the full conditional distributions have simple recognizable forms when the residuals at levels higher than 1 in the hierarchy are treated as latent variables and sampled along with the parameters. However, (a) this is no longer true in random-effects logistic regression (RELR) models, and (b) even when it is true it is possible that other MCMC methods will be more efficient than Gibbs sampling. The main natural alternative to Gibbs in multilevel models is a hybrid of Metropolis and Gibbs sampling, in which Metropolis updates are calculated for some of the unknowns and Gibbs sampling is employed for the remainder. Furthermore, there are two additional kinds of flexibility in the Metropolis updates: (i) these may either be performed on the unknowns one at a time or in blocks, and (ii) the Metropolis proposal distributions may either be fixed throughout the run or chosen adaptively at the beginning of the sampling. In the rest of this section we elaborate on these alternatives and present some MCMC efficiency comparisons. x This is a potentially dangerous strategy in small-sample settings on grounds of failure to propagate model uncertainty (e.g., Draper 1995), but the corrections required to adjust for having performed model selection and fitting on the same data set with, e.g., 48 schools and 887 students (as in the JSP data) should be modest.

21 A hybrid Metropolis-Gibbs sampler with univariate updates The general three-level Gaussian model may be written y ijk = X 1ijk fi 1 + X 2ijk fi 2jk + X 3ijk fi 3k + e ijk ; e ijk ο N(0;ff 2 );fi 2jk ο N p2 (0;V 2 );fi 3k ο N p3 (0;V 3 ); (12) in which fi 1 collects together the fixed effects and fi 2 and fi 3 are the level 2 and 3 residuals, with covariance matrices V 2 and V 3, respectively. A general N level model has one set of fixed effects, N sets of residuals (although the residuals at level 1areavailable by subtraction and do not need to be sampled), and N sets of (co)variance parameters. As in equation (9) all of these quantities may be naturally split into four groups: the fixed effects, the (N 1) sets of residuals (excluding those on level 1), the level 1 scalar variance ff 2, and the (N 1) higher-level covariance matrices. Assuming uniform priors for the fixed effects and inverse gamma and inverse Wishart prior distributions for ff 2 and the covariance matrices, respectively, the Gibbs sampling algorithm for the N level model involves generalizations of the full conditional distributions in (9). An alternative Metropolis- Gibbs hybrid approach which generalizes naturally to RELR models uses (a) univariate-update random-walk Metropolis sampling on the fixed effects and residuals and (b) Gibbs sampling for ff 2 and the covariance matrices. This method requires specification of the proposal distribution variances, a choice which affects the MCMC efficiency of the combined algorithm. In more detail, the idea in generating a proposed move fi (t) 1k the fixed-effects vector fi 1 at iteration t is to draw from the N for element k of fi (t 1) 1k ;» k^fi 2 k distribution, where ^fi k 2 is an estimate of the posterior variance of fi 1k and» k is a well-chosen scale factor. In Gaussian settings simpler than multilevel models Gelman, Roberts, and Gilks (1995) showed that the optimal value for» k is approximately 5.8 for univariate parameters when fik 2 is known, leading to an optimal acceptance rate of about 44%, with lower optimal» k values as the dimensionality of the parameter of interest increases. Estimates of fik 2 are readily available in the MLwiN context from maximum likelihood, but this leaves open the question of what to use for» k in multilevel analyses. We performed a small simulation study to address this issue, with results as summarized in Table 6 and Figure 6. The first five modelsintable 6were variations on (1) and (11) applied to the JSP data described at the beginning of Section 2(VC1 is model (11); to this model VC2 adds math3 as a linear predictor with nonrandom slope, and VC3 includes in addition the student's gender (also treated nonrandomly); RSR2 is model (1), and RSR3 adds gender with random slopes). The last row in Table 6, SCH1, pertains to model (1) applied to a different educational data set from Rasbash et al. (1999) with 4,059 students in 65 schools, in which the outcome is an examination score at age 16 and the predictor is the student's London

22 22 ^nm = fi 0 ^nm = fi Scale Factor Scale Factor ^nm = fi 0 ^nm = fi Acceptance Rate Acceptance Rate Figure 6: Effect of scale factor and acceptance rate on default Raftery-Lewis ^n M value for parameters fi 0 and fi 1 in model (1) applied to the JSP data. Solid curves are robust (lowess) smooths; the horizontal scale in the top plots is logarithmic.

23 23 Table 6: Optimal scale factors and ranges of near-optimal acceptance rates for various VC and RSR models (values are approximate averages across three Monte Carlo repetitions). Optimal Scale Factor Near-Optimal Acceptance Rate Parameter Parameter Model fi 0 fi 1 fi 2 fi 0 fi 1 fi 2 VC % VC % 40 60% VC % 40 60% 40 65% RSR % 35 65% RSR % 40 70% 40 65% SCH % 40 70% Reading Test score at age 11. Figure 6 is based on (1) applied to the JSP data, and uses the default Raftery-Lewis recommended length of monitoring run ^n M as the measure of MCMC efficiency. In all cases three runs (each employing a burn-in of 500 from RIGLS starting values and a monitoring run of 50,000) were made with different random seeds at each of a variety of scale factors from 0.05 to 20 and the results averaged (Table 6) or displayed. It is clear that (a) the optimal scale factors vary considerably from one model and parameter to another (while always remaining substantially below 5.8), but (b) the near-optimal acceptance rate is quite flat in the region 45 60% for a wide variety of models and parameters. In view of these findings we decided to provide MLwiN with a hybrid Metropolis-Gibbs option that chooses the scale factors adaptively by monitoring the acceptance rates of all the parameters. 3.2 An adaptive hybrid Metropolis-Gibbs sampler for random-effects logistic regression (RELR) models We present the idea behind our adaptive hybrid Metropolis-Gibbs sampler in the context of RELR models, where Gibbs sampling by itself is not straightforward because the full conditionals for the fixed effects and residuals do not have simple recognizable forms (see Browne 1998 for details on the adaptive method with Gaussian outcomes, and see, e.g., Müller 1993 for alternative approaches to adaptive MCMC sampling). Consider for illustration a twolevel data set with a dichotomous outcome and a model of the form (y ij j p ij ) indep ο P Bernoulli (p ij ) with p logit (p ij )=fi 0 + fi k=1 k (x ijk μx k )+u j ; (13) where u j IID ο N 0;ff 2 u. As was the case with model (12), our adaptivehybrid sampler uses Metropolis updates on the fixed effects fi = (fi 0 ;:::;fi p ) and residuals u j and Gibbs updates on the variance ff 2 u (see BD99 for a detailed

24 24 description of the updating with a three-level RELR model); the difference is in how the Metropolis proposal distribution (PD) variances are calculated. From maximum likelihood starting values we first employ a sampling period of random length (but with an upper bound set by the user) during which the PD variances are adaptively tuned and then eventually fixed for the remainder of the run; this is followed by the usual burn-in period (see Section 2.2); and then the main monitoring run from which posterior summaries are calculated occurs. The tuning of the PD variances is based on achieving an acceptance rate for each parameter that lies within a user-specified tolerance interval (r ;r+ ). The algorithm examines empirical acceptance rates in batches of 100 iterations, comparing them for each parameter with the tolerance interval and modifying the proposal distribution appropriately before going on to the next batch of 100. With r Λ as the acceptance rate in the most recent batchand ffp 2 as the PD variance for a given parameter, the modification performed at the end of each batch isasfollows: 1 r Λ If r Λ r; ff p! ff p»2 1 r ; else ff p! ff p 2 r Λ : (14) r This alters the PD variance by a greater amount the farther the empirical acceptance rate is from the target r. If r Λ is too low, the proposed moves are too big, so ffp 2 is decreased; if r Λ is too high, the parameter space is being explored with moves that are too small, and ffp 2 is increased. If the r Λ values are within the tolerance interval during three successive batches of 100 iterations, the parameter is marked as satisfying the tolerance conditions, and once all parameters have been marked the overall tolerance condition is satisfied and adapting stops (after a parameter has been marked it is still modified as before until all parameters are marked). To bound the time spent in the adapting procedure an upper limit is set (in MLwiN the default is 5,000 iterations) and when this limit is reached the adapting period ends regardless of whether the tolerance conditions are met (in practice this occurs rarely). Values of (r; ) = (0:5; 0:1) appear to give near-optimal Metropolis performance for a wide variety of multilevel models and are used as the defaults. Block updating. In both Gaussian and dichotomous-outcome models another approach involves the use of block rather than univariate Metropolis updating. The advantage of block updating (e.g., Gilks et al. 1996) is that it can account for the correlation structure of the unknowns in sets of sensiblychosen blocks, potentially increasing MCMC efficiency. A natural strategy in the L level generalization of model (12) is to create L sets of blocks, one consisting of all of the fixed effects and the other (L 1) groups of n l blocks of size n rl comprising all of the residuals at level 2;:::;L, respectively, wheren l is the number of blocks at level l and n rl is the number of residuals per block at level l. Multivariate normal proposal distributions may thenbeusedfor each block, for example with covariance matrices of the form»^±, in which

25 25 initially (for instance) ^± is the maximum-likelihood estimate of the block covariance matrix (of dimension p, say). In simple non-hierarchical Gaussian settings Gelman, Roberts, and Gilks (1995) find that» = 5:8 p is optimal, leading to acceptance rates of approximate form :22 + :31 p :09 p. It is also 2 possible to apply a version of the adaptive algorithm described above atthe block level, in which ^± is updated during the adaptation period along with», bearing in mind that the target value of r should decrease with increasing p. MLwiN release 1.1 offers both fixed and adaptive» options for block updating. 3.3 RELR computational efficiency results We have performed a small simulation to compare the adaptive methods of Section 3.2, in terms of MCMC efficiency in RELR models, with Gibbs sampling via adaptive rejection as implemented in the package WinBUGS. Adaptive rejection sampling (ARS; Gilks and Wild 1992) avoids the problem of nonstandard full conditional distributions in models with dichotomous outcomes by using a version of rejection sampling (e.g., Ripley 1987) in which the upper and lower envelopes evolve adaptively depending on the points sampled so far in the run. Results in this section are anecdotal but representative of many similar comparisons we have made. Tables 7 and 8 compare the MLwiN univariate and multivariate adaptive hybrid Metropolis-Gibbs approaches with ARS in the fitting of model (13) to two data sets. In Table 7 the outcome variable is an indicator in the JSP data (N = 887) of whether or not the student's math5 score was 30 or above (this occurred 67% of the time), and the predictors include a centered version of math3 (x 1 ) and dummy variables for gender (x 2 ) and whether the principal wage-earner in the student's family was a manual or nonmanual worker (x 3 ). The data for Table 8 are taken from the British Election Study (Heath et al. 1996), a longitudinal survey of the determinants of voting behavior. The sample studied here includes N = 800 people, chosen representatively from a total of 110 constituencies (the grouping variable), who were asked (among other things) to report how they voted in the 1983 British general election. The outcome in Table 8 is an indicator of whether the person reported voting Conservative or not (44% of the sample did so), and the predictors were centered versions of variables, each on a 21 point scale, measuring attitudes toward nuclear weapons (x 1 ), high unemployment as a means to lower inflation (x 2 ), tax cuts (x 3 ), and privatization (x 4 ). MCMC efficiency is measured with default Raftery-Lewis estimates of required length of monitoring run; results in both tables are based on an average of five chains using different random seeds, each with a burn-in of 500 from good starting values and a monitoring run of 50,000 iterations. Table 7 shows that (a) Gibbs sampling via ARS is the most efficient method per iteration, by a considerable margin, with ^n M values across the

Bayesian Methods in Multilevel Regression

Bayesian Methods in Multilevel Regression Joop Hox MuLOG, 15 september 2000 mcmc What is Statistics?! Statistics is about uncertainty To err is human, to forgive divine, but to include errors in your design