Bayesian Meta-analysis with Hierarchical Modeling Brian P. Hobbs 1

Bayesian Meta-analysis with Hierarchical Modeling Brian P. Hobbs 1 Division of Biostatistics, School of Public Health, University of Minnesota, Mayo Mail Code 303, Minneapolis, Minnesota 55455 0392, U.S.A. 1 Brian P. Hobbs is Graduate Assistant and Bradley P. Carlin is Professor of Biostatistics and Mayo Professor in Public Health at the Division of Biostatistics, School of Public Health, 420 Delaware St. S.E., University of Minnesota, Minneapolis, MN, 55455.

1 Introduction The Bayesian approach to inference enables relevant existing information to be formally incorporated into a statistical analysis. This is done through the specification of prior distributions, which summarize our preexisting understanding or beliefs regarding any unknown model parameters θ = (θ 1,..., θ K ). Inference is conducted on the posterior distribution of θ given the observed data y = (y 1,..., y N ), given by Bayes Rule as p(θ y) = p(θ, y) p(y) = p(y θ)p(θ). p(y θ)p(θ) This simple formulation assumes the prior p(θ) is fully specified. However, when we are less certain about p(θ), or when model variability must be allocated to multiple sources (say, centers and patients within centers), a hierarchical model may be more appropriate. This approach places prior distributions on the unknown parameters of previously specified priors in stages. Posterior distributions are again derived by Bayes theorem, where the denominator integral is now more difficult, but remains feasible using modern Markov chain Monte Carlo (MCMC) methods. The WinBUGS package (http://www.mrc-bsu.cam.ac.uk/bugs/welcome.shtml) and its open source cousin OpenBUGS (http://mathstat.helsinki.fi/openbugs) are able to handle a wide variety of hierarchical models, permitting posterior inference, prediction, model choice, and model checking all within a user-friendly MCMC framework. Hierarchical models permit borrowing of strength from the prior distributions and across subgroups. When combined with any information incorporated in the priors, this translates into a larger effective sample size, thus offering potentially important savings (both ethical and financial) in the practice of drug and device clinical trials. Suppose the prior distribution for θ depends on a vector of second-stage parameters γ. These parameters are called hyperparameters, and we then write p(θ γ). In a simple two-stage model, γ is assumed to be known, and is often set to produce a noninformative prior (i.e., one that does not favor one value of θ over any other. However, if γ is unknown, a third-stage prior, or hyperprior, p(γ) may be chosen. In clinical trials, p(γ) is often determined at least in part using data from existing historical controls. This additional informative content is part of what gives Bayesian methods their advantage over classical methods, although this advantage is typically small if noninformative priors are used. 1

2 Application of Bayes methods to Meta-analysis Results vary across studies due to random variation or differences in implementation. The studies may be carried out at different times and locations or include different types of subjects. Furthermore, the applications of eligibility criteria may vary. These differences may lead to disparate conclusions about the intervention of interest across studies. Consider, for example two studies which test the ability of a particular cardiac device to improve heart efficiency by increasing the amount of blood pumped out of the left ventricle relative to the amount blood contained in the ventricle. Suppose that both studies define eligible patients as those who have a left ventricular ejection fraction (LVEF) as low as 25%. One investigator may admit every such eligible candidate patient. A second investigator might alter the LVEF boundary to 40% for a subset of individuals with another condition, restricting eligibility. Consequently, the first study may tend to incorporate a frailer population, so that the first study may suggest that the device is less effective than the second (Berry, 1997). In such cases, a single comprehensive analysis of all relevant data from several independent studies, or a meta-analysis, is often used to assess the clinical effectiveness of healthcare interventions. Results from metaanalyses provide a thorough assessment of the intervention of interest. The Bayesian hierarchical approach to meta-analysis treats study as one level of experimental unit, and patient within study as a second level (Lindley and Smith, 1972; Berger, 1985). Inter-study differences may be accounted for by measured covariates as in the above illustration; however, unaccounted for differences will still remain. Since the Bayesian paradigm treats all unknowns as random, a Bayesian meta-analysis can be structured as a random effects model. Specifically, each study in a Bayesian meta-analysis has a distribution of patient responses specific to the particular study. Thus selecting a study corresponds to selecting one of these distributions. Furthermore, one is limited to only a sample from each study s distribution, revealing only indirect information about the distribution of study-specific effects. 2.1 Meta-analysis for a Single Success Proportion Berry (1997, Sec. 3.1) describes a simple yet commonly occurring setting where Bayesian meta-analysis pays significant dividends, illustrating with the data in Table 1. These data are from nine antidepressant drug 2

Study (i) x i n i ˆπ i = x i /n i 1 20 20 1.00 2 4 10 0.40 3 11 16 0.69 4 10 19 0.53 5 5 14 0.36 6 36 46 0.78 7 9 10 0.90 8 7 9 0.78 9 4 6 0.67 Total 106 150 0.71 Table 1: Successes x i and total numbers of patients n i in 9 antidepressant drug studies. studies (Janicak et al., 1988), where a success is considered a positive response to the treatment regimen. For our purpose of illustrating Bayesian hierarchical modeling in meta-analysis, suppose a success concerns effectiveness of a medical device, and that within study i the experimental units receiving the intervention are exchangeable (all have the same probability of success π i ). Define the random variable x i to be the number of successes among the n i patients in study i, so that x i Binomial(n i, π i ) for i = 1,..., 9. The likelihood function for π = (π 1,..., π n ) is then p(x π) n i=1 π x i i (1 π i ) n i x i (1) A pooled analysis assumes that all 150 patients are independent and identically distributed (iid). Therefore, all nine π i s are equal to a common π. Given we have 106 total successes in 150 trials, the likelihood function is then p(x π) π 106 (1 π) 44, and suggests that π is very likely to be between 0.6 and 0.8. However, the observed success proportions (Table 1) in five of the nine studies are outside this range. This is more than what would be expected from sampling variability alone, and suggests the π i s may be unequal. Sadly, separate analyses of the nine studies provides even less satisfying results. The effect of an experimental device is not well addressed by giving nine different likelihood functions, or by giving nine different 3

Figure 1: Beta(α, β) densities for α, β = 1, 2, 4, 8. confidence intervals. Consider the probability of success if the device were used in a tenth study with another patient population. Separate analyses provide no way to utilize the results from the nine previous studies. A Bayesian hierarchical perspective provides a beneficial middle ground. Here we view each study s success probability π i as having been selected from a population. A computationally convenient assumption here is to suppose the π i are random sample from a beta distribution, i.e., π i iid Beta(α, β). Denoting the beta function as B(α, β) = Γ(α+β)/[Γ(α)Γ(β)], for each π i, p(π i α, β) = B(α, β)πi α 1 (1 π i ) β 1, a beta distribution with mean E(π i α, β) = α α+β and variance V ar(π i α, β) = αβ (α+β) 2 (α+β+1), where α, β > 0. Since lim α+β [V ar(π i )] = 0, we can think of α + β as measuring homogeneity among studies. If α + β is large then the π i s distribution is highly concentrated near its mean. Smaller α and β permit more variability, hence a noticeable study effect (unequal π i ). Assuming that only two parameters index the entire distribution may be a restriction depending on the curve of choice. In this case, Figure 1 shows the beta family to be surprisingly flexible, able to capture various shapes (flat, bell-shaped, U-shaped, one-tailed, etc.). Since the Beta prior is conjugate with the binomial likelihood, the posterior of π i given x i emerges in closed form using Bayes theorem as p(π i x i ) π α 1+x i i (1 π i ) β 1+n i x i. That is, the π i are independent 4

Beta(α + x i, β + n i x i ) random variables with mean E πi [π i x i ] = α + x i α + β + n i. (2) In order to proceed further with a Bayesian hierarchical approach, the impact of the hyperprior p(α, β), for the second stage parameters needs to be assessed. Recall that concentrating the hyperprior s probability on large values of α and β suggests homogeneity among the π i. while small α + β suggests heterogeneity. If each π i was observable, then posterior distribution of (α, β) would be p(α, β π i ) 9 i=1 { B(α, β)π α 1 i (1 π i ) β 1} p(α, β). In reality, π i cannot be observed directly, but indirect information about π 1,..., π 9 is available through observations x = (x 1,..., x 9 ). Therefore, the posterior distribution of α and β, p(α, β x), is proportional to p(α, β) 9 B(α, β) i=1 1 0 π x i+α 1 i (1 π i ) n i x i +β 1 dπ 1 dπ 9 9 { i=1 B(α, β) B(α + x i, β + n i x i ) } p(α, β). (3) Given the data in Table 1, the posterior expected mean success rate for the next patient treated in study i, for i = 1,..., 9 is [ ] α + xi E(π i x) = E (α,β) {E πi [π i α, β, x]} = E (α,β) x, i = 1,..., 9. (4) α + β + n i Next, predictive distributions are obtained by averaging the likelihood over the full posterior distribution. If a new, tenth study similar to the first nine is implemented, inference for π 10 requires the predictive distribution, p(π 10 x 1,..., x 9 ) = p(π 10 α, β, x 1,..., x 9 )p(α, β x 1,..., x 9 )dαdβ. It follows that the expected probability of a successful treatment for a particular patient enrolled in the new study is the expected posterior mean of the predictive distribution of π 10, [ ] α E(π 10 x) = E (α,β) {E[π 10 α, β, x]} = E (α,β) α + β x. (5) 5

Hyperprior probability distributions are often hard to conceptualize. Commonly reference priors are used when assign distributions beyond the second stage. In the current model, the shape of the first stage prior distribution varies considerably for relatively small changes in α, β as seen in Figure 1. Therefore, a prior that associates some probability with α+β large and α+β small, while assigning a moderate amount of probability to roughly equivalent α and β will be quite effective in covering a wide range of shapes for p(π i α, β). Bowing to computational limitations of the time, Berry (1997) adopted independent discrete uniform priors on {1, 2,..., 10} for α and β, essentially discretizing the (α, β) space onto a square 10 10 grid. Here we switch to independent continuous U(0, 20) priors, a true joint flat prior over a broad range of sensible values. Thus the posterior probability density function p(α, β x) is proportional to the likelihood restricted to [0, 20] [0, 20]. Note also that values larger than 20 have some likelihood, therefore the truncation of α and β at 20 is a slight approximation made by this model. 2.2 Sampling Based Inference using MCMC In order to analyze the data in Tables 1 using our three-stage model, we use Markov chain Monte Carlo (MCMC) computational methods implemented in WinBUGS. These methods operate by sampling from a Markov chain whose stationary distribution is the joint posterior distribution. This permits easy evaluation of posterior distributions for the π i, which lack closed forms due to the nonconjugate hyperprior for (α, β). Specifically, we may estimate E(π x x) in (4) by MCMC sampling {(α (g), β (g) ), g = 1,..., G} from their joint posterior, and then using the Monte Carlo approximation Ê(π i x) = 1 G G g=1 α (g) + x i α (g) + β (g) + n i. The Gibbs sampler begins the Markov chain with initial values ( inits ) (π (0), α (0), β (0) ), and then successively samples from the conditional distributions for α, β, and the π i. We usually discard draws from the first K iterations, the initial transient or burn-in period, though choosing reasonable initial values can reduce the need for this. Typically, multiple Markov chains are started from disparate initial values and checked to see if they all appear to have the same equilibrium distribution. Modern software packages make MCMC 6

Figure 2: Posterior summary statistics generated in WinBUGS given the data in Table 1. β α Figure 3: Bivariate posterior scatterplot of (α, β) given the data in Table 1. Correlation(α, β) = 0.865. sampling quick and relatively easy; the popular WinBUGS package also offers several convergence checks and output summary tools. Using the WinBUGS code, data, and inits shown in Appendix A, we ran the Gibbs sampler for 30,000 iterations, discarding the first 10,000 as burn-in. Notice that we added a tenth study with no observations for n10 = 10 patients. Summary statistics for the posterior distributions of model parameters are shown in Figure 2. The posterior mean of p(α, β x) is (9.50, 4.30). Figure 3 shows that (α, β) given the data in Table 1 are highly correlated. The expected posterior predictive mean success rate for a patient in a new study, i = 10, is the posterior mean of τ = α/α + β, or posterior mean of π10, which from Figure 2 is approximately 0.69. The posterior mean success rates (4) for each of the nine studies are also given in Figure 2. These posterior 7

means represent the probability that the intervention is successful for the next patient in the respective study. The 0.025 and 0.975 posterior quantiles are also given in Figure 2, permitting ready evaluation of equal-tail 95% Bayesian credible intervals for the π i. Regression (or shrinkage) to the overall mean is observed in the predictive probabilities for each of the nine studies. More shrinkage occurs for smaller studies, since their likelihoods are less informative; see e.g. the high shrinkage and relatively wide credible interval for study 2.!" #$π10 π Figure 4: Posterior density of π 10 (the next trial) given the data in Table 1. Finally, Figure 4 plots the posterior distribution of expected success proportion in the next study, i = 10, given the results of the previous nine studies, (5), as well as the likelihood function that assumes all nine studies have a common success probability π. The posterior distribution of π 10 clearly has more variability, and appears to be more consistent with the observed success proportions observed in Table 1. 8

References Berger, J.O. (1985). Statistical Decision Theory and Bayesian Analysis, 2nd ed. New York: Springer-Verlag. Berry, D.A. (1997). Using a Bayesian approach in medical device development. White paper, Center for Devices and Radiological Health, U.S. Food and Drug Administration, Rockville, MD. Berry, D.A. (2006). Bayesian clinical trials. Nature Reviews Drug Discovery, 5, 27 36. Janicak, P.G., Lipinski, J., Davis, J.M., Comaty, J.E., Waternaux, C., Cohen, B., Altman, E., and Sharma, R.P. (1988). S-adenosyl-methionine (SAMe) in depression: a literature review and preliminary data report. Alabama Journal of Medical Sciences, 25, 306 312. Lindley, D.V., and Smith, A.F.M. (1972). Bayes estimates for the linear model (with discussion). J. Roy. Statist. Soc., Ser. B, 34, 1 41. Spiegelhalter, D.J., Abrams, K.R., and Myles, J.P. (2004). Bayesian Approaches to Clinical Trials and Health-Care Evaluation. Chichester: Wiley. Spiegelhalter, D.J., Best, N., Carlin, B.P., and van der Linde, A. (2002). Bayesian measures of model complexity and fit (with discussion). J. Roy. Statist. Soc., Ser. B, 64, 583 639. Spiegelhalter, D.J., Freedman, L.S., and Parmar, M.K.B. (1994). Bayesian approaches to randomized trials (with discussion). J. Roy. Statist. Soc., Ser. A, 157, 357 416. 9

Appendix A: WinBUGS code for the Meta-analysis example!" " # $%&''( ) ) *+,- #."/0" /" 1234343434343434343434 14 14 0" /"12 5 4$6 7859 12 6755676 10