The Relationship Between the Power Prior and Hierarchical Models

Bayesian Analysis 006, Number 3, pp. 55 574 The Relationship Between the Power Prior and Hierarchical Models Ming-Hui Chen, and Joseph G. Ibrahim Abstract. The power prior has emerged as a useful informative prior for the incorporation of historical data in a Bayesian analysis. Viewing hierarchical modeling as the gold standard for combining information across studies, we provide a formal justification of the power prior by examining formal analytical relationships between the power prior and hierarchical modeling in linear models. Asymptotic relationships between the power prior and hierarchical modeling are obtained for non-normal models, including generalized linear models, for example. These analytical relationships unify the theory of the power prior, demonstrate the generality of the power prior, shed new light on benchmark analyses, and provide insights into the elicitation of the power parameter in the power prior. Several theorems are presented establishing these formal connections, as well as a formal methodology for eliciting a guide value for the power parameter a 0 via hierarchical models. Keywords: Generalized linear model, hierarchical model, historical data, power prior, prior elicitation, random effects model Introduction The power prior discussed in Ibrahim and Chen 000 has emerged as a useful class of informative priors for a variety of situations in which historical data is available. Several applications to clinical trials and epidemiological studies using the power prior have appeared in the literature. Examples of the use of the power prior and its modifications in clinical trials and carcinogenicity studies include Berry 99, Eddy et al. 99, Berry and Hardwick 993, Lin 993, Spiegelhalter et al. 994, Berry and Stangl 996, Chen et al. 000, Ibrahim and Chen 998, Ibrahim et al. 998, Ibrahim et al. 999, Chen et al. 999a, Chen et al. 999b, Ibrahim and Chen 000, and Ibrahim et al. 003. Examples using the power prior in epidemiological studies include Greenhouse and Wasserman 995, Brophy and Joseph 995, Fryback et al. 00, and Tan et al. 00. A recent book illustrating the use of the power prior in epidemiological studies is Spiegelhalter et al. 003. The power prior for a general regression model can be constructed as follows. Suppose we have historical data from a similar previous study, denoted by D 0 = n 0, y 0, X 0, where n 0 is the sample size of the historical data, y 0 is the n 0 response vector, and Department of Statistics, University of Connecticut, Storrs, CT, http://www.stat.uconn.edu/~mhchen Department of Biostatistics, University of North Carolina, Chapel Hill, NC, http://www.sph.unc.edu/bios/facstaff/ c 006 International Society for Bayesian Analysis ba0003

55 Power Prior and Hierarchical Models X 0 is the n 0 p matrix of covariates based on the historical data. Further, let π 0 θ denote the prior distribution for θ before the historical data D 0 is observed. We shall call π 0 θ the initial prior distribution for θ. Let the data from the current study be denoted by D = n, y, X, where n denotes the sample size, y denotes the n response vector, and X denotes the n p matrix of covariates. Further, denote the likelihood for the current study by Lθ D, where θ is a vector of indexing parameters. Thus, Lθ D is a general likelihood function for an arbitrary regression model, such as linear models, generalized linear model, random effects model, nonlinear model, or a survival model with censored data. Given the discounting parameter a 0, we define the power prior distribution of θ for the current study as πθ D 0, a 0 Lθ D 0 a0 π 0 θ, where a 0 weights the historical data relative to the likelihood of the current study, and thus the parameter a 0 controls the influence of the historical data on Lθ D. The parameter a 0 can be interpreted as a precision parameter for the historical data. Since D 0 is historical data, it is unnatural in most applications including clinical trials and carcinogenicity studies to weight the historical data more than the current data; thus it is scientifically more sound to restrict the range of a 0 to be between 0 and, and thus we take 0 a 0. One of the main roles of a 0 is that it controls the heaviness of the tails of the prior for θ. As a 0 becomes smaller, the tails of become heavier. Setting a 0 =, corresponds to the update of π 0 θ using Bayes theorem. That is, with a 0 =, corresponds to the posterior distribution of θ based on the historical data. When a 0 = 0, then the prior does not depend on the historical data D 0 ; in this case, πθ D 0, a 0 = 0 π 0 θ. Thus, a 0 = 0 is equivalent to a prior specification with no incorporation of historical data. Therefore, can be viewed as a generalization of the usual Bayesian update of π 0 θ. The parameter a 0 allows the investigator to control the influence of the historical data on the current study. Such control is important in cases where there is heterogeneity between the previous and current studies, or when the sample sizes of the two studies are quite different. One of the most useful applications of the power prior is for model selection problems since it inherently automates the informative prior specification for all possible models in the model space see Chen et al. 999b, Ibrahim et al. 999, and Ibrahim and Chen 000. One of the most common ways of combining several datasets or incorporating prior information is through hierarchical modeling. Hierarchical modeling is perhaps the most common and best known method for combining several sources of information. In this paper, we establish a formal analytic connection between the power prior and hierarchical models, both for the normal linear model and non-normal models including the class of generalized linear models. This relationship is accomplished by establishing an analytic relationship between the power parameter a 0 and the variance components of the hierarchical model. Establishing this relationship is critical since it formally justifies the power prior based on hierarchical modeling as well as guides the user into the choice of a 0 from which sensitivity analyses can be based. Such a benchmark power prior would be the prior that has an a 0 value which corresponds to equivalence between the hierarchical model and the power prior. Thus in this situation, the benchmark power

M.-H. Chen and J. G. Ibrahim 553 prior analysis would be the one that corresponds to hierarchical modeling. Thus, the analytical connection we make is important and useful since it characterizes the relationship between the power parameter a 0 and the variance components in the hierarchical model, thereby motivating and justifying the use of the power prior as an informative prior for incorporating historical data, as well as providing a semi-automatic elicitation scheme for a 0 based on hierarchical modeling. Indeed, one of the most difficult and elusive issues in the use of the power prior is the choice of a 0. Since there will typically be little information in the historical data about a 0, it is not at all clear how to use the historical data in eliciting a 0. The analytic relationship between a 0 and the hyperparameters of the hierarchical model provides the data analyst with elicitation strategies for a 0 based on the historical data. The rest of this paper is organized as follows. We consider the normal hierarchical linear model in Section and develop formal analytical relationships between the power prior and the hierarchical regression model. In Section 3 we develop formal analytical relationships between the power prior and the hierarchical generalized linear model. In Section 4, we extend our results to multiple historical datasets. In Section 5, we use the analytic relationship between a 0 and the hyperparameters of the hierarchical model to devise a formal elicitation method for a 0 based on the historical data. In Section 6, we present a real dataset to demonstrate the main results. Hierarchical Normal Models. I.I.D. Case We first consider the i.i.d. case with a single historical dataset. The model for this case can be written as y i = θ + ɛ i, i =,,..., n, and y 0i = θ 0 + ɛ 0i, i =,,..., n 0, where y = y, y,..., y n denotes the current data with sample size of n and y 0 = y 0, y 0,..., y 0n0 denotes the historical data with sample size of n 0. In, we further assume that the ɛ i are i.i.d. N0, and the ɛ 0i are i.i.d. N0, 0 and independent of the ɛ i s, where and 0 are fixed parameters. Now the hierarchical model is completed by independently taking and then taking θ 0 µ, τ Nµ, τ, θ µ, τ Nµ, τ, 3 µ Nα, ν, 4 where α, ν, and τ are all fixed hyperparameters. Within the development of this hierarchical model, our goal is to make inferences about the current study through the marginal posterior distribution [θ y, y 0 ] [a b] denotes the marginal distribution of a given b throughout. Here, θ is the parameter of interest for the current study, such as a treatment effect, and θ 0 denotes the corresponding parameter based on the historical study. The following theorem gives the form of the marginal posterior distribution [θ y, y 0 ] obtained from, 3, and 4.

554 Power Prior and Hierarchical Models Theorem. Letting ȳ 0 = n 0 n0 i= y 0i and ȳ = n n i= y i, we have θ y, y 0 Nµ h, h, 5 where µ h = h nȳ + α τ ν ν + τ + τ 4 ν + τ n 0 0 n 0ȳ 0 0 + α τ ν ν + τ + τ τ 4 ν + τ 6 and h = n + τ τ 4 ν + τ τ 4 ν + τ n 0 0 + τ τ 4 ν + τ. 7 The proof of Theorem. is given in Appendix B. Now we consider a power prior formulation of the model in. To do this, we set θ 0 = θ, and the resulting model becomes y i = θ + ɛ i, and y 0i = θ + ɛ 0i. 8 Thus in the power prior formulation, the ɛ i are i.i.d. N0,, and the ɛ 0i are i.i.d. N0, 0 and independent of the ɛ i s, where and 0 are fixed. Under this model, the power prior based on the historical data y 0 using the initial prior π 0 θ, is given by πθ y 0 exp Straightforward calculations show that a 0 0 n 0 i= y 0i θ }. 9 θ y, y 0 Nµ p, p, 0 where and nȳ n + a 0ȳ 0 µ p = 0 0 n n + a 0, 0 0 p = n n + a 0. 0 0 We now examine the relationship between 5 and 0. To do this, we need to find an explicit relationship between µ h and µ p as well as a relationship between h and p. We are led to the following theorem which characterizes this relationship.

M.-H. Chen and J. G. Ibrahim 555 Theorem. The posteriors in 5 and 0 match, i.e., µ h = µ p and h = p if and only if α = 0 and ν in 4, and a 0 = τ n 0 0 +. 3 The proof of Theorem. is given in Appendix B. Corollary. The choice of a 0 given in 3 satisfies 0 < a 0 <. Corollary. The result in Theorem. can be alternatively obtained by taking a uniform improper prior for µ at the outset, i.e., πµ. The proofs of Corollaries. and. are straightforward. Theorem. gives us a useful characterization of the explicit relationship between the power prior and the hierarchical model, and we see from this theorem that the two models are equivalent if a 0 is chosen as 3. We see from 3 that a 0 is a monotonic function of τ, and if τ 0, then a 0. This implies that if θ = θ 0 with probability, the historical and current data should be weighted equally. Also, the larger the sample size for the historical data, the less the weight given to the historical data. This is a desirable property since, in general, we would never want the historical data to dominate the posterior distribution of θ by simply increasing n 0.. Regression Model We now extend the normal hierarchical model in to the normal hierarchical regression model by setting θ = x i β and θ 0 = x 0i β 0. This leads to the model y i = x iβ + ɛ i, i =,,..., n, and y 0i = x 0iβ 0 + ɛ 0i, i =,,..., n 0, 4 where β is a p vector of regression coefficients for the current data, β 0 is the p vector of regression coefficients for the historical data, x i is a p vector of covariates for the i th subject in the current dataset, and x 0i is a p vector of covariates for the i th subject in the historical dataset. Similar to, we further assume that the ɛ i are i.i.d. N0, and the ɛ 0i are i.i.d. N0, 0 and independent of the ɛ i s, where and 0 are fixed. In addition, we take β µ, Ω N p µ, Ω, β 0 µ, Ω N p µ, Ω, and πµ, 5 where Ω is fixed. Here, β is the parameter vector of interest for the current study, and β 0 denotes the corresponding parameter based on the historical study. The following theorem gives the form of the marginal posterior distribution [β y, y 0 ] obtained from 4 and 5.

556 Power Prior and Hierarchical Models Theorem.3 The marginal posterior distribution of β is given by where β y, X, y 0, X 0 N p A h B h, A h, 6 A h = X X + Ω Ω Ω Ω Ω + 0 X 0X 0 Ω Ω, 7 B h = X y+ [ Ω Ω Ω Ω + 0 X 0X 0 Ω X = x, x,..., x n, and X 0 = x 0, x 0,..., x 0n0. Ω Ω + 0 X 0X 0 0 X 0y 0 ], 8 The proof of Theorem.3 is given in Appendix B. Now we consider a power prior formulation of the model in 4. To do this, we set β 0 = β, and the resulting model becomes y i = x iβ + ɛ i, and y 0i = x 0iβ + ɛ 0i. 9 Thus in the power prior formulation, the ɛ i are i.i.d. N0,, and the ɛ 0i are i.i.d. N0, 0 and independent of the ɛ i s, where and 0 are fixed. Under this model, the power prior based on the historical data n 0, y 0, X 0 using the initial prior π 0 β is given by πβ y 0, X 0, a 0 exp a } 0 0 y 0 X 0 β y 0 X 0 β, 0 and the posterior distribution of β is given by β y, X, y 0, X 0, a 0 N p A p B p, A p, where and A p = X X + a 0 0 X 0X 0 B p = X y + a 0 0 X 0y 0. 3 To match 6 and, we need to find an explicit relationship between A h and A p as well as a relationship between B h and B p. Clearly for the hierarchical model and the power prior to have identical distributions, we need A h = A p and B h = B p. We are led to the following theorem which characterizes this relationship. Theorem.4 Assume that X 0 is of full rank. Then, the posteriors in 6 and match, i.e., A h = A p and B h = B p if and only if a 0 I + 0 ΩX 0X 0 = I. 4

M.-H. Chen and J. G. Ibrahim 557 The proof of Theorem.4 is given in Appendix B. We note that when p =, 4 reduces to 3. Specifically, setting setting Ω = τ and X 0 =,,...,, 4 reduces to a 0 + 0 τ n 0 =, and thus, the condition for a 0 given by 3 is a special case of 4. We also see that Theorem.4 implies that for the two models to match, ΩX 0X 0 must be proportional to the identity matrix; that is, Ω must be of the form Ω = cx 0X 0, where c 0 is a scalar. This form of Ω is quite attractive especially in model selection contexts, and has been discussed by several authors as prior covariance matrix, including Zellner 986, Ibrahim and Laud 994, and Laud and Ibrahim 995. Precisely, the form Ω = cx 0X 0 corresponds to Zellner s g-prior, and has several nice properties as discussed in Zellner 986, Ibrahim and Laud 994, and Laud and Ibrahim 995. Taking Ω = cx 0X 0 is actually quite attractive in practice since it has the interpretation that, if there is historical data D 0 = n 0, y 0, X 0 for which a linear model is fit, Y 0 = X 0 β + ɛ, ɛ N0, 0I, then the information contained in β in the data D 0 is the Fisher information, given by 0 X 0X 0. Thus, based on Fisher information arguments, it makes scientific sense to assign cx 0X 0 as the prior covariance matrix for β. Our prior does indeed imply that the variation in β 0 and β are linked via X 0. This is reasonable since β and β 0 do have the same prior covariance matrix Ω, and β and β 0 represent the same unobservable phenomenon. Therefore, allowing a multiple of the Fisher information in β 0 to be the prior precision matrix makes scientific and intuitive sense. We are now led to the following theorem. Theorem.5 If the g-prior form is taken for Ω, that is, Ω = cx 0X 0, then the power prior and the hierarchical model are identical when a 0 is taken to satisfy a 0 I + 0 ci = I or a 0 = + c. 0 This theorem yields a very appealing result in that it shows the power prior and the hierarchical model are equivalent in the regression setting when a g-prior is chosen for Ω and a 0 is chosen as in Theorem.4. This result thus gives more appeal to the g-prior for use in Bayesian inference. 3 Hierarchical Generalized Linear Models The analytic relationships given in the previous section can be extended to any nonnormal model that has an asymptotically normal likelihood. To be specific, we demonstrate the nature of such approximations for the class of generalized linear models. Consider the hierarchical generalized linear model given by and py i θ i, τ = exp a i τy i θ i bθ i + cy i, τ }, i =,..., n, py 0i θ 0i, τ = exp a i τy 0i θ 0i bθ 0i + cy 0i, τ }, i =,..., n 0,

558 Power Prior and Hierarchical Models indexed by the canonical parameter θ i θ 0i and the scale parameter τ. The functions b and c determine a particular family in the class, such as the binomial, normal, Poisson, etc. The functions a i τ are commonly of the form a i τ = τ w i, where the w i s are known weights. For ease of exposition, we assume that w i = and τ is known throughout. We further assume the θ i s and θ 0i s satisfy the equations θ i = θη i, i =,,..., n, θ 0i = θη 0i, i =,,..., n 0, and η = Xβ and η 0 = X 0 β 0, where η = η,..., η n, and η 0 = η 0,..., η 0n0. The function θη i is called the θ-link, and for a canonical link, θ i = η i. Given β, β 0, y, y 0 are independent, and thus we can write the joint likelihood of β, β 0 in vector notation for the generalized linear model GLM as py, y 0 β, β 0 = expτ y θxβ b[θxβ] + cy, τ} expτ y 0θX 0 β 0 b[θx 0 β 0 ] + cy 0, τ}, 5 where θxβ is a componentwise function of Xβ that depends on the link. When a canonical link is used, θxβ = Xβ. To complete the hierarchical GLM specification, we specify priors for β, β 0 as in the normal linear regression model. That is, we independently take β Nµ, Ω and β 0 Nµ, Ω, 6 where Ω is fixed. We further take πµ. The marginal posterior distribution of β is required to make inferences about β in this framework. Due to the complexity of the hierarchical GLM specification, it is not possible to obtain a closed form expression for the marginal posterior distribution of β. To overcome this difficulty, we consider an asymptotic approximation to the posterior similar to that of Chen 985. In the context considered here, asymptotic means n 0 and n. Toward this goal, let pβ y, X = expτ y θxβ τb[θxβ] + cy, τ} and pβ 0 y 0, X 0 = expτ y 0θX 0 β 0 b[θx 0 β 0 ] + cy 0, τ}. Following Chen 985 and ignoring constants that are free of the parameters, we have pβ y, X exp β ˆβ Σ β ˆβ } 7 and pβ 0 y 0, X 0 exp β 0 ˆβ 0 Σ 0 β 0 ˆβ } 0, 8

M.-H. Chen and J. G. Ibrahim 559 where ˆβ and ˆβ 0 are the respective maximum likelihood estimators MLEs of β and β 0 based on the data n, y, X and n 0, y 0, X 0 under the GLM, Σ = ln pβ y, X β β ˆβ β=, and Σ 0 = It is straightforward to show that under the model in 5, ln pβ 0 y 0, X 0 β 0 β 0 β0 = ˆβ 0 Σ = τx ˆ ˆV X and Σ 0 = τx 0 ˆ 0 ˆV 0 X 0, 9 where ˆ and ˆV are n n diagonal matrices with i th diagonal elements δ i = δ i x i β = dθ i /dη i and v i = v i x i β = d bθ i /dθ i evaluated at ˆβ, and ˆ 0 and ˆV 0 are n 0 n 0 diagonal matrices with i th diagonal elements δ 0i = δ 0i x 0i β 0 = dθ i /dη i and v 0i = v 0i x 0i β 0 = d bθ 0i /dθ 0i evaluated at ˆβ 0. The approximations in 7 and 8 are valid for large n and large n 0, respectively. Using 7 and 8 and the proof of Theorem.3, the marginal posterior distribution of β is then approximated by where and ˆB h = ˆΣ ˆβ + Ω β y, y 0, X, X 0 approx. N p Â h Â h = ˆΣ + Ω Ω Ω Ω Ω + Ω Ω Ω + ˆB h, Â h, 30 ˆΣ 0 Ω Ω Ω + ˆΣ 0 Ω Ω 3 ˆΣ ˆΣ ˆβ 0 0 0. As in the normal linear regression model, we consider a power prior formulation of the hierarchical GLM given by 5 by setting β 0 = β. The power prior based on the historical data n 0, y 0, X 0 using the initial prior π 0 β is then given by πβ y 0, X 0, a 0 expa 0 τ y 0θX 0 β 0 b[θx 0 β 0 ] + cy 0, τ}, 33 and the posterior distribution of β is given by πβ y, X, y 0, X 0, a 0 expτ y θxβ τb[θxβ] + cy, τ} expa 0 τ y 0θX 0 β b[θx 0 β] + cy 0, τ}. 34 Using 7 and 8, and after some algebra, the posterior distribution of β given by 34 is approximated by. 3 approx. β y, y 0, X, X 0, a 0 N p Â p ˆB p, Â p, 35

560 Power Prior and Hierarchical Models where Â p = ˆΣ + a ˆΣ 0 0, ˆBp = ˆΣ ˆβ + a0 ˆΣ ˆβ 0 0, 36 ˆβ and ˆβ 0 are the MLEs, and ˆΣ and ˆΣ 0 are defined by 9. Similar to the normal linear regression model, we now match the approximate posterior distributions of β under the GLM. We are now led to the following theorem. Theorem 3. The approximate posteriors in 30 and 35 match, i.e., Â h = Âp and ˆB h = ˆB p if and only if a 0 I + ΩˆΣ 0 = I. 37 The proof of Theorem 3. is similar to that of Theorem.4, and therefore the details are omitted for brevity. If a generalized g-prior form is taken for Ω, i.e., Ω = c ˆΣ 0, then the approximate power prior and the hierarchical model are identical when a 0 = / + c. Thus, again, we see that for the class of GLM s, the connection between the hierarchical model and the power prior is facilitated by taking a generalized g-prior for Ω. 4 Extensions to Multiple Historical Datasets In this section, we generalize the results given in Theorems.,.4, and 3. to multiple historical datasets. First, we generalize the hierarchical model in 4 and the power prior in 0 to multiple historical datasets as follows. Suppose we have L 0 historical datasets. For the current study, the hierarchical model can be written as y i = θ + ɛ i, 38 where i =,,..., n, and the ɛ i s are i.i.d. N0, random variables, where is fixed. For the historical data, the hierarchical model can be written as y i = θ + ɛ i, 39 where i =,,..., n, the ɛ i s are i.i.d. N0, and independent of the ɛ i s, and the s are fixed for k =,,..., L 0. Further, we assume that and the θ are independently distributed as θ µ, τ Nµ, τ 40 θ µ, τ Nµ, τ, 4 where k =,,..., L 0, and τ is a fixed hyperparameter. Based on the results established in Theorem., we take a uniform improper prior for µ, i.e., πµ, and let θ 0 = θ 0,..., θ 0L0, y = y, y,..., y n, and y 0 = y 0, y 0,..., y 0L 0, where y = y, y,..., y n for k =,,..., L 0. The following theorem gives the form of the marginal posterior distribution [θ y, y 0 ] under the model given by 38, 39, 40, and 4.

M.-H. Chen and J. G. Ibrahim 56 Theorem 4. Using 38, 39, 40, and 4, the marginal posterior distribution of θ is given by θ y, y 0 Nµ mh, mh, 4 where µ mh = mh nȳ + [ L0 n ȳ n + τ ] [τ 4 L 0 + τ mh n = + τ τ 4 L 0+ τ L 0 L 0 τ 4 n + τ τ 4 n + τ ȳ = n n i= y i, and ȳ = n n i= y i for k =,,..., L 0. ], 43, 44 The proof of Theorem 4. is given in Appendix B. The power prior formulation of the model with multiple historical datasets can be obtained by letting θ = θ in 40 and 4. The model for the current data can be written as and the model for the historical data can be written as y i = θ + ɛ i, 45 y i = θ + ɛ i, 46 where the ɛ i are i.i.d. N0, for k =,,..., L 0. Now the power prior for θ under multiple historical datasets is given by } πθ y, k =,,..., L 0, a 0 exp L 0 a n y i θ, 47 where a 0 = a 0, a 0,..., a 0L0 with 0 < a < for k =,,..., L 0. Using 45 and 46 along with a uniform improper initial prior for θ, i.e., π 0 θ, we are led to i= θ y, y 0, a 0 Nµ mp, mp, 48 where and µ mp = mp = nȳ + L 0 a n ȳ n + L 0 a n n + L 0 a n 49. 50 We are now led to the following theorem characterizing the relationship between the power prior and the hierarchical model with multiple historical datasets.

56 Power Prior and Hierarchical Models Theorem 4. Suppose we take π 0 θ for the power prior and πµ for the hierarchical model. Then µ mh = µ mp in 43 and 49, and mh = mp in 44 and 50 if and only if a = τ 4 n + L 0 + τ τ L 0 i= The proof of Theorem 4. is given in Appendix B. τ 4 n0i 0i + τ. 5 Corollary 4. The choice of a given in 5 satisfies 0 < a <. Furthermore, if τ n for k =,,..., L 0, then the choice of a given in 5 also satisfies L0 a <. The proof of Corollary 4. is straightforward. We also note that when L 0 =, then with some obvious adjustments in notation, µ mh in 43, mh in 44, and a 0 in 5 all reduce to µ h in 6, h in 7, and a 0 in 3, respectively. Next, we extend the normal hierarchical regression model 4 to multiple historical datasets as follows. Suppose we have L 0 historical datasets. In the vector notation, we assume y = Xβ + ɛ and y = X β + ɛ, k =,,..., L 0, 5 where X is an n p matrix, X is an n p matrix for k =,,..., L 0, β and β, k =,,..., L 0, are the p vectors of regression coefficients, ɛ N n 0, I with fixed, ɛ N n 0, I with fixed for k =,,..., L 0, and ɛ is independent of ɛ. The following theorem gives the marginal posterior distribution of β under this setting. Theorem 4.3 Assume that β and β, k =,,..., L 0, are i.i.d. N p µ, Ω random vectors, where Ω is fixed, and πµ. Using 5, the marginal posterior distribution of β is given by β N p A mh B mh, A mh, 53 where ] A mh = L 0 X X+Ω Ω [L 0 + Ω Ω Ω + Ω X X Ω 54 and B mh = X y + ] L 0 Ω [L 0 + Ω Ω Ω + Ω X X Ω L 0 Ω + X X X y }. 55

M.-H. Chen and J. G. Ibrahim 563 The proof of Theorem 4.3 is similar to that of Theorem.3 for the normal hierarchical model with a single historical dataset. Thus, the proof is omitted for brevity. By letting β = β 0 = = β in 5, the power prior for the normal linear model with multiple historical datasets can be obtained as πβ y, X, k =,,..., L 0, a 0 exp L 0 a y X β y X β 56 where 0 < a < for k =,,..., L 0. Assuming π 0 β, the posterior distribution of β is given by where β y, X, y, X, k =,,..., L 0 N p A mpb mp, A mp, 57 A mp = L 0 a X X + X X and B mp = X y + L 0 a X y. The following theorem characterizes the relationship between the power prior and the normal hierarchical regression model with multiple historical datasets. Theorem 4.4 The posterior distribution of β in 57 under the power prior 56 matches 53 under the normal hierarchical regression model if and only if ] L 0 a [L 0 + I I + ΩX X = I + ΩX X 58 for k =,,..., L 0. The proof of Theorem 4.4 directly follows from that of Theorem.4. We note that when L 0 =, 58 reduces to 4, and if p =, Ω = τ, and X = X 0 = = X 0L0 =,,...,, 58 reduces to 5. Finally, the result given in Theorem 3. for GLM s can be extended to multiple historical datasets, and details of this extension can be found in Appendix A. }, 5 Elicitation of a 0 In this section, we discuss two methods for specifying Ω in a Bayesian analysis: i using the analytic relationship between the power prior and hierarchical models, and taking a fixed value for Ω namely Ω = cx 0X 0 where c is a fixed hyperparameter, as well as a fixed value for 0, and ii taking Ω and 0 to be random, in which a prior is specified for Ω and 0. We first consider situation i. The results of the previous section, yielding equivalence between the power prior and hierarchical models, shed light on how to elicit a

564 Power Prior and Hierarchical Models guide value for a 0 using the historical data. The basic idea is to use the relationship between the power prior and the hierarchical model to specify a guide value for a 0 from the hyperparameters of the hierarchical model. Specifically, we can use the relationship given by 4 to elicit a 0. One natural way to specify a 0 is take the trace of both sides of 4 and take a 0 to be the solution to that equation. Upon taking traces of both sides of equation 4, we are led to a 0 tri + 0 ΩX 0X 0 = tri. 59 Since I is a p p matrix, tri = p, and thus 59 implies that a guide value of a 0 is taken to be p â 0 = p + 0 trωx 0 X 0. 60 If Ω = cx 0X 0, 60 reduces to â 0 = + 0 6 c. We see that 6 corresponds to the relationship between the power prior and the hierarchical model in the case that θ is a univariate parameter, i.e., the i.i.d. case. We can clearly see that 59 satisfies 0 < a 0 <. The guide value for a 0 given in 60 provides a very useful benchmark value for a 0 from which further sensitivity analyses can be conducted. This guide value has a solid theoretical justification and motivation: it is the value of a 0 that results in equivalence between the power prior and the hierarchical model. In the example of Section 6, we show that the guide value given by 60 works quite well in practice. We now consider situation ii. If Ω and 0 are random, then 60 can be easily modified. In this case, our guide value for a 0 would be the posterior expectation of 60 with respect to the joint posterior distribution of 0, Ω, thus leading to â 0 = E p p + 0 trωx 0 X 0, 6 where the expectation is taken with respect to the joint posterior distribution of 0, Ω. Similar formulas for the guide value for a 0 are available for GLM s using 37. For the class of GLM s under situation i, the guide value for a 0 is given by p â 0 = 63 p + trωˆσ 0. When Ω is random under situation ii for GLM s, the guide value is obtained by taking the expectation of 63 with respect to the posterior distribution of Ω, leading to p â 0 = E p + trωˆσ 0. 64

M.-H. Chen and J. G. Ibrahim 565 Finally, we briefly discuss how to obtain a guide value for a in the presence of multiple historical datasets under situation ii. For ease of exposition, we consider the normal linear model with multiple historical datasets. Using 58, we take the guide value for a as â = E I ] tr[ + ΩX X L 0 + p L 0 tr [ I + ΩX X ], where the expectation is taken with respect to the joint posterior distribution of Ω, 0,..., 0L 0. After some matrix algebra, it can be shown that â < for k =,,..., L 0. When multiple historical datasets are available, we have more information to estimate Ω, and thus Ω is estimated with more precision in this case. In this sense, inference with multiple historical datasets can be quite advantageous in eliciting the power parameter in the power prior. 6 AIDS Data We consider the two AIDS studies ACTG09 and ACTG036, where ACTG036 represents the current data and ACTG09 represents the historical data. The purpose of this example is to demonstrate the proposed methodology for obtaining the guide value for a 0 discussed in Section 5 in the context of logistic regression. The ACTG09 study was a double blind placebo-controlled clinical trial comparing zidovudine AZT to placebo in persons with CD4 counts less than 500. The sample size for this study, excluding cases with missing data, was n 0 = 83. The response variable y 0 for these data is binary with a indicating death, development of AIDS, or AIDS related complex ARC, and a 0 indicates otherwise. Several covariates were also measured. The ones we use here are CD4 count x 0 cell count per mm 3 of serum, age x 0, and treatment, x 03. The covariates CD4 count and age are continuous, whereas the treatment covariate is binary. The ACTG036 study was also a placebo-controlled clinical trial comparing AZT to placebo in patients with hereditary coagulation disorders. The sample size in this study, excluding cases with missing data, was n = 83. The response variable y for these data is binary with a indicating death, development of AIDS, or AIDS related complex ARC, and a 0 indicates otherwise. Several covariates were measured for these data. The ones we use here are CD4 count x, age x, and treatment x 3. We consider the hierarchical logistic regression model to fit the ACTG09 and ACTG036 data. In 6, we take Ω to be a 4 4 diagonal matrix and we further i.i.d. assume that Ω jj IG, 0.005, j = 0,,..., 3. Using 64, we specify a 0 by [ 4 â 0 = E 4 + trωˆσ 0 ], 65 where ˆΣ 0 is computed via 9, and the expectation is taken with respect to the posterior distribution of Ω under the hierarchical logistic regression model. The Gibbs

566 Power Prior and Hierarchical Models sampling algorithm is used to sample β = β 0, β, β, β 3, β 0 = β 00, β 0, β 0, β 03, µ = µ 0, µ, µ, µ 3, and Ω from their respective posterior distributions under the hierarchical model. Table gives the posterior estimates of β, β 0, µ, and Ω based on 0,000 Gibbs iterations after a burn-in of,000 iterations. Using 65, we obtain â 0 = 0.45. Table shows the posterior estimates of β using the power prior in 33 with several values of a 0, including a 0 = â 0 = 0.45. From Tables and, we can see that the posterior estimates of β using the power prior with a 0 = 0.45 are fairly close to those obtained under the hierarchical model. From Table, we also see that the posterior estimates are slightly different for the different values of a 0. In particular, when more weight is given to the historical data, age and treatment become more significant, that is, their 95% Highest Posterior Density HPD intervals do not include 0. Thus, the power prior gives the investigator great flexibility for incorporating the historical data to analyze the ACTG036 trial. Posterior Posterior Posterior Posterior Parameter Mean Std Dev Parameter Mean Std Dev β 0-3.8 0.38 β 00-3.044 0.77 β -0.78 0.6 β 0-0.67 0.9 β 0.6 0.38 β 0 0.33 0.8 β 3-0.336 0.84 β 03-0.387 0.44 µ 0-3.083 0.4 Ω 00 0.03 0.7 µ -0.70 0.70 Ω 0.0 0.09 µ 0.93 0.49 Ω 0.00 0.34 µ 3-0.36 0.79 Ω 33 0.0 0.04 Table : Posterior Estimates Under the Hierarchical Logistic Regression Model Posterior Posterior 95% HPD a 0 Parameter Mean Std Dev Interval 0 β 0-4.78 0.849-6.46, -3.3 β -.636 0.449 -.539, -0.79 β 0. 0.34-0.334, 0.587 β 3-0.057 0.380-0.80, 0.698 0.45 β 0-3.96 0.53-3.69, -.708 β -0.779 0.75 -., -0.434 β 0.59 0.4-0.06, 0.53 β 3-0.344 0.96-0.74, 0.043 β 0-3.04 0.69-3.379, -.7 β -0.677 0.3-0.97, -0.437 β 0.30 0.0 0.083, 0.53 β 3-0.377 0.39-0.654, -0.09 Table : Posterior Estimates Using the Power Prior for the Logistic Regression Model

M.-H. Chen and J. G. Ibrahim 567 Appendix A: Extension to Multiple Historical Datasets for GLMs In this appendix, we generalize the result given in Theorem 3. to multiple historical datasets. First, we consider the hierarchical GLM with multiple historical datasets. In vector notation, the likelihood function of β, β 0 given y, y 0,..., y 0L0 is given by py, y 0,..., y 0L0 β, β 0 = expτ y θxβ b[θxβ] + cy, τ} L 0 expτ y θx β b[θx β ] + cy, τ}. We independently take β N p µ, Ω, β N p µ, Ω for k =,,..., L 0, and πµ. Using 3.3, 3.4, and Theorems.3 and 4.3, the marginal posterior distribution of β is approximated by i.e., for n and n 0 large β y, X, y, X, k =,,..., L 0 approx. N p Â ˆB mh mh, Â mh, A. where L Â mh = ˆΣ 0 + Ω Ω [L 0 + Ω Ω Ω + ˆΣ Ω ] Ω, ˆB mh = ˆΣ ˆβ + L 0 Ω [L 0 + Ω Ω Ω + Ω L 0 Ω + ˆΣ ˆΣ ˆβ } Ω ˆΣ, ] ˆβ is the MLE of β based on the data D = n, y, X, and ˆβ is the MLE of β based on the data D = n, y, X, k =,..., L 0 under the GLM, ˆΣ and ˆΣ are defined by 3.5 based on the current data D and the historical data D, for k =,..., L 0. The power prior based on the L 0 historical datasets y, X, k =,,..., L 0 } using the initial prior π 0 β is given by πβ y, X, k =,,..., L 0, a 0 L 0 expa τ y θx β b[θx β] + cy 0, τ}, A.

568 Power Prior and Hierarchical Models and the posterior distribution of β is given by πβ y, X, y, X, k =,,..., L 0, a 0 expτ y θxβ τb[θxβ] + cy, τ} L 0 expa τ y θx β b[θx β] + cy 0, τ}. A.3 Using 3.3 and 3.4 and after some algebra, the posterior distribution of β given by A.3 is approximated by i.e., for large n and n, k =,..., L 0 β y, X, y, X =,,..., L 0, a 0 approx. N p Â p ˆB p, Â p, A.4 where L Â p = ˆΣ 0 L0 + a ˆΣ and ˆB p = ˆΣ ˆβ + a ˆΣ ˆβ. Then we are now led to the following theorem. Theorem A. The posterior distribution of β in A.4, based on the power prior A., matches A. under the hierarchical GLM if and only if for k =,,..., L 0. ] L 0 a [L 0 + I I + ΩˆΣ = I + ΩˆΣ A.5 Appendix B: Proofs of Theorems Proof of Theorem.: Under the hierarchical model in.,., and.3, we can write the joint posterior of θ, θ 0 as nȳ θ πθ, θ 0, µ y, y 0 exp n 0ȳ 0 θ 0 0 θ µ + θ 0 µ τ Integrating out µ gives πθ, θ 0 y, y 0 exp nȳθ nθ n 0θ0 n 0 ȳ 0 θ 0 0 θ + θ0 τ θ+θ 0 } µ α ν. τ + α ν ν + τ.

M.-H. Chen and J. G. Ibrahim 569 After integrating out θ 0, straightforward calculations lead to πθ y, y 0 exp exp n0ȳ 0 0 [ n + τ nȳ τ 4 ν + θ τ + + α τ ν ν + n 0 0 nȳ + τ + + τ τ 4 ν + θ τ 4 ν + τ τ [ n + τ τ 4 ν + α τ ν ν + τ + ]} τ τ 4 ν + τ n0 τ 4 ν + τ n0ȳ0 0 n 0 0 + α τ ν ν + + τ τ 4 ν + τ α τ ν ν + θ τ + 0 τ τ 4 ν + τ τ θ ]} θ. B. Thus,.4 directly follows from B.. Proof of Theorem.: It is easy to show that µ h in.5 equals µ p in.0 if and only if α = 0. When α = 0, µ h reduces to µ h = h [ nȳ + n 0 0 To match B. to µ p in.0, we have to set τ 4 ν + n0ȳ0 τ 0 + τ τ 4 ν + τ ]. B. a 0 = τ 4 ν + τ n0 + 0 τ τ 4 ν + τ B.3 and n + a 0 n 0 0 = h. B.4 Note that B.4 directly yields that h in.6 is equal to p in. Now, using B.3, B.4 reduces to n 0 0 τ 4 ν + τ n0 + 0 = τ τ 4 ν + τ τ τ 4 ν + τ τ 4 ν + n0 τ + 0 τ τ 4 ν + τ. B.5

570 Power Prior and Hierarchical Models After some algebra, we obtain = n 0 0 τ 4 ν + τ n0 + 0 τ 4 ν + τ τ τ 4 ν + τ τ 4 ν + τ τ ν + n 0 0 + τ τ 4 ν + τ Thus, B.5 holds if and only if ν. When ν, B.3 reduces to., which completes the proof of Theorem.. Proof of Theorem.3: Using.3 and.4, we have πβ, β 0, µ y, X, y 0, X 0 exp y Xβ y Xβ} exp 0 y 0 X 0 β 0 y 0 X 0 β 0 } exp β µ Ω β µ} exp β 0 µ Ω β 0 µ}. Integrating out β 0 leads to πβ, µ y, X, y 0, X 0 exp y Xβ y Xβ β µ Ω β µ} exp [ µ Ω Ω Ω + 0 X 0X 0 Ω µ µ Ω Ω + 0 X 0X 0 0 X 0y 0 ]}. To obtain the marginal distribution of β, we need to integrate out µ. After some further algebra, we obtain πβ y, X, y 0, X 0 exp β [ X X + Ω Ω Ω Ω Ω + 0 X 0X 0 Ω Ω ] β β [ X Y + Ω Ω Ω Ω + 0 X 0X 0 Ω Ω Ω + 0 X 0X 0 ] } 0 X 0y 0. Thus,.5 directly follows from the above equation. Proof of Theorem.4: A h in.6 matches equals A p in. if and only if a 0 0 X 0X 0 = Ω Ω Ω Ω Ω + 0 X 0X 0 Ω Ω. B.6.

M.-H. Chen and J. G. Ibrahim 57 After some algebra, B.6 can be written as a 0 0 which is equivalent to a 0 0 ΩX 0X 0 = I I I + 0 ΩX 0X 0, ΩX 0X 0 I I + 0 ΩX 0X 0 = I I + 0 ΩX 0X 0. B.7 Multiplying both sides of B.7 by I + 0 ΩX 0X 0 gives a 0 0 ΩX 0X 0 [ I + 0 ΩX 0X 0 I ] = 0 ΩX 0X 0. B.8 Since X 0X 0 is invertible, then B.8 reduces to.3. We further note that if.3 holds, B h in.7 automatically matches B p in. which completes the proof. Proof of Theorem 4.: Let θ 0 = θ 0, θ 0,..., θ. Under the hierarchical model in 4., 4., 4.3, and 4.4, we can write the joint posterior of θ, θ 0, µ as nȳ θ πθ, θ 0, µ y, y 0 exp After integrating out θ 0, we obtain πθ, µ y, y 0 exp Integrating out µ from B.9 yields πθ y, y 0 exp exp L 0 n ȳ θ [ nȳ θ θ µ + τ + L 0µ τ nȳ + [ n + τ θ nȳθ [ n + τ L0 L 0 τ 4 L 0+ τ L 0 n ȳ n + τ τ 4 L 0+ τ L 0 Thus, 4.5 directly follows from B.0. Proof of Theorem 4.: τ 4 n θ µ τ τ n ȳ n + τ θ τ + L 0 L 0+ τ L 0 + τ τ 4 n + τ L 0 + µ ]} τ n ȳ τ n + τ 4 n θ τ + τ θ µ }.. B.9 ]} ]} θ. B.0

57 Power Prior and Hierarchical Models To show µ mh = µ mp in 4.6 and 4. and mh = mp in 4.7 and 4.3, we take a in 4.4. It suffices to show L0 n + n a = mh. B. Note that if B. is true, then we have mh = mp. Using 4.3, B. reduces to L 0 a n = τ τ 4 L 0+ τ L 0 τ 4 n + τ. B. Using 4.4, we are led to L 0 a n = = L0 τ 4 L0+ τ L 0 n + τ τ n + τ τ [ L 0+ τ τ 4 L0 τ 4 L 0+ τ L 0 τ 4 n + τ = ] n + τ τ τ 4 n + τ, L 0 τ L0 τ 4 L0+ τ L 0 n + τ τ 4 n + τ which is equal to the right-hand side of B.. This completes the proof. References Berry, D. A. 99. Bayesian Methodology in Phase III Trails. Drug Information Association Journal, 5: 345 368. 55 Berry, D. A. and Hardwick, J. 993. Using Historical Controls in Clinical Trials: Application to ECMO. In Berry, D. A. and Stangl, D. K. eds., Statistical Decision Theory and Related Topics V, 4 56. New York: Springer-Verlag. 55 Berry, D. A. and Stangl, D. K. 996. Bayesian Methods in Health-Related Research. In Berry, D. A. and Stangl, D. K. eds., Bayesian Biostatistics, 3 66. New York: Marcel Dekker. 55 Brophy, J. M. and Joseph, L. 995. Placing Trials in Context using Bayesian Analysis - GUSTO Revisited by Reverend Bayes. Journal of the American Medical Association, 73: 87 875. 55 Chen, M.-H., Ibrahim, J. G., and Shao, Q.-M. 000. Power Prior Distributions for Generalized Linear Models. Journal of Statistical Planning and Inference, 84: 37. 55

M.-H. Chen and J. G. Ibrahim 573 Chen, M.-H., Ibrahim, J. G., and Sinha, D. 999a. A New Bayesian Model for Survival Data with a Surviving Fraction. Journal of the American Statistical Association, 94: 909 99. 55 Chen, M.-H., Ibrahim, J. G., and Yiannoutsos, C. 999b. Prior Elicitation, Variable Selection and Bayesian Computation for Logistic Regression Models. Journal of the Royal Statistical Society, Series B, 6: 3 4. 55, 55 Eddy, D. M., Hasselblad, V., and Schaachter, R. 99. Meta-analysis by the Confidence Profile Method: The Statistical Synthesis of Evidence. New York: Academic Press. 55 Fryback, D. G., Chinnis, J. O., and Ulvila, J. W. 00. Bayesian Cost-effectiveness Analysis - An Example Using the GUSTO Trial. International Journal of Treatment and Healh Care, 7: 83 97. 55 Greenhouse, J. B. and Wasserman, L. 995. Robust Bayesian Methods for Monitoring Clinical Trials. Statistics in Medicine, 4: 379 39. 55 Ibrahim, J. G. and Chen, M.-H. 998. Prior Distributions and Bayesian Computation for Proportional Hazards Models. Sankhya, Series B, 60: 48 64. 55 000. Power Prior Distributions for Regression Models. Statistical Science, 5: 46 60. 55, 55 Ibrahim, J. G., Chen, M.-H., and MacEachern, S. N. 999. Bayesian Variable Selection for Proportional Hazards Models. Canadian Journal of Statistics, 7: 70 77. 55, 55 Ibrahim, J. G., Chen, M.-H., and Sinha, D. 003. On Optimality Properties of the Power Prior. Journal of the American Statistical Association, 98: 04 3. 55 Ibrahim, J. G. and Laud, P. W. 994. A Predictive Approach to the Analysis of Designed Experiments. Journal of the American Statistical Association, 89: 309 39. 557 Ibrahim, J. G., Ryan, L. M., and Chen, M.-H. 998. Use of Historical Controls to Adjust for Covariates in Trend Tests for Binary Data. Journal of the American Statistical Association, 93: 8 93. 55 Laud, P. W. and Ibrahim, J. G. 995. Predictive Model Selection. Journal of the Royal Statistical Society, Series B, 57: 47 6. 557 Lin, Z. 993. Statistical Methods for Combining Historical Controls With Clinical Trial Data. Unpublished Ph.D. Dissertation, Duke University, Durham, NC. 55 Spiegelhalter, D. J., Abrams, K. R., and Myles, J. P. 003. Bayesian Approaches to Clinical Trials and Health-care Evaluation. Chichester: Wiley. 55

574 Power Prior and Hierarchical Models Spiegelhalter, D. J., Freedman, L. S., and Parmar, M. K. B. 994. Bayesian Approaches to Randomized Trials with discussion. Journal of the Royal Statistical Society, Series A, 57: 357 46. 55 Tan, S., Machin, D., Tai, B. C., and Foo, K. F. 00. A Bayesian Re-assessment of Two Phase II Trials of Gemcitabine in Metastatic Nasopharyngeal Cancer. British Journal of Cancer, 86: 843 850. 55 Zellner, A. 986. On Assessing Prior Distributions and Bayesian Regression Analysis with g-prior Distributions. In Goel, P. and Zellner, A. eds., Bayesian Inference and Decision Techniques, 33 43. Amsterdam: Elsevier Science Publishers B.V. 557 Acknowledgments The authors wish to thank the Editor and two referees for several suggestions which have greatly improved the paper. This research was partially supported by NIH grants #CA 700 and #CA 7405.