Marginal likelihood estimation via power posteriors

Marginal likelihood estimation via power posteriors N. Friel University of Glasgow, UK A.N. Pettitt Queensland University of Technology, Australia Summary. Model choice plays an increasingly important role in Statistics. From a Bayesian perspective a crucial goal is to compute the marginal likelihood of the data for a given model. This however is typically a difficult task since it amounts to integrating over all model parameters. The aim of this paper is to illustrate how this may be achieved using ideas from thermodynamic integration or path sampling. We show how the marginal likelihood can be computed via MCMC methods on modified posterior distributions for each model. This then allows Bayes factors or posterior model probabilities to be calculated. We show that this approach requires very little tuning, and is straightforward to implement. The new method is illustrated in a variety of challenging statistical settings. Keywords: BAYES FACTOR; REGRESSION; LONGITUDINAL DATA; MULTINOMIAL; HIDDEN MARKOV MODEL; MODEL CHOICE. Introduction Suppose data y is assumed to have been generated by one of M models indexed by the set {, 2,..., M}. An important goal of Bayesian model selection is to calculate p(k y) the posterior model probability for model k. Here the aim may be to obtain a single most probable model, or indeed a subset of likely models, a posteriori. Alternatively, posterior model probabilities may be synthesised from all competing models to calculate some quantity of interest common to all models, using model averaging (Hoeting et al 200). But before we can make inference for p(k y) we first need to fully specify the posterior distribution of all unknown parameters in the joint model and parameter space. We denote by θ k, the parameters specific to model k, where θ denotes collection of all model parameters. Specifying prior distributions for within model parameters p(θ k k), and for model indicators p(k), together with a likelihood L(y θ k, k) within model k allows Bayesian inference to proceed by examining the posterior distribution p(θ k, k y) L(y θ k, k)p(θ k k)p(k). () Across model strategies proceed by sampling from this joint posterior distribution of model indicators and parameters. The reversible jump MCMC algorithm (Green 995) is a popular approach for this situation. Other across model search strategies include (Godsill 200) and (Carlin and Chib 995). By contrast, within model searches examine the posterior distribution within model k separately for each k. Here the within model posterior appears as p(θ k y, k) L(y θ k, k)p(θ k k), where the constant of proportionality, often termed the marginal likelihood or integrated likelihood for model k is written as p(y k) = L(y θ k, k)p(θ k k)dθ k. (2) θ k This is in general a difficult integral to compute, possibly involving high dimensional within model parameters θ k. However if we were able to do so, then we would be readily able to make statements about posterior model probabilities, using Bayes theorem, p(k y) = p(y k)p(k) M p(y k)p(k). Address for correspondence: Department of Statistics, University of Glasgow, Glasgow, G2 8QQ, UK. E-mail: nial@stats.gla.ac.uk

2 Friel, Pettitt The marginal likelihoods can be used to compare two models by computing Bayes factors, B 2 = p(y k = ) p(y k = 2), without the need to specify prior model probabilities. Note that the Bayes factor B 2 gives the evidence provided by the data in favour of model compared to model 2. It can also be seen that B 2 = p(k = y) p(k = 2) p(k = 2 y) p(k = ). In other words, the Bayes factor is the ratio of the posterior odds of model to its prior odds. Note that an improper prior distribution p(θ k k) leads necessarily to an improper marginal likelihood, which in turns implies that the Bayes factor is not well-defined. To circumvent the difficulty of using improper priors for model comparison, (O Hagan 995) introduced an approximate method termed the fractional Bayes factors. Here an approximate (proper) marginal likelihood is defined by the ratio θ k L(y θ k, k)p(θ k k)dθ k, θ k {L(y θ k, k)} a p(θ k k)dθ k since any impropriety in the prior for θ k cancels above and below. Other approaches to this problem include the intrinsic Bayes factor (Berger and Perricchi 996). In this paper we concentrate on the case where prior model distributions are proper. Various methods have been proposed in the literature to estimate the marginal likelihood (2). For example (Chib 995) estimates the marginal likelihood p(y k) using output from a Gibbs sampler for (θ k y, k). It relies however on a block updating approach for θ k. Clearly this is not always possible to do. This work was then extended in (Chib and Jeliazkov 200), where output from a Metropolis-Hasting algorithm for the posterior (θ k y, k) can be used to estimate the marginal likelihood. Annealed importance sampling (Neal 200) estimates the marginal likelihood using ideas from importance sampling. Here an independent sample from the posterior is generated by defining a sequence of distributions indexed by a temperature parameter t from the prior through to the posterior. Importantly however, the collection of importance weights can be used to estimate the marginal likelihood. In this paper we propose a new method to compute the marginal likelihood based on samples from a distribution proportional to the likelihood raised to a power t times the prior, which we term the power posterior. This method was inspired by ideas from path sampling or thermodynamic integration (Gelman and Meng 998). We find that the marginal likelihood, p(y k) can be expressed as an integral with respect to t from 0 to of the expected deviance for model k, where the expectation is taken with respect to the power distribution, at power t. We argue that this method requires very little tuning or book keeping. It is easy to implement, requiring minor modification to computer code which samples from the posterior distribution. In this paper, in Section 2 we carry out a review of different approaches to across and within model searches. Section 3 introduces the new method for estimating the marginal likelihood, based on sampling from the so-called power posterior distribution. Here we also outline how this method could be implemented in practice, while also giving some guidance as to the sensitivity of the estimate of the marginal likelihood (or Bayes factor) to the diffusivity of the prior model parameters. Three illustrations of how the new method performs in practice are given in Section 4. In particular complex modelling situations are illustrated which preclude other methods of marginal likelihood estimation which rely on block updating model parameters. We conclude this paper with some final remarks in Section 5. 2. Review of methods to compute Bayes factors There are numerous methods and techniques available to estimate (2). Generally speaking two approaches are possible an across model search or a within model search. The former approach in an MCMC setting, involves generating a single Markov chain which traverses the joint model and parameter space (2). A popular choice is the reversible jump sampler (Green 995). Godsill (200) retains aspects of the reversible jump sampler, but considers the case where parameters are shared

Marginal likelihood estimation via power posteriors 3 between different model, as occurs for example in nested models. Other choices include (Stephens 2000), (Carlin and Chib 995) and (Dellaportas, Forster and Ntzoufras 200). Within model search essentially aims to estimate the marginal likelihood (2) for each model k separately, and then if desired uses this information to form Bayes factors, see (Chib 995), (Chib and Jeliazkov 200). Neal (200) combines aspects of simulated annealing and importance sampling to provide a method of gathering an independent sample from a posterior distribution of interest, but importantly also to estimate the marginal likelihood. Bridge sampling (Meng and Wong 996) offers the possibility of estimating the Bayes factor linking the two posterior distributions by a bridge function. Bartolucci and Mira (2003) use this approach, although it is based on an across model reversible jump sampler. Within model approaches are disadvantageous when the cardinality of the model space is large. However, as noted by Green (2003), the ideal situation for a within model approach is one where the models are all reasonably heterogeneous. In effect, this is the case where it is difficult to choose proposal distributions when jumping between models - and indeed the situation where parameters across models of the same dimension have different interpretations. This short review is not intended to be exhaustive. A more complete picture can be found in the substantial reviews (Sisson 2005) and (Han and Carlin 200). 2.. Reversible jump MCMC Reversible jump MCMC (Green 995) offers the potential to carry out inference for all unknown parameters () in the joint model and parameter space in a single logical framework. A crucial innovation in the seminal paper by Green (995) was to illustrate that detailed balance could be achieved for general state spaces. In particular this extends the Metropolis-Hastings algorithm to variable dimension state spaces of the type (θ k, k). To implement the algorithm, proposing to move from (θ k, k) to (θ l, l) proceeds by generating a random vector u from a distribution g and setting (θ l, l) = f kl ((θ k, k), u). Similarly to move from (θ l, l) to (θ k, k) requires random numbers u following some distribution g, and setting (θ k, k) = f lk ((θ l, l), u ), for some deterministic function f lk. However it is important that the transformation f kl from (θ k, k) to (θ l, l) is both a bijection and its differential invertible. A necessary condition for this to apply is if the so-called dimension matching condition applies, that is, if dim(θ k ) + dim(u) = dim(θ l ) + dim(u ). In this case, the probability of accepting such a move appears as ( min, p(θ l, l y)p(l k)g (u ) ) p(θ k, k y)p(k l)g(u) J where p(l k) is the probability of moving from model k to model l and in addition J is the jacobian resulting from the transformation from ((θ k, k), u) to ((θ l, l), u ). In practice, this may be simplified slightly by not insisting on stochastic moves in both direction, so that, for example, dim(u ) = 0, whence the term g ( ) disappears in the numerator above. Finally for the case of nested models, a possible move type is (θ k+, k + ) = ((θ k, k), u) in which case the jacobian term equals. In some respects RJMCMC is difficult to use in practice. The main drawback would appear to be the problem of model mixing across dimensions. Typically this is as a result of the difficulty in choosing a suitable jump proposal distribution. Here it is unclear how to reasonably center and scale the distribution to increase the chance of the move being accepted. However recent work by (Brooks et al 2003) has tackled this problem to some extent. 2.2. Chib s method An important method of marginal likelihood estimation within each model is that of Chib (995). This method follows from noticing that for any parameter configuration θ, Bayes rule implies that the marginal likelihood of the data y for model k satisfies p(y) = L(y θ )p(θ ) p(θ. y) Here and for the remainder of this article, for ease of notation, we remove reference to the model indicator k, except where this is ambiguous. Each factor on the right hand side above can be calculated

4 Friel, Pettitt immediately, with the exception of the posterior probability p(θ y). Typically θ would be chosen as a point of high posterior probability to increase the numerical accuracy of the estimate. Chib illustrated that this probability can be estimated via Gibbs sampling provided θ can be partitioned into n blocks {θi }, say, where the full conditional of each block is amenable to Gibbs sampling. It is clear that n p(θ y) = p(θi θi,...,θ, y). i=2 Now each factor p(θj θ j,...,θ, y) can be estimated from the Gibbs output by integrating out parameters θj+,...,θ n: p(θ j θ j,...,θ, y) = I I i= p(θj θ(i) n,...,θ(i) j+, θ j,..., θ, y), (3) where the index i indicates iterations of the Markov chain at stationarity. Further the normalising constant of each block must be known exactly in order for full conditional probabilities to be estimated. Chib and Jeliazkov (200) extended this methodology to the case where (3) can be updated using Metropolis-Hastings output, employing an identity based solely on the Metropolis-Hastings acceptance probabilities, but which does not require the normalising constant of p(θ y). However implementing both methods relies on judicious partitioning of the parameter θ, in addition to a considerable amount of book keeping. Clearly both methods increase in computational complexity as the dimension of θ k increases. 2.3. Annealed importance sampling Estimating the marginal likelihood using ideas from importance sampling is also possible as illustrated in (Neal 200). The idea is to define a sequence of distributions, starting from one for which it is possible to gather samples, for example the prior distribution, and ending at a target distribution from which you would like to sample. Neal (200) defines this sequence as p ti (θ k y) = p(θ k ) ti p(θ k y) ti, where 0 = t 0 < t < < t n =. Thus p t0 and p tn corresponds to the prior and posterior distribution respectively. The algorithm begins by sampling a point x t0 from the prior, p t0. At the ith step, a point x ti+ is generated from p ti+ via a Markov chain transition kernel at x ti, for example via Gibbs, or Metropolis-Hastings updating. The final step n yields a point x tn from the posterior. Repeating this scheme N times, yields an independent sample x (),..., x (N) from the posterior. In effect, distribution p ti is an importance distribution for p ti+. An important by-product of this scheme is that the collection of importance weights are such that w (i) = p t n (x tn ) p tn 2 (x tn 2 ) p tn (x tn ) p tn (x tn )... p t 0 (x t0 ) p t (x t ) p(y) = N i= w(i) That is, the marginal likelihood obtains as the average of the importance weights. N. 3. Marginal likelihoods and power posteriors Here we introduce a new approach to estimating the integrated likelihood based on ideas of thermodynamic integration or path sampling (Gelman and Meng 998). Consider introducing an auxiliary variable (or temperature schedule) T(t) where T : [0, ] [0, ] defined such that T(0) = 0 and T() =. For simplicity we assume that T(t) = t. Consider the power posterior defined as p t (θ k y) {L(y θ k )} t p(θ k ). (4)

Now, define Marginal likelihood estimation via power posteriors 5 z(y t) = {L(y θ k )} t p(θ k ) dθ k. θ k By construction, z(y t = 0) is the integral of the prior for θ k, which equals. Further, z(y t = ) is the marginal likelihood of the data. Here we assume of course that z(y t) < for all t [0, ]. Now ideas from path sampling (Gelman and Meng 998) can be used to calculate the integral of interest z(y t = ). The following identity is crucial to the problem in hand: { } z(y t = ) log(p(y)) = log = E z(y t = 0) θk y,t log{l(y θ k)} dt. (5) Thus the log of the marginal likelihood results as the mean deviance where the expectation is taken with respect to the power posterior (4) at temperature t, where t moves from 0 to. The identity (5) can be derived easily as follows: d dt log{z(y t)} = z(y t) = z(y t) = z(y t) = 0 d dt z(y t) d {L(y θ k )} t p(θ k ) dθ k dt θ k θ k {L(y θ k )} t log{l(y θ k )}p(θ k ) dθ k {L(y θ k )} t p(θ k ) θ k z(y t) = E θk y,t log{l(y θ k)}. log{l(y θ k )} dθ k Equation (5) now follows by integrating with respect to t. This approach shares some analogies to annealed importance sampling outlined in Section 2.3, however, here we estimate the marginal likelihood on the log scale, ensuring increased numerical stability. Further our method estimates log(p(y)) using expectations, again aiding the numerical stability. It is interesting to note that the fraction z(y t = )/z(y t = a), where 0 < a < is precisely the approximation to the marginal likelihood used to compute the fractional Bayes factor (O Hagan 995). In addition note that the likelihood contribution to the power posterior is generally not a proper likelihood since it may not always hold that {L(y θ k )} t dy. Finally in common with simulated annealing and simulated tempering, the effect of the temperature parameter t is to flatten the likelihood contribution in the power posterior, so that it is approximately uniform for values of t close to 0, in which case the power posterior approximates the prior contribution. Note that path sampling has been employed to compute high dimensional normalising constants, most notably in estimation of parameters of Markov random fields. In this context the technique has been used to calculate normalising constants of model parameters, which are then used as a look-up table in the estimation process, see for example (Green and Richardson 2002) and (Dryden, Scarr and Taylor 2003). 3.. Estimating the marginal likelihood Note that the identity for the marginal likelihood (5) can be considered as a double integral, integrating over the power parameter t and the model parameters θ k. The joint distribution of θ k and t can be written as, p(θ k, t y) = p(θ k t, y)p(t) = L(y θ k) t p(θ k ) p(t) z(y t) Now the full conditional distribution of θ k looks like, while if we assume that p(t) z(y t), then also p(θ k y, t) {L(y θ k )} t p(θ k ) p(t θ k, y) {L(y θ k )} t p(θ k ).

6 Friel, Pettitt Now a sample {(θ () k, t ),..., (θ (n) k, t n)} gathered from p(θ k, t y) can be used to calculate (5) by ordering the t i s and calculating log{l(y θ k )}, estimating the integral via quadrature. All of this hinges on the assumption that p(t) z(y t). It is our experience that z(y t) varies by orders of magnitude with t. This is not surprising since by construction z(y t = 0) =, while the marginal likelihood, z(y t = ), could be quite large depending on the problem in hand. Thus using a single chain, values of t close to zero would tend not to be sampled with high frequency leading to poor estimation of p(y). As a more direct approach we suggest discretising the integral (5) over t [0, ], running separate chains for each t, sampling from the power posterior to estimate the mean deviance, E θk y,t log(l(θ k y)). Numerical integration using, for example, a trapezoidal rule, over t yields an estimate of the marginal likelihood. For example choosing a discretisation 0 = t 0 < t < < t n < t n =, leads to an approximation [ ] n E θk y,t i+ log L(y θ k ) + E θk y,t i log L(y θ k ) log p(y) (t i+ t i ) (6) 2 i= Note that Monte Carlo standard errors for each E θk y,t i log L(y θ k ), can be pieced together to give an overall Monte Carlo standard error for log p(y). It should be apparent that if it is possible to sample from the posterior distribution of model parameters, then it should often be possible to sample from the power posterior, and hence to compute the marginal likelihood. In particular if the likelihood contribution follows an exponential family model, then raising the likelihood to a power t amounts to multiplying the exponent by t. We therefore expect that in many cases if the posterior is amenable to Gibbs sampling, then so too would the power posterior. In terms of computation, modifying existing MCMC code which samples from posterior model parameters is trivial. Essentially all that is needed is an extra iteration loop for the temperature parameter t, calculating the expected deviance under the power posterior at each temperature iteration. 3.2. Sensitivity of p(y k) to the prior It is well understood that the Bayes factor is sensitive to the choice of prior model parameters. Here we outline for a simple example how this impacts on the estimate of the marginal likelihood via samples from the power posterior, as outlined above. Consider the simple situation where data y = {y i } are independent and normally distributed with mean θ and unit variance. Assuming θ N(m, v), a priori, leads to a power posterior, θ y, t N(m t, v t ) where It is straightforward to show that m t = ntȳ + m/v nt + /v E θ y,t log L(y θ) = n 2 log 2π 2 and v t = n (y i ȳ) 2 n 2 i= nt + /v. (m ȳ) 2 (vmt + ) 2 n 2 (nt + /v). (7) Recall that the log of the marginal likelihood obtains by integrating (7) with respect to t over t [0, ]. Consider the situation when t = 0. In this case the final term in on the left hand side of (7) appears as nv/2. Clearly as v, so too does E θ y,t=0 log L(y θ), and at the same speed. This illustrates the sensitivity of the prior specification to the numerical stability of the estimate of the marginal likelihood. Figure plots E θ y,t log L(y θ) against t for v = 0, 5,, (for illustrative purposes the first two terms on the left hand side take a constant value, ȳ = 0, m = 0 and n = 0). It is our experience that the behaviour of the mean deviance with t, as outlined in Figure is typical in more complex settings, even when the prior parameters are quite uninformative. Bearing this in mind we see that the choice of spacing for the t i s in (6) is important. For example, a temperature schedule of the type t i = a c i, where a i = i/n is an equal spacing of the n points in the interval [0, ], and c > is a constant, ensures that the t i s are chosen with high frequency close to t = 0. Prescribing the collection of t i s in this way should improve the efficiency of the estimate of p(y).

Marginal likelihood estimation via power posteriors 7 0 0 20 30 40 50 60 0 0.2 0.4 0.6 0.8 Fig.. Expected deviance (7), under the distribution θ y, t plotted against t for prior variance equal to 0, 5,. As v increases, so too does the rate at which the mean deviance changes with t. i y i x i z i i y i x i z i i y i x i z i 3040 29.2 25.4 5 2250 27.5 23.8 29 670 22. 2.3 2 2470 24.7 22.2 6 2650 25.6 25.3 30 330 29.2 28.5 3 360 32.3 32.2 7 4970 34.5 34.2 3 3450 30. 29.2 4 3480 3.3 3 8 2620 26.2 25.7 32 3600 3.4 3.4 5 380 3.5 30.9 9 2900 26.7 26.4 33 2850 26.7 25.9 6 2330 24.5 23.9 20 670 2. 20.0 34 590 22. 2.4 7 800 9.9 9.2 2 2540 24. 23.9 35 3770 30.3 29.8 8 30 27.3 27.2 22 3840 30.7 30.7 36 3850 32.0 30.6 9 3670 32.3 29 23 3800 32.7 32.6 37 2480 23.2 22.6 0 230 24.0 23.9 24 4600 32.6 32.5 38 3570 30.3 30.3 4360 33.8 33.2 25 900 22. 20.8 39 2620 29.9 23.8 2 880 2.5 2.0 26 2530 25.3 23. 40 890 20.8 8.4 3 3670 32.2 29.0 27 2920 30.8 29.8 4 3030 33.2 29.4 4 740 22.5 22.0 28 4990 38.9 38. 42 3030 28.2 28.2 Table. Radiata pine dataset. y i: Maximum compression strength parallel to the grain, x i: Density, z i: Resinadjusted density. 4. Examples 4.. Linear regression - non-nested models The dataset in Table was taken from Williams (959). The data describe the maximum compression strength parallel to the grain y i, the density x i, and the resin-adjusted density z i for 42 specimens of radiata pine. This dataset has been examined in (Han and Carlin 200), (Carlin and Chib 995) and (Bartolucci and Scaccia 2004), where they compared several methods to estimate the Bayes factor between two non-nested competing models. The competing models are as follows: M : y i = α + β(x i x) + ǫ i, ǫ i N(0, σ 2 ). M 2 : y i = γ + δ(z i z) + η i, η i N(0, τ 2 ). The following prior specification was used (identical to those papers cited immediately above): N((3000, 85) T, diag(0 6, 0 4 ) for (α, β) T and (γ, δ) T. An IG(3, (2 300 2 ) ) prior was chosen for σ 2 and τ 2, where IG(a, b) is an inverse gamma distribution with density f(x) = exp(/bx)γ(a)b a x a+.

8 Friel, Pettitt Power Posterior RJMCMC RJ corrected Mean 4853.5 5280.6 4869.4 Standard error 94.2 622.4 97.8 Relative error.8% 34.5% 2.0% Table 2. Linear regression models. Estimates of means, standard errors and relative errors of B 2 using the method of power posteriors and RJMCMC. RJMCMC entries corresponds to prior model probabilities p(k = ) = p(k = 2) = 0.5, while RJ corrected corresponds to p(k = ) = 0.9995 and p(k = 2) = 0.0005. Green and O Hagan (998) found, for the given prior specification, by numerical integration that the Bayes factor B 2 = 4862. The aim for this straightforward situation is to see what statistically efficiency can be achieved by using the power posterior method to compute the Bayes factor over using RJMCMC. Here we calculated each marginal likelihood using a temperature schedule of the type t i = a 5 i, where the a i s correspond to 0 equally spaced points in [0, ]. In total 30, 000 iterations were used to estimate each marginal likelihood. Finally the algorithm was run for 00 independent chains, each estimating ˆB 2,i for i =,...,00. To implement RJMCMC we specified p(k = ) = p(k = 2) = 0.5. To allow for a fair comparison, the reversible jump sampler was run for 60, 000 iterations. Within model parameters were updated via Gibbs sampling. Across model moves were proposed by simply setting (α, β, σ) = (γ, δ, τ), resulting in the jacobian term taking the value. The following quantities appear in Table 2: Mean = Standard error = Rel. Error = B 2 00 ˆB 2,i 00 i= 00 ( 00 ˆB 2,i Mean) 2 i= 00 ( 00 ˆB 2,i B 2 ) 2. i= As can be seen from the results in Table 2, RJMCMC faired poorly. This is simply because the reversible jump sampler does not mix well and so does not visit model very often, leading to a poor posterior estimate of p(k = y). Running the reversible jump sampler with the prior model probability strongly weighted towards model with p(k = ) = 0.9995 and p(k = 2) = 0.0005, leads to estimates of B 2 with similar efficiency to that of the power posterior method and the power posterior has marginally smaller relative error than the RJ corrected method. 4.2. Categorical longitudinal models For this example we re-visit an analysis of a categorical longitudinal dataset presented in (Pettitt et al 2005). This example concerns a large social survey of immigrants to Australia. Data is recorded for each subject on three separate occasions. The response variable of interest is employment status, which comprises of 3 categories employed, unemployed or non-participant. The nonparticipant category refers to subjects who are for example, students, retired or pensioners. The response variable is modelled as a multinomial random variable. Here we are concerned with fitting Bayesian hierarchical models to the data including both fixed and random effects on employment status. Specifically we assume that the response y isj of individual i at time s belongs to employment category j with probability p isj. Note we have used s to index time rather than k as in (Pettitt et al 2005). Thus y isj is a binary random variable, and further we assume that y is = {y is, y is2, y is3 } has a multinomial distribution y is Multinomial(p is, n is ),

Marginal likelihood estimation via power posteriors 9 Fixed effects model k = t i 0.656 0.4096 0.240 0.296 0.0625 0.0256 0.008 0.006 0 Mean Dev. -242-2425 -2433-2446 -2475-2547 -2728-342 -6670-66965.28 MC se 0.96 0.293 0.526 0.803.325 4.003 5.564 26.20 95.330 536.93 Random effects model k = 2 t i 0.656 0.4096 0.240 0.296 0.0625 0.0256 0.008 0.006 0 Mean Dev. -783-225 -2428-2449 -2483-2562 -283-3862 -728-5503.28 MC se 3.525 4.663 0.690 0.775.753 3.72 0.40 27.50 96.360 55.865 Table 3. Categorical longitudinal model. Expected deviances (and Monte Carlo standard errors) for the power posterior at temperature t i for models k =, 2. where n is = j y isj =. The next level of the hierarchy models the binary probabilities p isj to fixed and random covariate effects, p isj = µ isj j µ, isj where, log(µ isj ) = X T is β j + α ij, and where X is is a vector of covariates, β j are fixed effects, and α ij is a random effect reflecting time constant unobserved heterogeneity. Choosing employed as a reference state and setting β and α i to zero allows the model to be identified. In order to maintain invariance with respect to which state is chosen as the reference state, we must take the random effect (α i2, α i3 ) to have a multivariate normal distribution with mean 0 and variance-covariance Σ. We write the posterior distribution as p(β, α, Σ y) L(y β, α, Σ)p(α Σ)p(Σ)p(β), assuming a priori independence between β and Σ. The prior distribution for β j was set to standard normal distribution with variances of 6 (except for the parameters corresponding to age 2 which were given more precise priors). The random effects terms (α i2, α i3 ) was assigned a bivariate zero mean distribution where the elements of the variance-covariance matrix Σ where Σ Uniform(0, 0), Σ 22 Uniform(0, 0) and the correlation coefficient was assigned Uniform(, ). Here it is possible to update beliefs about all unknown parameters from their full conditional distributions using Gibbs sampling. Here our interest is in calculating marginal likelihoods for Model k = : log(µ isj ) = X T is β j Model k = 2: log(µ isj ) = X T is β j + α ij. Model is essentially a fixed effects model, where the regression effect β j, for employment state j, remains constant for each individual at each of the time points s. A random effects term, α ij, is however included in model 2, accounting for variability between individuals at a given state j. Indeed other plausible models are also possible, modelling for example, variability in time between individuals, again the reader is referred to (Pettitt et al 2005) for more details. For this illustration we have used a randomly selected sample size of 000 from the complete case data (n = 3234) analysed in (Pettitt et al 2005). For this example, collecting samples from the power posteriors is possible using the WinBUGS software (Spiegelhalter, Thomas and Best 998). To implement (6) we chose a temperature schedule t i = a 4 i, where the a i s are 0 equally spaced points in the interval [0.006, ] and the end point t = 0. Within each temperature t i, 2, 000 samples were collected from the stationary distribution p ti (β y, k = ) and p ti (β, α, Σ y, k = 2). Table 3 summarises the output. Applying the trapezoidal rule (6), yields log p(y k = ) = 2, 528.72 and log p(y k = 2) = 2, 358.5, with associated Monte Carlo standard errors of 0.64 and.34 for the trapezoidal rule, respectively. This leads to the strong conclusion that a random effects model is much more probable which is qualitatively similar to the conclusion presented in (Pettitt et al 2005).

0 Friel, Pettitt 4.3. Hidden Markov random field models Markov random fields (MRFs) are often used to model binary spatially dependent data the autologistic model (Besag 974) is a popular choice. Here the joint distribution of x = {x i : i =, 2,..., n} taking values {(, +)} on a regular lattice is defined as p(x β) exp{β 0 x i + β x i x j } (8) i conditional on parameters β = (β 0, β ). Positive values of β 0 encourage x i to take the values +, while positive values of β encourage homogenous regions of + s or s. The notation i j denotes that x i and x j are neighbours. For this example we examine two models, defined via their neighbourhood structure: Model k = : A first order neighbourhood where each point x i has as neighbours the four nearest adjacent points. Model k = 2: A 2nd order neighbourhood structure where in addition to the first order neighbours, the four nearest diagonal points also belong to the neighbourhood. Both neighbourhood structures are modified along the edges of the lattice. MRF models are difficult to handle in practice, due to computational burden of calculating the proportional constant, in (8), c(β) say. A hidden MRF y arises when an MRF x is corrupted by some noise process. The underlying MRF is essentially hidden, and appears as parameters in the model. Typically it assumed that conditional on x the y i s are independent which gives the likelihood: L(y x, θ) = i j n p(y i x i, θ), i= for some parameters θ. Once prior distributions p(β) and p(θ) are specified for β and θ respectively, a complete Bayesian analysis proceeds by making inference on the posterior distribution p(x, β, θ y, k) L(y x, θ)p(x β, k)p(β)p(θ). It is relatively straightforward to sample from the full conditional distribution of each of x and θ. Sampling from the full conditional distribution of β is more problematic, due to the difficulty of calculating the normalising constant of the MRF, c(β). However provided the number of rows or columns is not greater than 20 for a reasonable number of the other dimension, then the forward recursion method presented in (Reeves and Pettitt 2004) can be use to calculate c(β). For a more complete description of the problem of Bayesian estimation of hidden MRFs, the reader is referred to (Friel et al 2005). For this example, gene expression levels were measured for 34 genes in a cluster of 38 neighbouring genes on the Streptomyces coelcicolor genome for 0 time points. The cluster of 38 neighbouring genes under study is responsible for the production of calcium-dependent antibodies We define the observations on a 38 0 regular lattice, where log expression level y sg corresponds to the gth gene at time point s. Figure 2 displays the data y, indicating gene locations for which there is no data. Here we assume that the data y masks a MRF process x, where states (, +) correspond to up-regulation and down-regulation respectively. We assume that the MRF process follows a first order neighbourhood structure (k = ), or a second order neighbourhood structure (k = 2). Finally we assume that the distribution of y given x is modelled as independent Gaussian noise with state specific mean µ(x sg ), and common variance σ and set θ = (µ, σ). Wit and McClure (2004) show that normality of log expression levels is a reasonable assumption for similar experimental setups. It is straightforward to handle the missing data in the full conditional distribution of the latent process x, the likelihood function needs to be modified slightly to allow for the fact that 4 of the columns of x are not supported by any data. A flat normal prior was chosen for each of the β parameters. The prior distribution for µ was distributed uniformly from the set {(µ( ), µ(+)) 2 µ( ) 2, µ( ) µ(+) 2}. The

Marginal likelihood estimation via power posteriors 2 4 6 8 0 2 27 3 Fig. 2. Log expression levels of 34 genes on the Streptomyces genome for 0 consecutive time points. The x-axis labels missing columns. First order model k = t i 0.6243 0.3660 0.975 0.0953 0.0390 0.023 0.0024 0.0002 0 Mean Dev. -25.7-253.0-252.6-248.5-259.4-282.2-352.5-365.3-68.0-607.6 MC se 0.040 0.033 0.06 0.083 0.83 0.377 0.854.63 6.754 27.240 Second order model k = 2 t i 0.6243 0.3660 0.975 0.0953 0.0390 0.023 0.0024 0.0002 0 Mean Dev. -252.6-252.6-253. -247.5-256.7-279.7-32.2-345.5-475.8-248.6 MC se 0.029 0.06 0.024 0.089 0.82 0.450 0.639.228 3.646 7.676 Table 4. Hidden Markov random field models. Expected deviances (and Monte Carlo standard errors) for the power posterior at temperature t i for models k =,2. values 2 and +2 represent approximate minimum and maximum values which are found in similar datasets. Corresponding to these values a gamma prior with mean 2 and variance 4 was specified for σ. Note that Friel and Wit (2005) present a more complete analysis of a similar dataset. Here we chose a temperature schedule t i = a 4 i, where the a i s are equally spaced points in the interval [0, ]. Within each temperature t i, 5, 000 samples were collected from the stationary distribution p ti (x, β, µ y, k), for k =, 2. Table 4 summarises the output. Applying the trapezoidal rule (6), yields log p(y k = ) = 256.93 and log p(y k = 2) = 255.6, with associated Monte Carlo standard errors of 0.003 and 0.000, respectively. Thus the second order model is deemed more probable a posteriori. 5. Concluding remarks We have introduced a new method of estimating the marginal likelihood for complex hierarchical Bayesian models which involves a minimal amount of change to commonly used algorithms which compute the posterior distributions of unknown parameters. We have illustrated the technique for three examples. The first, a simple regression example, where the prior model probabilities needed tuning in order to estimate the Bayes factor well. The second example involved a random effects model for multinomial data and demonstrated the ease of computing the marginal likelihood with a standard software package such as WinBUGS. Here the results demonstrated the overwhelming difference in marginal likelihoods for the two models considered, a similar situation to the first example where RJMCMC with default equally weighted model prior probabilities would perform very poorly. The third example involved a complex hidden Markov structure and the results demonstrated a difference in terms of marginal likelihood between the two models. In terms of approximating the mean deviance, the second example is far more challenging than the third example with the former displaying characteristics resulting from use of a vague prior with large negative values of the mean deviance for t near 0. However the Monte Carlo standard errors of the marginal likelihoods are

2 Friel, Pettitt nevertheless reasonably small for this example. Computation of the marginal likelihood requires a proper prior. The sensitivity of the value of marginal likelihood to the choice of prior can be readily investigated using our method. Various approaches have been proposed for the case where the prior is improper. As we mentioned above, the fractional Bayes factor is straightforwardly computed as a by-product of the marginal likelihood. For those seeking such approximations our method provides a straightforward solution. Our choice of quadrature rule and use of simulation resources can be improved but we have not followed up that matter here. But nevertheless, our computational approach provides estimates with tolerably small standard errors. In conclusion, we have illustrated a method of computing the marginal likelihood which is straightforward to implement and can be used for complex models. Acknowledgements Both authors were supported by the Australian Research Council. The authors wish to kindly acknowledge Thu Tran for her assistance with computational aspects of this work. Nial Friel wishes to acknowledge the School of Mathematical Sciences, QUT for its hospitality during June 2005. References Bartolucci, F. and A. Mira (2003), Efficient estimate of Bayes factors from reversible jump output. Technical report, Universitá dell Insubria, Dipartimento di Economia Bartolucci, F. and L. Scaccia (2004), A new approach for estimating the Bayes factor. Technical report, Universitá di Perugia Berger, J. O. and L. R. Perricchi (996), The intrinsic Bayes factor for linear models. In J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith (eds.), Bayesian Statistics, vol. 5, pp. 25 44, Oxford, Oxford University press Besag, J. E. (974), Spatial interaction and the statistical analysis of lattice systems (with discussion). Journal of the Royal Statistical Society, Series B 36, 92 236 Brooks, S. P., P. Giudici and G. O. Roberts (2003), Efficient construction of reversible jump Markov chain Monte Carlo proposal distributions (with discussion). Journal of the Royal Statistical Society, Series B 65(), 3 57 Carlin, B. P. and S. Chib (995), Bayesian Model Choice via Markov Chain Monte Carlo. Journal of the Royal Statistical Society, Series B 57, 473 484 Chib, S. (995), Marginal likelihood from the Gibbs output. Journal of the American Statistical Association 90, 33 32 Chib, S. and I. Jeliazkov (200), Marginal likelihood from the Metropolis-Hastings output. Journal of the American Statistical Association 96, 270 28 Dellaportas, P., J. J. Forster and I. Ntzoufras (200), On Bayesian model and variable selection using MCMC. Statistics and Computing 2, 27 36 Dryden, I. L., M. R. Scarr and C. C. Taylor (2003), Bayesian texture segmentation of weed and crop images using reversible jump Markov chain Monte Carlo methods. Applied Statistics 52(), 3 50 Friel, N., A. N. Pettitt, R. Reeves and E. Wit (2005), Bayesian inference in hidden Markov random fields for binary data defined on large lattices. Technical report, University of Glasgow Friel, N. and E. Wit (2005), Markov random field model of gene interactions on the M. Tuberculosis genome. Technical report, University of Glasgow, Department of Statistics Gelman, A. and X.-L. Meng (998), Simulating normalizing contants: from importance sampling to bridge sampling to path sampling. Statistical Science 3, 63 85 Godsill, S. J. (200), On the Relationship Between Markov Chain Monte Carlo Methods for Model Uncertainty. Journal of Computational and Graphical Statistics 0, 230 248

Marginal likelihood estimation via power posteriors 3 Green, P. J. (995), Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82, 7 732 Green, P. J. (2003), Trans-dimensional Markov chain Monte Carlo. In P. J. Green, N. L. Hjort and S. Richardson (eds.), Highly Structured Stochastic Systems, Oxford University Press, Oxford Green, P. J. and A. O Hagan (998), Model choice with MCMC on product spaces without using pseudo-priors. Technical Report 98-3, University of Nottingham Green, P. J. and S. Richardson (2002), Hidden Markov models and disease mapping. Journal of the American Statistical Association 97, 055 070 Han, C. and B. P. Carlin (200), Marlov chain Monte Carlo methods for computing Bayes factors: A comparative review. Journal of the American Statistical Association 96(455), 22 32 Hoeting, J. A., D. Madigan, A. E. Raftery and C. T. Volinsky (200), Bayesian model averaging: A tutorial. Statistical Science 4(4), 382 47 Meng, X.-L. and W. Wong (996), Simulating ratios of normalizing constants via a simple identity: a theoretical exploration. Statistica Sinica 6, 83 860 Neal, R. M. (200), Annealed importance sampling. Statistics and Computing, 25 39 O Hagan, A. (995), Fractional Bayes factors for model comparison (with discussion). Journal of the Royal Statistical Society, series B 57, 99 38 Pettitt, A. N., T. T. Tran, M. A. Haynes and J. L. Hay (2005), A Bayesian hierarchical model for categorical longitudinal data from a social survey of immigrants. Journal of the Royal Statistical Society, series A (to appear) Reeves, R. and A. N. Pettitt (2004), Efficient recursions for general factorisable models. Biometrika 9(3), 75 757 Sisson, S. A. (2005), Trans-dimensional Markov chains: A decade of progress and future perspectives. Journal of the American Statistical Association (to appear) Spiegelhalter, D., A. Thomas and N. Best (998), WinBUGS: Bayesian inference using Gibbs Sampling, Manual version.2. Imperial College, London and Medical Research Council Biostatistics Unit, Cambridge Stephens, M. (2000), Bayesian analysis of mixture models with an unknown number of components - an alternative to reversible jump methods. Annals of Statistics 28, 40 74 Willams, E. (959), Regression Analysis. Wiley Wit, E. and J. McClure (2004), Statistics for Microarrays: Design, Analysis and Inference. Wiley, Chichester