Estimation of Large Families of Bayes Factors from Markov Chain Output

Save this PDF as:

Size: px
Start display at page:

Download "Estimation of Large Families of Bayes Factors from Markov Chain Output" Transcription

1 Estimatio of Large Families of Bayes Factors from Markov Chai Output Hai Doss Uiversity of Florida Abstract We cosider situatios i Bayesia aalysis where the prior is idexed by a hyperparameter takig o a cotiuum of values. We distiguish some arbitrary value of the hyperparameter, ad cosider the problem of estimatig the Bayes factor for the model idexed by the hyperparameter vs. the model specified by the distiguished poit, as the hyperparameter varies. We assume that we have Markov chai output from the posterior for a fiite umber of the priors, ad develop a method for efficietly computig estimates of the etire family of Bayes factors. As a applicatio of the ideas, we cosider some commoly used hierarchical Bayesia models ad show that the parametric assumptios i these models ca be recast as assumptios regardig the prior. Therefore, our method ca be used as a model selectio criterio i a Bayesia framework. We illustrate our methodology through a detailed example ivolvig Bayesia model selectio. Key words ad phrases: Bayes factors, cotrol variates, ergodicity, importace samplig, Markov chai Mote Carlo

2 Itroductio Suppose we have a data vector Y whose distributio has desity p θ, for some ukow θ Θ. Let {ν h, h H} be a family of prior desities o θ that we are cotemplatig. The selectio of a particular prior from the family is importat i Bayesia data aalysis, ad whe makig this choice oe will ofte wat to cosider the margial likelihood of the data uder the prior ν h, give by m h (y) = l y (θ)ν h (θ) dθ, as h varies over the hyperparameter space H. Here, l y (θ) = p θ (y) is the likelihood fuctio. Values of h for which m h (y) is relatively low may be cosidered poor choices, ad cosideratio of the family {m h (y), h H} may be helpful i arrowig the search of priors to use. It is therefore useful to have a method for computig the family {m h (y), h H}. For the purpose of model selectio, if c is a fixed costat, the iformatio give by {m h (y), h H} ad {c m h (y), h H} is the same. From a computatioal ad statistical poit of view however, it is usually easier to fix a particular hyperparameter value h ad focus o {m h (y)/m h (y), h H}. Give two hyperparameter values h ad h, the quatity B(h, h ) = m h /m h is called the Bayes factor of the model idexed by h vs. the model idexed by h (we write m h istead of m h (y) from ow o). I this paper we preset a method for estimatig the family {B(h, h ), h H}. We have i mid situatios where B(h, h ) caot be obtaied aalytically ad, moreover, we eed to calculate B(h, h ) for a large set of h s, so that computatioal efficiecy is essetial. Our approach requires that there are k hyperparameter values h,..., h k, ad for l =,..., k, we are able to get a sample θ (l) i, i =,..., l, from ν hl,y, the posterior desity of θ give Y = y, assumig that the prior is ν hl. To set the framework, cosider the trivial case where k =, ad we have a sample from the posterior ν h,y geerated by a ergodic Markov chai. Our objective is to estimate {B(h, h ), h H}. For ay h such that ν h (θ) = 0 wheever ν h (θ) = 0, we have i= ν h (θ ν h (θ νh (θ) ν h (θ) ν h,y(θ) dθ (.) = m h ly (θ)ν h (θ)/m h ν h,y(θ) dθ m h l y (θ)ν h (θ)/m h = m h m h νh,y (θ) ν h,y(θ) ν h,y(θ) dθ = m h m h. Therefore, the left side of (.) is a cosistet estimate of the Bayes factor B(h, h ). To fix ideas, cosider as a simple example the followig stadard three-level hierarchical model: coditioal o ψ j, Y j idep φ ψj,σ j, j =,..., m (.2a) coditioal o µ, τ, ψ j iid φ µ,τ, j =,..., m (.2b) (µ, τ) λ c,c 2,c 3,c 4 (.2c) where φ m,s deotes the desity of the ormal distributio with mea m ad stadard deviatio s. I (.2a), the σ i s are assumed kow. I (.2c), λ c,c 2,c 3,c 4 is the ormal / iverse gamma distributio idexed by four hyperparameters (see Sectio 3). This is a very commoly used

3 iid model but, as we discuss later, i some situatios it is preferable to replace (.2b) with ψ j t v,µ,τ, where t v,µ,τ is the desity of the t distributio with v degrees of freedom, locatio µ ad scale τ. I this case, cosider ow the estimate i the left side of (.). The likelihood of (µ, τ) is m m l Y (µ, τ) =... φ ψj,σ j (Y j ) t v,µ,τ (ψ j ) dψ... dψ m. j= This likelihood caot be computed i closed form, ad therefore its cacellatio i (.) gives a o-trivial simplificatio: calculatio of the estimate requires oly the ratio of the desities of the priors ad ot the posteriors. Cosider (.2) with t v,µ,τ istead of φ µ,τ i the middle stage, ad suppose ow that we would like to select v, with the choice v = sigifyig the choice of the ormal distributio φ µ,τ. The distributio of Y is determied by ψ = (ψ,..., ψ m ). A completely equivalet way of describig the model is therefore through the two-level hierarchy i which we let θ = (ψ, µ, τ), ad stipulate: coditioal o θ, j= Y j idep φ ψj,σ j, (ψ, µ, τ) ν h, j =,..., m where ν h (ψ, µ, τ) = ( m j= t v,µ,τ(ψ j ) ) λ c,c 2,c 3,c 4 (µ, τ). Here, the hyperparameter is h = (v, c, c 2, c 3, c 4 ), which icludes the umber of degrees of freedom. Estimatio of the family of Bayes factors {B(h, h ), h H} therefore eables a model selectio step. We ow discuss briefly the accuracy of the estimate o the left side of (.). Whe ν h is early sigular with respect to ν h over the regio where the θ i s are likely to be, the estimate will be ustable. (Formally, the estimate will satisfy a cetral limit theorem if the chai mixes fast eough ad the radom variable ν h (θ)/ν h (θ) (where θ ν h,y) has a high eough momet. This is discussed i more detail i Sectio 2.3.) From a practical poit of view, this meas that there is effectively a radius aroud h withi which oe ca safely move. I all but the very simplest models, the dimesio of H is greater tha, ad therefore estimatio of the Bayes factor as h rages over H raises serious computatioal difficulties, ad it is essetial that for each h, the estimate of B(h, h ) is both accurate ad ca be computed quickly. Our approach is to select k hyperparameter poits h,..., h k, ad get Markov chai samples from ν hl,y for each l =,..., k. The prior ν h i the deomiator of the left side of (.) is replaced by a mixture w ν h + + w k ν hk, with appropriately chose weights. We show how judiciously chose cotrol variates ca be used i cojuctio with multiple Markov chai streams to produce accurate estimates eve with small samples, so that the et result is a computatioally feasible method for producig reliable estimates of the Bayes factors for a wide rage of hyperparameter values. Our approach is motivated by ad uses ideas developed i Kog et al. (2003), which deals with the situatio where we have idepedet samples from k uormalized desities, ad we wish to estimate all possible ratios of the k ormalizig costats. Owe ad Zhou (2000) ad Ta (2004) also discuss the use of cotrol variates to icrease the accuracy of Mote Carlo estimates. I Sectio 4 we retur to these three papers ad discuss i detail how our approach fits i the cotext of this work. The paper is orgaized as follows. Sectio 2 cotais the mai methodological developmet; there, we preset our method for estimatig the family of Bayes factors ad state supportig theoretical results. Sectio 3 illustrates the methodology through a detailed example that ivolves a umber of issues, 2

4 icludig selectio of the parametric family i the model. Sectio 4 gives a discussio of other possible approaches ad related work, ad the Appedix gives the proof of the mai theoretical result of the paper. 2 Estimatio of the Family of Bayes Factors Suppose that for l =,..., k, we have Markov chai Mote Carlo (MCMC) samples θ (l) i, i =,..., l from the posterior desity of θ give Y = y, assumig that the prior is ν hl, havig the form ν hl,y(θ) = l y (θ)ν hl (θ)/m hl. We assume that the k sequeces are idepedet of oe aother. We will ot assume we kow ay of the m hl s. However, we ow explai how kowledge of the Bayes factors m hl /m h, for l = 2,..., k would result i two importat beefits. If we kew these Bayes factors we could the form the estimate ˆB(h, h ) = k l ν h (θ (l) s=. (2.) sν hs (θ (l) m h /m hs l= i= Let = s= s, ad assume that s / a s, s =,..., k. We the have ˆB(h, h ) = = a.s. k l l= i= k s= sl y (θ (l) i m h l m h l l= i= k m h m h l= l y (θ (l) )ν h (θ (l) s= i )ν hs (θ (l) i l ν h,y(θ (l) s ν hs,y(θ (l) )m h /m hs a l ν h,y (θ) s= a sν hs,y(θ) ν h l,y(θ) dθ = m h m h. (2.2) The almost sure covergece i (2.2) occurs uder miimal coditios o the Markov chais θ (l) i, i =,..., l. Asymptotic ormality requires more restrictive coditios, ad is discussed i Sectio 2.3. To compute ˆB(h, h ), the quatities s= sν hs (θ (l) m h /m hs are calculated oce, ad stored. The, for every ew value of h, the computatio of ˆB(h, h ) requires takig ratios ad a sum. Sice this is to be doe for a large umber of h s, it is essetial that for each l, the sequece θ (l) i, i =,..., l be as idepedet as possible, so that the value of be made as small as possible. We ow briefly recall the use of cotrol variates i Mote Carlo samplig. Suppose we wish to estimate the expected value of a radom variable Y, ad we ca fid a radom variable Z that is correlated with Y, ad such that E(Z) is kow (without loss of geerality, E(Z) = 0). The for ay β, the estimate Y βz is a ubiased estimate of E(Y ), ad the value of β miimizig the variace of Y βz is β = Cov(Y, Z)/Var(Z). The idea may be used whe there are several variables Z,..., Z r that are correlated with Y. 3

5 I the preset cotext, we may cosider the fuctios Z j (θ) = ν h j (θ)m h /m hj ν h (θ) s s= ν hs (θ)m h /m hs, j = 2,..., k, whose expectatios uder s= ( s/)ν hs,y are 0. The calculatio of these fuctios requires kowledge of the Bayes factors m hs /m h, s = 2,..., k. The method proposed i this paper ca ow be briefly summarized as follows.. For each l =,..., k, get Markov chai samples θ (l) i, i =,..., N l from ν hl,y. Based o these, the Bayes factors m hs /m h, s = 2,..., k are estimated. The sample sizes N l should be very large, so that these estimates are very accurate. 2. For each l =,..., k, we obtai ew samples θ (l) i, i =,..., l from ν hl,y. Usig these, together with the Bayes factors computed i Step we form the estimate ˆB reg (h, h ), which is similar to (2.), except that we use the fuctios Z j, j = 2,..., k as cotrol variates. The samples i the two steps are used for differet purposes. Those i Step are used solely to estimate m hs /m h, s = 2,..., k, ad i fact, oce these estimates are formed, the samples may be discarded. The samples i Step 2 are used to estimate the family B(h, h ). O occasio, special aalytical structure eables the use of umerical methods to estimate m hs /m h, s = 2,..., k, as log as k is ot too large so Step is bypassed. A review of the literature for this approach is give i Kass ad Raftery (995). Ideally, the samples i Step 2 should be idepedet or early so, which may be accomplished by subsamplig a very log chai. If we have a Markov trasitio fuctio that gives rise to a uiformly ergodic chai, it is possible to use this Markov trasitio fuctio to obtai perfect samples (Hobert ad Robert (2004)), although the time it takes to geerate a perfect sample of legth l may be much greater tha the time to geerate the Markov chai of legth l. Oe may ask what is the poit of havig two steps of samplig, i.e. why ot just use the samples from Step for both estimatio of m hs /m h, s =,..., k, ad for subsequet estimatio of the family B(h, h ). The reaso for havig the two stages is that the estimate of B(h, h ) eeds to be computed for a large umber of h s, ad for every h the amout of computatio is liear i, so this precludes a large value of. Therefore, give that a relatively modest sample size must be used, we eed to reduce the variace of the estimate as much as possible, ad this is the reaso for carryig out Step. The amout of computatio to geerate the Step samples is typically oe or two orders of magitude less tha the amout of computatio eeded to calculate the estimates of B(h, h ) from the Step 2 samples (see the discussio at the ed of Sectio 3). To summarize, the beefit of the two-step approach is a better tradeoff betwee statistical efficiecy ad computatioal time. To see this, it is helpful to cosider a very simple example i which the variaces of various estimators ca actually be computed. Cosider the uormalized desity q h = t h I(t (0, )), ad let m h be the ormalizig costat. Now suppose we wish to estimate m h /m as h rages over a grid of 4000 poits i the iterval (.5, 2.5) ad that we are able to geerate iid observatios from q /m ad q 3 /m 3. We may use the estimator i Kog et al. (2003) (discussed later i this paper), which estimates both m h /m ad m 3 /m from the same sample. Give oe miute of computer time, usig the machie whose specificatios 4

6 are described i Sectio 3, the requiremet that we calculate such a large umber of ratios of ormalizig costats limits the total sample size to = A formula for the asymptotic variace ρ 2 (h) of the Kog et al. (2003) estimate is give i Ta (2004, equatio (8)), ad i this situatio all quatities that are eeded i the formula are available explicitly. Now if we take the miute ad divide it ito two parts, 3 secods ad 57 secods, the with the 3 secods we ca estimate m 3 /m with essetially perfect accuracy, ad with the remaiig 57 secods, if we use the estimate ˆB(h, ), we ca hadle a sample size of 57/60. A formula for the asymptotic variace τ 2 (h) of this estimator which uses the value of m 3 /m calculated i the first stage is give i Theorem of the preset paper, ad ca also be evaluated explicitly. The ratio τ 2 (h)/ρ 2 (h) is bouded above by.2 over the etire grid, ad so with the same computer resources, the variace of the two-stage estimator is uiformly at most.2 60/57.2 that of the oe-stage estimator. (The gais if we use ˆB reg istead of ˆB ca be far greater; see Sectio 3 for a illustratio.) I Sectio 2. we show how the MCMC approach to Step may be implemeted. I Sectio 2.2 we show how estimatio i Step 2 may be implemeted, ad also discuss the beefits of usig the cotrol variates. I Sectio 2.3 we give a result regardig asymptotic ormality of the estimates of the Bayes factors. 2. Estimatio of the Bayes Factors m hs /m h We ow assume that for l =,..., k, we have a sequece θ (l) i, i =,..., N l from a Markov chai correspodig to the posterior ν hl,y. Also, these k sequeces are idepedet of oe aother. Let N = l= N l, ad a l = N l /N. We wish to estimate m hl /m h, l = 2,..., k. Meg ad Wog (996) cosidered this problem ad, to uderstad their method, it is helpful to cosider first the case where k = 2 ad we wish to estimate d = m h2 /m h. For ay fuctio α defied o the commo support of ν h,y ad ν h2,y such that α(θ)ν h (θ)l y (θ)ν h2 (θ) dθ <, we have Therefore, α(θ)ν h2 (θ)ν h,y(θ) dθ α(θ)ν h (θ)ν h2,y(θ) dθ ˆd = = N N 2 m h m h2 N i= N 2 i= α(θ)ν h2 (θ)l y (θ)ν h (θ) dθ α(θ)ν h (θ)l y (θ)ν h2 (θ) dθ α(θ () ν h2 (θ () α(θ (2) ν h (θ (2) = m h 2 m h. (2.3) is a cosistet estimate of d, uder the miimal assumptio of ergodicity of the two chais. Meg ad Wog (996) show that whe {θ (j) i } N j i= are idepedet draws from ν h j,y, the optimal α to use is α opt (θ) = a ν h (θ) + a 2 ν h2 (θ)/d, (2.4) 5

7 which ivolves the quatity we wish to estimate. This suggests the iterative scheme ˆd (t+) = N N 2 N ν h2 (θ () i= a ν h (θ () + a 2 ν h2 (θ () i N 2 ν h (θ (2) i= a ν h (θ (2) + a 2 ν h2 (θ (2) i )/ ˆd (t) )/ ˆd (t), (2.5) for t =, 2,.... For the geeral case where k 2, let d = (m h2 /m h,..., m hk /m h ), but it is more coveiet to work with the vector of compoet-wise reciprocals of d, call it r. For i = 2,..., k, ad j =,..., k, j i, let α ij be kow fuctios defied o the commo support of ν hi ad ν hj satisfyig α ij (θ)ν hi (θ)l y (θ)ν hj (θ) dθ <. Let b ii = j i E ν hj,y( αij (θ)ν hi (θ) ) 2 i k, b ij = E νhi,y( αij (θ)ν hj (θ) ) i j, (2.6) ad b 22 b b 2k b 32 b b 3k B =......, b =. b 2 b 3. b k2 b k2... b kk b k The assumig that B is osigular, we have r = B b. If ˆB α ad ˆb α are the atural estimates of B ad b based o the fuctios α ij ad the samples {θ (j) i } N j i=, j =,..., k, the r may be estimated via ˆr = ˆB ˆb α α. (2.7) Meg ad Wog (996) cosider the fuctios α ij = a i a j s= a sr s ν hs, (2.8) which ivolve the ukow r. The atural extesio of (2.5) is ˆr (t+) = ˆB α tˆb αt, with the vector of fuctios α t give by (2.8), where we use ˆr (t) istead of r. 2.2 Usig Cotrol Variates The use of cotrol variates has had may successes i Mote Carlo samplig, ad a particularly importat paper is Owe ad Zhou (2000). This paper cosiders the use of cotrol variates i cojuctio with importace samplig, whe the importace samplig desity is a mixture, ad the paper motivates some of the ideas below. We ow assume that we have samples θ (l) i, i =,..., l, from ν hl,y, l =,..., k, with idepedece across samples, ad that we kow the costats d 2,..., d k. For uity of otatio, we defie d =. As before = l= l ad l / = a l. The estimate ˆB(h, h ) i (2.) is a average of draws from the mixture distributio p a = s= a sν hs,y. However, these are ot 6

8 idepedet ad idetically distributed sice they form a stratified sample: we have exactly s draws from ν hs,y, s =,..., k, a fact which causes o problems. We wish to estimate the itegral I h = l y (θ)ν h (θ)/m h dθ = B(h, h ). Defie the fuctios H j (θ) = l y (θ)ν hj (θ)/m hj l y (θ)ν h (θ)/m h, j = 2,..., k. We have H j (θ) dθ = 0, or equivaletly E pa ( Hj (θ)/p a (θ) ) = 0, where the subscript idicates that the expectatio is take with respect to the mixture distributio p a. Therefore, for every β = (β 2,..., β k ) the estimate Î h,β = k l l= i= l y (θ (l) i )ν h (θ (l) /m h [ ly (θ (l) ( ν hj (θ (l) /m hj ν h (θ (l) )] /m h s= a sν hs,y(θ (l) j=2 β j is ubiased. As writte, this estimate is ot computable, because it ivolves the ormalizig costats m hj, which are ukow, ad also the likelihood l y (θ), which may ot be available. We rewrite it i computable form as Î h,β = k l l= i= ν h (θ (l) i ) j=2 β [ j νhj (θ (l) /d j ν h (θ (l) ] s= a. (2.9) sν hs (θ (l) /d s We would like to use the value of β, call it β opt, that miimizes the variace of Îh,β, but this β opt is geerally ukow. As i Owe ad Zhou (2000), we ca do ordiary liear regressio of Y (h) Y (h) = o predictors Z (j), where ν h (θ (l) s= a, Z (j) sν hs (θ (l) /d s = ν h j (θ (l) /d j ν h (θ (l) s= a, j = 2,..., k, (2.0) sν hs (θ (l) /d s ad all required quatities are available. We the use the least squares estimate ˆβ, i.e. the estimate of I h is Îh, ˆβ. It is easy to see that Îh, ˆβ is simply ˆβ 0, the estimate of the itercept term i the bigger regressio problem where we iclude the itercept term, i.e. Î h, ˆβ = ˆβ 0. (2.) Oe ca show that if the k sequeces are all iid sequeces, the ˆβ coverges to β opt, ad Îh, ˆβ is guarateed to be at least as efficiet as the aive estimator. But whe we have Markov chais this is ot the case, especially if the chais mix at differet rates. I Sectio 2.3 we cosider the estimates ˆβ ad Îh, ˆβ directly. I particular, we give a precise defiitio of the oradom value β that ˆβ is estimatig (it is β (h) lim i equatio (A.3)), ad show that the effect of usig ˆβ istead of β is asymptotically egligible. 7

9 It is atural to cosider the problem of estimatig β opt i the Markov chais settig. Actually, before thikig about miimizig the variace of (2.9) with respect to β, oe should first ote the followig. The costats a s = s /, s =,..., k, used i formig the values Y (h) are sesible i the iid settig, but whe dealig with Markov chais oe would wat to replace s with a effective sample size, as discussed by Meg ad Wog (996). Therefore, the real problem is two-fold: How do we fid optimal (or good) values to use i place of the a s s i the Y (h) s? Usig the Y (h) s based o these values, how do we estimate the value of β that miimizes the variace of (2.9)? Both problems appear to be very difficult. Ituitively at least, the method described here should perform well if the mixig rates of the Markov chais are ot very differet. But i ay case, the results i Sectio 2.3 show that, whether or ot Îh, ˆβ is optimal, it is a cosistet ad asymptotically ormal estimator whose variace ca be estimated cosistetly. Note that if we do ot use cotrol variates, our estimate is just which is exactly (2.). k l ν h (θ (l) s= a, sν hs (θ (l) /d s l= i= Reductio i Variace from Usig the Cotrol Variates of the resposes Y (h) ad predictors Z (j) give by Cosider the liear combiatio L = k a j Z (j) + Y (h). j=2 (We are droppig the subscripts i, l.) A calculatio shows that if h = h the L =, meaig that we have a estimate with zero variace. Similarly, for t = 2,..., k, let L t be the liear combiatio give by k L t = a j Z (j) + (/d t )Y (h) Z (t). j=2 If h = h t, the L t =. Thus if h {h,..., h k }, our estimate of the Bayes factor B(h, h ) has zero variace. This is ot surprisig sice, after all, we are assumig that we kow B(h j, h ), for j =,..., k; however, this does idicate that if we use these cotrol variates, our estimate will be very precise as log as h is close to at least oe of the h j s. This advatage does ot exist if we use the plai estimate (2.). The itercept term i the regressio of the Y (h) s o the Z (j) s is simply a liear combiatio of the form ˆβ 0 = k l l= i= w Y (h). (2.2) 8

10 The w s eed to be computed just oce, so for every ew value of h the calculatio of ˆB reg (h, h ) requires operatios, which is the same as the umber of operatios eeded to compute ˆB(h, h ) give by (2.). To summarize, usig cotrol variates ca greatly improve the accuracy of the estimates, at o (or trivial) icrease i computatioal cost. 2.3 Asymptotic Normality ad Estimatio of the Variace Here we state a result that says that uder certai regularity coditios ˆB reg (h, h ) ad ˆB(h, h ) are asymptotically ormal, ad we show how to estimate the variace. As discussed i Sectio 2.2, we typically prefer that θ (l) i, i =,..., l, be a iid sample for each l. Nevertheless, our results pertai to the more geeral case where these samples arise from Markov chais. (As before, we assume that l / a l (0, ) ad, whe dealig with the asymptotics, strictly speakig we eed to make a distictio betwee l / ad its limit; however we write a l for both as this makes the bookkeepig easier, ad blurrig the distictio ever creates a problem.) Recall that Y (h) ad Z (j), j = 2,..., k, are defied i (2.0) ad, for ecoomy of otatio, we defie Z () to be for all i, l. Let R be the k k matrix defied by ( k ) R jj = E l= a lz (j) ),l Z(j,l, j, j =,..., k. We assume that for the Markov chais a strog law of large umbers holds (sufficiet coditios are give, for example, i Theorem 2 of Athreya, Doss ad Sethurama (996)), ad we refer to the followig coditios. A For each l =,..., k, the chai {θ (l) i } i= is geometrically ergodic. A2 For each l =,..., k, there exists ɛ > 0 such that E ( (h) Y 2+ɛ ) <. A3 The matrix R is osigular. Theorem Uder coditios A ad A2 ad uder coditios A A3 /2( ˆB(h, h ) B(h, h ) ) /2( ˆBreg (h, h ) B(h, h ) ),l d N ( 0, τ 2 (h) ), d N ( 0, σ 2 (h) ), with τ 2 (h) ad σ 2 (h) give by equatios (A.9) ad (A.7) below. The proof is give i the Appedix, which also explais how oe ca estimate the variaces. Theorem assumes that the vector d is kow either because it ca be computed aalytically or because the sample sizes from Stage samplig are so large that this is effectively true. Buta (2009) has obtaied a versio of Theorem that takes ito accout the variability from the first stage. Very briefly, if N is the total sample size from the first stage, ad if N ad i such a way that /N q [0, ), the /2( ˆB(h, h ) B(h, h ) ) d N ( 0, qτ 2 S(h) + τ 2 (h) ), 9

14 Bayes factor 0.6 Bayes factor epsilo df df epsilo Figure : Model assessmet for the aspiri ad colo cacer data. The Bayes factor as a fuctio of v, the umber of degrees of freedom i (3.3a), ad ɛ, the commo value of c ad c 2 i the gamma prior i (3.3b), is show from two differet agles. Here the baselie value of the hyperparameter correspods to v = 4 ad ɛ =.25. p. 87)), so that the pathological behavior resultig from ɛ 0 should be expected. For some more complicated models, whether the posterior is proper or ot is ukow (posterior propriety may eve deped o the data values), ad i these cases, plots such as those i Figure may be useful because they may lead oe to ivestigate a possible posterior impropriety. The choice of hyperparameter h does have a ifluece o our iferece. Let ψ ew deote the latet variable for a future study, a quatity of iterest i meta-aalysis. We cosidered two specificatios of h: (v =, ɛ =.00) ad (v = 4, ɛ =.625). The first choice may be cosidered a default choice, ad the secod a choice guided by cosideratio of the plot of Bayes factors. For the choice (v =, ɛ =.00), we have E(ψ ew ) =.95 ad P (ψ ew > 0) =.04, whereas for (v = 4, ɛ =.625), we have E(ψ ew ) =.87 ad P (ψ ew > 0) =.08. I other words, the t model suggests a stroger aspiri effect, but the iferece is more tetative. Remarks o Computatio ad Accuracy We ow give a idea of how the computatioal effort is distributed. The Stage samples (2 chais, each of legth 0 6 ) took 83 secods to geerate o a 3.8 GHz dual core P4 ruig Liux. By cotrast, the plot i Figure, which ivolves a grid of 4000 poits, took oe hour to compute, i spite of the fact that it is based o a total sample size of oly 200, for what must be cosidered a rather simple model. Clearly usig a very large value of is ot feasible, ad this is why we eed to ru the prelimiary chais i order to get a very accurate estimate of d. We ow illustrate the extet to which ˆB reg (h, h ) is more efficiet tha ˆB(h, h ). Figure 2 gives a plot of the ratio of the variaces of the two estimates as h varies. Both ˆB reg (h, h ) ad ˆB(h, h ) use the desig discussed earlier, which ivolves a total sample size of 200. This figure is obtaied by geeratig 00 Mote Carlo replicates of ˆB reg (h, h ) ad ˆB(h, h ) for 3

15 each h i a grid somewhat more coarse tha the oe used i Figure. As ca be see from the figure, the ratio is about.0 over most of the grid, ad is less tha. over the etire grid, with the exceptio of the values of h for which df =.5 (for those values, the Bayes factor itself is very small, ad the two estimates each have miiscule variaces). We also ote that the ratio is exactly 0 at the desig poits. 0.3 Ratio of variaces df epsilo 8 Figure 2: Improvemet i accuracy that results whe we use cotrol variates. The plot gives Var ( ˆBreg (h, h ) )/ Var ( ˆB(h, h ) ) as h rages over the same regio as i Figure. 4 Discussio Whe faced with ucertaity regardig the choice of hyperparameters, oe approach is to put a prior o the hyperparameters, that is, add oe layer to the hierarchical model. This approach, which goes uder the geeral ame of Bayesia model averagig, ca be very useful. O the other had, there are several good reasos why oe may wat to avoid it. First, the choice of prior o the hyperparameters ca have a great ifluece o the aalysis. Oe is tempted to use a flat prior but, as is well kow, for certai parameters such a prior ca i fact be very iformative. I the illustratio of Sectio 3, a flat prior o the degrees of freedom parameter i effect skews the results i favor of the ormal distributio. Secod, oe may wish to do Bayesia model selectio, as opposed to Bayesia model averagig, because the subsequet iferece is the more parsimoious ad iterpretable. These poits are discussed more fully i George ad Foster (2000) ad Robert (200, Chapter 7). There are a umber of papers that deal with estimatio of Bayes factors via MCMC. Che, Shao ad Ibrahim (2000, Chapter 5) ad Ha ad Carli (200) give a overview of much of this work, ad we metio also the more recet paper by Meg ad Schillig (2002), which is directly relevat. Most of these papers deal with the case of a sigle Bayes factor, whereas the preset paper is cocered with estimatio of large families of Bayes factors. Nevertheless i priciple, ay of the methods i this literature ca be applied to estimate the vector d. 4

16 Especially importat is Kog et al. (2003), whose work we describe i the otatio of the preset paper. The situatio cosidered there has k kow uormalized desities q h,..., q hk, with ukow ormalizig costats m h,..., m hk, respectively, ad for l =,..., k, there from q hl /m hl. The problem is the simultaeous estimatio of all ratios m hl /m hs, l, s =,..., k, or equivaletly, all ratios d l = m hl /m h, l =,..., k. I a certai framework, they show that the maximum likelihood estimate (MLE) of d is obtaied by solvig the system of k equatios is a iid sample θ (l),..., θ (l) l ˆd r = k l q hr (θ (l) s= a sq hs (θ (l) / ˆd, r =,..., k. (4.) s l= i= To put this i our cotext, let q hl (θ) = l y (θ)ν hl (θ), l =,..., k, ad suppose we have iid samples from the ormalized q hl s. We may imagie that we have k + uormalized desities q h,..., q hk, q h, with a sample of size 0 from the ormalized q h. The estimate of m h /m h the becomes k l l= i= ν h (θ (l) s= a sν hs (θ (l) / ˆd s. We recogize this as precisely ˆB(h, h ) i (2.), except that ˆd,..., ˆd k are formed by solvig (4.), i.e., are estimated from the sequeces θ (l),..., θ (l) l, l =,..., k. Thus, ˆB(h, h ) is the same as the estimate of Kog et al. (2003), except that the vector d is precomputed based o previously ru very log chais. Therefore, it is perhaps atural to cosider estimatig d o the basis of these very log Markov chais usig the method of Kog et al. (2003) (as opposed to the method discussed i Sectio 2.), ad we ow discuss this possibility. I their approach, Kog et al. (2003) assume that the q hl s are desities with respect to a domiatig measure µ, ad they obtai the MLE ˆµ of µ (ˆµ is give up to a multiplicative costat). They ca the estimate the ratios m hl /m hs sice the ormalizig costats are kow fuctios of µ. Their approach works if for each l, θ (l),..., θ (l) l is a iid sample. Although they exted it to the case where these are a Markov chai, i the extesio q hl is replaced by the Markov trasitio fuctios P hl (, θ (l), i = 0,..., l, assumed absolutely cotiuous with respect to a sigma-fiite measure µ (precludig Metropolis-Hastigs chais), ad if each of these is kow oly up to a ormalizig costat as is typically the case the the system (4.) becomes a system of k equatios. This is prohibitively difficult to solve. Ta (2004) shows how cotrol variates ca be icorporated i the likelihood framework of Kog et al. (2003). Whe there are r fuctios H j, j =,..., r, for which we kow that Hj dµ = 0, the parameter space is restricted to the set of all sigma-fiite measures satisfyig these r costraits. For the case where θ (l) i, i =,..., l, are iid for each l =,..., k, he obtais the MLE of µ i this reduced parameter space, ad therefore a correspodig estimate of m h /m h, ad shows that this approach gives estimates that are asymptotically equivalet to estimates that use cotrol variates via regressio. His estimate ca still be used whe we have Markov chai draws, but is o loger optimal for the same reaso that the estimate i the preset paper is ot optimal (see the discussio i the middle of Sectio 2.2). The optimal estimator is obtaied by usig the likelihood that arises from the Markov chai structure, ad i the case of geeral Markov chais its calculatio is computatioally very demadig. See 5