Estimation of Large Families of Bayes Factors from Markov Chain Output

Similar documents
MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

The standard deviation of the mean

Output Analysis and Run-Length Control

Infinite Sequences and Series

1 Introduction to reducing variance in Monte Carlo simulations

Statistics 511 Additional Materials

Lecture 2: Monte Carlo Simulation

1 Inferential Methods for Correlation and Regression Analysis

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

GUIDELINES ON REPRESENTATIVE SAMPLING

6.3 Testing Series With Positive Terms

Random Variables, Sampling and Estimation

A statistical method to determine sample size to estimate characteristic value of soil parameters

Bayesian Methods: Introduction to Multi-parameter Models

FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures

Chapter 6 Sampling Distributions

Topic 9: Sampling Distributions of Estimators

This is an introductory course in Analysis of Variance and Design of Experiments.

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

Estimation for Complete Data

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Lecture 19: Convergence

If, for instance, we were required to test whether the population mean μ could be equal to a certain value μ

4. Partial Sums and the Central Limit Theorem

Element sampling: Part 2

Rates of Convergence by Moduli of Continuity

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

6.867 Machine learning, lecture 7 (Jaakkola) 1

Properties and Hypothesis Testing

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Stochastic Simulation

It should be unbiased, or approximately unbiased. Variance of the variance estimator should be small. That is, the variance estimator is stable.

Frequentist Inference

Introductory statistics

CHAPTER 10 INFINITE SEQUENCES AND SERIES

Department of Mathematics

Sample Size Estimation in the Proportional Hazards Model for K-sample or Regression Settings Scott S. Emerson, M.D., Ph.D.

Machine Learning Brett Bernstein

7-1. Chapter 4. Part I. Sampling Distributions and Confidence Intervals

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

DS 100: Principles and Techniques of Data Science Date: April 13, Discussion #10

Double Stage Shrinkage Estimator of Two Parameters. Generalized Exponential Distribution

Distribution of Random Samples & Limit theorems

KLMED8004 Medical statistics. Part I, autumn Estimation. We have previously learned: Population and sample. New questions

Probability, Expectation Value and Uncertainty

7.1 Convergence of sequences of random variables

Regression with an Evaporating Logarithmic Trend

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Chapter 6 Principles of Data Reduction

t distribution [34] : used to test a mean against an hypothesized value (H 0 : µ = µ 0 ) or the difference

Exponential Families and Bayesian Inference

AAEC/ECON 5126 FINAL EXAM: SOLUTIONS

Statistical Inference Based on Extremum Estimators

Sequences. Notation. Convergence of a Sequence

Stat 421-SP2012 Interval Estimation Section

On an Application of Bayesian Estimation

1.010 Uncertainty in Engineering Fall 2008

Statisticians use the word population to refer the total number of (potential) observations under consideration

Topic 9: Sampling Distributions of Estimators

Monte Carlo Integration

Advanced Stochastic Processes.

Simulation. Two Rule For Inverting A Distribution Function

OPTIMAL ALGORITHMS -- SUPPLEMENTAL NOTES

Tests of Hypotheses Based on a Single Sample (Devore Chapter Eight)

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 9

II. Descriptive Statistics D. Linear Correlation and Regression. 1. Linear Correlation

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

Problem Set 4 Due Oct, 12

Topic 9: Sampling Distributions of Estimators

6. Sufficient, Complete, and Ancillary Statistics

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables

Lecture 9: September 19

Statistical inference: example 1. Inferential Statistics

R. van Zyl 1, A.J. van der Merwe 2. Quintiles International, University of the Free State

U8L1: Sec Equations of Lines in R 2

Investigating the Significance of a Correlation Coefficient using Jackknife Estimates

Lecture 3. Properties of Summary Statistics: Sampling Distribution

Estimation of a population proportion March 23,

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara

Chapter 2 The Monte Carlo Method

January 25, 2017 INTRODUCTION TO MATHEMATICAL STATISTICS

Binomial Distribution

n n i=1 Often we also need to estimate the variance. Below are three estimators each of which is optimal in some sense: n 1 i=1 k=1 i=1 k=1 i=1 k=1

Basics of Probability Theory (for Theory of Computation courses)

Lecture 11 October 27

7.1 Convergence of sequences of random variables

There is no straightforward approach for choosing the warmup period l.

Goodness-of-Fit Tests and Categorical Data Analysis (Devore Chapter Fourteen)

CSE 527, Additional notes on MLE & EM

Approximate Confidence Interval for the Reciprocal of a Normal Mean with a Known Coefficient of Variation

CS284A: Representations and Algorithms in Molecular Biology

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Chapter 8: Estimating with Confidence

Basis for simulation techniques

6 Sample Size Calculations

Because it tests for differences between multiple pairs of means in one test, it is called an omnibus test.

ENGI 4421 Confidence Intervals (Two Samples) Page 12-01

Transcription:

Estimatio of Large Families of Bayes Factors from Markov Chai Output Hai Doss Uiversity of Florida Abstract We cosider situatios i Bayesia aalysis where the prior is idexed by a hyperparameter takig o a cotiuum of values. We distiguish some arbitrary value of the hyperparameter, ad cosider the problem of estimatig the Bayes factor for the model idexed by the hyperparameter vs. the model specified by the distiguished poit, as the hyperparameter varies. We assume that we have Markov chai output from the posterior for a fiite umber of the priors, ad develop a method for efficietly computig estimates of the etire family of Bayes factors. As a applicatio of the ideas, we cosider some commoly used hierarchical Bayesia models ad show that the parametric assumptios i these models ca be recast as assumptios regardig the prior. Therefore, our method ca be used as a model selectio criterio i a Bayesia framework. We illustrate our methodology through a detailed example ivolvig Bayesia model selectio. Key words ad phrases: Bayes factors, cotrol variates, ergodicity, importace samplig, Markov chai Mote Carlo

Itroductio Suppose we have a data vector Y whose distributio has desity p θ, for some ukow θ Θ. Let {ν h, h H} be a family of prior desities o θ that we are cotemplatig. The selectio of a particular prior from the family is importat i Bayesia data aalysis, ad whe makig this choice oe will ofte wat to cosider the margial likelihood of the data uder the prior ν h, give by m h (y) = l y (θ)ν h (θ) dθ, as h varies over the hyperparameter space H. Here, l y (θ) = p θ (y) is the likelihood fuctio. Values of h for which m h (y) is relatively low may be cosidered poor choices, ad cosideratio of the family {m h (y), h H} may be helpful i arrowig the search of priors to use. It is therefore useful to have a method for computig the family {m h (y), h H}. For the purpose of model selectio, if c is a fixed costat, the iformatio give by {m h (y), h H} ad {c m h (y), h H} is the same. From a computatioal ad statistical poit of view however, it is usually easier to fix a particular hyperparameter value h ad focus o {m h (y)/m h (y), h H}. Give two hyperparameter values h ad h, the quatity B(h, h ) = m h /m h is called the Bayes factor of the model idexed by h vs. the model idexed by h (we write m h istead of m h (y) from ow o). I this paper we preset a method for estimatig the family {B(h, h ), h H}. We have i mid situatios where B(h, h ) caot be obtaied aalytically ad, moreover, we eed to calculate B(h, h ) for a large set of h s, so that computatioal efficiecy is essetial. Our approach requires that there are k hyperparameter values h,..., h k, ad for l =,..., k, we are able to get a sample θ (l) i, i =,..., l, from ν hl,y, the posterior desity of θ give Y = y, assumig that the prior is ν hl. To set the framework, cosider the trivial case where k =, ad we have a sample from the posterior ν h,y geerated by a ergodic Markov chai. Our objective is to estimate {B(h, h ), h H}. For ay h such that ν h (θ) = 0 wheever ν h (θ) = 0, we have i= ν h (θ ν h (θ νh (θ) ν h (θ) ν h,y(θ) dθ (.) = m h ly (θ)ν h (θ)/m h ν h,y(θ) dθ m h l y (θ)ν h (θ)/m h = m h m h νh,y (θ) ν h,y(θ) ν h,y(θ) dθ = m h m h. Therefore, the left side of (.) is a cosistet estimate of the Bayes factor B(h, h ). To fix ideas, cosider as a simple example the followig stadard three-level hierarchical model: coditioal o ψ j, Y j idep φ ψj,σ j, j =,..., m (.2a) coditioal o µ, τ, ψ j iid φ µ,τ, j =,..., m (.2b) (µ, τ) λ c,c 2,c 3,c 4 (.2c) where φ m,s deotes the desity of the ormal distributio with mea m ad stadard deviatio s. I (.2a), the σ i s are assumed kow. I (.2c), λ c,c 2,c 3,c 4 is the ormal / iverse gamma distributio idexed by four hyperparameters (see Sectio 3). This is a very commoly used

iid model but, as we discuss later, i some situatios it is preferable to replace (.2b) with ψ j t v,µ,τ, where t v,µ,τ is the desity of the t distributio with v degrees of freedom, locatio µ ad scale τ. I this case, cosider ow the estimate i the left side of (.). The likelihood of (µ, τ) is m m l Y (µ, τ) =... φ ψj,σ j (Y j ) t v,µ,τ (ψ j ) dψ... dψ m. j= This likelihood caot be computed i closed form, ad therefore its cacellatio i (.) gives a o-trivial simplificatio: calculatio of the estimate requires oly the ratio of the desities of the priors ad ot the posteriors. Cosider (.2) with t v,µ,τ istead of φ µ,τ i the middle stage, ad suppose ow that we would like to select v, with the choice v = sigifyig the choice of the ormal distributio φ µ,τ. The distributio of Y is determied by ψ = (ψ,..., ψ m ). A completely equivalet way of describig the model is therefore through the two-level hierarchy i which we let θ = (ψ, µ, τ), ad stipulate: coditioal o θ, j= Y j idep φ ψj,σ j, (ψ, µ, τ) ν h, j =,..., m where ν h (ψ, µ, τ) = ( m j= t v,µ,τ(ψ j ) ) λ c,c 2,c 3,c 4 (µ, τ). Here, the hyperparameter is h = (v, c, c 2, c 3, c 4 ), which icludes the umber of degrees of freedom. Estimatio of the family of Bayes factors {B(h, h ), h H} therefore eables a model selectio step. We ow discuss briefly the accuracy of the estimate o the left side of (.). Whe ν h is early sigular with respect to ν h over the regio where the θ i s are likely to be, the estimate will be ustable. (Formally, the estimate will satisfy a cetral limit theorem if the chai mixes fast eough ad the radom variable ν h (θ)/ν h (θ) (where θ ν h,y) has a high eough momet. This is discussed i more detail i Sectio 2.3.) From a practical poit of view, this meas that there is effectively a radius aroud h withi which oe ca safely move. I all but the very simplest models, the dimesio of H is greater tha, ad therefore estimatio of the Bayes factor as h rages over H raises serious computatioal difficulties, ad it is essetial that for each h, the estimate of B(h, h ) is both accurate ad ca be computed quickly. Our approach is to select k hyperparameter poits h,..., h k, ad get Markov chai samples from ν hl,y for each l =,..., k. The prior ν h i the deomiator of the left side of (.) is replaced by a mixture w ν h + + w k ν hk, with appropriately chose weights. We show how judiciously chose cotrol variates ca be used i cojuctio with multiple Markov chai streams to produce accurate estimates eve with small samples, so that the et result is a computatioally feasible method for producig reliable estimates of the Bayes factors for a wide rage of hyperparameter values. Our approach is motivated by ad uses ideas developed i Kog et al. (2003), which deals with the situatio where we have idepedet samples from k uormalized desities, ad we wish to estimate all possible ratios of the k ormalizig costats. Owe ad Zhou (2000) ad Ta (2004) also discuss the use of cotrol variates to icrease the accuracy of Mote Carlo estimates. I Sectio 4 we retur to these three papers ad discuss i detail how our approach fits i the cotext of this work. The paper is orgaized as follows. Sectio 2 cotais the mai methodological developmet; there, we preset our method for estimatig the family of Bayes factors ad state supportig theoretical results. Sectio 3 illustrates the methodology through a detailed example that ivolves a umber of issues, 2

icludig selectio of the parametric family i the model. Sectio 4 gives a discussio of other possible approaches ad related work, ad the Appedix gives the proof of the mai theoretical result of the paper. 2 Estimatio of the Family of Bayes Factors Suppose that for l =,..., k, we have Markov chai Mote Carlo (MCMC) samples θ (l) i, i =,..., l from the posterior desity of θ give Y = y, assumig that the prior is ν hl, havig the form ν hl,y(θ) = l y (θ)ν hl (θ)/m hl. We assume that the k sequeces are idepedet of oe aother. We will ot assume we kow ay of the m hl s. However, we ow explai how kowledge of the Bayes factors m hl /m h, for l = 2,..., k would result i two importat beefits. If we kew these Bayes factors we could the form the estimate ˆB(h, h ) = k l ν h (θ (l) s=. (2.) sν hs (θ (l) m h /m hs l= i= Let = s= s, ad assume that s / a s, s =,..., k. We the have ˆB(h, h ) = = a.s. k l l= i= k s= sl y (θ (l) i m h l m h l l= i= k m h m h l= l y (θ (l) )ν h (θ (l) s= i )ν hs (θ (l) i l ν h,y(θ (l) s ν hs,y(θ (l) )m h /m hs a l ν h,y (θ) s= a sν hs,y(θ) ν h l,y(θ) dθ = m h m h. (2.2) The almost sure covergece i (2.2) occurs uder miimal coditios o the Markov chais θ (l) i, i =,..., l. Asymptotic ormality requires more restrictive coditios, ad is discussed i Sectio 2.3. To compute ˆB(h, h ), the quatities s= sν hs (θ (l) m h /m hs are calculated oce, ad stored. The, for every ew value of h, the computatio of ˆB(h, h ) requires takig ratios ad a sum. Sice this is to be doe for a large umber of h s, it is essetial that for each l, the sequece θ (l) i, i =,..., l be as idepedet as possible, so that the value of be made as small as possible. We ow briefly recall the use of cotrol variates i Mote Carlo samplig. Suppose we wish to estimate the expected value of a radom variable Y, ad we ca fid a radom variable Z that is correlated with Y, ad such that E(Z) is kow (without loss of geerality, E(Z) = 0). The for ay β, the estimate Y βz is a ubiased estimate of E(Y ), ad the value of β miimizig the variace of Y βz is β = Cov(Y, Z)/Var(Z). The idea may be used whe there are several variables Z,..., Z r that are correlated with Y. 3

I the preset cotext, we may cosider the fuctios Z j (θ) = ν h j (θ)m h /m hj ν h (θ) s s= ν hs (θ)m h /m hs, j = 2,..., k, whose expectatios uder s= ( s/)ν hs,y are 0. The calculatio of these fuctios requires kowledge of the Bayes factors m hs /m h, s = 2,..., k. The method proposed i this paper ca ow be briefly summarized as follows.. For each l =,..., k, get Markov chai samples θ (l) i, i =,..., N l from ν hl,y. Based o these, the Bayes factors m hs /m h, s = 2,..., k are estimated. The sample sizes N l should be very large, so that these estimates are very accurate. 2. For each l =,..., k, we obtai ew samples θ (l) i, i =,..., l from ν hl,y. Usig these, together with the Bayes factors computed i Step we form the estimate ˆB reg (h, h ), which is similar to (2.), except that we use the fuctios Z j, j = 2,..., k as cotrol variates. The samples i the two steps are used for differet purposes. Those i Step are used solely to estimate m hs /m h, s = 2,..., k, ad i fact, oce these estimates are formed, the samples may be discarded. The samples i Step 2 are used to estimate the family B(h, h ). O occasio, special aalytical structure eables the use of umerical methods to estimate m hs /m h, s = 2,..., k, as log as k is ot too large so Step is bypassed. A review of the literature for this approach is give i Kass ad Raftery (995). Ideally, the samples i Step 2 should be idepedet or early so, which may be accomplished by subsamplig a very log chai. If we have a Markov trasitio fuctio that gives rise to a uiformly ergodic chai, it is possible to use this Markov trasitio fuctio to obtai perfect samples (Hobert ad Robert (2004)), although the time it takes to geerate a perfect sample of legth l may be much greater tha the time to geerate the Markov chai of legth l. Oe may ask what is the poit of havig two steps of samplig, i.e. why ot just use the samples from Step for both estimatio of m hs /m h, s =,..., k, ad for subsequet estimatio of the family B(h, h ). The reaso for havig the two stages is that the estimate of B(h, h ) eeds to be computed for a large umber of h s, ad for every h the amout of computatio is liear i, so this precludes a large value of. Therefore, give that a relatively modest sample size must be used, we eed to reduce the variace of the estimate as much as possible, ad this is the reaso for carryig out Step. The amout of computatio to geerate the Step samples is typically oe or two orders of magitude less tha the amout of computatio eeded to calculate the estimates of B(h, h ) from the Step 2 samples (see the discussio at the ed of Sectio 3). To summarize, the beefit of the two-step approach is a better tradeoff betwee statistical efficiecy ad computatioal time. To see this, it is helpful to cosider a very simple example i which the variaces of various estimators ca actually be computed. Cosider the uormalized desity q h = t h I(t (0, )), ad let m h be the ormalizig costat. Now suppose we wish to estimate m h /m as h rages over a grid of 4000 poits i the iterval (.5, 2.5) ad that we are able to geerate iid observatios from q /m ad q 3 /m 3. We may use the estimator i Kog et al. (2003) (discussed later i this paper), which estimates both m h /m ad m 3 /m from the same sample. Give oe miute of computer time, usig the machie whose specificatios 4

are described i Sectio 3, the requiremet that we calculate such a large umber of ratios of ormalizig costats limits the total sample size to = 2 90. A formula for the asymptotic variace ρ 2 (h) of the Kog et al. (2003) estimate is give i Ta (2004, equatio (8)), ad i this situatio all quatities that are eeded i the formula are available explicitly. Now if we take the miute ad divide it ito two parts, 3 secods ad 57 secods, the with the 3 secods we ca estimate m 3 /m with essetially perfect accuracy, ad with the remaiig 57 secods, if we use the estimate ˆB(h, ), we ca hadle a sample size of 57/60. A formula for the asymptotic variace τ 2 (h) of this estimator which uses the value of m 3 /m calculated i the first stage is give i Theorem of the preset paper, ad ca also be evaluated explicitly. The ratio τ 2 (h)/ρ 2 (h) is bouded above by.2 over the etire grid, ad so with the same computer resources, the variace of the two-stage estimator is uiformly at most.2 60/57.2 that of the oe-stage estimator. (The gais if we use ˆB reg istead of ˆB ca be far greater; see Sectio 3 for a illustratio.) I Sectio 2. we show how the MCMC approach to Step may be implemeted. I Sectio 2.2 we show how estimatio i Step 2 may be implemeted, ad also discuss the beefits of usig the cotrol variates. I Sectio 2.3 we give a result regardig asymptotic ormality of the estimates of the Bayes factors. 2. Estimatio of the Bayes Factors m hs /m h We ow assume that for l =,..., k, we have a sequece θ (l) i, i =,..., N l from a Markov chai correspodig to the posterior ν hl,y. Also, these k sequeces are idepedet of oe aother. Let N = l= N l, ad a l = N l /N. We wish to estimate m hl /m h, l = 2,..., k. Meg ad Wog (996) cosidered this problem ad, to uderstad their method, it is helpful to cosider first the case where k = 2 ad we wish to estimate d = m h2 /m h. For ay fuctio α defied o the commo support of ν h,y ad ν h2,y such that α(θ)ν h (θ)l y (θ)ν h2 (θ) dθ <, we have Therefore, α(θ)ν h2 (θ)ν h,y(θ) dθ α(θ)ν h (θ)ν h2,y(θ) dθ ˆd = = N N 2 m h m h2 N i= N 2 i= α(θ)ν h2 (θ)l y (θ)ν h (θ) dθ α(θ)ν h (θ)l y (θ)ν h2 (θ) dθ α(θ () ν h2 (θ () α(θ (2) ν h (θ (2) = m h 2 m h. (2.3) is a cosistet estimate of d, uder the miimal assumptio of ergodicity of the two chais. Meg ad Wog (996) show that whe {θ (j) i } N j i= are idepedet draws from ν h j,y, the optimal α to use is α opt (θ) = a ν h (θ) + a 2 ν h2 (θ)/d, (2.4) 5

which ivolves the quatity we wish to estimate. This suggests the iterative scheme ˆd (t+) = N N 2 N ν h2 (θ () i= a ν h (θ () + a 2 ν h2 (θ () i N 2 ν h (θ (2) i= a ν h (θ (2) + a 2 ν h2 (θ (2) i )/ ˆd (t) )/ ˆd (t), (2.5) for t =, 2,.... For the geeral case where k 2, let d = (m h2 /m h,..., m hk /m h ), but it is more coveiet to work with the vector of compoet-wise reciprocals of d, call it r. For i = 2,..., k, ad j =,..., k, j i, let α ij be kow fuctios defied o the commo support of ν hi ad ν hj satisfyig α ij (θ)ν hi (θ)l y (θ)ν hj (θ) dθ <. Let b ii = j i E ν hj,y( αij (θ)ν hi (θ) ) 2 i k, b ij = E νhi,y( αij (θ)ν hj (θ) ) i j, (2.6) ad b 22 b 23... b 2k b 32 b 33... b 3k B =......, b =. b 2 b 3. b k2 b k2... b kk b k The assumig that B is osigular, we have r = B b. If ˆB α ad ˆb α are the atural estimates of B ad b based o the fuctios α ij ad the samples {θ (j) i } N j i=, j =,..., k, the r may be estimated via ˆr = ˆB ˆb α α. (2.7) Meg ad Wog (996) cosider the fuctios α ij = a i a j s= a sr s ν hs, (2.8) which ivolve the ukow r. The atural extesio of (2.5) is ˆr (t+) = ˆB α tˆb αt, with the vector of fuctios α t give by (2.8), where we use ˆr (t) istead of r. 2.2 Usig Cotrol Variates The use of cotrol variates has had may successes i Mote Carlo samplig, ad a particularly importat paper is Owe ad Zhou (2000). This paper cosiders the use of cotrol variates i cojuctio with importace samplig, whe the importace samplig desity is a mixture, ad the paper motivates some of the ideas below. We ow assume that we have samples θ (l) i, i =,..., l, from ν hl,y, l =,..., k, with idepedece across samples, ad that we kow the costats d 2,..., d k. For uity of otatio, we defie d =. As before = l= l ad l / = a l. The estimate ˆB(h, h ) i (2.) is a average of draws from the mixture distributio p a = s= a sν hs,y. However, these are ot 6

idepedet ad idetically distributed sice they form a stratified sample: we have exactly s draws from ν hs,y, s =,..., k, a fact which causes o problems. We wish to estimate the itegral I h = l y (θ)ν h (θ)/m h dθ = B(h, h ). Defie the fuctios H j (θ) = l y (θ)ν hj (θ)/m hj l y (θ)ν h (θ)/m h, j = 2,..., k. We have H j (θ) dθ = 0, or equivaletly E pa ( Hj (θ)/p a (θ) ) = 0, where the subscript idicates that the expectatio is take with respect to the mixture distributio p a. Therefore, for every β = (β 2,..., β k ) the estimate Î h,β = k l l= i= l y (θ (l) i )ν h (θ (l) /m h [ ly (θ (l) ( ν hj (θ (l) /m hj ν h (θ (l) )] /m h s= a sν hs,y(θ (l) j=2 β j is ubiased. As writte, this estimate is ot computable, because it ivolves the ormalizig costats m hj, which are ukow, ad also the likelihood l y (θ), which may ot be available. We rewrite it i computable form as Î h,β = k l l= i= ν h (θ (l) i ) j=2 β [ j νhj (θ (l) /d j ν h (θ (l) ] s= a. (2.9) sν hs (θ (l) /d s We would like to use the value of β, call it β opt, that miimizes the variace of Îh,β, but this β opt is geerally ukow. As i Owe ad Zhou (2000), we ca do ordiary liear regressio of Y (h) Y (h) = o predictors Z (j), where ν h (θ (l) s= a, Z (j) sν hs (θ (l) /d s = ν h j (θ (l) /d j ν h (θ (l) s= a, j = 2,..., k, (2.0) sν hs (θ (l) /d s ad all required quatities are available. We the use the least squares estimate ˆβ, i.e. the estimate of I h is Îh, ˆβ. It is easy to see that Îh, ˆβ is simply ˆβ 0, the estimate of the itercept term i the bigger regressio problem where we iclude the itercept term, i.e. Î h, ˆβ = ˆβ 0. (2.) Oe ca show that if the k sequeces are all iid sequeces, the ˆβ coverges to β opt, ad Îh, ˆβ is guarateed to be at least as efficiet as the aive estimator. But whe we have Markov chais this is ot the case, especially if the chais mix at differet rates. I Sectio 2.3 we cosider the estimates ˆβ ad Îh, ˆβ directly. I particular, we give a precise defiitio of the oradom value β that ˆβ is estimatig (it is β (h) lim i equatio (A.3)), ad show that the effect of usig ˆβ istead of β is asymptotically egligible. 7

It is atural to cosider the problem of estimatig β opt i the Markov chais settig. Actually, before thikig about miimizig the variace of (2.9) with respect to β, oe should first ote the followig. The costats a s = s /, s =,..., k, used i formig the values Y (h) are sesible i the iid settig, but whe dealig with Markov chais oe would wat to replace s with a effective sample size, as discussed by Meg ad Wog (996). Therefore, the real problem is two-fold: How do we fid optimal (or good) values to use i place of the a s s i the Y (h) s? Usig the Y (h) s based o these values, how do we estimate the value of β that miimizes the variace of (2.9)? Both problems appear to be very difficult. Ituitively at least, the method described here should perform well if the mixig rates of the Markov chais are ot very differet. But i ay case, the results i Sectio 2.3 show that, whether or ot Îh, ˆβ is optimal, it is a cosistet ad asymptotically ormal estimator whose variace ca be estimated cosistetly. Note that if we do ot use cotrol variates, our estimate is just which is exactly (2.). k l ν h (θ (l) s= a, sν hs (θ (l) /d s l= i= Reductio i Variace from Usig the Cotrol Variates of the resposes Y (h) ad predictors Z (j) give by Cosider the liear combiatio L = k a j Z (j) + Y (h). j=2 (We are droppig the subscripts i, l.) A calculatio shows that if h = h the L =, meaig that we have a estimate with zero variace. Similarly, for t = 2,..., k, let L t be the liear combiatio give by k L t = a j Z (j) + (/d t )Y (h) Z (t). j=2 If h = h t, the L t =. Thus if h {h,..., h k }, our estimate of the Bayes factor B(h, h ) has zero variace. This is ot surprisig sice, after all, we are assumig that we kow B(h j, h ), for j =,..., k; however, this does idicate that if we use these cotrol variates, our estimate will be very precise as log as h is close to at least oe of the h j s. This advatage does ot exist if we use the plai estimate (2.). The itercept term i the regressio of the Y (h) s o the Z (j) s is simply a liear combiatio of the form ˆβ 0 = k l l= i= w Y (h). (2.2) 8

The w s eed to be computed just oce, so for every ew value of h the calculatio of ˆB reg (h, h ) requires operatios, which is the same as the umber of operatios eeded to compute ˆB(h, h ) give by (2.). To summarize, usig cotrol variates ca greatly improve the accuracy of the estimates, at o (or trivial) icrease i computatioal cost. 2.3 Asymptotic Normality ad Estimatio of the Variace Here we state a result that says that uder certai regularity coditios ˆB reg (h, h ) ad ˆB(h, h ) are asymptotically ormal, ad we show how to estimate the variace. As discussed i Sectio 2.2, we typically prefer that θ (l) i, i =,..., l, be a iid sample for each l. Nevertheless, our results pertai to the more geeral case where these samples arise from Markov chais. (As before, we assume that l / a l (0, ) ad, whe dealig with the asymptotics, strictly speakig we eed to make a distictio betwee l / ad its limit; however we write a l for both as this makes the bookkeepig easier, ad blurrig the distictio ever creates a problem.) Recall that Y (h) ad Z (j), j = 2,..., k, are defied i (2.0) ad, for ecoomy of otatio, we defie Z () to be for all i, l. Let R be the k k matrix defied by ( k ) R jj = E l= a lz (j) ),l Z(j,l, j, j =,..., k. We assume that for the Markov chais a strog law of large umbers holds (sufficiet coditios are give, for example, i Theorem 2 of Athreya, Doss ad Sethurama (996)), ad we refer to the followig coditios. A For each l =,..., k, the chai {θ (l) i } i= is geometrically ergodic. A2 For each l =,..., k, there exists ɛ > 0 such that E ( (h) Y 2+ɛ ) <. A3 The matrix R is osigular. Theorem Uder coditios A ad A2 ad uder coditios A A3 /2( ˆB(h, h ) B(h, h ) ) /2( ˆBreg (h, h ) B(h, h ) ),l d N ( 0, τ 2 (h) ), d N ( 0, σ 2 (h) ), with τ 2 (h) ad σ 2 (h) give by equatios (A.9) ad (A.7) below. The proof is give i the Appedix, which also explais how oe ca estimate the variaces. Theorem assumes that the vector d is kow either because it ca be computed aalytically or because the sample sizes from Stage samplig are so large that this is effectively true. Buta (2009) has obtaied a versio of Theorem that takes ito accout the variability from the first stage. Very briefly, if N is the total sample size from the first stage, ad if N ad i such a way that /N q [0, ), the /2( ˆB(h, h ) B(h, h ) ) d N ( 0, qτ 2 S(h) + τ 2 (h) ), 9

where τs 2 (h) is a correctio term that iflates the variace whe the sample sizes i Stage are fiite. Also, she has a similar result for the estimate that uses cotrol variates. The variaces of ˆB reg (h, h ) ad ˆB(h, h ) deped o the choice of the poits h,..., h k, ad fidig good values of k ad h,..., h k is i geeral a very difficult problem. I our experiece, we have foud that the followig method works reasoably well. Havig specified the rage H, we select trial values h,..., h k, ad i pilot rus plot the variace fuctio τ 2 (h), or σ 2 (h); the if we fid a regio where this is uacceptably large, we cover this regio by movig some h l s closer to the regio, or by simply addig ew h l s i that regio, which icreases k. 3 Illustratio There are may classes of models to which the methodology developed i Sectio 2 applies. These iclude the usual parametric models, ad also Bayesia oparametric models ivolvig mixtures of Dirichlet processes (Atoiak (974)), i which oe of the hyperparameters is the so-called total mass parameter very briefly, this hyperparameter cotrols the extet to which the oparametric model differs from a purely parametric model. Aother applicatio ivolves some problems i Bayesia variable selectio, ad this is described i Doss (2007). I this sectio we give a example ivolvig the hierarchical Bayesia model described i Sectio. While models of much greater complexity ca be cosidered, this relatively simple example has the advatage that the data ca be visualized quickly, ad the hyperparameters have a straightforward iterpretatio so that our aalysis ca be easily uderstood. Meta-Aalysis of Data o No-Steroidal Ati-Iflammatory Drugs ad Cacer Risk Over the last decade, a large umber of epidemiological studies have reported a lik betwee itake of osteroidal ati-iflammatory drugs (NSAIDs) ad cacer risk. The studies, which ivolve differet cacers ad differet NSAIDs, strogly suggest that log-term itake of NSAIDs results i a sigificat reductio i cacer risk for all the major types: colo, breast, lug, ad prostate cacer. I Harris et al. (2005) we carry out a comprehesive review of the published scietific literature o NSAIDs ad cacer. Our review spas 90 papers, which ivestigate several NSAIDs ad te cacers, icludig the four major types. We have extracted data from these papers to make tables such as Table below, which pertais to aspiri ad colo cacer. The table gives, for each of 5 studies, the dose, reported risk ratio (for NSAID use vs. o-nsaid use), ad the log reported risk ratio together with a stadard error. (Harris et al. (2005) does ot give these stadard errors; it gives 95% cofidece itervals for the risk ratios, which ca be used to form 95% cofidece itervals for the log risk ratios, which i tur ca be used to determie the stadard errors.) See Harris et al. (2005) for more iformatio o this table ad refereces for the 5 studies. As ca be see from the table, there is some icosistecy i the studies, with some idicatig a large reductio i cacer risk, while others idicate a smaller reductio, i spite of a large dose. This is ot surprisig, sice there is heterogeeity i the patiet ad cotrol pools (characteristics such as age, ethicity, ad health status vary greatly across the studies). It is 0

Publicatio PPW RR LRR SE(LRR) Cooga, 00 4 0.50 0.69 0.72 Friedma, 98 3 0.70 0.36 0.068 Garcia-Rod., 0 7 0.60 0.5 0.207 Giovaucci, 94 2 0.68 0.39 0.54 Giovaucci, 95 2 0.56 0.58 0.242 LaVecchia, 97 4 0.70 0.36 0.82 Muscat, 94 3 0.64 0.45 0.22 Pagaii-Hill, 89 7.50 0.4 0.95 Publicatio PPW RR LRR SE(LRR) Peleg, 94 7 0.25.39 0.547 Reeves, 96 2 0.79 0.24 0.277 Roseberg, 9 4 0.50 0.69 0.240 Roseberg, 98 4 0.70 0.36 0.28 Schr. & Ev., 94 0.74 0.30 0.202 Suh, 93 7 0.24.43 0.374 Thu, 9 4 0.48 0.73 0.234 Table : Fiftee studies o aspiri ad colo cacer. Here, PPW represets the dose (umber of 325 mg pills per week), RR is the observed risk ratio for aspiri vs. o aspiri, LRR is its logarithm, ad SE(LRR) is a estimate of the stadard error of LRR. therefore of iterest to carry out a meta-aalysis of these studies. Although there have bee a few meta-aalyses i the literature, these have bee rather iformal: all of them have used fixed effects models, ad oe have take ito accout the dose iformatio. Assume temporarily that all studies ivolved the same dose. I a radom-effects metaaalysis, for each study j there is a latet variable, say ψ j, that gives the true log risk ratio that would be obtaied if the sample sizes for that study were ifiite. Oe is the led to a model such as (.2), i which the distributio of the study-specific effect is the ormal distributio i (.2b). Two modellig issues ow arise. The first is that whereas the first ormality assumptio (lie (.2a)) is supported by a theoretical result (the approximate ormality of fuctios of biomial estimates), the secod ormality assumptio (lie (.2b)) is ot but is typically made for the sake of coveiece. I fact, data for several of the other cacers iclude outliers (see Harris et al. (2005)), ad therefore oe may wish to use a t distributio istead, this decisio beig made prior to lookig at the colo cacer data. A importat modellig issue is the to decide o the umber of degrees of freedom. The secod issue is to determie the parameters of the ormal / iverse gamma prior λ c i (.2c). Here c = (c, c 2, c 3, c 4 ), where c, c 2, c 4 > 0 ad c 3 R ad, uder this prior, the distributio of (µ, τ) is as follows: γ = /τ 2 Gamma(c, c 2 ) ad, coditioal o τ, µ N (c 3, c 4 τ 2 ). This prior is commoly used because it is cojugate to the family N (µ, τ 2 ). With appropriate hyperparameters, λ ca be made to be a flat ( oiformative ) prior, ad commo recommedatios are to take c ad c 2 to be very small (so that the gamma distributio o γ is a approximatio to dγ/γ, the improper Jeffrey s prior), ad to take c 3 = 0 ad c 4 to be very large. Ideed, this is the recommedatio made i the examples i the Bugs documetatio ad tutorials. Nevertheless, such a set of hyperparameter values is ow sometimes criticized because for small values of c ad c 2 the gamma distributio gives high probability to large values of γ (equivaletly small values of τ), which greatly ecourages the ψ j s to be all be equal to µ. I other words, this causes excessive shrikage. See for example Gelma (2006). We wish to address both these issues ad ow also would like to take ito accout the dose. Let L j be the log of the observed risk ratio for study j. Let x j be the dose, defied as umber of pills per day (PPW/7), for study j. Cosider the liear model L j = α j + ψ j x j + ε j, j =,..., m, (3.)

where α j ad ψ j are parameters specific to study j, ad ε j is ormally distributed with mea 0 ad stadard deviatio σ j (give i Colum 5 of Table ). Note that α j = 0, sice x j = 0 implies that the treatmet ad cotrol groups are idetical, so that L j has mea 0. Thus, (3.) is rewritte as L j = ψ j x j + ε j, from which we see that ψ j has the iterpretatio as the true log risk ratio if the treatmet group had take pill per day. Thus if we let Y j = L j /x j, we have Y j = ψ j + ε j, j =,..., m, where ε j is ormal with mea 0 ad stadard deviatio σ j = σ j /x j. We ow cosider the hierarchical model Y j idep φ ψj, σ j, j =,..., m, (3.2) with the distributio of ψ determied by the followig: coditioal o µ, τ, ψ j iid t v,µ,τ, j =,..., m, (3.3a) (µ, τ) λ c. (3.3b) Lettig θ = (ψ, µ, τ), the likelihood of Y = (Y,..., Y m ) is give by (3.2), ad the prior o θ is give by (3.3), which is idexed by h = (v, c). Loosely speakig, the value of v determies the choice of the model, ad the c s determies the prior. We may therefore fix some value h ad cosider the family of Bayes factors B(h, h ) as h varies. We ca estimate the family if for values h j, j =,..., k, of the hyperparameter h, we have samples from the posterior distributios ν hj,y of the etire vector θ. We cosidered four differet values of c i which c 3 = 0, c 4 = 000 were fixed (sice there does ot seem to be ay cotroversy about these two parameters) ad we took c = c 2 ad let the commo value, deoted ɛ, start at.005 ad icrease by factors of 5 up to.625. We took the values of the degrees of freedom parameter to be v =, 4, 2, for a total of 2 values of the hyperparameter h. For each of these 2 values we ra a Markov chai of legth about millio ad used these to calculate the vector of ratios of ormalizig costats, via the method of Meg ad Wog (996) reviewed i Sectio 2.. We the ra ew Markov chais to produce a sample of size 00 from each of the 2 posteriors. These samples, which were actually subsamples from loger chais (bur-i of 000, the takig every 50 th value), ca be cosidered iid for practical purposes, ad were used to calculate the estimate ˆB reg (h, h ) of Sectio 2.2. We took h to be the specificatio correspodig to v = 4 ad ɛ =.25, sice prelimiary experimets idicated that this value of h gave a relatively high value of m h. Figure shows ˆB reg (h, h ) as v ad ɛ vary. The maximum stadard error over the rage of the graph was less tha.0. The two plots i Figure show differet views of the same graph. From the left plot we see that a t distributio works better tha does a ormal, with the optimal umber of degrees of freedom beig about 3 or 4. The plot also shows clearly that a very small umber of degrees of freedom is ot appropriate. The right plot shows that as ɛ 0, the Bayes factor coverges to 0 rapidly (i particular, fixig v = 4, the recommedatio i the Bugs literature to use ɛ =.00 gives a Bayes factor of about.036, ad for ɛ =.000 it is.0037), givig strog evidece that very small values of ɛ should ot be used. For some models the improper prior dγ/γ gives rise to a proper posterior, ad for others, icludig model (3.3b), it is possible to prove that the posterior is improper (Berger (985, 2

.0.0 0.8 0.8 Bayes factor 0.6 Bayes factor 0.6 0.4 0.2 0.4 0.2 epsilo.0 0.8 0.6 0.4 0.2 2 4 df 6 8 2 4 df 6 8 0.2.0 0.8 0.6 0.4 epsilo Figure : Model assessmet for the aspiri ad colo cacer data. The Bayes factor as a fuctio of v, the umber of degrees of freedom i (3.3a), ad ɛ, the commo value of c ad c 2 i the gamma prior i (3.3b), is show from two differet agles. Here the baselie value of the hyperparameter correspods to v = 4 ad ɛ =.25. p. 87)), so that the pathological behavior resultig from ɛ 0 should be expected. For some more complicated models, whether the posterior is proper or ot is ukow (posterior propriety may eve deped o the data values), ad i these cases, plots such as those i Figure may be useful because they may lead oe to ivestigate a possible posterior impropriety. The choice of hyperparameter h does have a ifluece o our iferece. Let ψ ew deote the latet variable for a future study, a quatity of iterest i meta-aalysis. We cosidered two specificatios of h: (v =, ɛ =.00) ad (v = 4, ɛ =.625). The first choice may be cosidered a default choice, ad the secod a choice guided by cosideratio of the plot of Bayes factors. For the choice (v =, ɛ =.00), we have E(ψ ew ) =.95 ad P (ψ ew > 0) =.04, whereas for (v = 4, ɛ =.625), we have E(ψ ew ) =.87 ad P (ψ ew > 0) =.08. I other words, the t model suggests a stroger aspiri effect, but the iferece is more tetative. Remarks o Computatio ad Accuracy We ow give a idea of how the computatioal effort is distributed. The Stage samples (2 chais, each of legth 0 6 ) took 83 secods to geerate o a 3.8 GHz dual core P4 ruig Liux. By cotrast, the plot i Figure, which ivolves a grid of 4000 poits, took oe hour to compute, i spite of the fact that it is based o a total sample size of oly 200, for what must be cosidered a rather simple model. Clearly usig a very large value of is ot feasible, ad this is why we eed to ru the prelimiary chais i order to get a very accurate estimate of d. We ow illustrate the extet to which ˆB reg (h, h ) is more efficiet tha ˆB(h, h ). Figure 2 gives a plot of the ratio of the variaces of the two estimates as h varies. Both ˆB reg (h, h ) ad ˆB(h, h ) use the desig discussed earlier, which ivolves a total sample size of 200. This figure is obtaied by geeratig 00 Mote Carlo replicates of ˆB reg (h, h ) ad ˆB(h, h ) for 3

each h i a grid somewhat more coarse tha the oe used i Figure. As ca be see from the figure, the ratio is about.0 over most of the grid, ad is less tha. over the etire grid, with the exceptio of the values of h for which df =.5 (for those values, the Bayes factor itself is very small, ad the two estimates each have miiscule variaces). We also ote that the ratio is exactly 0 at the desig poits. 0.3 Ratio of variaces 0.2 0. 0.0 2 4 df 6 0.2 0.4.0 0.8 0.6 epsilo 8 Figure 2: Improvemet i accuracy that results whe we use cotrol variates. The plot gives Var ( ˆBreg (h, h ) )/ Var ( ˆB(h, h ) ) as h rages over the same regio as i Figure. 4 Discussio Whe faced with ucertaity regardig the choice of hyperparameters, oe approach is to put a prior o the hyperparameters, that is, add oe layer to the hierarchical model. This approach, which goes uder the geeral ame of Bayesia model averagig, ca be very useful. O the other had, there are several good reasos why oe may wat to avoid it. First, the choice of prior o the hyperparameters ca have a great ifluece o the aalysis. Oe is tempted to use a flat prior but, as is well kow, for certai parameters such a prior ca i fact be very iformative. I the illustratio of Sectio 3, a flat prior o the degrees of freedom parameter i effect skews the results i favor of the ormal distributio. Secod, oe may wish to do Bayesia model selectio, as opposed to Bayesia model averagig, because the subsequet iferece is the more parsimoious ad iterpretable. These poits are discussed more fully i George ad Foster (2000) ad Robert (200, Chapter 7). There are a umber of papers that deal with estimatio of Bayes factors via MCMC. Che, Shao ad Ibrahim (2000, Chapter 5) ad Ha ad Carli (200) give a overview of much of this work, ad we metio also the more recet paper by Meg ad Schillig (2002), which is directly relevat. Most of these papers deal with the case of a sigle Bayes factor, whereas the preset paper is cocered with estimatio of large families of Bayes factors. Nevertheless i priciple, ay of the methods i this literature ca be applied to estimate the vector d. 4

Especially importat is Kog et al. (2003), whose work we describe i the otatio of the preset paper. The situatio cosidered there has k kow uormalized desities q h,..., q hk, with ukow ormalizig costats m h,..., m hk, respectively, ad for l =,..., k, there from q hl /m hl. The problem is the simultaeous estimatio of all ratios m hl /m hs, l, s =,..., k, or equivaletly, all ratios d l = m hl /m h, l =,..., k. I a certai framework, they show that the maximum likelihood estimate (MLE) of d is obtaied by solvig the system of k equatios is a iid sample θ (l),..., θ (l) l ˆd r = k l q hr (θ (l) s= a sq hs (θ (l) / ˆd, r =,..., k. (4.) s l= i= To put this i our cotext, let q hl (θ) = l y (θ)ν hl (θ), l =,..., k, ad suppose we have iid samples from the ormalized q hl s. We may imagie that we have k + uormalized desities q h,..., q hk, q h, with a sample of size 0 from the ormalized q h. The estimate of m h /m h the becomes k l l= i= ν h (θ (l) s= a sν hs (θ (l) / ˆd s. We recogize this as precisely ˆB(h, h ) i (2.), except that ˆd,..., ˆd k are formed by solvig (4.), i.e., are estimated from the sequeces θ (l),..., θ (l) l, l =,..., k. Thus, ˆB(h, h ) is the same as the estimate of Kog et al. (2003), except that the vector d is precomputed based o previously ru very log chais. Therefore, it is perhaps atural to cosider estimatig d o the basis of these very log Markov chais usig the method of Kog et al. (2003) (as opposed to the method discussed i Sectio 2.), ad we ow discuss this possibility. I their approach, Kog et al. (2003) assume that the q hl s are desities with respect to a domiatig measure µ, ad they obtai the MLE ˆµ of µ (ˆµ is give up to a multiplicative costat). They ca the estimate the ratios m hl /m hs sice the ormalizig costats are kow fuctios of µ. Their approach works if for each l, θ (l),..., θ (l) l is a iid sample. Although they exted it to the case where these are a Markov chai, i the extesio q hl is replaced by the Markov trasitio fuctios P hl (, θ (l), i = 0,..., l, assumed absolutely cotiuous with respect to a sigma-fiite measure µ (precludig Metropolis-Hastigs chais), ad if each of these is kow oly up to a ormalizig costat as is typically the case the the system (4.) becomes a system of k equatios. This is prohibitively difficult to solve. Ta (2004) shows how cotrol variates ca be icorporated i the likelihood framework of Kog et al. (2003). Whe there are r fuctios H j, j =,..., r, for which we kow that Hj dµ = 0, the parameter space is restricted to the set of all sigma-fiite measures satisfyig these r costraits. For the case where θ (l) i, i =,..., l, are iid for each l =,..., k, he obtais the MLE of µ i this reduced parameter space, ad therefore a correspodig estimate of m h /m h, ad shows that this approach gives estimates that are asymptotically equivalet to estimates that use cotrol variates via regressio. His estimate ca still be used whe we have Markov chai draws, but is o loger optimal for the same reaso that the estimate i the preset paper is ot optimal (see the discussio i the middle of Sectio 2.2). The optimal estimator is obtaied by usig the likelihood that arises from the Markov chai structure, ad i the case of geeral Markov chais its calculatio is computatioally very demadig. See 5

Ta (2006, 2008) for advaces i this directio. Ta (2004) also obtais results o asymptotic ormality of his estimators that are valid whe we have the iid structure, but it should be possible to obtai versios for Markov chai draws, uder regularity coditios such as those of the preset paper. Owe ad Zhou (2000) use cotrol variates i cojuctio with importace samplig. I the otatio above, they assume that the q hl s are ormalized desities, ad that for every l, they have a iid sample of size l from q hl. As before, let a l = l / s= s. Because these are ormalized desities, each of the k variables q hl (θ)/ ( a k s s= q h s (θ) ) has expectatio uder the distributio s= a sq hs, ad so ca be used as cotrol variates. Their method does ot work directly i our situatio because the q hl = l y (θ)ν hl (θ) are uormalized desities. It is therefore atural to cosider estimatig the ormalizig costats of q hl, l =,..., k, from the Stage rus. Ideed, there are methods for doig this from Markov chai output (Chib (995), Chib ad Jeliazkov (200)). However, estimatio of ratios of ormalizig costats teds to be far more stable tha estimatio of the ormalizig costats themselves. For example, if we wish to estimate m h /m h, the a procedure that ivolves estimatig m h ad m h separately ad the takig the ratio is ot guarateed to provide accurate estimates eve whe h = h, whereas i this case the simple estimate (.) gives a ubiased estimate with zero variace. Moreover, if we ru Markov chais for models idexed by h,..., h k, the estimate of a sigle ratio m hs /m h usig the method of Sectio 2. makes use of all the chais, providig greater stability. The cotrol variates that we use are essetially equivalet to those used by Owe ad Zhou (2000), but their computatio requires oly kowledge of the vector d. R fuctios for producig the estimates ˆB(h, h ) ad ˆB reg (h, h ), ad plots such as those i Figure for the hierarchical model (3.2) (3.3) ad relatives, are available from the author upo request. Ackowledgemets I thak two referees for their careful readig ad Eugeia Buta for helpful commets. I am especially grateful to a associate editor for a very isightful ad thorough report, ad for suggestios that led to several improvemets i the paper. Appedix: Proof of Theorem ad l l l i= Y (h) for l =,..., k ad j = 2,..., k (corollary to Theorem 8.5.3 of Uder Coditios A ad A2 we have a cetral limit theorem for the averages l i= Z(j) Y (h) Ibragimov ad Liik (97)); however, there are other sets of coditios that could be used. For example, the ɛ > 0 is ot eeded, i.e. a fiite secod momet suffices if the chai is reversible (Roberts ad Rosethal (997)) for istace if the chai is a Metropolis algorithm, or if it is a two-cycle Gibbs sampler or if it is uiformly ergodic (Cogbur (972)). These are the most commoly used assumptios, but for a fuller discussio of cetral limit theorems for Markov chais see Cha ad Geyer (994). 6

We first prove the assertio regardig ˆB reg (h, h ). Let Z be the k matrix whose traspose is............ Z Z (2),... Z (2), Z (2),2... Z (2) 2,2... Z (2),k... Z (2) = k,k.................., Z (k),... Z (k), Z (k),2... Z (k) 2,2... Z (k),k... Z (k) k,k ad let Y = Y (h) = Y = ( Y (h),,..., Y (h),, Y (h),2,..., Y (h) 2,2,..., Y (h),k,..., Y (h) k,k). Note: we sometimes suppress the superscript h i order to lighte the otatio. The least squares estimate is ( ˆβ(h) 0, ˆβ (h) ) = (Z Z) Z Y /, assumig that Z Z is osigular. (Here, ˆβ (h) = (h) (h) ( ˆβ 2,..., ˆβ k )). Note that k l k Z (j) ) l l Z(j = Z (j) ) a.s. Z(j R j,j l= i= l= by the strog law of large umbers (clearly Z (j) Z Z/ a.s. R, so by A3 we have l i= are bouded radom variables). Therefore (Z Z) a.s. R ad, i particular, with probability oe, Z Z is osigular for large. We have Z Y = k l= l= l i= Z() Y. l i= Z(k) Y a.s. l= a le ( Z (),l Y,l. l= a le ( Z (k),l Y,l ). ) (A.) (A.2) Let v = (v,..., v k ) be the vector o the right side of (A.2). From (A.) ad (A.2) we have ( ˆβ(h) 0, ˆβ (h) ) a.s. ( β (h) 0,lim, lim) β(h) = R v. (A.3) Cosider (2.9), usig β (h) lim for β. We have Î h,β (h) lim = k l ( Y ) k j=2 β(h) j,lim Z(j) l= i= = ( k a l l= l l i= U ), (A.4) where U = Y j=2 β(h) j,lim Z(j). Let µ l(h) = E(U,l ). By A2, E( U,l 2+ɛ ) < ad therefore, by A we have ( l /2 i= U ) d l µ l (h) N ( 0, σl 2 (h) ), l where σ 2 l (h) = Var(U,l ) + 2 g= Cov(U,l, U +g,l ). (A.5) 7

Sice the Markov chais are idepedet, this implies that /2 (Îh,β (h) lim l= a lµ l (h) ) d N ( 0, σ 2 (h) ), (A.6) where Note that (/) l l= i= Y Therefore, from the first equatio i (A.4), proves that l= a lµ l (h) = B(h, h ). σ 2 (h) = l= a lσ 2 l a.s. B(h, h ) ad (/) Îh,β (h) lim To coclude the proof, we cosider the differece betwee E ( Z (j),l ). We have a.s. (h). (A.7) l= l i= Z(j) a.s. 0, j = 2,..., k. B(h, h ) which, together with (A.6), Îh, (h) ad Îh,β (h). Let e(j, l) = ˆβ lim ( ) ) k (Îh, /2 (h) Îh,β (h) = /2 (β ˆβ j,lim ˆβ k l j ) Z (j) lim j=2 l= i= ( k k = (β j,lim ˆβ l [ (j) ] Z j ) a l /2 e(j, l) ), (A.8) l j=2 where the secod equality i (A.8) follows from the fact that l= a le(j, l) = 0. Now, for each l =,..., k, ad j = 2,..., k, by A, /2 l [ (j) ] i= (Z e(j, l))/ l is asymptotically ormal, so i particular is bouded i probability. Together with (A.3), this implies that the right side of (A.8) coverges i probability to 0. We coclude that /2( ˆBreg (h, h ) B(h, h ) ) d N ( 0, σ 2 (h) ). The proof for ˆB(h, h ) is simpler. Let f l = E(Y,l ), ad ote that l= a lf l = B(h, h ). We have /2( ˆB(h, h ) B(h, h ) ) = /2 ( i which l= k l ) Y f l l= i= i= = k l= d N ( 0, τ 2 (h) ), l a /2 i= (Y f l ) l /2 l τ 2 (h) = l= a lτl 2 2 (h), where τl (h) = Var(Y,l) + 2 g= Cov(Y,l, Y +g,l ). (A.9) The variace term σl 2 (h) i (A.5) is the asymptotic variace of the stadardized versio of the average l i= U. If we kew the U s, we could estimate σl 2 (h) by estimatig the iitial segmet of the series i (A.5) usig stadard methods from time series (see Geyer (992)) or via batchig. Now the U s ivolve β (h) lim, which is ukow, but our proof idicates that the effect of usig ˆβ (h) istead of β (h) lim i the expressio for U is asymptotically egligible. 8

Refereces Atoiak, C. E. (974). Mixtures of Dirichlet processes with applicatios to Bayesia oparametric problems. The Aals of Statistics 2 52 74. Athreya, K. B., Doss, H. ad Sethurama, J. (996). O the covergece of the Markov chai simulatio method. The Aals of Statistics 24 69 00. Berger, J. O. (985). Statistical Decisio Theory ad Bayesia Aalysis (Secod Editio). Spriger-Verlag, New York. Buta, E. (2009). Computatioal Methods i Bayesia Sesitivity Aalysis. Ph.D. thesis, Uiversity of Florida. Cha, K. S. ad Geyer, C. J. (994). Commet o Markov chais for explorig posterior distributios. The Aals of Statistics 22 747 758. Che, M.-H., Shao, Q.-M. ad Ibrahim, J. G. (2000). Mote Carlo Methods i Bayesia Computatio. Spriger-Verlag, New York. Chib, S. (995). Margial likelihood from the Gibbs output. Joural of the America Statistical Associatio 90 33 32. Chib, S. ad Jeliazkov, I. (200). Margial likelihood from the Metropolis-Hastigs output. Joural of the America Statistical Associatio 96 270 28. Cogbur, R. (972). The cetral limit theorem for Markov processes. I Proceedigs of the Sixth Berkeley Symposium o Mathematical Statistics ad Probability, Volume 2. Uiversity of Califoria Press, Berkeley. Doss, H. (2007). Bayesia model selectio: Some thoughts o future directios. Statistica Siica 7 43 42. Gelma, A. (2006). Prior distributios for variace parameters i hierarchical models. Bayesia Aalysis 55 534. George, E. I. ad Foster, D. P. (2000). Biometrika 87 73 747. Calibratio ad empirical Bayes variable selectio. Geyer, C. J. (992). Practical Markov chai Mote Carlo (Disc: p483 503). Statistical Sciece 7 473 483. Ha, C. ad Carli, B. P. (200). Markov chai Mote Carlo methods for computig Bayes factors: A comparative review. Joural of the America Statistical Associatio 96 22 32. Harris, R., Beebe-Dok, J., Doss, H. ad Burr, D. (2005). Aspiri, Ibuprofe ad other osteroidal ati-iflammatory drugs i cacer prevetio: A critical review of o-selective COX-2 blockade. Ocology Reports 3 559 584. 9

Hobert, J. P. ad Robert, C. P. (2004). A mixture represetatio of π with applicatios i Markov chai Mote Carlo ad perfect samplig. The Aals of Applied Probability 4 295 305. Ibragimov, I. A. ad Liik, Y. V. (97). Idepedet ad Statioary Sequeces of Radom Variables. Wolters-Noordhoff, Groige. Kass, R. E. ad Raftery, A. E. (995). Bayes factors. Joural of the America Statistical Associatio 90 773 795. Kog, A., McCullagh, P., Meg, X.-L., Nicolae, D. ad Ta, Z. (2003). A theory of statistical models for Mote Carlo itegratio (with discussio). Joural of the Royal Statistical Society, Series B 65 585 68. Meg, X.-L. ad Schillig, S. (2002). Warp bridge samplig. Joural of Computatioal ad Graphical Statistics 552 586. Meg, X.-L. ad Wog, W. H. (996). Simulatig ratios of ormalizig costats via a simple idetity: A theoretical exploratio. Statistica Siica 6 83 860. Owe, A. ad Zhou, Y. (2000). Safe ad effective importace samplig. Joural of the America Statistical Associatio 95 35 43. Robert, C. P. (200). The Bayesia Choice: from Decisio-Theoretic Foudatios to Computatioal Implemetatio. Spriger-Verlag, New York. Roberts, G. O. ad Rosethal, J. S. (997). Geometric ergodicity ad hybrid Markov chais. Electroic Commuicatios i Probability 2 3 25. Ta, Z. (2004). O a likelihood approach for Mote Carlo itegratio. Joural of the America Statistical Associatio 99 027 036. Ta, Z. (2006). Mote Carlo itegratio with acceptace-rejectio. Joural of Computatioal ad Graphical Statistics 5 735 752. Ta, Z. (2008). Mote Carlo itegratio with Markov chai. Joural of Statistical Plaig ad Iferece 38 967 980. 20