Estimating marginal likelihoods from the posterior draws through a geometric identity

Estimating marginal likelihoods from the posterior draws through a geometric identity Johannes Reichl Energy Institute at the Johannes Kepler University Linz E-mail for correspondence: reichl@energieinstitut-linz.at Abstract: This article develops a new estimator of the marginal likelihood that requires a sample of the posterior distribution as the only input from the analyst. This sample may come from any sampling scheme, such as Gibbs sampling or Metropolis-Hastings sampling. The presented approach can be implemented generically in almost any application of Bayesian modeling and significantly decreases the computational burdens associated with marginal likelihood estimation compared to existing techniques. The functionality of this method is demonstrated in the context of a high-dimensional random intercept probit. Simulation results show that the simple approach presented here achieves excellent stability in low dimensional models, and also clearly outperforms existing methods when the number of coefficients of the model increases. Keywords: Bayesian statistics; Model evidence; Integrated likelihood; Model selection; Estimation of normalizing constant. Motivation Bayesian model selection relies on the posterior probabilities of the H candidate models M,..., M H conditional on the data (see e.g. Kass and Raftery, 995). In this article we discuss the estimation of the posterior probabilities p(m h y) of the h =,..., H candidate models by estimating their marginal likelihoods. Calculating the marginal likelihood is a non trivial integration problem, and as such it is still associated with significant effort on the part of the analyst and potential imprecision in the case of high-dimensional or multi-level models. Comparative studies of existing estimation techniques for the marginal likelihood only provide clear evidence of precision for candidate models of lesser dimensions, while Bayesian analysis frequently requires more complex models (see e.g. Frühwirth-Schnatter and Wagner, 2008). This paper was published as a part of the proceedings of the 30th International Workshop on Statistical Modelling, Johannes Kepler Universität Linz, 6 0 July 205. The copyright remains with the author(s). Permission to reproduce or extract any parts of this abstract should be requested from the author(s).

Reichl 325 This article presents a technique for estimating the marginal likelihood requiring only a sample of the posterior distribution as an input and is thus implementable as a generic function allowing for it to be used in a variety of applications. As a potentially even more important advantage, the approach shows significantly less sensitivity to an increase in the number of model coefficients compared to existing approaches. 2 The approach We start by defining the marginal likelihood of model M h as p(m h y) = p(y θ h ) p(θ h ) dθ h, () Θ h where θ h is a K vector containing the K coefficients of model M h. p(y θ h ) refers to the likelihood of model M h, and p(θ h ) is the prior distribution with domain Θ h. Suppressing the model index h henceforth, and considering the marginal likelihood of a model M is the normalizing constant of its posterior distribution p(θ y), we can rewrite Bayes theorem as: p(m y) = p(θ y) p(y θ) p(θ). (2) Let A be a bounded subset of prior domain Θ, then integrating both sides in (2) over A gives dθ = p(m y) p p(θ y) dθ, (3) (θ y) A A where p (θ y) is used as an abbreviation for the non-normalized posterior p(y θ) p(θ) henceforth. A representation of the marginal likelihood is then found by: p(m y) = A dθ/ A p p(θ y) dθ. (4) (θ y) We refer to the right integral in (4) as the non-normalized posterior integral over A and abbreviate it by κ A. Integrating over a K dimensional bounded set has the geometric interpretation of a generalized volume, or hypervolume, and we will refer to the left integral in (4) as the volume of A. This article exploits (4) and presents a new estimator for the marginal likelihood by separately estimating the volume of A and the corresponding non-normalized posterior integral. We also present a method for choosing A in such a way that the quotient of these estimators yields a stable estimate of the marginal likelihood.

326 Marginal likelihood from the posterior draws First, we turn to the non-normalized posterior integral. A common technique for numerical integration is importance sampling, and since the posterior distribution is part of the numerator in the non-normalized posterior integral this suggests the posterior as the importance density. Since draws from the posterior distribution are usually available as a natural output of Bayesian analysis, the importance sampling estimator of the posterior integral ˆκ A is available almost ad hoc once A has been defined. ˆκ A = L with f(θ (l) ) = L f(θ (l) ), l= { p (θ (l) y), if θ(l) A, 0, else, (5) where θ (l) with l =,..., L refers to the posterior draws after the burn in. Subsequent steps outline a definition of A allowing a stable estimation of the marginal likelihood from identity (4). Firstly, to ensure the posterior distribution is a proper choice for the importance density, as required in our approach, the region of integration A must have full support of the posterior distribution. Secondly, to avoid instability, the region of integration A may only contain points for which the sum in (5) is stable independently of any specific run of the MCMC sampler. To address both of these requirements and at the same time allow for a simple estimation of the volume and the non-normalized posterior integral we define A as the intersection of two sets A and A 2. Set A is defined by a threshold value ρ and only those points θ lie in A whose non-normalized posterior p (θ y) exceeds this threshold, such that: θ A if p (θ y) > ρ. Considering the series p,(,...,l) = p (θ () y),..., p (θ (L) y), a natural way of determining the threshold ρ is to ensure that the lowest values of p,(,...,l) are not destabilizing the sum in (5) by setting ρ as a quantile of p,(,...,l), and we define ρ as the median of p,(,...,l). Thereby, an almost perfectly stable estimation of (5) is ensured by excluding all values of p,(,...,l) from the estimation of ˆκ A stemming from the tails of the posterior distribution. To facilitate easy estimation of the volume of A a second set A 2 is defined, in such a way that the volume of the intersection of A and A 2 can be estimated by means of statistical standard techniques only. As long as A and A 2 have the same dimension K and an intersection A, the volume of this intersection V A can be written as V A = πv A2, where π refers to the ratio of points lying in A 2 that also lie in A, and V A2 is the volume of A 2. Hence, for an easy estimation of V A we define A 2 in such a way that its volume V A2 can be calculated analytically and that drawing

Reichl 327 uniformly from within it is efficient and feasible. Then, π can simply be estimated by drawing K-dimensional vectors θ (r) uniformly from A 2 such that ˆπ = R I(θ (r) A ), R r= where I( ) refers to the indicator function, and R is the number of random draws from within A 2. Even though other definitions of A 2 are possible, we choose a K-dimensional ellipsoid as A 2, as this choice shows outstanding efficiency of the resulting estimator. The set of points θ lying in A 2 is thus defined by θ A 2 if (θ θ ) C (θ θ ) T <, (6) where C is a positive definite matrix of dimension K K with its eigenvectors defining the principal axes of the ellipsoid. θ is a point with support of the posterior distribution, and we define θ as its posterior mode to ensure substantial overlap between A 2 and A. The last step in defining A 2 is thus choosing C. Consider matrix R = ( θ ()T,..., θ (L)T) T, and its covariance matrix D = cov(r), we define C = (αd), where α is a scalar with domain R +, and is employed as a tuning parameter in the presented approach. We recommend to set α in such a way that the resulting intersection of A and A 2 contains about 49% of the L posterior draws of θ (l), where a theoretic underpinning of this recommendation can be provided by the author upon request. Algorithm I for the estimation of the marginal likelihood is thus given by:. Run an MCMC sampler to obtain L posterior draws θ (l) after the burn in, calculate the series of non-normalized posterior density values p,(,...,l), and set ρ to its median, θ to the posterior mode, and D to the covariance matrix of the posterior draws. 2. Define( α in such a way that 0.49 L draws are in A, where, θ (l) A if p (θ (l) y) > ρ & ( θ (l) θ ) (αd) ( θ (l) θ ) ) T <. 3. Draw R points θ (r) from A 2, count the number r of draws for which p (θ (r) y) > ρ, and set ˆπ = r/r. 4. Estimate the volume of A as ˆV A = ˆπV A2, and obtain the estimator for the non-normalized posterior integral ˆκ A from (5). 5. Calculate the final estimator of the marginal likelihood as ˆp A (M y) = ˆV A ˆκ A.

328 Marginal likelihood from the posterior draws TABLE. PISA Data; logarithm of different marginal likelihood estimators and for five different data sets. Importance sampling and bridge sampling using a mixture importance density constructed as e.g. in Frühwirth-Schnatter and Wagner (2008) are referenced by ˆp IS and ˆp BS; ˆp CH refers to Chib s method; and the estimator proposed in this paper is referenced as ˆp A; relevant standard errors in parentheses; results from three independent MCMC runs per data set are reported. US region K log(ˆp IS) log(ˆp BS) log(ˆp CH) log(ˆp A) Northeast 28 363.203 363.39 363.086 363.36 (0.076) (0.02) (0.054) (0.04) 363.028 363.55 363.83 363.67 (0.85) (0.02) (0.049) (0.00) 363.204 363.69 363.68 363.45 (0.76) (0.02) (0.05) (0.009) West 35 503.50 503.59 503.29 503.99 (0.29) (0.024) (0.090) (0.07) 503.78 503.80 503.09 503.74 (0.262) (0.024) (0.097) (0.02) 503.828 503.25 503.003 503.89 (0.50) (0.023) (0.00) (0.00) Midwest 43 667.55 667.026 667.7 667.033 (0.234) (0.028) (0.27) (0.03) 667.364 667.02 667.04 667.03 (0.243) (0.029) (0.23) (0.02) 667.24 666.962 667.27 667.036 (0.47) (0.027) (0.095) (0.06) South 57 89.596 892.433 892.246 892.427 (0.779) (0.054) (0.58) (0.02) 893.633 892.432 892.628 892.432 (0.536) (0.058) (0.255) (0.03) 893.34 892.568 892.370 892.42 (0.475) (0.053) (0.84) (0.06) all 42 2352.958 2350.232 235.877 (.00) (0.447) (0.07) 2354.952 235.547 235.920 (.85) (.264) (0.06) 2350.454 2352.49 235.93 (0.539) (.338) (0.02) Algorithms for efficiently drawing uniformly from within a hyperellipsoid, and a recursive algorithm returning log(v A2 ) with minimal computing time even for very high K, can be requested from the author. Thus, the calculation of ˆπ and V A2, and consequently ˆV A, is achieved with low computational effort and high precision.

Reichl 329 3 Application In this section the proposed estimation method is applied to a random intercept probit model and a comparison to existing methods is presented, these are Chib s method (995), importance sampling, and bridge sampling (Meng and Wong, 996). This paper provides for the first time a comparison of the discussed estimation techniques for a high-dimensional unit-level model, and discloses the shortcomings of existing approaches. As one instance of a comparative study exploring the existing techniques with respect to a unit-level model, Frühwirth-Schnatter and Wagner (2008) estimate a random intercept logit model with up to K = 25 coefficients, where in the application shown in this article we increase the number of model dimensions in five applications up to K = 42 to demonstrate the extraordinary stability of the presented estimator in comparison to the existing approaches. Data is about reading proficiency in US schools and stems from the Program for International Student Assessment (PISA) as provided in Snijders and Bosker (202). The data is estimated by a random intercept probit model, and for 5 different partitions of the data with respect to their geographical origin. Table displays the results of the comparative study. While the magnitudes of the estimates can not be compared between the different values of K as these are related to different data, the standard errors allow conclusions about the sensitivity of the respective estimators to an increase of the number of coefficients of the underlying model. References Chib, S. (995). Marginal likelihood from the Gibbs output. Journal of the American Statistical Association, 90, 33 32. Frühwirth-Schnatter, S. and Wagner, H. (2008). Marginal likelihoods for non-gaussian models using auxiliary mixture sampling. Computational Statistics and Data Analysis, 52, 4608 4624. Kass, R.E. and Raftery, A.E. (995). Bayes factors. Journal of the American Statistical Association, 90, 773 795. Meng, X.-L. and Wong, W.H. (996). Simulating ratios of normalizing constants via a simple identity: A theoretical exploration. Statistica Sinica, 6, 83 860. Snijders, T.A. and Bosker, R.J. (202). Multilevel Analysis: An Introduction to Basic and Advanced Multilevel Modeling, 2nd ed. London: Sage Publishers Ltd.