Estimating marginal likelihoods from the posterior draws through a geometric identity

Similar documents
Outline. Clustering. Capturing Unobserved Heterogeneity in the Austrian Labor Market Using Finite Mixtures of Markov Chain Models

Bridge estimation of the probability density at a point. July 2000, revised September 2003

A note on Reversible Jump Markov Chain Monte Carlo

Improving power posterior estimation of statistical evidence

Online appendix to On the stability of the excess sensitivity of aggregate consumption growth in the US

7. Estimation and hypothesis testing. Objective. Recommended reading

7. Estimation and hypothesis testing. Objective. Recommended reading

BRIDGE ESTIMATION OF THE PROBABILITY DENSITY AT A POINT

Multimodal Nested Sampling

Default Priors and Effcient Posterior Computation in Bayesian

Principles of Bayesian Inference

Bayesian model selection: methodology, computation and applications

Bayesian Classification and Regression Trees

MCMC algorithms for fitting Bayesian models

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

A Note on Lenk s Correction of the Harmonic Mean Estimator

Estimating the marginal likelihood with Integrated nested Laplace approximation (INLA)

Learning the hyper-parameters. Luca Martino

Monte Carlo in Bayesian Statistics

Bayesian Model Comparison:

Introduction to Bayesian Inference

BAYESIAN MODEL CRITICISM

A Review of Pseudo-Marginal Markov Chain Monte Carlo

Lecture 4: Dynamic models

Comparing Non-informative Priors for Estimation and Prediction in Spatial Models

Bayesian model selection for computer model validation via mixture model estimation

Sparse Linear Models (10/7/13)

Practical Bayesian Quantile Regression. Keming Yu University of Plymouth, UK

Bayesian Linear Regression

Embedding Supernova Cosmology into a Bayesian Hierarchical Model

A Fully Nonparametric Modeling Approach to. BNP Binary Regression

Markov Chain Monte Carlo methods

Gaussian Multiscale Spatio-temporal Models for Areal Data

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference

Bayesian Linear Models

Bayes methods for categorical data. April 25, 2017

Bayesian search for other Earths

MCMC and Gibbs Sampling. Kayhan Batmanghelich

University of Massachusetts Amherst. From the SelectedWorks of Neal S. Katz

An introduction to Sequential Monte Carlo

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model

Dynamic Generalized Linear Models

Metropolis-Hastings Algorithm

Markov chain Monte Carlo

Bayesian Linear Models

Bayesian Phylogenetics:

Computational statistics

CPSC 540: Machine Learning

Stat 5101 Lecture Notes

Non-homogeneous Markov Mixture of Periodic Autoregressions for the Analysis of Air Pollution in the Lagoon of Venice

LECTURE 15 Markov chain Monte Carlo

Bayesian data analysis in practice: Three simple examples

Lecture Notes based on Koop (2003) Bayesian Econometrics

STAT 425: Introduction to Bayesian Analysis

Markov Chain Monte Carlo Methods

Markov Chain Monte Carlo

Markov Chain Monte Carlo

Markov Chain Monte Carlo methods

Hierarchical Modeling for Spatial Data

Bayesian Analysis of Multivariate Normal Models when Dimensions are Absent

Bayes Factors, posterior predictives, short intro to RJMCMC. Thermodynamic Integration

Bayesian Linear Models

Kernel adaptive Sequential Monte Carlo

Bayesian Inference: Probit and Linear Probability Models

Overall Objective Priors

Markov Chain Monte Carlo (MCMC) and Model Evaluation. August 15, 2017

ST 740: Markov Chain Monte Carlo

Lecture 5: Spatial probit models. James P. LeSage University of Toledo Department of Economics Toledo, OH

CSC 2541: Bayesian Methods for Machine Learning

arxiv: v2 [stat.co] 11 Oct 2017

A Comparison of Bayesian Model Selection based on MCMC with an application to GARCH-Type Models

MARGINAL MARKOV CHAIN MONTE CARLO METHODS

STA 4273H: Statistical Machine Learning

Lecture 6: Model Checking and Selection

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

2 Inference for Multinomial Distribution

Lecture 8: Bayesian Estimation of Parameters in State Space Models

Bayesian Nonparametric Regression for Diabetes Deaths

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Accept-Reject Metropolis-Hastings Sampling and Marginal Likelihood Estimation

Session 3A: Markov chain Monte Carlo (MCMC)

MCMC Sampling for Bayesian Inference using L1-type Priors

Likelihood-free MCMC

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent

Bayesian Methods for Machine Learning

Katsuhiro Sugita Faculty of Law and Letters, University of the Ryukyus. Abstract

STA 294: Stochastic Processes & Bayesian Nonparametrics

Index. Pagenumbersfollowedbyf indicate figures; pagenumbersfollowedbyt indicate tables.

What is the most likely year in which the change occurred? Did the rate of disasters increase or decrease after the change-point?

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

The Metropolis-Hastings Algorithm. June 8, 2012

Parameter Estimation. William H. Jefferys University of Texas at Austin Parameter Estimation 7/26/05 1

Bayesian Estimation with Sparse Grids

Comparing Non-informative Priors for Estimation and. Prediction in Spatial Models

Advanced Statistical Modelling

Down by the Bayes, where the Watermelons Grow

Machine learning: Hypothesis testing. Anders Hildeman

Bayesian Methods in Multilevel Regression

The Recycling Gibbs Sampler for Efficient Learning

Transcription:

Estimating marginal likelihoods from the posterior draws through a geometric identity Johannes Reichl Energy Institute at the Johannes Kepler University Linz E-mail for correspondence: reichl@energieinstitut-linz.at Abstract: This article develops a new estimator of the marginal likelihood that requires a sample of the posterior distribution as the only input from the analyst. This sample may come from any sampling scheme, such as Gibbs sampling or Metropolis-Hastings sampling. The presented approach can be implemented generically in almost any application of Bayesian modeling and significantly decreases the computational burdens associated with marginal likelihood estimation compared to existing techniques. The functionality of this method is demonstrated in the context of a high-dimensional random intercept probit. Simulation results show that the simple approach presented here achieves excellent stability in low dimensional models, and also clearly outperforms existing methods when the number of coefficients of the model increases. Keywords: Bayesian statistics; Model evidence; Integrated likelihood; Model selection; Estimation of normalizing constant. Motivation Bayesian model selection relies on the posterior probabilities of the H candidate models M,..., M H conditional on the data (see e.g. Kass and Raftery, 995). In this article we discuss the estimation of the posterior probabilities p(m h y) of the h =,..., H candidate models by estimating their marginal likelihoods. Calculating the marginal likelihood is a non trivial integration problem, and as such it is still associated with significant effort on the part of the analyst and potential imprecision in the case of high-dimensional or multi-level models. Comparative studies of existing estimation techniques for the marginal likelihood only provide clear evidence of precision for candidate models of lesser dimensions, while Bayesian analysis frequently requires more complex models (see e.g. Frühwirth-Schnatter and Wagner, 2008). This paper was published as a part of the proceedings of the 30th International Workshop on Statistical Modelling, Johannes Kepler Universität Linz, 6 0 July 205. The copyright remains with the author(s). Permission to reproduce or extract any parts of this abstract should be requested from the author(s).

Reichl 325 This article presents a technique for estimating the marginal likelihood requiring only a sample of the posterior distribution as an input and is thus implementable as a generic function allowing for it to be used in a variety of applications. As a potentially even more important advantage, the approach shows significantly less sensitivity to an increase in the number of model coefficients compared to existing approaches. 2 The approach We start by defining the marginal likelihood of model M h as p(m h y) = p(y θ h ) p(θ h ) dθ h, () Θ h where θ h is a K vector containing the K coefficients of model M h. p(y θ h ) refers to the likelihood of model M h, and p(θ h ) is the prior distribution with domain Θ h. Suppressing the model index h henceforth, and considering the marginal likelihood of a model M is the normalizing constant of its posterior distribution p(θ y), we can rewrite Bayes theorem as: p(m y) = p(θ y) p(y θ) p(θ). (2) Let A be a bounded subset of prior domain Θ, then integrating both sides in (2) over A gives dθ = p(m y) p p(θ y) dθ, (3) (θ y) A A where p (θ y) is used as an abbreviation for the non-normalized posterior p(y θ) p(θ) henceforth. A representation of the marginal likelihood is then found by: p(m y) = A dθ/ A p p(θ y) dθ. (4) (θ y) We refer to the right integral in (4) as the non-normalized posterior integral over A and abbreviate it by κ A. Integrating over a K dimensional bounded set has the geometric interpretation of a generalized volume, or hypervolume, and we will refer to the left integral in (4) as the volume of A. This article exploits (4) and presents a new estimator for the marginal likelihood by separately estimating the volume of A and the corresponding non-normalized posterior integral. We also present a method for choosing A in such a way that the quotient of these estimators yields a stable estimate of the marginal likelihood.

326 Marginal likelihood from the posterior draws First, we turn to the non-normalized posterior integral. A common technique for numerical integration is importance sampling, and since the posterior distribution is part of the numerator in the non-normalized posterior integral this suggests the posterior as the importance density. Since draws from the posterior distribution are usually available as a natural output of Bayesian analysis, the importance sampling estimator of the posterior integral ˆκ A is available almost ad hoc once A has been defined. ˆκ A = L with f(θ (l) ) = L f(θ (l) ), l= { p (θ (l) y), if θ(l) A, 0, else, (5) where θ (l) with l =,..., L refers to the posterior draws after the burn in. Subsequent steps outline a definition of A allowing a stable estimation of the marginal likelihood from identity (4). Firstly, to ensure the posterior distribution is a proper choice for the importance density, as required in our approach, the region of integration A must have full support of the posterior distribution. Secondly, to avoid instability, the region of integration A may only contain points for which the sum in (5) is stable independently of any specific run of the MCMC sampler. To address both of these requirements and at the same time allow for a simple estimation of the volume and the non-normalized posterior integral we define A as the intersection of two sets A and A 2. Set A is defined by a threshold value ρ and only those points θ lie in A whose non-normalized posterior p (θ y) exceeds this threshold, such that: θ A if p (θ y) > ρ. Considering the series p,(,...,l) = p (θ () y),..., p (θ (L) y), a natural way of determining the threshold ρ is to ensure that the lowest values of p,(,...,l) are not destabilizing the sum in (5) by setting ρ as a quantile of p,(,...,l), and we define ρ as the median of p,(,...,l). Thereby, an almost perfectly stable estimation of (5) is ensured by excluding all values of p,(,...,l) from the estimation of ˆκ A stemming from the tails of the posterior distribution. To facilitate easy estimation of the volume of A a second set A 2 is defined, in such a way that the volume of the intersection of A and A 2 can be estimated by means of statistical standard techniques only. As long as A and A 2 have the same dimension K and an intersection A, the volume of this intersection V A can be written as V A = πv A2, where π refers to the ratio of points lying in A 2 that also lie in A, and V A2 is the volume of A 2. Hence, for an easy estimation of V A we define A 2 in such a way that its volume V A2 can be calculated analytically and that drawing

Reichl 327 uniformly from within it is efficient and feasible. Then, π can simply be estimated by drawing K-dimensional vectors θ (r) uniformly from A 2 such that ˆπ = R I(θ (r) A ), R r= where I( ) refers to the indicator function, and R is the number of random draws from within A 2. Even though other definitions of A 2 are possible, we choose a K-dimensional ellipsoid as A 2, as this choice shows outstanding efficiency of the resulting estimator. The set of points θ lying in A 2 is thus defined by θ A 2 if (θ θ ) C (θ θ ) T <, (6) where C is a positive definite matrix of dimension K K with its eigenvectors defining the principal axes of the ellipsoid. θ is a point with support of the posterior distribution, and we define θ as its posterior mode to ensure substantial overlap between A 2 and A. The last step in defining A 2 is thus choosing C. Consider matrix R = ( θ ()T,..., θ (L)T) T, and its covariance matrix D = cov(r), we define C = (αd), where α is a scalar with domain R +, and is employed as a tuning parameter in the presented approach. We recommend to set α in such a way that the resulting intersection of A and A 2 contains about 49% of the L posterior draws of θ (l), where a theoretic underpinning of this recommendation can be provided by the author upon request. Algorithm I for the estimation of the marginal likelihood is thus given by:. Run an MCMC sampler to obtain L posterior draws θ (l) after the burn in, calculate the series of non-normalized posterior density values p,(,...,l), and set ρ to its median, θ to the posterior mode, and D to the covariance matrix of the posterior draws. 2. Define( α in such a way that 0.49 L draws are in A, where, θ (l) A if p (θ (l) y) > ρ & ( θ (l) θ ) (αd) ( θ (l) θ ) ) T <. 3. Draw R points θ (r) from A 2, count the number r of draws for which p (θ (r) y) > ρ, and set ˆπ = r/r. 4. Estimate the volume of A as ˆV A = ˆπV A2, and obtain the estimator for the non-normalized posterior integral ˆκ A from (5). 5. Calculate the final estimator of the marginal likelihood as ˆp A (M y) = ˆV A ˆκ A.

328 Marginal likelihood from the posterior draws TABLE. PISA Data; logarithm of different marginal likelihood estimators and for five different data sets. Importance sampling and bridge sampling using a mixture importance density constructed as e.g. in Frühwirth-Schnatter and Wagner (2008) are referenced by ˆp IS and ˆp BS; ˆp CH refers to Chib s method; and the estimator proposed in this paper is referenced as ˆp A; relevant standard errors in parentheses; results from three independent MCMC runs per data set are reported. US region K log(ˆp IS) log(ˆp BS) log(ˆp CH) log(ˆp A) Northeast 28 363.203 363.39 363.086 363.36 (0.076) (0.02) (0.054) (0.04) 363.028 363.55 363.83 363.67 (0.85) (0.02) (0.049) (0.00) 363.204 363.69 363.68 363.45 (0.76) (0.02) (0.05) (0.009) West 35 503.50 503.59 503.29 503.99 (0.29) (0.024) (0.090) (0.07) 503.78 503.80 503.09 503.74 (0.262) (0.024) (0.097) (0.02) 503.828 503.25 503.003 503.89 (0.50) (0.023) (0.00) (0.00) Midwest 43 667.55 667.026 667.7 667.033 (0.234) (0.028) (0.27) (0.03) 667.364 667.02 667.04 667.03 (0.243) (0.029) (0.23) (0.02) 667.24 666.962 667.27 667.036 (0.47) (0.027) (0.095) (0.06) South 57 89.596 892.433 892.246 892.427 (0.779) (0.054) (0.58) (0.02) 893.633 892.432 892.628 892.432 (0.536) (0.058) (0.255) (0.03) 893.34 892.568 892.370 892.42 (0.475) (0.053) (0.84) (0.06) all 42 2352.958 2350.232 235.877 (.00) (0.447) (0.07) 2354.952 235.547 235.920 (.85) (.264) (0.06) 2350.454 2352.49 235.93 (0.539) (.338) (0.02) Algorithms for efficiently drawing uniformly from within a hyperellipsoid, and a recursive algorithm returning log(v A2 ) with minimal computing time even for very high K, can be requested from the author. Thus, the calculation of ˆπ and V A2, and consequently ˆV A, is achieved with low computational effort and high precision.

Reichl 329 3 Application In this section the proposed estimation method is applied to a random intercept probit model and a comparison to existing methods is presented, these are Chib s method (995), importance sampling, and bridge sampling (Meng and Wong, 996). This paper provides for the first time a comparison of the discussed estimation techniques for a high-dimensional unit-level model, and discloses the shortcomings of existing approaches. As one instance of a comparative study exploring the existing techniques with respect to a unit-level model, Frühwirth-Schnatter and Wagner (2008) estimate a random intercept logit model with up to K = 25 coefficients, where in the application shown in this article we increase the number of model dimensions in five applications up to K = 42 to demonstrate the extraordinary stability of the presented estimator in comparison to the existing approaches. Data is about reading proficiency in US schools and stems from the Program for International Student Assessment (PISA) as provided in Snijders and Bosker (202). The data is estimated by a random intercept probit model, and for 5 different partitions of the data with respect to their geographical origin. Table displays the results of the comparative study. While the magnitudes of the estimates can not be compared between the different values of K as these are related to different data, the standard errors allow conclusions about the sensitivity of the respective estimators to an increase of the number of coefficients of the underlying model. References Chib, S. (995). Marginal likelihood from the Gibbs output. Journal of the American Statistical Association, 90, 33 32. Frühwirth-Schnatter, S. and Wagner, H. (2008). Marginal likelihoods for non-gaussian models using auxiliary mixture sampling. Computational Statistics and Data Analysis, 52, 4608 4624. Kass, R.E. and Raftery, A.E. (995). Bayes factors. Journal of the American Statistical Association, 90, 773 795. Meng, X.-L. and Wong, W.H. (996). Simulating ratios of normalizing constants via a simple identity: A theoretical exploration. Statistica Sinica, 6, 83 860. Snijders, T.A. and Bosker, R.J. (202). Multilevel Analysis: An Introduction to Basic and Advanced Multilevel Modeling, 2nd ed. London: Sage Publishers Ltd.