Methods for Computing Marginal Data Densities from the Gibbs Output

Size: px

Start display at page:

Download "Methods for Computing Marginal Data Densities from the Gibbs Output"

Gyles Jordan
5 years ago
Views:

1 Methods for Computing Marginal Data Densities from the Gibbs Output Cristina Fuentes-Albero Rutgers University Leonardo Melosi Federal Reserve Bank of Chicago January 2013 Abstract We introduce two estimators for estimating the Marginal Data Density MDD from the Gibbs output. Our methods are based on exploiting the analytical tractability condition, which requires that some parameter blocks can be analytically integrated out from the conditional posterior densities. This condition is satisfied by several widely used time series models. An empirical application to six-variate VAR models shows that the bias of a fully computational estimator is sufficiently large to distort the implied model rankings. One of the estimators is fast enough to make multiple computations of MDDs in densely parameterized models feasible. Keywords: Marginal likelihood, Gibbs Sampler, time series econometrics, Bayesian econometrics, reciprocal importance sampling. JEL Classification: C11, C15, C16, C32 Correspondence: Cristina Fuentes-Albero: Department of Economics, 75 Hamilton Street, Rutgers University, New Brunswick, NJ 08901: cfuentes@econ.rutgers.edu. Leonardo Melosi: Federal Reserve Bank of Chicago, 230 S LaSalle St, Chicago, IL 60604: lmelosi@frbchi.org. We thank Frank Schorfheide, Jesús Fernández-Villaverde, Francesco Ravezzolo, Lucrezia Reichlin, Paolo Surico, Herman van Dijk, Daniel Waggonner, the associate editor, and two anonymous referees for very helpful comments. We thank seminar participants at the 4th International Conference on Computational and Financial Econometrics, the Rimini Bayesian Econometrics Workshop, the 26th Annual Congress of the European Economic Association, the XXXIII SAE-Zaragoza, the 2011 Greater New York Metropolitan Area Econometrics Colloquium, and the 10th Applied Time Series Econometrics Workshop of the St Louis Fed. We also thank Marzie Taheri Sanjani for research assistence. The views in this paper are solely the responsibility of the authors and should not be interpreted as reflecting the views of the Federal Reserve Bank of Chicago or any other person associated with the Federal Reserve System

2 1 Introduction Modern macroeconometric methods are based on densely parameterized models such as vector autoregressive models VAR or dynamic factor models DFM. Densely parameterized models deliver a better in-sample fit. It is well-know, however, that such models can deliver erratic predictions and poor out-of-sample forecasts due to parameter uncertainty. To address this issue, Sims 1980 suggested to use priors to constrain parameter estimates by shrinking them toward a specific point in the parameter space. Provided that the direction of shrinkage is chosen accurately, it has been shown that densely parameterized models are extremely successful in forecasting. This explains the popularity of largely parameterized models in the literature Stock and Watson, 2002, Forni, Hallin, Lippi, and Reichlin, 2003, Koop and Porter 2004, Korobilis, forthcoming, Banbura, Giannone, and Reichlin, 2010 and Koop, The direction of shrinkage is often determined by maximizing the marginal likelihood of the data see Carriero, Kapetanios and Marcellino, 2010 and Giannone el al., 2010, also called marginal data density MDD. The marginal data density is defined as the integral of the likelihood function with respect to the prior density of the parameters. In few cases, the MDD has an analytical representation. When an analytical solution for this density is not available, we need to rely on computational methods, such as the Chib s method Chib, 1995, Importance Sampling estimators Hammersley and Handscomb, 1964, Kloek and Van Dijk, 1978, Geweke, 1989, estimators based on the Reciprocal Importance Sampling principle Gelfand and Dey, 1994, importance sampling based on mixture approximations Frühwirth-Schantter, 1995, the Bridge Sampling estimator Meng and Wong, 1996, or the Warp Bridge Sampling estimator Meng and Schilling, Since all these methods rely on computational methods to integrate the model parameters out of the posterior density, their accuracy deteriorates as the dimensionality of the parameter space grows large. Hence, there is a tension between the need for using broadly parameterized models for forecasting and the accuracy in estimating the MDD which influences the direction of shrinkage. This paper aims at mitigating this tension by introducing two estimators henceforth, Method 1 and Method 2 that exploit the information about models analytical structure. While Method 1 can be considered as a refinement of the approach proposed by Chib 1995, Method 2 is based upon the Reciprocal Importance Sampling principle as in Gelfand and Dey Conversely to fully computational methods, Method 1 and Method 2 rely on 1

3 the analytical integration of some parameter blocks 1. The proposed estimators can be applied to econometric models satisfying two conditions. The first condition henceforth, sampling condition requires that the posterior density can be block-partitioned so as to be approximated via the Gibbs sampler. The second condition henceforth, analytical tractability condition states that there exists an integer τ 2 such that the conditional posterior p θ 1,..., θ τ θ τ+1,..., θ s, D, Y can be analytically derived, where Y is the sample data, D is a set of unobservable model variables, and s is the total number of parameter blocks θ i, i {1,..., s}. These two conditions are met by a wide range of models, such as Vector AutoRegressive Models VARs, just-identified Structural VAR models SVARs, Reduced Rank Regression Models such as Vector Equilibrium Correction Models VECMs, unrestricted Markov-Switching VAR models MS VARs, Dynamic Factor Models DFMs, Factor Augmented VAR models FAVARs, and Time-Varying Parameter TVP VAR models. By means of a Monte Carlo experiment, we show that exploiting the analytical tractability condition leads to sizeable gains in accuracy and computational burden, which quickly grow with the dimensionality of the parameter space of the model. We consider VARp models, in the form studied by Villani 2009 and Del Negro and Schorfheide 2010 i.e., the socalled mean-adjusted VAR models, from one up to four lags, p = 1,..., 4. We fit these four VAR models, under a single-unit-root prior Sims and Zha, 1998, to data sets with increasing number of observable variables. It is compelling to focus on mean-adjusted VAR models because the true conditional predictive density 2 can be analytically derived in closed form. We can compare the performance of our estimators with their fully computational counterparts; that is to say the estimator proposed by Chib 1995 and that introduced by Gelfand and Dey Method 1 and Chib s method only differ in the computation of the 1 Fiorentini, Planas, and Rossi 2011 use Kalman filtering and Gaussian quadrature to integrate scale parameters out of the likelihood function for dynamic mixture models. 2 If one partitions the parameter space Θ into s vector blocks; that is Θ = {θ 1,..., θ s }, the conditional predictive density p Y θ τ+1,..., θ s is defined as p Y θ τ+1,..., θ s p Y θ 1,..., θ s p θ 1,..., θ τ θ τ+1,..., θ s dθ 1...dθ τ where p Y θ 1,..., θ s is the likelihood function and p θ 1,..., θ τ θ τ+1,..., θ s is the prior for the first τ parameter blocks conditional on the remaining blocks. Note that the conditional predictive density is a component of the MDD, p Y, that can be expressed as follows: p Y = p Y θ τ+1,..., θ s p θ τ+1,..., θ s dθ τ+1...dθ s where p θ τ+1,..., θ s is the prior for the parameter blocks that cannot be analytically integrated out. 2

4 conditional predictive density when applied to mean-adjusted VAR models. While Method 1 evaluates the exact analytical expression for the conditional predictive density, Chib s method approximates this density computationally via Monte Carlo integration. Therefore, we can quantify the accuracy gains associated with exploiting the analytical tractability condition by comparing the conditional predictive density estimated by Chib s method with its true value. This assessment would have not been possible, if we had based our Monte Carlo experiment on models that require data augmentation to approximate the posterior, such as DFMs, or on other estimators rather than Chib s method, such as the Bridge Sampling estimator. The main findings of the experiment are: i the fully-computational estimators that neglect the analytical tractability condition lead to an estimation bias that severely distorts model rankings; ii our two methods deliver very similar results in terms of posterior model rankings, suggesting that their accuracy is of the same order of magnitude in the experiment; iii exploiting the analytical tractability condition prevents our estimators from being affected by the curse of dimensionality. Related to this last finding, we argue that Method 2 is suitable for performing model selection and model averaging across a large number of models, as it is the fastest. The paper is organized as follows. Section 2 introduces the conditions that a model has to satisfy in order to apply our two estimators. In this section, we describe the two methods proposed in this paper for computing the MDD. Section 3 performs the Monte Carlo application. Section 4 concludes. 2 Methods for Computing the Marginal Data Density The marginal data density MDD, also known as the marginal likelihood of the data, is defined as the integral taken over the likelihood with respect to the prior distribution of the parameters. Let Θ be the parameter set of an econometric model and Y be the sample data. Then, the marginal data density is defined as p Y = py ΘpΘdΘ 1 where py Θ and pθ denote the likelihood and the prior density, respectively. In Section 2.1, we describe the two methods proposed in this paper in a canonical situation consisting of four vector blocks. In Section 2.2, we present the two estimators applied to the 3

5 general case of s vector blocks. Finally, Section 2.3 deals with the scope of application of the proposed estimators. 2.1 Four Vector Blocks Let us consider a model whose set of parameters and latent variables is denoted by Θ D = {D, Θ} where D stands for the latent variables and Θ for the parameters of the model, where Θ = {θ 1, θ 2, θ 3 }. We denote the prior for model s parameters as p Θ, which is assumed to have a known analytical representation. Furthermore, the likelihood function, p Y Θ, is assumed to be known in closed form or easy to evaluate. We focus on models satisfying the following two conditions: i It is possible to draw from the conditional posterior distributions p θ 1 θ 2, θ 3, D, Y, p θ 2 θ 1, θ 3, D, Y, p θ 3 θ 1, θ 2, D, Y, and from the posterior predictive density, p D θ 1, θ 2, θ 3, Y. ii The conditional posterior distribution p θ 1, θ 2 θ 3, D, Y is analytically tractable. Condition i implies that we can approximate the joint posterior p Θ Y and the predictive density p D Y through the Gibbs sampler. We label this condition as the sampling condition. Condition ii is the analytical tractability condition and is most likely to be satisfied through a wise partitioning of the parameter space and the specification of a conjugate prior. Method 1 is based on interpreting the MDD as the normalizing constant of the joint posterior distribution p Y = p Y Θ p Θ p θ 1 θ 2, θ 3, Y p θ 2 θ 3, Y p θ 3 Y 2 where the numerator is the product of the likelihood and the prior, with all integrating constants included, and the denominator is the posterior density of Θ. Denote the posterior mode as Θ = [ θ1, θ 2, θ ] 3. Hereafter, let p denote a density for which an analytical expression is available and p denote a density that needs to be approximated using computational methods. Method 1 is obtained by factorizing 2 as follows: p M1 Y = p Y θ 3 p θ3 p θ 3 Y 3 4

6 where p θ3 is the prior for the parameter block θ 3 evaluated at the posterior mode, the conditional posterior p θ3 Y is approximated using the Rao-Blackwellization technique proposed by Gelfand, Smith, and Lee 1992, and the conditional predictive density, p Y θ 3, is defined as: p Y θ 3 = p Y Θ p θ1, θ 2 θ 3 p θ1, θ 4 2 θ 3, Y Note that p Y Θ is the likelihood evaluated at the posterior mode and p θ1, θ 2 θ 3 is the prior for the blocks θ 1 and θ 2 conditional on θ 3 evaluated at the posterior mode. denominator can be evaluated as follows: p θ1, θ 2 θ 3, Y = 1 m m p θ1, θ 2 θ 3, D i, Y i=1 where the conditional posterior p θ1, θ 2 θ 3, D i, Y can be exactly calculated because of the analytical tractability condition and { D i} m is the output of a lower dimensional Gibbs i=1 sampler usually called reduced Gibbs step. The { reduced Gibbs } step delivers draws from the m density p D θ 3, Y by iteratively drawing θ i 1, θ i 2, D i from the conditional posterior distributions p θ 1 θ 2, θ i=1 3, D, Y and p θ 2 θ 1, θ 3, D, Y and from the predictive density p D θ 1, θ 2, θ 3, Y. It should be noted that Method 1 is a refinement of the estimator proposed by Chib 1995, whose only difference with Method 1 is the computation of the conditional posterior distribution p θ1, θ 2 θ 3, Y in the denominator of 4. Since Chib s method does not exploit the analytical tractability condition, it estimates this conditional posterior by taking the product of p θ1 θ 2, θ 3, Y and p θ2 θ 3, Y. This implies that two reduced Gibbs steps need to be performed to evaluate the denominator of 4: i one to obtain draws from the density p D θ 2, θ 3, Y so as to evaluate p θ1 θ 2, θ 3, Y and ii another one to obtain draws from the density p D, θ 1 θ 3, Y so as to evaluate p θ2 θ 3, Y. While Chib s estimator performs two reduced Gibbs steps, Method 1 only requires one because of the exploiting of the analytical tractability condition 3. Therefore, note that, by construction, Method 1 is more accurate and less computationally burdensome than Chib s estimator. 3 If the analytical tractability condition were satisfied for p θ 1 θ 2, θ 3, D, Y instead, then Method 1 and Chib estimator would coincide. The 5 5

7 Method 2 is based on combining the analytical tractability condition with the Reciprocal Importance Sampling RIS principle proposed by Gelfand and Dey The marginal data density is given by p Y = [ ] 1 p θ 1, θ 2 θ 3, D, Y E pd,θ3 Y f θ 3 6 p Y θ 1, θ 2, θ 3 p θ 1, θ 2 θ 3 p θ 3 where E pd,θ3 Y denotes the expectations taken with respect to the posterior density p D, θ 3 Y and f is a weighting function with the property f θ 3 dθ 3 = 1. Therefore, Method estimates the marginal data density as follows: ˆp M2 Y = 1 m m i=1 p p θ1, θ 2 θ i Y θ 1, θ 2, θ i 3 3, D i, Y p θ1, θ 2 θ i 3 p θ i 3 f { } m where θ i 3, D i are the draws from the Gibbs sampler simulator. The numerator is the i=1 conditional posterior, which is known because of the analytical tractability condition. In the denominator, we have the product of the likelihood and the joint prior, with all integrating constants included. function f θ i 3 = As proposed by Geweke 1999, we consider the following weighting 1 p 2π d/2 V 1/2 exp { I θ i 3 θ 3 V 1 { 1 θ i 3 2 θ 3 V 1 } θ i 3 θ 3 F χ 2 d ν θ i θ i θ 3 } 7 where p [0, 1], d is the dimension of the parameter vector vec θ 3, I is an indicator function, and F χ 2 d ν is the cumulative distribution function of a chi-square distribution with ν degrees of freedom, where the hyperparameter ν is chosen so as to minimize the numerical standard error of the estimator. The standard RIS estimator proposed by Gelfand and Dey 1994 is given by p GD Y = 1 m m i=1 θ i 1, θ i 2, θ i f p Y θ i 1, θ i 2, θ i 3, D i p 3, D i θ i 1, θ i 2, θ i 3, D i 8 Note that the standard RIS estimator proposed by Gelfand and Dey 1994 uses all the posterior draws for θ 1, θ 2, θ 3, D, which makes it a global estimator. By exploiting the analytical tractability condition, Method 2 relies on setting the first two parameter blocks equal to the posterior mode and using only the posterior draws for θ 3 and D. Therefore, 6

8 Method 2 is a hybrid estimator: local for θ 1, θ 2 and global for θ 3, D General Case Let us consider an s-block parameter vector, Θ {θ 1,..., θ s }. We assume that the prior distribution, p Θ, is known and the likelihood function, p Y Θ, is either known in closed form or easy to evaluate. written as: i Sampling condition: The two necessary conditions to apply our estimators can be It is possible to draw from the conditional posterior distributions p θ i Θ i, D, Y, where Θ i {θ 1,.., θ i 1, θ i+1,..., θ s }, for any i {1,... s} and from the posterior predictive density, p D Θ, Y. ii Analytical tractability condition: The conditional posterior distributions p Θ τ Θ >τ, D, Y, where Θ τ {θ 1,..., θ τ } and Θ >τ {θ τ+1,..., θ s }, are analytically tractable, for some τ {2,..., s}. Method 1 is given by where p Θ>τ p M1 Y = p Y Θ >τ p Θ>τ p Θ >τ Y is the prior for the parameter blocks θ τ+1,.., and θ s and the conditional predictive density is computed as follows p Y Θ >τ = 9 p Y Θ p Θ τ Θ >τ 10 p Θ τ Θ >τ, Y The analytical tractability condition allows us to compute the denominator of 10 as follows p Θ τ Θ >τ, Y = 1 m m p Θ τ Θ >τ, D i, Y 4 If there are no latent data, Method 2 becomes a global estimator since equation 7 becomes i=1 11 p M2 y = 1 m m i=1 p Y θ i 3 1 p θ i 3 f where p Y θ i 3 is the conditional predictive density, which is insensitive to evaluation of {θ 1, θ 2 }. 7 θ i 3 1

9 where { D i} { } m m is the output of a reduced Gibbs step that iteratively draws Θ i i=1 τ, Di i=1from the known distributions p θ i θ 1,.., θ i 1, θ i+1,..., θ τ, Θ >τ, D, Y, for 1 i τ and the predictive density p D Θ τ, Θ >τ, Y. The conditional posterior at the denominator of 9 can be estimated as 5 p Θ>τ Y = s τ i=1 θτ+i p Θ >τ+i, Y, where the ordinates p θτ+i Θ >τ+i, Y, for 1 i < s τ, can be approximated by running s τ 1 reduced Gibbs steps and the smallest ordinate p θs Y can be approximated via Rao-Blackwellization based on draws from the Gibbs sampler. 6 Method 2 computes the marginal data density, p Y, as follows: ˆp M2 Y = 1 m m i=1 p p Θ τ Θ i Y Θ τ, Θ i >τ p >τ, D i, Y Θ τ Θ i >τ p Θ i >τ f Θ i >τ 1 12 { } m where Θ i >τ, D i are the draws from the Gibbs sampler simulator. It should be noted i=1 that when τ = s i.e., all the parameter blocks can be integrated out analytically, we have that Θ >τ =, which implies that Method 1 and Method 2 coincide. 7 To sum up, for 1 < τ < s, applying Method 1 requires running s τ reduced Gibbs steps as opposed to the s 1 steps performed by Chib s method. 8 Thus, gains from applying Method 1 relative to Chib s method are expected to become more and more substantial as the number of blocks τ that can be integrated out increases. Nevertheless, Method 1 overlaps Chib s method when performing reduced Gibbs steps for i {τ + 1, s 1}. Note that these simulations are the most computationally cumbersome among all the reduced Gibbs steps performed by Chib s method because they are the ones which integrate out the largest number of parameter blocks. When the total number of parameter blocks, s, is much larger than the number of blocks that can be integrated out, τ, Method 1 may still be computationally cumbersome. In these cases, and when a large number of repeated 5 Conventionally Θ >m =. 6 Therefore, Method 1 requires to run a total of s τ reduced Gibbs steps. It is noteworthy that if τ = 1, then Method 1 requires s 1 reduced Gibbs steps, which are the same number of steps required by Chib s method. Thus, if τ = 1, both estimators coincide. 7 We thank an anonymous referee to point this out. 8 Note that when there is no data augmentation, Method 1 requires running one reduced Gibbs step less, that is, s τ 1. To see why note that the analytical tractability condition implies that the conditional posterior p Θ τ Θ >τ, Y is known when no data augmentation is required. As far as Chib s estimator is concerned, note that the largest ordinate p θ 1 Θ >1, Y is usually analytically tractable in many applications e.g., the Monte Carlo experiment in this paper that do not require data augmentation, implying that the actual number of reduced Gibbs steps to be performed is s 2. 8

10 computations of MDDs is required e.g., Bayesian averaging over a large number of models, Method 2 provides the fastest approach. It is important to emphasize that Method 2 only requires performing the Gibbs sampler posterior simulator, no reduced Gibbs step has to be performed. 2.3 Scope of Application Unlike Chib and Gelfand-Dey estimators, our methods are only applicable when both the sampling and the analytical tractability conditions are met. But both conditions can be shown to be satisfied by a large class of time series econometrics models. In particular, we can show that the conditions are met by Vector Autoregressive VAR models, just-identified Structural VAR 9, Reduced Rank Regression RRR models, unrestricted Markov-switching VAR models, Dynamic Factor Models DFMs, Factor Augmented VAR models FAVARs, and Time-Varying Parameter TVP VAR model. We explore in detail the application to VAR models in the next section. In the appendix, we provide a guide on how to partition the parameter space so that the sampling and the analytical tractability conditions are satisfied in the remaining models. 3 A Monte Carlo Experiment In this section, we assess the gains in accuracy and computational burden of the estimators proposed in the paper by means of a Monte Carlo experiment. In section 3.1, we describe the modeling framework and the application of the four estimators used in the experiment, that is, Chib s estimator, Method 1, Method 2, and Gelfand and Dey s estimator. discuss the data set and the priors used in the empirical application in section 3.2. quantify the gains in accuracy and computational burden associated with our estimators in sections 3.3 and 3.5, respectively. In section 3.4, we provide evidence on the pervasive effects that the estimation bias, linked to neglecting the analytical tractability condition, may have on distorting posterior model rankings. 9 Let Ω be an orthonormal matrix through which the econometrician specifies the identification restrictions for the VAR. It directly follows that if i the identification scheme does not impose restrictions on the reduced-form parameters and ii the conditional distribution of the matrix Ω does not get updated by the data; then our two estimators are applicable. These conditions are satisfied by recursive VARs and some non-recursive VARs identified with short-run or long-run restrictions. We We 9

11 3.1 The Model Following Villani 2009 and Del Negro and Schorfheide 2010, the VAR model in meanadjusted form can be expressed as Y = DΓ + Ỹ 13 Ỹ = XΦ + ε 14 where we denote the sample length as T and [ we define the T n matrix of observables Y = y 1,..., y T, the T l + 1 matrix D = 1 T, 1,..., T,..., 1,..., T l ] with 1 T being a 1 T vector of ones, the l + 1 n matrix Γ = γ 0,..., γ l, the T n matrix of the de-trended and de-meaned observables Ỹ is defined as Ỹ = ỹ 1,..., ỹ T, the T np matrix X = x 1,..., x T, where we define the np 1 vectors x t = ỹ t 1,..., ỹ t p, the np n parameter matrix Φ = [φ 1,..., φ p ], and the T n matrix of Gaussian residuals is denoted as ε = ε 1,..., ε T whose covariance matrix is denoted by Σ. We consider thee parameter blocks: the block for the mean and the deterministic trend Γ, the block for the autoregressive parameters Φ of the VAR in deviations, and the parameters of the covariance matrix Σ for the VAR in deviations. The block order is chosen such that θ 1 = Φ, θ 2 = Σ, and θ 3 = Γ. Note that, conditional on the parameter block Γ, the equations can be interpreted as a Multivariate Linear Gaussian Regression Model. Therefore, under prior conjugacy, the posterior distribution p Φ, Σ Γ, Y is analytically tractable belonging to the Multivariate-Normal-Inverted-Wishart MN IW family. This suffices to guarantee the satisfaction of the analytical tractability condition for τ = 2. Moreover, if the prior for Γ is independent and Gaussian, the conditional posterior p Γ Φ, Σ, Y can be shown to be also Gaussian see the online appendix. Therefore, the sampling condition is satisfied. Since the sampling and analytical tractability conditions are satisfied with τ = 2, Method 1 computes the MDD as follows p M1 Y = p Y Γ p Γ p Γ Y where the conditional predictive density, p Y Γ, has a closed-form expression. For instance, when the prior for the parameters of the VAR in deviations p Φ, Σ Γ is a dummy-observation 15 10

12 prior, the conditional predictive density can be shown to be given by p Y Γ = π T 0 +T 1 npn 2 X X n 2 S T 0 +T 1 np 2 Γ T0 +T 1 np n π T 0 npn 2 X X n 2 S T 0 np 2 Γ T0 np n where Y and X are matrices that stack dummy observations for the VAR in deviations; Y and X are the data in deviations obtained by de-meaning and de-trending the actual data Y with Γ; T 0 is the number of dummy observations; T 1 is the total number [ of observations ] ; T 1 = T + T 0 ; n is the number of variables; p is the number of lags; Y = Y, Ỹ X = [ X, X ] ; Γn is the multivariate gamma function; S = Y X Φ Y X Φ with Φ = X 1 X X Y ; and S = Y X Φ Y X Φ with Φ = X X 1 X Y. Finally, the marginalized posterior pγ Y in the denominator of 15 is computed implementing a Rao-Blackwell strategy. A naïve application of Chib s method, sidestepping that the conditional predictive density p Y Γ has a known analytically expression, computes p CHIB Y Γ = py Σ, Φ, Γp Σ, Φ Γ p Φ Σ, Γ, Y p Σ Γ, Y 17 where p Σ Γ, Y is approximated computationally using the output from the reduced Gibbs step as follows p Σ Γ, Y 1 m m p Σ Φ i, Γ, Y 18 i=1 Method 2 computes: [ 1 m f Γ i ] 1 ˆp M2 Y = 19 m py Γ i pγ i i=1 where the draws Γ i are the draws from the Gibbs sampler simulator 10. We analytically evaluate the posterior kernel py Γ i pγ i and the weighting function f Γ i for each draw of Γ. The application of Gelfand and Dey s method henceforth, the GD method to the model 10 In order to implement this approach, we need the draws {Γ} m i=1 from the marginalized posterior p Γ Y. It should be clear that these draws are simply the set of draws {Γ} m i=1 that come from the output of the Gibbs sampler. 11

13 13-14 is straightforward and hence omitted. In the Monte Carlo exercise, we use the weighting function f proposed by Geweke 1999 when implementing Method 2 and GD method. 11 The degree of freedom of the weighting function is chosen so as to minimize the numerical standard error of the estimator. In what follows, we set Φ, Σ, and Γ to be equal to the posterior mean Data, Prior Specification, and Number of Simulations We fit four VAR models with different lags to six encompassing data sets. In particular, we fit autoregressive models with lags p = 1,..., 4 to data sets containing from one up to six variables, which are in order: Real Gross Domestic Product source: Bureau of Economic Analysis, GDPC96, Implicit Price Deflator source: Bureau of Economic Analysis, GDPDEF, Personal Consumption Expenditures source: Bureau of Economic Analysis, PCEC, Fixed Private Investment source: Bureau of Economic Analysis, FPI, Effective Federal Funds Rate source: Board of Governors of the Federal Reserve System, FEDFUNDS, and Average Weekly Hours Duration in the Non-farm Business source: U.S. Department of Labor, PRS The encompassing data sets are such that the one-variate models consider the real GDP data, the two-variate models, GDP and the price deflator, and so on an so forth until the six-variate models, which contain all data series listed above. The quarterly data set ranges from 1954:1 to 2008:4. We elicit the prior density for the parameters of the VAR in deviations, Φ, Σ, by using the single-unit-root prior, suggested by Sims and Zha We follow Del Negro and Schorfheide 2004, Giannone, Lenza, and Primiceri 2012, and Carriero, Kapetanios, and Marcellino 2010 setting the hyperparameters of the prior so as to maximize the conditional predictive density, py Γ, where Γ stands for the posterior mean. 13 To this end, we perform a stochastic search based on simulated annealing Judd, 1998 with 1,000 stochastic draws. Furthermore, the prior density depends on the first and second moments of some pre-sample 11 However, note that while the weighting function for Method 2 is defined over the space of the vector block Γ, the one for the GD estimator is defined over the entire parameter space. 12 The results of the experiment are virtually the same if Φ, Σ, and Γ are set to be equal to the posterior median. 13 Very similar results are found using the procedure proposed by Banbura, Giannone, and Reichlin 2010 that automatically adjusts the prior hyperparameters as the number of observables is increased. For any number of observables from three to six, we set the hyperparameters so that the fit of the VAR4 in deviations in the presample 1947:1-1953:4 is comparable with that of the trivariate VAR1 estimated with the OLS. The values for the hyperparameters obtained with this procedure are very similar to those computed by maximizing the conditional predictive density, p Y Γ 12

14 data. We use the moments of a pre-sample ranging from 1947:1 to 1953:4. We run ten chains of m number of draws in the Gibbs sampler and in the reduced Gibbs sampler, where m = {1, 000, 10, 000, 100, 000}. We also run one chain with one million draws. We do not use a burn-in period. 3.3 Gains in Accuracy Our estimators rely on the insight that exploiting the analytical tractability condition increases the accuracy of MDD estimators. In this empirical application, we assess the inaccuracy associated with neglecting the analytical tractability condition. Consider the VAR model of the form In this framework, Method 1 differs from Chib s method only on the computation of the conditional predictive density, py Γ. While Method 1 exactly calculates the conditional predictive density py Γ via its analytical expression, Chib s method approximates it computationally via equation 17, which requires performing the reduced Gibbs step to implement the integration in 18. Thus, the inaccuracy derived from neglecting the analytical tractability condition can be quantified by the gap between the estimated conditional predictive density using Chib s approach, p CHIB Y Γ, and its true value, py Γ. Note that, as the number of draws in the reduced Gibbs step, m, goes to infinity, the size of the gap goes to zero, that is, lim m p CHIB Y Γ = p Y Γ. In this application, we assess the convergence of Chib s method to the true conditional predictive density by computing log p CHIB Y Γ log py Γ 20 We refer to this difference as the estimation bias for the conditional predictive density. We compute the absolute difference in 20 for every chain, VAR model p = 1,..., 4, and data set. The upper graph of Figure 1 reports the across-chain mean of the estimation bias for the conditional predictive density for the 24 models of interest when performing 1,000,000 draws in both the Gibbs sampler and the reduced Gibbs step. We find worth emphasizing the following two results. First, for a given number of lags p, the estimation bias grows at an increasing rate as the number of observable variables increases. Second, for a given number of observables, the estimation bias grows at an increasing rate as the number of lags p increases. For example, the size of the gap for a six-variate VAR4 is about 9 times the size of the bias for the VAR1 model. We document in Table 1 the convergence of the estimation bias as the number of draws in the reduced Gibbs step increases for six-variate VAR models. We conclude that for a 13

15 given data set and a given model, the bias is quite stable despite the increase in the number of posterior draws in the reduced Gibbs step. This suggests that the integration in 18 exhibits a rather slow convergence. The comparison of Method 2 and the GD method is not as straightforward as the one between Method 1 and Chib s estimator. Table 2 reports the across-chain means and standard deviations of the log MDD for each of the estimators and models for the six-variate data set. The GD method is found to be both biased and quite unstable since the across-chain standard deviations are larger than that for Method 2. In complex models, such as the VAR4, the standard deviation of the GD method is 17 log-points when 100,000 posterior draws are used, while that of Method 2 is 0.16 log-points. Geweke 2005 points out that the greater the dimension of the parameter space, the greater the variation in the weighting function-to-probability density ratio central to the GD method. 14. Exploiting the analytical tractability condition is crucial to prevent Method 2 from sharing this variability issue with the GD estimator. Note that the dimension of the vectorized parameter blocks are n 2 for vecσ, n 2 p for vecφ, and 2n for vecγ. Then, using the analytical tractability condition reduces the dimensionality of the relevant parameter space from n [2 + n 1 + p] to 2n. Therefore, the density ratio test in Method 2 may become impractical only for sufficiently high dimensional data sets, but it is stable even for rich lag structures. 3.4 Model Selection In this section, we analyze the effect of inaccurate estimates when performing Bayesian model selection. Under a 0-1 loss function, the optimal decision is to select the model with the largest posterior probability see Schorfheide Let us define the model set to be formed by the four VAR models, that is, {V ARp, 1 p 4} 15 in the six-variate data set. We assume that the prior model probabilities, {π p,0, 1 p 4}, are the same across the four candidate models. For every estimator, we permute MDDs estimated at each chain across the four VAR models which delivers 10, 000 quadruplets of posterior probabilities. The distributions of the posterior probabilities associated with the VAR1 and the VAR2 14 Geweke 2005 states that the instability of the GD estimator grows with the dimension of the model. However, the across-chain standard deviations reported in Table 2 seem to be unrelated to dimensionality. This is due to the limited number of chains used in the Monte Carlo experiment. But, following Geweke 2005, with a large enough number of chains, the one-to-one relationship between dimensionality and instability of the GD estimator would emerge. 15 We have extended the exercise to include VAR5 and VAR6. We have decided to not present them in the paper because all estimators deliver very small MDDs for these two models. Hence, all the results discussed in this section are unchanged. 14

16 for Chib s estimator, Method 1, and Method 2 are a mass point at zero, suggesting that these methods strongly disfavor both the VAR1 and the VAR2. The GD method rarely selects the VAR1 or the VAR2. Therefore, in Figure 2, we only report the distributions for the 10, 000 posterior probabilities computed by the four estimators for the VAR3 and VAR4 models. While both Method 1 and Method 2 lead to select the VAR4, the distribution related to Chib s method implies a median posterior probability of about 20% for the VAR4. Conversely, Chib s method strongly favors the VAR3 model with a median posterior probability of about 80%. These results show that the estimation bias due to a fully computational approach may significantly distort model rankings. Finally, the distributions related to the GD method are uniform for both models, which makes it impossible to make inference over models. Two important remarks about Figure 2 are in order. First, since Method 1 and Chib s estimator differ only in how they calculate the conditional posterior p Σ Γ, Y, the observed bias in model ranking must be due to the inaccuracy associated with the integration 18, which is based on the reduced Gibbs step. Second, although Method 1 and 2 estimate the MDD through different approaches, 16 these two methods deliver posterior model rankings that are remarkably similar. Hence, the accuracy of the two methods proposed in the paper is of the same order of magnitude. 3.5 Computation Time In this section, we study the evolution of computation burden for our estimators and their fully computational counterparts when increasing the number of observable variables, n, and the number of lags, p. Recall that the dimension of the vectorized parameter blocks are n 2 for vecσ, n 2 p for vecφ, and 2n for vecγ.when exploiting the analytical tractability condition, the MDD estimators are sensitive to the dimension of Γ, that is, to the size of the data set, n, but not to the lag structure. The computational burden of fully-computational methods depends upon both the size of the data set and the lag structure of the model. Therefore, exploiting the analytical tractability condition translates into a considerable reduction in the dimension of the parameter space, which reduces the sensitivity to rapidly being subject to the curse of dimensionality. 16 Recall that Method 1 exploits the fact that the MDD can be expressed as the normalizing constant of the joint posterior density for model parameters. In contrast, Method 2 relies on the principle of the reciprocal importance sampling. 15

17 Figure 3 shows how the computation time in seconds associated with each of the estimators under analysis varies as the number of observable variables and the number of lags, p increases. Comparing these figures, we conclude that Method 2 and GD method are computationally more convenient than Method 1 and Chib s method for any model specification and any data set. We observe that for Method 2 i the computing time is almost invariant to the number of lags included in the model and ii the increases in computing time due to the inclusion of additional observable variables are quite small. Quite remarkably, estimating the MDD associated with a six-variate VAR4 with the Method 2 and 100,000 posterior draws, 17 takes less that 1/10 seconds. While the computation burden of the GD method is quite reduced, it increases exponentially with the dimension of the model, that is, it suffers from the curse of dimensionality. In the lower graph of Figure 1, we explore the difference in computing time between Chib s method and Method 1. Recall that the these two estimators only differ in how they calculate the conditional posterior p Σ Γ, Y. Hence, the figure shows how the computing time to perform the reduced Gibbs step changes as the number of lags or observables in the VAR model varies. We conclude that Chib s method suffers of the curse of dimensionality because of the reduced Gibbs step. These findings suggest that exploiting the analytical tractability condition reduces the curse of dimensionality that characterizes both Chib s estimator and the GD method. 4 Concluding Remarks The paper develops two new estimators for the marginal likelihood of the data. These estimators rely on the fact that in several widely used time series models it is generally possible to analytically integrate out one or more parameter blocks from the block-conditional posterior densities implied by the models. An application based on a standard macroeconomic data set reveals that our estimators translate into significant gains in accuracy and computational burden when compared to fully-computational approaches. We find that the estimation bias associated with fully-computational estimators may severely distort model rankings. Furthermore, we show that exploiting the analytical tractability condition reduces the sources of curse of dimensionality when estimating the marginal data density. In particular, Method 2 is fast enough to be well-suited for applications where the marginal likelihood of VAR models , 000 draws ensure very reliable estimates since the size of the across-chain standard deviation is relatively small. 16

18 has to be computed several times e.g., Bayesian selection or average across a large set of models. The paper favors the idea that estimators that are tailored to the specific features of an econometric model are likely to dominate universal estimators, which are applicable to a broader set of models but rely on fully computational methods. Using estimators that exploit the information about the analytical structure of the model is very rewarding, especially for densely parameterized models. Furthermore, as we overview in the appendix, estimators that exploit the analytical structure of models to improve accuracy can be easily obtained for many popular time series models. The assessment of the accuracy gains that can be obtained from applying partly analytical estimators to popular time-series models, such as TVP VAR models, FAVAR models, and DFMs, is an important venue for future research. The results of the paper should encourage the development of new estimators that exploit the analytical structure of more involved models such as, for instance, restricted MS VAR models e.g., Sims and Zha, 2006 and Sims, Waggoner and Zha, 2008 or over-identified structural VAR models e.g., Waggoner and Zha,

19 References Banbura, M., D. Giannone, and L. Reichlin 2010: Large Bayesian Vector Auto Regressions, Journal of Applied Econometrics, 251, Bernanke, B. S., J. Boivin, and P. Eliasz 2005: Measuring the Effects of Monetary Policy: A Factor-Augmented Vector Autoregressive FAVAR Approach, Quarterly Journal of Economics, 1201, Carriero, A., G. Kapetanios, and M. Marcellino 2010: Forecasting Government Bond Yields with Large Bayesian VARs, CEPR Discussion Paper No Carter, C., and R. Kohn 1994: Biometrika, 813, On Gibbs Sampling for State Space Models, Chib, S. 1995: Marginal Likelihood from the Gibbs Output, Journal of the American Statistical Association, 90432, Del Negro, M., and F. Schorfheide 2004: Priors from General Equilibrium Models for VARS, International Economic Review, 452, : Bayesian Macroeconometrics, in The Handbook of Bayesian Econometrics, ed. by H. K. van Dijk, J. F. Geweke, and G. Koop. Oxford University Press. Fiorentini, G., C. Planas, and A. Rossi 2011: The marginal likelihood of dynamic mixture models: some new results, Mimeo. Forni, M., M. Halllin, M. Lippi, and L. Reichlin 2003: Do Financial variables help forecasting inflation and real activity in the Euro Area?, Journal of Monetary Economics, 50, Frühwirth-Schnatter, S. 1995: Bayesian model discrimination and Bayes factors for linear Gaussian state space models, Journal of Royal Statistical Society B, 57, Gelfand, A. E., and D. K. Dey 1994: Bayesian Model Choice: Asymptotics and Exact Calculations, Journal of the Royal Statistical Society B, 56, Gelfand, A. E., A. F. M. Smith, and T.-M. Lee 1992: Bayesian Analysis of Constrained Parameter and Truncated Data Problems Using Gibbs Sampling, Journal of the American Statistical Association, 87418,

20 Geweke, J. 1989: Bayesian Inference in Econometric Models using Monte Carlo Integration, Econometrics, 57, : Bayesian Reduced Rank Regression in Econometrics, Journal of Econometrics, 751, : Contemporary Bayesian Econometrics and Statistics. Wiley-Interscience. Geweke, J. F. 1999: Using Simulation Methods for Bayesian Econometric Models: Inference, Development and Communication, Econometric Reviews, 18, Giannone, D., M. Lenza, and G. Primiceri 2012: Prior Selection for Vector Autoregressions, NBER Working Paper No Hammersley, M. J., and D. C. Handscomb 1964: Monte Carlo Methods. Methuen, London. Judd, K. L. 1998: Numerical Methods in Economics. The MIT Press, Boston. Kloek, T., and H. K. van Dijk 1978: Bayesian Estimates of Equation System Parameters: an Application of Integration by Monte Carlo, Econometrica, 461, Koop, G. 2011: Forecasting with Medium and Large Bayesian VARs, mimeo University of Strathclyde. Koop, G., and S. Potter 2004: Forecasting in Dynamic Factor Models Using Bayesian Model Averaging, Econometric Journal, 72, Korobilis, D. forthcoming: Forecasting in Vector Autoregressions with many predictors, Advances in Econometrics, Vol 23: Bayesian Macroeconometrics. Meng, X.-L., and S. Shilling 2002: Warp Bridge Sampling, Journal of Computational and Graphical Statistics, 11, Meng, X.-L., and W. H. Wong 1996: Simulating Ratios of Normalizing Constants Via a Simple Identity: A Theoretical Exploration, Statistica Sinica, 6, Pitt, P. G. M. K., and R. Kohn 2010: Bayesian Inference for Time Series State Space Models, in The Handbook of Bayesian Econometrics, ed. by H. K. van Dijk, J. F. Geweke, and G. Koop. Oxford University Press. 19

21 Primiceri, G. 2005: Time Varying Structural Vector Autoregressions and Monetary Policy, Review of Economic Studies, 723, Schorfheide, F. 2000: Loss Function-Based Evaluation of DSGE Models, Journal of Applied Econometrics, 156, Sims, C. A. 1980: Macroeconomics and Reality, Econometrica, 484, Sims, C. A., D. F. Waggoner, and T. Zha 2008: Methods for Inference in Large Multiple-Equation Markov-Switching Models, mimeo. Sims, C. A., and T. Zha 1998: Bayesian Methods For Dynamic Multivariate Models, International Economic Review, 394, Sims, C. A., and T. Zha 2006: Were There Regime Switches in US Monetary Policy?, American Economic Review, 961, Stock, J., and M. Watson 2002: Macroeconomic Forecasting Using Diffusion Indexes, Journal of Business and Economic Statistics, 20, Villani, M. 2009: Steady State Priors for Vector Autoregressions, Journal of Applied Econometrics, 244, Waggoner, D. F., and T. Zha 2003: A Gibbs sampler for structural vector autoregressions, Journal of Economic Dynamics & Control, 28,

22 A. Figures and Tables Figure 1: Estimation bias Estimation bias Difference in Seconds 5 4 Estimation bias log-points Number of observables VAR1 VAR2 VAR3 VAR Time Differences Between Chib's Method and M Number of Observables VAR1 VAR2 VAR3 VAR4 21

23 Figure 2: Distribution of Posterior Probabilities for VARp 22

24 Computing time in seconds Computing time in seconds Computing time in seconds Computing time in seconds Figure 3: Computing time in seconds 3000 Computing Time - Chib's Method 100,000 draws in the Gibbs sampler and in the reduce-gibbs step 350 Computing Time - Method 1 100,000 draws in the Gibbs sampler Number of Observables Number of Observables VAR1 VAR2 VAR3 VAR4 VAR1 VAR2 VAR3 VAR Computing Time - Gelfand and Dey Estimator 100,000 draws in the Gibbs sampler Number of Observables Computing Time - Method 2 100,000 draws in the Gibbs sampler Number of Observables VAR1 VAR2 VAR3 VAR4 VAR1 VAR2 VAR3 VAR4 23

25 Table 1: across-chain averages of the estimation bias for the conditional predictive density: six-variate VAR Draws VAR1 VAR2 VAR3 VAR , , , , 000, Notes: Across-chain means of absolute differences. The across-chains numerical standard errors are in italics. Draws refers to the number of posterior draws and the number of draws in the reduced Gibbs step. For one million draws, we do not report numerical standard errors since we consider one chain. 24

26 Table 2: Log-Marginal Data Density: Six-variate case Model Draws MDD estimator Chib Method 1 Method 2 GD VAR VAR VAR VAR Notes: Draws refers to the number of posterior draws and the number of draws in the reduced Gibbs step. Across-chain standard deviations are reported in italics. We do not report numerical standard errors when considering one million draws because we report the results for one chain. 25

27 B. A Guide to use Method 1 and Method 2 B.1 Reduced Rank Regression Models A reduced rank regression model reads: detailed in Y = XΓ + ZΦ + u t 21 with u t iid N 0, Σ. X is an n k matrix, Γ is p L, Z is n p, and Φ is k L. The matrix of coefficients, Φ is full-rank, but the matrix Γ, is assumed to have rank q, where q < max {L, p}. Let us reparameterize the low-rank matrix as Γ = ΨΩ and assume a normalization scheme restricting Ψ = Ψ. Under an inverted Wishart distribution for Σ and independent Gaussian shrinkage priors for each of the elements of Ψ and Ω, Geweke 1996 shows that the conditional predictive densities Φ Σ, Ψ, Ω, Y, Σ Ψ, Ω, Y, Ψ Φ, Σ, Ω, Y, and Ω Φ, Σ, Ψ, Y belong to the MN IW family. Therefore, the sampling condition is satisfied. Conditional on Γ, the RRR model in 21 reduces to a multivariate linear Gaussian regression model. Given a MN IW prior on Φ, Σ Γ, we conclude that the posterior Φ, Σ Γ, Y is analytically tractable. Let us partition the parameter space of the RRR model in 21 as follows θ 1 = Φ, θ 2 = Σ, θ 3 = Ψ, and θ 4 = Ω. Hence, the analytically tractability condition is satisfied for τ = 2. B.2 Unrestricted Markov-Switching MS VARs Let us consider the model y t = x tφ K t + u t 22 where Φ K t = [Φ 1 K t,..., Φ p K t, Φ c K t ], y t is a n 1 vector of observable variables, and u t N 0, Σ K t. K t is a discrete M-state Markov process with time-invariant transition probabilities π lm = P [K t = l K t 1 = m], l, m {1,..., M}. For simplicity, let us assume that M = 2. Let T be the sample length, K = K 1,..., K T be the history of regimes, [Φ j, Σ j] j {1,2} = {Φ 1, Σ 1, Φ 2, Σ 2}, and π jj j {1,2} = {π 11, π 22 }. Let us partition the parameter space of the model as follows θ 1 = π jj j {1,2}, θ 2 = Φ 1, θ 3 = Σ 1, θ 4 = Φ 2, θ 5 = Σ 2. Conditional on the history of regimes, K, i the model in 22 reduces to a VAR model with dummy variables that account for known structural breaks and ii the transition probabilities, π jj j {1,2}, are independent of the data and of the remaining parameters of the model, [Φ j, Σ j] j {1,2}. As a result, if the prior distributions for Φ l and Σ l, l {1, 2}, are of the MN IW form and π 11 and π 22 are independent beta distributions, then the conditional posterior distributions of Φ l, Σ l K, Y, l {1, 2} and 26

Methods for Computing Marginal Data Densities from the Gibbs Output

Methods for Computing Marginal Data Densities from the Gibbs Output Methods for Comuting Marginal Data Densities from the Gibbs Outut Cristina Fuentes-Albero Rutgers University Leonardo Melosi London Business School May 2012 Abstract We introduce two estimators for estimating