Robust Deviance Information Criterion for Latent Variable Models

Size: px

Start display at page:

Download "Robust Deviance Information Criterion for Latent Variable Models"

Scot McDaniel
5 years ago
Views:

1 Robust Deviance Information Criterion for Latent Variable Models Yong Li Renmin University of China Tao Zeng Singapore Management University Jun Yu Singapore Management University February 15, 014 Abstract It is shown in this paper that the data augmentation technique undermines the theoretical underpinnings of the deviance information criterion DIC, a widely used information criterion for Bayesian model comparison, although it facilitates parameter estimation for latent variable models via Markov chain Monte Carlo MCMC simulation. Data augmentation makes the likelihood function non-regular and hence invalidates the standard asymptotic arguments. A robust form of DIC, denoted as RDIC, is advocated for Bayesian comparison of latent variable models. RDIC is shown to be a good approximation to DIC without data augmentation. While the later quantity is difficult to compute, the expectation maximization EM algorithm facilitates the computation of RDIC when the MCMC output is available. Moreover, RDIC is robust to nonlinear transformations of latent variables and distributional representations of model specification. The proposed approach is applied to several popular models in economics and finance. While DIC is very sensitive to the nonlinear transformations of latent variables in these models, RDIC is robust to these transformations. As a result, substantial discrepancy has been found between DIC and RDIC. JEL classification: C11, C1, G1 Keywords: AIC; DIC; EM Algorithm; Latent variable models; Markov Chain Monte Carlo. 1 Introduction One of the most important developments in the Bayesian literature in recent years is arguably the deviance information criterion DIC of Spiegelhalter et al. 00. DIC is a Bayesian We wish to thank Peter Phillips and David Spiegelhalter for their helpful comments. Yong Li, Hanqing Advanced Institute of Economics and Finance, Renmin University of China, Beijing, , P.R. China. Tao Zeng, School of Economics and Sim Kee Boon Institute for Financial Economics, Singapore Management University, 90 Stamford Road, Singapore Jun Yu,School of Economics and Lee Kong Chian School of Business. for Jun Yu: yujun@smu.edu.sg. URL: Yu thanks the Singapore Ministry of Education for Academic Research Fund under grant number MOE011-T

2 version of the well known Akaike Information Criterion AIC Akaike Like AIC, it trades off a measure of model adequacy against a measure of complexity and is concerned with how replicate data predict the observed data. DIC is constructed based on the posterior distribution of the log-likelihood or the deviance, and has several desirable features. Firstly, DIC is simple to calculate when the likelihood function is available in closed-form and the posterior distributions of the models are obtained by Markov chain Monte Carlo MCMC simulation. Secondly, it is applicable to a wide range of statistical models. Third, unlike Bayes factors BFs, it is not subject to Jeffery-Lindley s paradox. An important class of models in economics and finance involves latent variables. Latent variables have figured prominently in consumption decision, investment decision, labor force participation, conduct of monetary policy, indices of economic activity, inflation dynamics and other economic, business and financial activities and decisions. For example, one important class of latent variable models is the state space models, in which the state variable is latent. It provides a unified methodology for treating a wide range of problems in time series analysis. Another example can be found in the values of stocks, bonds, options, futures, and derivatives, which are often determined by a small number of factors. Often these factors, such as the level, the slope and the curvature in the term structure of interest rates, are not observed. In macroeconomics, a well-known recent example of latent variable models is the dynamic stochastic general equilibrium DSGE model. On the basis of macroeconomic theory, the DSGE model attempts to explain aggregate economic phenomena by taking into account the fact that the economy is affected by some structural innovations. The DSGE model can be solved as a rational expectation system in the percentage deviation of variables from their steady-states which are latent An and Schorfheide 007 and Dejong and Dave 007. In microeconometrics, many discrete choice models and panel data models involve unobserved variables in order to capture observed heterogeneity across economic entities Stern For latent variable models, Bayesian methods via MCMC simulation have proven to be a powerful alternative to frequentist methods for estimating model parameters. In particular, the data augmentation strategy proposed by Tanner and Wong 1987, that expands the parameter space by treating the latent variables as additional model parameters, has been found very useful for simplifying the MCMC computation of posterior distributions. This simplification is achieved because data augmentation leads to a closed-form expression for the likelihood function. Comparing alternative latent variable models in the Bayesian paradigm is a daunting and yet important task. The gold standard to carry out Bayesian model comparison is to compute BFs, which basically compare marginal likelihood of alternative models Kass and Raftery Several interesting developments have been made in recent years for computing marginal likelihood from the MCMC output; see for example, Chib 1995, Chib and Jeliazkov 001. While these methods are very general and widely applicable, for latent variable models,

3 they are difficult to use because the marginal likelihood may be hard to calculate. In addition, BFs cannot be used under improper priors and are subject to the Jeffrey-Lindley paradox. Given that DIC is simple to calculate from the MCMC output with the data augmentation technique and also that data augmentation is often used for Bayesian parameter estimation, DIC has been widely used for comparing alternative latent variable models; see for example, Berg et al. 004, Huang and Yu 010. The first contribution of this paper is that we argue DIC has to be used with care in the context of latent variable models. In particular, we believe DIC, in the way it is commonly implemented in practice, has some conceptual and practical problems. Firstly, DIC requires a concrete focus which is often not easily identified in practice. If the focus cannot be identified, using DIC violates the likelihood principle; see Gelfand and Trevisani 00. Secondly, DIC is not robust to apparently innocuous transformations and distributional representations. This problem is made worse by the data augmentation technique for latent variable models. Data augmentation greatly inflates the number of parameters and hence the effective number of parameter used in DIC is very sensitive to transformations and distributional representations. The detail will be explained in Section 3. Finally, DIC requires that the likelihood function has a closed form expression for it to be computationally operational. For latent variable models, this is achieved by data augmentation and, as a consequence, DIC opens up to possible variations. It is unclear which variation should be used in practice; see Celeux et al. 006 for further discussion of this problem. In this paper we argue that although data augmentation leads to a likelihood function in closed-form and greatly facilitates parameter estimation, DIC should NOT be calculated based on the new likelihood associated with data augmentation. The reason is that data augmentation makes the likelihood function non-regular and hence invalidates the standard asymptotic arguments. Consequently, it undermines the theoretical underpinnings of DIC. The source of the problem is data augmentation. With data augmentation, a closedform expression for likelihood is ensured and it is easy to compute DIC, but the asymptotic justification of DIC is invalid. Without data augmentation, the likelihood function does not have a closed form expression and hence DIC is much harder to compute for latent variable models, although it is asymptotically justified. The second contribution of this paper is that we advocate the use of a robust version of DIC, denoted by RDIC, to make Bayesian comparison of latent variable models. It is shown that RDIC is a good approximation to DIC without data augmentation and hence is theoretically justified. We then show that the expectation maximization EM algorithm facilitates the computation of RDIC for latent variable models when the MCMC output is available. Moreover, RDIC is robust to nonlinear transformations of latent variables and to distributional representations of model specification. The advantages of the proposed approach are illustrated using two popular models in 3

4 economics and finance, including a class of dynamic factor models and a class of stochastic volatility models. It is shown that DIC is very sensitive to the nonlinear transformations of latent variables in these models, whereas RDIC is robust to these transformations. As a result, substantial discrepancy is found between DIC and RDIC. The paper is organized as follows. In Section, the latent variable models are introduced. The Bayesian estimation method with data augmentation and the EM algorithm are also reviewed. Section 3 reviews DIC, introduces and justifies RDIC for latent variable models, and discusses how to compute RDIC from the MCMC output. Section 4 illustrates the method using models from economics and finance. Section 5 concludes the paper. The Appendix collects the proof of the theoretical results in the paper. Latent Variable Models, EM Algorithm and MCMC Let y = y 1, y,, y n denote observed variables and z = z 1, z,, z n the latent variables. The latent variable model is indexed by the a set of P parameters, θ = θ 1,..., θ P. Let py θ be the likelihood function of the observed data denoted the observed-data likelihood, and py, z θ be the complete-data likelihood function. The relationship between the two functions is: py θ = py, z θdz. 1 In many cases, the integral does not have a closed-form solution. Consequently, statistical inferences, such as estimation and model comparison, are difficult to make. In the literature, maximum likelihood ML analysis using the EM algorithm and Bayesian analysis using MCMC are two popular approaches for carrying out statistical inference of the latent variable models..1 Maximum likelihood via the EM algorithm The EM algorithm is an iterative numerical method for finding the ML estimates of θ in the latent variable models. It has been widely used in applications since Dempster et al gave its name and did the convergence analysis. In this subsection, we briefly review the main idea of the EM algorithm. For more details, one can refer to McLachlan and Krishnan 008. Let x = y, z be the complete data with a density px θ parameterized by a P -dimensional parameter vector θ Θ R P. The observed-data log-likelihood L o y θ = ln py θ often involves some intractable integral, preventing researchers from directly optimizing L o y θ with respect to θ. In many cases, however, the complete-data log-likelihood L c x θ = ln px θ has a closed-form expression. Instead of maximizing L o y θ directly, the EM algorithm maximizes Qθ θ r, the conditional expectation of the complete-data log-likelihood function L c x θ given the observed data y and a current fit θ r of the parameter. 4

5 Generally, a standard EM algorithm has two steps: the expectation E step and the maximization M step. The E-step evaluates Qθ θ r = E z L c x θ y, θ r }, where the expectation is taken with respect to the conditional distribution pz y, θ r. The M- step determines a θ r+1 that maximizes Qθ θ r. Under some mild regularity conditions, the sequence θ r } obtained from the EM iterations converges to the ML estimate ˆθ; see Dempster et al and Wu 1983 for details on the convergence properties of θ r }.. Bayesian analysis using MCMC Although the EM algorithm is a reasonable statistical approach for analyzing latent variable models, the numerical optimization in the M-step is often unstable. This numerical problem worsens as the dimension of θ increases. It is well recognized that Bayesian methods using MCMC provide a powerful tool to analyze the latent variables models. However, if the posterior analysis is conducted from the observed-data likelihood, py θ, one would end up with the same problem as in the ML method as py θ does not have a closed-form expression. The novelty in the Bayesian methods is to treat the latent variable model as a hierarchical structure of conditional distributions, namely, py z, θ, pz θ, and pθ. In other words, one can use the data augmentation strategy of Tanner and Wong 1987 to expand the parameter space from θ to θ,z. The advantage of data augmentation is that the Bayesian analysis is now based on the new likelihood function, py θ,z which often has a closed-form expression. Then the Gibbs sampler and other MCMC samplers can be used to generate random samples from the joint posterior distribution pθ, z y. After a sufficiently long period for a burningin phase, the simulated random samples can be regarded as random observations from the joint distribution. The statistical analysis can be established on the basis of these simulated posterior random observations. As a by-product to the Bayesian analysis, one also obtains Markov chains for the latent variables z and hence statistical inference can be made about z. For further details on Bayesian analysis of latent variable models via MCMC, including algorithms, examples and references, see Geweke et al From the above discussion, it can be seen that data augmentation is the key technique for Bayesian estimation of latent variable models. Two observations are in order. First, with data augmentation, the parameter space is much bigger. More often than not, the dimension of the space increases as the number of observations increases and is larger than the number of observations. In the latter case, the new likelihood function becomes non-regular. Second, it is difficult to argue that the latent variables can be always treated as the model parameters. Models parameters are typically fixed but the latent variables are often time varying. Consequently, the same treatment of 5

6 these two types of variables does not seem to be justifiable from the perspective of model selection. 3 Bayesian Comparison of Latent Variable Models 3.1 DIC Spiegelhalter et al. 00 proposed DIC for Bayesian model comparison. The criterion is based on the deviance Dθ = ln py θ, and takes the form of DIC = Dθ + P D. 3 The first term, used as a Bayesian measure of model fit, is defined as the posterior expectation of the deviance, that is, Dθ = E θ y Dθ = E θ y ln py θ. The better the model fits the data, the larger the log-likelihood value and hence the smaller the value for Dθ. The second term, used to measure the model complexity and also known as effective number of parameters, is defined as the difference between the posterior mean of the deviance and the deviance evaluated at the posterior mean of the parameters: P D = Dθ D θ = ln py θ ln py θpθ ydθ, 4 where θ is the Bayesian estimator, and more precisely the posterior mean, of the parameter θ. Here, P D can be explained as the expected excess of the true over the estimated residual information conditional on data y. In other words, P D can be interpreted as the expected reduction in uncertainty due to estimation. DIC can be rewritten by two equivalent forms: DIC = D θ + P D, 5 and DIC = Dθ D θ = 4E θ y ln py θ + ln py θ. 6 DIC defined in Equation 5 bears similarity to AIC of Akaike 1973 and can be interpreted as a classical plug-in measure of fit plus a measure of complexity. In Equation 3 the Bayesian measure, Dθ, is the same as D θ + P D which already includes a penalty term for model complexity and thus could be better thought of as a measure of model adequacy rather than pure goodness of fit. 6

7 Remark 3.1 The asymptotic justification of DIC requires that the candidate models nest the true model and that the posterior distribution is approximately normal. These two requirements parallel to those in AIC where the candidate models nest the true model and the ML estimator is asymptotically normally distributed. To see the importance of the asymptotic normality, Spiegelhalter et al. 00 show that, when the prior is noninformative, P D is approximately the same as the number of parameters, P. In this case DIC is explained as Bayesian version of AIC. However, if the asymptotic normality does not hold true, P D cannot be approximated by P and DIC is not the Bayesian version of AIC. Furthermore, the information-theoretical explanation of DIC requires the asymptotic normality of the Bayesian posterior to be held true. Remark 3. If py θ has a closed-form expression, DIC is trivially computable from the MCMC output. This is in sharp contrast to BFs and some other model selection criteria within the classical framework. The computational tractability, together with the versatility of MCMC and the fact that DIC is incorporated into a Bayesian software, WinBUGS, allows DIC to enjoy a very wide range of applications. 1 However, if py θ is not available in closed form, such as in random effects models and state space models, computing DIC may become infeasible, or at least, very time consuming. Remark 3.3 When an information criterion is used for model selection, the degrees of freedom are typically used to measure the model complexity. In the Bayesian framework, the prior information almost always imposes additional restrictions on the parameter space and hence the degrees of freedom may be reduced by the prior information. A useful contribution of DIC is to provide a way to measure the model complexity when the prior information is incorporated; see Brooks 00. Remark 3.4 Unlike BFs that address how observed data are predicted by the priors, DIC addresses how well the posterior might predict future data generated by the same mechanism that gave rise to the observed data Spiegelhalter et al. 00. This predictive perspective for selecting a good model is important to many practical business, economic, and financial decisions. For latent variable models, depending on whether or not the latent variables are treated as parameters, Celeux et al. 006 gave different ways to define DIC and classified them in three categories. Based on the observed likelihood py θ, the first category of DIC can be 1 As of July 8, 01, Spiegelhalter et al. 00 has been cited 3396 times according to Google Scholar and 1,984 time according to Science Citation Index. 7

8 defined as DIC 1 = 4E θ y ln py θ + ln p y θy, DIC = 4E θ y ln py θ + ln p y ˆθy, } DIC 3 = 4E θ y ln py θ + ln E θ y py θ, where θy and ˆθy are the posterior mean and the posterior mode, respectively. Based on the complete likelihood py, z θ, the second category of DICs can be defined as DIC 4 = 4E θ,z y ln py, z θ + E z y ln p y, z E θ y,z θ y, z, DIC 5 = 4E θ,z y ln py, z θ + ln p y, ẑy ˆθy, DIC 6 = 4E θ,z y ln py, z θ + E ln p y, z ˆθy, z y,ˆθy where in DIC 5, z is treated as parameters and ẑy and ˆθy are the joint Bayesian estimators, such as the joint maximum a posteriori MAP estimators of z, θ; in DIC 6, ˆθy is an estimator of θ based on the posterior distribution pθ y. as Based on the conditional likelihood py z, θ, the third category of DICs can be defined DIC 7 = 4E θ,z y ln py z, θ + ln p y ẑy, ˆθy, DIC 8 = 4E θ,z y ln py z, θ + E z y ln p y z, ˆθy, z, where again, in DIC 7, the latent variable z is treated as parameters so that ẑy and ˆθy are the joint Bayesian estimator, such as the joint maximum a posteriori MAP estimator of the pair z, θ; in DIC 8, ˆθy, z is estimator of θ based on py, z θ. Remark 3.5 To compute DIC 1, DIC, DIC 3, it is generally required that the observed likelihood py θ is available in closed form. However, for latent variable models, such as statespace models, including linear Gaussian state space models, the observed-data likelihood py θ is not available in closed form. In this case, computing these DICs from the MCMC output is time consuming or even infeasible since py θ has to be computed at each draw from the Markov chain. In this case, DIC is particularly hard to compute as it needs the maximum likelihood estimator ˆθy. DIC 4 requires the computation of a posterior expectation for each value of z. Consequently, the computational cost is too high in many latent variable models. The definition of DIC 5 is inconsistent in the sense that the first component treats z as latent variables while the second component treats z as parameters. In DIC 6 the P D is not guaranteed For linear Gaussian state space models, to do ML, the Kalman filter can be used to obtain the likelihood function numerically. Numerically more efficient algorithms have been developed in the recent literature; see for example, Chan and Jeliazkov

9 to be positive. Moreover, as argued in Celeux et al. 006, ẑy is often a terrible estimator and so are DIC 4, DIC 5, DIC 6. In DIC 7 the latent variable is regarded as parameters in both components and is easy to compute. As a result, DIC 7 is the default information criterion for comparing latent variable models and is implemented and reported in WinBUGS, following the suggestion of Spiegelhalter et al. 00. Examples that use DIC 7 in applications include Berg et al. 004 and Wang et al Clearly this choice of defining DIC is simple for computational convenience. In DIC 8, the estimator of ˆθy, z is very difficult to obtain. Remark 3.6 From a theoretical viewpoint, DIC 7 has a couple of serious problems. Firstly, due to the data augmentation, the number of the latent variables often increases with the sample size in latent variable models, causing the problem of a non-regular likelihood-based statistical inference; see Gelman 003. This invalidates the asymptotic justification of DIC because the standard asymptotic theory derived from regular likelihood is not applicable to nonregular likelihood. Secondly, if the latent variable can be treated as parameters, an incoherent inference problem will result. That is, when one model can be rewritten as distributional representation of another model with latent variables and the same prior is used in the two models, the different DIC values can be obtained. A simple example is the student-t distribution which can be rewritten as a normal-gamma scale mixture representation. As to this case, in Section 8. of Spiegelhalter et al. 00, where Models 4 and 5 are predictively identical but their DIC values are quite different. The same difficulty also shows up in Model 8 of Berg et al Thirdly, when the latent variables are discrete, such as component indicators in Markov switching models, generally, Bayesian estimator is not a discrete value which can cause some logic problems. Fourthly, due to the data augmentation, the dimension of the parameter space becomes larger and hence we expect that DIC 7 is very sensitive to transformations of latent variables. To illustrate the last problem, we consider a simple transformation of latent variables in the well-known Clark model Clark 1973 which is given by, Model 1 : y t Nµ, exph t, h t N0, σ, t = 1,, n. 7 An equivalent representation of the model is Model : y t Nµ, σt, σt LN0, σ, t = 1,, n, 8 where LN denotes the log-normal distribution. In Model the latent variable is the volatility σt, while the latent variable is the logarithmic volatility h t = log σt in Model 1. Suppose the parameters of interest are µ and σ. With the same focus, the two models are identical and hence are expected to have the same DIC and P D. To calculate the P D component in DIC 7, we simulate 1000 observations from the model with µ = 0, σ = 0.5. Vague priors are 9

10 selected for the two parameters, namely, µ N0, 100, σ Γ0.001, We run Gibbs sampler to make 40,000 simulated draws from the posterior distributions. The first 40,000 are discarded as burn-in samples. The remaining observations with every 10th observation are collected as effective observations for statistical inference. With the data augmentation, the latent variables, h t and σt are regarded as parameters, and we find that P D = for Model 1 but P D = for Model. The difference is very significant. Given that we have the identical models and priors, and use the same dataset, the vast difference suggests that DIC 7 and the corresponding P D are very sensitive to transformations of latent variables. For latent variable models, DIC 1 or DIC or DIC 3 does not suffer from the same theoretical problems as DIC 7. However, computing DIC 1 from the MCMC output is much harder, often infeasible, since py θ is not available in closed-form and computing E θ y log py θ necessitates numerical calculation of py θ at each draw from the Markov chain. To summarize the problems with DIC in the context of latent variable models, while DIC 7 is trivial to calculate but cannot be theoretically justified, DIC 1 is theoretically justified but infeasible to compute. 3. RDIC In this section we introduce a robust version of DIC, denoted as RDIC, as follows RDIC = D θ + tr I θv θ } = D θ + PD, 9 where P D = tr I θv θ }, 10 with tr denoting the trace of a matrix and, Iθ = log py θ θ θ θ, V θ = E θ θ θ y. Interestingly, in Equation 15 on Page 590 Spiegelhalter et al. 00 obtained the expression for P D and claimed that P D approximates the P D component in DIC 1. Unfortunately, to the best of our knowledge, PD has never been implemented in practice and WinBUGS does not report PD. Moreover, the proof of P D P D was not given in Spiegelhalter et al. 00. The conditions under which P D P D holds true were not specified. The order of the approximation remains unknown. To justify the choice of RDIC, we will have to establish conditions under which we can show that RDIC approximates DIC 1 and P D approximates P D that corresponds to DIC 1 with a known order of magnitude. We then show that how the EM algorithm facilitates the computation of RDIC from the MCMC output for latent variable models. Let L n θ = log pθ y, L 1 n θ = log pθ y/ θ, L n θ = log pθ y/ θθ. In this paper, we impose the following regularity conditions. 10

11 Assumption 1: There exists a finite sample size n, for n > n, there is a local maximum at ˆθ m so that L 1 n ˆθ m = 0 and L n ˆθ m is a negative definite matrix. Obviously, ˆθ m is the posterior mode and L n ˆθ m /n = O p 1. Assumption : Moreover, the largest eigenvalue of n. L n ˆθ m 1, σ n, goes to zero when Assumption 3: For any ɛ > 0, there exists an integer n } and some δ > 0 such that for any n > maxn, n } and θ H ˆθ m, δ = θ : θ ˆθ m δ, L n θ exists and satisfies where I P Aɛ L n θl n ˆθ m I P Aɛ, is a P P identity matrix, Aɛ a P P semi-definite symmetric matrix whose largest eigenvalue goes to zero as ɛ 0. Assumption 4: For any δ > 0, as n, pθ ydθ 0, where Θ is the support of θ. Θ Hˆθ m,δ Assumption 5: Both the first moment and the second moment exist for pθ y. Assumption 6: For all θ Θ, the prior of θ is O p 1. Assumption 7: The data generating process is stationary and the model is regular so that the standard maximum likelihood theory can be applied. Assumption 8: For any given θ 0 Θ and y from the same data generating process, there exists a positive number c and a function M y both of which may depend on θ 0 such that 1 n log py θ My, where θ 0 c < θ < θ 0 + c and E θ0 My <. Lemma 3.1 Under Assumptions 1-5, conditional on the observed data y, we have θ = E θ y = ˆθ m + o p n 1/, V ˆθ m = E θ ˆθ m θ ˆθ m y = L n ˆθ m + o p n 1. Remark 3.7 Lemma 3.1 establishes Bayesian large sample theory. The regularity conditions 1-4 have been used in the literature to develop Bayesian large sample theory for stationary and nonstationary dynamic models and nondynamic models; see, for example, Chen 1985, Kim 1994, Kim 1998, Geweke 005. The Bayesian large sample theory was also developed from different sets of regularity conditions in different contexts. For example, Ghosh and Ramamoorthi 003 developed the asymptotic posterior normality and Lemma 3.1 in the iid case. 11

12 Theorem 3.1 Under Assumptions 1-6, it can be shown that, where P D is defined in 4. P D = P D + o p 1, DIC 1 = RDIC + o p 1, Remark 3.8 Theorem 3.1 improves Equation 15 Spiegelhalter et al. 00 in two ways. First, it gives the order of the approximation errors. Second, it specifies the conditions under which P D approximates P D and DIC 1 approximates RDIC. Remark 3.9 As DIC 1 is theoretically justified for the latent variable models, Theorem 3.1 justifies RDIC asymptotically since RDIC and DIC 1 are asymptotically equivalent. Remark 3.10 RDIC maintains all the good features of DIC 1. For example, informative prior impose restrictions on the parameter space so that the degrees of freedom of the model are reduced. Hence, RDIC can incorporate the prior information when measuring the model complexity. Following Spiegelhalter et al. 00, we get I ˆθ m } log pθ y = θ θ log pθ θ θ } = L θ=ˆθ n ˆθ m log pθ m θ θ. θ=ˆθ m Under Assumption 1-5, following Lemma 3.1 and the proof of Theorem 3.1, we get } PD = tr Iˆθ m V θ = tr = tr L n ˆθ m L n = P tr + o p 1 log pθ θ θ } ˆθ m V θ tr log pθ θ θ θ=ˆθ m } } V θ + o p 1 θ=ˆθm } log pθ θ θ V θ + o θ=ˆθ p 1 m } V θ + o p From 11, it can be seen clearly that the prior information can reduce the model complexity. Remark 3.11 Like DIC 1, RDIC is justified by the standard Bayesian large sample theory. When the Bayesian large sample theory is not available, RDIC is not justified. These include models in which the number of the parameters increases with the sample size, under-identified models, models with an unbounded likelihood, and models with improper posterior distributions. For more details about the standard Bayesian large sample theory, see Gelman 003 and Geweke 005. For the latent variable models, since the number of the latent variables increases with sample size, the standard Bayesian large sample theory is not applicable if the data augmentation technique is used. As a result, when calculating RDIC, data augmentation should NOT be used. 1

13 Remark 3.1 Since RDIC is defined from the observed-data likelihood py θ, there is no need to specify a focus, and hence, RDIC does not suffer from the incoherent inference problem. Remark 3.13 For the latent variable models, while the number of the model parameters P is fixed and usually not so big, the number of the latent variables increases as the sample size increases. In the definition of RDIC, the latent variables are not regarded as the parameters. Consequently, the problem of parameter transformation is less serious. For example, in the Clark model, with the same setting as before, we get P D = 1.75 for Model 1 and P D = 1.80 for Model. There is no significant difference between them. Moreover, these two values are close to, that is the actual number of parameters. This is what we expected given that the vague priors are used and hence PD P =. The difference between P D and P arises due to the simulation error and the prior. Remark 3.14 An obvious computational advantage in RDIC is that PD does not involve inverting a matrix. This advantage is not so important when the latent variable model has only a small number of parameters. However, for high dimensional latent variable models where there are many parameters, this computational advantage may be important. We now consider the justification DIC and RDIC from an information-theoretic perspective. As in AIC, let y rep = y 1,rep, y,rep,, y n,rep be the independent replicate data generated by the same mechanism that gives rise to the observed data y, i.e., py rep = py. In the literature, the KL divergence is used to describe the difference between two models given by: KLpx, qx = px log px qx dx. Hence, from the information-theoretic perspective, when using the fitted y in Model M to predict y rep, on the basis of the KL divergence, a simple loss function can be chosen as = KLpy rep θ, p y rep θy = log py rep θpy rep θdy rep log py rep θ p y rep θy py rep θdy rep log p y rep θy py rep θdy rep. 1 Then, a posterior loss function is given by Ly rep, y = KLpy rep θ, p y rep θy pθ ydθ. 13 Hence, the criterion for model selection is to choose a model to minimize E y Ly rep, y = Lypydy. 13

14 Remark 3.15 In Spiegelhalter, et al 00, Page 604, Ly rep, y was chosen to be log p y rep θy and they showed that E y log p y rep θy } py rep θdy rep pθ ydθ E y DIC 1. However, their derivation is heuristic in the sense that no rigorous proof is provided. However, we can see from 1 and 13 that log py rep θpy rep θdy rep is not the same across the different models. Hence, it is difficult to justify DIC on the basis of the loss function log p y rep θy. A more rigorous justification of DIC is needed. Consider the predictive distribution py rep y = py rep θpθ ydθ. The KL loss function on the basis of this predictive distribution is KL py rep, py rep y = log py rep py rep y dy rep. Since log py rep dpy rep dy rep is the same for all the models, we can choose the loss function as Ly rep, y = log p y rep y. We then propose to choose a model to minimize the following risk function E y E yrep Ly rep, y = Ly rep, ypy rep pydy rep dy. The following theorem provides the justification of RDIC and DIC from the informationtheoretical viewpoint. Theorem 3. Under Assumptions 1-8, it can be shown that E y E yrep Ly rep, y = E y DIC 1 + o1 = E y RDIC + o1. Remark 3.16 According to the proof of Theorem 3., P D = P D + o p1 = P + o p 1. Consequently, DIC 1 = RDIC + o p 1 = log py ˆθ + P + o p 1 = AIC + o p 1. Namely, both RDIC and DIC 1 can be regarded as the Bayesian version of AIC. Remark 3.17 Both RDIC and DIC 1 are an unbiased estimator of the risk function asymptotically. Remark 3.18 Like DIC 1, RDIC addresses how well the posterior may predict future data generated by the same mechanism that gives rise to the observed data. This posterior predictive feature could be appealing in many applications. 14

15 Remark 3.19 Like AIC, both DIC 1 and RDIC require the candidate models nest the true model. This is of course a strong assumption. Under the iid case, Ando and Tsay 010 relaxed this assumption and obtained a predictive likelihood information criterion BPIC that minimizes the loss function η = E y E yrep log py rep y. The estimator of η is given by ˆη = log py rep y yrep=y + 1 I tr 1 ˆθJˆθ, where Iθ and Jθ are the Hessian matrix and the Fisher information matrix. In Ando 007, another BPIC was given as BP IC = log py ˆθ + tri 1 ˆθJˆθ + P/. Ando 007 showed that BPIC is an estimator of the loss function E y E yrep log py rep θpθ ydθ. Like TIC of Takeuchi 1976, these two information criteria involve the inverse of Hessian matrix which is numerically changing when the dimension of the parameter space is large. This is one of the reasons why TIC has not been widely used in practice. Furthermore, the derivation of these two information criteria requires the data be iid. For data in economic and finance, this requirement is often too restrictive. In addition, for many latent variable models, the maximum likelihood estimator, the Hessian matrix and the Fisher information matrix are difficult to obtain. How to develop a good information criterion for comparing latent variable models, without assuming the candidate models nest the true model, will be pursued in future research. Remark 3.0 For unit root models, Kim 1994 and Kim 1998 showed that the asymptotic normality of posterior distribution can be established under Assumptions 1-4. Hence, Lemma 3.1 holds true for unit root models. However, to develop Theorem 3., the standard maximum likelihood asymptotic theory is required. Hence, Theorem 3. may not be applicable to models with a unit root or explosive root. The topic on comparing non-stationary models will be pursued in future studies. Within the classical framework, Phillips and Ploberger 1996 and Phillips 1996 have proposed model selection criteria for models without latent variables. Remark 3.1 If the observed-data likelihood function, py θ, does not have a closed-from expression, its second derivative, log py θ/ θ θ and hence RDIC will be difficult to compute. Some general methods such as Kalman filter and particle filter can be used for this kind of purpose. In the following section, we show how the EM algorithm may be used to facilitate the computation of the second derivative and RDIC. 15

16 3.3 Computing RDIC by the EM algorithm The definition of RDIC clearly requires the evaluation of observed-data likelihood at the posterior mean, py θ, as well as the information matrix and the second derivative of the observed-data likelihood function. For most latent variable models, the observed-data likelihood function does not have a closed-from expression. In this section we show how the EM algorithm may be used to evaluate py θ, the second derivative of the observed-data likelihood function, and hence RDIC for the latent variable models. It is important to point out that we do not need to numerically optimize any function here as in the EM algorithm. Consequently, our method is not subject to the instability problem found in the M-step. As argued in Section.1, the main idea of EM algorithm is to replace the observed-data log-likelihood log py θ with the complete-data log-likelihood log py, z θ. Note that log py, z θ = log pz y, θ + log py θ. For any θ and θ in Θ, it was shown in Dempster et al that log py, z θpz y, θ dz = log pz y, θpz y, θ dz + log py θ. Hence, we can have the following lemma. Lemma 3. Let Hθ θ = log pz y, θpz y, θ dz, the so-called H function in the EM algorithm, we can have L o y, θ = Q θ θ H θ θ, where the Q function is defined in Equation. Following Lemma 3., the Bayesian plug-in model fit, log py θ, may be obtained as log py θ = Q θ θ H θ θ. 14 It can be seen that even when Q θ θ is not available in closed form, it is easy to evaluate from the MCMC output because Q θ θ = log py, z θpz y, θdz 1 M M m=1 log p y, z m θ. where z m, m = 1,,, M} are random observations drawn from the posterior distribution pz y, θ. For the second term in 14, if pz y, θ is a standard distribution, H θ θ can be easily evaluated from the MCMC output as H θ θ = log pz y, θpz y, θdz 1 M 16 M m=1 log p z m y, θ.

17 However, if pz y, θ is not a standard distribution, an alternative approach has to be used, depending on the specific model in consideration. We now consider two situations. First, if the complete-data y i, z i are independent with i j, and z i is of low-dimension, say 5, then a nonparametric approach may be used to approximate the posterior distribution pz y, θ. Note that Hθ θ = log pz y, θπz y, θdz = n i=1 log pz i y i, θπz i y, θdz i = n H i θ θ. The computation of H i θ θ requires an analytic approximation to pz i y i, θ which can be constructed using a nonparametric method. In particular, MCMC allows one to draw some effective samples from p z i y i, θ. Using these random samples, one can then use nonparametric techniques such as the kernel-based methods to approximate p z i y i, θ. In a recent study, Ibrahim et al. 008 suggested using a truncated Hermite expansion to approximate pz i y i, θ. As a simple illustration, we apply this method to the Clark model. When the Gaussian kernel method is used, we get log py θ = , RDIC= for Model 1 and log py θ = , RDIC= 90.4 for Model. These two sets of numbers are nearly identical. However, if the latent variable models are regarded as parameters, we get DIC 7 = for Model 1 and DIC 7 = for Model. The highly distinctive difference between them suggests that DIC 7 is not a reliable model selection criterion for the model. Note that DIC 1 is not really feasible to compute in this case. Second, for some latent variable models, the latent variables z follow a multivariate normal distribution and the observed variables y are independent, conditional on z. This class of models is referred to as the Gaussian latent variable models in the literature. In economics and finance, many latent variable models belong to this class of models, including dynamic linear models, dynamic factor models, various forms of stochastic volatility models and credit risk models. In these models, the observed-data likelihood is non-gaussian but has a Gaussian flavor in the sense that the posterior distribution, pz y, θ, may be expressed as, pz y, θ exp 1 n z V θz + log py i z i, θ. Rue et al. 004 and Rue et al. 009 showed that this type of posterior distribution can be well approximated by a Gaussian distribution that matches the mode and the curvature at the mode. The resulting approximation is known as the Laplace approximation and can be expressed as, i=1 pz y, θ exp 1 z V θ + diagcz, where c comes from the second order term in the Taylor expansion of n i=1 log py i z i at the mode of pz y, θ. The Laplace approximation may be employed to compute H θ θ. After i=1 17

18 py θ is obtained, it is easy to obtain D θ. It is important to point out that the numerical evaluation of py θ is needed only once, i.e., at the posterior mean. To compute PD, we have to calculate the second derivative of the observed-data likelihood function in 11. The following two lemmas show how to compute the second derivatives. Lemma 3.3 Under the mild regularity conditions, the observed-data information matrix may be expressed as: } Iθ = L o y θ = Qθ θ θ θ θ θ Qθ θ. 15 θ θ θ =θ Lemma 3.4 Let Sx θ = L c x θ/ θ. Under the mild regularity condition, the observeddata information matrix has an equivalent form: Iθ = L o y θ = E θ θ z y,θ = E z y,θ L c x θ θ θ L c x θ θ θ Sx θsx θ } V ar z y,θ Sx θ} 16 } + E z y,θ Sx θ}e z y,θ Sx θ}, where all the expectations are taken with respect to the conditional distribution of z given y and θ. Remark 3. Lemma 3.3 and Lemma 3.4 were developed in Oakes 1999 and Louis 198, respectively, for finding the standard error in the EM algorithm. If the Q function is available, we can use Lemma 3.3 to evaluate the second derivatives. If the Q function does not have an analytic form, we may use Lemma 3.4 to evaluate the second derivatives as follows, } E z y,θ L c x θ Sx θsx θ, θ θ } 1 M L c y, z m θ + Sy, z m θsy, z m θ, M θ θ m=1 E z y,θ Sx θ} 1 M M Sy, z m θ, m=1 where z m, m = 1,,, M} are random observations drawn from the posterior distribution pz y, θ. 4 Examples We now illustrate the proposed method in two applications. In the first example, while py θ is not available in closed-form, Kalman filter provides a recursive algorithm to evaluate it. Hence, Qθ θ and Hθ θ can be calculated in the same manner, facilitating the computation of RDIC while DIC 1 is much harder to compute. In the second example, py θ is not available in closed-form and Kalman filter cannot be applied. To compute RDIC, we use the Laplace approximation and the technique suggested in Lemma

19 4.1 Comparing high dimensional dynamic factor models For many countries, there exists a rich array of macroeconomic time series and financial time series. To reduce the dimensionality and to extract the information from the large number of time series, factor analysis has been widely used in the empirical macroeconomic literature and in the empirical finance literature. For example, by extending the static factor models previously developed for cross-sectional data, Geweke 1977 proposed the dynamic factor model for time series data. Many empirical studies, such as Sargent and Sims 1977, Giannone et al. 004, have reported evidence that a large fraction of the variance of many macroeconomic series can be explained by a small number of dynamic factors. Stock and Watson 1999 and Stock and Watson 00 showed that dynamic factors extracted from a large number of predictors can be used to lead to improvement in predicting macroeconomic variables. Not surprisingly, high dimensional dynamic factor models have become a popular tool under a data rich environment for macroeconomists and policy makers. review on the dynamic factor models is given by Stock and Watson 010. An excellent Following Bernanke et al. 005 BBE hereafter, the present paper considers the following fundamental dynamic factor model: Y t = F t L + ε t, F t = F t 1 Φ + η t, where Y t is a 1 N vector of time series variables, F t a 1 K vector of unobserved latent factors which contains the information extracted from all the N time series variables, L an N K factor loading matrix, Φ the K K autoregressive parameter matrix of unobserved latent factors. It is assumed that ε t N 0, Σ and η t N 0, Q. For the purpose of identification, Σ is assume to be diagonal and ε t and η t are assumed to be independent with each other. Following BBE 005, we set the first K K block in the loading matrix L to be the identity matrix. In this dynamic factor model, the observed variable Y t consists of a balanced panel of 10 US monthly macroeconomic time series. These series are initially transformed to induce stationarity. The description of the series and the transformation is provided in BBE 005. The sample period is from January 1959 to August 001. Because the data are of high dimension, the analysis of the dynamic factor models via a frequentist method is not trivial; see the discussion in Stock and Watson 011. In the literature, Bayesian inference via the MCMC techniques has been popular for analyzing the dynamic factor models; see Otrok and Whiteman 1998, Kose et al. 003, Kose et al. 008, BBE 005. Following BBE 005, we specify the following prior distribution: Σ ii Inverse Γ 3, 0.001, L i N 0, Σ ii M0 1, vec Φ Q N 0, Q 0, Q Inverse Γ Q 0, K +, 19

20 where M 0 is a K K identity matrix, L i the ith i > K column of L. The diagonal elements of Q 0 are set to be the residual variances of the corresponding one lag univariate autoregressions, ˆσ i. The diagonal elements of 0 are constructed so that the prior variance of parameter on the jth variable in the ith equation equals ˆσ i /ˆσ j. In this example, we aim to determine the number of factors in the dynamic factor models using model selection criteria. In BBE 005 model comparison is achieved by graphic methods. Our approach can be regarded as a formal statistical alternative to the graphic methods. It is well documented that the determination of number of factors in the setting of the dynamic factor models is important; see Stock and Watson As in the previous example, we use DIC 7 and RDIC to compare models with different numbers of factors, namely K = 1, and 3, which are denoted by M 1, M, M 3 respectively. Using the Gibbs sampler, we sample,000 random observations from the corresponding posterior distributions. We discard the first,000 observations and keep the following 0,000 as the effective samples from the posterior distribution of the parameters. Following a suggestion of a referee, we also compare alternative models using the marginal likelihood approach. Unfortunately, the prior distributions of Φ and Q of BBE 005 depend on the latent variables which lead to implicit joint prior distributions of L, R, Φ and Q. Consequently, it is difficult to calculate the joint prior density of L, R, Φ and Q. To avoid the evaluation of the joint prior density, we calculate the marginal likelihood by the harmonic mean method Newton and Raftery 1994, which only needs to calculate the reciprocal of the likelihood for each posterior draw of parameters. Based on the 0,000 samples, we compute DIC 7, RDIC, and the marginal likelihood for all three models. The technique in Lemma 3. is used to approximate the observed-data likelihood at the posterior mean. Table 1 reports the simple count of the number of parameters including the latent variables, DIC 7, the P D component of DIC 7, i.e. when the data augmentation technique is used, the simple count of the number of parameters excluding the latent variables, RDIC, the PD component and the D θ component of RDIC i.e. when the data augmentation technique is not used, and the marginal likelihood. Several conclusions may be drawn from Table 1. First, DIC 7, RDIC and the marginal likelihood all suggest that M 3 is the best model, followed by Model and then by Model 1. Model 3 has higher effective number of parameters than the other two models. However, the gain in the fit to data is greater. The conclusion is that at least 3 factors are needed to describe the joint movement of the 10 macroeconomic time series. Second, since some very informative priors have been used, neither P D nor PD is close to the actual number of parameters. While it is cheap to compute RDIC, it is much harder to compute DIC 1. This is because the observed-data likelihood py θ is not available in closed-form and Kalman filter is used to numerically calculate py θ which involves the computation of 1 J J j=1 log py θj, for J = 0, 000. We have to run Kalman 0

21 Table 1: Model selection results for dynamic factor models Model M 1 M M 3 Number of Parameters P D DIC Number of Parameters PD D θ RDIC Log MargLik filter 0,000 times, which takes more than 4 hours to compute in Matlab. 3 In sharp contrast, it only took less than 80 seconds to compute RDIC. Obviously, the discrepancy in CPU time increases with J. 4. Comparing stochastic volatility models Stochastic volatility SV models have been found very useful for pricing derivative securities. In the discrete time log-normal SV models, the logarithmic volatility is the state variable which is often assumed to follow an AR1 model. The basic log-normal SV model is of the form: y t = α + exph t /u t, u t N0, 1, h t = µ + φh t 1 µ + v t, v t N0, τ, where t = 1,,, n, y t is the continuously compounded return, h t the unobserved logvolatility, h 0 = µ, and u t, v t independently normal variables for all t. In this paper, we denote this model by M 1. To carry out Bayesian analysis of M 1, following Meyer and Yu 000, the prior distributions are specified as follows: α N 0, 100, µ N 0, 100, φ Beta 1, 1, 1/τ Γ 0.001, An alternative specification of M 1 is given by: y t = α + σ t u t, u t N0, 1, log σt = µ + φ log σt 1 µ + ν t, v t N0, τ, 3 Numerically more efficient algorithms, such as the one proposed by Chan and Jeliazkov 009, may be used to evaluate log py θ j. 1

Deviance Information Criterion for Comparing VAR Models

Deviance Information Criterion for Comparing VAR Models Tao Zeng Singapore Management University Jun Yu Singapore Management University June 16, 014 Yong Li Renmin University Abstract: Vector Autoregression