Reality Checks and Nested Forecast Model Comparisons

Size: px

Start display at page:

Download "Reality Checks and Nested Forecast Model Comparisons"

Ralf Paul
5 years ago
Views:

1 Reality Checks and Nested Forecast Model Comparisons Todd E. Clark Federal Reserve Bank of Kansas City Michael W. McCracken Board of Governors of the Federal Reserve System October 2006 (preliminary and incomplete please do not cite) Abstract This paper examines the asymptotic and finite-sample properties of bootstrap-based tests of equal forecast accuracy among many nested models. Although direct, multi-step forecasts are permitted, attention is restricted to an OLS-estimated linear framework. Our analytics allow both finite sample and asymptotic nesting. Our preferred, parametric bootstrap algorithm is comparable to that in Kilian (1999) but focuses on approximating the distribution of the ratio rather than the difference in MSEs. As in Hansen (2005), we find that judiciously chosen centering and rescaling constants, that vary with the hypothesis of interest, are necessary to achieve appropriately sized tests. Monte Carlo simulations find that out preferred bootstrap has better finite-sample size and power than other methods designed to manage the comparison of non-nested models. We conclude with an empirical application regarding the predictive content of various measures of macroeconomic slack for inflation. JEL Nos.:C53,C12,C52 Keywords: Reality check, nested models, prediction, long horizon Clark (corresponding author): Economic Research Dept.; Federal Reserve Bank of Kansas City; 925 Grand; Kansas City, MO 64198; McCracken: Board of Governors of the Federal Reserve System; 20th and Constitution N.W.; Mail Stop #61; Washington, D.C ; The views expressed herein are solely those of the authors and do not necessarily reflect the views of the Federal Reserve Bank of Kansas City, Board of Governors, Federal Reserve System, or any of its staff.

2 1 Introduction With improvements in computational power, it has become increasingly straightforward to mine economic and financial datasets for variables that might be useful for forecasting. This point is made clearly in any number of papers but includes Denton (1985), Lo and McKinlay (1990), and Hoover and Perez (1999). In light of this data mining,itisclearthatthesearchitselfmustbetakenintoaccountwhentryingtovalidate whether any of the findings are statistically significant. Various methods exist to manage this multiple testing problem including the traditional use of Bonferroni bounds as well as more recent methods involving q-values (Storey 2002). Each of these methods has their own strengths and weaknesses in terms of their applicability to a given situation. One method of managing this multiple testing problem is the bootstrap. Specifically, in the context of out-of-sample tests of predictive ability, White (2000) develops a bootstrap method of constructing an asymptotically valid test of the null hypothesis that forecasts from a potentially large number of predictive models are as accurate as those from a baseline model. Since then, extensions have developed that broaden the applicability of the results in White (2000). Hansen (2005) shows that normalizing and re-centering the test statistic in a specific manner can lead to a more accurately sized and powerful test. Corradi and Swanson (2006) provide important modifications that permit situations in which parameter estimation error enters the asymptotic distribution. One thing each of these bootstrap reality check s" have in common is that they assume that the baseline model is non-nested with at least one of the competing models. More precisely, they require that the (M 1) vector of out-of-sample averages of the loss differentials (each element of the vector is the average difference in forecast accuracy between the baseline and a distinct competing model) is asymptotically normal with a positive semi-definite long run covariance matrix. Along with additional moment and mixing conditions we can insure that this additional condition is satisfied if at least one of the models is non-nested with the baseline. However, it is frequently the case that the baseline model is a simpler, nested version of the competing models. The purpose of such a comparison is to determine whether the additional predictors associated with the larger, unrestricted, models 1

3 improve forecast accuracy relative to the baseline. Examples include Cheung, Chinn, and Garcia s (2005) examination of various exchange rate models, Goyal and Welch s (2003, 2004) studies of the predictive content of a variety of business fundamentals for stock returns, and Stock and Watson s (1999, 2003) analyses of output and inflation forecasting models. In general, economic theories of efficient markets imply that asset returns form a martingale difference and hence all predictive models nest the baseline model of zero expected return under this hypothesis. Consequently, the results in studies such as White (2000) on the asymptotic and finite-sample properties of bootstrap-based joint tests of equal forecast accuracy, based on non-nested models, may not apply. Intuitively, with nested models, the null hypothesis that the restrictions imposed in the benchmark model are true implies the population errors of the competing forecasting models are exactly the same. This in turn implies, for example, that the population difference between the competing models mean square forecast errors is exactly zero, with zero variance. As a result, the distribution of a t statistic for equal MSE may be non standard. Indeed, Clark and McCracken (2001, 2005) and McCracken (2006) show that, for forecasts from nested models, the distributions of tests for equal forecast accuracy are typically not Normally distributed. Motivated by the frequency with which researchers compare predictions from a baseline nested model with many alternative, nesting models, this paper examines the asymptotic and finite-sample properties of bootstrap-based joint tests of equal out-of-sample forecast accuracy among such forecasts. By focusing on nested models we are able to address various forms of the null hypothesis of joint equal predictive ability. One hypothesis, considered in Clark and McCracken (2001, 2005) and McCracken (2006), is that the models have equal population-level predictive ability. This situation arises when the coefficients associated with the additional predictors in the nesting models are zero and hence at the population level, the forecast errors are identical and thus the models have equal predictive ability. However, the main focus of this paper is to distinguish this null hypothesis from another: One that arises when some of the additional predictors have non-zero coefficients associated with them, but the marginal predictive content is small. In this case, addressed in Trenkler and Toutenberg (1992), Clark and McCracken (2006b) and Hjalmarsson (2006), the two models can have equal predictive ability at a fixed 2

4 forecast origin (say time T ) due to a bias variance trade-off between a more accurately estimated, but misspecified, nested model and a correctly specified, but imprecisely estimated, nesting model. Building upon this insight, we derive the asymptotic distributions associated with standard out-of-sample tests of equal predictive ability between recursively estimated models with weak predictors. We then evaluate various bootstrap-based methods for imposing the null of equal predictive ability upon these distributions and conduct asymptotically valid inference. 1 Our approach to modeling weak predictors is identical to the standard Pitman drift used to analyze the power of in-sample tests against small deviations from the null of equal population predictive ability. It has also been used by Inoue and Kilian (2004) in the context of analyzing the power of out-of-sample tests. In that sense, some (though not all) of our analytical results are quite similar to those in Inoue and Kilian (2004). We differ, though, in our focus. While Inoue and Kilian (2004) are interested in examining the power of out-of-sample tests against the null of equal population predictive ability, we are interested in using out-of-sample tests to test the null hypothesis of equal finite sample predictive ability. This seemingly minor distinction arises when we realize that the estimation error associated with estimating unknown regression parameters can cause a misspecified, restricted model to be as accurate or more accurate than a correctly specified unrestricted model when the additional predictors are imprecisely estimated (or in our terminology, are weak"). We use Pitman drift simply as a tool for constructing an asymptotic approximation to the finite sample problem associated with estimating a regression coefficient when the marginal signal-to-noise ratio associated with it is small. To be clear, we should clarify the extent of our contribution to the literature on bootstrap-based methods for multiple testing. Hansen (2005) notes explicitly that his and the results in White (2000) do not apply when the baseline model nests all of its competitors. More importantly, Inoue and Kilian (2004) explicitly derive the asymptotic distribution of two tests for equal population-level out-of-sample predictive ability when the baseline model is nested by many competing models - one of which we reconsider here. Some of our results therefore bear a strong resemblance 1 Another approach, taken by Giacomini and White (2006), is to estimate the models using a fixed, and finite, rolling window of observations. A reality check using this method of estimation is implemented in Kapetanios, Labhard, and Schleicher (2005). 3

5 to theirs. As noted above, we differ in our emphasis on primarily being interested in the null of equal finite-sample predictive ability and in light of this, also differ by clarifying what form of the bootstrap is needed in order to accurately test the null of equal population- or finite-sample predictive ability. Although our results apply only to a setup that some might see as restrictive direct, multi step (DMS) forecasts from nested models the list of studies analyzing such forecasts suggests our results should be useful to many researchers. Recent applications considering a variety of DMS forecasts from nested linear models include, among others: the studies cited at the beginning of this section; Mark (1995); Kilian (1999); Rapach and Wohar (2006); Billmeier (2004); Butler, Grullon and Weston (2005); Cooper and Gulen (2006); Wang (2004) and Mönch (2005). The remainder proceeds as follows. Section 2 introduces the notation, assumptions, and presents our theoretical results. Section 3 characterizes the bootstrapbased methods we consider for testing the joint hypothesis of equal forecast accuracy among many nested models, and lays out how, in practice, appropriate asymptotic critical values can be calculated. Section 4 presents Monte Carlo results on the finite sample performance of the asymptotics and the bootstrap. Section 5 applies our tests to evaluate the predictive content of various measures of economic slack, energy price inflation, and import price inflation for core PCE inflation. Section 6 concludes. 2 Theoretical results We begin by laying out our testing framework when comparing the forecast accuracy of two nested models in the presence of weak predictive ability. We then generalize to situations with multiple models. 2.1 Environment The possibility of weak predictors is modeled using a sequence of linear DGPs of the form (Assumption 1) 2 y T,t+τ = x 0 T,1,tβ T + u T,t+τ = x 0 T,0,tβ 0 + x 0 T,1,t(T 1/2 β 12)+u T,t+τ, (1) Ex T,1,t u T,t+τ Eh T,1,t+τ =0for all t =1,...,T,...T + P τ. 2 The parameter β T,t does not vary with the forecast horizon τ since, in our analysis, τ is treated as fixed. 4

6 Note that we allow the dependent variable y T,t+τ, the predictors x T,1,t and the error term u T,t+τ to depend upon T, the initial forecasting origin. This dependence allows the time variation in the parameters to influence their marginal distributions. This is necessary if we want to allow lagged dependent variables to be predictors. At each origin of forecasting t = T,...T +P τ, we observe the sequence {y T,j,x 0 T,j} t j=1. Forecasts of the scalar y T,t+τ, τ 1, are generated using a (k 1,k = k 0 + k 1 ) vector of covariates x T,t = (x 0 T,0,t,x 0 T,12,t) 0, and linear parametric models x 0 T,i,tβ i, i = 0, 1. The parameters are estimated using OLS (Assumption 2) and hence ˆβ i,t =argmint P 1 t τ s=1 (y T,t+τ x 0 T,i,t β i) 2, i =0, 1, for the restricted and unrestricted, respectively. We denote the loss associated with the τ-step ahead forecast errors as û 2 i,t+τ =(y T,t+τ x 0 T,i,tˆβ i,t ) 2, i =0, 1, for the restricted and unrestricted respectively. The following additional notation will be used. Let H T,i (t) = (t P 1 t τ s=1 x T,i,su T,s+τ ) = (t P 1 t τ s=1 h T,i,s+τ), B T,i (t) = (t P 1 t τ s=1 x T,i,sx 0 T,i,s) 1,andB i = lim T (Ex T,i,s x 0 T,i,s) 1 for i = 0, 1. For U T,t = (h 0 T,1,t+τ,vec(x T,1,t x 0 T,1,t) 0 ) 0, V = P τ 1 j= τ+1 Ω 11,j, where Ω 11,j is the upper block-diagonal element of Ω j defined below. For any (m n) matrix A let A denote the max norm and tr(a) denote the trace. For H T,1 (t) defined above, J 1 the selection matrix (I k0 k 0, 0 k0 k 1 ) 0, and a (k 1 k) matrix Ã satisfying Ã 0 Ã = B 1/2 1 ( J 0 B 0 J + B 1 )B 1/2 1, let h T,1,t+τ = σ 1 ÃB 1/2 1 h T,1,t+τ and H T,1 (t) =σ 1 ÃB 1/2 1 H T,1 (t). If we define γ h h,1 (i) =E h T,1,t+τ h0 T,1,t+τ i,thens h h,1 = γ h h,1 (0) + P τ 1 i=1 (γ h h,1 (i)+γ 0 h h,1 (i)). Let W (s) denote a k 1 1 vector standard Brownian motion and define the vector of weak predictor coefficients as δ =(0 1 k0,β 0 12) 0. To derive our general results, we need three more assumptions (in addition to our assumptions (1 and 2) of a DGP with weak predictability and OLS estimated linear forecasting models). Assumption 3: (a)t P 1 [rt] t=1 U T,tUT,t j 0 rω j where Ω j =lim T T P 1 T t=1 E(U T,tUT,t j) 0 for all j 0, (b)ω 11,j =0allj τ, (c)sup T 1,t T +P E U T,t 2q < some q>1, (d) The zero mean triangular array U T,t EU T,t =(h 0 T,1,t+τ,vec(x T,1,t x 0 T,1,t Ex T,1,t x 0 T,1,t) 0 ) 0 satisfies Theorem 3.2 of De Jong and Davidson (2000). Assumption 4: (a) LetK(x) be a continuous kernel such that for all real scalars x, K(x) 1, K(x) =K( x) and K(0) = 1, (b) For some bandwidth L and constant i (0, 0.5), L = O(P i ),(c)forallj>τ 1, Eh T,1,t+τ h 0 T,1,t+τ j =0, (d) The number of covariance terms j, used to estimate the long run covariance S dd definedinsection 5

7 2.2, satisfies τ 1 j <. Assumption 5: lim T P/T = λ P (0, ). Assumption 3 imposes three types of conditions. First, in (a) and (c) we require that the observables, while not necessarily covariance stationary, are asymptotically mean square stationary with finite second moments. We do so in order to allow the observables to have marginal distributions that vary as the weak predictive ability strengthens along with the sample size but are well-behaved enough that, for example, sample averages converge in probability to the appropriate population means. Second, in (b) we impose the restriction that the τ-step ahead forecast errors are MA(τ 1). We do so in order to emphasize the role that weak predictors have on forecasting without also introducing other forms of model misspecification. Finally, in (d) we impose the high level assumption that, in particular, h T,1,t+τ satisfies Theorem 3.2 of De Jong and Davidson (2000). By doing so we not only insure that certain weighted partial sums converge weakly to standard Brownian motion, but also allow ourselves to take advantage of various results pertaining to convergence in distribution to stochastic integrals. Assumption 4 is necessitated by the serial correlation in the multi-step (τ-step) forecast errors errors from even well-specified models exhibit serial correlation, of an MA(τ 1) form. Typically, researchers constructing a t-statistic utilizing the squares of these errors account for serial correlation of at least order τ 1 in forming the necessary standard error estimates. Meese and Rogoff (1988), Groen (1999), and Kilian and Taylor (2003), among other applications to forecasts from nested models, use kernel-based methods to estimate the relevant long-run covariance. 3 We therefore impose conditions sufficient to cover applied practices. Parts (a) and (b) are not particularly controversial. Part (c), however, imposes the restriction that the orthogonality conditions used to identify the parameters form a moving average of finite order τ 1, while part (d) imposes the restriction that this fact is taken into account when constructing the MSE t statistic discussed later in Section 2. Finally,inAssumption5weimposetherequirementthatlim T P/T = λ P (0, ) and hence the duration of forecasting is finite but non-trivial. 3 For similar uses of kernel based methods in analyses of non nested forecasts, see, for example, Diebold and Mariano (1995) and West (1996). 6

8 2.2 Asymptotics for MSE-F, MSE-t with weak predictors In the context of non-nested models, Diebold and Mariano (1995) propose a test for equal MSE based upon the sequence of loss differentials ˆd t+τ =û 2 0,t+τ û 2 1,t+τ. If we define MSE i =(P τ +1) P 1 T τ t=r û2 i,t+τ (i =0, 1), d =(P τ +1) P 1 T τ ˆd t=r t+τ = MSE 0 MSE 1, bγ dd (j) =(P τ +1) P 1 T τ t=r+j ( ˆd t+τ d)( ˆd t+τ j d), bγ dd ( j) =bγ dd (j), and Ŝdd = P j j= j K(j/M)bγ dd(j), the statistic takes the form MSE t =(P τ +1) 1/2 d pŝdd. (2) Under the null that x 12,t has no population level predictive power for y t+τ, the population difference in MSEs, Eu 2 0,t+τ Eu 2 1,t+τ will equal 0. When x 12,t has predictive power, the population difference in MSEs will be positive. Even so, the finite sample difference need not be positive and in fact, for a given sample size the difference in finite sample MSEs, Eû 2 0,t+τ Eû 2 1,t+τ, may be zero thus motivating a distinct null hypothesis of equal finite-sample predictive ability. Regardless of which null hypothesis we consider (equal population or equal finite-sample predictive ability), the MSE t test and the other equal MSE tests described below are one sided to the right. WhileWest(1996)proves directlythatthemse t statistic can be asymptotically standard normal when applied to non nested forecasts, this is not the case when the models are nested. In particular, the results in West (1996) require that under the null, the population level long run variance of ˆd t+τ be positive. This requirement is violated with nested models regardless of the presence of weak predictors. Intuitively, with nested models (and for the moment ignoring the weak predictors), the null hypothesis that the restrictions imposed in the benchmark model are true implies the population errors of the competing forecasting models are exactly the same. As a result, in population d t+τ =0for all t, which makes the corresponding variances also equal to 0. Because the sample analogues (for example, d and its variance) converge to zero at the same rate, the test statistics have non degenerate null distributions, but they are non standard. Motivated by (i) the degeneracy of the long-run variance of d t+τ and (ii) the functional form of the standard in-sample F-test, McCracken (2006) develops an out of sample F type test of equal MSE, given by MSE F =(P τ +1) MSE 0 MSE 1 d =(P τ +1). (3) MSE 1 MSE 1 7

9 Like the MSE t test, the limiting distribution of the MSE F test is non standard when the forecasts are nested under the null. Clark and McCracken (2005) and McCracken (2006) show that, for τ step ahead forecasts, the MSE F statistic converges in distribution to functions of stochastic integrals of quadratics of Brownian motion, with limiting distributions that depend on the sample split parameter π,the number of exclusion restrictions k 1, and the unknown nuisance parameter S h h. While this continues to hold in the presence of weak predictors, the asymptotic distributions now also depend not only upon the unknown coefficients associated with the weak predictors but also upon other unknown second moments of the data. If we define Γ 1 = R 1+λ P s 1 W 0 (s)s h hdw (s), Γ 1 2 = R 1+λ P s 2 W 0 1 (s)s h hw (s)ds, Γ 3 = R 1+λ P δ 0 B 1/2 1 2 S 1/2 h h dw (s), Γ 4 = R 1+λ P δ 0 B ( JB 1 J 0 + B 2 )B2 1 δds, andγ 5 = R 1+λP δ 0 B 1/2 1 2 S h hb 1/2 2 δds thefollowingtwotheoremsprovidetheasymptoticdistributions of these two statistics in the presence of weak predictors. Theorem 2.1: Maintain Assumptions 1, 2, 3, and5. MSE F d {2Γ 1 Γ 2 } + 2{Γ 3 } + {Γ 4 }. Theorem 2.2: Maintain Assumptions 1 5. MSE t d ({Γ 1.5Γ 2 } + {Γ 3 } + {.5Γ 4 })/(Γ 2 + Γ 5 ).5. Theorems 2.1 and 2.2 show that the limiting distributions of the MSE t and MSE F tests are neither normal nor chi-square when the forecasts are nested regardless of the presence of weak predictors. Theorem 2.1 is identical (other than notation) to Proposition 2 in Inoue and Kilian (2004) while Theorem 2.2 is unique. And again, the limiting distributions are free of nuisance parameters in only very special cases. In particular, the distributions here are free of nuisance parameters only if there are no weak predictors and if S h h = I. If this is the case if, for example, τ = 1 and the forecast errors are conditionally homoskedastic both representations simplify to those in McCracken (2006) and hence his critical values can be used for testing for equal population level predictive ability. In the absence of weak predictors alone, the representation simplifies to that in Clark and McCracken (2005) and hence the asymptotic distributions still depends upon S h h. In this case, and in the most general case where weak predictors are present, we use bootstrap methods to estimate the asymptotically valid critical values. 8

10 2.3 Various null hypotheses with weak predictors The noncentrality terms, especially those associated with the asymptotic distribution of the MSE F statistic, give some indication of the power that the test statistics have against deviations from the null hypothesis of equal population predictive ability H 0 : β 12 =0. As noted earlier, it is in that sense that our analytical results are closely related to those in Inoue and Kilian (2004). Closer inspection however, shows that the results provide opportunities for testing other forms of the null hypotheses of equal predictive ability when weak predictors are present. For example, under the assumptions made earlier in this section it is straightforward to show that, under the assumptions of the theorem, the mean of the asymptotic distribution of the MSE F can be used to approximate the mean difference in the out-of-sample average predictive ability of the two models. E( X T +P û2 0,t+τ û 2 1,t+τ) t=t Z 1+λP 1 [ s 1 tr(( JB 0 J 0 + B 1 )V )+δ 0 B 1 1 ( JB 0 J 0 + B 1 )B 1 1 δ]ds. Intuitively, one might consider using this expression as a means of characterizing when the two models have equal predictive ability in the presence of weak predictors. Perhaps the most obvious test given this representation is one in which the null hypothesis equates the right-hand side of the equation to zero. Other hypotheses can arise however if we interpret each element s of the integral as an asymptotic approximation to the index t in the summation on the left-hand side of the equation so that E(û 2 0,t+τ û 2 1,t+τ) s 1 tr(( JB 0 J 0 + B 1 )V )+δ 0 B 1 1 ( JB 0 J 0 + B 1 )B 1 1 δ. (4) for each t [T,T + P ] and corresponding s [1, 1+λ P ]. 1. H 0 : Average Equal predictive ability across all s. In perhaps the most intuitive case, the two models have equal average predictive ability across the entire out-of-sample period and hence Z 1+λP 1 [ s 1 tr(( JB 0 J 0 + B 1 )V )+δ 0 B 1 1 ( JB 0 J 0 + B 1 )B 1 1 δ]ds =0. (5) Integrating and solving for the marginal signal-to-noise ratio implies that the marginal signal-to-noise ratio δ0 B 1 1 ( JB 0J 0 +B 1 )B 1 1 δ tr(( JB 0 J 0 +B 1 equals ln(1+λ P ) )V ) λ P. Naturally, if we rearrange terms we obtain E(MSE 0 /MSE 1 ) 1 as expected given that the two models have equal average predictive ability. 9

11 2. H 0 : Equal predictive ability when s =1. Inthecasewherethetwomodelshaveequalfinite-sample predictive ability at the beginning of the out-of-sample period the null hypothesis implies that tr(( JB 0 J 0 + B 1 )V )=δ 0 B1 1 ( JB 0 J 0 + B 1 )B1 1 δ. Equivalently this restriction implies that that the marginal signal-to-noise ratio δ0 B 1 1 ( JB 0J 0 +B 1 )B 1 1 δ tr(( JB 0 J 0 +B 1 equals 1. In this case, since we )V ) expect the unrestricted model to be more accurate than the restricted one throughout the out-of-sample period (as the parameters becoming more precisely estimated) the null hypothesis implies that the mean out-of-sample average difference in predictive ability will be positive: E( P T +P t=t û2 0,t+τ û 2 1,t+τ) (ln(1 + λ P ) λ P )tr(( JB 0 J 0 + B 1 )V ) > 0. Under conditional homoskedasticity and τ =1forecasts this simplifies to E( P T +P t=t û2 0,t+τ û 2 1,t+τ) (ln(1 + λ P ) λ P )σ 2 k 1 andimpliesthatweshouldexpect the ratio of MSEs to be E(MSE 0 /M SE 1 ) 1+(λ P ln(1 + λ P ))k 1 /P,whichis greater than H 0 : Equal predictive ability when s =1+λ P. In the case where the two models have equal predictive ability at the end of the out-of-sample period the null hypothesis implies that tr(( JB 0 J 0 + B 1 )V )= (1 + λ P )δ 0 B1 1 ( JB 0 J 0 + B 1 )B1 1 δ. Equivalently this restriction implies that that the marginal signal-to-noise ratio δ0 B 1 1 ( JB 0J 0 +B 1 )B 1 1 δ 1 tr(( JB 0 J 0 +B 1 equals )V ) 1+λ P. In contrast to the previous case, since we now expect the unrestricted model to be less accurate than the restricted one throughout the out-of-sample period the null hypothesis implies that the mean out-of-sample average difference in predictive ability will be negative: E( P T +P t=t û2 0,t+τ û 2 1,t+τ) (ln(1 + λ P ) λ P 1+λ P )tr(( JB 0 J 0 + B 1 )V ) < 0. Under conditional homoskedasticity and τ =1forecasts this simplifies to E( P T +P t=t û2 0,t+τ û 2 1,t+τ) (ln(1 + λ P ) λ P 1+λ P )σ 2 k 1 and similarly implies that we should expect the ratio of MSEs to be E(MSE 0 /MSE 1 ) 1+( λ P 1+λ P ln(1+λ P ))k 1 /P, which is less than one. Each of the various null hypotheses imply a very specific set of restrictions on the regression parameters through the signal-to-noise ratio. As such, these signalto-noise ratios are crucial for imposing the relevant null on the bootstrap DGP used for constructing asymptotically valid critical values. For brevity, in the Monte Carlo experiments and the empirical results discussed in Sections 3 and 4, we focus exclusively on the first hypothesis in which we test for equal average out-of-sample predictive ability in the presence of weak predictors. 10

12 2.4 Multiple testing for equal predictive ability Having built an apparatus for handling pairwise tests for equal predictive ability in the presence of weak predictors, it is relatively straightforward to extend the results to tests of equal predictive ability among many nested models in the presence of weak predictors. To do so primarily requires a reinterpretation of the assumptions and notation used for the pairwise comparisons. Specifically, we maintain Assumption 1 but with k M (rather than k 1 ) prospective predictors above and beyond those obtained in the restricted model 0. There therefore exists 2 k M 1 unique unrestricted" linear regression models that nest the baseline model within it. Let j =1,..., M denote an index of the collection of M 6 2 k M 1 unique models to be compared to the baseline nested model 0, each of which is estimated by OLS as in Assumption 2. Assumption 3 and 4 remains largely unchanged but for the notational change from x T,1,t to x T,M,t and h T,1,t+τ to h T,M,t+τ in (c). Assumption 5 remains unchanged. Notationally, for each index define β j as the unique (k j 1) subvector of β M associated with the weak predictors in model j and for each redefine the corresponding vector δ j =(0 1 k0,β 0 j ) 0 and selection matrix J j =(I k0 k 0, 0 k0 k j ) 0. Similarly redefine B j, h T,j,t+τ, HT,j (t), and corresponding γ h h,j (i) and S h h,j relative to the predictors used in model j. Moreover, for a (k M 1) vector standard Brownian motion W (s), let W j (s) be the unique (k j 1) subvector associated with model j. Finally let MSE F j and MSE t j denote the pairwise tests between models 0 and j. If we define Γ 1,j = R 1+λ P s 1 W 0 1 j(s)s h h,j dw j (s), Γ 2,j = R 1+λ P s 2 W 0 1 j(s)s h h,j W j (s)ds, Γ 3,j = R 1+λ P δ 0 1 jb 1/2 j S 1/2 h h,j dw j(s), Γ 4,j = R 1+λ P δ 0 1 jbj 1 ( J j B 0 Jj 0 + B j )Bj 1 δ j ds, and Γ 5,j = R 1+λ P δ 0 1 jb 1/2 j S h h,j B 1/2 j δ j ds the following two Theorems provide the asymptotic distributions of these two statistics in the presence of weak predictors. Theorem 2.3: Maintain Assumptions 1, 2, 3, and 5. max j=1,...,m (MSE F j) d max j=1,...,m ({2Γ 1,j Γ 2,j } +2{Γ 3,j } + {Γ 4,j }). Theorem 2.4: Maintain Assumptions Γ 2,j } + {Γ 3,j } + {.5Γ 4,j })/(Γ 2,j + Γ 5,j ).5. max j=1,...,m (MSE t j) d max ({Γ 1,j j=1,...,m Theorems 2.3 and 2.4 are both straightforward extensions of Theorems 2.1 and 2.2. They both follow first from a rank condition on the long run variance of h T,M,t+τ, but then from a direct application of the Continuous Mapping Theorem. Again 11

13 however, providing asymptotically valid critical values is made difficult by the lack of a closed form for the density of the asymptotic distribution but also by the existence of unknown nuisance parameters. For this reason, in the following section we consider various bootstrap-based methods for conducting inference. 3 Bootstrap approaches Drawing on the proceeding theoretical results, we use a variety of non parametric and parametric bootstrap procedures in testing for equal forecast accuracy, using MSE ratios, differences in MSE, and t statistics for MSE differences. The procedures differ in both the resampling of data and in the re centering of the bootstrap distribution to reflect the null of interest. This section first describes the different approaches to data generation and then discusses centering. 3.1 Bootstrap data generation The non parametric approach is patterned on White s (2000) method: we create bootstrap samples of forecast errors by sampling (with replacement) from the time series of sample forecast errors, and construct test statistics for each sample draw. However, as noted above and in White (2000), this procedure is not, in general, asymptotically valid when applied to nested models. We include the method in part for its computational simplicity and in part to examine the potential pitfalls of using the approach. Hong and Lee (2003) have applied White s (2000) reality check to nested models. 4 Our parametric bootstrap procedures follow those of Kilian (1999) and Clark and McCracken (2005), among others. Vector autoregressive equations for y t and x t the form of the y t equation differs across approaches, as described below are estimated by OLS using the full sample of observations, with the residuals stored for sampling. Bootstrapped time series on y t and x t are generated by drawing with replacement from the sample residuals and using the autoregressive structures of the estimated VAR to iteratively construct data. The initial observations observations preceding the sample of data used to estimate the models necessitated by the lag structures 4 Hong and Lee (2003) argue that White s (2000) method is valid in their application because the number of forecast observations is small enough relative to the estimation sample to make the t test for equal MSE normally distributed. 12

14 of the estimated models, are selected by sampling from the actual data. In particular, following Stine (1987), among others, the initial observations are selected by picking one date at random and then taking the necessary number of initial observations in order from that date backward. For each sample of artificial data, we estimate forecasting models and generate forecasts and test statistics. Within the parametric class, we consider three different bootstraps, reflecting different approaches to inference that is, different null hypotheses or forms of equal forecast accuracy. In the first approach (denoted restricted VAR in the tables), the VAR used in the bootstrap is restricted to impose the null that the x variables of interest have no predictive content that is, coefficients of zero. The bootstrap DGP equation for y in the VAR takes the form of the null or benchmark forecasting model. For simplicity, the VAR equations for the x variables are specified as AR models. This basic approach is used in such studies as Mark (1995), Kilian (1999), Clark and McCracken (2005, 2006a), and Clark and West (2006a,b). In a second approach (denoted unrestricted VAR in the tables), we allow the y equation in the bootstrap VAR to take an unrestricted form. In particular, in this approach, the (OLS estimated) y equation used in the bootstrap includes all of the x variables considered in the forecasting models. (Note that, in reality check applications, the resulting model is often not one of the forecasting models actually considered.) As with the other parametric bootstraps, VAR equations for the x variables are specified as AR models. In a third approach (denoted rescaled VAR in the tables), we use our above asymptotic results for weak predictors to re-scale the β 12 coefficients of the alternative forecasting model that, according to our theoretical results, should forecast best. In particular, we estimate each alternative forecasting model with the full sample of data, estimate each model s signal to noise ratio, and select the model with the highest signal to noise ratio. We then re-scale the β 12 coefficients of this model such that, on average over the forecast sample, this model and the null model would be expected to be equally accurate, according to our asymptotic results. This estimated model for y and its residuals (along with AR equations for the x variables) are used to generate artificial samples of data. 13

15 3.2 Centering of test statistics Depending on the test statistic and bootstrap approach, the bootstrap distribution may need to be re centered to reflect the relevant null hypothesis. Under the non parametric approach, the relevant null hypothesis is that the MSE difference (benchmark MSE less alternative model MSE) is at most 0, and the MSE ratio (benchmark MSE/alternative model MSE) is at most 1. Following White (2000), each bootstrap draw of a given test statistic is re-centered around the corresponding sample test statistic. In the reality check testing applied to multiple models, in each draw we form an appropriately re centered test statistic for each alternative forecasting model. We then, in each draw, take the maximum of the re centered bootstrap statistics. Bootstrapped critical values are computed as percentiles of the resulting distributions of (appropriately re centered) test statistics. We report empirical rejection rates using a nominal size of 10%. Results using a nominal size of 5% are qualitatively similar. In the case of the restricted VAR bootstrap, the null hypothesis of β 12 =0is imposed in the data generation process. As a result, no adjustment of the sample test statistics is needed for inference. We simply compare the sample test statistics against the bootstrap draws, without any re centering. Similarly, in the case of the rescaled VAR bootstrap, the null of equal forecast accuracy (on average) is imposed in the data generating process. Again, no adjustment of the sample test statistics is needed for inference. Finally, with the unrestricted VAR bootstrap, we follow the same re centering approach used for the non parametric bootstrap. The relevant null hypothesis is that the MSE difference (benchmark MSE less alternative model MSE) is at most 0, and the MSE ratio (benchmark MSE/alternative model MSE) is at most 1. The bootstrapped test statistics are re centered around their corresponding sample statistics. For comparison, we include in our analysis an alternative test of equal accuracy, developed in Clark and West (2006a,b), that involves an alternative approach to re centering the MSE difference between two nested models. Clark and West propose a simple adjustment to the MSE differential for the additional parameter estimation error of the larger model. When applied to rolling sample forecasts under a random walk null model, the adjusted test statistic has a standard normal distribution (asymptotically). With null models that involve parameter estimation (as is the case 14

16 in this paper), Clark and West (2006b) argue that the limiting null distribution is approximately normal. The test, which we refer to as the C-W t test, can be simply calculated using C-W t-test =(P τ +1) 1/2 c, (6) pŝcc where û 1,t+τ and û 2,t+τ are τ-step ahead errors from the null and alternative forecast models, ĉ t+τ =û 1,t+τ (û 1,t+τ û 2,t+τ ), c =(P τ +1) P 1 T τ t=r ĉt+τ, ˆΓ cc (j) =(P τ + 1) P 1 T τ t=r+j (ĉ t+τ c)(ĉ t+τ j c), ˆΓ cc ( j) =ˆΓ cc (j), andŝ cc = P j K(j/M)ˆΓ j= j cc (j). In the pairwise model comparison case, the Clark and West re centering of the MSE difference reflected in the above test statistic is intended to allow the test to be compared against the standard normal distribution. In our application, however, we compare the test statistic against the restricted VAR bootstrap distribution (and only the restricted VAR bootstrap because the C-W null is that β 12 =0). We do so for the purpose of including the test in our multiple forecast models analysis. 4 Monte Carlo Evidence We use simulations of various bivariate and multivariate DGPs based loosely on common macroeconomic applications (with finance applications still to come) to evaluate the finite sample properties of the above approaches to testing pairs of models and multiple models for equal forecast accuracy. In these simulations, the benchmark forecasting model is an autoregressive model of the predictand y; thealternative models add lags of various other variables of interest. The general null hypothesis is that none of the forecasts from alternative models are more accurate than the benchmark forecast. This general null, however, can take different specific forms: either the variables in the alternative models have no predictive content, in that their coefficients are 0; or some models have non zero coefficients, but the coefficients are small enough that the none of the alternative models are expected to be more accurate than the benchmark over the forecast sample. 4.1 Monte Carlo design For all DGPs, we generate data using independent draws of innovations from the normal distribution and the autoregressive structure of the DGP. The initial observations necessitated by the lag structure of each DGP are generated with draws from 15

17 the unconditional normal distribution implied by the DGP. While our results can be generalized to any forecast horizon (with models of the direct multi step form), for simplicity in this preliminary draft we focus on one step ahead forecasts. With quarterly data primarily in mind, we also consider a range of sample sizes (R, P ), reflecting those commonly available in practice: 80,40; 40,80; 80,80; and 80,120. In order to establish the basic properties of the various bootstrap approaches in a relatively simple environment, we first consider experiments comparing forecasts from just two models. For simplicity, we refer to these as pairwise experiments. We then consider experiments henceforth, multiple model experiments comparing forecasts from multiple models. In all cases, our reported results are based on 1000 Monte Carlo draws and 499 bootstrap replications Pairwise forecast model DGPs The first pairwise model DGP (henceforth, DGP S1, where S denotes a single alternative forecasting model) is based on the empirical relationship between the change in core PCE inflation (y t ) and capacity utilization in the manufacturing sector (x 1,t ): y t =.40y t 1.16y t 2 + b 11 x 1,t 1 + u t x 1,t = 1.43x 1,t 1.50x 1,t 2 + v 1,t (7) µ µ ut.72 var =. v 1,t In the DGP S1 experiments, the alternative (unrestricted) forecasting model takes the form of the DGP equation for y (with constant added); the null or benchmark (restricted) model drops x 1,t 1. null: y t = β 0 + β 1 y t 1 + β 1 y t 2 + u t. (8) alternative: y t = β 0 + β 1 y t 1 + β 1 y t 2 + β 3 x 1,t 1 + u t. (9) We consider various experiments with different settings of b 11,thecoefficient on x 1,t 1, which corresponds to the elements of our theoretical construct β 12 / T.Inone set of simulations (Table 1), the coefficient is set to 0, such that the null forecasting model is expected to be more accurate than the alternative. In another (Table 3), the coefficient is set to a value that makes the models equally accurate (in expectation) 16

18 on average over the forecast sample. With R and P both equal to 80 (this coefficient value changes with R and P ), this value is about In another set of experiments (Table 5), the coefficient is set to its empirically estimated value of 0.05, such that the alternative model is expected to be more accurate than the null. The second pairwise model DGP, DGP S2, is based on the empirical relationship among the change in core PCE inflation (y t ), capacity utilization in the manufacturing sector (x 1,t ), growth in unit labor costs (x 2,t ), and import price inflation (x 3,t ). y t =.40y t 1.16y t 2 + b 11 x 1,t 1 + b 21 x 2,t 1 + b 22 x 2,t 2 + b 31 x 3,t 1 + b 32 x 3,t 2 + u t x 1,t = 1.43x 1,t 1.50x 1,t 2 + v 1,t x 2,t =.45x 1,t 1.19x 1,t x 2,t x 2,t 2 + v 2,t (10) x 3,t =.39x 2,t 1.06x 2,t x 3,t x 3,t 2 + v 3,t u t.72 var v 1,t v 2,t = v 3,t In DGP S2 experiments, the null (restricted) and alternative (unrestricted) forecasting models take the following forms, respectively: y t = β 0 + β 1 y t 1 + β 1 y t 2 + u t. (11) y t = β 0 + β 1 y t 1 + β 1 y t 2 + β 3 x 1,t 1 + β 4 x 2,t 1 + β 5 x 2,t 2 + β 6 x 3,t 1 + β 7 x 3,t 2 + (12) u t. As with DGP S1, we consider experiments with three different settings of the set of b ij coefficients, which correspond to the elements of β 12 / T. In one set of experiments (Table 1), all of the b ij coefficients are set to zero, such that the null forecasting model is expected to be more accurate than the alternative. In another (Table 3), empirically based values of the b ij coefficients are multiplied by a constant less than one, such that, in population, the null and alternative models are expected to be equally accurate, on average, over the forecast sample. With R and P at 80, this multiplying constant is.535. In another set of experiments (Table 5), the coefficients are set at empirically based estimates: b 11 =.05, b 21 =.02, b 22 =.02, b 31 =.04, b 32 =.03. With these values, the alternative model is expected to be more accurate than the null. 17

19 4.1.2 Multiple forecast model DGPs The first multiple model DGP (henceforth, DGP M1, where M denotes multiple alternative forecasting models) incorporates dynamics similar to, but simpler, than those used in the pairwise experiments. This DGP takes the general form 5X y t =.5y t 1 + b i x i,t 1 + u t i=1 x i,t = γ i x i,t 1 + v i,t, i =1,...10 (13) γ i =.9.1(i 1) E(u t v i,t )=0 i, E(v i,t v j,t ) i 6= j var(v i,t )=1 γ 2 i σ 2 x var(x i,t )=σ 2 x i var(u t )=2. In DGP M1 experiments, the null (restricted) forecasting model is null: y t = β 0 + β 1 y t 1 u t. (14) We consider a total of 10 alternative models, each including a single x i,t 1 : alternatives: y t = β 0 + β 1 y t 1 + β 2,i x i,t 1 + u i,t,i=1, (15) We consider experiments with four different settings of the set of b i coefficients, which correspond to the elements of β 12 / T. In one set of experiments (Table 2), all of the b i (i =1,...5) coefficients are set to zero, such that the null forecasting model is expected to be more accurate than the alternative. In another (Table 4), we set the coefficients to b 1 = b, b i = 0,i = 2,...5, whereb is set to make the null model and the alternative model including just x 1,t 1 equally accurate (on average over the forecast sample). The other models are expected to be less accurate. For R =80andP =80,b.13. In an additional set of simulations (Table 6), we use b 1 =.3, b i =0,i =2,...5. In this case, the alternative model including x 1,t 1 is expected to be more accurate than the null model; the other alternatives are expected to be less accurate. Finally, we consider (in Table 7) experiments with b i =.3,i = 1,...5, such that the alternative models with x i,t 1,i = 1,...5, are expected to be more accurate than the null and other alternatives. The second multiple model DGP takes a comparable form, but considers alternative forecasting models with a varying number of alternative variables. DGP M2 is 18

20 given by 3X y t =.5y t 1 + b i x i,t 1 + u t i=1 x i,t = γ i x i,t 1 + v i,t, i =1,...5 (16) γ i =.9.1(i 1) E(u t v i,t )=0 i, E(v i,t v j,t ) i 6= j var(v i,t )=1 γ 2 i σ 2 x var(x i,t )=σ 2 x i var(u t )=2. Again, the null (restricted) forecasting model is null: y t = β 0 + β 1 y t 1 u t. (17) The alternative models are based on combining five independent x variables in all possible combinations (5 different models with a single x, 10modelswithpairsof x s, etc.) a total of 31 different models, with the number of x variables included ranging from 1 to 5. We consider experiments with four different settings of the set of b i coefficients, which correspond to the elements of β 12 / T. In one set (Table 2), all of the b i (i = 1,...3) coefficients are set to zero, such that the null forecasting model is expected to be more accurate than the alternative. In another (Table 4), we set the coefficients to b 1 = b, b i =0,i=2, 3, whereb is set to make the null model and the alternative model including just x 1,t 1 equally accurate (on average over the forecast sample). The other models are expected to be less accurate. For R =80andP =80,b.13. In an additional set of simulations (Table 6), we use b 1 =.3, b i =0,i = 2, 3. In this case, the alternative model including x 1,t 1 is expected to be more accurate than the null model; the other alternatives are expected to be less accurate. Finally, we consider (in Table 7) experiments with b i =.3,i =1,...3, such that the alternative models with any combination of the variables x i,t 1,i =1,...3 (but no othervariables),areexpectedtobemoreaccuratethanthenullandotheralternatives. 4.2 Results Our interest lays in identifying those bootstrap approaches that yield reasonably accurate inferences on the forecast performance of models. At the outset, then, it 19

21 may be useful to broadly summarize the forecast performance of competing models under our various alternatives. Accordingly, Figure 1 shows estimated densities of our MSE ratio statistic (the ratio of the null model s MSE to the alternative model s MSE), based on experiments with DGP S2, using R = P = 80. We provide three densities, for the cases in which the b ij coefficients of the DGP (10) are: (i) set to 0, such that the null model should be more accurate; (ii) set to non zero values so as to make the null and alternative models (11) and (12) equally accurate over the forecast sample, according to our local to zero asymptotic results; and (iii) set at larger values, such that the alternative model is expected to be more accurate. As the figure shows, for the DGP which implies the null model should be best, the MSE ratio distribution mostly lays below 1.0. For the DGP that implies the models can be expected to be equally accurate, the distribution is centered at about 1.0. Finally, for the DGP that implies the alternative model can be expected to be best, the distribution mostly lays about 1.0. Among our bootstrap procedures, the restricted VAR approach yields, by design, a distribution like that shown for the null best DGP. The non parametric, unrestricted VAR, and rescaled VAR bootstraps are intended to estimate a null distribution like that shown for the equally good models DGP. In all cases, the null will be rejected when the sample MSE ratio lays in the right tail of the bootstrapped distribution. What, then, might we expect test rejection rates to look like across experiments and bootstraps? For DGPs in which the null model is best, tests compared against the restricted VAR bootstrap should have rejection rates of about 10%, the nominal size (prior studies such as Kilian (1999) and Clark and McCracken (2005) have shown this bootstrap approach to work reasonably well in this type of (pairwise) experiment). However,thesametestscomparedagainsttheotherbootstrapsshouldhaverejection ratesbelow10%,becausegiventhedgp,themodelsshouldnotbeexpectedtobe equally accurate. For DGPs with coefficients scaled such that the null and alternative models can be expected to be equally accurate, we want the tests compared against the non parametric, unrestricted VAR, and rescaled VAR bootstraps to have size of about 10%. That said, as indicated above, we shouldn t expect the non parametric approach to perform well when applied to our nested models. We should expect the same tests compared to restricted VAR bootstrap critical values to yield rejection rates greater than 10%, because the restricted VAR distribution lays to the left of the 20

22 equal accuracy distribution. Finally, with DGPs that imply the alternative model to be more accurate than the null, we should look for rejection rates that exceed 10%. Again, though, rejection rates based on the restricted VAR bootstrap should generally be higher than rejection rates based on the other approaches Null Model Most Accurate Tables 1 and 2 present Monte Carlo results for DGPs in which, in truth, the x variables considered have no predictive content for y, such that the null forecasting model should be expected to be best. These results generally line up with the expectations described above. Comparing the MSE ratio, MSE difference, MSE t-test, and CW t- test against restricted VAR bootstrap critical values consistently in both pairwise and multiple forecasting model experiments yields rejection rates of about the nominal size of 10%. For example, across all the experiments in Tables 1 and 2, restricted VAR bootstrap rejection rates for the MSE ratio statistic range from 8.6% (DGP M1, R, P = 80,80 and 80,120; DGP M2, R, P = 40,80) to 12.1% (DGP S2, R, P =80,80). Comparing the test statistics to other bootstrap distributions yields rejection rates typically well below 10%, and often close to 0. For example, non parametric bootstrap rejection rates for the MSE ratio test range from.1% (DGP S2, R, P = 40,80) to 4.2% (DGP M1, R, P = 80,40). In the pairwise model experiments, using the unrestricted or rescaled VAR bootstrap yields results qualitatively similar to those for the non parametric approach. However, in results from the multiple model experiments, the unrestricted and rescaled VAR bootstraps differ more from the non parametric approach. Rejection rates are highest with the rescaled VAR bootstrap (when used with the MSE ratio test, ranging from 3.7% to 7.1% in Table 2), next highest with the non parametric approach (ranging from.1% to 4.2%), and lowest with the unrestricted VAR method (ranging from 0 to.7%). Among the alternative tests for equal MSE, results are generally very similar for the MSE ratio and MSE difference tests. For example, with DGP M2 and R, P = 80,80 (Table 2), using a restricted VAR bootstrap yields a rejection rate of 9.9% for the MSE ratio test and 9.6% for the MSE difference test. In the same experiment, using a rescaled VAR bootstrap yields rejection rates of 3.7% and 4.5% for, respectively, the MSE ratio and difference tests. While using the MSE t-test yields qualitatively 21

Nested Forecast Model Comparisons: A New Approach to Testing Equal Accuracy

Nested Forecast Model Comparisons: A New Approach to Testing Equal Accuracy Todd E. Clark Federal Reserve Bank of Kansas City Michael W. McCracken Federal Reserve Bank of St. Louis July 2009 Abstract This