Introduction to Forecasting and Forecast Evaluation

Size: px

Start display at page:

Download "Introduction to Forecasting and Forecast Evaluation"

Annabel Tyler
6 years ago
Views:

1 Introduction to Forecasting and Forecast Evaluation Lecture Notes to Acompany Talk Norman Rasmus Swanson Rutgers University contact: prepared for the Bank of Canada, May

2 Outline Part I. Prediction Basics - Loss Functions, Optimal Prediction, and Model Selection Part II. Parameter Estimation Error, Bootstrap Techniques, and Model Selection Part III. What Should We Be Predicting - Real-Time Data Part IV. Methods of Prediction - Some Comments Part V. Density Based Model Selection 2

3 1 Part I - Prediction Basics: Loss Functions, Optimal Prediction, and Model Selection 1.1 Loss Functions From an economic policy perspective, one of the main uses of econometric and statistical methods is to provide forecasts of macroeconomic and financial variables. For instance: Given the rate of inflation over the past twelve months, what will be the rate of inflation next month? What will be the rate two months from now? These predictions have important consequences for the formulation of economic policy (e.g. setting the bank lending rate, etc.) Suppose that the Federal Reserve Board forecasts a 1.5% annualized inflation rate for July 2008, while the Department of Treasury provides a forecast of 1%. How can we decide which of the two is more reliable? One key question thus concerns how we can measure the relative accuracy of different forecasts? Different models yield different forecasts, so we want to choose the model producing the most accurate one. Many econometric techniques deal with the in sample evaluation of models, and only recently has attention focused on out of sample model evaluation. A key difference between the two approaches is that in sample methods tend to select models that are too large. *Overfitting is a problem. * In-sample inference is another. Some of the issues arising are: (i) Choice of the loss function. Suppose X t+1 istherateofinflation at time t +1, andx (i) t+1/t rate of inflation forecasted at time t using model i. The forecast error implied by model i, is u i,t+1 = X t+1 X (i) t+1/t. is the 3

4 We want to choose model j over model i if, on average, model j produces smaller errors. Smaller in which sense? * Quadratic loss function: choose model j, if on average, u 2 j,t+1 <u 2 i,t+1 * Mean Absolute loss function: choose model j, if on average, u j,t+1 < u i,t+1 * Other sorts of Loss Functions? Direction of Change (Contingency tables), Profitability, etc. * Sometimes we are more concerned about positive errors than negative errors (or vice-versa), so we may want to use an asymmetric loss functions, such as linex (linear exponential) loss. (ii) Is it possible that for some loss function, model j beats model i, and for other loss functions model i beats model j? In general: yes. If we choose the right model, in the sense that we correctly specify the joint distribution of all of the relevant variables, then no other model can win. On the other hand, if the models we compare are (generally) misspecified, then the ranking of models is loss function specific (i.e. we would like to assume that all model are approximations to the truth - any other assumption seems overly strong). * This has implications on data transformation, particularly when using one model to predict more than one transformation of a variable. (iii) What is the effect of parameter estimation error (i.e. of estimating the models used in prediction). Suppose we forecast inflation at t+1 simply using inflation at time t, and we use a simple linear model. Say: X t = β 0 + β 1 X t 1 + u t. Thetrueforecastingerroris u t+1 = X t+1 β 0 β 1 X t. 4

5 However, we have to replace the unknown parameters with estimates, say b 0 and b 1. Thus the estimated forecast error becomes bu t+1 = X t+1 b 0 b 1 X t = u t+1 (b 0 β 0 ) (b 1 β 1 )X t (iv) Choice of forecast horizon. Given information up to time t, do we want to forecast inflation at t +1,t+2,...t+ k? Again, unless we have the right model, model i can beat model j for given forecast horizons, but model j can beat model i for different forecast horizons. 1.2 Optimal Prediction (i) Quadratic loss functions Consider a time series y t,, 2,..., T. Suppose we want to find the optimal predictor of y t,h steps ahead, using information available at time t. Let F t = σ(y 1,..., y t,x 1,...X t ), where X t is a (possibly vector valued) series that may help to predict y t. The optimal h step ahead predictor for y t+h, given F t, is the function by t+h/t such that E((y t+h by t+h/t ) 2 ) <E((y t+h ey t+h/t ) 2 ), for any ey t+h/t 6= by t+h/t. We know that by t+h/t = E(y t+h F t ) (i.e. the best predictor is the conditional expectation of y t given F t ). In fact suppose that ey t+h/t is an F t measurable function (e.g. any continuous function of y 1,..., y t,x 1,...X t is F t measurable), then E((y t+h ey t+h/t ) 2 )=E(((y t+h E(y t+h F t )) (ey t+h/t E(y t+h F t ))) 2 ) = E(((y t+h E(y t+h F t )) 2 )+E((ey t+h/t E(y t+h F t ))) 2 ) > E(((y t+h E(y t+h F t )) 2 ), 5

6 as E((y t+h E(y t+h F t ))(ey t+h/t E(y t+h F t ))) = 0. Thus, if we want to minimize the square error, the conditional expectation is the best predictor. Prediction with linear models Suppose that y t+1 = αy t + ² t+1, where ² t is a white noise process with zero mean and variance σ² 2 (i.e. consider an AR(1) process). Then the best one step predictor is E(y t+1 y t )=αy t, and the best h step predictor is E(y t+h y t )=α h y t, correspondingly the one step ahead prediction error is u t+1 = y t+1 αy t = ² t+1, and u t+h = y t+h α h y t = ² t+h α h 2 ² t+2 + α h 1 ² t+1. AR(p) processes can be treated in an analogous way. Prediction with nonlinear models Suppose where say or Now, Note further that: y t+1 = ag(y t )+² t+1, g(y t )=y t +1/(1 + exp( y t )) g(y t )=yt 2. E(y t+h y t )=ag(y t ). y t+2 = ag(y t+1 )+² t+2 = ag(ag(y t )+² t+1 )+² t+2. Because of the ² t+1 term entering into the nonlinear function g, it is not immediate how to get the two-step ahead prediction error. 6

7 In this case we can approximate E(y t+2 y t )with ag(ag(y t )) = ag(e(y t+1 y t )). Broadly speaking in order to get the h step ahead forecast, we begin by taking the one step ahead forecast (of which we now the closed form expression), then we predict one period ahead again replacing y t+1 (which is not observable) with E(y t+1 y t ). That is, replace it with its predicted value given the information and time. We then proceed in the next steps in the same manner. Sofarwehaveconsideredcasesinwhichy t depends only on its own past. Consider now the following model: y t+1 = β 0 + β 1 X t + ² t+1, so that E(y t+1 X t )=β 0 + β 1 X t. In order to compute h step ahead forecasts, for h>1, we need to know the data generating process of X t. In this case we approximate X t+1 with E(X t+1 X t ). That is, use: E(y t+2 X t )=β 0 + β 1 E(X t+1 X t ). (ii) Asymmetric loss functions We have seen that in the case of quadratic loss the best predictor is the conditional mean. In this case the problem of selecting the optimal forecast is equivalent to the problem of correctly specifying the conditional mean. However, there are several instances in which we are more concerned about positive errors (y t+h by t+h/t > 0) than about negative errors (y t+h by t+h/t < 0). Needless to say, arriving at the airport 5 minutes too late is more costly than arriving 5 minutes too early. In this case, then, we want to more heavily penalize positive errors. Two well known asymmetric loss function are Linex loss (linear exponential loss) and Lin-lin (linearlinear) loss. If we use Linex loss then we want to find the predictor by t+h/t such that E(exp(a(y t+h by t+h/t )) + a(y t+h by t+h/t ) 1) < E(exp(a(y t+h ey t+h/t )) + a(y t+h ey t+h/t ) 1), a6= 0, 7

8 for any ey t+h/t 6= by t+h/t. Note that for a>0, the loss is approximately linear to the left of the origin, while it is exponential to the right of the origin, and vice-versa for a<0. Thus, positive errors are considered more costly than negative errors. Christoffersen and Diebold (1997) show that in this case the best predictor is by t+h/t = E(y t+h F t )+ a 2 (Var(y t+h F t )) 1/2, where Var(y t+h F t )isthevarianceofy t+h conditional on the information available at time t. This formula is valid when y t+h F t = N(E(y t+h F t ),Var(y t+h F t )). Note that for a>0 (more weight on positive errors) the optimal predictor is larger than the optimal MSE predictor. In fact, as we are more concerned about positive errors, we purposely prefer an overestimate. Also, note that while E(y t+h F t ) is an unbiased predictor of y t+h, given the information available at time t, for a>0, E(y t+h F t )+ a 2 (Var(y t+h F t )) 1/2 is an upwardly biased predictor. In this case, knowledge of the optimal predictor requires knowledge of the joint specification of the conditional mean and variance. Another asymmetric loss is Lin-lin loss. If we use a Lin-lin loss, then we want to find the predictor by t+h/t such that E ³ a y t+h by t+h/t 1{y t+h > by t+h/t } + b y t+h by t+h/t 1{y t+h by t+h/t } < E ³ a y t+h ey t+h/t 1{y t+h > ey t+h/t } + b y t+h ey t+h/t 1{y t+h ey t+h/t } for any ey t+h/t 6= by t+h/t and a>0,b>0. This loss function increases linearly in the error, but for a>b it more heavily penalizes positive errors. If, y t+h F t is N(E(y t+h F t ),Var(y t+h F t )), then Christoffersen and Diebold show that the optimal predictor under Lin-lin loss is given by by t+h/t = E(y t+h F t )+(Var(y t+h F t )) 1/2 Φ 1 (a/(a + b)), where Φ denotes the CDF of a standard normal. As a>b,φ 1 (a/(a + b)) > 0 the optimal predictor is upwardly biased. 8

9 Example - GARCH Consider the following GARCH(1,1) model, y t = σ t ² t,² t iidn(0, 1) σt 2 = ω 0 + ω 1 σt ω 2 yt 1, 2 with ω 1 + ω 2 < 1, ω 0 > 0, ω 1 > 0, ω 2 > 0. Now, note that σt 2 = σ0 2 Xt 1 + ω 0 ω j Xt ω 2 ω1y j t 1 j, 2 j=0 j=0 and so σt 2 is a measurable function of the past squared returns. Thus, the relevant information set is F t 1 = σ(y 1,..., y t 1 ). Now, while E(y t F t 1 )=E(σ t ² t F t 1 )=σ t E(² t F t 1 )=0, E(y 2 t F t 1 )=Var(y t F t 1 )=E(σ 2 t ² 2 t F t 1 )=σ 2 t E(² 2 t F t 1 )=σ 2 t. Hence, if the loss function is quadratic, the optimal predictor is E(y t F t 1 )=0. If instead the loss function is Linex with a =1, the best predictor is E(y t F t 1 )=0.5σ t. Finally if the loss is a Lin-lin with parameter a =1andb =2, the optimal predictor is E(y t F t 1 )=σ t Φ 1 (2/3) = 0.4σ t. 9

10 1.3 Model Selection Comparing possibly misspecified forecasting models So far we have considered the issue of optimal prediction for given loss functions. In practice, the true data generating process (DGP) is unknown and so we form optimal predictions for given model(s) which may be (dynamically) misspecified. For example, suppose that we believe that y t follows an AR(1) process, so that the optimal h step ahead predictor is α h y t. However, if for example, the DGP (data generating process) is a SETAR (self-exciting autoregressive) process, then, say y t = α 1 y t 1 + α 2 y t 1 1{y t 1 > τ} + ² t. Now, the optimal forecast under the AR(1) assumption is clearly not optimal at all. Furthermore, in practice we need to define the relevant information set. Again, suppose that y t is an AR(2) process, but we are just considering a AR(1) model. Then, α h y t is indeed the optimal predictor (under quadratic loss), for the information set F t = σ(y t ); but we are indeed neglecting the information contained in y t 2. In this case E(y t+h y t ) is indeed correctly specified, but so there is dynamic mispecification. E(y t+h y t ) 6= E(y t+h y t,y t 1 ), Finally, the h step ahead prediction error is heteroskedastic and autocorrelated and, for the case of Linex loss for example, failing to take this into consideration, would lead to a non optimal forecast. Thus,inpracticewewanttobeabletocomparetherelativepredictiveabilityoftwoormore,possibly misspecified models. Note that the ranking of the models, in the misspecified case, is loss function specific. On the other hand, if we correctly specify all conditional aspects, then the right models will beat all competitors, regardless of the loss function choice. Diebold and Mariano (1995) propose a test for the null hypothesis of equal predictive ability, against the alternative of non equal predictive ability. For the time being, we neglect the issue of parameter estimation error. Let u 0,t+h and u 1,t+h be the h step ahead prediction errors, when predicting y t+h using information available up to time t. 10

11 and For example, for h =1, u 0,t+1 = y t+1 β 01 β 02 y t β 03 x t, u 1,t+1 = y t+1 β 11 β 12 y t β 13 z t. It is important that the two models we are comparing be nonnested (i.e. neither is a special case of the other). Under the assumption that u 0,t and u 1,t are strictly stationary, the hypotheses for this test of equal predictive accuracy are: H 0 : E(f(u 0,t ) f(u 1t )) = 0 and H A : E(f(u 0,t ) f(u 1t )) 6= 0 where f is some continuous positive valued loss function. The relevant statistic is where bσ 2 T is a consistent estimator of DM T = 1 T 1/2 1 bσ T 1/2 lim Var(T T T 1 X (f(u 0,t+1 ) f(u 1t+1 )), (f(u 0,t+1 ) f(u 1t+1 )). Note why we require non-nestedness: Suppose that model 1 is nested in model zero. e.g. and u 0,t+1 = y t+1 β 01 β 02 y t β 03 x t u 1,t+1 = y t+1 β 11 β 12 y t, If model 1 is indeed correctly (dynamically) specified for the conditional mean, then the null is equivalent to β 01 = β 11, β 02 = β 12, and β 03 =0, and so under the null u 0,t = u 1,t for all t. Moreover, in practice we do not observe u 0,t and u 1,t but we only observe bu 0,t and bu 1,t (that depend on estimated parameters). But we still have that bσ T and 1 T 1/2 1 bσ T T 1 X (f(u 0,t+1 ) f(u 1t+1 )) go to zero in probability, and the statistic no longer has a well defined limiting distribution. 11

12 As we allow for (dynamic) mispecification under both hypotheses, in general u 0,t and u 1,t are not martingale difference sequences, (i.e. E(u 0,t F t 1 ) 6= 0) and are in general autocorrelated. Thus, we need to use a heteroskedastic and autocorrelation robust covariance (HAC) estimator for the long run variance. We can use a Newey-West (1987) type estimator, for example. Namely, define bσ 2 T = 1 T T 1 X (d t+1 d) T Xl T 1 w τ τ=1 t=τ+1 (d t+1 d)(d t+1 τ d), where d = T 1 P T d t, and w τ =1 τ/(l T +1). The following is needed in the sequel. d t+1 = f(u 0,t+1 ) f(u 1t+1 ), Assumption A: (i) (u 0,t,u 1,t ) is a strictly stationary and strong mixing process with size 2r/(r 1) with r>1, (ii) E(f(u i,t ) 4 ) <, i=0, 1. ASIDE: Broadly speaking u 0,t is a strong mixing process if it is asymptotically independent, i.e. if u 0,t is independent of its infinite past. More formally, define F n = σ(u 0,,..., u 0,n 1,u n ) to be the information set generated from the infinite past of the series up to time n, and analogously F n+m = σ(u 0,n+m,u 0,n+m+1,..., u 0, ) is the information set generated by the history of the series from time n + m up to infinity. More precisely, if u 0,t is a strong mixing process, for any u 0,t F n+m, E((u 0,t E(u 0,t )) F n ) goes to zero as m. The size has to do which the rate at which this quantity goes to zero as m. Proposition 1: Let Assumption A hold. Then, as T,l T and l T /T 1/4 0, under H 0, DM T d N(0, 1), and under H A, for ε > 0. Pr(T 1/2 DM T > ε) 1, 12

13 Thus, we compare DM T with the critical value of a normal random variable. Suppose that we do not reject H 0 if 1.96 DM T 1.96; otherwise we reject H 0. This gives a test with asymptotic size equal to 0.05 and unit asymptotic power. Note that the same result holds for generic forecast horizon (i.e. for h>1). Proof - Sketch: Under both hypotheses, bσ 2 T pr σ0 2 1/2 = lim Var(T T Also, by the central limit theorem for mixing processes, T 1/2 T X Thus, when E(d t )=0, (i.e. when the null is true), T 1/2 d t+1 ). (d t+1 E(d t )) d N(0, σ 2 0). T X while when E(d t ) 6= 0 (i.e. under the alternative), diverges at rate T 1/2. d t+1 d N(0, σ 2 0), T 1/2 d t+1 Note that many applied practitioners do not even implement the simple DM test, instead relying on point estimates of mean square errors and related statistics when comparing alternative prediction models. 13

14 2 Part II - Parameter Estimation, Bootstrap Techniques, and Model Selection 2.1 Parameter Estimation Error In practice, we do not observe the true forecasting error. For simplicity, consider However, we do not know the vector β. u 0,t+1 = y t+1 β 01 β 02 y t β 03 x t, Thus, we need to replace the parameters with their estimator and take into account the error due to the fact that the parameters are estimated. There are three main sampling schemes. (i) fixed scheme (ii) recursive scheme (iii) rolling scheme. When interested in out of sample forecasting (and when we need to estimate parameters), we typically split the sample T into two subsamples, a regression period, with R observations, and a prediction period, with P observations, where T = R + P. Fixed estimation scheme: Use the first R observations to estimate the parameters, called them b β R, and construct a sequence of P prediction errors, defined as for t = R,..., R + P 1. bu 0,t+1 = y t+1 b β 01,R b β 02,R y t b β 03,R x t, Recursive estimation scheme: Use the first R observations to compute b β R, and construct the first prediction error: bu 0,R+1 = y R+1 b β 01,R b β 02,R y R b β 03,R x R. Then use all observations up to time R + 1 to construct b β R+1, and get the second prediction error bu 0,R+2 = y R+2 b β 01,R+1 b β 02,R+1 y R+1 b β 03,R+1 x R+1. Proceed in the same manner until you have a sequence of P prediction errors, defined as: for t = R,...R + P 1, bu 0,t+1 = y t+1 b β 01,t b β 02,t y t b β 03,t x t, 14

15 where b β t is the estimator computed using observations up to time t. Rolling estimation scheme: Use the first R observations to compute b β R, and construct the first prediction error: bu 0,R+1 = y R+1 b β 01,R b β 02,R y R b β 03,R x R. Then, observations from t =2uptot = R + 1 are used to construct b β 2,R+1, and a second prediction error in constructed: bu 0,R+2 = y R+2 b β 01,2,R+1 b β 02,2,R+1 y R+1 b β 03,2,R+1 x R+1. Thereafter, use observations from t =3tot = R + 2 and obtain another prediction error. Proceed in the same manner estimating the parameters using the most recent R observations, until you have a sequence of P prediction errors: bu 0,t+1 = y t+1 b β 01,t R+1,t b β 02,t R+1,t y t b β 03,t R+1,t x t, for t = R,...R + P 1, where b β t R+1,t is the estimator computed using observations from time t R + 1 up to time t; thatis using the most recent R observations (see West and McCracken (1998) for an overview of the properties of various sampling scheme). The most commonly used approach is the recursive scheme. Intuitively it makes sense to use the information contained in new observations as soon as it becomes available. However, one must also be aware of structural breaks due to changing data definitions, changing model specification, changing tastes and preferences, etc. (i) Effect of parameter estimation error when performing test for equal predictive accuracy. Let where and where and Further, define u 0,t+1 = y t+1 w 0 0,tβ 0 w 0,t =(1,y t,x t ) 0, β 0 =(β 01, β 02, β 03 ) 0 u 1,t+1 = y t+1 w 0 1tβ 1, w 1t =(1,y t,z t ) 0, β 1 =(β 11, β 12, β 13 ) 0, bu 0,t+1 = y t+1 w 0 0,t b β 0,t bu 1,t+1 = y t+1 w 0 1,t b β 1,t, where the parameters have been estimated recursively. 15

16 We observe bu 0,t+1 and bu 1,t+1, and so construct the Diebold-Mariano statistic using the out of sample period, as: ddm P = 1 P 1/2 1 bσ P T 1 X t=r (f(bu 0,t+1 ) f(bu 1t+1 )), (1) with where bσ 2 P = 1 P T 1 X t=r( b d t+1 b d) T Xl P 1 w τ τ=1 t=r+τ+1 bd t+1 = f(bu 0,t+1 ) f(bu 1t+1 ) ( b d t+1 b d) 2 ( b d t+1 τ b d), and bd = T 1 T X bd t+1. Assume that f is a differentiable function (this rules out Lin-lin loss, for example), via a mean value expansion around β 0 and β 1, we have: P 1/2 1 t=r (f(bu 0,t+1 ) f(bu 1t+1 )) = P 1/2 1 t=r (f(u 0,t+1 ) f(u 1t+1 )) (2) where P 1 1 t=r +P 1 1 t=r β0 f(eu 0,t+1 )P 1/2 ( b β 0t β 0 ) β1 f(eu 1,t+1 )P 1/2 ( b β 1t β 1 ), (3) eu 0,t+1 = y t+1 w 0 0,t e β 0,t with e β 0,t (β 0, b β t ). Also, eu 1,t+1 is definedinananalogousmanner. Note that the term on the right hand side of (2) is thesametermwehadintheabsenceofparameter estimation error (i.e. as if we knew the parameters). The main issue we shall address is the following: Do the last two terms above vanish in probability as the sample gets large? In other words, does the effect of parameter estimation error vanish as the sample size get large? Under which conditions will it vanish? We shall show that, in the context of DM type tests: (a) Regardless of the choice of the loss function, f, parameter estimation error vanishes if, as T, P/R 0, (i.e. if the estimation period grows at a faster rate than the prediction period grows). 16

17 Suppose that T = 10100, R= and P =100, in this case P = R 1/2, so P/R = R 1/2 0as R. In general, suppose that R =(T T δ ), δ < 1, and P = T δ (so T = R + P ). In this case P/R = T δ /(T T δ ) 0asT. In practice, this occurs when the period used for estimation is much longer than the period used for out of sample forecasting. (b) If the same loss function is used for estimation and out of sample prediction, then parameter estimation error vanishes, regardless the relative rates at which P and R grow as the sample size get large (for the case of the DM test). This is for example the case in which we use nonlinear (or ordinary) least squares for estimation and we employ a quadratic (MSE) loss function. More generally, this occurs when we estimate parameters via an m estimator andweusethesame loss function for out of sample prediction (as we shall see below). Several authors (see e.g. Granger (1969) and Weiss (1996)) point out that the right way to proceed is to use the same loss for estimation and prediction. (c) Finally, if P/R π > 0, and we use a different loss function for estimation and prediction, then the contribution of parameter estimation error does not vanish. In particular, it will affect the covariance of the limiting distribution and we need to take it into account, if we want to perform valid inference. (ii) m Estimators. Let: Examples: m(y t,x t, θ) =(y t X 0 tθ) 2 OLS bθ T =argmin θ Θ m(y t,x t, θ) =(y t h(x t, θ)) 2 Nonlinear Least Squares 1 T m(y t,x t, θ) (4) m(y t,x t, θ) = log f(y t X t ; θ) Maximum Likelihood or Quasi-MLE (QMLE). Now, define: Note that, given the above expression for b θ T, θ =argmin θ Θ E(m(y t,x t, θ)) (5) 17

18 1 P T T θ m(y t,x t, θ) bθt = 1 P T T θ m(y t,x t, θ b T )=0, because of the first order condition. Also, θ E(m(y t,x t, θ)) θ = θ E(m(y t,x t, θ )) = E( θ (m(y t,x t, θ ))) = 0. If the uniform law large numbers holds. That is, if then b θ T sup θ Θ pr θ (consistency). 1 T Now, by a mean value expansion around θ, (m(y t,x t, θ) E (m(y t,x t, θ))) 1 T = 1 T + 1 T θ m(y t,x t, b θ T ) θ m(y t,x t, θ ) 2 θm(y t,x t, e θ T ) ³b θ T θ, where θ e T ³b θ T, θ. Now, 1 P T T θ m(y t,x t, θ b T )=0. Thus: ³ T θ b T θ = Ã 1 T 2 θm(y t,x t, e θ T )! 1 1 pr 0, X T θ m(y t,x t, θ ) T = ³ E ³ 2 θm(y t,x t, θ 1 X ) 1 T θ m(y t,x t, θ ) T Ã 1 T 2 θm(y t,x t, e θ T ) E ³ 2 θm(y t,x t, θ )! 1 1 X T θ m(y t,x t, θ ) (6) T Now, if the uniform law of large number holds. That is, if sup θ Θ 1 T ³ 2 θ m(y t,x t, θ) E ³ 2 θm(y t,x t, θ) and E ( 2 θm(y t,x t, θ)) is a positive definite matrix, then Ã 1 T 2 θm(y t,x t, e θ T ) E ³ 2 θm(y t,x t, θ )! 1 18 pr 0, pr 0.

19 Note that E ³ θ m(y t,x t, θ ) =0, by first order conditions. Under regularity conditions (see e.g. West (1996)), the central limit theorem applies and where 1 X T θ m(y t,x t, θ ) d N(0,V), T V = lim T Ã Var Ã 1 X T θ m(y t,x t, θ ) T Now, the term on the last line in (6) is the product of something going in probability to zero with something converging in distribution, therefore it goes in probability to zero (product rule). Thus, T ³ b θ T θ d N(0,MVM),!!. with M = ³ E ³ 2 θm(y t,x t, θ ) 1. Now, we need to have estimator for M and V, call them c M and bv.by the uniform law of large numbers, a consistent estimator for c M, is given by. cm = Ã 1 T 2 θm(y t,x t, b θ T ) Now, for the estimator of b V,if E ³ θ m(y t,x t, θ ) θ m(y s,x t, θ ) 0 =0forallt 6= s, then! bv = 1 T θ m(y t,x t, b θ T ) θ m(y t,x t, b θ T ) 0, if E ³ θ m(y t,x t, θ ) θ m(y s,x t, θ ) 0 6= 0, then we need to use a Newey-West (HAC heteroskedastic autocorrelation robust) estimator. In this case, bv = 1 T + 2 T θ m(y t,x t, b θ T ) θ m(y t,x t, b θ T ) 0 Xl T X T w τ τ=1 θ m(y t,x t, b θ T ) θ m(y t τ,x t τ, b θ T ) 0 Under the same type of assumptions as in West (1996), ³ M c V b M 1/2 c ³ T θ b T θ d N(0,I). Note that V = M 1 (equivalent to the condition of spherical errors in the linear model). Turning to our example, let bu 0,t+1 = y t+1 w 0 0,t b β 0,t. 19

20 Suppose that b β 0t is a m estimator defined as 1 bβ 0t =argmin β 0 B t Note that if m is a quadratic function, then 1 bβ 0t =arg min β 0 B 0 t and so b β 0t is the OLS estimator. Also define, tx j=2 m(y j w 0 0,j 1β 0 ), (7) tx j=2 (y j w 0 0,j 1β 0 ) 2, β0 =arg min E(m(y j w 0 β 0 B 0 0,j 1β 0 )), (8) so that if m is a quadratic function, then β0 =argmin E((y j w β 0 0,j 1β 0 0 ) 2 ) B and if model zero is indeed correctly specified, then β0 denotes the parameter of the conditional expectation. bβ 1t and β 1 can be defined in the same manner. versus where We want to test H 0 : E(f(u 0,t ) f(u 1t )) = 0 H A : E(f(u 0,t ) f(u 1t )) 6= 0 u 0,t+1 = y t+1 w 0 0,tβ 0,u 1,t+1 = y t+1 w 0 1,tβ 1. Consider a non-standardized version of the DM statistic, also called, for the sake of simplicity, d DM P : = P ddm P = P 1 1/2 t=r P 1 1 +P t=r 1 1 t=r 1 1/2 t=r (f(bu 0,t+1 ) f(bu 1t+1 )) (f(u 0,t+1 ) f(u 1t+1 )) (9) β0 f(u 0,t+1 )P 1/2 ( b β 0t β 0) β1 f(u 1,t+1 )P 1/2 ( b β 1t β 1), (10) 20

21 where u 0,t+1 = y t+1 w 0 0,tβ 0,t with β 0,t (β 0, b β t ), and where u 0,t+1 is definedinananalogousmanner. The term on the right hand side in (9) is the DM statistic for the case in which we know the underlying parameters. For the sake of simplicity, let s concentrate on the first piece on the RHS of (10). Let by a mean value expansion around β 0, m t (β 0 )=m(y t w 0 0,t 1β 0 ), 1 t tx j=2 β m j ( b β 0t )= 1 t + 1 t tx j=2 tx j=2 β m j (β 0) 2 βm j (β 0t ) ³ β b 0t β0, with β 0t ( b β 0t, β 0). Now the left hand side above is identical to zero by the first order conditions (see equation (7)), thus Hereafter, let f t (β 0 )=u t (β 0 )=f(y t+1 w 0 0,tβ 0 ), and let f t (β 1 )=u t (β 1 )=f(y t+1 w 0 1,tβ 1 ). t 1/2 ³ b β 0t β 0 = 1 t 1 t 1/2 tx j=2 tx j=2 2 βm j (β 0t ) 1 Along the lines of West (1996) we now state the following assumptions. Assumption A1: f(u i,t ), is twice continuously differentiable in β i and sup β i0 B i 2 f t (β i )/ β i β 0 i <C,i=0, 1. β m j (β 0) (11) ³ Assumption A2: sup t 1 P tj=2 2 a.s t βm j (β it ) 1 B i, where B i is negative definite, i =0, 1. Assumption A3: (i) (y t,x t,w i,t ),i= 0, 1, is a strictly stationary strong mixing sequence with size 4(4 + ψ)/ψ, ψ > 0, (ii) f and m are twice continuously differentiable in β, over the interior of B, and β m, 2 βm, β f 0, 2 βf 0 are 2r dominated (more simply have 2r moments finite uniformly in B) with r 2(2 + ψ) 21

22 Assumption A4: βi uniquely identified (i.e. E(m(y j wi,j 1β 0 i )) <E(m(y j wi,j 1β 0 i )) for any β i 6= βi,i=0, 1. Assumption A5: T = R + P, as T,P,R and P/R π, 0 π. Hereafter the notation o P (1) denotes a term which approaches zero in probability. Recall that we re considering P 1 1 t=r β0 f(u 0,t+1 ) 0 P 1/2 ( b β 0t β 0). Proposition 2: Let Assumptions A1,A2,A3,A4,A5 hold. Then: where (i) If f = m (i.e. if we are using the same loss for estimation and testing), then: P 1 1 t=r (ii) If π = 0 (i.e. if as T,P/R 0), then: P 1 1 t=r (iii) In all other cases (i.e. π > 0andf 6= m), then: P β0 f(u 0,t+1 ) 0 P 1/2 ( b β 0t β 0)=o P (1). β0 f(u 0,t+1 ) 0 P 1/2 ( b β 0t β 0)=o P (1). 1 1 t=r for 0 < π <, and where Π =1forπ =. Also, β0 f(u 0,t+1 ) 0 P 1/2 ( b β 0t β 0) d N(0, 2ΠF 0 0B 0 S h0 h 0 B 0 F 0 ), Π =1 π 1 ln(1 + π) F 0 = E( β0 f(u 0,t+1 )) 0, S h0 h 0 = X j= E( β m 1 (β 0) β m 1+j (β 0) 0 ), and B 0 = ³ E ³ 2 θm t (θ 0 ) 1. Proof (available upon request) 22

23 Thus we have seen that there are two important cases in which the effect of parameter estimation error vanishes in probability. Broadly speaking, if we use a different loss function for estimation and testing, then we are (a priori) ruling out the use of an optimal predictor. Also, let s stop standardizing (by σ), the DM test, still, though, calling it DM. Proposition 3: Let Assumptions A1,A2,A3,A4,A5 hold and let f 6= m (different loss for estimation and testing) and π > 0. Then, under H 0, ddm P = 1 1 (f(bu P 1/2 0,t+1 ) f(bu 1,t+1 )) t=r d N (0,S ff +2ΠF0B 0 0 S h0h0 B 0 F 0 +2ΠF1B 0 1 S h1 h 1 B 1 F 1 Π(Sf 0 h0 B 0 F 0 + F0B 0 0 S fh0 ) 2Π (F1B 0 1 S h1 h 0 B 0 F 0 + F0B 0 0 S h0 h 1 B 1 F 1 ) +Π(Sfh 0 1 B 1 F 1 + F1B 0 1 S fh1 ) µ ³ 2 where for i =0, 1 F i = E( βi f(u i,t+1 )), B i = E βm j (βi ) 1, S hi h l = P j= E( β m 1 (βi ) β m 1+j (βl ) 0 ),i,l=0, 1 S fhi = P j= E((f(u 0,1 ) f(u 1,1 )) β m 1+j (βi ) 0 ), S ff = P j= E((f(u 0,1 ) f(u 1,1 )) (f(u 0,1+j ) f(u 1,1+j ))) Under the alternative, for some ε > 0,! T 1 X 1 Pr Ã (f(bu 0,t+1 ) f(bu 1,t+1 )) P > ε =1, and so diverges at rate P 1/2. t=r T 1 X 1 P 1/2 t=r (f(bu 0,t+1 ) f(bu 1,t+1 )) In order to implement a valid DM test in the case of non vanishing parameter estimation error, we need to consistently estimate all the pieces of the covariance matrix in Proposition 3. Now, for consistent estimation of F i and B i we can just use the sample mean evaluated at the estimated parameters. Namely, we can use: and bb i = bf i = 1 P Ã 1 P T 1 X t=r T 1 X t=r βi f(bu i,t+1 ), 2 β i m j ( b β i,t )! 1 23

24 However for the long run covariance matrix we need to use a HAC (Newey-West type) covariance estimator. Define for i, l =0, 1, bs hi h l = 1 P l P X τ= l P w τ l P t=r+l P β m t ( b β i,t ) β m t+τ ( b β l,t ) 0 bs fhi = 1 P l P X l P w τ τ= l P t=r+l P Ã (f(bu 0,t ) f(bu 1,t )) 1 P β m t+τ ( b β i,t ) 0 1 t=r (f(bu 0,t ) f(bu 1,t ))! bs ff = 1 P l P X l P w τ τ= l P t=r+l P Ã Ã f(bu 0,t ) f(bu 1,t ) 1 P 1 t=r f(bu 0,t+τ ) f(bu 1,t+τ ) 1 P (f(bu 0,t ) f(bu 1,t )) 1 t=r! (f(bu 0,t ) f(bu 1,t )) Given A1-A4 above, if we let w τ =1 τ/(l P +1) and as P,l P and l P /P 1/4 0, then S b hi h l, S b fhi, and S b ff are consistent for S hi h l,s fhi,s ff. Note that in practice we do not know π, a natural estimate for π is bπ = P/R. Also we do not observe the rate at which P and R grow. Thus, unless R is much larger than P it is worthwhile using the formula for the covariance which takes into account parameter estimation error, whenever we use a different loss for estimation and prediction. Of note is that a recent key paper by Giacomini and White discusses conditional predictive inference, in which case the data are conditioned on and parameter estimation error essentially vanishes. 2.2 Bootstrap Techniques for Critical Value Construction Introduction to the Bootstrap Inference on parameters is based on asymptotic critical values. But, how good is the normal approximation? Can we improve over inference based upon the normal approximation? We shall see that bootstrap critical values can provide refinements over asymptotic critical value under various circumstances. 24!

25 First, let us outline the logic underlying the bootstrap, and then we shall see how the use of bootstrap can lead to more accurate inference. Consider a very simple situation. We have a sample of T iid observations, X 1,..., X T and we want to test the null hypothesis: H 0 : E(X 1 )=μ versus H A : E(X 1 ) 6= μ Note that given the identical distribution assumption, E(X 1 )=E(X 2 )=... = E(X T ). Consider the t-statistic 1 P T (X T t μ,t = 1/2 t μ), bσ X where bσ X 2 = 1 Ã X t 1! 2 X t. T T Provided that var(x 1 ) <, we know that under H 0,t d μ N(0, 1). Thus, we compare t μ with 2.5% and 97.5% critical values of a standard normal, and we reject at 5% significance level if t μ,t < 1.96 or t μ,t > The idea underlying the bootstrap is to pretend that the sample is the population, and draw from the sample as many bootstrap samples as needed in order to construct many bootstrap statistics. The simplest form of bootstrap is the iid nonparametric bootstrap, which is suitable for iid observations. ImaginethatweputallT observations into an urn, and we then make T draws with replacement (i.e. we make one draw, get one observation, put it back into the urn, get another one, put it back in the urn, and so on). Let X 1,X 2,..., X T be the resampled observations, and note that X 1 = X t,,...,t with probability 1/T. In other words, X 1,X 2,..., X T is equal to X I1,X I2,...,X IT, where for i =1,..., T I i is a random variable taking values 1, 2,..., T with equal probability 1/T. X 1,X 2,..., X T forms a bootstrap sample. Needless to say, we can repeat the same operation and get a second bootstrap sample, and so on. Note that, given the original sample, the probability law governing the resampled series is nothing other than the probability law of I i,i=1,..., T. As I i are iid discrete uniform random variables on [1,T], the Xi are also iid, conditional on the sample. 25

26 Let E and Var denote the mean and the variance of the resampled series, conditional on sample (note that E and Var are mean and variance operators in terms of the law governing the bootstrap (i.e. in terms of I i,i=1,..., T )). and Also, Now, given the identical distribution, E (X 1 )=E (X 2)=... = E (X T ), E (X1) 1 = X 1 T + X 1 2 T X 1 T T E Ã 1 T = 1 T! Xt X t Thus, the bootstrap mean is equal to the sample mean. Given that X 1,...,X T are iid observations, = 1 T Var Ã 1 T 1/2 = E (X 1)= 1 T! Xt X t Var(X t )=Var (X 1) = E (X1 2 ) (E (X1)) 2 = 1 Ã 1 Xt 2 T T Ã! 2 X t = 1 Xt 2 1! 2 X t. T T Thus, the bootstrap variance is equal to the sample variance. Let bσ 2 X = 1 T Ã Xt 1! 2 X t. T Given that X1,..., XT are iid with mean and variance equal to the sample mean and sample variance, t μ,t = 1 P ³ T T X 1/2 t 1 T bσ X P T X t d N(0, 1), 26

27 where d denotes convergence in distribution according to the bootstrap probability measure, conditional on the sample. Note importantly that t d μ,t N(0, 1), regardless whether the null hypothesis is true or not. Thus, under the null t μ,t and t μ,t have the same limiting distribution. Under the alternative, t μ,t d N(0, 1) while t μ,t diverges (to ). This suggests proceeding in the following manner. Construct B (B large) bootstrap statistics, say t (1) μ,t,..., t (B) μ,t. Sort these statistics from the smallest to the largest. Suppose B = 1000, then the 25th bootstrap statistic gives the 2.5% significance level critical value, say zt,2.5% ; and the 975-th bootstrap statistic gives the 97.5% significance level critical value, say zt,97.5%. If B is large enough, then rejecting H 0 if t μ,t <z 2.5% or t μ,t >z T,97.5% and not rejecting if z T,2.5% <t μ,t <z T,97.5% yields a test with asymptotic size equal to 5% and unit asymptotic power. It is important to note that in this case the bootstrap higher moments also are equal to the sample moments. In fact, given independence, = E Ã 1 T 1/2! 3 Xt 1 T 3/2 E (X1 3 )= 1 1 T 1/2 T Xt 3 and so on for the fourth moments, etc. Question: Is inference based on zt,2.5% and z T,97.5% normal approximations (e.g. based on using ±1.96)? more accurate than inference based on standard Answer: Yes. Why? Show why using the Edgeworth Expansion (see lecture notes) 27

28 2.2.2 Bootstrap with Time Series The iid nonparametric bootstrap does not work with dependent observations. resampled observations are iid, while the actual observations are not. The reason is that the In the case of dependent observations, things are more complicated. On one hand we want to draw blocks of data long enough to preserve the dependence structure present in the original sample, while on the other hand we want to have a large enough number of blocks independent each other. The most used resampling method for time series data is the block bootstrap of Künsch (1989), which we shall consider below. Let T = bl, where b denotes the number of blocks and l denotes the length of each block. We first draw a discrete uniform random variable, I 1, that can take values 0, 1,..., T l with probability 1/(T l +1). The first block is given by X I1+1,..., X I1 +l. We then draw another discrete uniform random variable, say I 2, and a second block of length l is formed, say X I2 +1,..., X I2 +l. Continue in the same manner, until you draw the last discrete uniform say I b, and so the last block is X Ib +1,..., X Ib +l. Let s call the X t the resampled series, and note that X 1,X 2,..., X T corresponds to X I1 +1,X I1 +2,..., X Ib +l. Thus, conditional on the sample, the only random element is the beginning of each block. In particular X 1,..., X l,x l+1,..., X 2l,X T l+1,...,x T, conditional on the sample, can be treated as biidblocks of discrete uniform random variables. Now, above results hold under some restrictions on block length Finally... Bootstrap DM (and Data Snooping) Tests Given the above discussion, we can use the block bootstrap in the context of DM tests. Namely, use ddm P = 1 P 1/2 T 1 X t=r H 0 : E (f(u 0,t+1 ) f(u 1,t+1 )) = 0 H A : E (f(u 0,t+1 ) f(u 1,t+1 )) 6= 0 " ³f(bu 0,t+1) f(bu 1,t+1) 1 T T 1 X t=2 (f(bu 0,t+1 ) f(bu 1,t+1 )) Then, the empirical distribution of this statistic can be used to obtain critical values for d DM P. Moreover, consider the White (2000) data snooping test for comparing many models against a benchmark, so that we have #. 28

29 H 0 : max 0,t+1) f(u k,t+1 )) 0 k=1,...,m H A : max 0,t+1) f(u k,t+1 )) > 0 k=1,...,m and S P = max k=1,...,m ddm P,k T 1 X 1 = max (f(bu k=1,...,m P 1/2 0,t+1 ) f(bu k,t+1 )). t=r Here, we need only construct S P = max k=1,...,m = max k=1,...,m ddm P,k 1 P 1/2 T 1 X t=r " ³f(bu 0,t+1) f(bu k,t+1) 1 T T 1 X t=2 (f(bu 0,t+1 ) f(bu k,t+1 )) The only caveats with both of these tests is that a recentering to the parameter estimates used in the construction of either ³ f(bu 0,t+1 ) f(bu 1,t+1) #. or ³ f(bu 0,t+1 ) f(bu k,t+1) needs to generally be made, because the bootstrap estimator of the forecast model parameters is characterized by a location bias. Additionally, the bootstrap component is constructed over the last P observations, while the sample component is constructed over all T observations. These adjustments to the usual approach to bootstrap test statistic discussed above arise because of the use of recursive (or rolling) estimation schemes. If parameter estimation error is assumed to vanish, then recentering to the parameter estimates does not need to be made, although the 1 term still needs to be subtracted from the bootstrap statistics. T Corradi, Valentina and Norman R. Swanson, 2006, Predictive Density Evaluation, in: Handbook of Economic Forecasting, eds. Clive W.J. Granger, Graham Elliot and Allan Timmerman, Elsevier, Amsterdam, pp Corradi, Valentina and Norman R. Swanson, 2007, Nonparametric Bootstrap Procedures for Predictive Inference Based on Recursive Estimation Schemes, International Economic Review, 48,

30 3 PartIII-WhatShouldWeBePredicting-Real-TimeData Real-time data are important in policy making contexts, as most macroeconomic data are revised over time. Table 1: Generic Real-Time Dataset Data Release Date of Data pertains to i = 1950, 05 i = 1950, 06 i =1950, i = t... i =2011, 03 i =2011, 04 j =1950, 04 X i (j) X i (j) X i (j)... X i (j)... X i (j) X i (j) j =1950, 05 0 X i (j) X i (j)... X i (j)... X i (j) X i (j) j =1950, X i (j)... X i (j)... X i (j) X i (j) j = t X i (j)... X i (j) X i (j) j =2011, X i (j) X i (j) j =2011, X i (j) 30

31 Figure 1: Output Growth Rates, First, and Second Revision Errors 1965:4-2006:4 Growth Rates First Revision Errors 8 x Second Revision Errors 8 x x x 10-3 Note: First revision errors are defined as follows: t+2 u t+1 t = t+2 X t t+1 X t,where t+1 X t is the annualized growth rate of output pertaining to calendar date t, and available at time t + 1. Similarly, second revision errors are defined as t+3 u t+2 t = t+3 X t t+2 X t. See Sections 2 and 3 for further details. 31

32 Many papers are linked to the real-time data research center at the Piladelphia Federal Reserve Bank website at: In many papers, some of the earlier ones being Amato and Swanson (2001), Bernanke and Boivin (2003), and Croushore and Stark (2001, 2003) complete revision histories for the variables that they examine are considered. One way of thinking about this sort of data is using regressions of the form: fx t = α + t+1 X t β + W 0 t+1γ + ε t+1, (12) where W t+1 is an m 1 vector of variables representing the conditioning information set available at time period t +1andε t+1 is an error term assumed to be uncorrelated with t+1 X t and W t+1. The null hypothesis of interest in this model is that α =0,β =1,andγ = 0, based on the notion of testing for rationality of t+1 X t for f X t by finding out whether the conditioning information in W t+1, available in real-time to the data issuing agency, could have been used to construct better conditional predictions of final data. Notice that this hypothesis, if rejected, is consistent with the errors-in-variables hypothesis. 1 Following Keane and Runkle (1990), the test of rationality of t+1 X t in the context of model (12) can be broken down into two sub-hypotheses, namely (i) unbiasedness and (ii) efficiency. The hypothesis of unbiasedness can be tested by imposing the restriction that γ =0andtestingα =0, β =1, Efficiency requires that α =0,β =1,andγ =0. For a complete discussion of the above ideas, including references, see Swanson, Norman R. and Dick van Dijk, 2006, Are Reporting Agencies Getting It Right? Data Rationality and Business Cycle Asymmetry, Journal of Business and Economic Statistics, 24, Corradi, Valentina, Andres Fernandez and Norman R. Swanson, 2009, Information in the Revision Process of Real-Time Data, Journal of Business and Economic Statistics, 27, For further discussion on the relationship between errors-in-variables hypotheses and rationality hypotheses, the reader is referred to Croushore and Stark (2003) and Faust, Rogers, and Wright (2004), where the errors-in-variables and rational forecast models are associated with the notions of noise and news, respectively. 32

33 Suffice it to say that one of the biggest issues is how to compare models. Should the target to be predicted be some sort of final release, or instead an earlier release upon which most economic agents base their decisions, say. Do revisions converge to zero, in the sense that all preliminary data converge to some true final values? What about definitional (and benchmark) revision (as opposed to reision associated with the arrival of nnew information? Can we predict revisions? see the special issue of Journal of Business and Economic Statistics, in which appears the above Corradi et al. paper, as well as key Aruoba et al. and McCracken et al. papers. 33

34 4 Part IV - Methods of Prediction - Some Comments Three sorts of prediction methods are of particular interest to policy setters. (i) Individual prediction models Thus far, we have discussed numerous individual linear and nonlinear models, including AR models, threshold switching models, and GARCH models, for example. It is certainly fesible to construct prediction models with many equations, in which case correlation across predictions of different variables becomes important (e.g., common shocks driving multiple yields in a yield curve model). In the current context, we are also concerned with correlation across different predictors of the same variable (see below discussion of Aiolfi and Timmermann (2006). (ii) Diffusion Indicies or other data reduction type models. Consider the standard diffusion index forecasting approach. X t = Λ 0,t F 0,t + u t, (13) where X t is a N 1 vector, Λ 0,t is an N r matrix of factor loadings, and F 0,t is the unobserved r 1 factor vector. We want to use factors to predict y t h steps ahead. For this purpose, we might consider the following simple index model with no factor dynamics, y t+h = β 0,1,t F 0,1,t β 0,r,t F 0,r,t + Γ 0 W t + ² t+h = F 0 0,tβ 0,t + ² t+h (14) Factors can be estimated using principal components, or using versions of the Kalman filter for various variants of this model. There are many data reduction techniques, such as the bagging, boosting, the lasso, the garrot, ridge regression, the elastic net, that offer additional approaches to data shrinkage. Some of these set coefficients on many variables to zero, and are hence very parsimonious, while others are not. All of these are discussed in the following paper, and are combined with diffusion index modelling. Hyun Hak Kim and Norman R. Swanson, 2010, Forecasting Financial and Macroeconomic Variables Using Data Reduction Methods: New Empirical Evidence, Working Paper, Rutgers University. 34

35 (iii) Forecast pooling and forecast combination A stylized fact in empirical economics is that it is difficult to beat the simple forecast average (i.e., average the predictions from multiple models using equal weights). Bagging is a form of model averaging. Bayesian model averaging is another popular type of averaging. Pooling, or forecast combination is sometimes done using simple regression, such as by estimating regressions of the form discussed in the proceeding example, with the weights (w i ) estimated using least squares on a pre-sample, say.. Timmermann, A. (2006), Forecast Combinations, in Handbook of Forecasting, eds. Clive W.J. Granger, Graham Elliot and Allan Timmerman, Elsevier, Amsterdam, pp , notes that forecast combinations may so frequently be found to yield better forecasts because of model misspecification, instability (nonstationarities), and estimation error, when the number of models is large relative to the sample size. Aiolfi, M. and A. Timmermann (2006), Persistence in forecasting performance and conditional combination strategies, Journal of Econometrics, 135, note that correlation (persistence) of linear and nonlinear times series models is prevalent in a large cross section of economic variables used for prediction in the G7 countries. They find it useful to: first (i) sort models into clusters using past performance; then (ii) pool forecasts within each cluster; then (iii) estimate optimal forecastr combination weights for the clusters with shrinkage towards equal weights. For further references, see Stock, James H. and Watson, Mark W., 2005, An Empirical Comparison of Methods for Forecasting using Many Predictors, Princeton University,Working Paper. Example: Combined Bivariate ADL Model As in Stock and Watson (2005), we implement a combined bivariate autoregressive distributed lag (ADL) model. Forecasts are constructed by combining individual forecasts computed from bivariate ADL models. The i-th ADL model includes p i,x lags of X i,t, and p i,y lags of Y t, and has the form Ŷ ADL t+h =ˆα + ˆβ i (L) X i,t + ˆφ i (L) Y t. The combined forecast is Ŷ Comb,h T +h T = Σ n w i Ŷ ADL,h T +h T. Here, we might set (w i =1/n), where n = 146. see Appendix for tables and results using this sort of forecast combination. 35

36 5 Part V - Density Based Model Selection Thus far we have discussed only point estimate based model selection (e.g. using mean square forecast error). Density based model selection is also important, as we are interested not only in good point preditions, but also in good interval predictions, for example. 5.1 Comparing Models Using Simulated Distributions Assume that we have a series of real business cycle (RBC) models, and our objective is to compare the joint distribution of historical variables with the joint distribution of the simulated variables for these RBC models. Hereafter, for sake of simplicity, but without loss of generality, we limit our attention to the joint distribution of (actual and model-based) current and previous period output growth. Extension to an arbitrary (but finite) number of lags of a given variable (or variables) follows immediately, but computational demand increases substantially as the number of random variables being examined increases. We shall follow Corradi, Valentina and Norman R. Swanson, 2007, Evaluation of Dynamic Stochastic General Equilibrium Models Based on Distributional Comparison of Simulated and Historical Data, Journal of Econometrics,136, Consider m RBC models, and set model 1 as the benchmark model. In keeping with our focus on current and lagged values of the variable of interest, let Y t =( log X t, log X t 1 ), Y j,n ( b θ j,t )=( log X j,n ( b θ j,t ), log X j,n 1 ( b θ j,t )). Also, let F 0 (u; θ 0 ) denote the distribution of Y t evaluated at u and F j (u; θ j) denote the distribution of Y j,n (θ j), where θ j is the probability limit of b θ j,t, taken as T, and where u U < 2, possibly unbounded. Accuracy is measured in terms of squared error. 36

37 The squared (approximation) µ error associated with model j, j =1,..., m, is measured in terms of the ³Fj (weighted) average over U of (u; θ j) F 0 (u; θ 0 ) 2, where u U, andu is a possibly unbounded set on < 2. Thus, the rule is to choose Model 1 over Model 2 if Z µ ³F1 (u; θ 1) F 0 (u; θ 0 ) 2 Z µ ³F2 φ(u)du < (u; θ 2) F 0 (u; θ 0 ) 2 φ(u)du, U U where R U φ(u)du =1andφ(u) 0forallu U <2. For any evaluation point, this measure defines a norm and is a typical goodness of fit measure. The hypotheses of interest are: Z µ ³F0 H 0 : max (u; θ 0 ) F 1 (u; θ ³ 1) 2 F0 (u) F j (u; θ j) 2 φ(u)du 0 j=2,...,m U and H A : Z µ ³F0 max (u) F 1 (u; θ ³ 1) 2 F0 (u) F j (u; θ j) 2 φ(u)du > 0. j=2,...,m U Thus, under H 0, no model can provide a better approximation (in a squared error sense) to the distribution of Y t than the approximation provided by model 1. If interest focuses on confidence intervals, so that the objective is to approximate Pr(u Y t u), then the null and alternative hypotheses can be stated as: H 0 0 : µ ³³F1 max (u; θ 1) F 1 (u; θ 1) (F 0 (u; θ 0 ) F 0 (u; θ 0 )) 2 j=2,...,m versus H 0 0 : ³³ F j (u; θ j) F j (u; θ j) (F 0 (u; θ 0 ) F 0 (u; θ 0 )) 2 0. µ ³³F1 max (u; θ 1) F 1 (u; θ 1) (F 0 (u; θ 0 ) F 0 (u; θ 0 )) 2 j=2,...,m ³³ F j (u; θ j) F j (u; θ j) (F 0 (u; θ 0 ) F 0 (u; θ 0 )) 2 > 0. If interest focuses on testing the null of equal accuracy of two distribution models (analogous to the pairwise conditional mean comparison setup of Diebold and Mariano (1995)), we can simply state the hypotheses as: Z µ ³F0 H0 00 : (u; θ 0 ) F 1 (u; θ ³ 1) 2 F0 (u) F j (u; θ j) 2 φ(u)du =0 U 37

Lecture Notes Prediction and Simulation Based Specification Testing and Model Selection

Lecture Notes Prediction and Simulation Based Specification esting and Model Selection copyright Valentina Corradi and Norman Rasmus Swanson contact: nswanson@econ.rutgers.edu Course Notes Outline - Professor