Lecture Notes Prediction and Simulation Based Specification Testing and Model Selection

Size: px

Start display at page:

Download "Lecture Notes Prediction and Simulation Based Specification Testing and Model Selection"

Antony Poole
5 years ago
Views:

1 Lecture Notes Prediction and Simulation Based Specification esting and Model Selection copyright Valentina Corradi and Norman Rasmus Swanson contact:

2 Course Notes Outline - Professor Norman Rasmus Swanson Rutgers University nswanson@econ.rutgers.edu Predictive and Simulation Based Specification esting and Model Selection Part I. Prediction Basics i Introduction ii Optimal Prediction Outline Part II. Parameter Estimation Error and Bootstrap echniques i Parameter Estimation Error ii Bootstrap echniques for Critical Value Construction Part III. Linear and Nonlinear Predictive Accuracy esting with Nested and Nonnested Models i Granger Causality ii Comparing the Predictive Accuracy of wo Nested Models Part IV. Multiple Models, Simulated Data, and In- and Out-of-Sample Specxification and Predictive Accuracy esting i Predictive Accuracy ests with Recursive Estimation ii Correct in-sample Specification esting Using CK ype ests iii Comparing Discrete ime Models Using Simulated Distributions Part V. Density Forecast Evaluation i he Kullback-Leibler Information Criterion Approach ii A Predictive Density Accuracy est for Comparing Multiple Misspecified Models iii Predictive Density esting for Continuous ime Models iv esting Using Continuous ime Finance Models References: All references are listed at the end of the lecture notes. 2

3 Part I - Prediction Basics. Introduction Prediction is an important area of economics. Examples where prediction is used include policy setting at the government level, production and inventory accumulation decisions at the firm level, and investment and asset allocation decisions at the individual level. Consider the following example: From an economic policy perspective, one of the main uses of econometric and statistical methods is to provide forecasts of macroeconomic and financial variables. For instance: Given the rate of inflation over the past twelve months, what will be the rate of inflation next month? What will be the rate two months from now? hese predictions have important consequences for the formulation of economic policy e.g. setting the bank lending rate, etc. Suppose that the Federal Reserve Board forecasts a.5% annualized inflation rate for July 2008, while the Department of reasury provides a forecast of %. How can we decide which of the two is more reliable? One key question thus concerns how we can measure the relative accuracy of different forecasts? Different models yield different forecasts, so we want to choose the model producing the most accurate one. Many econometric techniques deal with the in sample evaluation of models, and only recently has attention focused on out of sample model evaluation. A key difference between the two approaches is that in sample methods tend to select models that are too large. * Overfitting is a problem. * In-sample inference is another. Some of the issues arising are: i Choice of the loss function. Suppose X t+ is the rate of inflation at time t +, and is the rate of inflation forecasted at time t using model i. X i t+/t 3

4 he forecast error implied by model i, is u i,t+ = X t+ X i t+/t. We want to choose model j over model i if, on average, model j produces smaller errors. Smaller in which sense? * Quadratic loss function: choose model j, if on average, u 2 j,t+ < u 2 i,t+ * Mean Absolute loss function: choose model j, if on average, u j,t+ < u i,t+ * Other sorts of Loss Functions? Direction of Change, Profitability, etc. * Sometimes we are more concerned about positive errors than negative errors or viceversa, so we may want to use an asymmetric loss functions, such the linex linear exponential loss. ii Is it possible that for some loss function model j beats model i, and for other loss functions model i beats model j? In general: yes. If we choose the right model, in the sense that we correctly specify the joint distribution of all of the relevant variables, then no other model can win. On the other hand, if the models we compare are misspecified, then the ranking of models is loss function specific i.e. we would like to assume that all model are approximations to the truth - any other assumption seems overly strong. iii What is the effect of parameter estimation error i.e. of estimating the models used in prediction. Suppose we forecast inflation at t+ simply using inflation at time t, and we use a simple linear model. Say: X t = β 0 + β X t + u t. he true forecasting error is u t+ = X t+ β 0 β X t. 4

5 However, we have to replace the unknown parameters with estimates, say b 0 and b. hus the estimated forecast error becomes û t+ = X t+ b 0 b X t = u t+ b 0 β 0 b β X t iv Choice of forecast horizon. Given information up to time t, do we want to forecast inflation at t +, t + 2,...t + k? Again, unless we have the right model, model i can beat model j for given forecast horizons, but model j can beat model i for different forecast horizons..2 Optimal Prediction In this subsection we discuss optimal prediction in the context of of various loss functions. i Quadratic loss functions Consider a time series y t, t =, 2,...,. Suppose we want to find the optimal predictor of y t, h step ahead, using the information available at time t. Let F t = σy,..., y t, X,...X t where X t is a possibly vector valued other series that may help to predict y t. he optimal h step ahead predictor for y t+h given F t is the function ŷ t+h/t such that for any ỹ t+h/t ŷ t+h/t. We know that Ey t+h ŷ t+h/t 2 < Ey t+h ỹ t+h/t 2, ŷ t+h/t = Ey t+h F t i.e. the best predictor is the conditional expectation of y t given F t. In fact suppose that ỹ t+h/t is an F t measurable function e.g. any continuous function of y,..., y t, X,...X t is F t measurable, then Ey t+h ỹ t+h/t 2 = Ey t+h Ey t+h F t ỹ t+h/t Ey t+h F t 2 5

6 = Ey t+h Ey t+h F t 2 + Eỹ t+h/t Ey t+h F t 2 > Ey t+h Ey t+h F t 2, as Ey t+h Ey t+h F t ỹ t+h/t Ey t+h F t = 0. hus, if we want to minimize the square error, the conditional expectation is the best predictor. Prediction with linear models Suppose that y t+ = αy t + ɛ t+, where ɛ t is a white noise process with zero mean and variance σ 2 ɛ i.e. consider an AR process. hen the best one step predictor is and the best h step predictor is Ey t+ y t = αy t, Ey t+h y t = α h y t, correspondingly the one step ahead prediction error is u t+ = y t+ αy t = ɛ t+, and u t+h = y t+h α h y t = ɛ t+h α h 2 ɛ t+2 + α h ɛ t+. ARp processes can be treated in an analogous way. Prediction with nonlinear models Suppose where say or Now, y t+ = agy t + ɛ t+, gy t = y t + / + exp y t gy t = yt 2. Ey t+ /y t = agy t. 6

7 Note further that: y t+2 = agy t+ + ɛ t+2 = agagy t + ɛ t+ + ɛ t+2. Because of the ɛ t+ term entering into the nonlinear function g, it is not immediate how to get the two-step ahead prediction error. In this case we can approximate Ey t+2 y t with agagy t = agey t+ y t. Broadly speaking in order to get the h step ahead forecast, we begin by taking the one step ahead forecast of which we now the closed form expression, then we predict one period ahead again replacing y t+ which is not observable with Ey t+ y t. hat is, replace it with its predicted value given the information and time. We then proceed in the next steps in the same manner. So far we have considered cases in which y t depends only on its own past. Consider now the following model: so that y t+ = β 0 + β X t + ɛ t+, Ey t+ X t = β 0 + β X t. In order to compute h step ahead forecasts, for h >, we need to know the data generating process of X t. In this case we approximate X t+ with EX t+ X t. hat is, use: Ey t+2 X t = β 0 + β EX t+ X t. ii Asymmetric loss functions We have seen that in the case of quadratic loss the best predictor is the conditional mean. In this case the problem of selecting the optimal forecast is equivalent to the problem of correctly specifying the conditional mean. However, there are several instances in which we are more concerned about positive errors y t+h ŷ t+h/t > 0 than about negative errors y t+h ŷ t+h/t < 0. 7

8 Needless to say, arriving at the airport 5 minutes too late is more costly than arriving 5 minutes too early. In this case, then, we want to more heavily penalize positive errors. wo well known asymmetric loss function are Linex loss linear exponential loss and Lin-lin linear-linear loss. If we use Linex loss then we want to find the predictor ŷ t+h/t such that Eexpay t+h ŷ t+h/t + ay t+h ŷ t+h/t < Eexpay t+h ỹ t+h/t + ay t+h ỹ t+h/t, a 0, for any ỹ t+h/t ŷ t+h/t. Note that for a > 0, the loss is approximately linear to the left of the origin, while it is exponential to the right of the origin, and vice-versa for a < 0. hus, positive errors are considered more costly than negative errors. Christoffersen and Diebold 997 show that in this case the best predictor is ŷ t+h/t = Ey t+h F t + a 2 V ary t+h F t /2, where V ary t+h F t is the variance of y t+h conditional on the information available at time t. his formula is valid when y t+h F t = NEy t+h F t, V ary t+h F t. Note that for a > 0 more weight on positive errors the optimal predictor is larger than the optimal MSE predictor. In fact, as we are more concerned about positive errors, we purposely prefer an overestimate. In fact, note that while Ey t+h F t is an unbiased predictor of y t+h, given the information available at time t, for a > 0, is an upwardly biased predictor. Ey t+h F t + a 2 V ary t+h F t In this case, knowledge of the optimal predictor requires knowledge of the joint specification of the conditional mean and variance. Another asymmetric loss is Lin-lin loss. If we use a Lin-lin loss, then we want to find the predictor ŷ t+h/t such that E a y t+h ŷ t+h/t {y t+h > ŷ t+h/t } + b y t+h ŷ t+h/t {y t+h ŷ t+h/t } < E a y t+h ỹ t+h/t {y t+h > ỹ t+h/t } + b y t+h ỹ t+h/t {y t+h ỹ t+h/t } 8

9 for any ỹ t+h/t ŷ t+h/t and a > 0, b > 0. his loss function increases linearly in the error, but for a > b it more heavily penalizes positive errors. If, y t+h F t is NEy t+h F t, V ary t+h F t, then Christoffersen and Diebold show that the optimal predictor under Lin-lin loss is given by ŷ t+h/t = Ey t+h F t + V ary t+h F t /2 Φ a/a + b, where Φ denotes the CDF of a standard normal. As a > b, Φ a/a + b > 0 and so the optimal predictor is upwardly biased. Example - GARCH Consider the following GARCH, model, with ω + ω 2 <, ω 0 > 0, ω > 0, ω 2 > 0. Now note that and so σ 2 t y t = σ t ɛ t, ɛ t iidn0, σ 2 t = ω 0 + ω σ 2 t + ω 2 y 2 t, σt 2 = σ0 2 t + ω 0 ω j t + ω 2 ωy j t j, 2 j=0 j=0 is a measurable function of the past squared returns. hus the relevant information set is F t = σy,..., y t. Now, while Ey t F t = Eσ t ɛ t F t = σ t Eɛ t F t = 0, Ey 2 t F t = V ary t F t = Eσ 2 t ɛ 2 t F t = σ 2 t Eɛ 2 t F t = σ 2 t. Hence, if the loss function is quadratic, the optimal predictor is Ey t F t = 0. If instead the loss function is Linex with a =, the best predictor is Ey t F t = 0.5σ t. 9

10 Finally if the loss is a Lin-lin with parameter a = and b = 2, the optimal predictor is Ey t F t = σ t Φ 2/3 = 0.4σ t. iii Comparing possibly misspecified forecasting models. So far we have considered the issue of optimal prediction for given loss functions. In practice, the true data generating process DGP is unknown and so we form optimal predictions for given models which may be dynamically misspecified. For example, suppose that we believe that y t optimal h step ahead predictor is α h y t. follows an AR process, so that the However if for example the DGP data generating process is a SEAR self-exciting autoregressive process, then, say y t = α y t + α 2 y t {y t > τ} + ɛ t. Now, the optimal forecast under the AR assumption is clearly not optimal at all. Furthermore, in practice we need to define the relevant information set. Again, suppose that y t is an AR2 process, but we are just considering a AR model. hen, α h y t is indeed the optimal predictor under quadratic loss, for the information set F t = σy t ; but we are indeed neglecting the information contained in y t 2. In this case Ey t+h y t is indeed correctly specified, but so there is dynamic mispecification. Ey t+h y t Ey t+h y t, y t, Finally, the h step ahead prediction error is heteroskedastic and autocorrelated and, for the case of Linex loss for example, failing to take this into consideration, would lead to a non optimal forecast. hus, in practice we want to be able to compare the relative predictive ability of two or more, possibly misspecified models. Note that the ranking of the models, in the misspecified case, is loss function specific. On the other hand, if we correctly specify all conditional aspects, then the right models will beat all competitors, regardless of the loss function choice. 0

11 Diebold and Mariano 995 propose a test for the null hypothesis of equal predictive ability, against the alternative of non equal predictive ability. For the time being, we neglect the issue of parameter estimation error. Let u 0,t+h and u,t+h be the h step ahead prediction errors, when predicting y t+h using information available up to time t. and For example, for h =, u 0,t+ = y t+ β 0 β 02 y t β 03 x t, u,t+ = y t+ β β 2 y t β 3 z t. It is important that the two models we are comparing be nonnested i.e. neither is a special case of the other. We shall later see why this is important. Under the assumption that u 0,t and u,t are strictly stationary, the hypotheses for this test of equal predictive accuracy are: and H 0 : Efu 0,t fu t = 0 H A : Efu 0,t fu t 0 where f is some continuous positive valued loss function. he relevant statistic is DM = /2 σ where σ 2 is a consistent estimator of /2 lim V ar t= fu 0,t+ fu t+, fu 0,t+ fu t+. t= Note why we require non-nestedness: Suppose that model is nested in model zero. e.g. u 0,t+ = y t+ β 0 β 02 y t β 03 x t and u,t+ = y t+ β β 2 y t,

12 If model is indeed correctly dynamically specified for the conditional mean, then the null is equivalent to β 0 = β, β 02 = β 2, and β 03 = 0, and so under the null u 0,t = u,t for all t. In practice, as we shall see below we do not observe u 0,t and u,t but we only observe û 0,t and û,t that depend on estimated parameters. But we still have that σ and /2 σ t= fu 0,t+ fu t+ go to zero in probability, and the statistic no longer has a well defined limiting distribution. As we allow for dynamic mispecification under both hypotheses, in general u 0,t and u,t are not martingale difference sequences, i.e. Eu 0,t F t 0 and are in general autocorrelated. hus, we need to use a heteroskedastic and autocorrelation robust covariance HAC estimator for the long run variance. where We can use a Newey-West 987 type estimator, for example. Namely, define σ 2 = t= d t+ d d = t= d t, and w τ = τ/l +. he following is needed in the sequel. l w τ τ= t=τ+ d t+ = fu 0,t+ fu t+, d t+ dd t+ τ d, Assumption A: i u 0,t, u,t is a strictly stationary and strong mixing process with size 2r/r with r >, ii Efu i,t 4 <, i = 0,. ASIDE: Broadly speaking u 0,t is a strong mixing process if it is asymptotically independent, i.e. if u 0,t is independent of its infinite past. More formally, define F n = σu 0,,..., u 0,n, u n to be the information set generated from the infinite past of the series up to time n, and analogously F n+m = σu 0,n+m, u 0,n+m+,..., u 0, is the information set generated by the history of the series from time n + m up to infinity. More precisely, if u 0,t is a strong mixing process, for any u 0,t F n+m, Eu 0,t Eu 0,t F n 2

13 goes to zero as m. he size has to do which the rate at which this quantity goes to zero as m. Proposition : Let Assumption A hold. hen, as, l and l / /4 0, under H 0, DM d N0,, and under H A, for ε > 0. Pr /2 DM > ε, hus, we compare DM with the critical value of a normal random variable. Suppose that we do not reject H 0 if.96 DM.96; otherwise we reject H 0. his gives a test with asymptotic size equal to 0.05 and unit asymptotic power. Note that the same result holds for generic forecast horizon i.e. for h >. Proof - Sketch: Under both hypotheses, σ 2 pr σ0 2 /2 = lim V ar Also, by the central limit theorem for mixing processes, /2 t= d t+. t= d t+ Ed t d N0, σ 2 0. hus, when Ed t = 0, i.e. when the null is true, /2 t= while when Ed t 0 i.e. under the alternative, diverges at rate /2. d t+ d N0, σ 2 0, /2 d t+ t= Note that many applied practitioners do not even implement the simple DM test, instead relying on point estimates of mean square errors and related statistics when comparing alternative prediction models. 3

14 2 Part II - Parameter Estimation Error and Bootstrap echniques 2. Parameter Estimation Error In practice, we do not observe the true forecasting error. For simplicity, consider However, we do not know the vector β. u 0,t+ = y t+ β 0 β 02 y t β 03 x t, hus, we need to replace the parameters with their estimator and take into account the error due to the fact that the parameters are estimated. here are three main sampling schemes. i fixed scheme ii recursive scheme iii rolling scheme. When interested in out of sample forecasting and when we need to estimate parameters, we typically split the sample into two subsamples, a regression period, with R observations, and a prediction period, with P observations, where = R + P. Fixed estimation scheme: Use the first R observations to estimate the parameters, called them β R, and construct a sequence of P prediction errors, defined as for t = R,..., R + P. û 0,t+ = y t+ β 0,R β 02,R y t β 03,R x t, Recursive estimation scheme: Use the first R observations to compute β R, and construct the first prediction error: û 0,R+ = y R+ β 0,R β 02,R y R β 03,R x R. hen use all observations up to time R+ to construct β R+, and get the second prediction error û 0,R+2 = y R+2 β 0,R+ β 02,R+ y R+ β 03,R+ x R+. Proceed in the same manner until you have a sequence of P prediction errors, defined as: û 0,t+ = y t+ β 0,t β 02,t y t β 03,t x t, 4

15 for t = R,...R + P, where β t is the estimator computed using observations up to time t. Rolling estimation scheme: Use the first R observations to compute β R, and construct the first prediction error: û 0,R+ = y R+ β 0,R β 02,R y R β 03,R x R. hen, observations from t = 2 up to t = R + are used to construct β 2,R+, and a second prediction error in constructed: û 0,R+2 = y R+2 β 0,2,R+ β 02,2,R+ y R+ β 03,2,R+ x R+. hereafter, use observations from t = 3 to t = R + 2 and obtain another prediction error. Proceed in the same manner estimating the parameters using the most recent R observations, until you have a sequence of P prediction errors: û 0,t+ = y t+ β 0,t R+,t β 02,t R+,t y t β 03,t R+,t x t, for t = R,...R + P, where β t R+,t is the estimator computed using observations from time t R + up to time t; that is using the most recent R observations see West and McCracken 998 for an overview of the properties of various sampling scheme. he most commonly used approach is the recursive scheme. Intuitively it makes sense to use the information contained in new observations as soon as it becomes available. However, one must also be aware of structural breaks due to changing data definitions, changing model specification, changing tastes and preferences, etc. i Effect of parameter estimation error when performing test for equal predictive accuracy. Let where and where Further, define u 0,t+ = y t+ w 0,tβ 0 w 0,t =, y t, x t, β 0 = β 0, β 02, β 03 u,t+ = y t+ w tβ, w t =, y t, z t, β = β, β 2, β 3, û 0,t+ = y t+ w 0,t β 0,t 5

16 and û,t+ = y t+ w,t β,t, where the parameters have been estimated recursively. We observe û 0,t+ and û,t+, and so construct the Diebold-Mariano statistic using the out of sample period, as: DM P = P /2 σ P t=r fû 0,t+ fû t+, with where σ 2 P = P t=r d t+ d l P w τ τ= t=r+τ+ d t+ = fû 0,t+ fû t+ d t+ d 2 d t+ τ d, and d = d t+. t= Assume that f is a differentiable function this rules out Lin-lin loss, for example, via a mean value expansion around β 0 and β, we have: P /2 t=r fû 0,t+ fû t+ = P /2 t=r fu 0,t+ fu t+ 2 where +P t=r +P t=r β0 fũ 0,t+ P /2 β 0t β 0 β fũ,t+ P /2 β t β, 3 ũ 0,t+ = y t+ w 0,t β 0,t with β 0,t β 0, β t and ũ,t+ is defined in an analogous manner. Note that the term on the right hand side of 2 is the same term we had in the absence of parameter estimation error i.e. as if we knew the parameters. he main issue we shall address is the following: Do the last two terms above vanish in probability as the sample gets large? In other words, does the effect of parameter estimation error vanish as the sample size get large? Under which conditions will it vanish? 6

17 We shall show that, in the context of DM type tests: a Regardless of the choice of the loss function, f, parameter estimation error vanishes if, as, P/R 0, i.e. if the estimation period grows at a faster rate than the prediction period grows. Suppose that = 000, R = 0000 and P = 00, in this case P = R /2, so P/R = R /2 0 as R. In general, suppose that R = δ, δ <, and P = δ so = R + P. In this case P/R = δ / δ 0 as. In practice, this occurs when the period used for estimation is much longer than the period used for out of sample forecasting. b If the same loss function is used for estimation and out of sample prediction, then parameter estimation error vanishes, regardless the relative rates at which P and R grow as the sample size get large for the case of the DM test. his is for example the case in which we use nonlinear or ordinary least squares for estimation and we employ a quadratic MSE loss function. More generally, this occurs when we estimate parameters via an m estimator and we use the same loss function for out of sample prediction as we shall see below. Several authors see e.g. Granger 969 and Weiss 996 point out that the right way to proceed is to use the same loss for estimation and prediction. c Finally, if P/R π > 0, and we use a different loss function for estimation and prediction, then the contribution of parameter estimation error does not vanish. In particular, it will affect the covariance of the limiting distribution and we need to take it into account, if we want to perform valid inference. ii m Estimators. Let: Examples: my t, X t, θ = y t X tθ 2 OLS θ = arg min θ Θ my t, X t, θ = y t hx t, θ 2 Nonlinear Least Squares my t, X t, θ 4 t= my t, X t, θ = log fy t X t ; θ Maximum Likelihood or Quasi-MLE QMLE. 7

18 Now, define: θ = arg min θ Θ Emy t, X t, θ 5 Note that, given the above expression for θ, t= θ my t, X t, = t= θ θ θ my t, X t, θ = 0, because of the first order condition. Also, θ Emy t, X t, θ θ = θ Emy t, X t, θ = E θ my t, X t, θ = 0. If the uniform law large numbers holds. hat is, if pr my t, X t, θ E my t, X t, θ 0, then θ sup θ Θ t= pr θ consistency. Now, by a mean value expansion around θ, θ my t, X t, θ t= = θ my t, X t, θ t= + 2 θmy t, X t, θ θ θ, t= where θ θ, θ. Now, t= θ my t, X t, θ = 0. hus: θ θ = 2 θmy t, X t, θ θ my t, X t, θ t= t= = E 2 θmy t, X t, θ θ my t, X t, θ t= 2 θmy t, X t, θ E 2 θmy t, X t, θ t= θ my t, X t, θ 6 t= Now, if the uniform law of large number holds. hat is, if sup θ Θ t= 2 θ my t, X t, θ E 2 θmy t, X t, θ pr 0, 8

19 and E 2 θmy t, X t, θ is a positive definite matrix, then 2 θmy t, X t, θ E 2 θmy t, X t, θ t= pr 0. Note that E θ my t, X t, θ = 0, by first order conditions. Under regularity conditions see e.g. West 996, the central limit theorem applies and where θ my t, X t, θ d N0, V, t= V = lim V ar θ my t, X t, θ. t= Now, the term on the last line in 6 is the product of something going in probability to zero with something converging in distribution, therefore it goes in probability to zero product rule. hus, θ θ d N0, MV M, with M = E 2 θmy t, X t, θ. Now, we need to have estimator for M and V, call them M and V. By the uniform law of large numbers, a consistent estimator for M, is given by M = 2 θmy t, X t, θ t=. Now, for the estimator of V, if E θ my t, X t, θ θ my s, X t, θ = 0 for all t s, then V = θ my t, X t, θ θ my t, X t, θ, t= if E θ my t, X t, θ θ my s, X t, θ 0, then we need to use a Newey-West HAC heteroskedastic autocorrelation robust estimator. In this case, V = θ my t, X t, θ θ my t, X t, θ t= + 2 l w τ θ my t, X t, θ θ my t τ, X t τ, θ τ= t= Under the same type of assumptions as in West 996, M V M /2 θ θ d N0, I. 9

20 Note that V = M equivalent to the condition of spherical errors in the linear model. urning to our example, let Suppose that β 0t is a m estimator defined as û 0,t+ = y t+ w 0,t β 0,t. β 0t = arg min β 0 B t Note that if m is a quadratic function, then β 0t = arg min β 0 B 0 t and so β 0t is the OLS estimator. Also define, so that if m is a quadratic function, then t my j w 0,j β 0, 7 j=2 t y j w 0,j β 0 2, j=2 β 0 = arg min β 0 B 0 Emy j w 0,j β 0, 8 β 0 = arg min β 0 B Ey j w 0,j β 0 2 and if model zero is indeed correctly specified, then β 0 denotes the parameter of the conditional expectation. β t and β can be defined in the same manner. versus where We want to test H 0 : Efu 0,t fu t = 0 H A : Efu 0,t fu t 0 u 0,t+ = y t+ w 0,tβ 0, u,t+ = y t+ w,tβ. From the DM statistic, consider: P /2 t=r = P /2 t=r fû 0,t+ fû t+ 9 fu 0,t+ fu t+ 0 20

21 where P t=r +P t=r β0 fu 0,t+ P /2 β 0t β 0 β fu,t+ P /2 β t β, u 0,t+ = y t+ w 0,tβ 0,t with β 0,t β 0, β t, and where u 0,t+ is defined in an analogous manner. he term on the right hand side in 9 is the DM statistic for the case in which we know the underlying parameters. For the sake of simplicity, let s concentrate on the first piece on the RHS of. Let by a mean value expansion around β 0, t m t β 0 = my t w 0,t β 0, t β m j β 0t = t β m j β j=2 t 0 j=2 + t t 2 βm j β 0t β 0t β0, j=2 with β 0t β 0t, β0. Now the left hand side above is identical to zero by the first order conditions see equation 33, thus t /2 β 0t β 0 Hereafter, let f t β 0 = u t β 0 = fy t+ w 0,tβ 0, and let f t β = u t β = fy t+ w,tβ. = t t /2 t 2 βm j β 0t j=2 t β m j β0 2 j=2 Along the lines of West 996 we now state the following assumptions. Assumption A: fu i,t, is twice continuously differentiable in β i and sup 2 f t β i / β i β i < C, i = 0,. β i0 B i 2

22 Assumption A2: sup t tj=2 2 t βm j β it a.s B i, where B i is negative definite, i = 0,. Assumption A3: i y t, x t, w i,t, i = 0,, is a strictly stationary strong mixing sequence with size 44 + ψ/ψ, ψ > 0, ii f and m are twice continuously differentiable in β, over the interior of B, and β m, 2 βm, β f, 2 βf are 2r dominated more simply have 2r moments finite uniformly in B with r 22 + ψ Assumption A4: βi uniquely identified i.e. Emy j w i,j β i < Emy j w i,j β i for any β i βi, i = 0,. Assumption A5: = R + P, as, P, R and P/R π, 0 π. Hereafter the notation o P denotes a term which approaches zero in probability. Recall that we re considering P t=r β0 fu 0,t+ P /2 β 0t β 0. Proposition 2: Let Assumptions A,A2,A3,A4,A5 hold. hen: where i If f = m i.e. if we are using the same loss for estimation and testing, then: P t=r β0 fu 0,t+ P /2 β 0t β 0 = o P. ii If π = 0 i.e. if as, P/R 0, then: P t=r β0 fu 0,t+ P /2 β 0t β 0 = o P. iii In all other cases i.e. π > 0 and f m, then: P t=r β0 fu 0,t+ P /2 β 0t β 0 d N0, 2ΠF 0B 0 S h0 h 0 B 0 F 0, Π = π ln + π for 0 < π <, and where Π = for π =. Also, F 0 = E β0 fu 0,t+, S h0 h 0 = E β m β0 β m +j β0, j= 22

23 and B 0 = E 2 θm t θ 0. hus, Proof - Sketch: i From Lemma A3 in West 996, for all a < 0.5, P t=r = P t=r sup t a β 0t β0 = o P. t R β0 fu 0,t+ P /2 β 0t β 0 and the first term on the right hand side can be written as β0 fu 0,t+ P /2 β 0t β 0 + o P 3 E β0 fu 0,t+ P /2 β 0t β0 +P /2 t=r t=r the second term is o P as sup t R β 0t β 0 = o P, and P /2 t=r β0 fu 0,t+ E β0 fu 0,t+ β 0t β 0, 4 β0 fu 0,t+ E β0 fu 0,t+ = O P because of the central limit theorem, given A3 and A4. Now if f = m, E β0 mu 0,t+ = E β my t w 0,t β 0 = 0, because of the first order condition in 8. ii Recall equation 3, and note that P β0 fu 0,t+ P /2 β 0t β0 t=r P β0 fu 0,t+ sup P /2 β 0t β0 t=r t R, and P t=r β0 fu 0,t+ pr E β0 fu 0,t+ <, 23

24 and as P/R 0 and sup t R t a β 0t β 0 = o P. iii Given 2 and given 3 and 4, can be written as, sup P /2 β 0t β0 = o p, t R P /2 t=r β0 fu 0,t+ β 0t β 0 E β0 fu 0,t+ B 0 P /2 t β m j β t 0 + o P t=r West Lemma A5 shows that lim V ar P /2 t β m j β P t=r t 0 = 2ΠS h0 j=2 j=2 hus we have seen that there are two important cases in which the effect of parameter estimation error vanishes in probability. Namely, when the prediction period grows at a slower rate than the estimation period and when we use the same loss function for estimation and testing. For example, when we use a quadratic loss function and we estimate the parameter by OLS, then parameter estimation error vanishes. As mentioned above, it has been suggested by Granger 969, 993 and more recently by Weiss 996 that the right approach is indeed to use the same loss for function for estimation and prediction. Broadly speaking, if we use a different loss function for estimation and testing, then we are a priori ruling out the use of an optimal predictor. Proposition 3: Let Assumptions A,A2,A3,A4,A5 hold and let f m different loss for estimation and testing and π > 0. hen, under H 0, fû P /2 0,t+ fû,t+ t=r d N 0, S ff + 2ΠF 0B 0 S h0h0 B 0 F 0 +2ΠF B S h h B F ΠS f h0 B 0 F 0 + F 0B 0 S fh0 2Π F B S h h 0 B 0 F 0 + F 0B 0 S h0 h B F +ΠS fh B F + F B S fh 24

25 where for i = 0, F i = E βi fu i,t+, B i = E 2 βm j β i, S hi h l = j= E β m β i β m +j β l, i, l = 0, S fhi = j= Efu 0, fu, β m +j β i, S ff = j= Efu 0, fu, fu 0,+j fu,+j Under the alternative, for some ε > 0, and so diverges at rate P /2. Pr P t=r fû 0,t+ fû,t+ > ε =, fû P /2 0,t+ fû,t+ t=r In order to implement a valid DM test in the case of non vanishing parameter estimation error, we need to consistently estimate all the pieces of the covariance matrix in Proposition 3. Now, for consistent estimation of F i and B i we can just use the sample mean evaluated at the estimated parameters. Namely, we can use: and B i = F i = P P t=r t=r βi fû i,t+, 2 β i m j β i,t However for the long run covariance matrix we need to use a HAC Newey-West type covariance estimator. Define for i, l = 0,, Ŝ hi h l = P l P τ= l P w τ l P t=r+l P β m t β i,t β m t+τ β l,t Ŝ fhi = P l P l P w τ τ= l P t=r+l P fû 0,t fû,t P β m t+τ β i,t t=r fû 0,t fû,t 25

26 Ŝ ff = P Given A-A4 above, if we let l P l P w τ τ= l P t=r+l P fû 0,t fû,t P t=r fû 0,t+τ fû,t+τ P fû 0,t fû,t t=r w τ = τ/l P + fû 0,t fû,t and as P, l P and l P /P /4 0, then Ŝh i h l, Ŝfh i, and Ŝff are consistent for S hi h l, S fhi, S ff. Note that in practice we do not know π, a natural estimate for π is π = P/R. Also we do not observe the rate at which P and R grow. hus, unless R is much larger than P it is worthwhile using the formula for the covariance which takes into account parameter estimation error, whenever we use a different loss for estimation and prediction. Of note is that a recent key paper by Giacomini and White discusses conditional predictive inference, in which case the data are conditioned on and parameter estimation error essentially vanishes. 2.2 Bootstrap echniques for Critical Value Construction 2.2. Introduction to the Bootstrap Inference on parameters is based on asymptotic critical values. But, how good is the normal approximation? Can we improve over inference based upon the normal approximation? We shall see that bootstrap critical values can provide refinements over asymptotic critical value under various circumstances. First, let us outline the logic underlying the bootstrap, and then we shall see how the use of bootstrap can lead to more accurate inference. Consider a very simple situation. We have a sample of iid observations, X,..., X and we want to test the null hypothesis: versus H 0 : EX = µ H A : EX µ Note that given the identical distribution assumption, EX = EX 2 =... = EX. 26

27 where Consider the t-statistic t µ, = σ 2 X = t= X /2 t µ, σ X X t t= 2 X t. t= Provided that varx <, we know that under H 0, t µ d N0,. hus, we compare t µ with 2.5% and 97.5% critical values of a standard normal, and we reject at 5% significance level if t µ, <.96 or t µ, >.96. he idea underlying the bootstrap is to pretend that the sample is the population, and draw from the sample as many bootstrap samples as needed in order to construct many bootstrap statistics. he simplest form of bootstrap is the iid nonparametric bootstrap, which is suitable for iid observations. Imagine that we put all observations into an urn, and we then make draws with replacement i.e. we make one draw, get one observation, put it back into the urn, get another one, put it back in the urn, and so on. Let X, X2,..., X with probability /. be the resampled observations, and note that X = X t, t =,..., In other words, X, X 2,..., X is equal to X I, X I2,..., X I, where for i =,..., I i is a random variable taking values, 2,..., with equal probability /. X, X 2,..., X forms a bootstrap sample. Needless to say, we can repeat the same operation and get a second bootstrap sample, and so on. Note that, given the original sample, the probability law governing the resampled series is nothing other than the probability law of I i, i =,...,. As I i are iid discrete uniform random variables on [, ], the Xi are also iid, conditional on the sample. Let E and V ar denote the mean and the variance of the resampled series, conditional on sample note that E and V ar are mean and variance operators in terms of the law governing the bootstrap i.e. in terms of I i, i =,...,. Now, given the identical distribution, E X = E X 2 =... = E X, 27

28 and E X = X + X X = X t t= Also, E Xt = E X = t= hus, the bootstrap mean is equal to the sample mean. X t t= Given that X,..., X are iid observations, V ar X /2 t t= = V arxt = V ar X t= = E X 2 E X 2 = 2 Xt 2 X t t= t= = Xt 2 2 X t. hus, the bootstrap variance is equal to the sample variance. Let σ 2 X = t= Xt t= t= 2 X t. t= Given that X,..., X are iid with mean and variance equal to the sample mean and sample variance, t µ, = t= X /2 t t= X t σ X d N0,, where d denotes convergence in distribution according to the bootstrap probability measure, conditional on the sample. 28

29 Note importantly that t d µ, N0,, regardless whether the null hypothesis is true or not. hus, under the null t µ, and t µ, have the same limiting distribution. Under the alternative, d N0, while t µ, diverges to. t µ, his suggests proceeding in the following manner. Construct B B large bootstrap statistics, say t µ,,..., t B µ,. Sort these statistics from the smallest to the largest. Suppose B = 000, then the 25th bootstrap statistic gives the 2.5% significance level critical value, say z,2.5% ; and the 975-th bootstrap statistic gives the 97.5% significance level critical value, say z,97.5%. if If B is large enough, then rejecting H 0 if t µ, < z2.5% or t µ, > z,97.5% and not rejecting z,2.5% < t µ, < z,97.5% yields a test with asymptotic size equal to 5% and unit asymptotic power. It is important to note that in this case the bootstrap higher moments also are equal to the sample moments. In fact, given independence, = E /2 and so on for the fourth moments, etc. 3 Xt t= 3/2 E X 3 = /2 Xt 3 t= Question: Is inference based on z,2.5% and z,97.5% more accurate than inference based on standard normal approximations e.g. based on using ±.96? Answer: Yes. Why? Show why using the Edgeworth Expansion nothing to do with Edgeworth box! Under mild assumptions satisfied for the sample mean in the iid case provided there are enough finite moments, we can express the distribution of the t-statistic as the CDF of a standard normal, plus other terms capturing deviations from normality. Namely, Pr t µ, x = Φx + /2 p xφx + p 2 xφx + 3/2 p 3 xφx +..., 5 where Φx and φx are the cumulative distribution function and the density of a standard normal evaluated at x, p x is a polynomial in x depending on the central third moment, p 2 x is a polynomial in x depending on the central fourth moment, etc. 29

30 herefore, p x captures the deviation from normality in the form of skewness, and p 2 x captures deviation from normality in the sense of excess kurtosis. he successive terms capture more complex deviations and higher order effects. From 5 we see that the order of approximation of the normal distribution is /2. Analogously, we can write the Edgeworth expansion for t µ,, i.e. Pr t µ, x = Φx + /2 p xφx + p 2 xφx + 3/2 p 3 xφx +..., 6 where p x is a polynomial in x depending on the sample central third moment, p 2 x is a polynomial in x depending on the sample central fourth moment, etc. herefore, as sample moments converge to population moments, and as we know that under mild assumptions the convergence is at rate /2, we have that p 2 x p 2 x = O p /2, etc. p x p x = O p /2, Recall that Pr t µ, x depends on the sample, and so it s a random variable, while Pr t µ, x is a number between 0 and! depending on. while hus, Pr t µ, x Pr t µ, x = O P, Pr t µ, x Φx = O /2. Hence, if we approximate Pr t µ, x with a standard normal CDF we have an error of order O /2, while if we approximate Pr t µ, x with the bootstrap distribution Pr t µ, x we have an error of order O P. hus, the bootstrap distribution provides a more accurate approximation than the normal CDF. In practice, we do not compare Pr t µ, x with Φx, but instead we compare t µ, with z α. Let z,α be defined as below, Pr t µ, z,α = α and analogously, define z,α as Cornish Expansion Pr t µ, z,α = α Recall that the boostrap moments are the sample moments. 30

31 Whenever we have an Edgeworth expansion, we can always obtain a Cornish expansion by inversion. z,α = z α + /2 q α + q 2 α + 3/2 q 3 α... 7 where q α, q 2 α are polynomials in α capturing skewness and kurtosis, and z,α = z α + /2 q α + q 2 α + 3/2 q 3 α... where q α, q 2 α are also polynomials in α capturing sample skewness and sample kurtosis. Now, q α q α = O P /2 and hus, while q 2 α q 2 α = O P /2 z,α z,α = O z,α z α = O P /2. herefore, we see that inference based on bootstrap critical values is more accurate than that based on asymptotic normal critical values Bootstrapping GMM Estimators with ime Series he iid nonparametric bootstrap does not work with dependent observations. he reason is that the resampled observations are iid, while the actual observations are not. In the case of dependent observations, things are more complicated. On one hand we want to draw blocks of data long enough to preserve the dependence structure present in the original sample, while on the other hand we want to have a large enough number of blocks independent each other. he most used resampling method for time series data is the block bootstrap of Künsch 989, which we shall consider below. Let = bl, where b denotes the number of blocks and l denotes the length of each block. We first draw a discrete uniform random variable, I, that can take values 0,,..., l with probability / l +. he first block is given by X I+,..., X I +l. We then draw another discrete uniform random variable, say I 2, and a second block of length l is formed, say X I2 +,..., X I2 +l. Continue in the same manner, until you draw the last discrete uniform say I b, and so the last block is X Ib +,..., X Ib +l. 3

32 Let s call the X t the resampled series, and note that X, X 2,..., X corresponds to X I +, X I +2,..., X Ib +l. hus, conditional on the sample, the only random element is the beginning of each block. In particular X,..., X l, X l+,..., X 2l, X l+,..., X, conditional on the sample, can be treated as b iid blocks of discrete uniform random variables. It can be shown that conditional on the sample and for all samples except a set of probability measure approaching zero, E Xt = t= X t + OP l/ 8 t= V ar /2 Xt t= = l l t t=l i= lx X t X t+i X t t= t= +O P l 2 /, 9 where E and V ar denotes the expectation and the variance operators with respect to P the probability law governing the resampled series or the probability law governing the iid uniform random variables, conditional on the sample, and where O P l/ O P l 2 / denotes a term converging in probability P to zero, as l/ 0 l 2 / 0. Proof - Sketch of 8 and 9: E Xt t= = E bl = E l b l X Ii +j i= j= l X I +j j=, 20 as I i, i =,..., b are independent uniform random variables; and so, conditional on the sample, blocks are independent and identically distributed note that conditional on sample, all randomness is due to I,..., I b, which are iid uniform random variables. hus, 20 can be rewritten as: l X + X X l PrI = 0 + l X 2 + X X l+ PrI = l X l+ + X l X 2l PrI = l 32

33 l X bl l+ + X bl l X bl PrI = l +. 2 Now PrI = 0 = PrI = =... = PrI = l = l +. Note that for l + t l we have lx t summands, while we have only X and X bl, 2 X 2 and X bl, and l X l and X bl l. hus, summing up the terms in 2 we have that: E Xt = t= = l + l X t + O P l/ 22 t=l+ X t + O P l/ t= Now, we want to sketch the proof of 9. As I i, i = 0,, 2,..., b are iid, and given 22, V ar = V ar = E l X /2 t t= l X l /2 I +j j= = V ar b /2 l /2 b l X Ii +j i= j= l l X I +k X a X I +j X a + O P l 2 /, k= j= where X a = t= X t. he first term on the RHS above is in turn equal to l l l X k+ X a X j+ X a PrI = 0 k= j= + l l l X k+2 X a X j+2 X a PrI = k= j= l l X k+ l X a l k= j= X j+ l X a PrI = l 33

34 l l = X t X a X t+j X a + O P l 2 /. l t=l j= l Now, Künsch shows that conditional on the samples except a set with probability measure approaching zero, as l, t b µ, = V ar /2 /2 Xt t= /2 Xt EXt d N0, I. 23 t= Let where σ 2,HAC t HAC µ, = t= X /2 t µ σ HAC is an HAC covariance estimator., hus, if we use the block bootstrap, we know that t HAC µ, and t b µ, will have the same limiting distribution and so bootstrap critical values are asymptotically valid, as explained below. Recall that in the iid case iid observations and iid bootstrap, and E V ar /2 Xt = t= Xt = V ar t= X t t= /2 X t. t= In the case of the block bootstrap with dependent observations but the same will be true if we use the block bootstrap and we have iid observations, E Xt = l X t + O P t= t= V ar = V ar X /2 t t= X t /2 t= l 2 + O P 34

35 As a consequence, it is no longer true that Pr t HAC µ, x Pr t b µ, x = O P. Gotze and Hipp 996, for the case of stationary mixing observations, show that if we choose the block length, l, equal to the lag truncation parameter used in the construction of the HAC variance estimator i.e. l = m, then hus, for l = /4, 2 Pr t HAC µ, x Pr t b µ, x = O P l + O l /2 Pr t HAC µ, x Pr t b µ, x = O P 3/ Bootstrap Refinements for GMM Estimators Now, we outline how to bootstrap GMM estimators, and we see how bootstrap critical values can provide an improvement over asymptotic normal critical values see Andrews 2002 for complete details. Improvement over standard asymptotic approximation is called higher order refinement. In the sequel, we require that E g t β GMM gt k β GMM = 0 for all k > κ, where κ is finite. hat is, the correlation between the moment conditions is zero after the κ th term. Currently, for the case of general nonlinear GMM estimators, there are no results about bootstrap higher order refinements for the general case, in which κ = κ with κ, as. 3 For some generality, consider the case in which the variance of the moment conditions depend on the parameters; and therefore we use a two-step GMM approach. 2 Note that while we need m / /4 0 for the case of possibly heterogeneous observations, in the strict stationary case we can allow for m = /4. 3 Inoue and Shintani 2006 provide GMM refinements in the case of κ = κ for linear IV overidentified estimators. 35

36 In the first step, we use an arbitrary p p weighting matrix, say Ω, and we compute, β,gmm = arg min g t β Ω g t β β B t= t= = arg min β ΩG β. β B 24 Given β,gmm, we compute the second step estimator β,gmm = arg min β B G β Ω β,gmm G β, 25 where Ω β,gmm = g t β,gmm gt β,gmm t= + 2 κ g t β,gmm gt j β,gmm t= j= he two-step GMM covariance matrix estimator is given by: where σ 2 = D β,gmm Ω β,gmm D β,gmm, D β,gmm = t= β g t β β= β,gmm. Let σ 2 ii, be the ii th element of σ 2. Suppose, g t β = g y t, X t, Z t, β, and we resample b blocks of length l of y t, X t, Z t, in order to obtain y t, X t, Z t. Let g t β = g y t, X t, Z t, β E g y t, X t, Z t, β,gmm, 26 where = E g yt, Xt, Zt, β,gmm t= w t g y t, X t, Z t, β,gmm l + t= 36

37 with w t = t/l t =,..., l w t = t = l,..., l + w t = t +, t = l + 2,..., l he weight w t is smaller than one for the first and last l observations, as they have less chances of being drawn. Note, that in general g yt, Xt, Zt, β,gmm has non-zero mean even if g yt, X t, Z t, β GMM has zero mean. Hence there is a need for recentering the bootstrap moment conditions. In fact, E gt β,gmm = 0. Now, define the bootstrap counterpart of β,gmm as β,gmm, where β,gmm = arg min gt β Ω β B t= = arg min β B G β ΩG β, gt β t= and where g t β is defined as in 26. Also, define the bootstrap counterpart of β,gmm as β,gmm, where where and hus, Ω β,gmm β,gmm = arg min β B = arg min β B G t= gt β Ω β,gmm β,gmm G β, β Ω t= gt β g t β = g y t, X t, Z t, β E g y t, X t, Z t β,gmm, 27 Ω + 2 β,gmm = κ t= j= t= gt β,gmm g t β,gmm gt β,gmm g t j β,gmm is the bootstrap analog of Ω β,gmm. 37

38 he bootstrap covariance matrix, is given by σ 2 = D β,gmm Ω β,gmm D β,gmm, where versus D β,gmm = t= Now, let σ 2 ii, be the ii th element of σ 2. We are interested in testing Define the t-statistic as: H 0 : β i = β i,gmm H A : β i β i,gmm. β g t β β= β,gmm. t βi, = /2 β i,,gmm β i,gmm σ ii, he bootstrap analog of t βi, is: tβ i, = /2 β i,,gmm β i,,gmm. Now, σ 2 ii, is the bootstrap counterpart of σ 2 ii,, but the variance of σ 2 ii, does not coincide with var /2 β i,,gmm β i,,gmm. his is because the dependence in the sample moment conditions and in the bootstrap moment conditions is not the same. his is due to the so called joint problem. Blocks are independent, conditional on the sample. So, the last observation of a block and the first of the next block are uncorrelated. However, this is not true in the original sample. As there are b joint points as many as the blocks, this has to be taken into account. Summarizing, the issue is whether or not σ ii, 2 properly mimics σ ii, 2 2 i.e. E σ ii, = σ 2 ii,, but σ ii, 2 is NO var /2 β i,,gmm β i,,gmm. We thus need a correction factor. Define σ ii, σ 2 ii, = D β,gmm Ω β,gmm D β,gmm D β,gmm Ω β,gmm Ω β,gmm Ω β,gmm D β,gmm Ω β,gmm D β,gmm, 38

39 where Ω β,gmm = E = t= s= g t l l l β,gmm g s β,gmm l l t=0 j= i= gt+j β,gmm g t+i β,gmm. Note that he correction factor is thus given by σ 2 ii, = var /2 β i,,gmm β i,,gmm. τ ii, = σ ii, σ ii,, Now, consider the adjusted bootstrap statistic, t β i, = /2 β i,,gmm β i,,gmm σ 2 ii,, σ ii, which is given by the product of the bootstrap analog of the t-statistic times the correction term. We construct B corrected bootstrap statistics, t β i,, and use them to get the corrected bootstrap critical values, z α/2, and z α/2, for example, if B = 000 and α = 0.025, then z α/2, is the 25th smallest t β i, and z α /2, is the 975th smallest corrected bootstrap statistic, t β i,. Now, under H 0, Pr t βi, <.96 or t βi, >.96 = O /2 If we instead use corrected bootstrap critical values z α/2, and z α/2,, then Pr t βi, < z α/2, or t βi, > z α/2, = O /2+ξ where ξ > 0. hus, as ξ > 0, inference based on bootstrap critical values is more accurate than inference based on asymptotic, standard normal critical values. In the case of iid data, l = and ξ = /2 so that we have the same order of improvement as that for the sample mean one cannot do better than this... In the dependent case, when l = /4, ξ can be arbitrarily close to /4. 39

40 Remarks i hus far, we have considered the case of equally tailed tests, in the sense that we compare t βi, with the 2.5% and 97.5% critical values. Needless to say, in finite samples, z α/2, z α/2,, as the bootstrap distribution is not symmetric for finite. However, if we impose symmetry, and we compare t βi, with z α/2,, then Pr t βi, > z α/2, = O +ξ where again ξ cannot be larger than /4. he smaller order of error is due to the fact that, by imposing symmetry, the first term in the Edgeworth and Cornish expansions disappears. ii Broadly speaking, in the iid case iid bootstrap we have an improvement, over standard normal critical values, of order /2. In the dependent case block bootstrap case, by choosing the block length of order /4, we have an improvement, not larger but arbitrarily close to /4. his is due to the fact that in the iid data/iid bootstrap case, the bootstrap moment approaches the sample moments at rate /2. On the other hand, in the block bootstrap case, the bootstrap moments approach the sample moments at rate /4. he latter is true, even if we have iid data but we resample using blocks. iii Note that if the moment conditions are a martingale sequence, dynamic correct specification case, then κ = 0. However, this does not help. We still need to use the block bootstrap, in order to capture dependence in the higher higher than second moments. 40

41 3 Part III - Linear and Nonlinear Predictive Accuracy esting With Nested and Nonnested Models As the title to this section suggests, we are now ready to discuss in detail a number of predictive accuracy tests used for comparing linear and nonlinear models, potentially under misspecification, allowing for parameter estimation error, and for both nested and nonnested alternatives. 3. Granger Causality We say that X t Granger causes Y t, if the lags of X t help to predict Y t Granger 969. More formally, if fy t F x t = fy t F t, then we say that X t is not Granger causal for Y t, where F t denotes some relevant information set, and F x t is the same information set, except without any past values of X t. ypically, when performing a causality test, the null is that of non-causality, versus the alternative of causality. Causality tests are often performed by regressing Y t on its lags, on the lags of X t, and on lags of other relevant variables; and then testing whether the coefficients on the lags of X t are all equal to zero or not. versus More formally, consider H 0 : X t does not cause Y t H A : X t causes Y t We could then estimate the following model, say: p q Y t = c + α i Y t i + β j X t j + e t i= j= and the null and alternative could be restated as versus H 0 : β j = 0 for all j =,..., p H 0 : β j 0 for at least one j =,..., p hus, to test the null, one could use the usual F, Wald, Lagrange Multiplier or Likelihood Ratio tests, for example. 4

Introduction to Forecasting and Forecast Evaluation

Introduction to Forecasting and Forecast Evaluation Lecture Notes to Acompany Talk Norman Rasmus Swanson Rutgers University contact: nswanson@econ.rutgers.edu http://econweb.rutgers.edu/nswanson/ prepared