The Comparative Performance of Alternative Out-ofsample Predictability Tests with Non-linear Models

The Comparative Performance of Alternative Out-ofsample Predictability Tests with Non-linear Models Yu Liu, University of Texas at El Paso Ruxandra Prodan, University of Houston Alex Nikolsko-Rzhevskyy, University of Memphis June 30th, 2010 Very Preliminary Version There is a vast literature that has been focusing on testing the forecasting performance of various models. The most commonly used measure for forecasting is the mean squared prediction error, and tests for equal predictability of two non-nested linear models, introduced by Diebold and Mariano (1995) and West (1996), (DMW test) are most often used. More recent studies have developed inference procedures for testing the null of equal predictability when two linear models are nested (Clark and West (2006, 2007)) (CW test). We study the power and size properties of the DMW and CW tests with nonlinear models (SETAR, BAND-TAR and ESTAR) commonly used to forecast series such as real GDP and real exchange rates. In order to compare the various tests for equal predictability, we perform a Monte Carlo study with the generated data containing different degrees of persistence, various sample size, multi stepforecasts and different forecast windows. Our main conclusion is that Clark and West test is appropriately sized and has good power properties when applied to nonlinear models. JEL Classification: Keywords: nonlinear models, forecast evaluation, tests for equal predictability, real exchange rates 1

1. Introduction The most commonly used measure to evaluate the out-of-sample performance of two models when forecasting exchange rates is the mean squared prediction error (MSPE). In order to evaluate the out-of-sample performance of the models based on the MSPE comparison, tests for equal predictability of two non-nested models, introduced by Diebold and Mariano (1995) and West (1996) have been widely used (henceforth, DMW tests). While these tests are appropriate for non-nested models, when testing nested models, the use of standard normal critical values usually results in severely undersized tests. 1 More recent studies take the above matters into consideration by using a newly developed inference procedure for testing the null of equal predictive ability of an econometric model and a martingale difference model proposed by Clark and West (2006, 2007), (henceforth the CW test). 2 They argue that this methodology is preferable to the standard DMW procedure when the two models are nested, resulting in properly sized tests. 3 However, according to Rogoff and Stavrakeva (2009), the new asymptotic out-ofsample tests, such as the CW, are easy to use but bootstrapped DMW out-of-sample tests and the traditional Theil s U test remain more powerful and better sized. Clark and West (2007) test for equal predictability has been widely used in the last few years to assess various models predictability. Some examples of studies that use CW tests include: Clark and West (2006), using the uncovered interest rate parity (UIRP) model, Gourinchas and Rey (2007), using the ratio of net exports to net foreign assets to predict exchange rates movements, Molodtsova and Papell (2008) and Molodtsova, Nikolsko-Rzhevskyy, and Papell (2008, 2009), basing their model on Taylor rule fundamentals, and Engel, Mark and West (2007), considering panel data with monetary and PPP fundamentals, provide some evidence of short-run and long-run exchange rate predictability. Rossi and Tatevik Sekhposyan (2008) show that the predictive ability of a variety of models that aim at predicting future industrial production growth or inflation vary through time. Alquist and Killian(2008) show that the price of crude oil futures is not the most accurate predictor of the spot price of crude oil in practice, no-change forecasts being more accurate. The above equal predictability tests, whether designed for non-nested or nested models, are aimed towards testing the null that a series that follows a martingale difference against the alternative that the series is linearly predictable. As of our knowledge, there are no equal predictability tests specifically designed for cases where the alternative is nonlinear. Besides, there is no evidence that the tests for equal 1 McCracken (2007) provides simulation results for a class of linear models. 2 For simplicity, in the rest of the paper, we would loosely refer to the Martingale process as a random walk. 3 The test statistic takes into account that under the null, the sample MSPE of the alternative model is expected to be greater than that of the random walk model and adjusts for the upward shift in the sample MSPE of the alternative model. 2

predictability developed for linear models have good size and power properties when the alternative is nonlinear. 4 This is especially important as, in recent years, nonlinear models have become more common when modeling and forecasting economic and financial series. Linear models might be a relatively poor way of capturing certain types of economic behavior, or economic performance, at certain times. For instance, output growth can be characterized by the presence of two or more regimes (recessions and expansions), as can financial variables (periods of high and low volatility), nominal exchange rates (times of appreciation and depreciation) or real exchange rates, where transaction costs could give rise to a band of inactivity where arbitrage is not profitable so that the real exchange rate deviations from purchasing power parity are not corrected inside the band. This trend has brought with it an increased interest in forecasting economic variables with nonlinear models 5. Several papers have used non-linear models when modeling and forecasting economic and financial series: Tiao and Tsay (1994) and Rothman (1998) argue that improved forecasting performance can be achieved by using nonlinear time series models for the US real GNP and for the US unemployment rate, respectively, finding support for the claim that a threshold forecast can improve linear forecasts. Engel (1994) and Nikolsko-Rzhevskyy and Prodan (2010) investigate whether a Markov- Switching model is a useful specification for the out-of-sample nominal exchange rate predictability and the latter find evidence of both, short and long run predictability for monthly exchange rates. Rapach and Wohar (2005) have analyzed the out-of-sample forecasting performance of the two nonlinear models (Band-TAR and ESTAR) of the U.S. dollar real exchange rate behavior for four countries and find that, in a few cases, the nonlinear models outperform the simple linear autoregressive models. A large literature has previously focused on analyzing nonlinear economic forecasting, surveying different ways to compute multi-step forecasts from nonlinear models and comparing the forecasting performance of linear and nonlinear models. Some of the recent papers include Timo Teräsvirta (2006), Teräsvirta, van Dijk and Medeiros (2005), Marcellino (2002) Stock and Watson (1999) and Ramsey (1996). However, none of these studies have discussed in detail the optimal forecasting evaluation method for nonlinear models. This paper discusses the current state-of-the-art in linear and non-linear forecasting evaluation, outlines a number of open issues and provides some solutions. We evaluate the properties of several tests of equal predictability when multi-step forecasting linear and non-linear models, in order to provide practical guidance to forecasters. Our contribution is threefold. First, we provide guidance to which is the most consistent test, in terms of size and power (DMW, CW or the more traditional Theil s U), when the alternative 4 Clark and West (2006) claim that nonlinear models, specifically the Markov Switching model, can be easily accommodated when using the CW test for equal predictability. 5 For accounts of this topic, see Tsay (2002) and Clements, Franses and Swanson (2004). 3

is a nonlinear model (Threshold Autoregressive and Exponential Star models). Next, we investigate the properties of the tests across two variance estimators (Newey-West and Hodrick) and we argue that the Hodrick estimator has better size properties when using standard critical values but the Newey-West estimator has better power properties when bootstrapping critical values. Third, we expand Clark and West (2006) linear multiple-step forecasting results, using a more general data generating process and the two variance estimators (Newey-West and Hodrick) mentioned above. We use Monte Carlo simulation methods to compute the empirical size and empirical power/size adjusted power functions, different values of the ratio between prediction and estimation sample sizes and various degrees of persistence of the data generating process. Our findings are the followings: When using standard normal critical values, for both, linear and nonlinear models, at all forecasting horizons, the CW test with Hodrick variance estimation has an appropriate size. The power of the test decreases as the forecasting horizon increases and as the number of predictions (P) becomes smaller. When bootstrapping critical values all the tests discussed above have an appropriate size. For both, the linear and nonlinear models, the tests using the Newey-West variance estimation have the highest power: for the linear models, the CW test has the best power properties and for the nonlinear models, the DMW test has the best power properties. 2. The Models Without a doubt, the most popular statistical linear forecasting model is the simple autoregressive model, AR(1), given by: (1) We consider two non-linear models, the Threshold Autoregressive and the Exponential Star models. Although there are many types of regime switching models, we consider the threshold autoregressive model (TAR). The nature of the TAR model is that it allows for a number of different regimes with a separate autoregressive model in each regime. We will focus on the simple two-regime TAR(1) model: (2) (3) where: is the value of the threshold, is the delay parameter, and is the Heaviside indicator function. 4

The nature of the TAR model is that there are two states of the world that we call high and low. In high state, exceeds the value of the threshold, so that = 1 and follows the autoregressive process. Similarly, when, so that = 1, and follows the autoregressive process. Although is linear in each regime, the possibility of regime switching means that the entire sequence is nonlinear. Note that =, the TAR(1) model is equivalent to an AR(1) model. The TAR(1) model permits us to estimate the value of the threshold without imposing an a priori line of demarcation between the regimes. The key feature of these models is that a sufficiently large shock can cause the system to switch between regimes. The dates at which the series crosses the threshold are not specified beforehand by the researcher. 6 In contrast to the discrete regime switching that characterizes the TAR(1) model, the exponential STAR (ESTAR) model proposed by Granger and Terasvirta (1993) allows for smooth adjustments, so that the speed of adjustment varies with the extent of the deviation from the mean. We use the following ESTAR(1) model: (4) where is and is the long-run equilibrium level of the series. The series behaves as a random walk in the inner regime when. The speed of the mean reversion increases gradually when further away from the long run equilibrium. We implement the nonlinear least squares estimation by setting the delay parameter equal to 1. Our experiments using different starting values for the parameters yield similar results, indicating the location of a global optimum. 3. Size and power simulations We use Monte Carlo simulations for the simple AR (1) model and for the nonlinear TAR (1) and ESTAR(1) processes to evaluate the finite-sample size and power of the predictability tests. We first calculate multi-step forecasts for the above models and then evaluate the following predictability tests: Clark and West, Diebold-Mariano-West and the Theil s U tests. For each one of these tests we use both, 6 A grid search over all potential values of the thresholds yields a superconsistent estimate of the unknown threshold parameter. We follow the conventional practice of excluding the highest and lowest 15% of the potential values to ensure an adequate number of observations on each side of the threshold. Note that our TAR model constrains the variance of t to be identical across the regimes. 5

Newley-West and Hodrick variance-covariance matrix adjustments. 7 We consider experiments where we evaluate empirical size, empirical power and size adjusted power. 3.1. The linear and nonlinear DGP S The linear DGP is the simple AR(1) from Equation 1, where = 0 and when evaluating the size and = 0 and 0 when evaluation the power. The null forecast (model 1) is the no change forecast of 0 for all t (random walk). The alternative forecast (model 2) is obtained from the AR(1) regression, using the rolling window of R = 120 observations. The level of persistence is measured by the autoregressive coefficient = 0.95 and the residuals are drawn randomly from a normal distribution. The number of simulations is 1000. We generate a TAR (1) process in the form of Equation (2) and (3). We set,, = 0 when evaluating the size and we set, = 0 and, when evaluating the power. As previously, the null forecast (model 1) is a random walk. For the alternative forecast (model 2) we set,, the threshold, the delay parameter and the residuals are We chose these examples as most of the macroeconomics and financial data is highly persistent. We report results for a 10% and 5% nominal size and we investigate the relationship between the empirical and the nominal test size. 3.2. Forecasting: Linear and Nonlinear models Let be the time series of interest and suppose that we want to forecast subsequent values of the series conditional on the current and past observations. Suppose that the data generating process for is given by (5) where: is a zero-mean white noise disturbance, and the functional form f( ) is the linear model described above. For any period t, the conditional mean of is given by: (6) where we allow the forecast horizon, h, to run from 1 to 36. For the AR(1) model, it is straightforward to obtain the h-step ahead forecasts recursively because the functional form f( ) is linear. 7 A detailed description is presented in Appendix A. 6

On the other hand, forecasting with the TAR model is a nontrivial task. As analyzed in Koop et all(1996), the iterated projections from a nonlinear model are state-dependent. In order to construct multiperiod forecasts, we use the method described in Enders (2004). Specifically, we select 36 randomly drawn realizations of the residuals of Equation 2 such that the residuals are drawn with replacement using a uniform distribution We call these residuals We then generate through by substituting these bootstrapped residuals into Equation 2 and setting appropriately for high or low states. For this particular history, we repeat the process 1000 times. The Law of Large Numbers guarantees that the sample means of the various converge to the true conditional i-step ahead forecasts. The essential point is that the sample averages of through yield the one-step through 36-step ahead conditional forecasts of the simulated series. We consider rolling regressions to obtain multi-step-ahead forecasts from every estimated model. We match some of the settings that appear in the empirical research, specifically Clark and West (2006). We initially fix the window size (R) at 120 months and the number of predictions (P) is set to 48, 96 and 144. We therefore generate a total of R+P observations. Using the window of R observations, rolling multi-step ahead predictions of horizon are formed for observations R+ through R+P (where = 1 to 36 months). The total number of predictions is P + 1. Multi-step forecasts constructed in this way involve overlapping data, so we control for that using both, the Newley-West and Hodrick variancecovariance matrix adjustment. We provide technical details in Appendix A. 3.3. Procedures for inference The forecasts are used to compute the out-of-sample mean square prediction errors (MSPEs) of our models. 8 We first calculate multi-step forecasts for the above models and then evaluate the following predictability tests: the Clark and West tests (MSPE-adjusted test in the tables), the unadjusted MSPE test with standard normal critical values, commonly referred to us as Diebold-Mariano-West test (the MSPE- Normal test in the tables) and the (the TU test in the tables). 9 First, in order to evaluate the empirical size and the empirical unadjusted power, we compare the MSPE-adjusted and the MSPE-normal statistics with standard normal critical values. Second, according to Rogoff and Stavrakeva (2009) argument that the new asymptotic out-of-sample tests, such as the CW, are easy to use but bootstrapped DMW out-ofsample tests and the traditional Theil s U test remain more powerful and better sized, we also consider 8 For the remainder of our paper MSPE refers to an out-of-sample statistic. 9 We follow Clark and West (2006) notations for comparison purpose. 7

bootstrap versions of the above tests and the Theil s U test (TU test). 10 Bootstrap critical values are computed using wild bootstrap procedure, imposing a null of no predictability of our series. We employ the a simple parametric bootstrap method. Specifically, we take the following steps: (1) We generate the simple AR(1) process y t = a 1 y t-1 + e t, with a 1 =1 for size experiment and a 1 =0.95 for power experiments; (2) We take the first difference of the series and run a regression of dy t on a constant and y t ; (3) We compute MSPE normal and MSPE adjusted t-statistics using rolling forecast from the null and the alternative models; (4) We assess the 10% one-sided critical value by taking the 90 th percentile of the t- statistics distribution. For each one of the above statistics (MSPE-adjusted, MSPE-normal and TU test) we control for the serial correlation due to the overlapping data, using both, the Newley-West and Hodrick variance-covariance matrix adjustment. We therefore analyze 6 statistics: MSPE-Adjusted-NW, MSPE-Adjusted-HD, MSPE- Normal-NW, MSPE-Normal-HD, TU-NW and TU-HD. For a detailed description of how to calculate each test statistics and how to determine the statistical significance see Appendix B. 4. Simulation Results The results are reported in Tables 1 to 4. We will discuss the empirical size for a 10% nominal size and we will mainly focus on the case where P=144. 11 4.1. The linear model: Size and Power Size results for the linear model are presented in Table 1.1. For a nominal size of 10%, for P=144 and standard normal critical values, the MSPE-Adjusted-NW statistics is oversized at all horizons (ranging from 14.5 % at 3-steps to 30.8% at 36 steps horizon) except for the 1-step horizon when it is slightly undersized (9%). On the other hand, the MSPE-Adjusted-HD statistics is correctly sized at most horizons (slightly oversized at longer horizons, an average of 12% empirical size). As expected and documented in the previous literature, when the models are nested, both, MSPE-Normal-NW and MSPE- Normal-HD are strongly undersized at all steps (with values ranging between 0.3% and 6.7%), with the exception that MSPE-Normal-NW becomes oversized at long-horizon (14% empirical size at 36 stepshorizon). The size varies systematically with the P values: the oversizing of the tests using the NW covariance estimator seems to increase for smaller values of P, and for P = 48 the MSPE-Adjusted-HD 10 We do not present results for McCracken (2004) predictability test as he has not supplied critical values for multiple-step forecasts but only for one step forecast. 11 Clark and West (2006) argue that, in order to match common setting that appear in empirical research, one should focus on cases where P is large relative to R. 8

statistics becomes oversized at all steps (ranging from 11.8% at 1-step to 31.8% at 24 steps horizon). 12 Also, the size is positively correlated to the forecasting horizon, the oversizing becoming more severe at longer forecasting horizons. In conclusion, for nested models, for all P values and forecasting horizons, the MSPE-Adjusted-HD test has the best size. Power results for the linear model are presented in Table 2.1. We study the finite-sample power of the above tests using both empirical critical values (unadjusted power) and bootstrap critical values (adjusted power) at a nominal size of 10%. For the unadjusted power, using standard normal critical values, we will not discuss the predictability tests that have poor size properties. We therefore focus only on the MSPE-Adjusted-HD test. The power varies systematically across both, the forecasting horizons and the P values: First, the power increases, at all forecasting horizons, as the P value increases. 13 For instance, at 1-step ahead, the power increases from 27.6% for P=48 to 43.0% for P=144. Second, the power generally decreases as the forecasting horizon becomes longer; for P=144, the power decreases from 43% (1-step ahead) to 25.9% (36-steps ahead) in a nonlinear manner. For the adjusted power, using bootstrap critical, we will discuss all the above tests, as they all have by construction a correct size. For all P values, the MSPE-Adjusted-NW test has the highest power. For P=144, the power is very similar from step 1 to step 12 (an average of 47.0%) and it becomes slightly smaller at longer horizons (41.7% at 36-step horizon). Also, note that even if MSPE-Adjusted-NW has systematically a higher power than the MSPE-Adjusted-HD test, the largest difference between the two tests can be observed at longer horizons (41.7% versus 20.9% at 36-step horizon). Taking in account Rogoff and Stavrakeva (2009) argument that the bootstrapped DMW out-of-sample tests remain more powerful and better sized than the new CW tests, we check the power of the MSPE-Normal tests; once again the test that is using the NW covariance estimator has a better power at all forecasting horizons: at 36-step forecasting horizon, the power of the MSPE-Normal-NW test is 37.4% versus the power of the MSPE-Normal-HD, which is 22.5%. These results remain consistent for all P values. We conclude that the most powerful test is the MSPE-Adjusted-NW but when using bootstrapped DMW tests, one should choose the MSPE-Normal-NW test. 4.2. The nonlinear model: Size and Power Size results for the nonlinear model are presented in Table 3.1. Our findings are somewhat similar to our previous findings, when the alternative is linear. For a nominal size of 10%, for P=144 and standard normal critical values, the MSPE-Adjusted-NW statistics is oversized at all horizons (ranging 12 We cannot compute MSPE-Adjusted-HD doe the 36-step horizon as for the Hodrick estimator P 2h. 13 The only exception is at 24-step horizon, when the power is higher for P=48 that P=96 or 144. This is probably related to the small sample bias. 9

from 12% at 3-steps to 19% at 36 steps horizon) except for the 1-step horizon when it is slightly undersized (9.3%). When using the NW covariance estimator, the oversizing seems to be smaller when the alternative is nonlinear than linear. The MSPE-Adjusted-HD statistics is correctly sized at most horizons. As previously, both, MSPE-Normal-NW and MSPE-Normal-HD are severely undersized at all forecasting horizons, with values ranging between 0.0% and 8.9% for the MSPE-Normal-NW test and between 0.0% and 3.8% for the MSPE-Normal-HD. The size varies systematically with the P value: the oversizing of the tests using the NW covariance estimator seems to increase for smaller values of P, and for P = 48 the MSPE-Adjusted-HD statistics becomes oversized at all horizons. Also, the size is positively correlated to the forecasting horizon, the oversizing becoming more severe at longer forecasting horizons. For all cases, similar to the case when the alternative is linear, the MSPE-Adjusted-HD test has the best size. Power results are presented in Table 4.1. We study the finite-sample power of the above tests using both empirical critical values (unadjusted power) and bootstrap critical values (adjusted power) at a nominal size of 10%. For the unadjusted power, using standard normal critical values, we will not discuss the predictability tests that have poor size properties. We therefore focus only on the MSPE- Adjusted-HD test: overall the power of the test is lower in all cases (p values and forecasting horizons) when the alternative is nonlinear than previously, when the alternative is linear. The power varies systematically across both, the P values and the forecasting horizons: First, as in the linear case, the power increases, at all forecasting horizons, as the P value increases. For instance, at 1-step ahead, the power increases from 21.1% for P=48 to 27.8% for P=144. Second, the power generally decreases as the forecasting horizon becomes longer; for P=144, the power decreases from 27.8% (1-step ahead) to 15.8% (36-steps ahead). For the adjusted power, using bootstrap critical, we will discuss all the above tests, as they all have by construction a correct size. One again, the power is lower in all cases (p values and forecasting horizons) when the alternative is nonlinear than previously, when the alternative is linear. For all P values, at short forecasting horizons (1 and 3 steps) all our tests are similar in terms of power, both MSPE-normal tests being slightly more powerful. At longer forecasting horizons (6 to 36 steps) the MSPE-Adjusted-NW is more powerful than all the other tests in most cases. However, the performance of both tests using the NW covariance estimator (MSPE-Adjusted-NW and MSPE-Normal-NW) is similar. In conclusion, we argue that, if forecasting nonlinear models at short horizons (1-3 steps) one should use the MSPE-Normal-NW but MSPE-Adjusted-NW tests should be used when forecasting at long horizons. When using bootstrapped DMW tests, one should use the MSPE-Normal-NW test 10

5. Conclusion A large literature has previously focused on analyzing nonlinear economic forecasting. However, none of the previous studies have discussed the optimal forecasting evaluation method for nonlinear models. Besides, there has not been a consensus on what is the appropriate forecast evaluation method and covariance estimator when multi-step forecasting, whether linear or nonlinear models. We discuss the current state-of-the-art in multi-step linear and non-linear forecasting, evaluate the properties of several tests of equal predictability and provide practical guidance to forecasters. Specifically, we analyze DMW, CW and Theil s U tests, using standard normal and bootstrap critical values, and two variance estimators (Newey-West and Hodrick). Our alternative is a linear and a nonlinear model (Threshold Autoregressive and Exponential Star model). We use Monte Carlo simulation methods to compute the empirical size and empirical power/size adjusted power functions, different values of the ratio between prediction and estimation sample sizes and various degrees of persistence of the data generating process. A summary is as following: When using standard normal critical values, for both, linear and nonlinear models, at all forecasting horizons, the CW test with Hodrick variance estimation has an appropriate size. The power of the test decreases as the forecasting horizon increases and as the number of predictions (P) becomes smaller. When bootstrapping critical values all the tests discussed above have an appropriate size. For both, the linear and nonlinear models, the tests using the Newey-West variance estimation have the highest power: for the linear models, the CW test has the best power properties and for the nonlinear models, the DMW test has the best power properties. 11

References Alquist, Ron & Kilian, Lutz, 2007. "What Do We Learn from the Price of Crude Oil Futures?," CEPR Discussion Papers 6548, C.E.P.R. Discussion Papers. Clark, Todd E. & West, Kenneth D., 2006. "Using out-of-sample mean squared prediction errors to test the martingale difference hypothesis," Journal of Econometrics, Elsevier, vol. 135(1-2), pages 155-186. Diebold, Francis & Roberto Mariano, 1995. "Comparing Predictive Accuracy," Journal of Business & Economic Statistics, American Statistical Association, vol. 13(3), pages 253-63, July. Engel, Charles, 1994. "Can the Markov switching model forecast exchange rates?," Journal of International Economics, Elsevier, vol. 36(1-2), pages 151-165, February. Gourinchas, Pierre-Olivier and Helene Rey, International Financial Adjustment forthcoming, Journal of Political Economy, 2007 Hamilton, James, 1989. "A New Approach to the Economic Analysis of Nonstationary Time Series and the Business Cycle," Econometrica, Econometric Society, vol. 57(2), pages 357-84, March. Koop, Gary, M. Hashem Pesaran & Simon Potter,1996. Impulse Response Analysis Multivariate Models. Journal of Econometrics,74, 119 147. Nonlinear McCracken, Michael W., 2007. "Asymptotics for out of sample tests of Granger causality," Journal of Econometrics, Elsevier, vol. 140(2), pages 719-752, October. Molodtsova, Tanya and David Papell, 2008. "Out-of-Sample Exchange Rate Predictability with Taylor Rule Fundamentals," Journal of International Economics, forthcoming. Alex Nikolsko-Rzhevskyy and Ruxandra Prodan, 2009. New Evidence of Exchange Rates Predictability Using the Long Swings Model, working paper. Rapach, David E. & Mark E. Wohar, 2005. The Out-of-Sample Forecasting Performance of Nonlinear Models of Real Exchange Rate Behavior. International Journal of Forecasting, Vol. 22, No. 2 (April-June 2006), pp. 341 361 Rogoff, Kenneth S. & Stavrakeva, Vania, 2009. "The Continuing Puzzle of Short Horizon Exchange Rate Forecasting," NBER Working Papers 14071, National Bureau of Economic Research, Inc. Rothman, Philip, 1991. "Further evidence on the asymmetric behavior of unemployment rates over the business cycle," Journal of Macroeconomics, Elsevier, vol. 13(2), pages 291-298. Tatevik Sekhposyan & Barbara Rossi, 2009. "Has Economic Models Forecasting Performance for US Output Growth and Inflation Changed Over Time, and When?," Working Papers 09-06, Duke University, Department of Economics Teräsvirta, T., 1994. Specification, Estimation, and Evaluation of Smooth Transition Autoregressive Models. Journal of the American Statistical Association 89, 208-218. West, Kenneth D., Asymptotic Inference about Predictive Ability Econometrica, 1996, 64, pp. 1067-108 12

Table 1. Empirical Size for the linear model Table 1.1. Empirical Size for the linear model (Nominal Size=10%) Horizon h 1 3 6 12 24 36 A: R=120, P=48 MSPE-Adjusted-NW 11.8% 19.9% 26.0% 33.0% 36.9% 41.6% MSPE-Adjusted-HD 11.8% 12.7% 12.4% 13.9% 31.8% - MSPE-Normal-NW 4.2% 9.5% 13.0% 19.0% 26.9% 32.5% MSPE-Normal-HD 3.8% 4.4% 5.4% 6.8% 23.5% - B: R=120, P=96 MSPE-Adjusted-NW 8.6% 14.3% 17.1% 22.9% 29.7% 31.9% MSPE-Adjusted-HD 8.6% 9.5% 9.2% 10.7% 12.0% 14.4% MSPE-Normal-NW 1.2% 3.3% 5.5% 9.9% 15.1% 19.8% MSPE-Normal-HD 1.0% 2.2% 2.2% 2.6% 3.5% 6.8% C: R=120, P=144 MSPE-Adjusted-NW 9.0% 13.5% 16.4% 20.8% 27.9% 30.8% MSPE-Adjusted-HD 9.0% 10.1% 10.8% 12.3% 12.9% 12.7% MSPE-Normal-NW 0.4% 1.8% 3.6% 6.7% 12.1% 14.0% MSPE-Normal-HD 0.3% 1.0% 1.3% 1.5% 3.4% 3.1% Table 1.2. Empirical Size for the linear model (Nominal Size = 5%) Horizon h 1 3 6 12 24 36 A: R=120, P=48 MSPE-Adjusted-NW 6.5% 13.4% 19.4% 25.4% 32.0% 37.4% MSPE-Adjusted-HD 6.5% 6.5% 6.9% 8.0% 28.2% - MSPE-Normal-NW 1.9% 5.8% 9.7% 14.2% 23.5% 29.2% MSPE-Normal-HD 1.8% 1.8% 2.4% 3.0% 20.8% - B: R=120, P=96 MSPE-Adjusted-NW 4.2% 9.3% 12.3% 17.4% 23.4% 26.1% MSPE-Adjusted-HD 4.2% 5.0% 5.7% 5.1% 5.8% 8.9% MSPE-Normal-NW 0.6% 2.2% 3.9% 7.2% 12.4% 15.5% MSPE-Normal-HD 0.5% 1.0% 1.2% 1.3% 1.6% 4.3% C: R=120, P=144 MSPE-Adjusted-NW 5.2% 8.4% 11.5% 15.4% 22.6% 25.3% MSPE-Adjusted-HD 5.2% 5.8% 6.4% 5.4% 6.3% 6.8% MSPE-Normal-NW 0.2% 1.2% 2.2% 4.6% 9.1% 11.3% MSPE-Normal-HD 0.2% 0.1% 0.2% 0.4% 1.6% 1.9% 13

Table 2. Empirical Power for the linear Model Table 2.1. Unadjusted Power (Nominal Size=10%) Size Adjusted Power (Nominal Size=10%) 1 3 6 12 24 36 1 3 6 12 24 36 A: R=120, P=48 MSPE-Adjusted-NW 27.6% 43.7% 51.6% 58.7% 62.6% 60.9% 21.8% 24.2% 23.1% 17.8% 18.4% 10.6% MSPE-Adjusted-HD 27.6% 27.7% 27.2% 25.7% 45.8% - 21.8% 23.0% 22.3% 18.6% 16.5% - MSPE-Normal-NW 8.7% 19.1% 27.4% 36.4% 45.9% 46.8% 19.4% 20.1% 21.2% 19.2% 18.8% 11.7% MSPE-Normal-HD 8.1% 8.0% 9.7% 11.2% 33.2% - 19.2% 18.6% 18.4% 16.9% 14.4% - B: R=120, P=96 MSPE-Adjusted-NW 33.9% 49.2% 55.5% 65.9% 71.0% 71.4% 37.0% 38.6% 37.9% 37.6% 31.0% 26.1% MSPE-Adjusted-HD 33.9% 36.0% 35.3% 33.3% 28.6% 26.7% 37.0% 37.2% 36.8% 31.3% 23.1% 20.3% MSPE-Normal-NW 5.0% 12.4% 19.7% 33.9% 45.2% 44.7% 31.0% 31.3% 32.6% 34.7% 29.4% 28.0% MSPE-Normal-HD 4.7% 5.4% 7.2% 8.7% 9.4% 11.4% 30.8% 31.2% 31.1% 29.8% 22.3% 18.4% C: R=120, P=144 MSPE-Adjusted-NW 43.0% 54.4% 59.7% 70.1% 77.9% 76.2% 46.2% 46.7% 47.9% 48.0% 42.5% 41.7% MSPE-Adjusted-HD 43.0% 43.1% 43.2% 43.2% 34.5% 25.9% 46.2% 42.9% 40.8% 39.1% 27.3% 20.9% MSPE-Normal-NW 3.6% 9.0% 17.0% 33.6% 45.0% 48.4% 38.6% 39.4% 41.0% 40.9% 40.3% 37.4% MSPE-Normal-HD 3.3% 3.9% 7.4% 8.9% 9.4% 8.3% 38.6% 39.2% 38.0% 38.2% 29.7% 22.5% Table 2.2. Unadjusted Power (Nominal Size=5%) Size Adjusted Power (Nominal Size=5%) 1 3 6 12 24 36 1 3 6 12 24 36 A: R=120, P=48 MSPE-Adjusted-NW 15.6% 31.2% 41.8% 48.3% 54.3% 53.1% 12.8% 12.9% 13.1% 9.7% 8.9% 3.9% MSPE-Adjusted-HD 15.6% 15.2% 15.0% 14.5% 40.5% - 12.8% 12.5% 11.5% 10.8% 6.8% - MSPE-Normal-NW 4.0% 12.3% 20.2% 28.1% 38.1% 39.9% 11.0% 10.8% 11.7% 8.5% 8.6% 5.4% MSPE-Normal-HD 3.2% 3.2% 4.1% 6.5% 28.2% - 10.7% 9.2% 9.4% 8.9% 7.3% - B: R=120, P=96 MSPE-Adjusted-NW 21.8% 36.7% 45.7% 59.0% 61.1% 62.8% 25.0% 23.5% 21.8% 22.4% 17.9% 12.8% MSPE-Adjusted-HD 21.8% 21.1% 20.4% 17.8% 14.5% 18.7% 25.0% 22.5% 16.9% 17.8% 13.2% 7.3% MSPE-Normal-NW 1.3% 6.8% 13.4% 26.1% 37.1% 36.7% 18.3% 18.9% 17.4% 17.6% 18.2% 13.3% MSPE-Normal-HD 1.0% 2.5% 2.5% 3.7% 3.8% 5.9% 18.1% 17.5% 14.1% 16.2% 13.7% 7.1% C: R=120, P=144 MSPE-Adjusted-NW 25.7% 42.3% 50.5% 61.6% 70.1% 68.5% 25.5% 28.5% 30.5% 33.0% 26.2% 24.8% MSPE-Adjusted-HD 25.7% 28.5% 27.2% 27.0% 18.0% 14.0% 25.5% 26.6% 23.3% 25.9% 15.6% 10.7% MSPE-Normal-NW 1.0% 5.2% 11.5% 26.1% 37.5% 39.1% 19.9% 22.2% 25.9% 28.0% 23.4% 21.8% MSPE-Normal-HD 1.0% 1.6% 3.0% 4.5% 3.6% 3.5% 19.7% 21.4% 23.6% 21.9% 14.7% 11.3% 14

Table 3. Empirical Size for the SETAR model Table 3-1 SETAR Empirical Size (Nominal Size=10%) 1 3 6 12 24 36 A: R=120, P=48 MSPE-Adjusted-NW 12.5% 18.6% 22.8% 26.6% 33.0% 34.6% MSPE-Adjusted-HD 12.5% 12.6% 13.0% 12.4% 30.0% - MSPE-Normal-NW 1.6% 6.3% 11.1% 16.1% 25.1% 28.1% MSPE-Normal-HD 1.4% 2.6% 4.3% 7.0% 21.2% - B: R=120, P=96 MSPE-Adjusted-NW 10.9% 14.2% 15.7% 19.4% 22.4% 27.8% MSPE-Adjusted-HD 10.9% 11.1% 10.6% 10.8% 12.1% 14.9% MSPE-Normal-NW 0.3% 1.9% 5.0% 8.8% 13.4% 16.6% MSPE-Normal-HD 0.2% 1.0% 1.4% 2.8% 3.7% 7.6% C: R=120, P=144 MSPE-Adjusted-NW 9.3% 12.0% 12.7% 13.8% 17.4% 19.0% MSPE-Adjusted-HD 9.3% 11.0% 9.0% 8.7% 9.4% 12.3% MSPE-Normal-NW 0.0% 1.5% 2.4% 3.8% 7.2% 8.9% MSPE-Normal-HD 0.0% 0.6% 0.8% 2.1% 2.8% 3.8% Table 3-2 SETAR Empirical Size (Nominal Size=5%) 1 3 6 12 24 36 A: R=120, P=48 MSPE-Adjusted-NW 5.7% 12.7% 16.5% 19.5% 26.7% 28.9% MSPE-Adjusted-HD 5.7% 6.3% 7.2% 8.3% 26.7% - MSPE-Normal-NW 0.7% 4.0% 9.0% 12.1% 19.3% 24.5% MSPE-Normal-HD 0.6% 1.1% 1.7% 3.6% 18.7% - B: R=120, P=96 MSPE-Adjusted-NW 4.9% 8.5% 10.3% 13.6% 17.1% 19.7% MSPE-Adjusted-HD 4.9% 5.6% 5.3% 6.2% 6.3% 10.2% MSPE-Normal-NW 0.0% 1.3% 3.3% 6.2% 9.8% 12.1% MSPE-Normal-HD 0.0% 0.5% 0.6% 1.0% 1.6% 4.0% C: R=120, P=144 MSPE-Adjusted-NW 4.5% 7.1% 7.5% 8.6% 11.3% 13.1% MSPE-Adjusted-HD 4.5% 5.3% 4.3% 4.1% 5.3% 7.9% MSPE-Normal-NW 0.0% 0.6% 1.4% 2.9% 5.1% 7.3% MSPE-Normal-HD 0.0% 0.2% 0.3% 0.8% 1.1% 1.8% 15

Table 4. Empirical Power for the SETAR Model SETAR Unadjusted Power (Nominal Size=10%) SETAR Adjusted Power (Nominal Size=10%) 1 3 6 12 24 36 1 3 6 12 24 36 A: R=120, P=48 MSPE-Adjusted-NW 21.1% 34.8% 42.8% 47.8% 51.1% 50.3% 18.5% 17.5% 21.0% 17.7% 16.7% 13.1% MSPE-Adjusted-HD 21.1% 24.4% 21.5% 21.2% 39.6% - 18.5% 15.8% 18.4% 15.2% 13.3% - MSPE-Normal-NW 3.7% 15.1% 24.9% 35.2% 41.2% 42.2% 18.8% 19.2% 19.5% 20.3% 14.7% 13.7% MSPE-Normal-HD 3.1% 7.8% 9.5% 12.2% 31.3% - 18.7% 19.1% 19.2% 17.6% 13.4% - B: R=120, P=96 MSPE-Adjusted-NW 24.4% 35.5% 37.2% 44.5% 48.2% 51.0% 26.4% 27.0% 31.5% 31.7% 26.3% 26.3% MSPE-Adjusted-HD 24.4% 26.2% 25.9% 22.5% 19.2% 19.9% 26.4% 24.3% 23.5% 22.1% 15.0% 11.3% MSPE-Normal-NW 1.7% 8.3% 17.2% 26.9% 33.3% 36.0% 25.5% 27.4% 28.1% 29.7% 25.4% 27.1% MSPE-Normal-HD 1.4% 4.8% 7.0% 9.9% 11.5% 9.8% 25.4% 27.9% 27.8% 27.0% 21.4% 16.5% C: R=120, P=144 MSPE-Adjusted-NW 27.8% 39.1% 40.3% 41.4% 43.6% 43.0% 29.9% 33.1% 34.2% 37.8% 36.6% 32.3% MSPE-Adjusted-HD 27.8% 31.3% 26.5% 24.6% 20.8% 15.8% 29.9% 29.9% 26.1% 25.3% 18.3% 14.2% MSPE-Normal-NW 0.6% 7.3% 14.7% 22.1% 28.3% 28.0% 31.3% 34.1% 33.7% 34.7% 32.9% 29.6% MSPE-Normal-HD 0.5% 4.1% 7.2% 9.1% 10.1% 7.7% 30.3% 34.2% 33.2% 33.4% 28.8% 22.5% Table 4-2 SETAR Unadjusted Power (Nominal Size=5%) SETAR Adjusted Power (Nominal Size=5%) 1 3 6 12 24 36 1 3 6 12 24 36 A: R=120, P=48 MSPE-Adjusted-NW 11.4% 24.9% 34.8% 38.9% 43.2% 43.3% 8.8% 11.0% 11.5% 9.9% 6.5% 6.2% MSPE-Adjusted-HD 11.4% 13.6% 12.8% 12.6% 34.7% - 8.8% 6.8% 8.9% 7.2% 6.3% - MSPE-Normal-NW 1.6% 9.2% 18.3% 27.6% 34.7% 36.8% 11.6% 11.6% 11.5% 10.0% 6.0% 6.7% MSPE-Normal-HD 1.3% 2.5% 4.8% 5.5% 27.1% - 11.6% 10.2% 9.7% 7.6% 7.6% - B: R=120, P=96 MSPE-Adjusted-NW 12.2% 24.5% 29.2% 36.2% 39.4% 42.3% 12.6% 16.9% 21.1% 16.9% 14.0% 16.3% MSPE-Adjusted-HD 12.2% 14.8% 15.0% 13.1% 12.2% 12.4% 12.6% 12.3% 13.4% 11.4% 7.2% 5.9% MSPE-Normal-NW 0.6% 4.5% 12.0% 21.4% 26.8% 28.9% 16.6% 17.0% 20.7% 18.6% 13.2% 15.0% MSPE-Normal-HD 0.3% 2.1% 3.5% 4.6% 5.6% 5.7% 16.5% 14.9% 19.4% 18.4% 11.3% 8.7% C: R=120, P=144 MSPE-Adjusted-NW 15.1% 28.1% 29.8% 33.7% 37.0% 35.0% 17.1% 22.9% 23.1% 26.7% 24.2% 19.9% MSPE-Adjusted-HD 15.1% 18.9% 17.0% 14.2% 12.0% 8.6% 17.1% 15.2% 17.5% 15.1% 9.0% 6.8% MSPE-Normal-NW 0.4% 4.2% 9.4% 17.4% 23.6% 23.6% 21.2% 21.9% 22.2% 25.8% 22.2% 20.1% MSPE-Normal-HD 0.3% 1.2% 3.3% 5.0% 5.6% 3.8% 21.1% 21.8% 21.8% 22.3% 14.6% 12.1% 16

Appendix A: variance-covariance matrix adjustments: Newley-West and Hodrick Appendix B: Forecasts evaluation tests. The Theil's U-Test (TU) If we define the sample forecast errors from the model under H 0 and H 1 as e ˆRW, t 1 yt 1 and e ˆ ˆ MS, t 1 yt 1 yms, t 1, respectively, then the TU test statistic is defined as TU MSPE mod el MSPE RW, where i 1 T 2 MSPE P t T p 1ei, t 1 is the mean square prediction error, P is the size of the mowing window, model stays for the linear or the nonlinear models, and RW stays for Random Walk. A TU < 1 implies that our model outperforms the random walk model. Diebold-Mariano/West (DMW) If we define f ˆ t e 2 RW, t e 2 MS, t, it s mean value RW MS f MSPE MSPE and variance V P 1 ( fˆ f ) T 2 1 t T P 1, then the statistic is defined as DMW f P V out model achieves lower MSPE than the RW. /. A DMW>0 implies that Clark and West (CW) Clark and West (2006) propose an adjusted DMW statistic by correcting for the non-zero 2 2 2 expected difference between the two MSPEs. If we define ˆ ADJ f e e y it s mean as ˆ t 1 RW, t 1 MS, t 1 MS, t 1 ˆ ADJ RW MS f MSPE MSPE P and variance 1 T 2 ˆ t T P 1yMS, t 1 ADJ 1 T ADJ ADJ 2 V P t T P 1( f f ), ˆ then the test statistic is defined as outperforms the RW. CW f ADJ 1 ADJ / P V. A CW>0 implies that our model 17