The Comparative Performance of Alternative Out-ofsample Predictability Tests with Non-linear Models

The Comparative Performance of Alternative Out-ofsample Predictability Tests with Non-linear Models Yu Liu, University of Texas at El Paso Alex Nikolsko-Rzhevskyy, University of Lehigh Ruxandra Prodan, University of Houston Arguably the most popular way to evaluate forecasting models is by using the mean squared prediction error to construct either the Diebold and Mariano (1995) and West (1996) test, for non-nested models, or the Clark and West (2006) test, for nested models. However, both are designed and well-studied for linear, rather than nonlinear models. In this paper, we investigate the size and power properties of both tests with nonlinear models, TAR, Band-TAR, ESTAR and Markov-Switching, by performing Monte Carlo simulations with a range of data generating processes, sample sizes, forecast horizons, and variance-covariance estimators. We find the Clark and West test with Hodrick variance-covariance estimator to have the best size, while the Newey-West estimator often results in the highest adjusted power. To illustrate our findings, we revisit four influential papers that employ nonlinear models. JEL Classification: F31, F37, C5 Keywords: Econometric models, Evaluating forecasts, Nonlinear time series, Long memory time series, and Forecasting competitions

1 Introduction The mean squared prediction error (MSPE) is generally considered the most popular measure used to evaluate the out-of-sample forecasting performance of two models. When the models are non-nested, the test for equal predictability developed by Diebold and Mariano (1995) and West (1996) is widely used, henceforth the DMW test. The DMW test statistic is distributed normally, with the null hypothesis stating that the competing linear models achieve the same MSPE, thus forecasting equally well. When the models are nested, the DMW test statistic is undersized if standard normal critical values are used. In particular, this is the case when the null is a random walk. 1 More recent studies take the above matters into consideration by using a newly developed inference procedure proposed by Clark and West (2006, 2007), henceforth the CW test, specifically designed for testing the null of equal predictive ability of two nested models. The CW statistic, distributed asymptotically normally, explicitly takes into account that in finite samples, the nesting model is estimated less precisely than the nested model and adjusts the statistic for bias. The CW test has been widely used in the past few years, with recent papers by Gourinchas and Rey (2007), Molodtsova and Papell (2009), Molodtsova, Nikolsko-Rzhevskyy, and Papell (2008, 2011), Engel, Mark and West (2007), Rossi and Sekhposyan (2010), Alquist and Killian (2008), and Panopoulou and Pantelidis (2014),among others. In a recent paper, Rogoff and Stavrakeva (2009) examine several popular exchange rate forecasting models described by Gourinchas and Rey (2007), Engel, Mark and West (2007), and Molodtsova and Papell (2009), which outperform the random walk in an out-of-sample exchange rate predictability exercise, when the assessment is based on the CW test. Rogoff and Stavrakeva argue that one of the sources of the overly optimistic results from these papers is the failure to check robustness with respect to alternative out-of-sample test statistics. They conclude that the new asymptotic out-of-sample tests, such as the CW test, are easy to use, but the bootstrapped out-of-sample DMW test remains more powerful and better sized. The above equal predictability tests, whether designed for non-nested or nested models, are aimed toward testing the linear models. However, with few exceptions, there are no equal predictability tests specifically designed for cases where the competing models are nonlinear; therefore, DMW and CW tests remain the most popular choices. 2 To the 1 As opposed to drawing population-level inference, one may be interested in comparing forecasts of the nested models in a finite sample. In this case the DMW test is an appropriate choice. While these questions are not investigated explicitly in this paper, the reader can refer to Giacomini and White (2006) and Clark and McCracken (2009), among others, for additional details. 2 Some studies have tried to design forecast evaluation methods for nonlinear models. Mizrach (1992) applies a multivariate nearest-neighbor nonparametric model to exchange rates and finds it to marginally improve upon RW forecasts. Van Dijk and Franses (2003) argue that extreme observations are more relevant than others in certain cases and propose a forecast evaluation methodology where different weights are given to different forecasts (Weighted and Modified Weighted DMW tests).

best of our knowledge, there are no studies that investigate the size and power properties of the DMW and CW tests in application to nonlinear models. This issue is especially important because nonlinear models are becoming more common when modeling persistent economic and financial series. Researchers often use nonlinear models when simpler linear alternatives fail to capture certain types of economic behavior. For instance, nominal exchange rates (times of appreciation and depreciation) or real exchange rates (where transaction costs could give rise to a band of inactivity where arbitrage is not profitable so the real exchange rate deviations from purchasing power parity are not corrected inside the band). This trend has brought with it an increased interest in forecasting nominal/real exchange rates with nonlinear models. Engel (1994) and Nikolsko-Rzhevskyy and Prodan (2012) investigate whether a Markov-Switching random walk with drift model (henceforth MS-RW) is a useful specification for out-of-sample nominal exchange rate forecasting; the latter study finds evidence of both short- and long-run predictability for monthly nominal exchange rates. 3 Rapach and Wohar (2006) analyze the out-of-sample forecasting performance of two nonlinear models, a band-threshold autoregressive (Band-TAR) model and an exponential smooth threshold autoregressive (ESTAR) model, for four real exchange rates. They find that, in several instances, the nonlinear models outperform the simple linear autoregressive models. More recently, using longer spans of data, Pavlidis, Paya and Peel (2012) model and then forecast two real exchange rates. They find that for the U.S./U.K. real exchange rate, ESTAR outperforms both the random walk and linear alternatives. 4 Another strand of papers surveys ways to compute multi-step forecasts from nonlinear models and compares them to linear forecasts. Some of the recent works include Teräsvirta (2006), Teräsvirta, van Dijk and Medeiros (2005), Marcellino (2005), Stock and Watson (1999) and Ramsey (1996). The comparison is typically done by either simply comparing MSPEs from linear and nonlinear models that does not allow making statistical inferences, or by applying equal predictability tests with various variance-covariance estimators developed specifically for linear models. Hence, the existing research does not provide justification for using DMW or CW tests for non-linear models, nor does it give recommendations for choosing the most appropriate test and variance-covariance estimator for a particular class of models. This is the gap in the literature that we attempt to fill with this study. 3 The nonlinear behavior of the nominal exchange rate has been documented by previous research. Engel and Hamilton (1990) argue that a simple two-state Markov-switching random walk model with drift, which allows the constant term and the variance of innovations to take two distinct values during times of appreciation and depreciation, is a good representation for nominal exchange rates. Other papers, such as Lundbergh and Terasvirta (2006), develop models that characterize the dynamic behavior of exchange rates as fluctuating within a target zone. 4 Note that we have traded generality for clarity. In terms of nonlinear model selection, we focus on autoregressive models instead of econometric models. For a detailed discussion of forecasting nominal/real exchange rates with nonlinear econometric models, see Panopoulou and Pantelidis (2014) and Sartore, Trevisan, Trova and Volo(2002), and Jamaleh(2002).

Our contributions are the following. First, we compare side by side the DMW and CW tests and provide guidance for choosing the most appropriate statistic in terms of size and power when evaluating forecasts of persistent series from linear and various nonlinear models against a random walk model; in particular, we focus on the AR, Band-TAR, ESTAR and MS-RW models. Second, for multistep forecasting, we investigate the properties of both tests across the two variance-covariance estimators, Newey and West (1987) (henceforth NW) and Hodrick (1992), resulting in four possible test/variance-covariance estimator combinations: DMW-NW, DMW-Hodrick, CW-NW and CW-Hodrick. We use Monte Carlo simulations to compute the empirical size, power, and size adjusted power for various sample sizes and forecast horizons, as well as generate tables of corresponding critical values. Our results show that when using standard normal critical values, for both linear and nonlinear models, the CW-Hodrick test has the most appropriate size for all forecasting horizons, while the DMW tests with either NW or Hodrick adjustments are significantly undersized at short horizons and oversized at long horizons, confirming previous findings for linear models. When using bootstrapped critical values, for the majority of cases, the CW-NW test has the highest power among the alternatives. While DMW- NW has desirable properties for some nonlinear models, its performance is inferior to either CW test for other models. Finally, the adjusted power for all tests decreases when the forecast horizon increases or the number of predictions decreases. In the empirical section we revisit three influential studies that model economic series using non-linear models. These include Obstfeld and Taylor (1997), who apply a Band-TAR(1) model to the U.K., Germany, France and Japan s real exchange rates, Taylor, Peel and Sarno (2001), who model the same real exchange rates as an ESTAR(1) process, and Engel and Hamilton (1990), who estimate the U.K, German and French nominal exchange rates using the MS-RW model. We test whether any of the proposed nonlinear models significantly outperforms the driftless random walk in an out-of-sample forecasting exercise. We find weak to no predictability of exchange rates using Band- TAR(1), ESTAR(1), and MS-RW models. The rest of the paper is organized as follows: Section 2 describes the models and Section 3 discusses model forecasting and relevant test statistics. Sections 4 and 5 describe Monte Carlo simulation procedures and results, Section 6 presents our empirical examples, and Section 7 concludes. 2 The Models In order to study the out of sample forecasting performance and to perform the size/power simulation exercise, we consider one widely used linear autoregressive model (AR) that serves as a benchmark, and three non-linear models, the band threshold autoregressive model (Band-TAR), the exponential smooth transition autoregressive model (ESTAR) and the Markov-Switching random walk with drift model (MS-RW). The key feature of these non-linear models is that they allow for several regimes and a sufficiently large shock

can cause the system to switch between them. The dates at which the series crosses the threshold are not specified beforehand by the researcher, but are estimated endogenously. Each of the four models is tested against the driftless random walk (RW). AR(1): Without a doubt, the most popular statistical linear technique to model and forecast a time series y t is a simple autoregressive model with one lag, AR(1), defined as: y t = α 0 + α 1 y t 1 + ɛ t (1) where ɛ t is a white-noise process with zero mean and variance σ 2. Under the null, α 0 = 0 and α 1 = 1, resulting in a driftless RW. We use AR(1) as a benchmark model for the following non-linear models. Band-TAR(1): The Band-TAR(1) model, first used by Obstfeld and Taylor (1997) and then by Rapach and Wohar (2006) to estimate and forecast real exchange rates, assumes that iceberg transportation costs create a band for the real exchange rate within which the marginal cost of arbitrage exceeds the marginal benefit. The model is characterized by unit root behavior in the inner regime and reversion to the edge of the unit root band in the two outer regimes: y t = y t 1 + α out (y t 1 τ) + ɛ out t y t = y t 1 + ɛ in t y t = y t 1 + α out (y t 1 + τ) + ɛ out t y t 1 > τ τ y t 1 τ τ > y t 1 where ɛ out t and ɛ in t are white-noise series with generally non-equal variances σout 2 and σin, 2 τ and τ are the values of the two thresholds, and α out is the speed at which the real exchange rate moves back to the inner regime. The Band-TAR model permits us to estimate the value of the threshold, τ, without imposing a priori lines of demarcation between the regimes. In order to do so, a grid search over all potential values y t of the thresholds is conducted, which yields a super-consistent estimate of τ. 5 ESTAR(1): In contrast to the discrete regime switching that characterizes the above threshold models, the ESTAR model proposed by Granger and Teräsvirta (1993) allows for smooth transitions, so that the speed of adjustment varies with the extent of the deviation of y t from the mean. Taylor, Peel and Sarno (2001) and Rapach and Wohar (2006) have employed the following ESTAR(1) model to estimate and forecast real exchange rates: [ y t = y t 1 1 e α(y t 1 τ) 2] (y t 1 τ) + ɛ t (3) 5 In this and the following models we follow the conventional practice of excluding the highest and the lowest 15% of the potential values of τ to ensure an adequate number of observations on each side of the threshold. (2a) (2b) (2c)

where ɛ t N(0, σ 2 ), α is a negative constant called the smoothness parameter and τ is the long-run equilibrium level of y t. The speed of the mean reversion increases gradually as y t is getting further away from the long run equilibrium, τ. MS-RW: Following Engel and Hamilton s (1990) 2-state MS-RW with drift model, we decompose a non-stationary time series into a sequence of stochastic, segmented time trends. The regime at any given time is presumed to be the outcome of a Markov Chain whose realizations are unobserved. The parameter estimates can then be used to infer which regime the process is in and provide forecasts for the future values of the series. We define the growth rate of y t as dy t = 100(ln y t ln y t 1 ), so that dy t is the percentage change of the series from time t 1 to time t. The model then postulates the existence of an unobserved variable (s t ) that may be equal to one or two. The variable characterizes the state (or regime) that the process is in at time t. When s t = 1, dy t is distributed as N(µ 1, σ1), 2 whereas when s t = 2, dy t is distributed as N(µ 2, σ2). 2 Thus, the model postulates that: dy t = µ st + ɛ st (4) where the unobserved state variable (s t ) is governed by the following transition probabilities: { P r[st = 1 s t 1 = 1] = p 11 (5a) P r[s t = 2 s t 1 = 2] = p 22 (5b) This is a simple version of the more general MS-RW model. Compared to the more general MS-RW model, this parsimonious specification introduces less noise into its forecast by estimating only six parameters for each series. The parameter vector θ = {µ 1, µ 2, σ 1, σ 2, p 11, p 22 } can be estimated by maximum likelihood. The sample likelihood is a function of the observed values of the changes in y t. The states are unobserved and the inferences about their probabilities are based on the observed data. The maximum likelihood estimation is performed using the algorithm described by Hamilton (1989). 3 Forecasting and Inference 3.1 The multistep forecast construction methodology Let y t be the time series of interest and suppose that we want to forecast subsequent values of the series, y t+1, y t+2,..., y t+h conditional on the current and past observations, y t p, y t p+1,..., y t, where we allow the forecast horizon, h, to run from 1 to 36. For the AR(1) model, it is straightforward to obtain the h-step ahead forecasts recursively because its functional form is linear. On the other hand, forecasting with Band-TAR, ESTAR and MS-RW models is a nontrivial task. As discussed in Koop,

Pesaran and Potter (1996), for Band-TAR and ESTAR models, the iterated projections from a nonlinear model are state-dependent. Therefore, in order to construct multiperiod forecasts, we use the method described in Enders (2004). Specifically, using the Band-TAR(1) model as an example, we select 36 random realizations of the residuals from Equation 2 such that the residuals are drawn with replacement. We call these residuals ɛ t+1, ɛ t+2,..., ɛ t+36. We then generate yt+1 through yt+36 by substituting these bootstrapped residuals back into Equation 2. For this particular history, we repeat the process 1,000 times. The sample means of each yt+h,i should converge to the true conditional h-step ahead forecasts, y t+h, under the Law of Large Numbers. To construct an h-step-ahead forecast from the MS-RW model, we first fit the model inside the moving window using data up to time t. Then after we define vectors S t = (P r[s t = 1]; P r[s t = 2]) and M = (µ 1 ; µ 2 ), we calculate the growth rate forecast as y MS,t+h = 100(ln y MS,t+h ( ) i ln y t ) = Σ h i=1(s t ) p11 1 p 11 (M). 1 p 22 p 22 3.2 The forecasts evaluation tests To test the forecasting performance of the above models against the driftless random walk no-change forecast, Ey t+h = y t, we use the DMW and CW tests of equal forecasting accuracy. To account for overlapping data, we employ the variance-covariance estimator of either Newey-West or Hodrick. In particular, if we define an h-step forecast from model M as y M,t+h and a RW forecast as y RW,t+h = y t, then the sample forecast errors under H 0 and H 1 would be ê RW,t+h = y t+h y t and ê M,t+h = y t+h y M,t+h. If ˆf t = ê 2 RW,t+h ê2 M,t+h, then its mean value f = MSP E RW MSP E M (6) and variance The DMW statistic is defined as V = 1 P T t=t P +h ( ˆf f) 2 (7) DMW = f P 1 V (8) The CW test adjusts the DMW statistic for the non-zero expected difference between the two MSPEs, which, for the case of linear models, is known to result in asymptotically normal distribution of the statistic. If we define ˆf t = ê 2 RW,t+h ( ê 2 M,t+h M,t+h) y2 ADJ,

its mean and variance as ( f ADJ = MSP E RW MSP E M 1 P V ADJ = 1 P T t=t P +h T t=t P +h y 2 M,t+h ) ( ˆf ADJ f ADJ ) 2 (9) (10) then CW = f ADJ P 1 V ADJ (11) 4 Size and Power Simulations We use Monte Carlo simulations to evaluate the finite-sample size, power and size-adjusted power of the predictability tests discussed above. In order to do that, we calculate one and multi-step forecasts for the four models described in Section 2, the corresponding forecasts under the non-predictability null, and then compute the DMW and CW statistics. While performing multi-step forecasting, we control for the serial correlation present in the errors due to the overlapping data using either the NW or Hodrick variance-covariance estimators. We therefore analyze four statistics: DMW-NW, DMW-Hodrick, CW-NW and CW-Hodrick. In order to evaluate the empirical size and power, we first compare DMW-NW, DMW-Hodrick, CW-NW or CW-Hodrick test statistic against the standard normal critical values. Clark and West (2006) and McCracken (2007) show that the CW test results in asymptotically normally distributed statistic for a class of linear models, while the DMW test is severely undersized. To calculate the size-adjusted power, we bootstrap the critical values for the four tests mentioned above. Rogoff and Stavrakeva (2009) argue that the bootstrapped DMW test might have a higher power than the CW test. We perform a non-parametric bootstrap that is based on re-sampling the residuals with replacement, and then generating the new series. We then re-fit the model and record the statistics of interest. We repeat these steps 1,000 times. The critical values are taken from the sorted series in an ascending order, with the 900th entry being the 10% critical value, the 950th entry being the 5% critical value, and the 990th entry being the 1% critical value. 4.1 The linear and non-linear DGPs We generate data based on several popular studies that have previously employed nonlinear models. In order to do so, we have reconstructed their original data, and use that data to fit the models both under the null and alternative. As a result, the data generating processes (DGP) employed in this paper provide representative results for the size, power

and size adjusted power of nonlinear alternatives versus the no predictability null for a range of empirical nominal/real exchange rate applications. AR(1): The linear DGP is the AR(1) process defined by Equation 1. The forecast under the null is a no change RW forecast of y RW,t+h = y t for all h, while the forecast under the alternative y AR,t+h is iteratively obtained from the AR(1) regression. Hence, when evaluating the size, α 0 = 0, the level of persistence α 1 = 1.00, and ɛ t N(0, 0.030 2 ), which corresponds to a driftless RW. When evaluating the power, α 0 = 0, the persistence α 1 = 0.95, and ɛ t N(0, 0.030 2 ). 6 Band-TAR(1): The Band-TAR(1) process is generated in the form of Equation 2. Our data generating process is based on the coefficients estimated by Obstfeld and Taylor (1997) using monthly real exchange rates for the U.K., Germany, France and Japan from 1980:02 to 1994:11. For simplicity, the average of the coefficients estimated from all countries is computed. When evaluating the size, the null is a driftless random walk, with ɛ t N(0, 0.035 2 ). When evaluating the power, we set αt out = 0.060, τ = 0.159 and ɛ out t N(0, 0.038 2 ) for the two outer regimes, and ɛ in t N(0, 0.033 2 ) for the inner regime. ESTAR(1): The ESTAR(1) process is generated according to Equation 3. Our data generating process is based on Taylor, Peel and Sarno (2001) using monthly real exchange rates for the U.K., Germany, France and Japan from 1973:02 to 1996:12. Similarly to Band-TAR(1), for ESTAR(1) we also average the coefficients estimated from all the four countries. When evaluating the size, the null is a driftless random walk with ɛ t N(0, 0.034 2 ). When evaluating the power we set α = 0.292, τ = 0.177 and ɛ t N(0, 0.034 2 ). MS-RW: The MS-RW process is generated according to Equations 4 and 5. Our data generating process is based on the coefficients estimated by Engel and Hamilton (1990) using the British pound/u.s. dollar monthly nominal exchange rate from 1973:12 to 1988:03. 7 When evaluating the size, the null is a driftless RW, with ɛ t N(0, 0.026 2 ). When evaluating the power, we make the following parametrization: for s t = 1, the observed change in the exchange rate dy t is distributed as N( 0.020, 0.017 2 ), for s t = 2, dy t is distributed as N(0.013, 0.021 2 ). The dynamics of the unobserved state variable s t is governed by the following transition probabilities: p 11 = 0.828 and p 22 = 0.863. 4.2 The forecasting exercise design To study the forecasting performance of the above models, we closely follow the setup of Clark and West (2006). In particular, we consider rolling, as opposed to recursive, 6 The standard deviation is chosen to be 0.030 in order to be similar in magnitude to the standard deviations for the nonlinear models below. 7 In the original paper, the authors use quarterly, rather monthly, data.

window regressions to obtain multi-step-ahead forecasts of the series of interest y t from every estimated model. We fix the window size, R, at 120 periods, while the number of predictions, P, is allowed to take on the values of 48, 96, 144 and 240. We therefore generate a total of R + P observations. Using the window of R data points, multi-step ahead predictions for horizon h are formed for observations R + h through R + P, where h = 1 to 36 periods. The total number of predictions is thus P h + 1. 5 Simulation Results The size, power and size adjusted power results are reported in Tables 1, 2 and 3, correspondingly. Even though we only discuss the empirical results for the 10% nominal size, other results are qualitatively similar. 5.1 Size simulations The size results for the linear and nonlinear models are presented in Tables 1.1 1.4. The tables show the proportion of rejections using the 10% one-sided standard normal critical value. For example, in Table 1.1, Panel A, when h = 1, the empirical size for the DMW- NW statistics is 4.8%, implying that 48 out of 1,000 simulations generate statistics larger than 1.28. The size results for the linear AR model are presented in Table 1.1. They show that for all values of P and h, the CW-Hodrick test has the best size when using the standard normal critical values. For example, when P = 144, the CW-Hodrick test is correctly sized at most horizons, while being only slightly oversized at long horizons with the average empirical size of 12%. On the other hand, for the same P, the CW-NW statistic is oversized at all horizons (ranging from 15.1% at the 3-step to 28.7% at the 36-step horizon) except for the 1-step horizon when it is slightly undersized (9.6%). As expected and documented in the previous literature, when the models are nested, both DMW-NW and DMW-Hodrick tests are strongly undersized at most horizons, especially for large P. In general, the empirical size of both DMW tests is inconsistent and varies with the parameters it decreases with P and increases with h, ranging from 0.1% for P = 240 and h = 1 up to 32% for P = 48 and h = 36. Size results for the nonlinear models are presented in Tables 1.2 1.4. Our findings are in accord with the linear case. The CW-Hodrick statistic is correctly sized at most horizons (while being slightly oversized at long horizons). When P = 144, the CW- Hodrick statistic has an average empirical size of 13.0% for Band-TAR, 11.1% for ESTAR and 8% for the MS-RW model. The CW-NW test does not perform as well as the CW-Hodrick test while it is correctly sized for h = 1, its performance quickly deteriorates as the forecast horizon h

increases. When P = 144, the CW-NW statistic has an average empirical size of 19.5% for Band-TAR, 20.6% for ESTAR and 13.1% for the MS-RW model. Performance of the DMW tests is generally inferior to that of both CW tests and produces inconsistent results depending on P and h. For example, when P = 48 and h = 24, the empirical size for Band-TAR(1) is 27.8% and 23.3%, for the DMW-NW and DMW-Hodrick tests, respectively. When P = 240 and h = 1, the size is 0% for both tests. This pattern is similar for other nonlinear models as well. In summary, CW-Hodrick provides the best empirical size compared to the other three tests when the standard normal critical values are used. Additionally, for all models and test statistics, the oversizing issue aggravates as the forecast horizon increases and the number of predictions decreases. 5.2 Power simulations We study the finite-sample power of the above tests using both standard normal critical values (unadjusted power) and bootstrapped critical values (adjusted power) at the nominal size of 10%. The unadjusted power results are presented in Tables 2.1 through 2.4, while the adjusted power results are presented in Tables 3.1 through 3.4. 5.2.1 The unadjusted power For the linear model, the unadjusted power results are presented in Table 2.1. We can see that the CW-NW has higher power than the alternatives that increases with the forecast horizon, while the performance of all the tests generally varies with P and h. Band-TAR(1) and ESTAR(1) results are presented in Tables 2.2 2.3. Similarly to the AR(1) case, CW-NW clearly dominates other tests for most values of P and h. Performance of the CW-Hodrick test, on the other hand, is unremarkable. Both versions of the DMW test have lower power, reaching as low as 0.4% for h = 1 and P = 240 for DMW-Hodrick for the Band-TAR(1) model. 8 The MS-RW results (Tables 2.4) are somewhat different from the previous models. CW-Hodrick appears superior to CW-NW for all values of P and h. Both versions of the DMW test have low power, except at very short horizon. The power of all tests seem to deteriorate as the forecast horizon increases. To sum up, despite the fact that the CW-NW test appears to have the highest power for all autoregressive models, one needs to keep in mind that size simulations show that the test is severely oversized when used with the standard normal critical values, especially for high h. Hence, for that case, the CW-Hodrick test might be more appropriate. 8 The DGPs for the Band-TAR and ESTAR models, based on the coefficients estimated by Obstfeld and Taylor (1997) and Taylor, Peel and Sarno (2001), using monthly real exchange rates, are highly persistent and it is usually not easy to distinguish them from a random walk process.

5.2.2 The adjusted power For the adjusted power, we use bootstrapped critical values from 1,000 iterations and fix the size at 10% for all the tests. The results are presented in Tables 3.1 through 3.4. Two trends are obvious for the adjusted power of the tests: The size-adjusted power gradually increases with the number of predictions P and is generally lower for high h comparing to low and midrange h. However, there are some differences among the tests depending on the underlying DGP. For the linear model (Table 3.1), the tests using the NW adjustment have a higher adjusted power than the tests using the Hodrick adjustment, for both CW and DMW tests for all values of P and all forecast horizons h, with the exception of very small h, where both tests estimators similarly. 9 In the case of Band-TAR(1) and ESTAR(1), Tables 3.2-3.3, the performance of the tests using the NW and Hodrick estimators is similar. For the MS-RW model, Table 3.4, the CW tests have a higher power than the DMW tests, with the power of CW-Hodrick being slightly higher than that of CW-NW. Overall, when bootstrapping critical values, we argue that CW-NW test generally has the highest size-adjusted power for all autoregressive models. For the MS-RW model, CW-Hodrick is preferable to other options. 5.3 Summary of simulation results Table 4 provides a summary of bootstrapped 10% critical values based on the simulation results. We find that for nonlinear models, the DMW tests with either NW or Hodrick adjustments are undersized at short horizon and oversized at long horizon, confirming previous findings for linear models. CW-Hodrick appears to be properly sized most of the time, while CW-NW is often oversized. In terms of size-adjusted power, when one bootstraps critical values, for the majority of cases, the CW-NW test has the highest power among the alternatives. For short-horizon forecasting, both CW tests perform similarly well. 6 Empirical Results In this section we revisit three influential studies that model economic series using nonlinear models. These include Obstfeld and Taylor (1997), who estimate a Band-TAR(1) model in application to the U.K., Germany, France and Japan s real exchange rates, Taylor, Peel and Sarno (2001), who model the same real exchange rates as an ESTAR(1) process, and Engel and Hamilton (1990), who estimate the U.K, German and French 9 For instance, the CW-NW test systematically has a higher power than the CW-Hodrick test, the largest difference between the two tests can be observed at longer horizons (38.8% versus 21% for P = 144 at the 36-step horizon).

nominal exchange rates using the MS-RW model. 10 Our goal is to test whether any of the proposed models is able to significantly outperform the driftless random walk in an out-of-sample forecasting exercise. The empirical test results using bootstrapped and standard normal critical values are presented in Tables 5.1 5.2; the bootstrapped 10% critical values are shown in Tables 4.1 4.4. When modeling real exchange rates for the U.K., Germany, France and Japan using either Band-TAR(1) or ESTAR(1) models, we find weak evidence that these models outperform the RW model for all four countries. In particular, Band-TAR(1) appears to work well for France at short, and U.K., Germany, and Japan at medium to long horizons, while ESTAR(1) works well only for the U.K. at short and Japan at long horizons. Neither model, however, is able to outperform the RW for Japan and Germany for h = 1. These results are consistent with Rapach and Wohar (2006) who also find limited evidence of predictability. There are minor differences between tests results using bootstrapped and standard normal critical values. Both tests using the NW adjustment are oversized at long horizons, while both DMW tests, regardless of the adjustment, are undersized at short horizons. These observations are consistent with our Monte-Carlo experiments. When modeling the U.K., German, and French nominal exchange rates measured against the U.S. dollar using the MS-RW model, we find almost no evidence that these models outperform the RW model. The only exception is the case of Germany where the model works well at very short horizons when using CW-NW or CW-Hodrick tests. This result is consistent with Engel and Hamilton (1990) who found the MSPE of the RW to be smaller than that of the MS-RW model. However, had they used the CW test, their finding could have been reversed. We do not find significant differences between the results with standard normal and bootstrapped critical values. 7 Conclusions We discuss the current state-of-the-art in multi-step non-linear forecasting, evaluate the properties of popular tests of equal predictability, and provide practical guidance to forecasters. Specifically, we analyze the DMW and CW tests, using standard normal and bootstrapped critical values, and two variance-covariance estimators (Newey-West and Hodrick) when the null hypothesis is a driftless random walk. Our alternatives are Band- TAR, ESTAR and the MS-RW models. We use Monte Carlo simulations to compute the empirical size, power, and size adjusted power for a number of DGPs, sample sizes, and forecast horizons. We provide several recommendations for forecasters. First, when using standard normal critical values, the recommended test is CW-Hodrick, which generally has the best size among all the tests considered. The drawback that it is somewhat oversized for 10 The span of the data used in each study is specified in Tables 5.1 5.2.

small P and large h. Second, when bootstrapping critical values, we recommend using the CW-NW test that has the highest adjusted power for most non-linear models except MS- RW, when CW-Hodrick should be used instead. The downside is that bootstrapping can be burdensome, computationally intensive and time consuming, especially for non-linear forecasting above one step ahead. Considering that our Monte Carlo simulations and empirical tests main focus on nominal and real exchange rates, the above recommendations may be especially valuable for exchange rate forecasters.

References [1] Clark, T. E. and McCracken, M. W. (2014). Nested Forecast Model Comparisons: a New Approach to Testing Equal Accuracy, Journal of Econometrics, Forthcoming. [2] Clark, T. E. and West, K. D. (2006). Using Out-of-Sample Mean Squared Prediction Errors to Test the Martingale Difference Hypothesis, Journal of Econometrics, 135(1-2), 155-186. [3] Clark, T. E. and West, K. D. (2007). Approximately Normal Tests for Equal Predictive Accuracy in Nested Models, Journal of Econometrics, 138, 291-311. [4] Clements, M.P., Franses, P.H., and Swanson, N.R. (2004). Forecasting Economic and Fincnail Time-series with Nonlinear Models, International Journal of Forecasting, 20(2), 169-183. [5] Diebold, F. and Mariano, R. (1995). Comparing Predictive Accuracy, Journal of Business and Economic Statistics, 13(3), 253-263. [6] Enders, W. (2004). Applied Econometric Time Series. 2nd ed. John Wiley and Sons, Hoboken, NJ. [7] Engel, C. (1994). Can the Markov Switching Model Forecast Exchange Rates? Journal of International Economics, 36(1-2), 151-165. [8] Engel, C. and Hamilton, J.D. (1990). Long Swings in the Dollar: Are They in the Data and Do Markets Know It? American Economic Review, 80(4), 689-713. [9] Engel, C., Mark, N. and West, K. (2007). International Financial Adjustment, Journal of Political Economy, 115(4), 665-793. [10] Giacomini, R., and White, H. (2006). Tests of Conditional Predictive Ability., Econometrica, 74(6), 1545-1578. [11] Gourinchas, P.-O. and Rey, H. (2007) International Financial Adjustment, Journal of Political Economy, 115(4), 665-703. [12] Granger, C.W. J. and Teräsvirta. (1993). Modelling Nonlinear Economic Relationships, Oxford: Oxford University Press. [13] Hamilton, J.D. (1989). A New Approach to the Economic Analysis of Nonstationary Time Series and the Business Cycle, Econometrica, 57(2), 357-384. [14] Hodrick, R. (1992). Dividend Yields and Expected Stock Returns: Alternative Procedures for Inference and Measurement, Review of Financial Studies, 5(3), 357-386. [15] Jamaleh, A. (2002). Explaining and forecasting the euro/dollar exchange rate through a non-linear threshold model, The European Journal of Finance, 8(4), 422-448. [16] Koop, G., Pesaran, M.H. and Potter, S. (1996). Impulse Response Analysis of Nonlinear Multivariate Models, Journal of Econometrics, 74, 119-147. [17] Lundbergh, Stefan and Teräsvirta. (2006) A Time Series Model for an Exchange Rate in a Target Zone with Applications, Journal of Econometrics, 2006, 131(1-2), 579-609. [18] Marcellino, M. (2005). Instability and Non-Linearity in the EMU, In; Costas Milas, Philip Rothman and Dick van Dijk (eds.), Nonlinear Time Series Analysis of Business

Cycles, Elsevier. [19] McCracken, M. W. (2007). Asymptotics for Out of Sample Tests of Granger Causality, Journal of Econometrics, 140(2), 719-752. [20] Mizrach, B. (1992). Multivariate Nearest-Neighbor Forecasts of EMS Exchange Rates, Journal of Applied Econometrics, vol. 7(S), 151-163. [21] Molodtsova, T., Nikolsko-Rzhevksyy, A. and Papell, D. (2008). Taylor Rules and Real-Time Data: A Tale of Two Countries and One Exchange Rate, Journal of Monetary Economics, 55(S1), S63-S79. [22] Molodtsova, T., Nikolsko-Rzhevksyy, A. and Papell, D. (2011). Taylor Rules and the Euro, Journal of Money, Credit and Banking, 43(2-3), 535-552. [23] Molodtsova, T. and Papell D. (2009). Out-of-Sample Exchange Rate Predictability with Taylor Rule Fundamentals, Journal of International Economics, 77(2), 167-180. [24] Newey, W.K. and West, K.D. (1987). A Simple, Positive Semi-definite, Heteroskedasticity and Autocorrelation Consistent Covariance Matrix, Econometrica, 55(3), 703-708. [25] Nikolsko-Rzhevskyy, A. and Prodan, R. (2012). New Evidence of Exchange Rates Predictability Using the Long Swings Model, International Journal of Forecasting, 28(2), 353-365. [26] Obstfeld, M. and Taylor, A.M. (1997). Nonlinear Aspects of Goods-Market Arbitrage and Adjustment: Heckschers Commodity Points Revisited, Journal of the Japanese and International Economies, 11, 441-479. [27] Panopoulou, E. and Pantelidis, T. (2014). Regime-switching models for exchange rates, The European Journal of Finance, Forthcoming. [28] Pavlidis, E. G., Paya, I. and Peel, D. A. (2012) Forecast Evaluation of Nonlinear Models: The Case of Long-Span Real Exchange Rates, Journal of Forecasting, pp. 580-595. [29] Ramsey, J. B. (1996). If Nonlinear Models Cannot Forecast, What Use Are They? Studies in Nonlinear Dynamics and Forecasting, 1, 65-86. [30] Rapach, D. E. and Wohar, M.E. (2006). The Out-of-Sample Forecasting Performance of Nonlinear Models of Real Exchange Rate Behavior, International Journal of Forecasting, 22(2), 341-361. [31] Rogoff, K. S. and Stavrakeva, V. (2009). The Continuing Puzzle of Short Horizon Exchange Rate Forecasting, NBER Working Papers 14071, National Bureau of Economic Research, Inc. [32] Rossi, B. and Sekhposyan, T. (2010). Has Economic Models Forecasting Performance for US Output Growth and Inflation Changed Over Time, and When? International Journal of Forecasting, 26(4), 808-835. [33] Rothman, P. (1998). Forecasting Asymmetric Unemployment Rates, The Review of Economics and Statistics, 80, 164-168. [34] SartoreD., Trevisan,L. Trova, M., and Volo, F. (2002). US dollar/euro exchange rate: a monthly econometric model for forecasting, The European Journal of Finance, 8(4), 480-501,

[35] Stock, J. H. and Watson, M. W. (1999). A Comparison of Linear and Nonlinear Univariate Models for Forecasting Macroeconomic Time Series, in R. F. Engle and H. White (eds), Cointegration, Causality and Forecasting. A Festschrift in Honour of Clive W.J. Granger, Oxford University Press, Oxford, 1-44. [36] Taylor, M.P., Peel,D.A. and Sarno, L. (2001). Nonlinear Mean Reversion in Real Exchange Rates: Toward a Solution to the Purchasing Power Parity Puzzles, International Economic Review 42, 1015-1042. [37] Tiao, G. C., and Tsay, R. S. (1994). Some Advances in Non-Linear and Adaptive Modelling in Time-Series, Journal of Forecasting, 13, 109-131. [38] Tsay, R. S. (2002). Nonlinear Models and Forecasting, In; Clements, M. P., Hendry, D.F. (Eds.), A Companion to Economic Forecasting, Blackwell, Oxford, 453-484. [39] Teräsvirta, T. (2006). Forecasting Economic Variables with Nonlinear Models, Handbook of Economic Forecasting, Elsevier. [40] Teräsvirta, T., van Dijk, D., and Medeiros, M. (2005). Linear Models, Smooth Transition Autoregressions, and Neural Networks for Forecasting Macroeconomic Time Series: A Re-examination, International Journal of Forecasting, 21(4), 755-774. [41] van Dijk, Dick and Franses, P. H. (2003). Selecting a Nonlinear Time Series Model using Weighted Tests of Equal Forecast Accuracy, Oxford Bulletin of Economics and Statistics, vol. 65(s1), 727-744 [42] West, K. D. (1996) Asymptotic Inference about Predictive Ability, Econometrica, 64, 1067-1084.

1 Empirical Size Simulation Results Table 1.1 AR(1) Empirical Size: Nominal Size=10% 1 3 6 12 24 36 Panel A: R=120, P=48 DMW-NW 4.8% 9.4% 13.5% 18.4% 27.4% 32.0% DMW-Hodrick 4.7% 5.2% 5.4% 6.3% 21.1% NA CW-NW 12.5% 18.8% 25.2% 30.9% 39.5% 42.0% CW-Hodrick 12.5% 12.3% 12.4% 12.3% 29.4% NA Panel B: R=120, P=96 DMW-NW 2.0% 3.8% 6.4% 11.5% 17.0% 19.8% DMW-Hodrick 1.9% 2.0% 2.6% 4.1% 5.1% 7.3% CW-NW 10.2% 16.3% 20.6% 26.5% 31.5% 35.1% CW-Hodrick 10.2% 9.8% 11.0% 12.1% 13.3% 15.3% Panel C: R=120, P=144 DMW-NW 1.1% 2.2% 3.6% 7.4% 12.9% 14.1% DMW-Hodrick 1.0% 1.3% 1.3% 2.8% 3.5% 3.7% CW-NW 9.6% 15.1% 16.8% 21.6% 27.6% 28.7% CW-Hodrick 9.6% 10.7% 11.0% 12.1% 12.6% 14.4% Panel D: R=120, P=240 DMW-NW 0.1% 0.6% 1.6% 3.6% 6.9% 8.6% DMW-Hodrick 0.1% 0.3% 0.8% 1.7% 2.5% 2.9% CW-NW 7.7% 12.6% 13.4% 17.8% 21.4% 25.2% CW-Hodrick 7.7% 8.1% 8.8% 11.1% 12.5% 13.2% Notes: a) R represents the size of the rolling window while P represents the number of out-of-sample predictions. b) DMW refers to the unadjusted t-test for equal MSPE developed by Diebold and Mariano (1995) and West (1996); CW refers to the adjusted t-test for equal MSPE developed by Clark and West (2006). c) NW refers to the variance-covariance estimator proposed by Newey and West (1987) while Hodrick refers to the variance-covariance estimator proposed by Hodrick (1992). d) Hodrick variance-covariance estimator requires P >2h so that DMW-Hodrick and CW-Hodrick statistics are not available for P =48 and h =36. e) The number of simulation is 1,000. The table shows the proportion of rejections using the one-sided standard normal critical values. For example, in panel A, h =1, for DMW-NW, the number 4.8% implies that 48 out of 1,000 simulations generate statistics larger than 1.28.

Table 1.2 TAR(2) Empirical Size: Nominal Size=10% Panel A: R=120, P=48 1 3 6 12 24 36 DMW-NW 1.1% 3.9% 8.7% 17.3% 23.4% 28.2% DMW-Hodrick 0.8% 1.7% 3.3% 7.2% 22.4% NA CW-NW 9.1% 16.7% 22.5% 28.5% 32.1% 35.8% CW-Hodrick 9.1% 11.2% 11.3% 13.6% 32.3% NA Panel B: R=120, P=96 DMW-NW 0.1% 0.9% 4.1% 9.0% 14.3% 14.1% DMW-Hodrick 0.0% 0.3% 1.2% 3.5% 4.9% 5.7% CW-NW 8.4% 12.3% 16.2% 21.1% 25.5% 26.2% CW-Hodrick 8.4% 9.3% 10.1% 9.4% 11.8% 13.3% Panel C: R=120, P=144 DMW-NW 0.1% 0.5% 1.9% 4.9% 9.3% 8.9% DMW-Hodrick 0.0% 0.1% 0.9% 1.4% 2.6% 3.9% CW-NW 8.4% 11.2% 14.2% 17.3% 22.2% 24.5% CW-Hodrick 8.4% 8.7% 9.5% 9.8% 10.3% 11.5% Panel D: R=120, P=240 DMW-NW 0.0% 0.1% 0.6% 1.9% 3.6% 3.4% DMW-Hodrick 0.0% 0.0% 0.4% 1.2% 1.8% 1.0% CW-NW 7.5% 10.8% 11.6% 13.8% 16.5% 19.9% CW-Hodrick 7.5% 9.5% 9.7% 9.0% 9.6% 9.3%

Table 1.3 Band-TAR(1) Empirical Size: Nominal Size=10% Panel A: R=120, P=48 1 3 6 12 24 36 DMW-NW 3.2% 9.1% 14.9% 21.4% 27.8% 34.6% DMW-Hodrick 2.7% 5.2% 6.5% 8.9% 23.3% NA CW-NW 11.3% 17.6% 23.0% 28.6% 32.3% 37.7% CW-Hodrick 11.3% 11.1% 12.7% 13.3% 28.3% NA Panel B: R=120, P=96 DMW-NW 1.5% 4.6% 9.1% 14.5% 20.1% 22.5% DMW-Hodrick 1.2% 3.0% 3.8% 6.7% 10.0% 11.2% CW-NW 11.2% 15.1% 18.9% 25.0% 26.8% 28.0% CW-Hodrick 11.2% 10.5% 11.9% 12.8% 14.4% 15.5% Panel C: R=120, P=144 DMW-NW 0.8% 3.7% 6.9% 11.6% 18.0% 18.4% DMW-Hodrick 0.6% 1.8% 2.9% 5.7% 8.0% 8.5% CW-NW 10.3% 15.0% 18.3% 20.5% 26.3% 26.3% CW-Hodrick 10.3% 12.0% 12.9% 14.1% 14.0% 14.4% Panel D: R=120, P=240 DMW-NW 0.1% 1.1% 3.6% 7.5% 10.9% 12.4% DMW-Hodrick 0.0% 0.6% 1.8% 3.9% 5.0% 4.8% CW-NW 9.4% 14.5% 15.4% 18.0% 20.4% 21.9% CW-Hodrick 9.4% 10.6% 11.4% 11.8% 12.4% 11.9%

Table 1.4 ESTAR(1) Empirical Size: Nominal Size=10% Panel A: R=120, P=48 1 3 6 12 24 36 DMW-NW 6.1% 13.6% 19.1% 26.4% 37.0% 40.1% DMW-Hodrick 5.3% 8.4% 9.2% 9.7% 29.5% NA CW-NW 14.1% 21.5% 26.6% 32.7% 41.4% 44.6% CW-Hodrick 14.1% 15.2% 14.9% 13.9% 33.7% NA Panel B: R=120, P=96 DMW-NW 2.8% 6.0% 10.6% 19.0% 25.3% 28.2% DMW-Hodrick 2.5% 4.2% 5.6% 8.2% 9.7% 10.7% CW-NW 11.9% 16.0% 20.7% 26.9% 31.6% 32.8% CW-Hodrick 11.9% 10.7% 11.9% 14.3% 14.1% 15.4% Panel C: R=120, P=144 DMW-NW 1.1% 4.0% 7.3% 13.1% 20.2% 22.5% DMW-Hodrick 0.7% 2.6% 3.3% 5.8% 7.6% 7.9% CW-NW 9.5% 14.8% 17.8% 22.3% 29.4% 29.8% CW-Hodrick 9.5% 10.2% 11.3% 11.4% 12.1% 12.3% Panel D: R=120, P=240 DMW-NW 0.3% 1.9% 3.3% 8.0% 13.5% 16.3% DMW-Hodrick 0.2% 0.9% 1.7% 3.7% 6.0% 6.6% CW-NW 9.0% 11.4% 14.1% 17.9% 22.5% 24.8% CW-Hodrick 9.0% 8.6% 9.3% 10.2% 11.8% 11.9%

Table 1.5 MS-RW Empirical Size: Nominal Size=10% Panel A: R=120, P=48 1 3 6 12 24 36 DMW-NW 2.5% 2.5% 4.6% 7.9% 13.8% 17.6% DMW-Hodrick 2.1% 1.6% 2.2% 2.9% 11.6% NA CW-NW 8.9% 12.7% 16.0% 22.6% 29.1% 32.6% CW-Hodrick 8.9% 9.6% 9.5% 10.4% 25.8% NA Panel B: R=120, P=96 DMW-NW 1.0% 1.4% 1.7% 3.1% 5.7% 7.1% DMW-Hodrick 1.0% 1.1% 1.1% 0.9% 1.3% 1.6% CW-NW 8.5% 9.1% 9.7% 13.1% 17.8% 21.8% CW-Hodrick 8.5% 7.1% 7.6% 7.0% 7.0% 8.2% Panel C: R=120, P=144 DMW-NW 0.5% 0.4% 0.6% 0.9% 2.9% 4.4% DMW-Hodrick 0.5% 0.3% 0.5% 0.3% 1.3% 1.7% CW-NW 9.0% 10.2% 10.2% 12.7% 17.0% 19.5% CW-Hodrick 9.0% 7.7% 7.3% 7.8% 8.1% 8.2% Panel D: R=120, P=240 DMW-NW 0.3% 0.0% 0.2% 0.5% 0.9% 1.3% DMW-Hodrick 0.1% 0.0% 0.0% 0.2% 0.1% 0.0% CW-NW 10.3% 9.8% 9.5% 9.8% 12.4% 15.5% CW-Hodrick 10.3% 8.5% 7.9% 6.7% 6.7% 5.9%

2 Unadjusted Power Simulation Results Table 2.1 AR(1) Unadjusted Power: Nominal Size=10% Panel A: R=120, P=48 1 3 6 12 24 36 DMW-NW 14.8% 24.3% 31.4% 38.9% 47.3% 50.5% DMW-Hodrick 14.1% 14.0% 13.2% 12.6% 34.6% NA CW-NW 35.5% 47.4% 55.0% 60.2% 65.3% 65.3% CW-Hodrick 35.5% 32.2% 29.1% 27.2% 48.8% NA Panel B: R=120, P=96 DMW-NW 9.2% 19.0% 25.0% 37.9% 46.0% 47.3% DMW-Hodrick 8.9% 9.0% 10.7% 11.9% 10.1% 11.6% CW-NW 42.9% 54.8% 60.7% 68.4% 71.9% 69.6% CW-Hodrick 42.9% 42.1% 39.9% 36.8% 26.8% 25.2% Panel C: R=120, P=144 DMW-NW 6.5% 15.1% 22.2% 35.8% 49.0% 48.4% DMW-Hodrick 5.9% 7.8% 8.6% 10.8% 11.0% 9.3% CW-NW 52.0% 63.0% 69.4% 74.6% 79.4% 76.3% CW-Hodrick 52.0% 49.8% 51.3% 45.1% 38.1% 30.7% Panel D: R=120, P=240 DMW-NW 5.0% 13.1% 21.7% 38.0% 50.3% 52.1% DMW-Hodrick 4.5% 7.9% 10.1% 13.7% 14.0% 10.5% CW-NW 67.7% 76.2% 79.7% 84.0% 85.5% 83.7% CW-Hodrick 67.7% 67.1% 66.9% 64.2% 53.4% 40.2% Note: See Notes to Table 1.1.

Table 2.2 TAR(2) Unadjusted Power: Nominal Size=10% Panel A: R=120, P=48 1 3 6 12 24 36 DMW-NW 65.1% 88.3% 93.9% 91.9% 81.6% 67.6% DMW-Hodrick 48.1% 45.5% 19.8% 7.1% 37.7% NA CW-NW 99.6% 99.6% 99.9% 99.6% 97.8% 92.7% CW-Hodrick 99.6% 97.0% 81.7% 41.4% 59.1% NA Panel B: R=120, P=96 DMW-NW 87.6% 96.9% 98.7% 98.2% 96.3% 92.3% DMW-Hodrick 74.0% 73.3% 47.4% 9.5% 3.2% 6.2% CW-NW 99.9% 100.0% 99.9% 99.5% 99.4% 97.4% CW-Hodrick 99.9% 100.0% 99.7% 79.6% 36.8% 30.1% Panel C: R=120, P=144 DMW-NW 94.3% 99.1% 99.3% 98.9% 97.2% 92.5% DMW-Hodrick 89.8% 89.2% 75.5% 18.7% 1.9% 2.3% CW-NW 100.0% 99.9% 99.9% 99.7% 99.2% 97.0% CW-Hodrick 100.0% 100.0% 99.9% 97.1% 55.6% 31.9% Panel D: R=120, P=240 DMW-NW 99.2% 99.6% 99.1% 98.4% 96.1% 89.2% DMW-Hodrick 98.3% 98.3% 95.0% 52.7% 7.1% 1.4% CW-NW 100.0% 100.0% 99.9% 99.7% 99.2% 96.5% CW-Hodrick 100.0% 100.0% 99.9% 99.6% 90.5% 58.8%

Table 2.3 Band-TAR(1) Unadjusted Power: Nominal Size=10% Panel A: R=120, P=48 1 3 6 12 24 36 DMW-NW 4.7% 11.3% 19.7% 25.4% 35.5% 35.8% DMW-Hodrick 4.0% 5.5% 8.7% 10.4% 31.4% NA CW-NW 13.9% 23.2% 27.7% 34.3% 42.0% 41.0% CW-Hodrick 13.9% 15.7% 16.2% 16.7% 36.5% NA Panel B: R=120, P=96 DMW-NW 2.1% 7.0% 14.5% 24.0% 27.6% 28.5% DMW-Hodrick 1.7% 3.6% 6.7% 10.3% 12.2% 12.4% CW-NW 16.0% 22.6% 27.6% 33.7% 37.3% 38.2% CW-Hodrick 16.0% 16.8% 18.3% 20.9% 19.1% 18.9% Panel C: R=120, P=144 DMW-NW 1.6% 6.4% 11.9% 20.5% 27.6% 27.2% DMW-Hodrick 1.2% 3.2% 6.2% 10.3% 10.6% 10.4% CW-NW 16.8% 25.3% 29.3% 34.0% 37.4% 35.7% CW-Hodrick 16.8% 19.0% 22.7% 21.7% 19.8% 16.9% Panel D: R=120, P=240 DMW-NW 0.5% 3.4% 9.2% 17.9% 22.0% 22.7% DMW-Hodrick 0.4% 2.0% 5.3% 9.9% 11.6% 8.7% CW-NW 19.2% 27.0% 29.6% 33.4% 34.0% 33.5% CW-Hodrick 19.2% 20.6% 21.1% 24.0% 20.6% 16.8%

Table 2.4 ESTAR(1) Unadjusted Power: Nominal Size=10% Panel A: R=120, P=48 1 3 6 12 24 36 DMW-NW 7.0% 19.4% 29.1% 37.5% 46.2% 46.4% DMW-Hodrick 6.0% 12.1% 12.5% 13.1% 33.7% NA CW-NW 20.5% 32.1% 40.6% 45.9% 51.9% 53.8% CW-Hodrick 20.5% 21.0% 20.8% 19.6% 39.9% NA Panel B: R=120, P=96 DMW-NW 4.4% 12.6% 21.4% 31.4% 39.4% 40.4% DMW-Hodrick 4.0% 6.5% 9.8% 12.8% 15.1% 15.3% CW-NW 21.6% 29.4% 36.3% 44.6% 49.9% 50.5% CW-Hodrick 21.6% 21.6% 21.4% 22.7% 22.8% 22.0% Panel C: R=120, P=144 DMW-NW 2.3% 9.1% 16.5% 27.0% 36.8% 39.1% DMW-Hodrick 1.8% 4.6% 8.1% 11.6% 11.7% 11.7% CW-NW 20.2% 27.7% 34.6% 40.6% 48.0% 48.5% CW-Hodrick 20.2% 19.7% 22.5% 24.5% 22.0% 20.7% Panel D: R=120, P=240 DMW-NW 0.6% 6.2% 10.7% 22.0% 32.0% 37.2% DMW-Hodrick 0.5% 2.8% 5.7% 9.8% 13.9% 13.1% CW-NW 23.8% 31.7% 36.6% 42.3% 48.4% 52.3% CW-Hodrick 23.8% 24.9% 26.1% 26.5% 26.4% 23.7%