Persistence in Forecasting Performance

Size: px

Start display at page:

Download "Persistence in Forecasting Performance"

Hortense Short
5 years ago
Views:

1 Persistence in Forecasting Performance Marco Aiolfi Bocconi Allan Timmermann UCSD December 27, 2003 Abstract This paper considers various measures of persistence in the forecasting performance of linear and nonlinear time-series models applied to output growth in the G7 countries. We find strong evidence of persistence among both top and bottom forecasting models, but also find systematic evidence of crossings - where a previously good (bad) forecasting model delivers bad (good) forecasting performance out-of-sample - particularly among linear models. We consider a range of forecast combination methods including spreads that put positive weights on models with good past forecasting performance and negative weights on models with poor past performance. We also propose a new three-stage model combination method that first sorts models into clusters based on their past performance, then pools forecasts within each cluster, followed by estimation of the optimal forecast combination weights for these clusters and shrinkage towards equal weights. This method is shown to work well empirically in out-of-sample forecasting experiments for output growth. 1 Introduction Forecasts are of considerable importance to decision makers throughout economics and finance and are routinely used by private enterprises, government institutions and professional economists. It is therefore not surprising that considerable effort has gone into designing 1

2 and estimating forecasting models ranging from simple, autoregressive specifications to complicated non-linear models or models with time-varying parameters. A reason why such a wide range of forecasting models is often considered is that the true data generating process underlying a particular series of interest is unknown and even the most complicated model is likely to be misspecified and can, at best, provide a reasonable local approximation to the target variable. Conditions under which the true model is selected asymptotically are quite strict, c.f. White (1990), Sin and White (1996), and are unlikely to be relevant in practical forecasting situations in macroeconomics with a large cross-section of forecasting models and a short time-series dimension. Forecasting models are also likely to be subject to structural instability as documented by Stock and Watson (1996) for a large series of macroeconomic variables. Viewed this way, it is highly unlikely that a single model will dominate all others at all points in time and the identity of the best local approximation is also likely to change over time. If the identity of the best local model is time-varying, a question immediately arises, namely whether it is possible to design a forecasting strategy that, at each point in time, selects the best current model. This strategy can be expected to be successful if the best current forecasting model outperforms all other models by a suitably large margin and if the ranking of the forecasting models is persistent. If either of these conditions fails to be met, a strategy of selecting only a single best model is unlikely to work well. Most obviously, if (ex-ante) the identity of the best model varies randomly from period to period, it will not be possible to identify this model by considering past forecasting performance across models. Similarly, if the best model only outperforms other models by a margin that is small relative to random sampling variation, it becomes difficult to identify this model by means of statistical methods based on past performance. Even if the single best model could be identified in this situation, it is likely that diversification gains from combining across a set of forecasting models with similar performance will dominate the strategy of only using a single model. In general there are hence good theoretical reasons for combining forecasts produced by several models. A popular strategy is to assign equal weights to the individual forecasting models. This becomes an optimal strategy if there is no ex-ante indication of the individual 2

3 models prospective out-of-sample forecasting performance, either because the models are of similar quality or because their (relative) performance is unstable over time. In practice, the sources that give rise to changes in the ranking of different forecasting models - e.g., major oil price shocks, policy chnages, institutional shifts or market participants learning behaviour - are neither likely to be very swift nor to unfold over several decades. In this case, one might expect the relative performance of forecasting models to be subject to change but also to display moderate degreses of persistence. How much persistence is a question of great practical relevance, however. Unfortunately, little is known about this question. 1 The first part of our paper therefore considers this question by studying the extent of persistence in forecasting performance using a range of non-parametric statistics. We find strong evidence of persistence among both top and bottom forecasting models, but also find systematic evidence of crossings - where a previously good (bad) forecasting model delivers bad (good) forecasting performance out-of-sample - particularly among linear models. If the (relative) performance of forecasting models changes over time, it is crucial to decide how long a sorting or ranking window to base the combination weights on. One possibility is to use the models full track record, in which case an expanding window is used for ranking. Alternatively, if performance persistence is more local in time, the models can be ranked over a much shorter recent time span that considers either their most recent performance or their performance over a rolling window of recent data. In the case where the correlation between the forecasts and the target variable does not change too rapidly over time, it becomes attractive to estimate forecast combination weights by least squares regression or (equivalently) using portfolio variance minimization methods. This approach introduces new problems, however. First, given the sample sizes typically available in macroeconomic forecasting, the combination weights are often imprecisely estimated, giving rise to poor out-of-sample forecasting performance. In particular, this is a problem when the number of models is large relative to the length of the time-series so that the covariance matrix of the forecast errors either cannot be estimated or is estimated very 1 Stock and Watson (1999) consider combination methods based on expanding window and rolling window estimations, two approaches that are usually associated with a stable and unstable data generating process, respectively. 3

4 imprecisely. Furthermore, the assumption of a stable covariance structure is unlikely to be satisfied in practice due to the presence of instabilities in the data generating process. To deal with such problems, we propose a new three-stage tenon-parametric three-stage approach for model combination that first sorts models into clusters based on their past performance, then pools forecasts within each cluster, followed by estimation of the optimal forecast combination weights for these clusters and shrinkage towards equal weights. The premise of this approach is that, when combining forecasts from a large cross-section of models, it is generally difficult to distinguish between the performance of the top models, but one can tell the difference between the best and worst models. This suggests including a subset of good models in the combined forecast. Our method is shown to work well empirically in an out-of-sample forecasting experiments for output growth in the G7 countries. The paper is organized as follows. Section 2 studies the persistence of forecasts produced by a range of linear and nonlinear time-series models. Section 3 introduces the forecast combination problem while Section 4 studies the out-of-sample forecasting performance of a range of standard combination methods proposed in the literature as well as some new ones. Section 5 proposes our new three-stage combination method, while Section 6 concludes. 2 Persistence in Forecasting Performance 2.1 Data Set Our empirical application uses data on real GDP growth in the the G7 country (Canada, France, Germany, Italy, Japan, UK and the US) and is taken from the larger data set compiled by Stock and Watson (2003). The data contains quarterly, seasonally adjusted time series over the sample , although some series are available only for a shorter period. 2 For all countries the initial estimation period is 1959.I-1969.IV so the first available forecast is for 1970.I (denoted T 0 +1), while the last forecast is for 1999.IV (denoted T ). The number of linear model specifications ranges from 41 for the UK and Canada over 2 More specifically, the sample size is 107 for the US and Canada, 89 for the UK, 80 for Japan, 103 for Germany and 67 for France. 4

5 43 for Germany, 45 for France, 51 for Japan and 66 for the US. 2.2 Forecasting Models and Methods The underlying forecasts in our problem are generated by time-series models of the form: y t+h = f i (Z t ; θ i,h )+u t+h,t,i. (1) Here i is an index for the forecasting model, θ i,h is a vector of unknown parameters, u it+h,t is an error term, and Z t is a vector of predictor variables that are known at time t. Individual forecasting models generally only use a subset of the elements of Z t. All forecasts are computed recursively, out-of-sample, i.e., the forecast of y t+h by the ith model is computed ³ as f i Z t ; ˆθ it,h. Following the analysis of Stock and Watson (2001), we consider both linear and nonlinear forecasting models. The class of linear models includes simple autoregressions with lag lengths selected recursively using the BIC and considering up to four lags: y t+h = c + A (L) y t + ² t+h. (2) We also consider bivariate autoregressive models that include a single additional regressor, x t, which is an element of Z t : y t+h = c + A (L) y t + B (L) x t + ² t+h. (3) Lag lengths are again recursively selected using the BIC with between 1 and 4 lags of x t and between 0 and 4 lags of y t. The class of non-linear forecasting models includes 18 Artificial Neural Network models with one and two hidden layers and different numbers of lags, p. Single layer feedforward neural network models take the form y t+h = β 0 0 ζ Xn 1 t + γ 1i g (β 0 1i ζ t)+² t+h, (4) i=1 ζ t = (1,y t,y t 1,..., y t p 1 ), p =1, 2, 3, 5

6 where g (z) = 1 is the logistic function. Neural network models with two hidden layers 1+e z take the form " y t+h = β 0 0 ζ Xn 2 n1 # X t + γ 2j g β 2ji g (β 0 1i ζ t) + ² t+h. (5) j=1 The choices for single hidden layer ANNs are n 1 =1, 2, 3,p=1, 2, 3 for a total of nine basic models. The choices for two hidden layers are n 1 =2,n 2 =1, 2, 3 and p =1, 2, 3, producing nine basic models and thus spanning many of the basic neural net designs, c.f. Swanson and White (1997). 3 We also consider 15 Logistic Smooth Transition Autoregression (LSTAR) models: i=1 y t+h = α 0 0ζ t + d t β 0 ζ t + ² t+h (6) d t = 1 1+exp(γ 0 + γ 0 1ξ t ), ξ t = (y t,y t 1,..., y t p 1 ), p =1, 2, 3. The LSTAR models differbythevariableusedtodefine the threshold and by the lag length p. Finally, we consider time-varying autoregressions (TVARs). These are autoregressive models whose parameters are allowed to evolve according to a multivariate random walk: y t+h = θ 0 t,hξ t + ε t+h θ t,h = θ t 1,h + u t,h, (7) u t,h iid 0, λ 2 σ 2 Q. Here σ 2 Q is the variance of u t,h.weconsiderdifferent values of λ and up to three lags for a total of 21 TVAR models, all of which are estimated by the Kalman filter. 3 For all ANN models, coeffcients were estimated by recursive non-linear least squared, minimizing the objective function by an ad-hoc algorithm developed by Stock and Watson. 6

7 2.3 Sorting Windows For each country and model type (linear versus non-linear), j, and forecast horizon, h, we produce a ((T h T 0 +1) M j ) panel of forecasts by (1) T,T h by (2) T,T h... by (M j) T,T h by (1) by h = T 1,T h 1 by (2) T 1,T h 1... by (M j) T 1,T h by (1) T 0 +h,t 0 by (2) T 0 +h,t 0... by (M j) T 0 +h,t 0 where the superscript, i, tracks the model, i =1,..., M j ; M j is the number of models for country/forecasting method j and h denotes the forecast horizon. We implement an automatic procedure to control for missing values and outliers to produce a balanced panel of forecasts. The h period performance of the generic forecasting model (i) is denoted L (i) t+h,t ³ L y t+h, by (i) t+h,t. In line with common practice, we shall assume mean squared error or quadratic loss, i.e. L(y t+h, by t+h,t )=e 2 t+h,t, (8) where e t+h,t = y t+h by t+h,t. We track the historical forecasting performance over a window of the last win periods by computing S (i) t = (1/win) P t τ=t win+1 L(i) τ,τ h. To study how persistent forecasting performance is through time, we consider three different tracking or sorting windows used to rank the forecasting models based on their historical performance: 1. Symmetric sorting and assessment window: win =1:S (i) t ³ = L e (i) t,t h ; 2. Asymmetric (rolling) sorting and assessment window: win =20:S (i) t 3. Expanding sorting window: S (i) t =(1/(t h +1) P t τ=h+1 L(i) τ,τ h. =(1/win) P t τ=t win+1 L(i) τ,τ h ; In each case the out-of-sample assessment window is based on forecasts for the period [T 0 + h; T ] so that the forecasting performance is assessed using A (i) t =(1/(T h T 0 +1)) P T τ=t 0 +h L(i) τ,τ h. 7

8 2.4 Empirical Evidence For each model, i, we record its rank at time t, R it. The model with the best (MSFE) forecasting performance, gets a rank of 1, the second best a rank of 2 and so on. Anticipating the analysis in section 4, we sort the models into quartiles and use 4 4 contingency tables to cross-tabulate the forecasting models sorting-period performance against their out-of-sample performance. 4 Tables 1-6 report empirical evidence on persistence in forecasting performance for the three sorting windows and linear (Tables 1-3) and nonlinear (Tables 4-6) models. Transition probability estimates, ˆP ij, in these tables give the probability of moving from quartile i (based on historical performance up to time t) to quartile j (based on future performance, e t+h,t, averaged over the out-of-sample period [T 0 +h; T ]). We show only the four top corners of the tables (i.e., P 11,P 14,P 41,P 44 ) since these effectively convey information about persistence or anti-persistence in forecasting performance. It is easy to benchmark the null of no persistence in these tables. If performance was purely random over time we would expect the transition probabilities to be close to 25% so the probability of good (or bad) future performance should be unaffected by past performance. Persistent forecasting performance would lead to estimates of P 11 and P 44 above 0.25, while P 14 and P 41 (the probability that a historically good model becomes a bad future model or that a historically bad model produces good future forecasts) should be well below Conversely, anti-persistence corresponds to low values of P 11 and P 44 and high values of P 14 and P 41. A chi-squared test statistic can easily be constructed for the estimated transition probabilities when h =1. However, at longer horizons (h 2) the data is overlapping so the performance statistics are serially correlated. To assess the statistical significance in this situation, we therefore use a bootstrap to construct confidence intervals for the transition probability estimates, ˆP ij. Several interesting results emerge from the tables. First, forecasting performance is generally persistent both for the linear and nonlinear models. Among the linear models, the 4 We also considered rank-order statistics for forecasting performance and found similar results. 8

9 stayer probability estimate for the top quartile ( ˆP 11 )issignificantly greater than 25% in at least 26 out of 28 cases. Hence, there is robust evidence of persistence among the best linear forecasting models. Using an expanding sorting window, 21 out of 28 of the estimates of P 14 are significantly larger than 0.25, however, suggesting that there is also a high chance that the historically best performing models have an above-average chance of becoming the future worst models. This probability declines, however, as shorter sorting windows are used - only half of the P 14 -estimates are above 0.25 under the asymmetric (rolling) and symmetric sorting windows in Tables 2 and 3. Interestingly, for the US under the expanding window ˆP 14 > ˆP 11 at all four horizons. Interestingly, for the historically worst models, the degree of persistence is clearly inversely related to the length of the assessment window. Using an expanding window, 17 of 28 estimates of P 44 are significantly larger than 0.25, growing to 23 for the asymmetric (rolling) window and 28 for the symmetric window. Persistence among the worst models thus appears to be more local in time. There is also a lower chance that a historically bad model delivers good forecasting performance in the future than the probability that a historically good model becomes a poor future performer: for each sorting window the estimates of P 41 are only significantly higher than 0.25 in between six and nine of the 28 cases considered in Tables 1-3. The relative persistence among the top and bottom models also depends on whether an expanding or a short symmetric sorting window is used in the rankings. Under the expanding sorting window, in general ˆP 11 > ˆP 44 and persistence is strongest among the top models. In contrast, under the much shorter symmetric sorting window, ˆP 44 > ˆP 11 suggesting that the worst forecasting models have the strongest local persistence. There is generally a much lower probability of crossings in the performance among the non-linear forecasting models and far fewer of the off-diagonal transition probability estimates are significant in Tables 4-6. At most four of 28 cases in each of these tables produce a significant value of ˆP 41 while at most two of 28 cases of the P 14 estimates are significant. In contrast, irrespective of the forecast horizon, h, the estimated transition 9

10 probabilities for the top and bottom quartile models continue to exceed 0.25 at significant marginsinatleast23and25of28casesforthetopandbottomquartiles,respectively. Among the non-linear forecasting models, the persistence tends to be highest among the bottom models, i.e. ˆP44 > ˆP 11. This pattern is particularly strong for the measures of local persistence, i.e., when a symmetric sorting ranking window is used. We conclude the following from these findings. First, perhaps unsurprisingly, there is systematic evidence of persistence in forecasting performance both at the top and at the bottom end of the rankings. Second, there is unfortunately also a strong tendency for the previous best linear models to become future underperformers (in the sense that their outof-sample MSFE performance belongs to the bottom quartile of models), particularly when an expanding sorting window is used. Third, as shorter and more recent sorting windows are used, the probability of crossings - the situation where a previous top quartile model becomes a future bottom quartile performer or vice versa - declines and the persistence in the linear forecasting models performance grows, particularly at the bottom end. Fourth, there is in general stronger persistence in the forecasting performance of the non-linear models and lower probability of crossings in these models forecasting performance than was observed for the linear models. 2.5 Persistence Under a Stable Factor Structure It is not surprising that we find persistence in the forecasting models performance. This is in fact what one would expect if, as seems reasonable, the predictors and target variable share a common factor structure. It is more difficult to say how much persistence one would expect under this type of assumption since this will depend on the relative variability of the factors and the distribution of factor loadings across forecasting models. To quantify how much persistence to expect under a stable factor model, we fitted a factor model of the form: X = F L + e. (9) T k T m m k T 1 where the number of factors, m, islessthanthenumberofregressors, k. Weestimated four factors b F and the loadings b L by applying principal component methods to the variance- 10

11 covariance matrix of X. We then fit a VAR(1) model to the factors and simulate these over time: bf t = A b F t 1 + u t, (10) where u t = N(0,Chol(ˆΣ)) and ˆΣ is the estimated variance-covariance matrix of the VAR(1) model. This allows us to simulate factors and regressors by drawing innovations, η. bx i t = F b t ˆL + η i t, i =1,..., MC. (11) We set the number of simulations, MC =1, 000 and draw the η-values from an IIN(0,1) distribution. For each of the simulated series we re-estimate the linear and nonlinear forecasting models and use these to generate h period forecasts. The resulting forecast errors are then used to compute transition probabilities. Tables 1-6 report 95% confidence intervals computed across the factor simulations in square brackets underneath the estimates for the actual data. Many interesting results emerge from these simulations. First of all, unsurprisingly, the stable factor structure explains a considerable amount of the forecasting persistence found in the actual data. This can be seen by noting that the factor confidence intervals are generally not centered on 0.25 but tend to be above this value for ˆP 11 and ˆP 44 and below it for ˆP 14 and ˆP 41. Clearly, however, a stable factor structure cannot explain the observed degree of persistence in the data. Relative to what one would expect under a stable factor model, there is more persistence in the forecasts from the linear models at the longest horizons (h =4 or h =8). Furthermore, there is far more local persistence among the worst models forecasting performance than what one would expect under the factor model. In particular, ˆP 44 is outside the 95% confidence interval generated by the stable factor model in 18 out of 28 cases under the asymmetric (rolling) sorting window in Table 2 and in 26 of 28 cases under the symmetric (short) sorting window in Table 3. Turning to the non-linear forecasting models, there is considerably more persistence among the top models sorted by means of the expanding window, particularly at long forecast horizons, than could be expected under a stable factor structure. For these nonlinear models 11

12 we once again see that the persistence patterns found in the actual data are particularly at odds with the factor simulations in the case of local persistence. 3 The Forecast Combination Problem Suppose we are interested in forecasting the conditional mean of a univariate process, y, h periods ahead and that an N vector of forecasts generated at time t, ŷ t+h,t =(ŷ 1t+h,t, ŷ 2t+h,t,..., ŷ Nt+h,t ) 0 is available. The information set underlying each forecast need not be observed to the decision maker in the forecasting problem since the forecasts are taken as the primitive information. We will instead simply assume that the information set at time t, F t,comprises the vector of current and past forecasts: F t = {ŷ t+h,t,..., ŷ h+1,1 }. This assumption is quite general. For example, autoregressive specifications are included by equating individual forecasts with lagged values of y. Similarly, a constant can trivially be included as one of the forecasts so that the combination scheme allows for an intercept term, a strategy recommended by Granger and Ramanathan (1984) and - for a variety of loss functions - by Elliott and Timmermann (2002). The general forecast combination problem can be posed as that of choosing a mapping g(ŷ t+h,t ; θ) from the set of predictions to the real line, Y t+h,t R that best approximates the conditional expectation, E[y t+h ŷ t+h,t ]. This general class of combination schemes comprises non-linear and time-varying combination methods, but it is far more common to limit the analysis by assuming that g(.) is a linear function and choosing weights, {ω it,h } N i=1, to produce a combined forecast, ŷt+h,t c,oftheform: N ŷt+h,t c = X ω it,h ŷ it+h,t. (12) ω it,h it must obviously be measurable on F t. This gives rise to the forecast error i=1 e c t+h,t = y t+h ŷ c t+h,t. (13) Following common practice, the forecaster is assumed to be endowed with a loss function that only depends on the forecast error, L(e t+h,t ).Theperiod-t combination weights ω th = 12

13 (ω 1t,h,..., ω Nt,h ) 0 can therefore be chosen to solve the problem ω th =argmin E L e c t+h,t Ft. (14) ω th Under MSE loss, the combination weights are easy to characterize in population and only depend on the first two conditional moments of the joint distribution of y and ŷ, µ yt+h µ µyt σ2 yt σ 0 21t. ŷ t+h,t µ t To keep the notation simple, we suppress the dependence of the joint distribution and moments on the forecast horizon, h. Assuming that the first forecast is simply a constant, the optimal (population) values of the constant and the vector of combination weights, ω c 0t and ω t,are σ 21t Σ 22t ω c 0t = µ yt ω 0 0tµ t ω t = Σ 1 22t σ 21t. (15) In general, the optimal combination weights depend on the full covariance matrix of forecasts. Sample estimates of these weights can be obtained by OLS regression of y t+h on a constant and ŷ t+h,t, c.f. Diebold and Pauly (1987). In cases with large N and small sample size, T,this estimation strategy is either not feasible or unlikely to perform well given the large number of parameters required to be estimated and the resulting large parameter estimation errors. Many empirical studies have found that it is difficult to produce more precise forecasts than those generated by such equal-weighted combinations, c.f. Clemen (1989). The forecast combination problem (14) is similar to that of minimizing the variance of a portfolio with the errors from the individual forecasts playing the role of asset returns. Intuition can in fact be gained from this analogy. Consider two forecasts whose forecast errors have variance σ 2 1, σ 2 2 with correlation ρ 12. Further suppose that both forecasts are unbiased so the combination weights of an unbiased forecast add up to one. The optimal combination weights for this case are ω 1 = σ 2 (σ 2 ρ 12 σ 1 ), σ σ 2 2 2ρ 12 σ 1 σ 2 ω 2 = 1 ω 1. (16) 13

14 The optimal weights reflect the relative magnitude of the variance of the individual forecast errors in addition to their correlation. A special case arises when one model - e.g. the ith model - has a much smaller forecast error and its forecast error is uncorrelated with that of other models so that diversification gains are less likely to be important. In this case only a single model is selected so that ω t = ξ i, (17) where ξ i is an N vector with zeros everywhere except for unity in the ith place. The opposite extreme arises when all forecasting models have forecast errors of (roughly) the same size. In this case, equal-weights are likely to work well irrespective of the correlations between the forecast errors: ω t = ι N /N, (18) where ι N is an N vector of ones. In the special case where the individual forecast errors are uncorrelated so all off-diagonal elements of Σ 22 are zero, the optimal combination weights take the form of a Discounted MSFE (DMSFE) scheme and are inversely proportional to the past MSFE performance (S it ): ω it = S 1 it /( N X i=1 Sit 1 ). (19) We further consider combination schemes that use ranking information on past performance as a sufficient statistic determining the weight of a given forecasting model. Such weighting schemes are non-parametric in the sense that they belong to the class ω it = f(r it ). Triangular weighting (TK) scheme and are inversely proportional to the models rank: N ω it = R 1 it /( X R 1 it ). (20) i=1 14

15 We also introduce a new set of spread combinations with weights 1+ω if R it αn ω it = 0 if αn <R it < (1 α)n ω R it (1 α)n, (21) where α is the proportion of top models that - based on performance up to time t -geta weight of 1+ω, while 1 α is the proportion of models that gets a weight of ω. We set ω =0.5 and α =0.25 or α =0.50 and thus study spread combinations that put a weight of 1.5 on the top 50% or top 25% of all models and a weight of -0.5 on the bottom 50% or 25% of the models ranked by their previous forecasting performance. 4 Empirical Results The evidence in Section 2 suggests that there is strong persistence in the forecasting performance of standard time-series models. The extent to which this evidence can be translated into improved out-of-sample forecasting performance is still an open issue, however. To address this question we study the out-of-sample forecasting performance for the combination strategies described above. Results from these experiments are provided in Table 7 (linear models) and Table 8 (nonlinear models). In each case we normalize the out-of-sample RMSFE-value of the previous best model at 1.00 and report the performance of all models relative to this benchmark. Values below 1 indicate that the previous best model does not generate the lowest RMSFE performance out-of-sample. For both sets of models there are four forecast horizons, three sorting windows and seven countries, yielding a total of 84 panels. Table 7 reports summary measures for the distribution of these (relative) RMSFE values across the 84 panels considered in our analysis. For each forecasting method we show the minimum, maximum, top and bottom 10% and 25%, mean and median value. Every single forecast combination method produces a mean or median value below 1. This is clear evidence that, in general, a strategy of selecting the top model based on past 15

16 forecasting performance and then using this to compute forecasts out-of-sample does not work very well. This holds both for the linear and non-linear forecasting methods. For the linear models the best out-of-sample RMSFEperformance isgeneratedbythe spread combinations and the trimmed mean based on the top quartile of models which produce median values for the (relative) RMSFE below 0.85 compared to a median value of 0.91 for the average ( mean ) combination. Furthermore, these two combination strategies produce the best forecasting performance in 32 and 33 of the 84 combinations of sorting windows, countries and forecast horizons listed in Table 7. Surprisingly, the mean forecast computed across all models - a combination strategy supported by many of the empirical results in Stock and Watson (1999) - is only best in two out of the 84 cases, less even than the eight cases generated by the previous worst quartile of models. Comparing the out-of-sample forecasting performance of the models across the expanding, asymmetric and symmetric sorting windows, the best sorting window varies considerably across countries. For example, for the US the lowest out-of-sample MSFE values were generated by a model selected by an expanding window irrespective of horizon, whereas for the UK, a model selected by a symmetric window generated the smallest forecast errors. Overall, across countries and horizons, the top model is selected by the expanding, rolling and symmetric window in 7, 11 and 10 out of the 28 cases, respectively. Overall, the best strategy is thus to average over the models in the top quartile, followed by the two spread combinations and the triangular kernel. These methods all produce better results than the popular and frequently used method simply averaging acros all models or using a single best model. Turning to the results for the nonlinear models, Table 8 suggests that a broader set of models should be used for averaging when dealing with non-linear forecasting models. The lowest median RMSFE performance is produced by averaging over the top 75% of models. This is consistent with the evidence in Tables 4-6 of strong persistence in the performance of the worst models, so that it pays to exclude the bottom quartile of forecasts. The finding in these tabels of persistence among the top quartile of forecasting models is further consistent with the good performance of the triangular kernel weighting scheme which weights the 16

17 forecasts from each model inversely with their historical forecasting performance. Once again, the spread combinations perform quite well, producing the lowest RMSFE values in 22 out of 84 cases. The triangular kernel and DMSFE model also do quite well, both producing top performance in nine cases. 4.1 Statistical Significance of Forecasting Performance So far we have not evaluated the statistical significance of the out-of-sample forecasting performance of the models relative to the benchmark that selects the single best model based on historical forecasting performance. To do so we apply the performance test proposed by Diebold and Mariano (1995). For each combination strategy we regress its out-of-sample squared forecast errors relative to that of the benchmark model on an intercept and compute the t-statistic of the intercept coefficient using Newey-West heteroskedasticity and autocorrelation consistent standard errors. We then bootstrap the residuals from this regression and project this sequence on an intercept whose (pivotal) t-statistic is compared against the value produced by the actual data to evaluate its statistical significance under the null of equal forecasting performance (which is imposed in the bootstrap replications). Overall, we found only few cases where the results are statistically significant even under conventional critical values which there may be good reasons not to rely on. Such a finding is unsurprising in the light of the relative short out-of-sample period and the poor smallsample properties of out-of-sample tests for predictive ability documented in Chao, Corradi and Swanson (2001). For the linear models it is only at the longest horizon (h =8)for Japan and at the two shortest horizons for the spread combinations in the case of Canada that performance is systematically better than the benchmark. For the non-linear forecasts there are a few isolated cases - mainly indicating performance statistically worse than the benchmark - particularly at short horizons for Japan. 17

18 5 Conditional Pooling and Combination Strategies Given the large number of forecasting models (N) relative to the number of time-series observations (T ), it is generally not feasible to estimate optimal combination weights at the level of the individual forecasts. Furthermore, it is generally found in the empirical literature on forecasting performance that estimated combination weights tend to lead to worse forecasting performance than if simple equal-weighted averages were used. For this reason we propose in this section a range of new (conditional) two-step combination strategies that pool the forecasting models by their their historical forecasting performance in the first stage and then combine the mean forecast computed for the selected clusters of models using a range of weighting schemes. One prominent example belonging to this class is that of thick modeling proposed by Granger and Yeon (2001) which removes outliers from the set of forecasting models in a step preceding the equal-weighting. Granger and Yeon argue that an advantage of thick modeling is that one no longer needs to worry about difficult decisions between close alternatives or between deciding the outcome of a test that is not decisive. 5.1 Combination of Forecasts from Pre-selected Quartiles The first set of combination methods operates at the level of the quartile-sorted forecasts and uses information on the estimated transition probabilities to select which quartiles to include in the combination. Hence, the forecasting models are initially assigned into quartiles based on their historical forecasting performance up to the point of the prediction, t. Foreach quartile, a pooled (average) forecast is then computed. If the transition probability estimates (using information up to time t) suggest that a particular quartile produced better-thanaverage forecasts, then the average forecast from this quartile is included in the combination. The pooling gives us between one and four forecasts. This number is small enough to let us consider estimating optimal combination weights by least squares methods following the proposal of Diebold and Pauly (1987). We also consider shrinkage methods that shrink the least-squares estimates of the combination weights towards equal-weights. As an extreme case, this includes simply using equal weights. As the number of forecasts (N) increases, 18

19 Elliott (2002) established conditions under which the expected loss from averaging gets closer to the expected loss from using the optimal weights. In small samples, however, there may be gains from using shrinkage estimators of the type: bω t (bγ) =(1 bγ) bω OLS + γω (22) Most shrinkage estimators shrink the least squares estimates of the weights, bω OLS,towards the sample average, ω, and specify a function of the data for bγ, the parameter governing the amount of shrinkage. To set out the three-stage strategy, let q i t+h,t be the p 1 vector containing the forecasts belonging to quartile i = 1, 2, 3, 4, where p = N/4 or the closest integer. We use the information contained in the quartile-by-quartile transition matrix to select quartiles deemed to produce precise forecasts as follows: 1. If ˆP 11 > ˆP 14 : include forecasts from all models in the top quartile. If ˆP 21 + ˆP 22 > ˆP 23 + ˆP 24 : include forecasts from all models in the second quartile If ˆP 31 + ˆP 32 > ˆP 33 + ˆP 34 : include forecasts from all models in the third quartile If ˆP 41 > ˆP 44 : include forecasts from all models in the fourth quartile Let I i be an indicator taking the value 1 if the ith quartile is included and otherwise zero, while ι p is a p 1 vector of ones. Then we consider three sets of combination weights applied to the forecasts pooled into quartiles, namely equal-weighted (QEW), optimally weighted (QOW) and shrinkage-weighted (QSW) combinations: 1. QEW = P 4 i=1 I i 1 P 4 i=1 I i(ι 0 p /p)qi t+h,t. 2. QOW = P 4 i=1 I iˆω it (ι 0 p /p)qt+h,t i where ˆωit are estimates of the optimal combination weights for the included quartiles 3. QSW = P 4 i=1 I iŝ it (ι 0 p /p)qt+h,t i where ŝit are shrinkage weights applied to the selected quartiles computed as ŝ i = λˆω i +(1 λ) P 4 i=1 I 1, n ³ P4i=1 o I i λ =max 0, 1 κ i T P 4, i=1 I i κ =.25,.5, 1. If none of the quartiles passes the test in the first step, we set QEW = QOW = QSW = P 1 N N i=1 yi t+h,t and average across all forecasting models. 19

20 5.2 Clustering by k-means algorithm The approach of sorting models into quartiles can be criticized for using arbitrary cut-off points. Two models with very similar in-sample forecasting performance may get assigned to different quartiles so that one model gets included in the combination while the other does not. To deal with this problem, we propose to use k means clustering algorithms to divide the models into a finite number of clusters. To motivate this approach, Figure 1 plots the in-sample MSFE performance (up to period T 0 ) against the out-of-sample forecasting performance across linear forecasting models while Figure 2 does the same for the non-linear models. In general there is not much support for a simple monotonic or linear relationship between past and future forecasting performance. However, there is indication of performance clusters in some countries, notably France, Germany and Japan. There is also some evidence - e.g., for Italy and Japan - at the extreme end that the models with the very worst in-sample forecasting performance tend to generate the highest out-of-sample RMSFE-values. This suggests trimming the worst models prior to forming forecasts. Suppose we identify M-clusters and denote by y k t+h,t the p k 1 vector containing the subset of forecasts belonging to cluster k =1, 2, 3 where cluster k =1contains the models with the lowest historical MSFE values. We consider the following conditional combination strategies: 1. CPB =(ι 0 p 1 /p 1 )yt+h,t 1 select the cluster with lowest in-sample MSFE-values and use the simple mean of the forecasts of his elements; P 2. CEW = 1 M 1 M k=1 (ι0 p k /p k )yt+h,t k exclude the worst cluster and apply equal-weigths to the forecasts from the top M 1 clusters. 3. COW = P M k=1 ˆω k (ι 0 pk /p k )yt+h,t k where ˆωk are least-squares estimates of the optimal combination weights for the M clusters 4. CSW = P M k=1 ŝk (ι 0 pk /p k )yt+h,t k where ŝk are the shrinkage weights for the M clusters ³ computed as: ŝ k = λˆω k +(1 λ) 1 M, λ =max n0, 1 κ n t h T 0 n o, κ =.25,.5, 1. 20

21 These methods require that part of the out-of-sample period is used to establish an initial ranking of the models. For each experiment we therefore use the first 20 out-of-sample observations for the initial sorting. period for this purpose. 5.3 Empirical Results Results from the three-step combination strategies are presented in Tables 9 (linear models) and 10 (non-linear models). The forecasting results are not directly comparable to those in Tables 7-8 which use the longer out-of-sample period. To allow comparison with the earlier strategies, we report the results based on the previous best model, the top quartile and the simple equal-weighted average (mean) forecast. Several interesting findings emerge from table 9 which summarizes the distribution of performance measures across countries, horizons and sorting windows. Both among the quartile combinations and among the clustering combinations, there are many strategies that outperform the best methods identified in Table 7 (TMB 25%) as well as the previous best and the simple mean across models. Among the quartile combinations there are large gains from using shrinkage methods. Among the clustering methods we consider strategies based on M =2, 3, i.e., using either two or three clusters. The best strategy appears to be a simple average across the top cluster (C3PB or C2PB) rather than combining models from the top cluster with those from the lower ranked clusters. In general the results tend to be better for the combination methods based on two rather than three clusters. Furthermore, shrinkage towards equal weights across clusters also improves forecasting performance. Combination of forecasts tend to be more effective for the pooled quartile forecasts than for the clusters. This is easy to understand. Forecast precision is unlikely to be as clearly divided across quartiles as it is across clusters and combining forecasts from different quartiles leads to a diversification gain. In contrast, once models have been clustered according to their past MSFE values, the difference between the top and second cluster s forecasting precision tends to be quite large and dominates the diversification gain from combining. To see how the performance of the various combination schemes varies across sorting window and forecast horizons, we also considered some more disaggregate results than those 21

22 shown in Table 9. Under the expanding and asymmetric sorting windows (pooling results across the forecast horizon and countries), shrinkage applied to the quartile portfolios worked best, while the forecast based on the top quartile was best under the symmetric (singleperiod) sorting window. Looking across forecast horizons, the top quartile gave the best performance for the short horizons (h =1, 2), while the shrinkage methods applied to the quartiles and the forecasts from the top cluster of models were best for the longer horizons (h =4, 8). Turning finally to the results for the non-linear models, Figure 2 indicates very different patterns in the relationship between in-sample (past) forecasting performance and out-ofsample (future) forecasting performance. Table 10 shows that once again shrinkage provides a reduction in the forecast errors as does clustering at least when the median forecasting performance is considered. The three-stage methods involving clustering, pooling and estimation of optimal combination weights and shrinkage towards equal weights continue to perform better than the mean forecast or using the average forecast from the top quartile of models. 6 Conclusion Forecast combination entails using information from a typically large set of forecasts and emerges as an attractive strategy when individual forecasting models are misspecified in a way that is unknown to the modeller. Misspecification is likely to be related not simply to functional form but also to instability in the joint distribution of forecasts and the target variable. In this situation, the identity of the best forecasting model is likely to change over time and a key question is for how long the relative performance of forecasting models persists. This paper investigated the nature of persistence in forecasting performance across a large set of linear and nonlinear models used to forecast GDP growth in the G7 countries. Much of the paper was exploratory since there is not, to our knowledge, any previous research on this question. We found strong evidence of persistence in forecasting performance as models that were in the top and bottom quartiles when ranked by their historical performance have 22

23 a higher than average chance of remaining in the top and bottom quartiles, respectively, in the out-of-sample period. However, we also found systematic evidence of crossings, where thepreviousbestmodelsbecometheworstmodelsinthefutureorviceversa,particularly among the linear forecasting models. Overall, the ranking of the forecasts tends to be more persistent for the non-linear models than for the linear models. We next explored a range of non-parametric combination schemes that use ranking information based on past forecasting performance and do not require estimation of forecast combination weights. We found evidence that a new set of spread combinations perform quite well. These assign a positive weight greater than one to the top models while the worst models get a negative weight and the weights sum to one. There is also evidence that using the average forecast from the top quartile of models results in out-of-sample forecasting performance better than that produced by the average forecast or the previous best forecast. Finally, we proposed a range of new combination strategies that first pre-select or pool models into either quartiles or clusters on the basis of the distribution of the distribution of past forecasting performance across models, pools forecasts within each cluster and then estimate optimal combination weights that are shrunk towards equal weights. We find evidence in our data that these conditional combination strategies lead to better forecasting performance than simpler strategies in common use such as as using the previous best model or simply averaging across all forecasting models or a small subset of these. References [1] Aiolfi, M. and C. A. Favero, 2003, Model Uncertainty, Thick Modeling and the Predictability of Stock Returns, CEPR Discussion Paper No [2] Bates, J.M. and C.W.J. Granger, 1969, The Combination of Forecasts. Operations Research Quarterly 20, [3] Chao, J., V. Corradi and N.R. Swanson, 2001, An Out of Sample Test for Granger Causality. Macroeconomic Dynamics 5,

24 [4] Clemen, R.T., 1989, Combining Forecasts: A Review and Annotated Bibliography. International Journal of Forecasting 5, [5] Corradi, V. and N.R. Swanson, 2002, A Consistent Test for Nonlinear Out of Sample Predictive Accuracy. Journal of Econometrics 110, [6] Diebold, F.X. and R. Mariano, 1995, Comparing Predictive Accuracy. Journal of Business and Economic Statistics 13, [7] Diebold, F.X. and P. Pauly, 1987, Structural Change and the Combination of Forecasts. Journal of Forecasting 6, [8] Elliott, G., 2002, Forecast Combination with Many Forecasts, mimeo. [9] Elliott, G. and A. Timmermann, 2002, Optimal Forecast Combinations under General Loss Functions and Forecast Error Distributions. Forthcoming in Journal of Econometrics. [10] Giacomini, R. and H. White, 2003, Test of Conditional Predictive Ability. UC San Diego Department of Economics Discussion Paper No. 09. [11] Granger, C.W.J., 1989, Combining Forecasts - Twenty Years Later. Journal of Forecasting 8, [12] Granger, C.W.J. and P. Newbold, 1986, Forecasting Economic Time Series, 2nd Edition. Academic Press. [13] Granger, C.W.J. and R. Ramanathan, 1984, Improved Methods of Combining Forecasts. Journal of Forecasting 3, [14] Granger, C.W.J. and Y. Yeon, 2001, Thick Models, mimeo, UC San Diego Department of Economics. [15] Sin, C.Y. and H. White, 1996, Information Criterion for Selecting Possibly Mis-specified Parametric Models. Journal of Econometrics 71,

25 [16] Stock, J.H. and M. Watson, 2001, A Comparison of Linear and Nonlinear Univariate Models for Forecasting Macroeconomic Time Series. Pages 1-44 In R.F. Engle and H. White (eds). Festschrift in honour of Clive Granger. [17] Stock, J.H. and M. Watson, 2003, Combination Forecasts of Output Growth in a Seven- Country Data Set. Working Paper Princeton University and Harvard University. [18] Swanson, N.R. and H. White, 1997, A Model Selection Approach to Real-Time Macroeconomic Forecasting Using Linear Models and Artificial Neural Networks. Review of Economics and Statistics 79, [19] White, H., 1990, A Consistent Model Selection. In C.W.J. Granger (ed) Modeling Economic Series. Oxford University Press. 25

26 Table 1: Expanding Sorting Window, Linear Models: Each cell reports the corner probabilities of the quartile-by-quartile contingency table for the forecasting models initial and subsequent h-period rankings. Transition probabilities Pij give the probability of moving from quartile i (based on historical performance, et,t h) to quartile j (based on future performance, et+h,t). Significance at the 95% confidence level is indicated by a *, while square brackets report 95% confidence intervals computed across 1,000 simulations under stable factors structure. rgdp h=1 h=2 h=4 h=8 USA [0.27,0.37] [0.30,0.38] [0.27,0.37] [0.29,0.38] [0.27,0.36] [0.28,0.38] [0.26,0.36] [0.28,0.38] [0.19,0.29] [0.19,0.29] [0.20,0.29] [0.20,0.29] [0.19,0.29] [0.19,0.29] [0.20,0.29] [0.19,0.30] UK [0.26,0.37] [0.28,0.39] [0.25,0.37] [0.28,0.39] [0.25,0.38] [0.27,0.39] [0.25,0.37] [0.26,0.39] [0.19,0.31] [0.19,0.31] [0.18,0.31] [0.19,0.32] [0.19,0.31] [0.19,0.31] [0.19,0.31] [0.19,0.32] France [0.26,0.37] [0.29,0.39] [0.26,0.37] [0.28,0.39] [0.25,0.38] [0.27,0.40] [0.25,0.38] [0.26,0.39] [0.19,0.31] [0.19,0.31] [0.18,0.31] [0.19,0.32] [0.19,0.32] [0.19,0.33] [0.19,0.32] [0.19,0.33] Germany [0.32,0.42] [0.26,0.35] [0.33,0.42] [0.25,0.35] [0.32,0.42] [0.25,0.35] [0.31,0.43] [0.24,0.35] [0.16,0.27] [0.25,0.36] [0.16,0.27] [0.25,0.36] [0.16,0.26] [0.25,0.36] [0.16,0.27] [0.25,0.37] Japan [0.32,0.43] [0.27,0.36] [0.32,0.42] [0.26,0.36] [0.31,0.42] [0.25,0.36] [0.31,0.42] [0.25,0.36] [0.17,0.27] [0.25,0.36] [0.17,0.27] [0.24,0.36] [0.17,0.27] [0.24,0.36] [0.17,0.28] [0.24,0.36] Canada [0.27,0.37] [0.28,0.39] [0.26,0.37] [0.29,0.39] [0.26,0.37] [0.28,0.39] [0.25,0.37] [0.27,0.39] [0.19,0.30] [0.18,0.30] [0.19,0.30] [0.19,0.30] [0.19,0.30] [0.19,0.30] [0.19,0.30] [0.19,0.31] Italy [0.25,0.37] [0.28,0.39] [0.25,0.37] [0.27,0.39] [0.25,0.37] [0.27,0.39] [0.24,0.37] [0.26,0.39] [0.18,0.30] [0.19,0.30] [0.19,0.31] [0.19,0.31] [0.18,0.31] [0.18,0.31] [0.19,0.31] [0.19,0.32]

Do we need Experts for Time Series Forecasting?

Do we need Experts for Time Series Forecasting? Christiane Lemke and Bogdan Gabrys Bournemouth University - School of Design, Engineering and Computing Poole House, Talbot Campus, Poole, BH12 5BB - United