Forecasting the unemployment rate when the forecast loss function is asymmetric. Jing Tian

Forecasting the unemployment rate when the forecast loss function is asymmetric Jing Tian This version: 27 May 2009 Abstract This paper studies forecasts when the forecast loss function is asymmetric, using Australian unemployment rates as an example. We focus on simple univariate models including autoregressive models and self-exciting threshold autoregressive models and we employ the same asymmetric quadratic loss function that is used for forecast evaluation at the model selection, model estimation and forecast combination stages. In particular, to incorporate the asymmetric loss function into model selection, we suggest the use of crossvalidation associated with the estimation methods that also incorporate the same loss function. Our forecasting results show considerable improvement when the same loss function is used for both estimation, combination and evaluation. However we do not observe gains from using the asymmetric loss function to select forecasting models compared with the conventional model selection methods based on squared-error loss. Forecast combination results show that using combination methods that account for asymmetric loss can attenuate the disadvantages of ignoring asymmetric loss when producing individual forecasts. Keywords: Asymmetric loss function, linear autoregressive model, self-exciting threshold autoregressive model, model selection, cross-validation, unemployment rate. Correspondence to: Jing Tian, School of Economics and Finance, Faculty of Business, University of Tasmania, Hobart, TAS, 7001, Australia. Phone: 61-3-62262323. Fax: 61-3-62267587. Email:jing.tian@utas.edu.au. The author is grateful for the valuable suggestions and comments by Professor Heather Anderson and Professor Farshid Vahid from the Australian National University. 1

1 Introduction Forecasters usually assume a symmetric quadratic loss function when estimating a forecasting model. Under this assumption the optimal predictor of a variable of interest is simply given by its conditional expectation. The use of a symmetric quadratic loss function simplifies forecasting derivations, but because it implies an equal loss from negative and positive forecasting errors with an equal size, it is rarely appropriate in practice. In fact, policymakers or forecasting agents often experience different costs that arise from under-prediction and over-prediction. For example, the costs of a lower-than-targeted future unemployment rate for governments are in general lower than costs of a higherthan-targeted unemployment rate. Recent studies find empirical evidence of asymmetric loss when examining survey or published forecasts. Corder (2005) finds a systematic bias in US unemployment forecasts, in that federal forecast agencies tend to under-predict future unemployment unless unemployment is very high. Given a wide range of energy price and consumption forecasts published by the United States Energy Information Administration, Auffhammer (2007) applies Elliott and Timmermann (2005) rationality tests and detects strong evidence of an implicit asymmetric loss function associated with these forecasts. Elliott and Timmermann (2008b) examine survey forecasts of macroeconomic variables such as real output growth and inflation, and find that rejections of rationality of these forecasts may imply use of asymmetric loss. Despite these findings of asymmetric loss in many published forecasting series, most empirical papers calculate and evaluate their forecasts based on a squared-error loss function. The main purpose of this paper is to carry out a forecasting exercise that uses an asymmetric loss function. In particular, we forecast Australian unemployment rates assuming a simple asymmetric quadratic loss function. We use linear autoregressive (AR) models that are widely used in time series forecasting as a baseline forecasting model, but we also consider a nonlinear forecasting specification, namely a self-exciting threshold autoregression (SETAR). The use of a SETAR specification that explicitly models asymmetric cyclical behavior in unemployment rates, is based on both theoretical and empirical claims that the job destruction rate is higher during recessions than expansions. Studies by Rothman (1998) and Montgomery et al. (1998) have shown improvement with respect to forecasting the US unemployment rate when using SETAR models rather than linear models, but their forecasts are derived and evaluated under symmetric quadratic loss. Typically the forecast loss function is used only in evaluating forecasts, but here we also consider two estimation methods used in Weiss (1996) that use the forecast loss function to 2

estimate the forecasting models; i.e iterative weighted least squares and Granger s (1969) simple method to estimate AR and SETAR models. Facing the practical problem of unknown model specification, we go one step further, by investigating how to incorporate the asymmetric quadratic loss function to select forecasting models, especially SETAR forms. Given forecasts generated from two different modeling strategies, i.e AR and SETAR models, one might choose to combine them rather than searching for the best single forecast. The reason suggested by Elliott and Timmermann (2008a) is that forecast combinations diversify against modeling uncertainty, and to some extent deal with breaks or other nonstationarities that may occur in the future. Therefore, we finally combine two forecast series, allowing the weighting methods to incorporate the asymmetric quadratic loss. Elliott and Timmermann (2004) provide a framework to calculate the combination weights under general loss functions and we employ their methods in our paper. The plan of the paper is as follows. In the next section we briefly introduce the model estimation and selection procedures when asymmetric quadratic loss is assumed. Section 3 provides empirical results of forecasting monthly changes of Australian unemployment rate using AR models and SETAR models. We also investigate forecast combinations that use the weighting methods for asymmetric quadratic loss in section 3. Section 4 concludes. 2 Asymmetric Quadratic Loss Suppose that an agent is interested in forecasting a stationary variable y at time T + 1, and ŷ T +1 is a forecast given all information up to time T. The agent s loss is a function of the forecasting error e = y T +1 ŷ T +1, denoted by L(e). In this paper, we consider a simple quadratic loss function that features asymmetry between over-prediction and under-prediction, i.e L(e) = α I e<0 e 2, where 0 < α < 1 and the indicator function I e<0 takes on the value of 1 if e < 0 ( i.e we over-predict) and a value of 0 otherwise. The parameter α indicates the degree of asymmetry. For instance, if α > 0.5 the agent suffers more loss from under-prediction than from over-prediction. The further α is from 0.5, the more asymmetric the agent s preference between under-prediction and over-prediction is. We focus on this type of loss function because it shows asymmetric features and it is easily incorporated into model estimation. 3

2.1 Estimation of Forecasting Models Since the true data generating process (DGP) of y is often unknown, forecasting agents need to specify some forms for conditional means and variances and then estimate associated unknown parameters. Suppose that agents model y t by y t = f(x t, β 1, β 2 ) + ɛ t, t = 1, 2,..., T, (1) where E t 1 [ɛ t ] = 0. Agents may consider a variety of candidate specifications for f that can be either linear or nonlinear. We allow X t to contain a constant as well as lags of the dependent variable, lagged explanatory variables and variables that have other functions 1. In equation (1) we use β 1 to denote unknown coefficients for x t, and other parameters such as the order of autoregression or parameters that help to determine regimes are denoted by β 2. Note that the dimensions of the vectors X t, β 1 and β 2 may differ in different models. We simplify our modeling task to modeling only the conditional mean of y T +1, by assuming that the conditional variance of ɛ t is constant in each regime 2. Given the value of β 2, Weiss (1996) develops two procedures to estimate parameters for x t and then produce forecasts. Both procedures are suitable for forecasting time series with time-invariant conditional variances, or when information on conditional variances is not given. The first procedure estimates β 1 in equation (1) by minimizing the average in-sample asymmetric loss L(.), i.e β 1 = arg min T 1 β 1 T L(ɛ t (β 1 ) β 2 ) (2) where values of β 2 are given. Since the asymmetric quadratic loss function L(.) is continuously differentiable, asymmetric least squares estimates of β 1 (Newey and Powell, 1987) can be interpreted as iterated weighted least squares estimators (IWLS). These estimators are obtained by solving the equation β 1 = [ t=1 T T α I yt <X β t 1 X t X t] 1 α I yt <X β t 1 X t y t, (3) t=1 where the dimensions of X t and β 1 are subject to the model specification, i.e β 2. 1 For example, regime switching models require transition variables that define regimes. 2 Weiss (1996) considers the entire distribution of y T +1 and thus estimates unknown parameters for both conditional mean and conditional variance to calculate the optimal predictor. He finds little improvements in the forecasts involving modeling conditional variances, and suggests that this is due to misspecification of models for conditional variances. t=1 4

The other estimation procedure that accounts for asymmetric loss is based on Granger s (1969) simple approach 3, and this simply adds a constant bias term to the prediction generated from the symmetric loss function. The procedure is as follows. We firstly estimate β 1 given by β 1 = arg min T 1 β 1 T ɛ 2 t (β 1 β 2 ). (4) For example, if a linear specification is preferred, given a vector of model specification parameters β 2, β 1 are ordinary least squares estimates. In the next step, we add a constant term γ to the forecast derived from β 1 to compute the optimal forecast y t 4, i.e, t=1 y t = f(x t, β 1 β 2 ) + γ. (5) An estimate of γ can be obtained by minimizing the forecasting loss associated with y t, i.e. γ = arg min T 1 γ T α I yt f(x t, β c 1 β 2 ) γ<0 (y t f(x t, β 1 β 2 ) γ) 2. (6) t=1 We then express the estimates of Granger s simple method as β 1 + γι, in which vector ι = (1, 0,..., 0) has the same dimension as β 1. The forecast for y T +1 given β 2 is expressed as ŷ T +1 = f(x T +1, β 1 + γι β 2 ). 2.2 Model Selection We have presented estimation methods for computing forecasts under asymmetric quadratic loss, assuming that the model specification vector β 2 is given. However, without knowing the true DGP in practice, forecasters have to choose forecasting models from many candidate models. Take the selection of an AR model as an example. The usual approach to select lag length under mean-squared error loss is to use the Akaike information criterion (AIC) or the Bayesian information criterion (BIC). With each criterion, the penalties associated with adding more parameters are constructed under the assumption of normally distributed errors so that the differences in AIC or BIC among competing models are subject to chi-squared distributions. However, if we select models that minimize asymmetric quadratic loss when forecasting errors are not normally distributed, the conventional forms 3 The name of this method follows Weiss (1996). 4 Strictly speaking, this can only be a sub-optimal forecast for y T since the true DGP is unknown for agents. 5

of AIC and BIC will be inapplicable 5. In this paper, we consider an alternative approach, namely cross-validation, to choose forecasting models according to their predictive abilities. Since Allen (1971) first suggested this method to select models, cross-validation has been widely discussed and applied to forecasting fields (see Shao (1993), Kohavi (1995) and Pesaran and Timmermann (2007) for examples). In contrast to pairwise predictive ability tests such as Diebold and Mariano (1995) and Giacomini and White (2006), cross-validation evaluates all competing models at the same time. Another reason why we prefer cross-validation to modifying AIC or BIC as in Weiss (1996) is that cross-validation can be easily applied for any given loss function. To apply the cross-validation method, we begin with dividing the full sample [1 : T ] into two parts. Observations in [1 : T 1 ] are used for initial estimation and the rest are used for cross-validation. For τ [T 1 + 1 : T ], the best model specification parameters are determined as β2 = arg min L(y τ ŷ (i) β (i) τ τ 1 ), (7) 2 where forecast ŷ (i) τ τ 1 is derived from model i which has been estimated by incorporating asymmetric loss. Since we consider two approaches to calculate the asymmetric forecast ŷ (i) τ τ 1, as shown in section 2.1, β(i) 2 depends on which of the estimates, i.e asymmetric least squares estimates β 1 (i) or Granger s simple method estimator β1 (i) + γι has been used. 3 Forecasting Unemployment Rate under Asymmetric Loss This section discusses the models and estimation methods for forecasting one-step ahead Australian unemployment rates under an asymmetric quadratic loss function. The data, retrieved from DATASTREAM, is the seasonally adjusted Australian monthly unemployment rate. This series, plotted in figure 1, shows that the unemployment rate climbs rapidly during two recessions in the mid-1980s and early-1990s, but declines gradually. This asymmetric feature of business cycles has also been seen in the U.S. unemployment rate series (see Rothman, 1998). We use the period from 1982:07 to 2003:05 for initial model estimation, and leave 60 observations from 2003:06 to 2008:05 for out-of-sample 5 We note that Weiss (1996) suggests an approach to modify BIC formulas of so that the differences in means of in-sample asymmetric quadratic loss between competing models are still subject to chi-square distributions asymptotically. 6

forecast evaluation. We conduct an Augmented Dicky-Fuller test for the full sample, and based on the resulting of p-value of 0.15 we cannot reject the null hypothesis of a unit root. Therefore, we model monthly changes of unemployment rate u t instead of the level of unemployment rate u t. Figure 1: Monthly Australian unemployment rate. 12 11 10 9 8 7 6 5 4 3 85:01 90:01 95:01 00:01 05:01 Since linear autoregressive models are often used to forecast time series, we firstly consider this simple modeling technique in the paper. However, the observation of business cycle asymmetry in unemployment rates suggests that an alternative nonlinear time series models might be more appropriate than linear models to capture this asymmetry. To justify the use of a nonlinear SETAR model for our forecasting exercise, we firstly examine the normality and the nonlinearity of monthly changes in the Australian unemployment rate u t. Table 1 reports the descriptive statistics and p-values of the test results for the initial estimation sample from 1982:08 to 2003:08, and for the full sample from 1982:08 to 2008:05. Small p-values for the skewness test 6 show strong evidence of skewness in the series for both samples. The Jarque-Bera test cannot reject normality for the sub-sample, which may be due to non-kurtosis in this sub-sample, but the test rejects normality in the full sample at the 5% test level. To justify our use of a threshold model for unemployment 6 For the null hypothesis of no skewness, the test statistic = skewness 0 6/T number of observations in the sample. χ 2 1, where T is the 7

rates, we apply the Tsay s test (1989) for testing linearity against threshold nonlinearity. The last column of the table presents small p-values, supporting a SETAR specification for modeling monthly changes of Australian unemployment rates 7. Table 1: Normality test results and linearity test results for monthly changes of Australian unemployment rates a Obs Mean Std Skew Kurtosis Skew test JB TSAY Sub 250-0.002 0.423 0.319 3.007 0.039 0.120 0.025 Full 310-0.008 0.402 0.336 3.164 0.016 0.045 0.036 a The first row reports for the sub-sample from 1982:08 to 2003:05, and the second row reports for the full sample from 1982:08 to 2008:05. The first five columns are descriptive statistics including the number of observations, sample mean, standard deviation, skewness and kurtosis. The next two columns show the p-values for the skewness test and Jarque-Bera test for normality. The last column shows the p-values of the Tsay s test for testing linearity against threshold nonlinearity. Note that the test results correspond to a test regression with the lags of dependent variable chosen by AIC and the threshold variable that produces the smallest p-value. The forecasts in section 3.1 are generated using expanding window estimation; that is we include new data in the estimation window and respecify and re-estimate models when the forecasting origin moves forward. Section 3.1 shows a simple linear autoregressive modeling strategy that accounts for asymmetric loss. In section 3.2 we use threshold autoregressive models to capture the asymmetric movement of the unemployment rate, and we specify the forecasting model and calculate forecasts based on a quadratic asymmetric loss function. In section 3.3 we combine AR and SETAR forecasts using weighting methods that also incorporate asymmetric loss. 3.1 Linear Autoregressive Model A linear autoregressive model with an order p for u t can be written as u t = φ 0 + Σ p i=1 φ i u t i + ε t (8) 7 The test results are based on the model with the lags of dependent variable chosen by AIC (maximum lag length is 12), and they correspond to the threshold variable that generates the smallest p-value. 8

where we set the lag length parameter p 12. We consider the forecasts generated by OLS together with forecasting models selected by AIC as benchmark forecasts. To incorporate the given asymmetric quadratic loss function in model selection, we set the most recent 50 observations before the forecasting origin as the test sample for cross-validation, and use observations in the first part to estimate all potential forecasting models. The iterative weighted least squares (IWLS) method and Granger s simple method (GSM) are applied in order to account for asymmetric quadratic loss. For instance, to determine a forecasting model for u 2003:06, we take 50 observations from 1999:04 to 2003:05 to evaluate 12 candidate linear AR models estimated by IWLS and GSM. The model that produces the smallest average asymmetric loss from the out-of-sample forecasts across 50 observations is the chosen forecasting model for u 2003:06. To draw attention to any benefit from applying cross-validation to select models when there is asymmetric loss, we compare the forecasting results with those based on IWLS and GSM estimated models, chosen by AIC. 3.1.1 Results Table 2 reports the average losses of out-of-sample forecasts based on IWLS and GSM estimation methods relative to forecasts produced by OLS. A relative loss that is smaller than one indicates an absolute advantage when using the same asymmetric quadratic loss function in both estimation and evaluation. The first column shows levels of asymmetry in the loss function. The next two columns report forecasting results based on the IWLS estimation method, but forecasts in the second column are generated from models selected by AIC and forecasts in the third column are generated from models selected by crossvalidation associated with IWLS method (abbreviated as C-V(IWLS)). In the last two columns we show relative loss of forecasts when forecasting models are estimated by GSM. The fourth column corresponds to AIC for selecting forecasting models and the last column corresponds to cross-validation associated with GSM (abbreviated as C-V(GSM)). Comparing results in different rows, we observe that as the degree of asymmetry becomes higher, i.e α is further away from 0.5, the advantage of incorporating the loss function used in forecasting evaluation into forecast generating process becomes more obvious. It is interesting to see that with an equal degree of asymmetry, for example when α = 0.2 or α = 0.8, the relative losses when α < 0.5 are smaller than those when α > 0.5. This implies that when predicting monthly Australian unemployment changes using linear AR models, agents who are more averse to over-prediction than to under-prediction achieve more benefits from incorporating this asymmetry into estimation than agents who have an 9

opposite asymmetric preference. Comparing the two estimation methods that both take asymmetric forecasting loss into account, we observe that IWLS performs slightly better than GSM for α < 0.5 and becomes slightly worse than GSM for α > 0.5 8. Table 2: Relative average of asymmetric quadratic loss for out-of-sample forecasts based on linear AR models a IWLS GSM α AIC C-V(IWLS) AIC C-V(GSM) 0.1 0.534 0.534 0.537 0.537 0.2 0.741 0.741 0.757 0.757 0.3 0.874 0.874 0.893 0.893 0.4 0.960 0.960 0.960 0.960 0.6 0.995 0.995 0.983 0.983 0.7 0.943 0.943 0.922 0.922 0.8 0.826 0.826 0.803 0.803 0.9 0.611 0.611 0.610 0.610 a The loss function for forecasting evaluation and estimation is L(e t ) = α I et<0 e 2 t where e t = u t u t and α is the asymmetry parameter. Observations from 2003:06 to 2008:05 are used for out-of-sample forecasting evaluation. The reported average out-of-sample forecast losses are relative to those based on OLS estimation with the lag lengths of forecasting models chosen by AIC. Table 2 shows that regardless of the degree of asymmetry, different lag selection strategies under the same estimation method produce the same relative forecasting results. We use figure 2 to illustrate the lag lengths chosen by AIC, cross-validation under IWLS and cross-validation under GSM when α = 0.2. The horizontal axis shows 60 forecasting origins from 2003:05 to 2008:04. It is clear that the maximum 12 lags are consistently chosen by AIC and cross-validations for every forecast 9. In fact, the same lag structure is chosen for all asymmetric levels. We also find that even if we increase the maximum number of lags that we consider for a linear AR model, all three model selection strategies still choose the longest lag length. While this shows that our asymmetric model selection criterion offers no advantage over AIC in this case, it also raises a long but linear lag structure is not 8 Weiss (1996) comments that with the same forecasting model, GSM may be preferred if the variance of OLS estimates in the first step of GSM is smaller than the variance of IWLS estimates and a Hausman test cannot reject the null hypothesis that IWLS and GSM estimates are the same. 9 BIC also consistently chooses the maximum 12 lags. 10

sufficiently flexible to capture the dynamic features of Australia monthly unemployment rate changes. Figure 2: Lag lengths for linear AR models chosen by different strategies 12 10 8 6 4 2 AIC C-V (IWLS) C-V (GSM) 04:01 05:01 06:01 07:01 08:01 3.2 Self-Exciting Threshold Autoregressive Model Threshold autoregressive models (TAR) are often found useful for characterizing the asymmetric evolution of macroeconomic variables over business cycles. We consider two regimes, namely expansion and recession for modeling Australian monthly unemployment rate changes, and we allow a lagged unemployment rate change to be the threshold variable that distinguishes between the two regimes. A self-exciting threshold autoregressive (SETAR) model with two regimes takes the form φ 10 + Σ p i=1 u t = φ 1i u t i + ε 1t u t d < θ φ 20 + Σ p i=1 φ 2i u t i + ε 2t u t d θ. Note that for simplicity we let the linear AR models in each regime share the same lag structure. The delay parameter d, in addition to the threshold parameter θ, is unknown and determined by data-dependent selection procedures. Errors ε 1t and ε 2t are i.i.d(0, σ 2 1 ) and i.i.d(0, σ2 2 ), respectively. (9) 11

3.2.1 Model specification and estimation Under a symmetric quadratic loss function, the specification of a SETAR model is often determined by a procedure in which the lag length p is determined first, followed by the delay parameter d and the threshold value θ. For instance, a linear AR model is selected by AIC first. Given the chosen value of p, for each possible choice of the delay parameter d, a grid search method can be applied to find the threshold parameter that leads to the smallest in-sample sum of squared residuals. The delay parameter d (and its associated best threshold parameter θ ) is then located by minimizing the in-sample sum of squared residuals. Under symmetric quadratic loss, OLS is used to locate d and θ. When loss is asymmetric, it is rational to choose the delay parameter and the threshold value by minimizing the in-sample weighted sum of squared residuals, given a lag length. Because the trade-off relationship between the in-sample weighted sum of squared residuals and the number of parameters to be estimated may not be the same as AIC or BIC indicates when loss is symmetric quadratic, we apply cross-validation method to specify the lag length parameter for TAR models. The procedure consists of the following stages: In the first step, we start by setting p = 1, 2,..., 12 and d = 1, 2,..., 12. Given each value of p, we grid search for the best threshold parameter θ (d, p) for each potential value of d by minimizing the sum of in-sample losses based on estimation that accounts for asymmetric quadratic loss. We trim the range of potential values of the threshold parameter, so that the highest and lowest 15% values of u t d are excluded. This ensures enough observations in each regime for estimation. Using the IWLS estimation method as an example, we denote the estimated AR coefficients in regime r = 1, 2 as φ r and denote the corresponding IWLS estimated residuals as ε r, and these depend on the parameters p, d, θ(p, d) and φ r. Therefore, we write the chosen threshold θ (p, d) under the IWLS method as θ (p, d) = arg min θ α I εrt <0 ε 2 rt. (10) r In the next step, we determine the delay parameter d (p) associated with its best θ (d, p) for each potential value of lag length choice p. d (p) = arg min d t α I εrt <0 ε 2 rt (11) r where ε rt depends on p, d, θ (p, d) and estimation methods that account for asymmetric loss. Finally, we conduct cross-validation to select the lag length p (d, θ ) for our TAR t 12

model, given by p (d, θ 1 ) = arg min p 50 T 1 t=t 50 L( u t+1 u t+1 (p d, θ )), (12) where L denotes the asymmetric quadratic loss function and T denotes forecasting origins. The best lag length choice p correspond to the best choice of delay parameter d and its associated threshold parameter θ. This model selection procedure reflects the idea of selecting a model that minimizes the asymmetric quadratic loss, both in-sample and out-of-sample. We apply cross-validation at the stage of choosing between models that have different lags of dependent variables 10. This is because the penalty of adding more lags of dependent variables indicated by AIC may no longer hold when the loss function is asymmetric. Note that since the procedure involves estimation of coefficients φ s, employing different estimation methods may lead to different results of p, d and θ. Figure 3 shows the choices of p and d made by the conventional method based on symmetric quadratic loss (abbreviated as SYM) as well as our proposed cross-validation based on iterative weighted least squares (abbreviated as C- V(IWLS)) and Granger s simple method (abbreviated as C-V(GSM)) across all forecasting origins. When the degree of asymmetry α is 0.2, we observe that SYM and C-V(GSM) choose the same lag length over time, i.e 12 lags, and they consistently choose 3 and 4 as the delay parameter, respectively. In contrast to the stable set of models selected by SYM and C-V(GSM), C-V(IWLS) often chooses a different lag length, and the choice of the delay parameter shifts between 7 and 12, implying that in this scenario C-V(IWLS), compared with the other two methods, prefers a longer period for the regime switch to occur. When α = 0.8, meaning that under-prediction is weighted more heavily than overprediction in the loss, the three methods almost always choose 12 lags of dependent variable, except that C-V(IWLS) chooses 5 lags for the first 4 forecasts, i.e u 2003:06, u 2003:07, u 2003:08 and u 2003:09. Both SYM and C-V(GSM) choose the same delay parameters as they do when α = 0.2, whereas C-V(IWLS) chooses 12 at the beginning and then 3 for all subsequent forecasts. Figure 3 shows that forecasting models chosen by procedures that take asymmetric loss function into account may differ from those chosen by procedures based on symmetric loss. Also, the results from our proposed cross-validation procedure depend on the 10 We can potentially use cross-validation to determine all model specification parameters, but this involves higher computational efforts. 13

estimation method. It seems that the 60 forecasting models chosen by the GSM-based procedure are more similar to each other than when they are chosen by the IWLS-based procedure. Additionally, if forecasters weigh their loss as associated with over-prediction and under-prediction differently, they will use different forecasting models after applying cross-validation for model specification. Figure 3: Model specification for SETAR models by different strategies α = 0.2 α = 0.8 12 12 10 10 lag length 8 6 lag length 8 6 4 4 2 SYM C-V(IWLS) C-V(GSM) 2 SYM C-V(IWLS) C-V(GSM) 04:01 05:01 06:01 07:01 08:01 04:01 05:01 06:01 07:01 08:01 T T 12 12 SYM C-V(IWLS) C-V(GSM) 10 10 delay parameter 8 6 4 delay parameter 8 6 4 2 SYM C-V(IWLS) C-V(GSM) 2 04:01 05:01 06:01 07:01 08:01 T 04:01 05:01 06:01 07:01 08:01 T 3.2.2 Results We report one-step-ahead out-of-sample forecasting results for SETAR models in Table 3. The first column shows the asymmetry levels of the loss functions. The next two 14

columns headed by IWLS reports relative forecast losses when forecasting models are estimated by IWLS. In particular, the second column with a sub-heading SYM shows relative forecast losses when forecasting models are selected by the conventional method that is often used under symmetric quadratic loss, and the third column headed by C- V(IWLS) reports relative forecast losses when forecasting models are selected by crossvalidation associated with IWLS method. The last two columns headed by GSM relate to the use of GSM for estimation. All of the reported averaged asymmetric losses based on asymmetric estimation methods are relative to averaged asymmetric losses based on the symmetric estimation method, with the same degree of asymmetry. Table 3 shows that most of the relative losses are smaller than 1, which indicates that in general the forecasting performance of TAR models are improved after we adjust the estimation method to match the loss function used in forecasting evaluation. In particular, when α = 0.1, we observe a reduction of more than 50% in average loss. Comparing the second column with the fourth column, we see that if we use the same conventional method to select forecasting models but estimate the models using IWLS and GSM, respectively, we produce similar out-of-sample forecasting results. However, column 3 and column 5 shows that if we use cross-validation for model selection and IWLS and GSM for model estimation, we may have quite different out-of-sample forecasting results. This is mainly because cross-validation involves model estimation and the use of different estimation methods results in different forecasting models being selected. Comparing the results for the two model selection strategies, we do not observe considerable forecast improvement from applying cross-validation even though it incorporates asymmetric quadratic loss when selecting forecasting models. Since cross-validation selects models based on their predictive ability, its performance might be subject to factors such as the stability of the series. 3.3 Forecast Combination under Asymmetric Quadratic Loss The performance of a combined forecast is obviously determined by the weights on the individual forecasts used in the combination. A commonly used weighting method is to linearly regress the out-of-sample actual time series on a constant and the individual forecasting series, with the use of the constant correcting for possible bias coming from individual forecasts. Under a symmetric quadratic loss function, OLS is used to estimate the bias correction term and the combination weights. This method has been extended for a generalized loss function by Elliott and Timmermann (2004). They show that when the forecast errors are elliptically symmetric, optimal forecast combination weights are 15

Table 3: Relative average asymmetric quadratic loss for the out-of-sample forecasts based on TAR models a IWLS GSM α SYM C-V(IWLS) SYM C-V(GSM) 0.1 0.484 0.543 0.474 0.525 0.2 0.688 0.906 0.689 0.692 0.3 0.826 0.971 0.832 0.828 0.4 0.930 1.054 0.934 0.934 0.6 1.029 0.932 1.026 0.982 0.7 1.001 1.048 0.996 0.965 0.8 0.889 0.928 0.889 0.880 0.9 0.649 0.618 0.662 0.795 a The loss function for forecasting evaluation and estimation is L(e t ) = α I et <0 e 2 t where e t = u t u t and α is the asymmetry parameter. Observations from 2003:06 to 2008:05 are used for out-of-sample forecasting evaluation. The reported average out-of-sample forecasting losses are relative to those based on OLS estimation, with lag lengths, delay parameters and threshold parameters of the forecasting models chosen by the conventional selection procedure based on symmetric quadratic loss. the same as OLS estimated weights regardless of the loss function, and only the constant needs to be adjusted to account for the relevant loss. They also suggest weight estimation methods that incorporate specific loss functions, e.g. estimation method for asymmetric quadratic loss. iterative weighted least squares In this paper, we employ these two estimation approaches proposed in Elliott and Timmermann (2004) to estimate the parameters ω c, ω AR and ω T AR in equation (13). We combine unemployment rate forecasts generated by AR and SETAR models under asymmetric quadratic loss, and estimate u t+1 = ω c + ω AR u AR t+1 + ω T AR u T AR t+1 + ν t+1, for t = T, T + 1,..., T + 60. (13) Our interest in this combination exercise centers on the following two aspects. Firstly, we would like to see whether forecasting performance can be improved by combining individual AR and SETAR forecasts for different levels of asymmetry in the loss function. Secondly, since we estimate combination weights using estimation methods that incorporate an asymmetric loss function, one might argue that even if individual forecasts are 16

derived by methods that assume symmetric loss, the biasedness can be fixed by combining them. Therefore, we examine the usefulness of considering asymmetric loss to obtain individual forecasts before forecast combination. Table 4: Descriptive statistics and normality test results for forecast errors a Series Mean Std Dev Skew Kurtosis Skew test JB AR(OLS) -0.002 0.187-0.413 2.647 0.191 0.365 TAR(OLS) -0.020 0.184-0.486 2.837 0.124 0.297 α = 0.1 AR(IWLS) 0.218 0.182-0.521 2.972 0.100 0.258 AR(GSM) 0.224 0.186-0.413 2.640 0.191 0.362 TAR(IWLS) 0.223 0.206 0.035 3.411 0.911 0.805 TAR(GSM) 0.232 0.195-0.350 2.464 0.268 0.378 α = 0.9 AR(IWLS) -0.229 0.196-0.391 2.612 0.216 0.385 AR(GSM) -0.237 0.187-0.413 2.666 0.191 0.370 TAR(IWLS) -0.201 0.191-0.583 3.022 0.065 0.183 TAR(GSM) -0.246 0.202-0.664 2.959 0.036 0.110 a This table reports the estimates of the mean, standard deviation, skewness, kurtosis and the Jarque-Bera test statistics for each forecast error series. The first two rows are for forecasts generated under a symmetric quadratic loss. The rest of the rows report for forecasts using cross-validation and the estimation methods that consider asymmetric quadratic loss. The last two columns report p-values of the skewness test and the Jarque-Bera test, respectively. All individual forecasts are one-step-ahead monthly changes of Australian unemployment rates that we calculated in the previous sections. In particular, we focus on OLS estimated forecasts, IWLS estimated forecasts based on models selected by cross-validation, as well as forecasts using Granger s simple estimation method and cross-validation for model selection. Since some forecasts are generated from asymmetric SETAR models and have been designed to incorporate asymmetric loss, we expect skewness rather than symmetry in these forecasts. OLS estimated forecasts do not incorporate asymmetric loss and therefore they might be symmetric. However, the properties of forecast errors depend on whether the true DGP of the data is symmetric and whether we forecast it well. To make sure that it is appropriate to employ Elliott and Timmermann s two stage procedure (i.e to use OLS weights ω AR and ω T AR and only correct the bias term ω c using the asymmetric 17

quadratic loss function 11 ), we conduct skewness tests and Jarque-Bera normality tests on each forecast error series. Table 4 reports their descriptive statistics and p-values of the normality tests when the degree of asymmetry α are high, i.e 0.1 and 0.9. The skewness and kurtosis statistics slightly depart from the values under normal distribution, and significant skewness is only found in the TAR forecasts estimated by IWLS and GSM when α = 0.9, with the 10% and 5% test level, respectively. Table 5 shows the average loss of individual out-of-sample forecasts and combination forecasts over 60 months from 2003:06 to 2008:05. In the top panel, individual forecasts are generated based on OLS estimates, whereas in the middle and the bottom panels individual forecasts are calculated from models selected by cross-validation and estimated using methods that take into account asymmetric quadratic loss. The first column shows degrees of asymmetry in the loss function. The second and the third columns report average loss of the individual forecasts generated from AR and TAR models. The following columns headed by OLS-w, IWLS-w, 2stage-w provide average loss of combination forecasts with the weights ω c, ω AR and ω T AR in equation (13) estimated by OLS, IWLS and Elliott and Timmermann s two stage procedure, respectively. In the last column, we combine AR and TAR forecasts by taking a simple average. We use underlines to show the smallest loss(es) for each level of asymmetry. The losses of individual forecasts reported in bottom two panels are in general smaller than those in the top panel for each value of α. This shows that it is usually beneficial to incorporate the loss function when we choose and estimate forecasting models, and the benefit is more obvious when the degree of asymmetry is higher. In the top panel where individual forecasts are generated based on symmetric loss, combination methods always improve forecasting performance. When we ignore the loss function and simply estimate combination weights by OLS without correcting the constant term, average losses of combined forecasts are smaller than those of both individual forecasts. We observe a large reduction in the forecast loss after using IWLS and Elliott and Timmermann s two stage procedure to combine two individual OLS-estimated forecasts, particularly when α is far from 0.5. Taking a simple average improves individual forecasts, as might be expected, but this produces worse forecasts than the other combination methods that incorporate asymmetric loss. These empirical findings are consistent with the simulation results in Elliott and Timmermann (2004). In contrast to Elliott and Timmermann (2004), in which individual forecasts used for combination are not incorporated with asymmetric loss, our paper also examines combi- 11 The estimation technique is essentially the same as Granger s simple method. 18

nation forecasts that are derived using the asymmetric loss function. Results for different combination strategies are shown in the bottom two panels. We find that in general combination forecasts improve forecasting results, except that when there is a high degree of asymmetric loss, the OLS-weighted combination method produces worse forecasts than individual forecasts. When we use the combination methods that incorporate asymmetric loss, we achieve better forecasts than individual forecasts regardless of degree of asymmetry. Comparing the fifth and the sixth columns in each panel, we note that the two-stage combination method produces very similar results as IWLS combination method. The reason, as suggested in Elliott and Timmermann (2004), is that when forecast errors are elliptically symmetric distributed, the two-stage method consistently estimates the optimal weights as the IWLS method does. In contrast to its superior performance that is often found in empirical papers that use symmetric quadratic loss, the equally weighted forecasting method may not outperform the other combination methods under asymmetric loss. Comparing IWLS-weighted and two-stage-weighted forecasts across the three panels, we find that for each value of α the average losses are approximately the same no matter how the individual forecasts have been generated. This implies that the disadvantages of ignoring asymmetric loss when producing individual forecasts can be attenuated by using combination methods that account for asymmetric loss. 19

Table 5: Average losses of individual out-of-sample forecasts and combined outof-sample forecasts for monthly changes of Australian unemployment rates a Individual forecasts Combined forecasts α AR TAR OLS-w IWLS-w 2stage-w Equal-w Individual forecasts are generated by OLS estimates: 0.1 0.0190 0.0206 0.0178 0.0093 0.0093 0.0192 0.2 0.0186 0.0197 0.0173 0.0131 0.0131 0.0184 0.3 0.0181 0.0187 0.0169 0.0151 0.0151 0.0176 0.4 0.0176 0.0177 0.0164 0.0160 0.0160 0.0169 0.5 0.0171 0.0168 0.0159 0.0159 0.0159 0.0161 0.6 0.0166 0.0158 0.0155 0.0151 0.0151 0.0153 0.7 0.0162 0.0148 0.0150 0.0134 0.0134 0.0145 0.8 0.0157 0.0139 0.0145 0.0107 0.0107 0.0137 0.9 0.0152 0.0129 0.0141 0.0067 0.0068 0.0129 Individual forecasts are generated by IWLS estimates: 0.1 0.0102 0.0112 0.0174 0.0090 0.0091 0.0098 0.2 0.0138 0.0178 0.0178 0.0132 0.0133 0.0148 0.3 0.0158 0.0182 0.0173 0.0154 0.0154 0.0160 0.4 0.0169 0.0187 0.0167 0.0162 0.0162 0.0167 0.5 0.0171 0.0172 0.0159 0.0159 0.0159 0.0160 0.6 0.0166 0.0147 0.0144 0.0141 0.0141 0.0146 0.7 0.0152 0.0155 0.0156 0.0139 0.0139 0.0145 0.8 0.0130 0.0129 0.0151 0.0112 0.0112 0.0122 0.9 0.0093 0.0080 0.0147 0.0070 0.0070 0.0083 Individual forecasts are generated by Granger s simple method: 0.1 0.0102 0.0108 0.0182 0.0092 0.0092 0.0102 0.2 0.0140 0.0136 0.0174 0.0131 0.0131 0.0137 0.3 0.0161 0.0155 0.0170 0.0152 0.0152 0.0157 0.4 0.0171 0.0166 0.0169 0.0165 0.0165 0.0167 0.5 0.0171 0.0164 0.0163 0.0163 0.0163 0.0167 0.6 0.0164 0.0155 0.0155 0.0151 0.0151 0.0158 0.7 0.0149 0.0143 0.0152 0.0136 0.0136 0.0145 0.8 0.0126 0.0122 0.0148 0.0109 0.0109 0.0123 0.9 0.0093 0.0102 0.0154 0.0073 0.0073 0.0095 a This table reports average loss of one step ahead out-of-sample forecasts from 2003:06 to 2008:05. We generate combination forecasts by estimating equation (13) based on out-of-sample AR and TAR forecasts. The underlines show the smallest loss(es) in each row. 20

4 Conclusion In this paper we focus on producing out-of-sample forecasts that are based on a more realistic asymmetric loss function. We incorporate the asymmetric quadratic loss function into every step associated with generating a forecast, including the model selection stage, the model estimation stage and the forecast combination stage. Apart from linear autoregressive models, we also examine how to incorporate the asymmetric loss function to specify and estimate self-exciting threshold autoregressive models that are often used to characterize the cyclical behavior of many macroeconomic time series. In contrast to model selection criteria such as AIC and BIC that are valid only when there is symmetric loss, the cross-validation method evaluates different models according to their out-of-sample forecasting performance, and thus can easily incorporate any type of loss functions. This paper proposes the use of cross-validation to specify AR and SETAR models when the loss function is asymmetric quadratic. We adopt two estimation techniques in Weiss (1996), namely iterative weighted least square that is based on minimizing the in-sample asymmetric quadratic loss, as well as Granger s simple method that uses OLS estimates and only adjusts the intercept for the asymmetric loss function. When combining forecasts generated from AR and SETAR models, we estimate combination weights by regressing the actual series on an intercept and two forecasts, and the estimation methods, as suggested in Elliott and Timmermann (2004), are iterative weighted least square and Granger s simple method (two-stage procedure). In the empirical application, we forecast one-step-ahead monthly changes of Australian unemployment rate, assuming an asymmetric quadratic loss function for forecasting agents or users. The major findings are as follows: Firstly, forecasting performance is largely improved if model estimation methods incorporate the loss function that is used for forecast evaluation. Secondly, in the case of forecasting Australian unemployment rates, we do not observe obvious benefits from employing model selection methods that incorporate the loss function. Thirdly, combining forecasts generated from different modeling strategies improves forecasting results, and the improvements are considerable when the weighting methods also incorporate the loss function. For example, both the IWLS-weighting method and the two-stage weighting method consistently outperform the OLS-based weighting technique and the equal weighting technique when the loss function is asymmetric. Lastly, we observe that the use of appropriate combination methods can cover the loss from ignoring the asymmetric loss function when we produce individual forecasts. We explore the reason for this last finding by using the two-stage weighting method in Elliott and Timmermann (2004) as an example. Combining two least squares forecasts 21

using a constant term adjusted for minimizing asymmetric loss works for the same reason that Granger s simple method works. Individual least squares forecasts are biased in the sense that they are calculated based on minimizing an incorrect loss function. Granger s simple method uses a constant term estimated based on the true loss function to correct the bias in individual forecasts, whereas the two-stage method corrects these biases using a constant at the combination stage. This finding has an interesting practical implication for forecasters, that is if they prefer combining forecasts, they do not need to consider their forecast loss function until the last stage of forecast combination. References Allen, D. M. (1971): Mean Square Error of Prediction as a Criterion for Selecting Variables, Technometrics, 13, 469 475. Auffhammer, M. (2007): The rationality of EIA forecasts under symmetric and asymmetric loss, Resource and Energy Economics, 29, 102 121. Corder, J. K. (2005): Managing Uncertainty: The Bias and Efficiency of Federal Macroeconomic Forecasts, Journal of Public Administration Research and Theory, 15, 55 70. Diebold, F. X. and R. S. Mariano (1995): Comparing Predictive Accuracy, Journal of Business and Economic Statistics, 13, 253 263. Elliott, G. and A. Timmermann (2004): Optimal forecast combinations under general loss functions and forecast error distributions, Journal of Econometrics, 122, 47 79. (2008a): Economic Forecasting, Journal of Economic Literature, 46, 3 56. Elliott, G., K. I. and A. Timmermann (2005): Estimation and Testing of Forecast Rationality under Flexible Loss, Review of Economic Studies, 72, 1107 1125. (2008b): Biases in Macroeconomic Forecasts: Irrationality or Asymmetric Loss? Journal of the European Economic Association, 6, 122 157. Giacomini, R. and H. White (2006): Tests of Conditional Predictive Ability, Econometrica, 74, 1545 1578. 22

Granger, C. W. J. (1969): Prediction with a Generalized Cost of Error Function, Operational Research Society, 20, 199 207. Kohavi, R. (1995): A study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection, in Proceedings of the 14th International Joint Confrernce on Artificial Intelligence, ed. by C. S. Mellish, Morgan Kaufmann, 1137 1143. Montgomery, A., V. Zarnowitz, R. S. Tsay, and G. C. Tiao (1998): Forecasting the U.S. Unemployment Rate, Journal of American Statistical Association, 93, 478 493. NBER (2008): Business Cycle Expansions and Contractions, http://www.nber.org/cycles.html. Newey, W. K. and J. L. Powell (1987): Asymmetric Least Squares Estimation and Testing, Econometrica, 55, 819 847. Pesaran, M. H. and A. Timmermann (2007): Selection of Estimation Window in the Presence of Breaks, Journal of Econometrics, 137, 134 161. Rothman, P. (1998): Forecasting Asymmetric Unemployment Rates, Review of Economics and Statistics, 80, 164 168. Shao, J. (1993): Linear Model Selection by Cross-Validation, Journal of the American Statistical Association, 88, 486 494. Tsay, R. S. (1989): Testing and Modeling Threshold Autoregressive Processes, Journal of American Statistical Association, 84, 231 240. Weiss, A. A. (1996): Estimating Time Series Models Using The Relevant Cost Function, Journal of Applied Econometrics, 11, 539 560. 23

Appendix In this appendix, we consider an application of forecasting U.S. monthly growth rates of industrial production. We retrieve the monthly U.S. industrial production index (seasonally adjusted) from FREDR (Federal Reserve Economic Data), and we transform it into a series of monthly growth rates by taking the first difference of the natural logarithm of the data. The full sample covers the period from 1960:01 to 2009:02, in which 521 observations from 1960:01 to 2003:05 are used for the initial in-sample estimation and 69 observations from 2003:06 to 2009:02 are used for the out-of-sample forecasting exercise. Figure A. 1 shows the movement of the monthly growth rate of U.S. industrial production for the full-sample. The shaded areas indicate U.S. recessions according to the National Bureau of Economic Research (2008). It is obvious that the series moves cyclically, and most of the troughs match the recession periods. Figure A. 1: Monthly growth rate of U.S. industrial production 4 3 monthly growth rate (%) 2 1 0-1 -2-3 -4-5 60:01 70:01 80:01 90:01 00:01 Table A. 1 presents the descriptive statistics as well as the normality test and Tsay s (1989) linearity test results for the data in the initial in-sample period and the full sample period. Near-zero p-values for the Jarque-Bera test show strong evidence of non-normality in the data, and based on Tsay s test we reject linearity at the 10% test level for the data during the sub-sample period. In the following forecasting exercise, we consider both linear autoregressive (AR) models and self-exciting threshold autoregressive (SETAR) models. 24