LASSO-Type Penalties for Covariate Selection and Forecasting in Time Series

Size: px

Start display at page:

Download "LASSO-Type Penalties for Covariate Selection and Forecasting in Time Series"

Roger Bishop
5 years ago
Views:

1 Journal of Forecasting, J. Forecast. 35, (2016) Published online 21 February 2016 in Wiley Online Library (wileyonlinelibrary.com) DOI: /for.2403 LASSO-Type Penalties for Covariate Selection and Forecasting in Time Series EVANDRO KONZEN 1 AND FLAVIO A. ZIEGELMANN 2 1 Graduate Program in Economics, Federal University of Rio Grande do Sul, Porto Alegre, RS, Brazil; and School of Mathematics and Statistics, Newcastle University, Newcastle upon Tyne, United Kingdom 2 Department of Statistics and Graduate Programs in Economics and Management, Federal University of Rio Grande do Sul, Porto Alegre, RS, Brazil ABSTRACT This paper studies some forms of LASSO-type penalties in time series to reduce the dimensionality of the parameter space as well as to improve out-of-sample forecasting performance. In particular, we propose a method that we call WLadaLASSO (weighted lag adaptive LASSO), which assigns not only different weights to each coefficient but also further penalizes coefficients of higher-lagged covariates. In our Monte Carlo implementation, the WLadaLASSO is superior in terms of covariate selection, parameter estimation precision and forecasting, when compared to both LASSO and adalasso, especially for a higher number of candidate lags and a stronger linear dependence between predictors. Empirical studies illustrate our approach for US risk premium and US inflation forecasting with good results.copyright 2016 John Wiley & Sons, Ltd. KEY WORDS time series; LASSO; adalasso; variable selection; forecasting INTRODUCTION High-dimensional models have been increasingly present in the literature. It is known that the inclusion of a large number of economic and financial variables can contribute to substantial gains for time series forecasting. As Song and Bickel (2011) point out, a challenging problem is to determine which variables and lags are relevant, especially when there is a conjunction of serial correlation, high-dimensional dependence structure among variables and small sample size (relative to the dimensionality). As Fan and Lv (2010) state, statistical accuracy, interpretability of the model and computational complexity are three important pillars of any statistical procedure. Typically, the number of observations n is much greater than the number of variables or parameters p. However, when the dimensionality p is large compared to the sample sizen, the traditional methods face some challenges: among them, how to make interpretable models that are estimable; how to make the statistical procedures robust and computationally efficient; and how to obtain more efficient procedures in terms of statistical inference. Moreover, in a high-dimensionality context, when the number of covariates p is large compared to the sample size n, the traditional models can lead to problems of spurious correlation between the covariates, which can be serious even when the covariates are independent and identically distributed, as shown in Fan and Lv (2008) and Fan et al. (2012). One way to challenge the problems caused by high dimensionality is through the sparsity assumption on the p-dimensional parameter vector, forcing that many of its components are exactly zero. Although it generally produces biased estimates, the sparsity assumption helps one to identify the important covariates, then obtaining a more parsimonious model and reducing its complexity as well as the computational cost of estimating it. As Medeiros and Mendes (2012) comment, factor models provide a good alternative when many variables manifest importance in the model, a situation which the authors name by dense structure of the model. However, when the coefficient vector is indeed sparse, methods that assume sparsity gain importance. The least absolute shrinkage and selection operator (LASSO) (Tibshirani, 1996) is proposed in a linear regression context. It imposes a penalty on the set of L 1 norm of the coefficients. Due to the nature of this penalty, LASSO forces some coefficients to be exactly zero, making it useful to select covariates and to reduce the dimensionality of the parameter space. This methodology is an example of regularization techniques characterized by Breiman (1995), which considers an error function given by E D error on data C model complexity where the sum of squared residuals can, for instance, represent the term error on data. Correspondence to: Flavio A. Ziegelmann, Department of Statistics and Graduate Programs in Economics and Management, Federal University of Rio Grande do Sul, Porto Alegre, RS, Brazil flavioaz@mat.ufrgs.br Copyright 2016 John Wiley & Sons, Ltd

2 LASSO-Type Penalties in Time Series 593 The second term penalizes those models whose complexity and variance of the estimators are high, with representing how severe the penalty is. When it minimizes the error function rather than just minimizing the error on data, complex models are overpenalized and then estimator variances are reduced. If is too large though, only very simple models are obtained and a large bias can be introduced. LASSO is one of the most well-known regularization techniques, succeeding mainly in cases where there are many null coefficients within the set of coefficients to be estimated. Thus LASSO also becomes useful for the selection of covariates. Zou (2006) investigates the oracle properties, mentioned by Fan and Li (2001), of the original LASSO proposed by Tibshirani (1996). Zou (2006) shows that there are cases in which LASSO is not consistent in variable selection and hence proposes the adalasso (adaptive LASSO), where the penalty occurs with different weights for each coefficient, allowing this version to enjoy the oracle properties. In a time series context, LASSO-type penalties are employed in De Mol et al. (2008), Hua (2011) and Li (2012). Medeiros and Mendes (2012) show that the adalasso consistently chooses the relevant variables as the number of observations increases (model selection consistency), even when the errors are non-gaussian and conditionally heteroskedastic. In addition, Audrino and Camponovo (2013) present some theoretical and empirical results on the finite-sample and asymptotic properties of adalasso in time series regression models. Our present work borrows ideas from (Park and Sakaori, 2013) and proposes a variation which we call WLadaLASSO (weighted lag adaptive LASSO), a method which assigns different weights to each coefficient and also further penalizes coefficients of higher-lagged covariates. This idea is also similarly employed in the Minnesota prior used in Bayesian VARs. The results show the superiority of WLadaLASSO when it is compared to LASSO and adalasso, essentially for a higher linear dependence between predictors and a greater number of candidate lags. A first application is carried out to forecast risk premium using the same data as in Goyal and Welch (2008). Contrary to what the previous article points out, we find that the predictors used in the literature can help to predict the risk premium when LASSO-type methods are applied. A second application is conducted to forecast US inflation employing several explanatory covariates, where WLadaLASSO produces better forecasts than a benchmark model. The rest of this work is as follows. The next section introduces the theme of model selection. In the third section, LASSO-type penalties are described in detail. The fourth carries out a comprehensive Monte Carlo simulation study. The fifth section brings our empirical applications of risk premium and inflation forecasting. Finally, the sixth section concludes, bringing our main remarks. PRELIMINARIES ON MODEL SELECTION One of the main goals in linear regression analysis is to estimate the coefficients of the following Gaussian linear model: y i D ˇ0 C ˇ1x 1i C ˇ2x 2i C :::C ˇkx ki C " i ; i D 1;:::;n; (1) or, in a vector form: y i D ˇ0 C X T i ˇ C " i ; (2) where y i 2 R is the response variable, X i D.x 1i ;:::;x ki / T 2 R k is the set of predictors, " i N.0; 2 /, ˇ [ ˇ0 D.ˇ0;ˇ1;:::;ˇk/ T is the set of parameters. One of the most popular methods for estimating the unknown parameters of equation (1) ordinary least squares (OLS) is based on minimizing the sum of squared residuals (SSR), i.e. solving the following minimization problem: Oˇ D arg min ˇ0;ˇ1;:::;ˇk id1 0 i ˇ0 ˇj x ji 12 A : (3) However, the OLS method suffers from some problems when there are too many predictors in the model. Firstly, its estimates often have large variance, which reduces the accuracy of the forecasts. Secondly, only by itself, it lacks the skill of determining a smaller set of predictors that exhibits the best effects when one aims to find out which predictors most explain the variability of the response variable. Finally, by construction, the OLS method cannot be implemented when the number of parameters exceeds the number of observations. A natural procedure for choosing which predictors enter and which do not enter model (2) is to compute all 2 k possible regression models, investigating all possible combinations of predictors. However, this can require large computational resources. Thus some procedures have been proposed, such as best subset selection, forward and backward elimination, and forward-stagewise regression, as pointed by Hastie T et al. (2001).

3 594 E. Konzen and F. A. Ziegelmann However, these alternatives also have their own limitations when the model has many candidate covariates. Best subset selection is a feasible procedure for a k of magnitude not exceeding 30 or 40, as explained by Hastie T et al. (2001). Meanwhile, the procedure of forward selection and backward elimination may not select the best subset of variables in some situations, as stated by Berk (1978). Forward-stagewise regression, as an algorithm that can require many steps, has a large computational cost and may be impractical for problems of high dimensionality. Ridge regression, in turn, is used to shrink the set of coefficients by imposing a penalty on their sum of squares: Oˇridge D arg min ˇ0;ˇ1;:::;ˇk id1 ˇ0;ˇ1;:::;ˇk 0 i ˇ0 where the parameter t 0 controls the penalty. An equivalent expression is given by 8 0 ˆ< nx Oˇridge D arg i ˇ0 ˆ: id1 ˇj x ji 1 A 2 ˇj x ji subject to 1 A 2 C ˇ2j ˇ2j t; (4) 9 >= >; : (5) where the parameter 0 is a function of the parameter t in equation (4) and controls how severe the penalty is. The larger is, the larger the penalty gets. When D 0, the vector Oˇridge is equal to the coefficient vector obtained by OLS. Ridge regression obtains non-zero estimates for all coefficients, so it is not a method of variable selection. Mentioning that the choice of variables is important for the interpretation of the model, Breiman (1995) proposes the Garrote method, which penalizes the regression coefficients so that some of them are forced to zero. The disadvantage of the Garrote method is that its solution depends on the sign and magnitude of the OLS estimates, which are let down in the presence of high correlation between predictors. LASSO-TYPE PENALTIES As in ridge regression, described by equation (4), where a penalty on the sum of squares of the coefficients occurs, there are alternative methods that impose penalties on the sum of the absolute values of the coefficients. Some of these methods are described in this section. LASSO The least absolute shrinkage and selection operator (LASSO), proposed by Tibshirani (1996), is another method of shrinking the coefficient set. Just as the Garrote, LASSO aims to estimate a model that produces forecasts with small variance and determine the set of predictors that explain better the response variable. Tibshirani (1996) argues that techniques usually employed to improve OLS estimates, such as subset selection and ridge regression, have disadvantages. Subset selection models are easily interpreted, but the processes of choosing variables have great variability since they are discrete processes. Ridge regression, in turn, has less variability and shrinks the regression coefficients but still remains with all predictors in the model. As in typical regression modeling, represented by equations (1) and (2), it is supposed that the y i s are conditionally independent given the x ki s. It is assumed that x ki s are standardized such that P i x ki D 0 and P i x2 =n D 1. ki LASSO estimates are obtained by minimizing the sum of squared residuals subject to a penalty of L 1 norm of the coefficients: 0 12 nx OˇLASSO D minˇ0;ˇ1;:::;ˇk i ˇ0 ˇj x ji A subject to jˇj jt; (6) id1 where the tuning parameter t 0 controls the penalty. For all t, the solution for ˇ0 is Oˇ0 D y. Thus one may assume without loss of generality that y D 0 andthenomitˇ0. The ± tuning parameter t 0 in equation (6) controls how much penalty is applied on the set of coefficients. Let Oˇ0 j represent the set of OLS coefficients and t 0 D P j Oˇ0j j. The values t<t 0 will cause a shrinkage of the 1j k coefficients toward zero, and some coefficients may be exactly equal to zero. If t t 0, LASSO estimates will be the same as OLS estimates.

4 LASSO-Type Penalties in Time Series 595 Using the Lagrangian, equation (6) has an equivalent expression given by 8 0 ˆ< nx OˇLASSO D minˇ0;ˇ1;:::;ˇk i ˇ0 ˆ: id1 ˇj x ji 1 A 2 C 9 >= jˇj j >; ; (7) where the parameter 0 is a function of the parameter t. The larger, the greater are the penalty coefficients; when D 0, LASSO estimates are equal to OLS estimates. The value of the parameter is traditionally chosen via cross-validation in a cross-section framework. However, this strategy is more complicated in a time series set-up. Therefore, we will employ the Bayesian information criterion (BIC) to choose. Tibshirani (2013) shows the sufficient conditions for the uniqueness of the LASSO solution using Karush Kuhn Tucker. LASSO has less variability than the methods of subset selection. Moreover, it shrinks some coefficients and forces others to zero, keeping the good features of subset selection and ridge regression. Furthermore, LASSO performs variable selection and coefficient estimation simultaneously. LASSO s consistency in variable selection In order to use LASSO as a selection criterion, its sparse solution should represent the true model well. Hence one of the desirable properties of such a criterion is that it is consistent, i.e. it identifies the true model when n!1. To study the consistency of LASSO selection, Zhao and Yu (2006) consider two aspects: (i) whether there is a deterministic amount of regularization that provides consistency in selection; and (ii) whether for each sample there is a correct amount of regularization that selects the true model. Their results show that there is a condition that they name the irrepresentable condition, which is almost necessary and sufficient for both kinds of consistency. Their results hold for linear models with either fixed k or k growing with n. Let us consider the linear regression model: Y n D X nˇn C " n ; where " n D." 1 ;:::;" n / T is a vector of i.i.d. random variables with mean 0 and variance 2. Y n is an n 1 response variable and X n D X1 n;:::;xn k is the n k matrix of predictors, where X n i D.x i1 ;:::;x in / T,fori D 1;:::;k. ˇn is the k 1 coefficient vector. Unlike the traditional setting where k is fixed, the data and the model parameters are indexed by n to allow them to vary with n. LASSO estimates Oˇn./ T D Oˇn D Oˇn 1 ;:::; k Oˇn are defined by Oˇn./ D arg minˇ jjy n X nˇnjj 2 2 C jjˇnjj 1 ; where jj jj 2 2 denotes the standard L 2 norm of a vector, i.e. the sum of the squares of the vector components; and jj jj 1 denotes the L 1 norm of a vector, i.e. the sum of the absolute values of the vector components. There is consistency in model selection when ± P i W Oˇni 0 D ¹i W ˇni 0º! 1; as n!1; which is equivalent to signal consistency Zhao and Yu (2006). The following definitions are based on Zhao and Yu (2006). Definition 1. An estimate Oˇn is equal in sign to the true ˇn (which is given by Oˇn D sˇn) if and only if sign Oˇn D sign.ˇn/: where sign./ takes the value 1 for positive values and 1 for zero and negative values. Definition 2. LASSO is strongly sign consistent if there exists n D f.n/, i.e. a function of n which is independent of Y n and X n, such that Definition 3. LASSO is generally sign consistent if lim P Oˇn. n / D sˇn D 1: n!c1 lim P 9 0; Oˇn./ D sˇn D 1: n!c1

5 596 E. Konzen and F. A. Ziegelmann Strong sign consistency means that you can use a preselected to obtain a consistent model selection. General sign consistency means that for a random realization there is a correct amount of regularization that selects the true model. Zhao and Yu (2006) show that both types of consistency are almost equivalent up to a certain condition. We now introduce some notation to define this condition. T Without loss of generality, we assume that ˇn D ˇn1 ;:::;ˇnq ;ˇnqC1 ;:::;ˇn,whereˇn k j 0 for j D 1;:::;q ˇn1 T T and ˇnj D 0 for j D q C 1;:::;k. Suppose ˇn D ;:::;ˇq.1/ 1 e ˇn.2/ D ˇnqC1 k ;:::;ˇq. Then one can write X n.1/ and X n.2/ as the first q and the last k q columns of X n, respectively, and define C n D 1 n X n T X n. By setting C11 n D 1 n X n.1/ T X n.1/, C22 n D 1 n X n.2/ T X n.2/, C12 n D 1 n X n.1/ T X n.2/ and C21 n D 1 n X n.2/ T X n.1/, C n can be expressed in a block-wise form as follows: C n C n D 11 C12 n C21 n C 22 n : Assuming C11 n is invertible, the following irrepresentable conditions are defined. Strong irrepresentable condition. There exists a positive constant vector such that ˇ ˇC21 n C n 1 11 sign ˇn.1/ˇˇˇ 1 ; where 1 is a.k q/ 1 vector and the inequality holds element-wise. Weak irrepresentable condition. ˇ ˇC21 n C n 1 11 sign ˇn.1/ˇˇˇ < 1; where the inequality holds element-wise. Zou (2006) similarly concludes that there is a necessary condition for LASSO s consistency in model selection. adalasso Noting that there may be situations in which the LASSO is not consistent in variable selection, Zou (2006) proposes the adaptive LASSO (adalasso), which employs different weights to different coefficients: OˇadaLASSO D arg min ˇ0;ˇ1;:::;ˇk 0 i ˇ0 id1 ˇj x ji 1 A 2 C μ! j jˇj j ; (8) where! j D ˇ Oˇridge ˇ ˇ j ;>0. Individual weights! j help to select relevant variables. A relevant variable x j tends to have a large coefficient Oˇridge j, resulting in a small weight! j assigned to the coefficient of that variable; otherwise, if the variable x j is irrelevant, the ridge coefficient Oˇridge j tends to be small and will result in a great! j. Thus the adalasso imposes a greater penalty on coefficients of the variables that appear to be irrelevant. The weights! j can also be obtained from OLS estimates; however, this would be limited to the case where n>kc1. Following Zou s ± 2006 notations, A D ¹j W ˇj 0º is the true set of non-zero coefficients. In turn, A n D j W Oˇ.n/ j 0 is the set of non-zero coefficients estimated by equation (8), where n varies with the sample size n. Zou (2006) shows that with the use of adequate weights! j the adalasso has oracle properties. Theorem 1. (Zou, 2006) Suppose n = p n! 0 and n n. 1/=2!1. Thus adalasso satisfies: 1. Consistency in variable selection: lim n!1 P.A n D A/ D Asymptotic normality: p n Oˇ.n/ A ˇA! d N 0; 2 C Therefore, in addition to the correct selection of the relevant variables when the sample size increases, the adalasso has estimates of non-zero coefficients that asymptotically follow the same distribution as the OLS estimators when the OLS is estimated only with the relevant variables.

6 LASSO-Type Penalties in Time Series 597 WLadaLASSO In economic time series, vector autoregression (VAR) models became popular with the seminal work of Sims (1980). However, these models often suffer from over-parametrization, which can create multicollinearity and a loss of degrees of freedom, resulting in inefficient estimates and consequently large out-of-sample forecasting errors. For that reason, further developments have been characterized by imposing restrictions on the parameters via Bayesian priors, which are now well known as Minnesota priors. In particular, Litterman (1979) imposes restrictions on the coefficients across different lag lengths, assuming that the coefficients of longer lags may approach more closely to zero than the coefficients of shorter lags. Other strategies literally exclude lags with statistically insignificant coefficients (Dua and Ray, 1995; Dua et al., 1999). When one performs adalasso in a time series context, each lagged variable enters as a predictor candidate and its coefficient is only penalized according to the size of its Ridge (or OLS) estimate. One can then wonder whether a greater penalty for a more distant lagged variable improves the time series forecasting, since more recent information is usually more important. Park and Sakaori (2013) propose some alternative types of penalties for different lags. Borrowing from their ideas in a slight different version, we propose the adalasso with weighted lags, called here WLadaLASSO (weighted lag adaptive LASSO), which is given by where! j D j Oˇridge j OˇWLadaLASSO D arg min ˇ0;ˇ1;:::;ˇk 8 ˆ< ˆ: 0 i ˇ0 id1 ˇj x ji j e l, >0, 0 and l represents the lag order. 1 A 2 C 9 >=! j jˇj j >; ; (9) In what follows, besides looking at WLadaLASSO forecasting skills, we will also study its model selection and estimation capabilities, comparing the results to other regularization methods. SIMULATION In this section we analyze the WLadaLASSO performance, comparing it to other regularization methods in a Monte Carlo simulation study. All implementations are done with the free software R. The estimation of equations (7), (8) and (9) uses the function glmnet to optimize the parameters ˇj and through BIC. The parameter is set to 1. In WLadaLASSO, for each belonging to the set Œ0;0:5;1;:::;10, the optimal is the one that produces a fit with the smallest BIC value. The chosen value for is the one that produces the smallest BIC value among all obtained BIC values. Through Monte Carlo simulations with 1000 replications, we simulate 10 independent time series that follow AR(1) as x i;t D x i;t 1 C u i;t,whereu i;t N.0; 1/, i D 1;:::;10. The following data-generating process is considered: y t D 0:8y t 1 C 0:6x 1;t 1 C 0:3x 1;t 2 0:5x 2;t 1 0:2x 2;t 2 C 0:4x 3;t 1 C 0:3x 3;t 2 C 0:4x 4;t 1 0:3x 5;t 1 C 0:2x 6;t 1 C " t ; t D 1;2;:::;T; (10) where " t has two specifications: N.0; 1/ and t.5/. The LASSO, adalasso and WLadaLASSO methods are employed to estimate equation (10) with L lags of y t and L lags of x j;t ;j D 1;:::10, as candidates, resulting in 11 L candidate predictors. We analyze situations where L D 5, 10 and 20. In order to compare the coefficients estimates with their true values, we use the mean squared error (MSE) and the mean absolute error (MAE) of the estimates, which are respectively given by MSE D X 1000k id1 2 Oˇj ˇj (11) and MAE D X 1000k id1 ˇ ˇ Oˇj ˇj ˇ: (12) We remove the last 10 observations of the simulated series and then perform the one-step-ahead out-of-sample forecasts for the observations removed. For each t, the forecast of y t is based on the entire set of existing information at date t 1.

7 598 E. Konzen and F. A. Ziegelmann Table I. Descriptive statistics of model selection ( D 0:3) Errors N.0; 1/ Errors t.5/ 5 candidate lags 10 candidate lags 20 candidate lags 5 candidate lags 10 candidate lags 20 candidate lags T FVCI FVCI FVCI FVCI FVCI FVCI LASSO adalasso WLadaLASSO TMI TMI TMI TMI TMI TMI LASSO adalasso WLadaLASSO FRVI FRVI FRVI FRVI FRVI FRVI LASSO adalasso WLadaLASSO FIVE FIVE FIVE FIVE FIVE FIVE LASSO adalasso WLadaLASSO NIV NIV NIV NIV NIV NIV LASSO adalasso WLadaLASSO

8 LASSO-Type Penalties in Time Series 599 Table II. Descriptive statistics of model selection ( D 0:6) Errors N.0; 1/ Errors t.5/ 5 candidate lags 10 candidate lags 20 candidate lags 5 candidate lags 10 candidate lags 20 candidate lags T FVCI FVCI FVCI FVCI FVCI FVCI LASSO adalasso WLadaLASSO TMI TMI TMI TMI TMI TMI LASSO adalasso WLadaLASSO FRVI FRVI FRVI FRVI FRVI FRVI LASSO adalasso WLadaLASSO FIVE FIVE FIVE FIVE FIVE FIVE LASSO adalasso WLadaLASSO NIV NIV NIV NIV NIV NIV LASSO adalasso WLadaLASSO

9 600 E. Konzen and F. A. Ziegelmann Table III. Descriptive statistics of model selection ( D 0:9) Errors N.0; 1/ Errors t.5/ 5 candidate lags 10 candidate lags 20 candidate lags 5 candidate lags 10 candidate lags 20 candidate lags T FVCI FVCI FVCI FVCI FVCI FVCI LASSO adalasso WLadaLASSO TMI TMI TMI TMI TMI TMI LASSO adalasso WLadaLASSO FRVI FRVI FRVI FRVI FRVI FRVI LASSO adalasso WLadaLASSO FIVE FIVE FIVE FIVE FIVE FIVE LASSO adalasso WLadaLASSO NIV NIV NIV NIV NIV NIV LASSO adalasso WLadaLASSO

10 LASSO-Type Penalties in Time Series 601 Tables I III, inspired by the working paper of Medeiros and Mendes (2012), show various statistics related to variable selection, divided into panels, for different sample sizes, for different numbers of candidate lags and for two different distributions of the errors u i;t and " t. Each of these tables reports the results for different levels of linear dependence between covariates ( D 0:3, 0.6 and 0.9). The first panel shows the average fraction (across the replications) of variables correctly identified (FVCI): relevant variables identified as relevant and irrelevant ones identified as irrelevant; the second panel shows the fraction of replications where the true model is included (TMI), i.e. all relevant covariates are included; the third shows the average fraction of relevant variables included (FRVI); the fourth shows the average fraction of irrelevant variables excluded (FIVE); and the last shows the average number of included variables (NIV). In what follows, we will discuss only results for cases in which the errors have an N.0; 1/ distribution. These results are quite similar to those obtained when the errors follow a t.5/ distribution. The main differences in model selection performance among the three types of penalties can be viewed in the FRVI panel in Tables I III. For the largest sample size (T D 2000), all methods have similar performance with the exception of the case D 0:9, where WLadaLASSO is better than LASSO and slightly worse than adalasso. For the smallest sample size (T D 50), we can notice a superiority of WLadaLASSO in FRVI, which becomes more evident as we have a higher number of candidate lags and a stronger linear dependence between covariates. Finally, the case of moderate sample size (T D 500) shows us intermediate results between the other sample sizes. Tables IV VI show the error measures of the set of estimated parameters, calculated via equations (11) and (12). For T D 2000, WLadaLASSO maintains at least the same quality of parameter estimates obtained by adalasso and always has greater success than LASSO. Our proposed penalty also shows slightly better results where T D 500. The drawback of WLadaLASSO appears for T D 50, when it is outperformed by LASSO. In our simulations we also observe the MSE and MAE of the difference between the OLS and WLadaLASSO estimates when the former is estimated using only the relevant variables. Both error measurements tend to zero when the sample size increases, verifying one of the oracle properties (Fan and Li, 2001). The results of the one-step-ahead forecasts are reported in Tables VII IX. The adalasso and WLadaLASSO produce predictions with similar performance for T D 2000, with both overcoming LASSO. However, WLadaLASSO particularly stands out when there are 10 or 20 candidate lags and D 0:9. For the smallest sample size, we can observe that WLadaLASSO becomes strikingly superior the greater the number of candidate lags and the stronger the linear dependence between covariates. The disadvantage of our proposed penalty occurs in the specific situation with T D 50 and only 5 candidate lags. Figures 1 3 illustrate the distribution of the obtained prediction errors for cases in which the errors have N.0; 1/ distribution. EMPIRICAL ANALYSIS Risk premium forecasting Stock returns forecasting is of great interest to academics and financial market investors, and many economic variables have been proposed as potential predictors. Goyal and Welch (2008) show that an extensive list of potential predictors used in the literature to provide forecasts are unstable in obtaining out-of-sample forecasts when they are compared to the simple model based on the historical average return. In contrast, Campbell and Thompson (2008) show that many regressions can provide better predictions than the historical average predictions when restrictions are imposed on the signs of the coefficients. Although this superiority to the historical average is generally small and statistically insignificant, it can be economically significant. Among the articles that corroborate the results of Campbell and Thompson (2008), we can cite Rapach et al. (2010), Ferreira and Santa-Clara (2011) and Hillebrand et al. (2012). Other recent works analyze whether financial time series may be predicted by a list of covariates; see, for example, Issler et al. (2014), Lee et al. (2015) and Hsiao and Wan (2014). Goyal and Welch (2008) estimate linear regressions via OLS where the risk premium at time t is explained by a predictor variable at time t 1. Furthermore, they estimate linear regressions including all candidate predictors. Here we apply methods of L 1 norm penalty on the regression coefficients with multiple predictors as candidates, delegating to these methods the choice of predictors for the risk premium, which is the total rate of return on the stock market minus the prevailing short-term interest rate. We use 14 variables from Goyal and Welch (2008): dividend price ratio (log); dividend yield (log); earnings-price ratio (log); dividend payout ratio (log); stock variance; book-to-market ratio; net equity expansion; treasury bill rate; long-term yield; long-term return; term spread; default yield spread; default return spread; and inflation. All annual series used as predictors begin in 1927 and end in We work with differentiated and centered series, estimating the models without intercept. For penalty methods, the number of lags is optimized through BIC, where the maximal lag order is five.

11 602 E. Konzen and F. A. Ziegelmann Table IV. Descriptive statistics of parameter estimates ( D 0:3) Errors N.0; 1/ Errors t.5/ 5 candidate lags 10 candidate lags 20 candidate lags 5 candidate lags 10 candidate lags 20 candidate lags T MSE MSE MSE MSE MSE MSE LASSO adalasso WLadaLASSO MAE MAE MAE MAE MAE MAE LASSO adalasso WLadaLASSO

12 LASSO-Type Penalties in Time Series 603 Table V. Descriptive statistics of parameter estimates ( D 0:6) Errors N.0; 1/ Errors t.5/ 5 candidate lags 10 candidate lags 20 candidate lags 5 candidate lags 10 candidate lags 20 candidate lags T MSE MSE MSE MSE MSE MSE LASSO adalasso WLadaLASSO MAE MAE MAE MAE MAE MAE LASSO adalasso WLadaLASSO

13 604 E. Konzen and F. A. Ziegelmann Table VI. Descriptive statistics of parameter estimates ( D 0:9) Errors N.0; 1/ Errors t.5/ 5 candidate lags 10 candidate lags 20 candidate lags 5 candidate lags 10 candidate lags 20 candidate lags T MSE MSE MSE MSE MSE MSE LASSO adalasso WLadaLASSO MAE MAE MAE MAE MAE MAE LASSO adalasso WLadaLASSO

14 LASSO-Type Penalties in Time Series 605 Table VII. Descriptive statistics of forecasts ( D 0:3) Errors N.0; 1/ Errors t.5/ 5 candidate lags 10 candidate lags 20 candidate lags 5 candidate lags 10 candidate lags 20 candidate lags T Mean of MSEs Mean of MSEs Mean of MSEs Mean of MSEs Mean of MSEs Mean of MSEs LASSO adalasso WLadaLASSO Median of MSEs Median of MSEs Median of MSEs Median of MSEs Median of MSEs Median of MSEs LASSO adalasso WLadaLASSO Mean of MAEs Mean of MAEs Mean of MAEs Mean of MAEs Mean of MAEs Mean of MAEs LASSO adalasso WLadaLASSO Median of MAEs Median of MAEs Median of MAEs Median of MAEs Median of MAEs Median of MAEs LASSO adalasso WLadaLASSO

15 606 E. Konzen and F. A. Ziegelmann Table VIII. Descriptive statistics of forecasts ( D 0:6) Errors N.0; 1/ Errors t.5/ 5 candidate lags 10 candidate lags 20 candidate lags 5 candidate lags 10 candidate lags 20 candidate lags T Mean of MSEs Mean of MSEs Mean of MSEs Mean of MSEs Mean of MSEs Mean of MSEs LASSO adalasso WLadaLASSO Median of MSEs Median of MSEs Median of MSEs Median of MSEs Median of MSEs Median of MSEs LASSO adalasso WLadaLASSO Mean of MAEs Mean of MAEs Mean of MAEs Mean of MAEs Mean of MAEs Mean of MAEs LASSO adalasso WLadaLASSO Median of MAEs Median of MAEs Median of MAEs Median of MAEs Median of MAEs Median of MAEs LASSO adalasso WLadaLASSO

16 LASSO-Type Penalties in Time Series 607 Table IX. Descriptive statistics of forecasts ( D 0:9) Errors N.0; 1/ Errors t.5/ 5 candidate lags 10 candidate lags 20 candidate lags 5 candidate lags 10 candidate lags 20 candidate lags T Mean of MSEs Mean of MSEs Mean of MSEs Mean of MSEs Mean of MSEs Mean of MSEs LASSO adalasso WLadaLASSO Median of MSEs Median of MSEs Median of MSEs Median of MSEs Median of MSEs Median of MSEs LASSO adalasso WLadaLASSO Mean of MAEs Mean of MAEs Mean of MAEs Mean of MAEs Mean of MAEs Mean of MAEs LASSO adalasso WLadaLASSO Median of MAEs Median of MAEs Median of MAEs Median of MAEs Median of MAEs Median of MAEs LASSO adalasso WLadaLASSO

17 608 E. Konzen and F. A. Ziegelmann 5 candidate lags, T = 50 5 candidate lags, T = candidate lags, T = candidate lags, T = candidate lags, T = candidate lags, T = candidate lags, T = candidate lags, T = candidate lags, T = Figure 1. Distribution of forecast errors ( D 0:3) 5 candidate lags, T = 50 5 candidate lags, T = candidate lags, T = candidate lags, T = candidate lags, T = candidate lags, T = candidate lags, T = candidate lags, T = candidate lags, T = Figure 2. Distribution of forecast errors ( D 0:6)

18 LASSO-Type Penalties in Time Series candidate lags, T = 50 5 candidate lags, T = candidate lags, T = candidate lags, T= candidate lags, T = candidate lags, T = candidate lags, T =50 20 candidate lags, T = candidate lags, T = LASSO adalasso WLadaLASSO Figure 3. Distribution of forecast errors ( D 0:9) In order to compare the one-step-ahead predictions of the h out-of-sample observations, we use the R 2 OS (out-ofsample R 2 ) statistics, as suggested in Campbell and Thompson (2008), which is given by P T Ch ROS 2 D 1 tdt C1.r t Or t / 2 P T Ch tdt C1.r t r t / ; 2 where Or t ;td 1;:::;T are the predicted values for T out-of-sample observations, and r t is the historical mean until t 1. p-values of the modified Diebold Mariano (MDM) test (Harvey et al., 1997) are computed in order to compare the predictions via historical mean to the other methods. The Diebold Mariano one-tailed test has the null hypothesis of no difference in the accuracy of two competing forecasts. When the null hypothesis is rejected, we conclude that the method has better predictive ability than the historical mean. Table X shows the superiority of predictions made by penalization methods compared to the historical mean in three different periods, which begin in 1955, 1970 and We employ those methods in two ways: by using only one lag (l D 1) of all individual predictors (as in Goyal and Welch, 2008) and by using the optimal number of lags chosen via BIC. In both cases the penalization methods achieve better predictions than the historical mean, WLadaLASSO having the best results. Inflation forecasting Chan (2013) carries out a brief literature review of inflation forecasting and points out that both persistence and volatility in the inflation process change considerably over time. He lists some studies which find that models with stochastic volatility provide substantially better forecasts than those obtained from constant error variance models (e.g. Clark and Doh, 2011; Chan et al., 2012). Then he introduces a new class of models that has both stochastic volatility and moving average errors and illustrates an empirical application of forecasting US quarterly CPI inflation, in which his approach provides better in-sample fit and out-of-sample forecast performance than the standard variants with only stochastic volatility. The unobserved components model with MA(1) stochastic volatility (UC-MA) has the best forecast performance in comparison to competing models, essentially at short forecast horizons. We use US inflation observed at a quarterly frequency ranging from 1960:Q1 to 2011:Q2, which is the quarterly log change in the personal consumption expenditures (PCE) deflator. The potential predictors are the same used by

19 610 E. Konzen and F. A. Ziegelmann Table X. Equity premium forecasting results Forecasts begin in 1955 Forecasts begin in 1970 Forecasts begin in 1985 R 2 OS MDM p-value R 2 OS MDM p-value R 2 OS MDM p-value Individual predictors Dividend price ratio Dividend yield Earning price ratio Dividend payout ratio Stock variance Book to market Net equity expansion T-bill rate Long-term yield Long-term return Term spread Default yield spread Default return spread Inflation All regressors Penalization methods (l D 1) LASSO adalasso Penalization methods (free l) LASSO adalasso WLadaLASSO Table XI. Inflation forecasting results Penultimate period Last period Model MSE MDM p-value MAE MDM p-value MSE MDM p-value MAE MDM p-value UC-MA LASSO adalasso WLadaLASSO Groen et al. (2013). They correspond to lagged values of inflation, a host of real activity data, term structure data, nominal data and surveys, resulting in 16 covariates. We work with differentiated and centered series, estimating the models without intercept. For penalty methods, the number of covariate lags is optimized observing BIC statistics, considering that the maximal lag order is 20. p- values of the MDM test (respectively for each MSE and MAE measurement) are computed in order to compare the one-step-ahead predictions of some models with the UC-MA model. We analyze two periods of 20 observations each to evaluate predictions. Table XI shows that both adalasso and WLadaLASSO provide better forecasting performance than UC-MA and LASSO for the two sample periods. CONCLUSION This study aims to investigate how the penalty on the coefficients set can contribute to the performance of time series forecasting. The methods that penalize the coefficients are extremely important in reducing the dimensionality of some economic time series, where the number of time series is high and there are few observations. LASSO and adalasso methods have arisen in the context of linear regression but are increasingly present in time series analysis. Observing that some more recent information tends to contribute more to time series forecasting, and inspired by Park and Sakaori (2013), this article proposes the WLadaLASSO, which penalizes each lagged variable differently. The simulation study shows that, essentially for a stronger linear dependence between predictors and a higher number of candidate lags, WLadaLASSO outperforms other penalization methods in many perspectives: covariate

LASSO-type penalties for covariate selection and forecasting in time series

LASSO-type penalties for covariate selection and forecasting in time series Evandro Konzen 1 Flavio A. Ziegelmann 2 Abstract This paper studies some forms of LASSO-type penalties in time series to reduce