LASSO-type penalties for covariate selection and forecasting in time series

Size: px

Start display at page:

Download "LASSO-type penalties for covariate selection and forecasting in time series"

Ambrose Simpson
5 years ago
Views:

1 LASSO-type penalties for covariate selection and forecasting in time series Evandro Konzen 1 Flavio A. Ziegelmann 2 Abstract This paper studies some forms of LASSO-type penalties in time series to reduce the dimensionality of the parameter space as well as to improve the out-of-sample forecasting performance. Particularly, we propose a method which we call WLadaLASSO (Weighted Lag adaptive LASSO), which assigns not only different weights to each coefficient but also further penalizes coefficients of higher-lagged covariates. In our Monte Carlo implementation, the WLadaLASSO is superior in terms of covariate selection, parameter estimation precision and forecasting, when compared to both LASSO and adalasso, especially for small sample sizes and highly correlated covariates. An empirical study illustrates our approach for U.S. risk premium forecasting with good results. Keywords: Time series, LASSO, adalasso, variable selection, forecasting. JEL Classification: C22; C52; C53. 1 Graduate Program in Economics, Federal University of Rio Grande do Sul, Porto Alegre, RS , Brazil, konzen.evandro@gmail.com 2 Department of Statistics - PPGE and PPGA, Federal University of Rio Grande do Sul, Porto Alegre, RS , Brazil, flavioaz@mat.ufrgs.br. The author wishes to thank CNPq (processes / and /2013-1) and FAPERGS (process 1994/12-6) for financial support. 1

2 1 Introduction High-dimensional models have been increasingly present in the literature. It is known that the inclusion of a large number of economic and financial variables can contribute to substantial gains for time series forecasting. As (Song and Bickel, 2011) pointed out, a challenging problem is to determine which variables and lags are relevant, especially when there is a conjunction of serial correlation, high-dimensional dependence structure among variables and small sample size (relative to the dimensionality). As (Fan and Lv, 2010) stated, statistical accuracy, interpretability of the model and computational complexity are three important pillars of any statistical procedure. Typically, the number of observations n is much greater than the number of variables or parameters p. However, when the dimensionality p is large compared to the sample size n, the traditional methods face some challenges. Among them, how to make interpretable models that are estimable; how to make the statistical procedures robust and computationally efficient; and how to obtain more efficient procedures in terms of statistical inference. Moreover, in a high dimensionality context, when the number of covariates p is large compared to the sample size n, the traditional models can lead to problems of spurious correlation between the covariates, which can be serious even when the covariates are independent and identically distributed, as shown in (Fan and Lv, 2008) and (Fan et al., 2012). One way to challenge the problems caused by high dimensionality is through the sparsity assumption on the p-dimensional parameter vector, forcing that many of its components are exactly zero. Although it generally produces biased estimates, the sparsity assumption helps one to identify the important covariates, then obtaining a more parsimonious model and reducing its complexity as well as the computational cost of estimating it. As (Medeiros and Mendes, 2012) commented, factor models provide a good alternative when many variables manifest importance in the model, situation which the authors name by dense structure of the model. However, when the coefficient vector is indeed sparse, methods that assume sparsity gain importance. The Least Absolute Shrinkage and Selection Operator (LASSO) (Tibshirani, 1996) was proposed in a linear regression context. It imposes a penalty on the set of L 1 norm of the coefficients. Due to the nature of this penalty, LASSO forces some coefficients to be exactly zero, making it useful to select covariates and to reduce the dimensionality of the parameter space. This methodology is an example of regularization techniques characterized by (Breiman, 1995), which considers an error function given by E = error on data + λ model complexity, where the sum of squared residuals can, for instance, represent the term error on data. The second term penalizes those models whose complexity and variance of the estimators are high, with λ representing how severe the penalty is. When it minimizes the error function rather than just minimizing the error on data, complex models are overpenalized and then estimators variances are reduced. If λ is too large though, only very simple models are obtained and a large bias can be introduced. LASSO is one of the most famous regularization techniques, succeeding mainly in cases where there are many null coefficients within the set of coefficients to be estimated. Thus, LASSO also becomes useful for the selection of covariates. 2

3 (Zou, 2006) investigates the oracle properties, mentioned by (Fan and Li, 2001), of the original LASSO proposed by (Tibshirani, 1996). (Zou, 2006) shows that there are cases in which LASSO is not consistent in variable selection; hence proposes the adalasso (adaptive LASSO), where the penalty occurs with different weights for each coefficient, which allows this version to enjoy the oracle properties. In a time series context, LASSO-type penalties are employed in (De Mol et al., 2008), (Hua, 2011) and (Li, 2012). (Medeiros and Mendes, 2012) show that the adalasso consistently chooses the relevant variables as the number of observations increases (model selection consistency) even when the errors are non-gaussian and conditionally heteroskedastic. In addition, (Audrino and Camponovo, 2013) presented some theoretical and empirical results on the finite sample and asymptotic properties of adalasso in time series regression models. Our present work borrows ideas from (Park and Sakaori, 2013) and proposes a variation which we call WLadaLASSO (Weighted Lag adaptive LASSO), a method which assigns different weights to each coefficient and also further penalizes coefficients of higher-lagged covariates. The results show the superiority of WLadaLASSO when it is compared to LASSO and adalasso, essentially for small sample sizes. An application is carried out to forecast risk premium using the same data as in (Goyal and Welch, 2008). Contrary to what the previous article points out, we find that the predictors used in the literature can help to predict the risk premium when LASSO-type methods are applied. The rest of this work is as follows. Section 2 introduces the theme of model selection. In Section 3, LASSO-type penalties are described in detail. Section 4 carries out a comprehensive Monte Carlo simulation study. Section 5 brings our empirical application of Risk Premium forecasting. Finally, Section 6 concludes, bringing our main remarks. 2 Preliminaries on Model Selection One of the main goals in linear regression analysis is to estimate the coefficients of the following gaussian linear model or, in a vector form, y i = β 0 + β 1 x 1i + β 2 x 2i + + β k x ki + ε i, i = 1,..., n, (1) y i = β 0 + X T i β + ε i, (2) where y i R is the response variable, X i = (x 1i,..., x ki ) T R k is the set of predictors, ε i N(0, σ 2 ), β β 0 = (β 0, β 1,..., β k ) T is the set of parameters. One of the most popular methods for estimating the unknown parameters of (1), ordinary least squares (OLS), is based on minimizing the sum of squared residuals (SSR), that is, solving the following minimization problem: ˆβ = argmin β 0,β 1,...,β k n i=1 ( y i β 0 k ) 2 β j x ji. (3) However, the OLS method suffers from some problems when there are too many predictors in the model. Firstly, its estimates often have large variance, which reduces the accuracy 3

4 of the forecasts. Secondly, only by itself, it lacks the skill of determining a smaller set of predictors that exhibits the best effects when one aims to find out which predictors most explain the variability of the response variable. Finally, by construction, the OLS method can not be implemented when the number of parameters exceeds the number of observations. A natural procedure for choosing which predictors enter and which do not enter model (2) is to compute all 2 k possible regression models, investigating all possible combinations of predictors. However, this can require large computational resources. Thus, some procedures have been proposed, such as the Best-Subset Selection, the Forward and Backward Elimination and Forward-Stagewise Regression, as pointed by (Hastie et al., 2001). However, these alternatives also have their own limitations when the model has many candidate covariates. Best Subset Selection is a feasible procedure for a k of a magnitude not exceeding 30 or 40, as explained by (Hastie et al., 2001). Meanwhile, the proceedings of Forward Selection and Backward Elimination may not select the best subset of variables in some situations, as stated by (Berk, 1978). Forward-Stagewise Regression, as an algorithm that can require many steps, has a large computational cost and may be impractical for problems of high dimensionality. Ridge Regression, in turn, is used to shrink the set of coefficients by imposing a penalty on their sum of squares: ˆβ ridge = argmin β 0,β 1,...,β k n i=1 ( y i β 0 k ) 2 β j x ji subject to k βj 2 t, (4) where the parameter t 0 controls the penalty. An equivalent expression is given by { n ( ˆβ ridge = argmin y i β 0 β 0,β 1,...,β k i=1 k ) 2 k β j x ji + λ β 2 j }, (5) where the parameter λ 0 is a function of the parameter t in (4) and controls how severe the penalty is. The larger λ is, the larger the penalty gets. When λ = 0, the vector ˆβ ridge is equal to the coefficient vector obtained by OLS. Ridge regression obtains non-zero estimates for all coefficients, and so it is not a method of variable selection. Mentioning that the choice of variables is important for the interpretation of the model, (Breiman, 1995) proposes the Garrote method that penalizes the regression coefficients so that some of them are forced to zero. The disadvantage of the Garrote is that its solution depends on the sign and magnitude of the OLS estimates, which are let down in the presence of high correlation between predictors. 3 LASSO-type penalties As in Ridge regression, described by equation (4), where a penalty on the sum of squares of the coefficients occurs, there are alternative methods that impose penalties on the sum of the absolute values of the coefficients. Some of these methods are described in this section. 4

5 3.1 LASSO The least absolute shrinkage and selection operator (LASSO), proposed by (Tibshirani, 1996), is another method of shrinking the coefficient set. Just as the Garrote, LASSO aims to estimate a model that produces forecasts with small variance and determine the set of predictors that explain better the response variable. (Tibshirani, 1996) argues that techniques usually employed to improve the OLS estimates, such as Subset Selection and Ridge Regression, have disadvantages. The Subset Selection models are easily interpreted, but the processes of choosing variables have great variability since they are discrete processes. Ridge regression, in turn, has less variability and shrinks the regression coefficients but still remains with all predictors in the model. As in typical regression modelling represented by equations (1) and (2), it is supposed that the y i s are conditionally independent given the x ki s. It is assumed that x ki s are standardized such that i x ki = 0 and i x2 ki /n = 1. LASSO estimates are obtained by minimizing the sum of squared residuals subject to a penalty of L 1 norm of the coefficients: ˆβ LASSO = argmin β 0,β 1,...,β k n i=1 ( y i β 0 k ) 2 β j x ji subject to k β j t, (6) where the tuning parameter t 0 controls the penalty. For all t, the solution for β 0 is ˆβ 0 = y. So one may assume without loss of generality that y = 0 and then omit β 0. The tuning parameter t 0 in (6) controls how much penalty is applied on the set of coefficients. Let { ˆβ j 0 } 1 j k represent the set of OLS coefficients and t 0 = ˆβ j 0. The values t < t 0 will cause a shrinkage of the coefficients toward zero, and some coefficients may be exactly equal to zero. If t t 0, LASSO estimates will be the same as OLS estimates. Using the Lagrangian, (6) has an equivalent expression given by ˆβ LASSO = argmin β 0,β 1,...,β k { n i=1 ( y i β 0 } k ) 2 k β j x ji + λ β j, (7) where the parameter λ 0 is a function of the parameter t. The larger λ, the greater the penalty coefficients; when λ = 0, LASSO estimates are equal to OLS estimates. The value of the parameter λ is chosen via K-fold cross-validation. (Tibshirani, 2013) shows the sufficient conditions for the uniqueness of LASSO solution using Karush-Kuhn-Tucker. LASSO has less variability than the methods of Subset Selection. Moreover, it shrinks some coefficients and forces others to zero, keeping the good features of Subset Selection and Ridge Regression. Furthermore, LASSO performs variable selection and coefficient estimation simultaneously LASSO s consistency in variable selection In order to use LASSO as a selection criterion, its sparse solution should represent well the true model. Hence, one of the desirable properties of such a criterion is that it is consistent, i.e., it identifies the true model when n. 5

6 To study the consistency of LASSO selection, (Zhao and Yu, 2006) consider two aspects: i) whether there is a deterministic amount of regularization that provides consistency in selection; and ii) whether for each sample there is a correct amount of regularization that selects the true model. Their results show that there is a condition that they name Irrepresentable Condition, which is almost necessary and sufficient for both kinds of consistency. Their results hold for linear models with either fixed k or k growing with n. Let us consider the linear regression model Y n = X n β n + ε n, where ε n = (ε 1,..., ε n ) T is a vector of i.i.d. random variables with mean 0 and variance σ 2. Y n is a n 1 response variable and X n = (X1 n,..., Xk n ) is the n k matrix of predictors, where Xi n = (x i1,..., x in ) T, for i = 1,..., k. β n is the k 1 coefficient vector. Unlike the traditional setting where k is fixed, the data and the model parameters are indexed by n to allow them vary with n. LASSO estimates ˆβ n (λ) = ˆβ n = ( ˆβ 1 n,..., ˆβ k n)t are defined by ˆβ n (λ) = argmin Y n X n β n λ β n 1, β where 2 2 denotes the standard L 2 norm of a vector, i.e., the sum of the squares of the vector components; and 1 denotes the L 1 norm of a vector, i.e. the sum of the absolute values of the vector components. There is consistency in model selection when ( P {i : ˆβ ) i n 0} = {i : βi n 0} 1, as n, which is equivalent to signal consistency (Zhao and Yu, 2006). The following definitions are based on (Zhao and Yu, 2006). Definition 1: An estimate ˆβ n is equal in sign to the true β n (which is written by ˆβ n = s β n ) if and only if sign( ˆβ n ) = sign(β n ), where sign( ) takes the value 1 for positive values, 1 to zero and negative values. Definition 2: LASSO is strongly sign consistent if there exists λ n = f(n), i.e., a function of n which is independent of Y n and X n, such that ( ) lim P ˆβn (λ n ) = s β n = 1. n + Definition 3: LASSO is general sign consistent if ( ) lim P λ 0, ˆβn (λ) = s β n = 1. n + Strongly sign consistency means that you can use a preselected λ to obtain a consistent model selection. General sign consistency means that for a random realization there is a correct amount of regularization that selects the true model. (Zhao and Yu, 2006) show that 6

7 both types of consistency are almost equivalent up to a certain condition. We now introduce some notation to define this condition. Without loss of generality, we assume that β n = (β1 n,..., βq n, βq+1, n..., βk n)t, where βj n 0 for j = 1,..., q and βj n = 0 for j = q + 1,..., k. Suppose β(1) n = (βn 1,..., β1) q T e β(2) n = (βq+1, n..., β q k )T. Then one can write X n (1) and X n (2) as the first q and the last k q columns of X n, respectively, and define C n = 1 X n n T X n. By setting C11 n = 1 X n n(1) T X n (1), C22 n = 1 X n n(2) T X n (2), C12 n = 1 X n n(1) T X n (2) and C21 n = 1 X n n(2) T X n (1), C n can be expressed in a block-wise form as follows: ( ) C C n n = 11 C12 n C21 n C22 n. Assuming C n 11 is invertible, the following Irrepresentable Conditions are defined. Strong Irrepresentable Condition. There exists a positive constant vector η such that C n 21 (C n 11) 1 sign(β n (1)) 1 η, where 1 is a (k q) 1 vector and the inequality holds element-wise. Weak Irrepresentable Condition. C n 21 (C n 11) 1 sign(β n (1)) < 1, where the inequality holds element-wise. (Zou, 2006) similarly concludes that there is a necessary condition for LASSO s consistency in model selection. 3.2 adalasso Noting that there may be situations in which the LASSO is not consistent in variable selection, (Zou, 2006) proposes the adaptive LASSO (adalasso), which employs different weights to different coefficients: ˆβ adalasso = argmin β 0,β 1,...,β k { n i=1 ( y i β 0 } k ) 2 k β j x ji + λ ω j β j, (8) ridge where ω j = ˆβ j τ, τ > 0. Individual weights ω j help to select relevant variables. A relevant variable x j tends to ˆβ ridge j have a large coefficient, resulting in a small weight ω j assigned to the coefficient of ridge that variable; otherwise, if the variable x j is irrelevant, the ridge coefficient ˆβ j tends to be small and will result in a great ω j. Thus, the adalasso imposes a greater penalty on coefficients of the variables that appear to be irrelevant. The weights ω j can also be obtained from OLS estimates; however, this would be limited to the case where n > k + 1. Following (Zou, 2006) notations, A = {j : β j 0} is the true set of non-zero coefficients. In turn, A n = {j : ˆβ (n) j 0} is the set of non-zero coefficients estimated by (8), where λ n 7

8 varies with the sample size n. (Zou, 2006) shows that with the use of adequate weights ω j the adalasso has the oracle properties. Theorem 1 (Zou, 2006): Suppose λ n / n 0 and λ n n (τ 1)/2. Therefore the adalasso satisfies: 1. Consistency in variable selection: lim n P (A n = A) = Asymptotic normality: n ( ˆβ (n) A ) β A d N ( ) 0, σ 2 C Thus, in addition to the correct selection of the relevant variables when the sample size increases, the adalasso has estimates of non-zero coefficients that asymptotically follow the same distribution as the OLS estimators when the OLS is estimated only with the relevant variables. 3.3 WLadaLASSO When one performs adalasso in a time series context, each lagged variable enters as a predictor candidate and its coefficient is only penalized according to the size of its Ridge (or OLS) estimate. One can then wonder whether a greater penalty for a more distant lagged variable improves the time series forecasting, since more recent information is usually more important. (Park and Sakaori, 2013) propose some alternative types of penalties for different lags. Borrowing from their ideas in a slight different version, we propose the adalasso with weighted lags, called here as WLadaLASSO (Weighted Lag adaptive LASSO), which is given by where ω j = ˆβ W LadaLASSO = argmin β 0,β 1,...,β k ( { n i=1 ( y i β 0 } k ) 2 k β j x ji + λ ω j β j ) τ ridge ˆβ j e αl, τ > 0, α 0, and l represents the lag order., (9) In what follows, besides looking at WLadaLASSO forecasting skills, we will also study its model selection and estimation capabilities, comparing the results to other regularization methods. 4 Simulation In this section we analyze the WLadaLASSO performance, comparing it to other regularization methods in a Monte Carlo simulation study. All implementations are done with the free software R. The estimation of equations (7), (8) and (9) uses the function glmnet to optimize the parameters β j and λ. We use a 10-fold cross-validation. The parameter τ is set to 1. In WLadaLASSO, for each α belonging to the set [0, 0.5, 1,..., 10], the 10-fold crossvalidation technique is performed to find the optimal λ observing the smallest measurement 8

9 error. The chosen value for α is the one that produces the smallest measurement error of the performed 10-fold cross-validation procedures. Through Monte Carlo simulations with 1,000 replications, we have simulated 10 independent time series that follow an AR(1) as x i,t = φx i,t 1 +u i,t, where u i,t N(0,1), i = 1,..., 10. The following data generating process was considered: y t = 0.8y t x 1,t x 1,t 2 0.5x 2,t 1 0.2x 2,t x 3,t x 3,t x 4,t 1 0.3x 5,t x 6,t 1 + ε t, t = 1, 2,..., T, (10) where ε t N(0,1). The LASSO, adalasso, and WLadaLASSO methods were employed to estimate the equation (10) with 10 lags of y t and 10 lags of x j,t, j = 1,... 10, as candidates, resulting in 110 candidate predictors. In order to compare the coefficients estimates to their true values, we used the Mean Squared Error (MSE) and the Mean Absolute Error (MAE) of the estimates which are respectively given by and MSE = k i=1 MAE = k i=1 k ( ) 2 ˆβj β j (11) k ˆβj β j. (12) We removed the last 10 observations of the simulated series and then performed the onestep-ahead out-of-sample forecasts for the observations removed. For each t, the forecast of y t was based on the entire set of existing information at the date t 1. Table 1, inspired by the working paper of (Medeiros and Mendes, 2012), shows various statistics related to variable selection, divided into panels, for different sample sizes. The first panel shows the fraction of replications in which the model was correctly selected: all relevant variables were included and all irrelevant variables were excluded from the final model; the second panel shows the fraction of replications where all relevant covariates were included; the third shows the fraction of relevant covariates included; the fourth shows the fraction of irrelevant regressors excluded; and the last shows the average number of covariates included. The results in Table 1 show that WLadaLASSO had a slightly better performance on the exclusion of irrelevant covariates than LASSO and adalasso. On the identification of relevant covariates for the cases where φ = 0.3 and φ = 0.6, the three methods had reasonably similar performance for sample sizes 500 and 2000, but WLadaLASSO was quite superior for the sample size 50. For φ = 0.9, this superiority is even greater for the smallest sample size. Table 2 shows the error measures of the set of estimated parameters, calculated by (11) and (12). The WLadaLASSO stood out for all sample sizes, especially for the smallest one. Figure 1 shows the smoothed histogram (using a Gaussian kernel) of the 1,000 estimated values for β 1, the coefficient of y t 1. The plots show that the WLadaLASSO estimates were the best, mainly for the cases of high-correlated covariates and/or small sample sizes. In our simulations we also observed the MSE and MAE of the difference between the OLS and WLadaLASSO estimates when the former is estimated using only the relevant variables. 9

10 Both error measurements tend to zero when the sample size increases, verifying one of the oracle properties (Fan and Li, 2001). The results of the one-step-ahead forecasts are reported in Table 3. The WLadaLASSO had the best predictive performance in almost all cases, and stood out for the cases T = 50. We can observe that it was strikingly superior when there was a higher linear dependence between covariates. Figure 2 illustrates the distribution of the obtained prediction errors. 5 Empirical Analysis Stock returns forecasting is of great interest to academics and financial market investors, and many economic variables have been proposed as potential predictors. (Goyal and Welch, 2008) show that an extensive list of potential predictors used in the literature to provide forecasts are unstable to get out-of-sample forecasts when they are compared to the simple model based on the historical average return. In contrast, (Campbell and Thompson, 2008) show that many regressions can provide better predictions than the historical average predictions when restrictions are imposed on the signs of the coefficients. Although this superiority to the historical average is generally small and statistically insignificant, it can be economically significant. Among the articles that corroborate the results of (Campbell and Thompson, 2008), we can cite (Rapach et al., 2010), (Ferreira and Santa-Clara, 2011) and (Hillebrand et al., 2012). Other recent works have analyzed whether financial time series may be predicted by a list of covariates, see for example (Issler et al., 2014), (Lee et al., 2014) and (Hsiao and Wan, 2014). (Goyal and Welch, 2008) estimated linear regressions via OLS where the risk premium at time t is explained by a predictor variable at time t 1. Furthermore, they estimated linear regressions including all candidate predictors. Here we have applied methods of L 1 norm penalty on the regression coefficients with multiple predictors as candidates, delegating to these methods the choice of predictors for the risk premium, which is the total rate of return on the stock market minus the prevailing short-term interest rate. We used 14 variables from (Goyal and Welch, 2008): dividend-price ratio (log); dividend yield (log); earnings-price ratio (log); dividend-payout ratio (log); stock variance; book-tomarket ratio; net equity expansion; treasury bill rate; long-term yield; long-term return; term spread; default yield spread; default return spread; and inflation. All annual series used as predictors begin in 1927 and end in We worked with differentiated and centered series, estimating the models without intercept. For penalty methods, the number of lags was optimized via a 10-fold cross-validation approach, where the maximal lag order was five. In order to compare the one-step-ahead predictions of the h out-of-sample observations, we use the ROS 2 (out-of-sample R2 ) statistics, as suggested in (Campbell and Thompson, 2008), which is given by R 2 OS = 1 T +h t=t +1 (r t ˆr t ) 2 T +h t=t +1 (r t r t ), 2 where ˆr t, t = 1,..., T are the predicted values for T out-of-sample observations, and r t is the historical mean until t 1. 10

11 P-values of the Modified Diebold-Mariano (MDM) test (Harvey et al., 1997) were computed in order to compare the predictions via historical mean to the other methods. The Diebold Mariano one-tailed test has the null hypothesis of no difference in the accuracy of two competing forecasts. When the null hypothesis is rejected, we conclude that the method has better predictive ability than the historical mean. Table 4 shows the superiority of predictions made by penalization methods compared to the historical mean in three different periods which begin in 1955, 1970 and We employed those methods in two ways: by using only one lag (l = 1) of all individual predictors (as Goyal and Welch (2008)) and by using the optimal number of lags chosen via 10-fold cross-validation. In both cases the penalization methods achieved better predictions than the historical mean, having WLadaLASSO obtained the best results. 6 Conclusion This study aimed to investigate how the penalty on the coefficients set can contribute to the performance of time-series forecasting. The methods that penalize the coefficients are extremely important in reducing the dimensionality of some economic time series, where the number of time series is high and there are few observations. LASSO and adalasso methods have arisen in the context of linear regression but are increasingly present in time series analysis. Observing that some more recent information tends to contribute more to time series forecasting, and inspired by (Park and Sakaori, 2013), this article proposes the WLadaLASSO, which penalizes each lagged variable differently. The simulation study showed that, essentially for small samples, WLadaLASSO outperforms other penalization methods in many perspectives: covariate selection, parameter estimation and out-of-sample forecasting, especially for small sample sizes and highly-correlated covariates. In addition, the application to U.S. financial data shows that the predictions obtained by WLadaLASSO for the risk premium were also superior to competitor methods and were significantly better than those obtained by the historical mean model. References Audrino F, Camponovo L Oracle properties and finite sample inference of the adaptive lasso for time series regression models. Economics Working Paper Series 1327, University of St. Gallen, School of Economics and Political Science. URL p/usg/econwp/ html. Berk KN Comparing subset regression procedures. Technometrics 20(1): 1 6. doi: / Breiman L Better subset selection using the non-negative garrote. Technometrics 37(4): Campbell JY, Thompson SB Predicting excess stock returns out of sample: Can 11

12 anything beat the historical average? Review of Financial Studies 21(4): doi: /rfs/hhm055. De Mol C, Giannone D, Reichlin L Forecasting using a large number of predictors: Is bayesian shrinkage a valid alternative to principal components? Journal of Econometrics 146(2): doi: /j.jeconom Fan J, Guo S, Hao N Variance estimation using refitted cross-validation in ultrahigh dimensional regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 74(1): doi: /j x. Fan J, Li R Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 96(456): doi: / Fan J, Lv J Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70(5): doi: /j x. Fan J, Lv J A selective overview of variable selection in high dimensional feature space. Statistica Sinica 20(1): Ferreira MA, Santa-Clara P Forecasting stock market returns: The sum of the parts is more than the whole. Journal of Financial Economics 100(3): doi: /j. jfineco Goyal A, Welch I A comprehensive look at the empirical performance of equity premium prediction. Review of Financial Studies 21(4): doi: /rfs/hhm014. Harvey D, Leybourne S, Newbold P Testing the equality of prediction mean squared errors. International Journal of Forecasting 13(2): doi: /s (96) Hastie T, Tibshirani R, Friedman JJH The elements of statistical learning, volume 1. Springer New York. doi: / Hillebrand E, Lee TH, Medeiros MC Let s do it again: Bagging equity premium predictors. CREATES Research Papers , School of Economics and Management, University of Aarhus. URL Hsiao C, Wan SK Is there an optimal forecast combination? Journal of Econometrics 178: doi: /j.jeconom Hua Y Macroeconomic Forecasting using Large Vector Auto Regressive Model. Master s thesis. URL Issler JV, Rodrigues C, Burjack R Using common features to understand the behavior of metal-commodity prices and forecast them at different horizons. Journal of International Money and Finance 42: doi: /j.jimonfin

13 Lee TH, Tu Y, Ullah A Forecasting equity premium: Global historical average versus local historical average and constraints. Journal of Business & Economic Statistics doi: / Li J Monetary policy analysis based on lasso-assisted vector autoregression (lavar). Working paper. doi: /ssrn Medeiros MC, Mendes EF Estimating high-dimensional time series models. CREATES Research Papers , School of Economics and Management, University of Aarhus. URL Park H, Sakaori F Lag weighted lasso for time series model. Computational Statistics 28(2): doi: /s Rapach DE, Strauss JK, Zhou G Out-of-sample equity premium prediction: Combination forecasts and links to the real economy. Review of Financial Studies 23(2): doi: /rfs/hhp063. URL 2/821.abstract. Song S, Bickel PJ Large vector auto regressions. ArXiv: Tibshirani R Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 58: doi: /j x. Tibshirani RJ The lasso problem and uniqueness. Electronic Journal of Statistics 7: doi: /13-ejs815. Zhao P, Yu B On model selection consistency of lasso. The Journal of Machine Learning Research 7: URL: Zou H The adaptive lasso and its oracle properties. Journal of the American statistical association 101(476): doi: /

14 Table 1: Descriptive statistics of model selection 14 φ = 0.3 φ = 0.6 φ = 0.9 T Fraction of variables correctly identified Fraction of variables correctly identified Fraction of variables correctly identified LASSO adalasso WLadaLASSO True Model Included True Model Included True Model Included LASSO adalasso WLadaLASSO Fraction of relevant variables included Fraction of relevant variables included Fraction of relevant variables included LASSO adalasso WLadaLASSO Fraction of irrelevant variables excluded Fraction of irrelevant variables excluded Fraction of irrelevant variables excluded LASSO adalasso WLadaLASSO Number of included variables Number of included variables Number of included variables LASSO adalasso WLadaLASSO

15 Table 2: Descriptive statistics of parameter estimates φ = 0.3 φ = 0.6 φ = 0.9 T Mean Squared Error Mean Squared Error Mean Squared Error LASSO adalasso WLadaLASSO Mean Absolute Error Mean Absolute Error Mean Absolute Error LASSO adalasso WLadaLASSO

16 Table 3: Descriptive statistics of forecasts 16 φ = 0.3 φ = 0.6 φ = 0.9 T Mean of MSE s Mean of MSE s Mean of MSE s LASSO adalasso WLadaLASSO Median of MSE s Median of MSE s Median of MSE s LASSO adalasso WLadaLASSO Mean of MAE s Mean of MAE s Mean of MAE s LASSO adalasso WLadaLASSO Median of MAE s Median of MAE s Median of MAE s LASSO adalasso WLadaLASSO

17 a) φ = 0.3, T = 50 b) φ = 0.3, T = 500 c) φ = 0.3, T = 2000 Density Density Density d) φ = 0.6, T = 50 e) φ = 0.6, T = 500 f) φ = 0.6, T = Density Density Density g) φ = 0.9, T = 50 h) φ = 0.9, T = 500 i) φ = 0.9, T = 2000 Density Density Density Figure 1: Observed density function of ˆβ1 (estimator for β 1 = 0.8). LASSO (continuous line), adalasso (dashed line), and WLadaLASSO (dotted line)

18 φ = 0.3, T = 50 φ = 0.3, T = 500 φ = 0.3, T = LASSO adalasso WLadaLASSO φ = 0.6, T = 50 LASSO adalasso WLadaLASSO φ = 0.6, T = 500 LASSO adalasso WLadaLASSO φ = 0.6, T = LASSO adalasso WLadaLASSO φ = 0.9, T = 50 LASSO adalasso WLadaLASSO φ = 0.9, T = 500 LASSO adalasso WLadaLASSO φ = 0.9, T = LASSO adalasso WLadaLASSO LASSO adalasso WLadaLASSO LASSO adalasso WLadaLASSO Figure 2: Distribution of forecast errors

19 Table 4: Equity Premium Forecasting Results 19 Forecasts begin in 1955 Forecasts begin in 1970 Forecasts begin in 1985 ROS 2 MDM p-value ROS 2 MDM p-value ROS 2 MDM p-value Individual predictors Dividend Price Ratio Dividend Yield Earning Price Ratio Dividend Payout Ratio Stock Variance Book to Market Net Equity Expansion T-Bill Rate Long Term Yield Long Term Return Term Spread Default Yield Spread Default Return Spread Inflation All Regressors Penalization Methods (l = 1) LASSO adalasso Penalization Methods (free l) LASSO adalasso WLadaLASSO

LASSO-Type Penalties for Covariate Selection and Forecasting in Time Series

Journal of Forecasting, J. Forecast. 35, 592 612 (2016) Published online 21 February 2016 in Wiley Online Library (wileyonlinelibrary.com) DOI: 10.1002/for.2403 LASSO-Type Penalties for Covariate Selection