Variable Selection in Predictive Regressions

Size: px

Start display at page:

Download "Variable Selection in Predictive Regressions"

Howard Anderson
5 years ago
Views:

1 Variable Selection in Predictive Regressions Alessandro Stringhi Advanced Financial Econometrics III Winter/Spring 2018

2 Overview This chapter considers linear models for explaining a scalar variable when a researcher is given T historical observations and N potentially relevant predictors The problem is to identify the set of relevant predictors Three types of variable selection model will be discussed: Criterion based methods Regularization Dimension Reduction procedure Criterion based methods are presented in the case N<T Then we will turn to the case in which N is large 2

3 Overview Criterion based methods perform better when N is small and there is a natural hierarchy of the variables Regularization seems to better suit situations when all but a few observed predictors have non-zero effect on the regression function Dimension reduction methods seem more appropriate when the predictors are highly collinear and possibly have a factor stucture Desptite that the variable selection problem is by no means solved. Moreover when one of these models is implemented there is a trade-off between consistency and accurate prediction 3

4 Notation Matrix: for an arbitrary matrix AA, let AA jj denote the j-th column of AA and let AA 1:rr the sub-matrix formed from the first rr columns of AA Norms: for an arbitrary vector zz RR nn the LL 2 norm is zz 2 2 = nn ii zz 2 ii, the LL 1 norm is zz 1 = nn ii zz ii and the LL 0 norm is zz 0 = nn ii II zzii 0 yy is the variable of interest and yy TT+h TT is the forecast h period ahead. The forecast accuracy is measured in terms of mean-squared forecast errors Predictors: XX tt will be the set of N potentially relevant predictors and XX AA will be the set of empirically relevant predictors 4

5 Criterion Based Methods N<T Criterion based methods are employed mostly in the context of autoregression, where the number of lag pp is small compared to the sample size T Predictors in an autoregression have a natural time ordering, hence the model selection problem boils down to the determination of the lag length pp, therefore the problem is computationally simple Two different approach can be used: Information Criteria and Sequential Testing Procedure The first method favors parsimonious models, while the second aims to find the last significant lag 5

6 Sequential Testing Procedure Sequential Testing Procedures can be used to select model when the number of candidate models is small, as in the case of autoregressions Sequential testing procedures can be done using two different approaches: General-to-Specific (top-down) and Specific-to-General (bottom-up) General-to-Specific: the method starts from the largest model, which in the case of autoregression would be the pp mmmmmm lags of the dependent variable. The model is estimated and the last lag significance is checked. If it is not significant the model is re-estimated using pp mmmmmm 1 regressors and the last regressor is checked. If it is not significant the precedure is repeated. 6

7 Sequential Testing Procedure Specific-to-General: the estimation starts from the simplest model and adds lags since these are significant It is proved that Specific-to-General approach is not valid in a pure AR model and its finite sample properties are inferior to General-to-Specific approach These procedures are defined an greedy because the locally optimal decisions made at each stage may not be globally optimal These procedures also perform an hard tresholding, meaning that a variable is either in or out of the predictor set. Moreover hard tresholding is sensitive to small changes in the data and if the sample size is too small few variables will be selected 7

8 Information Criterion Information criteria are used to infer the quality of a model with respect other candidates, penalazing models with too many parameters The criteria here considered are: Mallow (1973) CP, Akaike (AIC) and Bayesian information criterion (BIC) Mallow criterion determines XX AA using the scaled sum of squared errors Where pp refers to the number of regressors included 8

9 Information Criterion Mallow 1973 Assuming that the errors are homoskedastic Mallow shows SSSSSS that a useful estimate of EE( pp σσ 2 ) is The CP criterion defines XX AA as the subset of explanatory variables that corresponds to the lowest point in the plot of CP against pp Mallow does not reccomend to blindly follow this criterion because it is not reliable when a large number of subset are close competirors of the minimizer of CP 9

10 Information Criterion Akaike AIC Akaike AIC estimates the relative quality of a model, hence we can regard it as a tool for variable selection AIC is founded on information theory and deal with the trade-off between the goodness of fit of the model and the semplicity of the model (overfitting can cause trouble when forecasting) The preferred model is the one with the minimum AIC AAAAAA = 2pp 2 ln LL Where LL is the likelihood function 10

11 Information Criterion BIC Similarly to AIC, BIC impose a penality to model with too many parameters, but the penality is larger in BIC BIC was developed by G. Schwarz (1978) and he gave a bayesian argument for adopting it BIC is defined as BBBBBB = ln(tt) pp 2 ln LL Again, the model with the lowest BIC will be selected 11

12 Criterion Based Methods Drawbacks When there are a large number of predictors with no natural ordering, 2 NN estimation are needed, which is very impractical even with powerfull computers Criterion based methods work well in the estimation sample, but not necessarily perform well in the prediction sample This could lead to over fitting, with the conseguence of poor out-of-sample fit Information criterion and sequential testing perform LL 0 regularization (hard thresholding) 12

13 Regularization Methods The parameters estimated with information criteria are sensitive to small changes in the data when the eigenvalues are close to zero, which leads to beta bouncing problems A way to alleviate this problem is to down-weight the less important predictors, a method known as shrinkage For variable selection a general shrinkage framework is Bridge regression 13

14 Regularization Methods The Ridge estimator (Tikhonov Regularization) is a special case of the Bridge estimator with η = 2. Thus impose a LL 2 penality While the Ridge estimator will alleviate the problem of highly collinear regressor, most coefficient estimates will remain non-zero This is due to the LL 2 penality that shrinks coefficients toward zero, but never sets them to zero exactly 14

15 Regularization Methods - LASSO A method that received a great deal of attention is the Least Absolute Shrinkage Selection Operator (LASSO) LASSO can be view as a Bridge operator with η = 1 and it is given by The main difference between LASSO and Ridge operator is the use of the LL 1 penality instead of the LL 2 The LL 1 penality can set coefficients exactly to zero, excluding the corresponding variable LASSO thus perform shrinkage and variable selection simultaneously 15

16 Regularization Methods - LASSO Another difference is that the Ridge coefficients of correlated predictors are shrunk toward each other, while LASSO tends to pick one and ignore the rest of the correlated predictors 16

17 Dimension Reduction Methods While regularization picks out the empirical relevant variables, a different approach is to use all data available in a clever way We will focus on methods that simultaneously consider all predictors Principal Components: this technique combines the potentially relevant predictor XX tt into new predictors that are linear combinations of the old ones The principal components are ranked, namely the j-th principal component XX PPPP,jj is the linear combination that captures the j-th largerst variation in XX 17

18 Dimension Reduction Methods Principal component The principal component regression replace the TT NN predictor matrix XX with a TT rr XX sub-matrix of principal components XX PPPP,1:rrXX are the first rr xx columns of XX PPPP that corresponds to the rr XX eigenvalue of XX The estimator using the first rr XX principal components as regressors is 18

19 Dimension Reduction Methods Principal component Compared to the least squares estimator it involves only rr XX < NN components In other words ββ PPPP puts unit weight on the first rr XX components and ignore the remaining ones Principal components analysis is often sees as a numerical tool that reduce the dimension of the data, but has weak statistical foundation because no probability model is specified It is an unsupervised dimension reduction technique 19

20 Dimension Reduction Methods - Factors In contrast with principal components, factor analysis assume that the data have a specific structure Suppose that yy tt can be well approximated by the infeasible regression Where FF tt is a rr yy 1 vector of unobserved common factors, ββ FF is a polynomial in the lag operator of order pp FF and WW tt is the set of must have regressors A factor augmented regression is obtained when FF tt is used in place of FF tt 20

21 Dimension Reduction Methods - Factors Then a h period ahead forecast is The key of augmented regressions is that the latent factors can be precisely estimated from a large number of the observed predictors xx iiii that can be represented by the factor model Where FF tt is a rr xx 1 vector of latent factors and λ ii are the factor loadings Notice that the factors relevant for forecasting need not to be the same as the set of factors in XX 21

22 Dimension Reduction Methods - Factors When yy tt also belong to XX tt, rr yy can be set to rr xx, making FF tt the rr xx static principal components of XX. Thus we can rewrite The relation between factors and principal components is easy to see when pp FF = 0 and WW tt is empty. A criticism to factor augmented regressions is that the factors are estimated without taking into account that the objective is to forecast YY. Factors that have good explanatory power for XX may not be good predictors for YY, even if yy tt XX tt 22

23 Dimension Reduction Methods - Factors More precisely, a factor augmented regression first estimate FF by maximizing Estimates of α and β are then obtained by maximizing Therefore the FF are constructed in a way irrespective of YY Two model that address this problem are Reduced Rank and Partial Least Square Regressions 23

MS-C1620 Statistical inference

MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents