Prediction & Feature Selection in GLM

Tarigan Statistical Consulting & Coaching statistical-coaching.ch Doctoral Program in Computer Science of the Universities of Fribourg, Geneva, Lausanne, Neuchâtel, Bern and the EPFL Hands-on Data Analysis with R University of Neuchatel, 10 May 2016 Prediction & Feature Selection in GLM Bernadetta Tarigan, Dr. sc. ETHZ Prediction and Feature Selection in GLM 1

Bug data Prediction and Feature Selection in GLM 2

Predicting number of bugs Goal : TO PREDICT 1. How to predict the number of bugs? 2. Are change-metrics better than source-metrics? 3. Should we combine them? 4. Are the predictors actually independent? 5. What is the best prediction model to use? 6. What are minimal metrics (from combined set) to make best prediction? 7. Are the above questions dependent on the project? 8. Finally, can we actually predict the number of bugs at all? Prediction and Feature Selection in GLM 3

Is it good? Term LS BestSubset Ridge ElasticNet Lasso (Intercept) 0.102 0.128 0.255-0.085-0.086 numberofversionsuntil. 0.067 0.066 0.000 0.027 0.031 numberoffixesuntil. -0.063 0.000 numberofrefactoringsuntil. 0.038 0.000 0.108 0.056 numberofauthorsuntil. -0.137-0.151 0.000 linesaddeduntil. 0.000 0.000 maxlinesaddeduntil. 0.000 0.000 avglinesaddeduntil. 415.129 0.000 linesremoveduntil. -0.002-0.002 0.000 maxlinesremoveduntil. -0.001 0.000 avglinesremoveduntil. -415.111 0.012 0.000 maxcodechurnuntil. -0.001 0.000 avgcodechurnuntil. -415.121 0.000 agewithrespectto. -0.001 0.000 0.000 0.000 weightedagewithrespectto. 0.002 0.000 TestError 0.124 0.116 0.090 0.093 0.095 Improvement 7% 38% 33% 30% Prediction and Feature Selection in GLM 4

Or is this better? Term Poisson BestSubset Ridge ElasticNet Lasso (Intercept) -2.220-2.464-2.043-1.753-1.719 numberofversionsuntil. 0.024 0.022 0.008 0.010 0.010 numberoffixesuntil. -0.387-0.313 0.025 numberofrefactoringsuntil. -0.128 0.076 numberofauthorsuntil. 0.264 0.321 0.024 linesaddeduntil. 0.001 0.000 0.000 0.000 maxlinesaddeduntil. 0.001 0.001 0.000 avglinesaddeduntil. -416.599 0.021 0.004 linesremoveduntil. -0.001 0.000 maxlinesremoveduntil. 0.000 0.000 avglinesremoveduntil. 416.605-0.001 maxcodechurnuntil. -0.002 0.001 0.000 avgcodechurnuntil. 416.619 0.007 agewithrespectto. -0.007-0.006 0.000 weightedagewithrespectto. 0.004 0.000 TestError 0.129 0.116 0.078 0.067 0.067 Improvement 11% 65% 94% 94% Prediction and Feature Selection in GLM 5

Review: Multiple Least Squares Regression 2 Y = f X + ε; ε random with E(ε) = 0, Var ε = σ ε Linear model: f X β T p X = β + j=1 β j X j Problem: to estimate the unknown β from the data Least squares estimates β ls arg min (y i β T X i ) 2 Matrix notation β ls = (X T X) 1 X T y n i=1 X is n p matrix with each row an input vector Y is n vector of the outputs Prediction and Feature Selection in GLM 6

Prediction is different from explanation Assume Y = f X + ε, E(ε) = 0, Var ε = σ ε 2 Suppose we have any estimator f X. Will f X fit future observations well? Prediction is different from explanation: Quality of a model is no longer measured by R 2 goodness-of-fit It is replaced by its generalization performance on the future observations This generalization performance, on a new input point X = x 0, is calculated by expected prediction error (EPE) EPE x 0 E Y f x 0 2 X = x0 EPE is also called out-of-sample error or test error Prediction and Feature Selection in GLM 7

Partition of the data into training and test sets Y = f X + ε; ε random with E(ε) = 0, Var ε = σ ε 2 Data Training Test EPE f ls = 1 n test y test X test β ls 2 2 Obtain a model f ls : β ls = arg min β R p y training X training β 2 2 Prediction and Feature Selection in GLM 8

Improving least squares fit with feature selection Why we are not satisfied with the least squares estimates: Prediction accuracy o LS estimates often have law bias but high variance o Prediction accuracy can sometimes be improved by shrinking or setting some coefficients to zero o By doing so we sacrifice a little bit of bias to reduce the variance of the predicted value, hence may improve the overall prediction accuracy Interpretative ability o With large number of predictors, often we d like to determine a smaller subset that exhibit the strongest effects o In order to get the big picture, we are willing to sacrifice some of the small details Also: least squares estimates is not defined when p > n To improve: automatically perform feature selection Subset selection methods Coefficient shrinkage methods (modern techniques) Prediction and Feature Selection in GLM 9

However Performing feature selection means reducing complexity of the model class There is an issue in bias-variance decomposition of EPE that depends on the model complexity model selection and bias-variance tradeoff We first see the bias-variance decomposition of EPE Then its relationship to model complexity Prediction and Feature Selection in GLM 10

Bias-Variance Decomposition of EPE EPE x 0 = E Y f x 0 2 X = x0 = σ ε 2 + f x 0 E(f x 0 ) 2 + Var(f x 0 ) = Irreducible Error + Bias 2 + Variance Variance of the target around its true mean f x 0 Cannot be avoided no matter how well we predict f x 0 unless σ ε 2 = 0 The difference btwn the average prediction of our model to the true unknown value we are trying to predict The variability of our model prediction for a given data point Note that EPE x 0 = σ ε 2 +MSE(f x 0 ), where MSE = mean squared error Prediction and Feature Selection in GLM 11

Graphical illustration of Bias and Variance the unknown target f(x 0 ) model prediction f x 0 Error due to Bias: the difference between the average prediction of our model to the true unknown value we are trying to predict. http://scott.fortmann-roe.com/docs/biasvariance.html Error due to Variance: the variability of a model prediction for a given data point. Imagine you can build the entire model building process multiple times (you have multiple samples). The variance is how much the prediction of a given point vary btwn different realization of the model Prediction and Feature Selection in GLM 12

Bias, Variance and Model Complexity As the model f becomes more complex (more terms include) the bias will most likely decrease (local curvature can be picked up) However the variance would increase as more terms are included We of course would like to choose our model complexity to trade bias off with variance in such a way as to minimize the test error http://scott.fortmann-roe.com/docs/biasvariance.html Prediction and Feature Selection in GLM 13

By the way Let f p x = β ls T, the least squares estimates (the parameter vector β is with p components) EPE x 0 = σ ε 2 + f x 0 E(f p x 0 ) 2 + Var(f p x 0 ) 1 n n i=1 x i = σ 2 ε + 1 n n 2 + p i=1 f(x i ) E(f x i σ2 ε Model complexity of least squares estimates is directly related to the number of parameters p Thus, the smaller p the smaller the variance but might increase the bias n Prediction and Feature Selection in GLM 14

Back to feature selection Recall: Why we are not satisfied with the least squares estimates: Prediction accuracy o often low bias but high variance o variance gets smaller when coefficients shrink toward zero o bias increases a bit but overall accuracy might improve Interpretative ability o with large number of predictors, often we d like to determine a smaller subset that exhibit the strongest effects o in order to get the big picture, we are willing to sacrifice some of the small details Also: least squares estimates is not defined when p > n To improve: automatically perform feature selection Two classes of method 1. Coefficient shrinkage methods (modern techniques) 2. Subset selection methods Prediction and Feature Selection in GLM 15

Shrinkage / penalized / regularization β = arg min β R p y Xβ 2 2 subject to R β < t or equivalently β = arg min β R p y Xβ 2 2 + λ R(β) R(β) is called regularizer (or penalty) on the complexity of the model The term λ is called tuning parameter, controlling the amount of regularization The larger λ the greater the amount of regularization/penalty λ shrinks the coefficient estimates toward 0 Note that regularization can be applied beyond regression, e.g., classification, clustering, principal component analysis, etc Prediction and Feature Selection in GLM 16

Ridge β = arg min β R p y Xβ 2 2 + λ β 2 2 shrinks the coefficients toward zero but not exactly zero so it doesn t do variable selection but it still outperforms least squares estimates for prediction goal and encourages grouping effect Prediction and Feature Selection in GLM 17

Lasso β = arg min β R p y Xβ 2 2 + λ β 1 shrinks some of the coefficients exactly zero so it does variable selection (sparse model), when p > n lasso picks max n variables but it typically fails to do group selection it tends to select one variable from a group and ignore the other Prediction and Feature Selection in GLM 18

Elastic net β = arg min β R p y Xβ 2 2 + λ 2 β 2 2 + λ 1 β 1 simply combines advantages of ridge and lasso methods Prediction and Feature Selection in GLM 19

Why shrinkage models works well? Elements of Statistical Learning, Hastie, Tibshirani, Friedman, 2 nd, 2009 Prediction and Feature Selection in GLM 20

Some more pictures Elements of Statistical Learning, Hastie, Tibshirani, Friedman, 2 nd, 2009 Prediction and Feature Selection in GLM 21

All great, but how to choose λ opt? Recall: We of course would like to choose our model complexity to trade bias off with variance in such a way as to minimize the test error But we have access only to training error Unfortunately training error is not a good estimate of the test error Cross Validation Elements of Statistical Learning, Hastie, Tibshirani, Friedman, 2 nd, 2009 Prediction and Feature Selection in GLM 22

Cross Validation Data Outer loop Training Test Inner loop Validation β λopt Prediction and Feature Selection in GLM 23

Shrinkage methods in R: glmnet package Glmnet is a package that fits a generalized linear model via penalized maximum likelihood The algorithm is extremely fast, and can exploit sparsity in the input matrix It fits linear, logistic and multinomial, poisson, and Cox regression models A variety of predictions can be made from the fitted models It can also fit multi-response linear regression glmnet solves β = arg min β R p y Xβ 2 2 + λ 1 α β 2 2 /2 + α β 1 the elastic net penalty is controlled by α α bridges the gap between lasso (α = 1, default) and ridge α = 0 cv.glmnet is the main function to do cross validation Prediction and Feature Selection in GLM 24

In this approach we retain only a subset of the variables, and eliminate the rest from the model Least squares regression is used to estimate the coefficients of the inputs that are retained There are a number of different strategies for choosing the subset Best subset regression Forward stepwise selection Backward stepwise selection Hybrid stepwise selection Prediction and Feature Selection in GLM 25

Best Subset Best subset regression finds for each k {1, 2,, p} the subset of size k that gives smallest residual sum of squares (RSS) There is an efficient algorithm leaps and bound procedure (Furnival and Wilson, 1974) makes this feasible for p as large as 30 or 40 This procedure is available in R with package bestglm The question of how to choose k involves the tradeoff between bias and variance and there are a number of criteria that one may use Typically we choose the model that minimizes an estimate of the expected prediction error (EPE) Prediction and Feature Selection in GLM 26

Stepwise selection Rather than search through all possible subsets (becomes infeasible for p much larger than 40), we can seek for a good path through them Forward stepwise selection Starts with the intercept (null model) and sequentially adds into the model one-at-a-time the predictor that improves most the fit Suppose current model has k inputs represented with estimates β and we add a predictor resulting in estimates β The improvement of fit is often based on the statistic RSS β RSS(β) F = RSS(β)/(n k 2) Strategy: add sequentially the predictor producing the largest value of F, stopping when no predictor produces an F-ratio greater than the 90 th or 95 th percentile of the F 1,n k 2 distribution Can be used even when p > n The only viable subset method when p is very large Prediction and Feature Selection in GLM 27

Stepwise selection (Cont.) Backward stepwise selection Starts with the full model containing all p predictors and sequentially deletes predictors producing the smallest fit Can be used only when p < n Hybrid stepwise selection Consider both forward and backward moves at each stage and make the best move Prediction and Feature Selection in GLM 28

Best subset or stepwise selection? Stepwise selection the F-ratio stopping rule provides only local control of the model search and does not attempt to find the best model along the sequence of models that it examines Best subset selection (all-subset selection) we can choose the model from the sequence that minimizes an estimate of expected prediction error When the goal is prediction: best subset is proper Prediction and Feature Selection in GLM 29

or Shrinkage? By retaining only a subset of the predictors and eliminate the rest from the model, subset selection produces a model that is interpretable and has a possibly lower prediction error than the full model It is a discrete process: variables are either retained or eliminated Therefor it often exhibits high variance, so doesn t reduce the prediction error of the full model. Shrinkage methods are more continuous, and doesn t suffer from high variability Prediction and Feature Selection in GLM 30