Prediction & Feature Selection in GLM

Similar documents
Linear Methods for Regression. Lijun Zhang

Linear model selection and regularization

Data Mining Stat 588

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Machine Learning for OR & FE

MS-C1620 Statistical inference

Linear Model Selection and Regularization

Lecture 14: Shrinkage

ISyE 691 Data mining and analytics

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

High-dimensional regression

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Applied Machine Learning Annalisa Marsico

Different types of regression: Linear, Lasso, Ridge, Elastic net, Ro

Chapter 3. Linear Models for Regression

A Modern Look at Classical Multivariate Techniques

Lecture 6: Methods for high-dimensional problems

Introduction to Statistical modeling: handout for Math 489/583

Regularization: Ridge Regression and the LASSO

Lasso Regression: Regularization for feature selection

Shrinkage Methods: Ridge and Lasso

Linear regression methods

Dimension Reduction Methods

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Linear Regression Models. Based on Chapter 3 of Hastie, Tibshirani and Friedman

STAT 462-Computational Data Analysis

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

LASSO Review, Fused LASSO, Parallel LASSO Solvers

MSA220/MVE440 Statistical Learning for Big Data

Lasso Regression: Regularization for feature selection

The prediction of house price

Regularization and Variable Selection via the Elastic Net

Variable Selection under Measurement Error: Comparing the Performance of Subset Selection and Shrinkage Methods

Fast Regularization Paths via Coordinate Descent

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

Sparse Linear Models (10/7/13)

MSG500/MVE190 Linear Models - Lecture 15

Theorems. Least squares regression

Proteomics and Variable Selection

COMS 4771 Lecture Fixed-design linear regression 2. Ridge and principal components regression 3. Sparse regression and Lasso

Variable Selection and Regularization

A simulation study of model fitting to high dimensional data using penalized logistic regression

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

arxiv: v3 [stat.ml] 14 Apr 2016

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

Machine Learning Linear Regression. Prof. Matteo Matteucci

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Regression, Ridge Regression, Lasso

MSA220/MVE440 Statistical Learning for Big Data

Stability and the elastic net

Statistics 262: Intermediate Biostatistics Model selection

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1

COMS 4771 Regression. Nakul Verma

Machine Learning for Biomedical Engineering. Enrico Grisan

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

PENALIZING YOUR MODELS

Tuning Parameter Selection in L1 Regularized Logistic Regression

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Robust Variable Selection Methods for Grouped Data. Kristin Lee Seamon Lilly

STK4900/ Lecture 5. Program

High-Dimensional Statistical Learning: Introduction

6. Regularized linear regression

Lecture 14: Variable Selection - Beyond LASSO

Lecture Data Science

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

Machine Learning. Regression basics. Marc Toussaint University of Stuttgart Summer 2015

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Machine Learning for Economists: Part 4 Shrinkage and Sparsity

Regression Shrinkage and Selection via the Lasso

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

LECTURE 10: LINEAR MODEL SELECTION PT. 1. October 16, 2017 SDS 293: Machine Learning

Sparse Ridge Fusion For Linear Regression

Business Statistics. Tommaso Proietti. Model Evaluation and Selection. DEF - Università di Roma 'Tor Vergata'

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

CMSC858P Supervised Learning Methods

Fundamentals of Machine Learning. Mohammad Emtiyaz Khan EPFL Aug 25, 2015

Lecture 25: November 27

A Short Introduction to the Lasso Methodology

Linear Model Selection and Regularization

Lecture 7: Modeling Krak(en)

Course in Data Science

Recap from previous lecture

Regularization Paths

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

Is the test error unbiased for these programs? 2017 Kevin Jamieson

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Regularization Paths. Theme

Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models

Homework 1: Solutions

Iterative Selection Using Orthogonal Regression Techniques

Analysis Methods for Supersaturated Design: Some Comparisons

Data Mining und Maschinelles Lernen

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

ORIE 4741: Learning with Big Messy Data. Regularization

Regularized Regression

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

MLR Model Selection. Author: Nicholas G Reich, Jeff Goldsmith. This material is part of the statsteachr project

Transcription:

Tarigan Statistical Consulting & Coaching statistical-coaching.ch Doctoral Program in Computer Science of the Universities of Fribourg, Geneva, Lausanne, Neuchâtel, Bern and the EPFL Hands-on Data Analysis with R University of Neuchatel, 10 May 2016 Prediction & Feature Selection in GLM Bernadetta Tarigan, Dr. sc. ETHZ Prediction and Feature Selection in GLM 1

Bug data Prediction and Feature Selection in GLM 2

Predicting number of bugs Goal : TO PREDICT 1. How to predict the number of bugs? 2. Are change-metrics better than source-metrics? 3. Should we combine them? 4. Are the predictors actually independent? 5. What is the best prediction model to use? 6. What are minimal metrics (from combined set) to make best prediction? 7. Are the above questions dependent on the project? 8. Finally, can we actually predict the number of bugs at all? Prediction and Feature Selection in GLM 3

Is it good? Term LS BestSubset Ridge ElasticNet Lasso (Intercept) 0.102 0.128 0.255-0.085-0.086 numberofversionsuntil. 0.067 0.066 0.000 0.027 0.031 numberoffixesuntil. -0.063 0.000 numberofrefactoringsuntil. 0.038 0.000 0.108 0.056 numberofauthorsuntil. -0.137-0.151 0.000 linesaddeduntil. 0.000 0.000 maxlinesaddeduntil. 0.000 0.000 avglinesaddeduntil. 415.129 0.000 linesremoveduntil. -0.002-0.002 0.000 maxlinesremoveduntil. -0.001 0.000 avglinesremoveduntil. -415.111 0.012 0.000 maxcodechurnuntil. -0.001 0.000 avgcodechurnuntil. -415.121 0.000 agewithrespectto. -0.001 0.000 0.000 0.000 weightedagewithrespectto. 0.002 0.000 TestError 0.124 0.116 0.090 0.093 0.095 Improvement 7% 38% 33% 30% Prediction and Feature Selection in GLM 4

Or is this better? Term Poisson BestSubset Ridge ElasticNet Lasso (Intercept) -2.220-2.464-2.043-1.753-1.719 numberofversionsuntil. 0.024 0.022 0.008 0.010 0.010 numberoffixesuntil. -0.387-0.313 0.025 numberofrefactoringsuntil. -0.128 0.076 numberofauthorsuntil. 0.264 0.321 0.024 linesaddeduntil. 0.001 0.000 0.000 0.000 maxlinesaddeduntil. 0.001 0.001 0.000 avglinesaddeduntil. -416.599 0.021 0.004 linesremoveduntil. -0.001 0.000 maxlinesremoveduntil. 0.000 0.000 avglinesremoveduntil. 416.605-0.001 maxcodechurnuntil. -0.002 0.001 0.000 avgcodechurnuntil. 416.619 0.007 agewithrespectto. -0.007-0.006 0.000 weightedagewithrespectto. 0.004 0.000 TestError 0.129 0.116 0.078 0.067 0.067 Improvement 11% 65% 94% 94% Prediction and Feature Selection in GLM 5

Review: Multiple Least Squares Regression 2 Y = f X + ε; ε random with E(ε) = 0, Var ε = σ ε Linear model: f X β T p X = β + j=1 β j X j Problem: to estimate the unknown β from the data Least squares estimates β ls arg min (y i β T X i ) 2 Matrix notation β ls = (X T X) 1 X T y n i=1 X is n p matrix with each row an input vector Y is n vector of the outputs Prediction and Feature Selection in GLM 6

Prediction is different from explanation Assume Y = f X + ε, E(ε) = 0, Var ε = σ ε 2 Suppose we have any estimator f X. Will f X fit future observations well? Prediction is different from explanation: Quality of a model is no longer measured by R 2 goodness-of-fit It is replaced by its generalization performance on the future observations This generalization performance, on a new input point X = x 0, is calculated by expected prediction error (EPE) EPE x 0 E Y f x 0 2 X = x0 EPE is also called out-of-sample error or test error Prediction and Feature Selection in GLM 7

Partition of the data into training and test sets Y = f X + ε; ε random with E(ε) = 0, Var ε = σ ε 2 Data Training Test EPE f ls = 1 n test y test X test β ls 2 2 Obtain a model f ls : β ls = arg min β R p y training X training β 2 2 Prediction and Feature Selection in GLM 8

Improving least squares fit with feature selection Why we are not satisfied with the least squares estimates: Prediction accuracy o LS estimates often have law bias but high variance o Prediction accuracy can sometimes be improved by shrinking or setting some coefficients to zero o By doing so we sacrifice a little bit of bias to reduce the variance of the predicted value, hence may improve the overall prediction accuracy Interpretative ability o With large number of predictors, often we d like to determine a smaller subset that exhibit the strongest effects o In order to get the big picture, we are willing to sacrifice some of the small details Also: least squares estimates is not defined when p > n To improve: automatically perform feature selection Subset selection methods Coefficient shrinkage methods (modern techniques) Prediction and Feature Selection in GLM 9

However Performing feature selection means reducing complexity of the model class There is an issue in bias-variance decomposition of EPE that depends on the model complexity model selection and bias-variance tradeoff We first see the bias-variance decomposition of EPE Then its relationship to model complexity Prediction and Feature Selection in GLM 10

Bias-Variance Decomposition of EPE EPE x 0 = E Y f x 0 2 X = x0 = σ ε 2 + f x 0 E(f x 0 ) 2 + Var(f x 0 ) = Irreducible Error + Bias 2 + Variance Variance of the target around its true mean f x 0 Cannot be avoided no matter how well we predict f x 0 unless σ ε 2 = 0 The difference btwn the average prediction of our model to the true unknown value we are trying to predict The variability of our model prediction for a given data point Note that EPE x 0 = σ ε 2 +MSE(f x 0 ), where MSE = mean squared error Prediction and Feature Selection in GLM 11

Graphical illustration of Bias and Variance the unknown target f(x 0 ) model prediction f x 0 Error due to Bias: the difference between the average prediction of our model to the true unknown value we are trying to predict. http://scott.fortmann-roe.com/docs/biasvariance.html Error due to Variance: the variability of a model prediction for a given data point. Imagine you can build the entire model building process multiple times (you have multiple samples). The variance is how much the prediction of a given point vary btwn different realization of the model Prediction and Feature Selection in GLM 12

Bias, Variance and Model Complexity As the model f becomes more complex (more terms include) the bias will most likely decrease (local curvature can be picked up) However the variance would increase as more terms are included We of course would like to choose our model complexity to trade bias off with variance in such a way as to minimize the test error http://scott.fortmann-roe.com/docs/biasvariance.html Prediction and Feature Selection in GLM 13

By the way Let f p x = β ls T, the least squares estimates (the parameter vector β is with p components) EPE x 0 = σ ε 2 + f x 0 E(f p x 0 ) 2 + Var(f p x 0 ) 1 n n i=1 x i = σ 2 ε + 1 n n 2 + p i=1 f(x i ) E(f x i σ2 ε Model complexity of least squares estimates is directly related to the number of parameters p Thus, the smaller p the smaller the variance but might increase the bias n Prediction and Feature Selection in GLM 14

Back to feature selection Recall: Why we are not satisfied with the least squares estimates: Prediction accuracy o often low bias but high variance o variance gets smaller when coefficients shrink toward zero o bias increases a bit but overall accuracy might improve Interpretative ability o with large number of predictors, often we d like to determine a smaller subset that exhibit the strongest effects o in order to get the big picture, we are willing to sacrifice some of the small details Also: least squares estimates is not defined when p > n To improve: automatically perform feature selection Two classes of method 1. Coefficient shrinkage methods (modern techniques) 2. Subset selection methods Prediction and Feature Selection in GLM 15

Shrinkage / penalized / regularization β = arg min β R p y Xβ 2 2 subject to R β < t or equivalently β = arg min β R p y Xβ 2 2 + λ R(β) R(β) is called regularizer (or penalty) on the complexity of the model The term λ is called tuning parameter, controlling the amount of regularization The larger λ the greater the amount of regularization/penalty λ shrinks the coefficient estimates toward 0 Note that regularization can be applied beyond regression, e.g., classification, clustering, principal component analysis, etc Prediction and Feature Selection in GLM 16

Ridge β = arg min β R p y Xβ 2 2 + λ β 2 2 shrinks the coefficients toward zero but not exactly zero so it doesn t do variable selection but it still outperforms least squares estimates for prediction goal and encourages grouping effect Prediction and Feature Selection in GLM 17

Lasso β = arg min β R p y Xβ 2 2 + λ β 1 shrinks some of the coefficients exactly zero so it does variable selection (sparse model), when p > n lasso picks max n variables but it typically fails to do group selection it tends to select one variable from a group and ignore the other Prediction and Feature Selection in GLM 18

Elastic net β = arg min β R p y Xβ 2 2 + λ 2 β 2 2 + λ 1 β 1 simply combines advantages of ridge and lasso methods Prediction and Feature Selection in GLM 19

Why shrinkage models works well? Elements of Statistical Learning, Hastie, Tibshirani, Friedman, 2 nd, 2009 Prediction and Feature Selection in GLM 20

Some more pictures Elements of Statistical Learning, Hastie, Tibshirani, Friedman, 2 nd, 2009 Prediction and Feature Selection in GLM 21

All great, but how to choose λ opt? Recall: We of course would like to choose our model complexity to trade bias off with variance in such a way as to minimize the test error But we have access only to training error Unfortunately training error is not a good estimate of the test error Cross Validation Elements of Statistical Learning, Hastie, Tibshirani, Friedman, 2 nd, 2009 Prediction and Feature Selection in GLM 22

Cross Validation Data Outer loop Training Test Inner loop Validation β λopt Prediction and Feature Selection in GLM 23

Shrinkage methods in R: glmnet package Glmnet is a package that fits a generalized linear model via penalized maximum likelihood The algorithm is extremely fast, and can exploit sparsity in the input matrix It fits linear, logistic and multinomial, poisson, and Cox regression models A variety of predictions can be made from the fitted models It can also fit multi-response linear regression glmnet solves β = arg min β R p y Xβ 2 2 + λ 1 α β 2 2 /2 + α β 1 the elastic net penalty is controlled by α α bridges the gap between lasso (α = 1, default) and ridge α = 0 cv.glmnet is the main function to do cross validation Prediction and Feature Selection in GLM 24

In this approach we retain only a subset of the variables, and eliminate the rest from the model Least squares regression is used to estimate the coefficients of the inputs that are retained There are a number of different strategies for choosing the subset Best subset regression Forward stepwise selection Backward stepwise selection Hybrid stepwise selection Prediction and Feature Selection in GLM 25

Best Subset Best subset regression finds for each k {1, 2,, p} the subset of size k that gives smallest residual sum of squares (RSS) There is an efficient algorithm leaps and bound procedure (Furnival and Wilson, 1974) makes this feasible for p as large as 30 or 40 This procedure is available in R with package bestglm The question of how to choose k involves the tradeoff between bias and variance and there are a number of criteria that one may use Typically we choose the model that minimizes an estimate of the expected prediction error (EPE) Prediction and Feature Selection in GLM 26

Stepwise selection Rather than search through all possible subsets (becomes infeasible for p much larger than 40), we can seek for a good path through them Forward stepwise selection Starts with the intercept (null model) and sequentially adds into the model one-at-a-time the predictor that improves most the fit Suppose current model has k inputs represented with estimates β and we add a predictor resulting in estimates β The improvement of fit is often based on the statistic RSS β RSS(β) F = RSS(β)/(n k 2) Strategy: add sequentially the predictor producing the largest value of F, stopping when no predictor produces an F-ratio greater than the 90 th or 95 th percentile of the F 1,n k 2 distribution Can be used even when p > n The only viable subset method when p is very large Prediction and Feature Selection in GLM 27

Stepwise selection (Cont.) Backward stepwise selection Starts with the full model containing all p predictors and sequentially deletes predictors producing the smallest fit Can be used only when p < n Hybrid stepwise selection Consider both forward and backward moves at each stage and make the best move Prediction and Feature Selection in GLM 28

Best subset or stepwise selection? Stepwise selection the F-ratio stopping rule provides only local control of the model search and does not attempt to find the best model along the sequence of models that it examines Best subset selection (all-subset selection) we can choose the model from the sequence that minimizes an estimate of expected prediction error When the goal is prediction: best subset is proper Prediction and Feature Selection in GLM 29

or Shrinkage? By retaining only a subset of the predictors and eliminate the rest from the model, subset selection produces a model that is interpretable and has a possibly lower prediction error than the full model It is a discrete process: variables are either retained or eliminated Therefor it often exhibits high variance, so doesn t reduce the prediction error of the full model. Shrinkage methods are more continuous, and doesn t suffer from high variability Prediction and Feature Selection in GLM 30