STAT 462-Computational Data Analysis

Size: px

Start display at page:

Download "STAT 462-Computational Data Analysis"

Britton Dennis
5 years ago
Views:

1 STAT 462-Computational Data Analysis Chapter 5- Part 2 Nasser Sadeghkhani a.sadeghkhani@queensu.ca October / 27

2 Outline Shrinkage Methods 1. Ridge Regression 2. Lasso Dimension Reduction Methods Reference: Sections 6.2 and 6.3 of ISL 1 1 an Introduction to Statistical Learning 2 / 27

3 In subset selection we used the ordinary least squares (OLS) to fit a linear model that contains a subset of the predictors. Alternatively, one can fit a model containing all p predictors then by using some techniques that constrain or regularize the coefficient estimates, shrink coefficient estimates towards zero. These kind of techniques which are indeed an optimization problem subject to a constraint (penalty) on the parameters are called shrinkage methods. We will see shrinking the coefficient estimates can significantly reduce their variance and hence improve the fitted model. Two techniques for shrinking the regression coefficient towards zero are ridge regression and the lasso. 3 / 27

4 Ridge Regression Recall that these two methods do not use the OLS method to fit a model directly. Recap from previous Chapters, the OLS method searches to estimate coefficients β 0, β 1,, β p which minimize the (training) error SSE 2 = n 1 (y i β 0 p 1 β j x ij ) 2. In contrast, the ridge regression tries to find parameters such that SSE+Penalty= SSE +λ p 1 β2 j, is being minimized. λ 0 is a tuning parameter and have to be determined separately using GCV (Generalized Cross Validation) 3 method. 2 RSS the ISL notation 3 or in some ref. s CV 4 / 27

5 Some notes: 1. In the ridge regression, typically what we do, is to pay some penalty (Shrinkage penalty) for the coefficients that are non zeros. 2. Or one can imagine this method tries to minimize the SSE but on other hands encourages the parameters to be shrunken towards zero. 3. The tuning parameter λ serves to control the relative impact of these two terms on the regression coefficient estimates. 5 / 27

6 The OLS coefficient estimates are scale equivariant. That is multiplying x j by a constant c simply leads to a scaling of the least squares coefficient estimates by a factor of 1 c. In other words, regardless of how much we have scaled x j, x j ˆβ will remain the same. But take into account that the ridge regression is so sensitive to any scaling. (Why?) Therefore, it is best to apply ridge regression after standardizing the predictors as x ij = x ij /s j, where s j is the standard division of x j. 6 / 27

7 Computation After standardizing y and X: SSE = (y Xβ) T (y Xβ) + λβ T β X doesn t have the 1 column. ˆβ R = (X T X + λi) 1 X T y = X T (XX T + λi) 1 y If X has orthogonal columns, ˆβ R = ˆβ OLS /(1 + λ). As λ 0, ˆβ R ˆβ OLS. As λ, ˆβ R 0. As λ increases, ridge regression leads to decreased variance but increased variance. 7 / 27

8 Eample Credit data Recap: #regressors p = / 27

9 Selecting the tuning parameter λ Generalized cross validation (GCV) is defined as where H = X(X T X + λi) 1 X T. Example GCV = 1 y i ŷ i ( n 1 tr(h)/n )2, i >prostate = scale(prostate) >prostate = as.data.frame(prostate) > fit.ridge<-lm.ridge(lpsa lcavol+lweight+age +lbph+svi+lcp+gleason+pgg45, data=prostate, lambda=seq(0,20,0.1)) > plot(fit.ridge) 9 / 27

10 10 / 27

11 Example > select(fit.ridge) modified HKB estimator is modified L-W estimator is smallest value of GCV at 6.5 > round(fit.ridge$coef[, which(fit.ridge$lambda == 6.5)], 2) lcavol lweight age lbph svi lcp gleason pgg / 27

12 Why does Ridge improve over OLS? 12 / 27

13 Lasso Drawback of the ridge regression: (Unlike best subset selection or stepwise selection) ridge regression does not select a model. That is at the beginning, we start with p predictors and finally we end up with all p predictors. 13 / 27

14 So the lasso is quite similar to the ridge regression but it does variable selection selection since it shrinks the coefficients estimates towards zero. In other words Lasso method is a combination of the model selection and the shrinkage methods. We say that the lasso yields sparse models that is, models that involve only a subset of the variables. 14 / 27

15 In the Credit example: It can be seen clearly that the coefficient estimates are equal to zero for some λ. 15 / 27

16 But why the lasso is a selective method? That is, why unlike ridge regression, results in coefficient estimates are exactly equal to zero? All can be answered using the corresponding equations as follows. where s > 0 can be imagined (naively) as our budget. 16 / 27

17 and of course the diamond (restriction or penalty in the lasso) fortunately has corners!. 17 / 27

18 18 / 27

19 Lasso vs. Ridge Regression Scenario 1: The true model is dense. That is all the coefficients are non zero: Left: Plots of squared bias (black), variance (green), and test MSE (purple) for the lasso on Scenario 1. Right: Comparison of squared bias, variance and test MSE between lasso (solid) and ridge (dashed). The crosses in both plots indicate the lasso model for which the MSE is smallest. 19 / 27

20 Lasso vs. Ridge Regression Continued Scenario 2: The true model is not dense. That is some of coefficients are potentially zero, say 2 among them. Left: Plots of squared bias (black), variance (green), and test MSE (purple) for the lasso. In Scenario 2 only two predictors are related to the response (True model is not dense) Right: Comparison of squared bias, variance and test MSE between lasso (solid) and ridge (dashed). The crosses in both plots indicate the lasso model for which the MSE is smallest. 20 / 27

21 We can conclude from the last two slides that neither the ridge regression nor the lasso will universally dominate the other one. We expect that the lasso has a better performance when the response as a function of only a relatively small number of predictors (sparse model) and vice versa. Selecting the tuning parameter λ: Cross validation is being used for this propose. We choose a grid of λ, and compute the cross validation error rate for each value of them. Then we select a λ for which the cross validation error is smallest. Finally, the model is re-fit using all of the available observations and the selected value of the tuning parameter λ. 21 / 27

22 Example 22 / 27

23 23 / 27

24 24 / 27

25 Dimension Reduction Methods 4 4 For STAT 862, Optional for STAT / 27

26 26 / 27

27 27 / 27

Lecture 14: Shrinkage

Lecture 14: Shrinkage Reading: Section 6.2 STATS 202: Data mining and analysis October 27, 2017 1 / 19 Shrinkage methods The idea is to perform a linear regression, while regularizing or shrinking the