Part V. Model Slection and Regularization. As of Nov 21, 2018

Size: px

Start display at page:

Download "Part V. Model Slection and Regularization. As of Nov 21, 2018"

Della Rogers
5 years ago
Views:

1 Part V Model Slection and Regularization As of Nov 21, 2018 Some of the figures in this presentation are taken from An Introduction to Statistical Learning, with applications in R (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani

2 1 Model Selection and Regulairzation Variable selection Ridge regression Lasso Dimension reduction Principal components regression Partial least squares

3 Model selection in regression in linear regression y = β 0 + β 1 x β p x p + ɛ (1) refers typically to the selection of the most appropriate subset from the p explanatory variables that best predict and capture variability in y. Regularization refers to fitting a model with all p variables by shrinking the estimated coefficients with respect least squares estimates towards zero in order to decrease (reducible) variance. Depending on the type of shrinkage, some of the coefficients may be estimated to be exactly zero, thereby performing also variable selection. We discussed earlier variables selection shortly, which at the same time serves as dimension reduction of the regression problem. In this section we demonstrate variable selection and discuss shortly other approaches to dimension reduction.

4 Variable selection 1 Model Selection and Regulairzation Variable selection Ridge regression Lasso Dimension reduction Principal components regression Partial least squares

5 Variable selection Example 1 Consider the Hitters data set in the ISLR library which contains data on baseball players. We predict a player s Salary on the basis of available background data. > library(islr) > str(hitters) # structure data.frame : 322 obs. of 20 variables: $ AtBat : int $ Hits : int $ HmRun : int $ Runs : int $ RBI : int $ Walks : int $ Years : int $ CAtBat : int $ CHits : int $ CHmRun : int $ CRuns : int $ CRBI : int $ CWalks : int $ League : Factor w/ 2 levels "A","N": $ Division : Factor w/ 2 levels "E","W": $ PutOuts : int $ Assists : int $ Errors : int $ Salary : num NA $ NewLeague: Factor w/ 2 levels "A","N":

6 Variable selection Best subset Consider first selecting the best subset with respect to a given criterion. Earlier we utilized the car package. Here we utilize leaps package. Function regsubsets() which is part of the package can be used to identify the best subset in terms of RSS a. a Criterion functions can be derived from RSS, for example AIC is AIC k = log(rss k ) + 2k/n (2) where RSS k is the residual sum of squares for a regression with k explanatory variables.

7 Variable selection Best subset Regression with all explanatory variables. > summary(lm(salary ~., data = Hitters)) # full regression with all variables Call: lm(formula = Salary ~., data = Hitters) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) AtBat ** Hits ** HmRun Runs RBI Walks *** Years CAtBat CHits CHmRun CRuns CRBI CWalks * LeagueN DivisionW ** PutOuts *** Assists Errors NewLeagueN Signif. codes: 0 *** ** 0.01 * Residual standard error: on 243 degrees of freedom (59 observations deleted due to missingness) Multiple R-squared: ,Adjusted R-squared: F-statistic: on 19 and 243 DF, p-value: < 2.2e-16

8 Variable selection Best subset Significant t-values suggest that only AtBat, Hits, Walks, Walks, CWalks, Division, and PutOuts (and possibly CRuns and Assists that are 10% significant) have explanatory power. We will see next how this subset compares to the best ones in terms of different criterion functions.

9 Variable selection Best subset > fit.full <- regsubsets(salary ~., data = Hitters, nvmax = 19) # all subsets > (sm.full <- summary(fit.full)) # with smallest RSS(k), k = 1,..., p, indicate variables included Subset selection object Call: regsubsets.formula(salary ~., data = Hitters, nvmax = 19) 19 Variables (and intercept)... 1 subsets of each size up to 19 Selection Algorithm: exhaustive AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " "*" 2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*" 3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*" 4 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*" 5 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " " " "*" 6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " " "*" 7 ( 1 ) " " "*" " " " " " " "*" " " "*" "*" "*" " " " " 8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " "*" "*" " " 9 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*" 10 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*" 11 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*" 12 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*" "*" 13 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*" "*" 14 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" " " " " "*" "*" 15 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" "*" " " "*" "*" 16 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*" "*" 17 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*" "*" 18 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" " " "*" "*" 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN 1 ( 1 ) " " " " " " " " " " " " " " 2 ( 1 ) " " " " " " " " " " " " " " 3 ( 1 ) " " " " " " "*" " " " " " " 4 ( 1 ) " " " " "*" "*" " " " " " " 5 ( 1 ) " " " " "*" "*" " " " " " " 6 ( 1 ) " " " " "*" "*" " " " " " " 7 ( 1 ) " " " " "*" "*" " " " " " " 8 ( 1 ) "*" " " "*" "*" " " " " " " 9 ( 1 ) "*" " " "*" "*" " " " " " " 10 ( 1 ) "*" " " "*" "*" "*" " " " " 11 ( 1 ) "*" "*" "*" "*" "*" " " " " 12 ( 1 ) "*" "*" "*" "*" "*" " " " " 13 ( 1 ) "*" "*" "*" "*" "*" "*" " " 14 ( 1 ) "*" "*" "*" "*" "*" "*" " " 15 ( 1 ) "*" "*" "*" "*" "*" "*" " " 16 ( 1 ) "*" "*" "*" "*" "*" "*" " " 17 ( 1 ) "*" "*" "*" "*" "*" "*" "*" 18 ( 1 ) "*" "*" "*" "*" "*" "*" "*" 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*"

10 Variable selection Best subset Asterisk indicates inclusion of a variable in a model. > names(sm.full) # objects in sm.full [1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj" Printing for example R-squares shows the highest values for the best combinations of k explanatory variables, k = 1,..., p (p = 19). > round(sm.full$rsq, digits = 3) # show R-squares (in 3 decimals) [1] [13] Thus, for example the highest R 2 with one explanatory variable is 32.1% when CRBI is in the regression.

11 Variable selection Best subset Plotting R 2, adjusted R 2, C p (here equivalent to AIC), and BIC for all subset sizes can be used to decide the final model. > par(mfrow = c(2, 2)) > plot(x = 1:19, y = sm.full$rsq, type = "l", col = "steel blue", main = "R-squares", + xlab = "N of Variables", ylab = "R-squared") > plot(x = 1:19, y = sm.full$adjr2, type = "l", col = "steel blue", main = "Adjusted R-squares", + xlab = "N of Variables", ylab = "Adjusted R-squared") > (k.best <- which.max(sm.full$adjr2)) # model with best adj R-square [1] 11 > points(k.best, sm.full$adjr2[k.best], col = "red", cex = 2, pch = 20) # show the maximum > plot(x = 1:19, y = sm.full$cp, type = "l", col = "steel blue", main = "Cp", + xlab = "N of Variables", ylab = "Cp") > (k.best <- which.min(sm.full$cp)) # model with the smallest Cp [1] 10 > points(k.best, sm.full$cp[k.best], col = "red", cex = 2, pch = 20) # show the minimum > plot(x = 1:19, y = sm.full$bic, type = "l", col = "steel blue", main = "BIC", + xlab = "N of Variables", ylab = "BIC") > (k.best <- which.min(sm.full$bic)) # model with the smallest BIC [1] 6 > points(k.best, sm.full$bic[k.best], col = "red", cex = 2, pch = 20)

12 Variable selection Best subset R squares Adjusted R squares R squared Adjusted R squared N of Variables N of Variables Cp BIC Cp BIC N of Variables N of Variables

13 Variable selection Best subset For example the six variables selected by BIC and the corresponding fitted model are: > names(coef(fit.full, 6)) [1] "(Intercept)" "AtBat" "Hits" "Walks" "CRBI" [6] "DivisionW" "PutOuts" > summary(lm(salary ~ AtBat + Hits + Walks + CRBI + Division + PutOuts, data = Hitters)) Call: lm(formula = Salary ~ AtBat + Hits + Walks + CRBI + Division + PutOuts, data = Hitters) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) AtBat *** Hits e-06 *** Walks ** CRBI < 2e-16 *** DivisionW ** PutOuts *** --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 256 degrees of freedom (59 observations deleted due to missingness) Multiple R-squared: ,Adjusted R-squared: F-statistic: on 6 and 256 DF, p-value: < 2.2e-16 The are differences with those significant in the full model (e.g. initially non-significant CRBI is included while initially significant CWalks is not).

14 Variable selection Validation set approach In particular if our objective is to use the model for prediction, we can use a validation set to identify the best predictors. First split the sample to a test set and a validation set, estimate the best explanatory variable subsets of sizes 1, 2,..., p using the estimation data set and find the minimum test set MSE. > sum(is.na(hitters$salary)) # number of missing Salary values [1] 59 > Hitters <- na.omit(hitters) # drop all rows with missing values > sum(is.na(hitters)) [1] 0 > set.seed(2) # initialize random seed for exact replication ## a random vector of TRUEs and FALSEs with length equaling the rows in Hitters ## and about one half of the values are TRUE > train <- sample(x = c(true, FALSE), size = nrow(hitters), replace = TRUE) # > mean(train) # proportion of TRUEs [1] > test <-!train # complement set to identify the test set > mean(test) # fraction of observations in the test set [1]

15 Variable selection Validation set approach fit.best <- regsubsets(salary ~., Hitters[train, ], nvmax = 19) # best fitting models > test.mat <- model.matrix(salary ~., data = Hitters[test, ]) # generate model matrix > head(test.mat) # a few first lines of test.mat (Intercept) AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits -Alvin Davis Andre Dawson Alfredo Griffin Al Newman Andres Thomas Alan Trammell CHmRun CRuns CRBI CWalks LeagueN DivisionW PutOuts Assists -Alvin Davis Andre Dawson Alfredo Griffin Al Newman Andres Thomas Alan Trammell Errors NewLeagueN -Alvin Davis Andre Dawson 3 1 -Alfredo Griffin Al Newman 7 0 -Andres Thomas Alan Trammell 22 0 The function model.matrix() generates constant vector for the intercept and transforms factor variables to 0/1 dummy vectors by indicating also which class is labeled by 1.

16 Variable selection Validation set approach > test.mse <- double(19) # vector of length 19 for validation set MSEs > for (i in 1:length(test.mse)) { + betai <- coef(object = fit.best, id = i) # extract coefficients of the model with k x-vars + pred.salary <- test.mat[, names(betai)] %*% betai # pred y = X beta + test.mse[i] <- mean((hitters$salary[test] - pred.salary)^2) + } # end for > test.mse # print results [1] [8] [15] > which.min(test.mse) # find the minimum [1] 10 > coef(fit.best, id = 10) # slope coefficients of the best fitting model (Intercept) AtBat Hits Walks CAtBat CRuns CRBI CWalks DivisionW PutOuts Assists > coef(fit.full, id = 10) # slope coefficients of the best set of 10 variables from the full data set (Intercept) AtBat Hits Walks CAtBat CRuns CRBI CWalks DivisionW PutOuts Assists Thus, the best model is the one with 10 predictors. This set of predictors is also the best model with 10 predictors from the full data, and would be the results selected by the Cp criterion. The results, however, can different for a different training and test sets (actually if we initialized the random genrator with set.seed(1), as in the book, a slightly different set of predictors would have been selected).

17 Variable selection Validation set approach Remark 1 The practice is that the size k of the best set of predictors (here k = 10) is selcted on the basis of the validation apprach, while the final best k predictors are selected from the full sample and the corresponding regression is estimated (again from the full sample). Thus, predictors in the final model may differ from those of the validation best predictors, only the number is the same (above in both cases the sets of best predictors happened to coincide).

18 Variable selection Cross-validation approach In the same manner as in the validation set approach, we can identify the size of the set of best predictors on the basis of cross-validation. We demonstrate here the k-fold CV with k = 10. First create a vector that indicates in which of the 10 groups each observation belongs to. > set.seed(1) # for exact replication > k <- 10 # n of folds > folds <- sample(x = 1:k, size = nrow(hitters), replace = TRUE) # randomly formed k folds > head(folds) [1] Thus, here the first observation falls into fold 3, the second into 4, etc. The following function produces predictions. > predict.regsubsets <- function(obj, # object produced by regsubset() + newdata, # out of sample data + id, # id of the model +... # potential additional argumets if needed + ) { # Source: James et al. (2013) ISL + if (class(obj)!= "regsubsets") stop("obj must be produced by regsubsets() function!") + fmla <- as.formula(obj$call[[2]]) # extract formula from the obj object + beta <- coef(object = obj, id = id) # coefficients corresponding to model id + xmat <- model.matrix(fmla, newdata) # data matrix for prediction computations + return(xmat[, names(beta)] %*% beta) # return predictions + } # pred.regsubsets

19 Variable selection Cross-validation approach Next we loop through the k sets, and best predictions sets to compute MSEs. > cv.mse <- matrix(nrow = k, ncol = 19, dimnames = list(1:k, 1:19)) # matrix to store MSE-values > for (i in 1:k) { # over validation sets + best.fits <- regsubsets(salary ~., data = Hitters[folds!= i, ], nvmax = 19) + y <- Hitters$Salary[folds == i] # new y values + for (j in 1:19) { # MSEs over n of predictors + ypred <- predict.regsubsets(best.fits, newdata = Hitters[folds == i, ], id = j) # predictions + cv.mse[i, j] <- mean((y - ypred)^2) # store MSEs into cv.mse matrix + } # for j + } # for i MSEs for a j-predictors model, j = 1,..., 19. > (mean.cv.mse <- apply(cv.mse, 2, mean)) # mean mse values, parentheses prints the results > plot(mean.cv.mse, type = "b", col = "red", xlab = "N of Predictors", ylab = "MSE") Models with 10 and 11 predictors are close to each other, however, the 11 predictor model is slightly better as shown also by the following figure, so a model with 11 predictors would be our choice. > coef(regsubsets(salary ~., data = Hitters, nvmax = 19), id = 11) # the best model (Intercept) AtBat Hits Walks CAtBat CRuns CRBI CWalks LeagueN DivisionW PutOuts Assists > plot(mean.cv.mse, type = "b", col = "red", xlab = "N of Predictors", ylab = "MSE", main = "Best 10-fold Cross-Validation MSEs\nfor Different Number of Predictors")

20 Variable selection Cross-validation approach Best 10 fold Cross Validation MSEs for Different Number of Predictors MSE N of Predictors

21 1 Model Selection and Regulairzation Variable selection Ridge regression Lasso Dimension reduction Principal components regression Partial least squares

22 constrain or regularizes the coefficient estimates, or equivalently shrink the coefficient estimates to zero. It turns out that shrinking estimated coefficients towards zero can significantly reduce their variance. Two best-known techniques are ridge regression and lasso (least absolute shrinking and selection operator, introduced in statistics by Tibshirani 2 ). 2 Tibshirani, Robert (1996). Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society, Series B, 58,

23 1 Model Selection and Regulairzation Variable selection Ridge regression Lasso Dimension reduction Principal components regression Partial least squares

24 Ridge regression OLS estimates regression coefficients by minimizing ( ) 2 n p RSS = y i β 0 β j x ij. (3) i=1 i=1 Ridge regression estimates ˆβ R λ are obtained by minimizing n y i β 0 p 2 β j x ij + λ βj 2 = RSS + λ i=1 j=1 j=1 j=1 p p βj 2, (4) where λ 0 is a tuning parameter, to be determined separately.

25 Ridge regression The term, λ p j=1 β2 j, is called a shrinkage penalty which gets the smaller the closer β j s are to zero. Thus, like OLS, ridge regression seeks optimal fitting finding coefficients that minimize the RSS, but at the same time it penalizes large coefficients, so that the in the optimum coefficients will be shrunken towards zero. The tuning parameter λ 0 serves to control the relative impact of these two terms; λ = 0 leads to OLS, while λ drives the coefficient estimates towards zero. Selecting a good value for λ is obviously critical and depends on the application. Note that shrinking is not applied to β 0.

26 Ridge regression Ridge regression s advantage over OLS is rooted in the bias-variance trade-off (in MSE). As λ increases, the flexibility of the ridge regression fit decreases, leading to decreased variance but increased bias. By a suitable selection of λ variance can decrease more than the bias increases, which leads to a gain over OLS. Ridge regression works best in cases where OLS estimates have high variance (e.g., in the case of high multicollinearity, and in the case of large number of explanatory variables relative to n) Remark 2 Because the slope coefficients β j depend of the scale of the explanatory variable, x j, the common practice is to use standardize the explanatory variables by scaling the variables by their (sample) standard deviations, i.e., use the transformations x j = x j /s j, where s j is the (sample) standard deviation of x j variable.

27 1 Model Selection and Regulairzation Variable selection Ridge regression Lasso Dimension reduction Principal components regression Partial least squares

28 Lasso The lasso is relatively recent alternative to ridge regression. The lasso estimates, ˆβ L λ, of regression (1) are obtained by minimizing n y i β 0 i=1 2 p β j x ij + λ j=1 p β j = RSS + λ j=1 p β j. (5) The difference from ridge estimation is that in the penalty βj 2 is replaced by the β j (i.e., the lasso uses l 1 -norm and ridge l 2 -norm). As with ridge, also lasso shrinks coefficients towards zero. However, unlike the l 2 penalty, l 1 has the effect of forcing some coefficients to be exactly equal to zero when the tuning parameter λ is large enough. Again, as in ridge regression, selecting λ is critical. j=1

29 Lasso Due to the property that the l 1 distance can force some coefficients to be exactly zero implies the variable selection property ( selection operator, i.e., so) of the lasso as indicated in its name. Ridge regression lags this property as it does not force any of the initially non-zero coefficients to zero.

30 Lasso Remark 3 A combination of the ridge regression and the lasso, called elastic-net regularization, estimates the regressions coefficients β 0, β 1, β 2,..., β p by minimizing n (y i β 0 x iβ) 2 + λp α (β), (6) i=1 where x i = (x i1,..., x ip ), β = (β 1,..., β p ), and P α (β) = p j=1 ( ) 1 2 (1 α)β2 j + α β j. (7) Thus, α = 0 implies the ridge regression and α = 1 the lasso.

31 Another formulation for ridge regression and lasso Lasso and ridge regression solve the problem { n } p min (y i x iβ) 2 s.t. β j s (8) β i=1 j=1 { n } p min (y i x iβ) 2 s.t. βj 2 s (9) β i=1 j=1 where x i = (1, x i1,..., x ip ), β = (β 0, β 1,..., β p ), and x iβ = β 0 + p β j x ij. That is, for every λ there is some s such that equations (8) and (9) will give the same lasso and ridge coefficients and vice versa. j=1

32 Another formulation for ridge regression and lasso For example the lasso restriction in (8) can be thought as a budget constraint that defines how large p j=1 β j can be. Formulating the constraints of lasso and ridge in equations (8) and (9) as p I (β j 0) s, (10) j=1 where I (β j 0) is an indicator function equaling 1 if β j 0 and zero otherwise. Then RSS is minimized under the constraint that no more than s coefficients can be nonzero, i.e., the problem becomes a best subset selection problem.

33 Variable selection property of lasso The figure below illustrates in the case of two variables the situation in which lasso tends to have a variable selection property, while ridge regression does not (Source: James et al. 2013, Fig 6.7).

34 Variable selection property of lasso Thus, if ( ˆβ 1, ˆβ 2 ) is outside the region β 1 + β 2 s and β1 2 + β2 2 s, lasso can reach the boundary at the corner in which a coefficient equals zero. The ellipses depict values on which RSS remains constants. Inside the diamond and circle lasso and ridge regression give the same values as the budget constraints are satisfied by the OLS solution.

35 Selecting the tuning parameter λ The tuning parameter is chosen as follows: choose a grid of λ compute cross-validation error for each value of λ select the value of λ for which the cross-validation error is the smallest. Finally, the model is re-fit using all of the available observations and the selected value of the tuning parameter. Remark 4 Because OLS estimate ˆβ corresponds lasso (and ridge) with λ = 0, and the OLS minimizes RSS = n i=1 (y i x ˆβ) i 2 in the training sample, so that for any λ > 0, n i=1 (y i x ˆβ L i λ) 2 RSS in the training sample. Therefore, finding optimal λ must be based on some sort of out-of-sample computations like cross-validation.

36 Example 2 We utilize again the Hitters data set to demonstrate lasso. R has package glmnet in which the main function to perform lasso, ridge regression, and generally elastic-net regularization estimation. The main function of the package is glmnet(), which does not allow missing values and R factor variables must be first transformed to 0/1 dummy variables. > library(islr) > library(glmnet) > head(hitters) AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun -Andy Allanson Alan Ashby Alvin Davis Andre Dawson Andres Galarraga Alfredo Griffin CRuns CRBI CWalks League Division PutOuts Assists Errors -Andy Allanson A E Alan Ashby N W Alvin Davis A W Andre Dawson N E Andres Galarraga N E Alfredo Griffin A W Salary NewLeague -Andy Allanson NA A -Alan Ashby N -Alvin Davis A -Andre Dawson N -Andres Galarraga 91.5 N -Alfredo Griffin A > Hitters <- na.omit(hitters) # remove lines with missing values

37 Function glmnet() requires its input in an x matrix (predictors) and a y vector (dependent variable), and syntax y x does not work. The model.matrix() function generates the required x-matrix by transforming also factor variables to dummy variables. glmnet() performs lasso by selecting alpha = 1 (which is also the default) and automatically selects a range for λ. Below we use this automatically generated grid for λ. > x <- model.matrix(salary ~., Hitters)[, -1] # x-variables, drop the constant term vector > y <- Hitters$Salary > head(x) (results omitted) > lasso.mod <- glmnet(x = x, y = y, alpha = 1) ## alpha = 1 performs lasso ## for a range of lambda values

38 > str(lasso.mod) # structure of the object produced by glmnet() List of 12 $ a0 : Named num [1:80] attr(*, "names")= chr [1:80] "s0" "s1" "s2" "s3"... $ beta :Formal class dgcmatrix [package "Matrix"] with 6 slots....@ i : int [1:882] @ p : int [1:81] @ Dim : int [1:2] @ Dimnames:List of $ : chr [1:19] "AtBat" "Hits" "HmRun" "Runs" $ : chr [1:80] "s0" "s1" "s2" "s3" @ x : num [1:882] @ factors : list() $ df : int [1:80] $ dim : int [1:2] $ lambda : num [1:80] $ dev.ratio: num [1:80] $ nulldev : num $ npasses : int 2851 $ jerr : int 0 $ offset : logi FALSE $ call : language glmnet(x = x, y = y, alpha = 1) $ nobs : int attr(*, "class")= chr [1:2] "elnet" "glmnet"

39 Below are shown the number of values in the automatically generated λ-grid and some of the values. > length(lasso.mod$lambda) # number of lambdas [1] 80 > c(min = min(lasso.mod$lambda), max = max(lasso.mod$lambda)) # range of lambdas min max > head(lasso.mod$lambda) # a few first lamdas [1] > tail(lasso.mod$lambda) # a few last lambdas [1] Thus, the regression coefficients are computed for 80 values of λ. These results are stored into the beta matrix in the lasso.mod object and can be extracted directly or using coef() function. > dim(coef(lasso.mod)) # dimension of the coefficient matrix (beta) [1] > lasso.mod$lambda[50] # the 50th value of lambda [1] > coef(lasso.mod)[, 50] # the corresponding beta estimates (Intercept) AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN > sum(abs(coef(lasso.mod)[-1, 50])) # the corresponing L1-norm [1]

40 The function coef() can be used to produce lasso estimates for any values of λ (the same can be done by the predic() function). For further information, see help(coef.glmnet) and help(predict.glmnet). > drop(coef(lasso.mod, s = 5)) # lasso estimates for any value of lambda (arg s) (Intercept) AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN

41 > par(mfrow = c(1, 2)) > plot(lasso.mod, xvar = "lambda", xlab = expression(lambda)) > plot(lasso.mod, xvar = "norm", xlab = expression(log(sum(abs(beta[j]))))) Coefficients Coefficients log(λ) log( β j) The coefficients are plotted against log(λ) in the left panel and logarithm of the l 1 -norm of the coefficients in the right panel. Numbers on the top of the figure indicate the number of non-zero coefficients. The figure shows that depending of the choice of λ, some of the coefficients will be exactly equal to zero.

42 Next we use corss-validation to identify the best λ in terms of the MSE. The glmnet has the function cv.glmnet() for the purpose. > set.seed(1) # for replication purposes > train <- sample(1:nrow(x), nrow(x)/2) # random selection of about ## one half of the observations for a train set > test <- -train # test set consist observations not in the train set > y.test <- y[test] # test y-values > cv.out <- cv.glmnet(x[train, ], y[train], alpha = 1) # by default performs 10-fold CV > plot(cv.out) Mean Squared Error log(lambda)

43 > (best.lam <- cv.out$lambda.min) # lambda for which CV MSE is at minimum [1] lasso.pred <- predict(lasso.train, s = best.lam, newx = x[test, ]) # predicted values ## with coefficients corresponding best.lam mean((lasso.pred - y.test)^2) # lasso test MSE [1] ## for comparison compute test MSE for OLS estimated model from the test set > ols.train <- lm(salary ~., data = Hitters, subset = train) # OLS train data estimats > ols.pred <- predict(ols.train, newdata = Hitters[test, ]) # OLS prediction > mean((ols.pred - y.test)^2) # OLS test MSE [1] Here lasso outperforms OLS in terms of test MSE. Below is a scatter plot of test set realized and precited salaries.

44 > plot(x = ols.pred, y = y.test, col = "red", xlim = c(0, 2500), ylim = c(0, 2500), + xlab = "Predicted", ylab = "Realized") > abline(lm(y.test ~ ols.pred), col = "red") > points(x = lasso.pred, y = y.test, col = "steel blue") > abline(lm(y.test ~ lasso.pred), col = "steel blue") > abline(a = 0, b = 1, lty = "dashed", col = "gray") > legend("topleft", legend = c("ols", "Lasso"), col = c("red", "Steel blue"), + pch = c(1, 1), bty = "n") Predicted Realized OLS Lasso

45 Dimension reduction 1 Model Selection and Regulairzation Variable selection Ridge regression Lasso Dimension reduction Principal components regression Partial least squares

46 Dimension reduction Variable selection and shrinking methods aim to control variance relying original predictors x 1,..., x p. Dimension reduction aims to find a lower number of new variable z 1,..., z M, M < p, that are linear combination of the original predictors, i.e., z m = p φ jm x j (11) j=1 for some constants φ 1m,..., φ pm, m = 1,..., M. Then y is regressed on these new variables i = 1,..., n. y i = θ 0 + M θ m z im + ɛ i, (12) m=1 A proper selection of the φ-coefficients in (11) can lead the dimension reduced regression with M + 1 coefficients in (12) outperform the original regression y on x-variables with p + 1 coefficients.

47 Dimension reduction 1 Model Selection and Regulairzation Variable selection Ridge regression Lasso Dimension reduction Principal components regression Partial least squares

48 Dimension reduction Principal components Principal component analysis (PCA) is a popular approach to reduce the dimension of the original p variables to a low number M p of linear combinations of the form (11) such that these linear combination capture the variability of the original x-variables. Given an n p data matrix X, PCA derives linear combinations of the variables such that: The first principal component defines a direction of the data along which the observations vary the most. The second principal component defines a direction which is orthogonal to the first one and along which the along varies most (among directions that are orthogonal to the first). The third principal component is defined such that it is orthogonal to the first and third and captures most of the remaining variability. Generally, an mth principal component is defined such that it is orthogonal with the earlier m 1 ones and captures most of the remaining variability

49 Dimension reduction Principal components Mathematically this reduces to finding the eigen values of the covariance matrix cov(x ) = Σ. The eigen vector of the largest eigen value is the coefficient vector of the first component, the eigen vector of the second largest eigenvalue is the coefficient vector of the second component, and so forth. The eigen vectors are normed to unity, i.e., if φ m = (φ 1m,..., φ pm ) is the mth eigen vector, then φ mφ m = 1. Because the eigenvalue problem is scale dependent, the solution is most often extracted form the correlation matrix (i.e., covariance matrix of standardized variables).

50 Dimension reduction Principal components The figure below illustrates the PC solution in the case of two variables, population size (pop) and ad spending (ad) (Source: James et al 2013, Fig 6.14). Ad Spending Population The green line is the first PC and the dashed line is the second.

51 Dimension reduction Principal components The computed scores z im = θ im (x i1 x 1 ) + + θ pm (x ip x p ), (13) i = 1,..., n are called principal component scores. Another interpretation of the PCA is: the first PC vector defines the line that is as close as porrible to the data. This is illustrated by the figure below (Source: James et al 2013, Fig 6.15). Ad Spending nd Principal Component Population st Principal Component

52 Dimension reduction Principal components Plotting the components against the original variables illustrates graphically how well the component represents the variability of that variable. For the advertising data the first PC represents well both variables, while the second PC is not much related with either of the variables (Source: James et al 2013, Fig 6.16 & 6.17). Population Ad Spending st Principal Component st Principal Component

53 Dimension reduction Principal components Population Ad Spending nd Principal Component nd Principal Component Because here the first PC captures the virtually all among the two variables in the advertising data, the component reflect jointly the population size and advertising spending of the cities. The components are linear combinations of demeaned original variables (e.q. (13)). Therefore, negative values reflect below average population size and below average advertising budget, close to zero reflects average size and budget, and positive values above average population size and budget.

54 Dimension reduction Principal component regression The principal component regression (PCR) involves constructing first M PCs and regress y on the components. The key is that M p. If M = p then PCR amounts to the same fit as the least squares fit of the original variables. M, the number of components for PCR is typically selected by cross-validation. Also, wen using PCR, it is generally recommended to standardize the x-variables before PC.

55 Dimension reduction Principal component regression Example 3 Consider again the Hitters data set > library(pls) # pls library contains pcr regression and more > library(islr) > Hitters <- na.omit(hitters) # remove missing values > pcr.fit <- pcr(salary ~., data = Hitters, scale = TRUE, # standardize x-variables + validation = "CV") # use CV to select n of components default 10 fold > summary(pcr.fit) # summary of the results Data: X dimension: Y dimension: Fit method: svdpc Number of components considered: 19 VALIDATION: RMSEP Cross-validated using 10 random segments. (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps CV adjcv comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps CV adjcv comps 15 comps 16 comps 17 comps 18 comps 19 comps CV adjcv TRAINING: % variance explained 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps X Salary comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps X Salary comps 17 comps 18 comps 19 comps X Salary

56 Dimension reduction Principal component regression The RMSEP is square root of CV MSE, and the smallest value is reached with six PCs, which is also shown by the plot of RMSEPs. Six components explain % of the total variation among the predictors. > validationplot(pcr.fit, val.type = "RMSEP") # plot RMSEPs Salary RMSEP number of components

57 Dimension reduction Principal component regression Next we will demonstrate in terms MSE how the PCR regression performs by using a test set. First define the best number of components from the training set, estimate the corresponding regression, and compute test MSE > set.seed(1) # for exact replication > train <- sample(nrow(hitters), size = nrow(hitters) / 2) # training set > head(train) # examples of observations in the training set [1] > test <- -train # test set resulted by not including those in the train set > pcr.train <- pcr(salary ~., data = Hitters, subset = train, scale = TRUE, + validation = "CV") # training set > validationplot(pcr.train, val.type = "RMSEP") > summary(pcr.train) Data: X dimension: Y dimension: Fit method: svdpc Number of components considered: 19 VALIDATION: RMSEP Cross-validated using 10 random segments. (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps CV adjcv comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps CV adjcv comps 15 comps 16 comps 17 comps 18 comps 19 comps CV adjcv TRAINING: % variance explained 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps X Salary comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps X Salary comps 17 comps 18 comps 19 comps X Salary

58 Dimension reduction Principal component regression On the basis of the training data CV RMSEP results M = 7 principal components yields the best resutls. Next compute test MSE. > pcr.pred <- predict(pcr.train, Hitters[test, ], ncomp = 7) > head(pcr.pred) # a few first predictions [1] > mean((pcr.pred - Hitters$Salary[test])^2) # test set MSE [1] This test set MSE is slightly smaller that that of lasso (114,470.6). PCR is useful in prediction purposes. If interpretation of the model is needed PCR results may be difficult to interpret. Finally, estimating the M = 7 components from the full data set yields the following results. > summary(pcr.fit <- pcr(salary ~., data = Hitters, scale = TRUE, ncomp = 7)) # estimate and summarize Data: X dimension: Y dimension: Fit method: svdpc Number of components considered: 7 TRAINING: % variance explained 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps X Salary

59 Dimension reduction 1 Model Selection and Regulairzation Variable selection Ridge regression Lasso Dimension reduction Principal components regression Partial least squares

60 Dimension reduction Partial least squares PCR focuses on reducing the dimension of the predictors without using the help of dependent variable y. In this sense PCR can be consider as unsupervised learning. As a results nothing guarantees that the dimension do really help in predicting y. Partial least squares (PLS) can be considered as a supervised alternative to PCR. Similar to PCR, PLS seeks to identify new features z 1,..., z M (M < p) that are linear combinations of the original variables, but unlike with PC, they are also related to the response y. In short PLS attempts to find directions that help explain both the response and the predictors.

61 Dimension reduction Partial least squares In PLS the predictors are first standardized. The first PLS direction z 1 is defined by regressing y on each predictor x j at a time and the resulting coefficients are used as φ 1j to define z 1 = φ 11 x φ 1p x p. The second PLS direction is defined by first regressing each x j on z 1 and taking the residuals (i.e., variation of x j not explained by z 1 ). Using these orthogonalized data, z 2 is formed in the same fashion as z 1 with the original data. This iterative procedure is repeated M times to produce PLS components z 1,..., z M. M is chosen by cross-validation.

62 Dimension reduction Partial least squares Example 4 Continuing the previous examples, we use plsr() function of the pls library. > set.seed(1) # initialize again the seed > pls.train <- plsr(salary ~., data = Hitters, subset = train, scale = TRUE, + validation = "CV") > summary(pls.train) Data: X dimension: Y dimension: Fit method: kernelpls Number of components considered: 19 VALIDATION: RMSEP Cross-validated using 10 random segments. (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps CV adjcv comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps CV adjcv comps 15 comps 16 comps 17 comps 18 comps 19 comps CV adjcv TRAINING: % variance explained 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps X Salary comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps X Salary comps 17 comps 18 comps 19 comps X Salary > validationplot(pls.train, val.type = "RMSEP")

63 Dimension reduction Partial least squares M = 2 produces the smallest CV RMSEP. Salary RMSEP number of components > pls.pred <- predict(pls.train, newdata = Hitters[test, ], ncomp = 2) > mean((hitters$salary[test] - pls.pred)^2) [1] MSE if comparable but slightly higher than that of PCR.

Penalized Regression

Penalized Regression Deepayan Sarkar Penalized regression Another potential remedy for collinearity Decreases variability of estimated coefficients at the cost of introducing bias Also known as regularization