Part V. Model Slection and Regularization. As of Nov 21, 2018
|
|
- Della Rogers
- 5 years ago
- Views:
Transcription
1 Part V Model Slection and Regularization As of Nov 21, 2018 Some of the figures in this presentation are taken from An Introduction to Statistical Learning, with applications in R (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani
2 1 Model Selection and Regulairzation Variable selection Ridge regression Lasso Dimension reduction Principal components regression Partial least squares
3 Model selection in regression in linear regression y = β 0 + β 1 x β p x p + ɛ (1) refers typically to the selection of the most appropriate subset from the p explanatory variables that best predict and capture variability in y. Regularization refers to fitting a model with all p variables by shrinking the estimated coefficients with respect least squares estimates towards zero in order to decrease (reducible) variance. Depending on the type of shrinkage, some of the coefficients may be estimated to be exactly zero, thereby performing also variable selection. We discussed earlier variables selection shortly, which at the same time serves as dimension reduction of the regression problem. In this section we demonstrate variable selection and discuss shortly other approaches to dimension reduction.
4 Variable selection 1 Model Selection and Regulairzation Variable selection Ridge regression Lasso Dimension reduction Principal components regression Partial least squares
5 Variable selection Example 1 Consider the Hitters data set in the ISLR library which contains data on baseball players. We predict a player s Salary on the basis of available background data. > library(islr) > str(hitters) # structure data.frame : 322 obs. of 20 variables: $ AtBat : int $ Hits : int $ HmRun : int $ Runs : int $ RBI : int $ Walks : int $ Years : int $ CAtBat : int $ CHits : int $ CHmRun : int $ CRuns : int $ CRBI : int $ CWalks : int $ League : Factor w/ 2 levels "A","N": $ Division : Factor w/ 2 levels "E","W": $ PutOuts : int $ Assists : int $ Errors : int $ Salary : num NA $ NewLeague: Factor w/ 2 levels "A","N":
6 Variable selection Best subset Consider first selecting the best subset with respect to a given criterion. Earlier we utilized the car package. Here we utilize leaps package. Function regsubsets() which is part of the package can be used to identify the best subset in terms of RSS a. a Criterion functions can be derived from RSS, for example AIC is AIC k = log(rss k ) + 2k/n (2) where RSS k is the residual sum of squares for a regression with k explanatory variables.
7 Variable selection Best subset Regression with all explanatory variables. > summary(lm(salary ~., data = Hitters)) # full regression with all variables Call: lm(formula = Salary ~., data = Hitters) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) AtBat ** Hits ** HmRun Runs RBI Walks *** Years CAtBat CHits CHmRun CRuns CRBI CWalks * LeagueN DivisionW ** PutOuts *** Assists Errors NewLeagueN Signif. codes: 0 *** ** 0.01 * Residual standard error: on 243 degrees of freedom (59 observations deleted due to missingness) Multiple R-squared: ,Adjusted R-squared: F-statistic: on 19 and 243 DF, p-value: < 2.2e-16
8 Variable selection Best subset Significant t-values suggest that only AtBat, Hits, Walks, Walks, CWalks, Division, and PutOuts (and possibly CRuns and Assists that are 10% significant) have explanatory power. We will see next how this subset compares to the best ones in terms of different criterion functions.
9 Variable selection Best subset > fit.full <- regsubsets(salary ~., data = Hitters, nvmax = 19) # all subsets > (sm.full <- summary(fit.full)) # with smallest RSS(k), k = 1,..., p, indicate variables included Subset selection object Call: regsubsets.formula(salary ~., data = Hitters, nvmax = 19) 19 Variables (and intercept)... 1 subsets of each size up to 19 Selection Algorithm: exhaustive AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " "*" 2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*" 3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*" 4 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*" 5 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " " " "*" 6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " " "*" 7 ( 1 ) " " "*" " " " " " " "*" " " "*" "*" "*" " " " " 8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " "*" "*" " " 9 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*" 10 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*" 11 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*" 12 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*" "*" 13 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*" "*" 14 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" " " " " "*" "*" 15 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" "*" " " "*" "*" 16 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*" "*" 17 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*" "*" 18 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" " " "*" "*" 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN 1 ( 1 ) " " " " " " " " " " " " " " 2 ( 1 ) " " " " " " " " " " " " " " 3 ( 1 ) " " " " " " "*" " " " " " " 4 ( 1 ) " " " " "*" "*" " " " " " " 5 ( 1 ) " " " " "*" "*" " " " " " " 6 ( 1 ) " " " " "*" "*" " " " " " " 7 ( 1 ) " " " " "*" "*" " " " " " " 8 ( 1 ) "*" " " "*" "*" " " " " " " 9 ( 1 ) "*" " " "*" "*" " " " " " " 10 ( 1 ) "*" " " "*" "*" "*" " " " " 11 ( 1 ) "*" "*" "*" "*" "*" " " " " 12 ( 1 ) "*" "*" "*" "*" "*" " " " " 13 ( 1 ) "*" "*" "*" "*" "*" "*" " " 14 ( 1 ) "*" "*" "*" "*" "*" "*" " " 15 ( 1 ) "*" "*" "*" "*" "*" "*" " " 16 ( 1 ) "*" "*" "*" "*" "*" "*" " " 17 ( 1 ) "*" "*" "*" "*" "*" "*" "*" 18 ( 1 ) "*" "*" "*" "*" "*" "*" "*" 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
10 Variable selection Best subset Asterisk indicates inclusion of a variable in a model. > names(sm.full) # objects in sm.full [1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj" Printing for example R-squares shows the highest values for the best combinations of k explanatory variables, k = 1,..., p (p = 19). > round(sm.full$rsq, digits = 3) # show R-squares (in 3 decimals) [1] [13] Thus, for example the highest R 2 with one explanatory variable is 32.1% when CRBI is in the regression.
11 Variable selection Best subset Plotting R 2, adjusted R 2, C p (here equivalent to AIC), and BIC for all subset sizes can be used to decide the final model. > par(mfrow = c(2, 2)) > plot(x = 1:19, y = sm.full$rsq, type = "l", col = "steel blue", main = "R-squares", + xlab = "N of Variables", ylab = "R-squared") > plot(x = 1:19, y = sm.full$adjr2, type = "l", col = "steel blue", main = "Adjusted R-squares", + xlab = "N of Variables", ylab = "Adjusted R-squared") > (k.best <- which.max(sm.full$adjr2)) # model with best adj R-square [1] 11 > points(k.best, sm.full$adjr2[k.best], col = "red", cex = 2, pch = 20) # show the maximum > plot(x = 1:19, y = sm.full$cp, type = "l", col = "steel blue", main = "Cp", + xlab = "N of Variables", ylab = "Cp") > (k.best <- which.min(sm.full$cp)) # model with the smallest Cp [1] 10 > points(k.best, sm.full$cp[k.best], col = "red", cex = 2, pch = 20) # show the minimum > plot(x = 1:19, y = sm.full$bic, type = "l", col = "steel blue", main = "BIC", + xlab = "N of Variables", ylab = "BIC") > (k.best <- which.min(sm.full$bic)) # model with the smallest BIC [1] 6 > points(k.best, sm.full$bic[k.best], col = "red", cex = 2, pch = 20)
12 Variable selection Best subset R squares Adjusted R squares R squared Adjusted R squared N of Variables N of Variables Cp BIC Cp BIC N of Variables N of Variables
13 Variable selection Best subset For example the six variables selected by BIC and the corresponding fitted model are: > names(coef(fit.full, 6)) [1] "(Intercept)" "AtBat" "Hits" "Walks" "CRBI" [6] "DivisionW" "PutOuts" > summary(lm(salary ~ AtBat + Hits + Walks + CRBI + Division + PutOuts, data = Hitters)) Call: lm(formula = Salary ~ AtBat + Hits + Walks + CRBI + Division + PutOuts, data = Hitters) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) AtBat *** Hits e-06 *** Walks ** CRBI < 2e-16 *** DivisionW ** PutOuts *** --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 256 degrees of freedom (59 observations deleted due to missingness) Multiple R-squared: ,Adjusted R-squared: F-statistic: on 6 and 256 DF, p-value: < 2.2e-16 The are differences with those significant in the full model (e.g. initially non-significant CRBI is included while initially significant CWalks is not).
14 Variable selection Validation set approach In particular if our objective is to use the model for prediction, we can use a validation set to identify the best predictors. First split the sample to a test set and a validation set, estimate the best explanatory variable subsets of sizes 1, 2,..., p using the estimation data set and find the minimum test set MSE. > sum(is.na(hitters$salary)) # number of missing Salary values [1] 59 > Hitters <- na.omit(hitters) # drop all rows with missing values > sum(is.na(hitters)) [1] 0 > set.seed(2) # initialize random seed for exact replication ## a random vector of TRUEs and FALSEs with length equaling the rows in Hitters ## and about one half of the values are TRUE > train <- sample(x = c(true, FALSE), size = nrow(hitters), replace = TRUE) # > mean(train) # proportion of TRUEs [1] > test <-!train # complement set to identify the test set > mean(test) # fraction of observations in the test set [1]
15 Variable selection Validation set approach fit.best <- regsubsets(salary ~., Hitters[train, ], nvmax = 19) # best fitting models > test.mat <- model.matrix(salary ~., data = Hitters[test, ]) # generate model matrix > head(test.mat) # a few first lines of test.mat (Intercept) AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits -Alvin Davis Andre Dawson Alfredo Griffin Al Newman Andres Thomas Alan Trammell CHmRun CRuns CRBI CWalks LeagueN DivisionW PutOuts Assists -Alvin Davis Andre Dawson Alfredo Griffin Al Newman Andres Thomas Alan Trammell Errors NewLeagueN -Alvin Davis Andre Dawson 3 1 -Alfredo Griffin Al Newman 7 0 -Andres Thomas Alan Trammell 22 0 The function model.matrix() generates constant vector for the intercept and transforms factor variables to 0/1 dummy vectors by indicating also which class is labeled by 1.
16 Variable selection Validation set approach > test.mse <- double(19) # vector of length 19 for validation set MSEs > for (i in 1:length(test.mse)) { + betai <- coef(object = fit.best, id = i) # extract coefficients of the model with k x-vars + pred.salary <- test.mat[, names(betai)] %*% betai # pred y = X beta + test.mse[i] <- mean((hitters$salary[test] - pred.salary)^2) + } # end for > test.mse # print results [1] [8] [15] > which.min(test.mse) # find the minimum [1] 10 > coef(fit.best, id = 10) # slope coefficients of the best fitting model (Intercept) AtBat Hits Walks CAtBat CRuns CRBI CWalks DivisionW PutOuts Assists > coef(fit.full, id = 10) # slope coefficients of the best set of 10 variables from the full data set (Intercept) AtBat Hits Walks CAtBat CRuns CRBI CWalks DivisionW PutOuts Assists Thus, the best model is the one with 10 predictors. This set of predictors is also the best model with 10 predictors from the full data, and would be the results selected by the Cp criterion. The results, however, can different for a different training and test sets (actually if we initialized the random genrator with set.seed(1), as in the book, a slightly different set of predictors would have been selected).
17 Variable selection Validation set approach Remark 1 The practice is that the size k of the best set of predictors (here k = 10) is selcted on the basis of the validation apprach, while the final best k predictors are selected from the full sample and the corresponding regression is estimated (again from the full sample). Thus, predictors in the final model may differ from those of the validation best predictors, only the number is the same (above in both cases the sets of best predictors happened to coincide).
18 Variable selection Cross-validation approach In the same manner as in the validation set approach, we can identify the size of the set of best predictors on the basis of cross-validation. We demonstrate here the k-fold CV with k = 10. First create a vector that indicates in which of the 10 groups each observation belongs to. > set.seed(1) # for exact replication > k <- 10 # n of folds > folds <- sample(x = 1:k, size = nrow(hitters), replace = TRUE) # randomly formed k folds > head(folds) [1] Thus, here the first observation falls into fold 3, the second into 4, etc. The following function produces predictions. > predict.regsubsets <- function(obj, # object produced by regsubset() + newdata, # out of sample data + id, # id of the model +... # potential additional argumets if needed + ) { # Source: James et al. (2013) ISL + if (class(obj)!= "regsubsets") stop("obj must be produced by regsubsets() function!") + fmla <- as.formula(obj$call[[2]]) # extract formula from the obj object + beta <- coef(object = obj, id = id) # coefficients corresponding to model id + xmat <- model.matrix(fmla, newdata) # data matrix for prediction computations + return(xmat[, names(beta)] %*% beta) # return predictions + } # pred.regsubsets
19 Variable selection Cross-validation approach Next we loop through the k sets, and best predictions sets to compute MSEs. > cv.mse <- matrix(nrow = k, ncol = 19, dimnames = list(1:k, 1:19)) # matrix to store MSE-values > for (i in 1:k) { # over validation sets + best.fits <- regsubsets(salary ~., data = Hitters[folds!= i, ], nvmax = 19) + y <- Hitters$Salary[folds == i] # new y values + for (j in 1:19) { # MSEs over n of predictors + ypred <- predict.regsubsets(best.fits, newdata = Hitters[folds == i, ], id = j) # predictions + cv.mse[i, j] <- mean((y - ypred)^2) # store MSEs into cv.mse matrix + } # for j + } # for i MSEs for a j-predictors model, j = 1,..., 19. > (mean.cv.mse <- apply(cv.mse, 2, mean)) # mean mse values, parentheses prints the results > plot(mean.cv.mse, type = "b", col = "red", xlab = "N of Predictors", ylab = "MSE") Models with 10 and 11 predictors are close to each other, however, the 11 predictor model is slightly better as shown also by the following figure, so a model with 11 predictors would be our choice. > coef(regsubsets(salary ~., data = Hitters, nvmax = 19), id = 11) # the best model (Intercept) AtBat Hits Walks CAtBat CRuns CRBI CWalks LeagueN DivisionW PutOuts Assists > plot(mean.cv.mse, type = "b", col = "red", xlab = "N of Predictors", ylab = "MSE", main = "Best 10-fold Cross-Validation MSEs\nfor Different Number of Predictors")
20 Variable selection Cross-validation approach Best 10 fold Cross Validation MSEs for Different Number of Predictors MSE N of Predictors
21 1 Model Selection and Regulairzation Variable selection Ridge regression Lasso Dimension reduction Principal components regression Partial least squares
22 constrain or regularizes the coefficient estimates, or equivalently shrink the coefficient estimates to zero. It turns out that shrinking estimated coefficients towards zero can significantly reduce their variance. Two best-known techniques are ridge regression and lasso (least absolute shrinking and selection operator, introduced in statistics by Tibshirani 2 ). 2 Tibshirani, Robert (1996). Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society, Series B, 58,
23 1 Model Selection and Regulairzation Variable selection Ridge regression Lasso Dimension reduction Principal components regression Partial least squares
24 Ridge regression OLS estimates regression coefficients by minimizing ( ) 2 n p RSS = y i β 0 β j x ij. (3) i=1 i=1 Ridge regression estimates ˆβ R λ are obtained by minimizing n y i β 0 p 2 β j x ij + λ βj 2 = RSS + λ i=1 j=1 j=1 j=1 p p βj 2, (4) where λ 0 is a tuning parameter, to be determined separately.
25 Ridge regression The term, λ p j=1 β2 j, is called a shrinkage penalty which gets the smaller the closer β j s are to zero. Thus, like OLS, ridge regression seeks optimal fitting finding coefficients that minimize the RSS, but at the same time it penalizes large coefficients, so that the in the optimum coefficients will be shrunken towards zero. The tuning parameter λ 0 serves to control the relative impact of these two terms; λ = 0 leads to OLS, while λ drives the coefficient estimates towards zero. Selecting a good value for λ is obviously critical and depends on the application. Note that shrinking is not applied to β 0.
26 Ridge regression Ridge regression s advantage over OLS is rooted in the bias-variance trade-off (in MSE). As λ increases, the flexibility of the ridge regression fit decreases, leading to decreased variance but increased bias. By a suitable selection of λ variance can decrease more than the bias increases, which leads to a gain over OLS. Ridge regression works best in cases where OLS estimates have high variance (e.g., in the case of high multicollinearity, and in the case of large number of explanatory variables relative to n) Remark 2 Because the slope coefficients β j depend of the scale of the explanatory variable, x j, the common practice is to use standardize the explanatory variables by scaling the variables by their (sample) standard deviations, i.e., use the transformations x j = x j /s j, where s j is the (sample) standard deviation of x j variable.
27 1 Model Selection and Regulairzation Variable selection Ridge regression Lasso Dimension reduction Principal components regression Partial least squares
28 Lasso The lasso is relatively recent alternative to ridge regression. The lasso estimates, ˆβ L λ, of regression (1) are obtained by minimizing n y i β 0 i=1 2 p β j x ij + λ j=1 p β j = RSS + λ j=1 p β j. (5) The difference from ridge estimation is that in the penalty βj 2 is replaced by the β j (i.e., the lasso uses l 1 -norm and ridge l 2 -norm). As with ridge, also lasso shrinks coefficients towards zero. However, unlike the l 2 penalty, l 1 has the effect of forcing some coefficients to be exactly equal to zero when the tuning parameter λ is large enough. Again, as in ridge regression, selecting λ is critical. j=1
29 Lasso Due to the property that the l 1 distance can force some coefficients to be exactly zero implies the variable selection property ( selection operator, i.e., so) of the lasso as indicated in its name. Ridge regression lags this property as it does not force any of the initially non-zero coefficients to zero.
30 Lasso Remark 3 A combination of the ridge regression and the lasso, called elastic-net regularization, estimates the regressions coefficients β 0, β 1, β 2,..., β p by minimizing n (y i β 0 x iβ) 2 + λp α (β), (6) i=1 where x i = (x i1,..., x ip ), β = (β 1,..., β p ), and P α (β) = p j=1 ( ) 1 2 (1 α)β2 j + α β j. (7) Thus, α = 0 implies the ridge regression and α = 1 the lasso.
31 Another formulation for ridge regression and lasso Lasso and ridge regression solve the problem { n } p min (y i x iβ) 2 s.t. β j s (8) β i=1 j=1 { n } p min (y i x iβ) 2 s.t. βj 2 s (9) β i=1 j=1 where x i = (1, x i1,..., x ip ), β = (β 0, β 1,..., β p ), and x iβ = β 0 + p β j x ij. That is, for every λ there is some s such that equations (8) and (9) will give the same lasso and ridge coefficients and vice versa. j=1
32 Another formulation for ridge regression and lasso For example the lasso restriction in (8) can be thought as a budget constraint that defines how large p j=1 β j can be. Formulating the constraints of lasso and ridge in equations (8) and (9) as p I (β j 0) s, (10) j=1 where I (β j 0) is an indicator function equaling 1 if β j 0 and zero otherwise. Then RSS is minimized under the constraint that no more than s coefficients can be nonzero, i.e., the problem becomes a best subset selection problem.
33 Variable selection property of lasso The figure below illustrates in the case of two variables the situation in which lasso tends to have a variable selection property, while ridge regression does not (Source: James et al. 2013, Fig 6.7).
34 Variable selection property of lasso Thus, if ( ˆβ 1, ˆβ 2 ) is outside the region β 1 + β 2 s and β1 2 + β2 2 s, lasso can reach the boundary at the corner in which a coefficient equals zero. The ellipses depict values on which RSS remains constants. Inside the diamond and circle lasso and ridge regression give the same values as the budget constraints are satisfied by the OLS solution.
35 Selecting the tuning parameter λ The tuning parameter is chosen as follows: choose a grid of λ compute cross-validation error for each value of λ select the value of λ for which the cross-validation error is the smallest. Finally, the model is re-fit using all of the available observations and the selected value of the tuning parameter. Remark 4 Because OLS estimate ˆβ corresponds lasso (and ridge) with λ = 0, and the OLS minimizes RSS = n i=1 (y i x ˆβ) i 2 in the training sample, so that for any λ > 0, n i=1 (y i x ˆβ L i λ) 2 RSS in the training sample. Therefore, finding optimal λ must be based on some sort of out-of-sample computations like cross-validation.
36 Example 2 We utilize again the Hitters data set to demonstrate lasso. R has package glmnet in which the main function to perform lasso, ridge regression, and generally elastic-net regularization estimation. The main function of the package is glmnet(), which does not allow missing values and R factor variables must be first transformed to 0/1 dummy variables. > library(islr) > library(glmnet) > head(hitters) AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun -Andy Allanson Alan Ashby Alvin Davis Andre Dawson Andres Galarraga Alfredo Griffin CRuns CRBI CWalks League Division PutOuts Assists Errors -Andy Allanson A E Alan Ashby N W Alvin Davis A W Andre Dawson N E Andres Galarraga N E Alfredo Griffin A W Salary NewLeague -Andy Allanson NA A -Alan Ashby N -Alvin Davis A -Andre Dawson N -Andres Galarraga 91.5 N -Alfredo Griffin A > Hitters <- na.omit(hitters) # remove lines with missing values
37 Function glmnet() requires its input in an x matrix (predictors) and a y vector (dependent variable), and syntax y x does not work. The model.matrix() function generates the required x-matrix by transforming also factor variables to dummy variables. glmnet() performs lasso by selecting alpha = 1 (which is also the default) and automatically selects a range for λ. Below we use this automatically generated grid for λ. > x <- model.matrix(salary ~., Hitters)[, -1] # x-variables, drop the constant term vector > y <- Hitters$Salary > head(x) (results omitted) > lasso.mod <- glmnet(x = x, y = y, alpha = 1) ## alpha = 1 performs lasso ## for a range of lambda values
38 > str(lasso.mod) # structure of the object produced by glmnet() List of 12 $ a0 : Named num [1:80] attr(*, "names")= chr [1:80] "s0" "s1" "s2" "s3"... $ beta :Formal class dgcmatrix [package "Matrix"] with 6 slots....@ i : int [1:882] @ p : int [1:81] @ Dim : int [1:2] @ Dimnames:List of $ : chr [1:19] "AtBat" "Hits" "HmRun" "Runs" $ : chr [1:80] "s0" "s1" "s2" "s3" @ x : num [1:882] @ factors : list() $ df : int [1:80] $ dim : int [1:2] $ lambda : num [1:80] $ dev.ratio: num [1:80] $ nulldev : num $ npasses : int 2851 $ jerr : int 0 $ offset : logi FALSE $ call : language glmnet(x = x, y = y, alpha = 1) $ nobs : int attr(*, "class")= chr [1:2] "elnet" "glmnet"
39 Below are shown the number of values in the automatically generated λ-grid and some of the values. > length(lasso.mod$lambda) # number of lambdas [1] 80 > c(min = min(lasso.mod$lambda), max = max(lasso.mod$lambda)) # range of lambdas min max > head(lasso.mod$lambda) # a few first lamdas [1] > tail(lasso.mod$lambda) # a few last lambdas [1] Thus, the regression coefficients are computed for 80 values of λ. These results are stored into the beta matrix in the lasso.mod object and can be extracted directly or using coef() function. > dim(coef(lasso.mod)) # dimension of the coefficient matrix (beta) [1] > lasso.mod$lambda[50] # the 50th value of lambda [1] > coef(lasso.mod)[, 50] # the corresponding beta estimates (Intercept) AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN > sum(abs(coef(lasso.mod)[-1, 50])) # the corresponing L1-norm [1]
40 The function coef() can be used to produce lasso estimates for any values of λ (the same can be done by the predic() function). For further information, see help(coef.glmnet) and help(predict.glmnet). > drop(coef(lasso.mod, s = 5)) # lasso estimates for any value of lambda (arg s) (Intercept) AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
41 > par(mfrow = c(1, 2)) > plot(lasso.mod, xvar = "lambda", xlab = expression(lambda)) > plot(lasso.mod, xvar = "norm", xlab = expression(log(sum(abs(beta[j]))))) Coefficients Coefficients log(λ) log( β j) The coefficients are plotted against log(λ) in the left panel and logarithm of the l 1 -norm of the coefficients in the right panel. Numbers on the top of the figure indicate the number of non-zero coefficients. The figure shows that depending of the choice of λ, some of the coefficients will be exactly equal to zero.
42 Next we use corss-validation to identify the best λ in terms of the MSE. The glmnet has the function cv.glmnet() for the purpose. > set.seed(1) # for replication purposes > train <- sample(1:nrow(x), nrow(x)/2) # random selection of about ## one half of the observations for a train set > test <- -train # test set consist observations not in the train set > y.test <- y[test] # test y-values > cv.out <- cv.glmnet(x[train, ], y[train], alpha = 1) # by default performs 10-fold CV > plot(cv.out) Mean Squared Error log(lambda)
43 > (best.lam <- cv.out$lambda.min) # lambda for which CV MSE is at minimum [1] lasso.pred <- predict(lasso.train, s = best.lam, newx = x[test, ]) # predicted values ## with coefficients corresponding best.lam mean((lasso.pred - y.test)^2) # lasso test MSE [1] ## for comparison compute test MSE for OLS estimated model from the test set > ols.train <- lm(salary ~., data = Hitters, subset = train) # OLS train data estimats > ols.pred <- predict(ols.train, newdata = Hitters[test, ]) # OLS prediction > mean((ols.pred - y.test)^2) # OLS test MSE [1] Here lasso outperforms OLS in terms of test MSE. Below is a scatter plot of test set realized and precited salaries.
44 > plot(x = ols.pred, y = y.test, col = "red", xlim = c(0, 2500), ylim = c(0, 2500), + xlab = "Predicted", ylab = "Realized") > abline(lm(y.test ~ ols.pred), col = "red") > points(x = lasso.pred, y = y.test, col = "steel blue") > abline(lm(y.test ~ lasso.pred), col = "steel blue") > abline(a = 0, b = 1, lty = "dashed", col = "gray") > legend("topleft", legend = c("ols", "Lasso"), col = c("red", "Steel blue"), + pch = c(1, 1), bty = "n") Predicted Realized OLS Lasso
45 Dimension reduction 1 Model Selection and Regulairzation Variable selection Ridge regression Lasso Dimension reduction Principal components regression Partial least squares
46 Dimension reduction Variable selection and shrinking methods aim to control variance relying original predictors x 1,..., x p. Dimension reduction aims to find a lower number of new variable z 1,..., z M, M < p, that are linear combination of the original predictors, i.e., z m = p φ jm x j (11) j=1 for some constants φ 1m,..., φ pm, m = 1,..., M. Then y is regressed on these new variables i = 1,..., n. y i = θ 0 + M θ m z im + ɛ i, (12) m=1 A proper selection of the φ-coefficients in (11) can lead the dimension reduced regression with M + 1 coefficients in (12) outperform the original regression y on x-variables with p + 1 coefficients.
47 Dimension reduction 1 Model Selection and Regulairzation Variable selection Ridge regression Lasso Dimension reduction Principal components regression Partial least squares
48 Dimension reduction Principal components Principal component analysis (PCA) is a popular approach to reduce the dimension of the original p variables to a low number M p of linear combinations of the form (11) such that these linear combination capture the variability of the original x-variables. Given an n p data matrix X, PCA derives linear combinations of the variables such that: The first principal component defines a direction of the data along which the observations vary the most. The second principal component defines a direction which is orthogonal to the first one and along which the along varies most (among directions that are orthogonal to the first). The third principal component is defined such that it is orthogonal to the first and third and captures most of the remaining variability. Generally, an mth principal component is defined such that it is orthogonal with the earlier m 1 ones and captures most of the remaining variability
49 Dimension reduction Principal components Mathematically this reduces to finding the eigen values of the covariance matrix cov(x ) = Σ. The eigen vector of the largest eigen value is the coefficient vector of the first component, the eigen vector of the second largest eigenvalue is the coefficient vector of the second component, and so forth. The eigen vectors are normed to unity, i.e., if φ m = (φ 1m,..., φ pm ) is the mth eigen vector, then φ mφ m = 1. Because the eigenvalue problem is scale dependent, the solution is most often extracted form the correlation matrix (i.e., covariance matrix of standardized variables).
50 Dimension reduction Principal components The figure below illustrates the PC solution in the case of two variables, population size (pop) and ad spending (ad) (Source: James et al 2013, Fig 6.14). Ad Spending Population The green line is the first PC and the dashed line is the second.
51 Dimension reduction Principal components The computed scores z im = θ im (x i1 x 1 ) + + θ pm (x ip x p ), (13) i = 1,..., n are called principal component scores. Another interpretation of the PCA is: the first PC vector defines the line that is as close as porrible to the data. This is illustrated by the figure below (Source: James et al 2013, Fig 6.15). Ad Spending nd Principal Component Population st Principal Component
52 Dimension reduction Principal components Plotting the components against the original variables illustrates graphically how well the component represents the variability of that variable. For the advertising data the first PC represents well both variables, while the second PC is not much related with either of the variables (Source: James et al 2013, Fig 6.16 & 6.17). Population Ad Spending st Principal Component st Principal Component
53 Dimension reduction Principal components Population Ad Spending nd Principal Component nd Principal Component Because here the first PC captures the virtually all among the two variables in the advertising data, the component reflect jointly the population size and advertising spending of the cities. The components are linear combinations of demeaned original variables (e.q. (13)). Therefore, negative values reflect below average population size and below average advertising budget, close to zero reflects average size and budget, and positive values above average population size and budget.
54 Dimension reduction Principal component regression The principal component regression (PCR) involves constructing first M PCs and regress y on the components. The key is that M p. If M = p then PCR amounts to the same fit as the least squares fit of the original variables. M, the number of components for PCR is typically selected by cross-validation. Also, wen using PCR, it is generally recommended to standardize the x-variables before PC.
55 Dimension reduction Principal component regression Example 3 Consider again the Hitters data set > library(pls) # pls library contains pcr regression and more > library(islr) > Hitters <- na.omit(hitters) # remove missing values > pcr.fit <- pcr(salary ~., data = Hitters, scale = TRUE, # standardize x-variables + validation = "CV") # use CV to select n of components default 10 fold > summary(pcr.fit) # summary of the results Data: X dimension: Y dimension: Fit method: svdpc Number of components considered: 19 VALIDATION: RMSEP Cross-validated using 10 random segments. (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps CV adjcv comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps CV adjcv comps 15 comps 16 comps 17 comps 18 comps 19 comps CV adjcv TRAINING: % variance explained 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps X Salary comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps X Salary comps 17 comps 18 comps 19 comps X Salary
56 Dimension reduction Principal component regression The RMSEP is square root of CV MSE, and the smallest value is reached with six PCs, which is also shown by the plot of RMSEPs. Six components explain % of the total variation among the predictors. > validationplot(pcr.fit, val.type = "RMSEP") # plot RMSEPs Salary RMSEP number of components
57 Dimension reduction Principal component regression Next we will demonstrate in terms MSE how the PCR regression performs by using a test set. First define the best number of components from the training set, estimate the corresponding regression, and compute test MSE > set.seed(1) # for exact replication > train <- sample(nrow(hitters), size = nrow(hitters) / 2) # training set > head(train) # examples of observations in the training set [1] > test <- -train # test set resulted by not including those in the train set > pcr.train <- pcr(salary ~., data = Hitters, subset = train, scale = TRUE, + validation = "CV") # training set > validationplot(pcr.train, val.type = "RMSEP") > summary(pcr.train) Data: X dimension: Y dimension: Fit method: svdpc Number of components considered: 19 VALIDATION: RMSEP Cross-validated using 10 random segments. (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps CV adjcv comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps CV adjcv comps 15 comps 16 comps 17 comps 18 comps 19 comps CV adjcv TRAINING: % variance explained 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps X Salary comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps X Salary comps 17 comps 18 comps 19 comps X Salary
58 Dimension reduction Principal component regression On the basis of the training data CV RMSEP results M = 7 principal components yields the best resutls. Next compute test MSE. > pcr.pred <- predict(pcr.train, Hitters[test, ], ncomp = 7) > head(pcr.pred) # a few first predictions [1] > mean((pcr.pred - Hitters$Salary[test])^2) # test set MSE [1] This test set MSE is slightly smaller that that of lasso (114,470.6). PCR is useful in prediction purposes. If interpretation of the model is needed PCR results may be difficult to interpret. Finally, estimating the M = 7 components from the full data set yields the following results. > summary(pcr.fit <- pcr(salary ~., data = Hitters, scale = TRUE, ncomp = 7)) # estimate and summarize Data: X dimension: Y dimension: Fit method: svdpc Number of components considered: 7 TRAINING: % variance explained 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps X Salary
59 Dimension reduction 1 Model Selection and Regulairzation Variable selection Ridge regression Lasso Dimension reduction Principal components regression Partial least squares
60 Dimension reduction Partial least squares PCR focuses on reducing the dimension of the predictors without using the help of dependent variable y. In this sense PCR can be consider as unsupervised learning. As a results nothing guarantees that the dimension do really help in predicting y. Partial least squares (PLS) can be considered as a supervised alternative to PCR. Similar to PCR, PLS seeks to identify new features z 1,..., z M (M < p) that are linear combinations of the original variables, but unlike with PC, they are also related to the response y. In short PLS attempts to find directions that help explain both the response and the predictors.
61 Dimension reduction Partial least squares In PLS the predictors are first standardized. The first PLS direction z 1 is defined by regressing y on each predictor x j at a time and the resulting coefficients are used as φ 1j to define z 1 = φ 11 x φ 1p x p. The second PLS direction is defined by first regressing each x j on z 1 and taking the residuals (i.e., variation of x j not explained by z 1 ). Using these orthogonalized data, z 2 is formed in the same fashion as z 1 with the original data. This iterative procedure is repeated M times to produce PLS components z 1,..., z M. M is chosen by cross-validation.
62 Dimension reduction Partial least squares Example 4 Continuing the previous examples, we use plsr() function of the pls library. > set.seed(1) # initialize again the seed > pls.train <- plsr(salary ~., data = Hitters, subset = train, scale = TRUE, + validation = "CV") > summary(pls.train) Data: X dimension: Y dimension: Fit method: kernelpls Number of components considered: 19 VALIDATION: RMSEP Cross-validated using 10 random segments. (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps CV adjcv comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps CV adjcv comps 15 comps 16 comps 17 comps 18 comps 19 comps CV adjcv TRAINING: % variance explained 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps X Salary comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps X Salary comps 17 comps 18 comps 19 comps X Salary > validationplot(pls.train, val.type = "RMSEP")
63 Dimension reduction Partial least squares M = 2 produces the smallest CV RMSEP. Salary RMSEP number of components > pls.pred <- predict(pls.train, newdata = Hitters[test, ], ncomp = 2) > mean((hitters$salary[test] - pls.pred)^2) [1] MSE if comparable but slightly higher than that of PCR.
Penalized Regression
Penalized Regression Deepayan Sarkar Penalized regression Another potential remedy for collinearity Decreases variability of estimated coefficients at the cost of introducing bias Also known as regularization
More informationLinear Model Selection and Regularization
Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In
More informationMachine Learning for OR & FE
Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com
More informationDimension Reduction Methods
Dimension Reduction Methods And Bayesian Machine Learning Marek Petrik 2/28 Previously in Machine Learning How to choose the right features if we have (too) many options Methods: 1. Subset selection 2.
More informationMS-C1620 Statistical inference
MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents
More informationHomework 1: Solutions
Homework 1: Solutions Statistics 413 Fall 2017 Data Analysis: Note: All data analysis results are provided by Michael Rodgers 1. Baseball Data: (a) What are the most important features for predicting players
More informationLinear regression methods
Linear regression methods Most of our intuition about statistical methods stem from linear regression. For observations i = 1,..., n, the model is Y i = p X ij β j + ε i, j=1 where Y i is the response
More informationDirect Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina
Direct Learning: Linear Regression Parametric learning We consider the core function in the prediction rule to be a parametric function. The most commonly used function is a linear function: squared loss:
More informationPrediction & Feature Selection in GLM
Tarigan Statistical Consulting & Coaching statistical-coaching.ch Doctoral Program in Computer Science of the Universities of Fribourg, Geneva, Lausanne, Neuchâtel, Bern and the EPFL Hands-on Data Analysis
More informationLinear model selection and regularization
Linear model selection and regularization Problems with linear regression with least square 1. Prediction Accuracy: linear regression has low bias but suffer from high variance, especially when n p. It
More informationLecture 14: Shrinkage
Lecture 14: Shrinkage Reading: Section 6.2 STATS 202: Data mining and analysis October 27, 2017 1 / 19 Shrinkage methods The idea is to perform a linear regression, while regularizing or shrinking the
More informationData Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.
TheThalesians Itiseasyforphilosopherstoberichiftheychoose Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods Ivan Zhdankin
More informationSTAT 462-Computational Data Analysis
STAT 462-Computational Data Analysis Chapter 5- Part 2 Nasser Sadeghkhani a.sadeghkhani@queensu.ca October 2017 1 / 27 Outline Shrinkage Methods 1. Ridge Regression 2. Lasso Dimension Reduction Methods
More informationRegression III: Computing a Good Estimator with Regularization
Regression III: Computing a Good Estimator with Regularization -Applied Multivariate Analysis- Lecturer: Darren Homrighausen, PhD 1 Another way to choose the model Let (X 0, Y 0 ) be a new observation
More informationThe prediction of house price
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
More informationDay 4: Shrinkage Estimators
Day 4: Shrinkage Estimators Kenneth Benoit Data Mining and Statistical Learning March 9, 2015 n versus p (aka k) Classical regression framework: n > p. Without this inequality, the OLS coefficients have
More informationLinear Model Selection and Regularization
Linear Model Selection and Regularization Chapter 6 October 18, 2016 Chapter 6 October 18, 2016 1 / 80 1 Subset selection 2 Shrinkage methods 3 Dimension reduction methods (using derived inputs) 4 High
More informationLinear Methods for Regression. Lijun Zhang
Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived
More informationChapter 3. Linear Models for Regression
Chapter 3. Linear Models for Regression Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 c Wei Pan Linear
More informationVariable Selection and Regularization
Variable Selection and Regularization Sanford Weisberg October 15, 2012 Variable Selection In a regression problem with p predictors, we can reduce the dimension of the regression problem in two general
More informationMultiple (non) linear regression. Department of Computer Science, Czech Technical University in Prague
Multiple (non) linear regression Jiří Kléma Department of Computer Science, Czech Technical University in Prague Lecture based on ISLR book and its accompanying slides http://cw.felk.cvut.cz/wiki/courses/b4m36san/start
More informationMSG500/MVE190 Linear Models - Lecture 15
MSG500/MVE190 Linear Models - Lecture 15 Rebecka Jörnsten Mathematical Statistics University of Gothenburg/Chalmers University of Technology December 13, 2012 1 Regularized regression In ordinary least
More informationRidge and Lasso Regression
enote 8 1 enote 8 Ridge and Lasso Regression enote 8 INDHOLD 2 Indhold 8 Ridge and Lasso Regression 1 8.1 Reading material................................. 2 8.2 Presentation material...............................
More informationLinear Regression Models. Based on Chapter 3 of Hastie, Tibshirani and Friedman
Linear Regression Models Based on Chapter 3 of Hastie, ibshirani and Friedman Linear Regression Models Here the X s might be: p f ( X = " + " 0 j= 1 X j Raw predictor variables (continuous or coded-categorical
More informationITEC 621 Predictive Analytics 6. Variable Selection
ITEC 621 Predictive Analytics 6. Variable Selection Multi-Collinearity XI(û) X s are not independent (are correlated) Y = X * B Approximately: X has no inverse because its columns are dependent Really:
More informationData Mining Stat 588
Data Mining Stat 588 Lecture 02: Linear Methods for Regression Department of Statistics & Biostatistics Rutgers University September 13 2011 Regression Problem Quantitative generic output variable Y. Generic
More informationStatistical Learning
Statistical Learning Supervised learning Assume: Estimate: quantity of interest function predictors to get: error Such that: For prediction and/or inference Model fit vs. Model stability (Bias variance
More informationSTK 2100 Oblig 1. Zhou Siyu. February 15, 2017
STK 200 Oblig Zhou Siyu February 5, 207 Question a) Make a scatter box plot for the data set. Answer:Here is the code I used to plot the scatter box in R. library ( MASS ) 2 pairs ( Boston ) Figure : Scatter
More informationA Modern Look at Classical Multivariate Techniques
A Modern Look at Classical Multivariate Techniques Yoonkyung Lee Department of Statistics The Ohio State University March 16-20, 2015 The 13th School of Probability and Statistics CIMAT, Guanajuato, Mexico
More informationMachine Learning Linear Regression. Prof. Matteo Matteucci
Machine Learning Linear Regression Prof. Matteo Matteucci Outline 2 o Simple Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression o Multi Variate Regession Model Least Squares
More informationISyE 691 Data mining and analytics
ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)
More informationEDAMI DIMENSION REDUCTION BY PRINCIPAL COMPONENT ANALYSIS
EDAMI DIMENSION REDUCTION BY PRINCIPAL COMPONENT ANALYSIS Mario Romanazzi October 29, 2017 1 Introduction An important task in multidimensional data analysis is reduction in complexity. Recalling that
More informationFinal Overview. Introduction to ML. Marek Petrik 4/25/2017
Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,
More informationSTK4900/ Lecture 5. Program
STK4900/9900 - Lecture 5 Program 1. Checking model assumptions Linearity Equal variances Normality Influential observations Importance of model assumptions 2. Selection of predictors Forward and backward
More informationCOMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)
COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless
More informationIntroduction and Single Predictor Regression. Correlation
Introduction and Single Predictor Regression Dr. J. Kyle Roberts Southern Methodist University Simmons School of Education and Human Development Department of Teaching and Learning Correlation A correlation
More informationPackage Grace. R topics documented: April 9, Type Package
Type Package Package Grace April 9, 2017 Title Graph-Constrained Estimation and Hypothesis Tests Version 0.5.3 Date 2017-4-8 Author Sen Zhao Maintainer Sen Zhao Description Use
More informationHigh-dimensional regression modeling
High-dimensional regression modeling David Causeur Department of Statistics and Computer Science Agrocampus Ouest IRMAR CNRS UMR 6625 http://www.agrocampus-ouest.fr/math/causeur/ Course objectives Making
More informationUsing R in 200D Luke Sonnet
Using R in 200D Luke Sonnet Contents Working with data frames 1 Working with variables........................................... 1 Analyzing data............................................... 3 Random
More informationChapter 8 Conclusion
1 Chapter 8 Conclusion Three questions about test scores (score) and student-teacher ratio (str): a) After controlling for differences in economic characteristics of different districts, does the effect
More informationMaking Our Cities Safer: A Study In Neighbhorhood Crime Patterns
Making Our Cities Safer: A Study In Neighbhorhood Crime Patterns Aly Kane alykane@stanford.edu Ariel Sagalovsky asagalov@stanford.edu Abstract Equipped with an understanding of the factors that influence
More informationGov 2000: 9. Regression with Two Independent Variables
Gov 2000: 9. Regression with Two Independent Variables Matthew Blackwell Harvard University mblackwell@gov.harvard.edu Where are we? Where are we going? Last week: we learned about how to calculate a simple
More informationRegularization: Ridge Regression and the LASSO
Agenda Wednesday, November 29, 2006 Agenda Agenda 1 The Bias-Variance Tradeoff 2 Ridge Regression Solution to the l 2 problem Data Augmentation Approach Bayesian Interpretation The SVD and Ridge Regression
More informationTutorial on Linear Regression
Tutorial on Linear Regression HY-539: Advanced Topics on Wireless Networks & Mobile Systems Prof. Maria Papadopouli Evripidis Tzamousis tzamusis@csd.uoc.gr Agenda 1. Simple linear regression 2. Multiple
More informationShrinkage Methods: Ridge and Lasso
Shrinkage Methods: Ridge and Lasso Jonathan Hersh 1 Chapman University, Argyros School of Business hersh@chapman.edu February 27, 2019 J.Hersh (Chapman) Ridge & Lasso February 27, 2019 1 / 43 1 Intro and
More informationInference After Variable Selection
Department of Mathematics, SIU Carbondale Inference After Variable Selection Lasanthi Pelawa Watagoda lasanthi@siu.edu June 12, 2017 Outline 1 Introduction 2 Inference For Ridge and Lasso 3 Variable Selection
More informationSimultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR
Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR Howard D. Bondell and Brian J. Reich Department of Statistics, North Carolina State University,
More informationStatistical Methods for Data Mining
Statistical Methods for Data Mining Kuangnan Fang Xiamen University Email: xmufkn@xmu.edu.cn Linear Model Selection and Regularization Recall the linear model Y = 0 + 1 X 1 + + p X p +. In the lectures
More informationInstitute of Statistics Mimeo Series No Simultaneous regression shrinkage, variable selection and clustering of predictors with OSCAR
DEPARTMENT OF STATISTICS North Carolina State University 2501 Founders Drive, Campus Box 8203 Raleigh, NC 27695-8203 Institute of Statistics Mimeo Series No. 2583 Simultaneous regression shrinkage, variable
More informationDifferent types of regression: Linear, Lasso, Ridge, Elastic net, Ro
Different types of regression: Linear, Lasso, Ridge, Elastic net, Robust and K-neighbors Faculty of Mathematics, Informatics and Mechanics, University of Warsaw 04.10.2009 Introduction We are given a linear
More informationIntroduction to Statistics and R
Introduction to Statistics and R Mayo-Illinois Computational Genomics Workshop (2018) Ruoqing Zhu, Ph.D. Department of Statistics, UIUC rqzhu@illinois.edu June 18, 2018 Abstract This document is a supplimentary
More informationHigh-dimensional regression
High-dimensional regression Advanced Methods for Data Analysis 36-402/36-608) Spring 2014 1 Back to linear regression 1.1 Shortcomings Suppose that we are given outcome measurements y 1,... y n R, and
More informationA Short Introduction to the Lasso Methodology
A Short Introduction to the Lasso Methodology Michael Gutmann sites.google.com/site/michaelgutmann University of Helsinki Aalto University Helsinki Institute for Information Technology March 9, 2016 Michael
More informationIntroduction to the Analysis of Hierarchical and Longitudinal Data
Introduction to the Analysis of Hierarchical and Longitudinal Data Georges Monette, York University with Ye Sun SPIDA June 7, 2004 1 Graphical overview of selected concepts Nature of hierarchical models
More informationRegression in R. Seth Margolis GradQuant May 31,
Regression in R Seth Margolis GradQuant May 31, 2018 1 GPA What is Regression Good For? Assessing relationships between variables This probably covers most of what you do 4 3.8 3.6 3.4 Person Intelligence
More informationSTATS216v Introduction to Statistical Learning Stanford University, Summer Midterm Exam (Solutions) Duration: 1 hours
Instructions: STATS216v Introduction to Statistical Learning Stanford University, Summer 2017 Remember the university honor code. Midterm Exam (Solutions) Duration: 1 hours Write your name and SUNet ID
More informationData analysis strategies for high dimensional social science data M3 Conference May 2013
Data analysis strategies for high dimensional social science data M3 Conference May 2013 W. Holmes Finch, Maria Hernández Finch, David E. McIntosh, & Lauren E. Moss Ball State University High dimensional
More informationMachine Learning for OR & FE
Machine Learning for OR & FE Supervised Learning: Regression I Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com Some of the
More informationData Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression
Data Mining and Data Warehousing Henryk Maciejewski Data Mining Predictive modelling: regression Algorithms for Predictive Modelling Contents Regression Classification Auxiliary topics: Estimation of prediction
More informationLecture 6: Methods for high-dimensional problems
Lecture 6: Methods for high-dimensional problems Hector Corrada Bravo and Rafael A. Irizarry March, 2010 In this Section we will discuss methods where data lies on high-dimensional spaces. In particular,
More informationESL Chap3. Some extensions of lasso
ESL Chap3 Some extensions of lasso 1 Outline Consistency of lasso for model selection Adaptive lasso Elastic net Group lasso 2 Consistency of lasso for model selection A number of authors have studied
More informationRegression, Ridge Regression, Lasso
Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.
More informationBusiness Statistics. Tommaso Proietti. Model Evaluation and Selection. DEF - Università di Roma 'Tor Vergata'
Business Statistics Tommaso Proietti DEF - Università di Roma 'Tor Vergata' Model Evaluation and Selection Predictive Ability of a Model: Denition and Estimation We aim at achieving a balance between parsimony
More informationSparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda
Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic
More informationVariable Selection under Measurement Error: Comparing the Performance of Subset Selection and Shrinkage Methods
Variable Selection under Measurement Error: Comparing the Performance of Subset Selection and Shrinkage Methods Ellen Sasahara Bachelor s Thesis Supervisor: Prof. Dr. Thomas Augustin Department of Statistics
More informationClassification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).
Regression and PCA Classification The goal: map from input X to a label Y. Y has a discrete set of possible values We focused on binary Y (values 0 or 1). But we also discussed larger number of classes
More informationApplied Machine Learning Annalisa Marsico
Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 22 April, SoSe 2015 Goals Feature Selection rather than Feature
More informationCollinearity: Impact and Possible Remedies
Collinearity: Impact and Possible Remedies Deepayan Sarkar What is collinearity? Exact dependence between columns of X make coefficients non-estimable Collinearity refers to the situation where some columns
More informationChris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010
Model-Averaged l 1 Regularization using Markov Chain Monte Carlo Model Composition Technical Report No. 541 Department of Statistics, University of Washington Chris Fraley and Daniel Percival August 22,
More informationRegularization and Variable Selection via the Elastic Net
p. 1/1 Regularization and Variable Selection via the Elastic Net Hui Zou and Trevor Hastie Journal of Royal Statistical Society, B, 2005 Presenter: Minhua Chen, Nov. 07, 2008 p. 2/1 Agenda Introduction
More informationPENALIZED PRINCIPAL COMPONENT REGRESSION. Ayanna Byrd. (Under the direction of Cheolwoo Park) Abstract
PENALIZED PRINCIPAL COMPONENT REGRESSION by Ayanna Byrd (Under the direction of Cheolwoo Park) Abstract When using linear regression problems, an unbiased estimate is produced by the Ordinary Least Squares.
More informationIntroduction to the genlasso package
Introduction to the genlasso package Taylor B. Arnold, Ryan Tibshirani Abstract We present a short tutorial and introduction to using the R package genlasso, which is used for computing the solution path
More informationPRINCIPAL COMPONENTS ANALYSIS
121 CHAPTER 11 PRINCIPAL COMPONENTS ANALYSIS We now have the tools necessary to discuss one of the most important concepts in mathematical statistics: Principal Components Analysis (PCA). PCA involves
More informationConsider fitting a model using ordinary least squares (OLS) regression:
Example 1: Mating Success of African Elephants In this study, 41 male African elephants were followed over a period of 8 years. The age of the elephant at the beginning of the study and the number of successful
More informationCOS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10
COS53: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 0 MELISSA CARROLL, LINJIE LUO. BIAS-VARIANCE TRADE-OFF (CONTINUED FROM LAST LECTURE) If V = (X n, Y n )} are observed data, the linear regression problem
More informationIntroduction and Background to Multilevel Analysis
Introduction and Background to Multilevel Analysis Dr. J. Kyle Roberts Southern Methodist University Simmons School of Education and Human Development Department of Teaching and Learning Background and
More informationLinear Regression Linear Regression with Shrinkage
Linear Regression Linear Regression ith Shrinkage Introduction Regression means predicting a continuous (usually scalar) output y from a vector of continuous inputs (features) x. Example: Predicting vehicle
More informationRegression I: Mean Squared Error and Measuring Quality of Fit
Regression I: Mean Squared Error and Measuring Quality of Fit -Applied Multivariate Analysis- Lecturer: Darren Homrighausen, PhD 1 The Setup Suppose there is a scientific problem we are interested in solving
More informationIntroduction to Statistical modeling: handout for Math 489/583
Introduction to Statistical modeling: handout for Math 489/583 Statistical modeling occurs when we are trying to model some data using statistical tools. From the start, we recognize that no model is perfect
More informationStat 502X Exam 1 Spring 2014
Stat 502X Exam 1 Spring 2014 I have neither given nor received unauthorized assistance on this exam. Name Signed Date Name Printed This is a long exam consisting of 11 parts. I'll score it at 10 points
More informationRegression Shrinkage and Selection via the Lasso
Regression Shrinkage and Selection via the Lasso ROBERT TIBSHIRANI, 1996 Presenter: Guiyun Feng April 27 () 1 / 20 Motivation Estimation in Linear Models: y = β T x + ɛ. data (x i, y i ), i = 1, 2,...,
More informationSolutions to obligatorisk oppgave 2, STK2100
Solutions to obligatorisk oppgave 2, STK2100 Vinnie Ko May 14, 2018 Disclaimer: This document is made solely for my own personal use and can contain many errors. Oppgave 1 We load packages and read data
More informationPrediction problems 3: Validation and Model Checking
Prediction problems 3: Validation and Model Checking Data Science 101 Team May 17, 2018 Outline Validation Why is it important How should we do it? Model checking Checking whether your model is a good
More informationApplied Regression Analysis
Applied Regression Analysis Chapter 3 Multiple Linear Regression Hongcheng Li April, 6, 2013 Recall simple linear regression 1 Recall simple linear regression 2 Parameter Estimation 3 Interpretations of
More informationR in Linguistic Analysis. Wassink 2012 University of Washington Week 6
R in Linguistic Analysis Wassink 2012 University of Washington Week 6 Overview R for phoneticians and lab phonologists Johnson 3 Reading Qs Equivalence of means (t-tests) Multiple Regression Principal
More informationEXTENDING PARTIAL LEAST SQUARES REGRESSION
EXTENDING PARTIAL LEAST SQUARES REGRESSION ATHANASSIOS KONDYLIS UNIVERSITY OF NEUCHÂTEL 1 Outline Multivariate Calibration in Chemometrics PLS regression (PLSR) and the PLS1 algorithm PLS1 from a statistical
More informationContinuous soil attribute modeling and mapping: Multiple linear regression
Continuous soil attribute modeling and mapping: Multiple linear regression Soil Security Laboratory 2017 1 Multiple linear regression Multiple linear regression (MLR) is where we regress a target variable
More informationBiostatistics Advanced Methods in Biostatistics IV
Biostatistics 140.754 Advanced Methods in Biostatistics IV Jeffrey Leek Assistant Professor Department of Biostatistics jleek@jhsph.edu Lecture 12 1 / 36 Tip + Paper Tip: As a statistician the results
More informationDensity Temp vs Ratio. temp
Temp Ratio Density 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Density 0.0 0.2 0.4 0.6 0.8 1.0 1. (a) 170 175 180 185 temp 1.0 1.5 2.0 2.5 3.0 ratio The histogram shows that the temperature measures have two peaks,
More informationMSA220/MVE440 Statistical Learning for Big Data
MSA220/MVE440 Statistical Learning for Big Data Lecture 7/8 - High-dimensional modeling part 1 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Classification
More informationA Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression
A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression Noah Simon Jerome Friedman Trevor Hastie November 5, 013 Abstract In this paper we purpose a blockwise descent
More informationIterative Selection Using Orthogonal Regression Techniques
Iterative Selection Using Orthogonal Regression Techniques Bradley Turnbull 1, Subhashis Ghosal 1 and Hao Helen Zhang 2 1 Department of Statistics, North Carolina State University, Raleigh, NC, USA 2 Department
More informationCMSC858P Supervised Learning Methods
CMSC858P Supervised Learning Methods Hector Corrada Bravo March, 2010 Introduction Today we discuss the classification setting in detail. Our setting is that we observe for each subject i a set of p predictors
More informationMS&E 226. In-Class Midterm Examination Solutions Small Data October 20, 2015
MS&E 226 In-Class Midterm Examination Solutions Small Data October 20, 2015 PROBLEM 1. Alice uses ordinary least squares to fit a linear regression model on a dataset containing outcome data Y and covariates
More informationA simulation study of model fitting to high dimensional data using penalized logistic regression
A simulation study of model fitting to high dimensional data using penalized logistic regression Ellinor Krona Kandidatuppsats i matematisk statistik Bachelor Thesis in Mathematical Statistics Kandidatuppsats
More informationStat 401B Final Exam Fall 2015
Stat 401B Final Exam Fall 015 I have neither given nor received unauthorized assistance on this exam. Name Signed Date Name Printed ATTENTION! Incorrect numerical answers unaccompanied by supporting reasoning
More informationExam Applied Statistical Regression. Good Luck!
Dr. M. Dettling Summer 2011 Exam Applied Statistical Regression Approved: Tables: Note: Any written material, calculator (without communication facility). Attached. All tests have to be done at the 5%-level.
More informationA Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn
A Handbook of Statistical Analyses Using R Brian S. Everitt and Torsten Hothorn CHAPTER 6 Logistic Regression and Generalised Linear Models: Blood Screening, Women s Role in Society, and Colonic Polyps
More informationBayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson
Bayesian variable selection via penalized credible regions Brian Reich, NC State Joint work with Howard Bondell and Ander Wilson Brian Reich, NCSU Penalized credible regions 1 Motivation big p, small n
More informationMATH 644: Regression Analysis Methods
MATH 644: Regression Analysis Methods FINAL EXAM Fall, 2012 INSTRUCTIONS TO STUDENTS: 1. This test contains SIX questions. It comprises ELEVEN printed pages. 2. Answer ALL questions for a total of 100
More information