Part V. Model Slection and Regularization. As of Nov 21, 2018

Size: px
Start display at page:

Download "Part V. Model Slection and Regularization. As of Nov 21, 2018"

Transcription

1 Part V Model Slection and Regularization As of Nov 21, 2018 Some of the figures in this presentation are taken from An Introduction to Statistical Learning, with applications in R (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani

2 1 Model Selection and Regulairzation Variable selection Ridge regression Lasso Dimension reduction Principal components regression Partial least squares

3 Model selection in regression in linear regression y = β 0 + β 1 x β p x p + ɛ (1) refers typically to the selection of the most appropriate subset from the p explanatory variables that best predict and capture variability in y. Regularization refers to fitting a model with all p variables by shrinking the estimated coefficients with respect least squares estimates towards zero in order to decrease (reducible) variance. Depending on the type of shrinkage, some of the coefficients may be estimated to be exactly zero, thereby performing also variable selection. We discussed earlier variables selection shortly, which at the same time serves as dimension reduction of the regression problem. In this section we demonstrate variable selection and discuss shortly other approaches to dimension reduction.

4 Variable selection 1 Model Selection and Regulairzation Variable selection Ridge regression Lasso Dimension reduction Principal components regression Partial least squares

5 Variable selection Example 1 Consider the Hitters data set in the ISLR library which contains data on baseball players. We predict a player s Salary on the basis of available background data. > library(islr) > str(hitters) # structure data.frame : 322 obs. of 20 variables: $ AtBat : int $ Hits : int $ HmRun : int $ Runs : int $ RBI : int $ Walks : int $ Years : int $ CAtBat : int $ CHits : int $ CHmRun : int $ CRuns : int $ CRBI : int $ CWalks : int $ League : Factor w/ 2 levels "A","N": $ Division : Factor w/ 2 levels "E","W": $ PutOuts : int $ Assists : int $ Errors : int $ Salary : num NA $ NewLeague: Factor w/ 2 levels "A","N":

6 Variable selection Best subset Consider first selecting the best subset with respect to a given criterion. Earlier we utilized the car package. Here we utilize leaps package. Function regsubsets() which is part of the package can be used to identify the best subset in terms of RSS a. a Criterion functions can be derived from RSS, for example AIC is AIC k = log(rss k ) + 2k/n (2) where RSS k is the residual sum of squares for a regression with k explanatory variables.

7 Variable selection Best subset Regression with all explanatory variables. > summary(lm(salary ~., data = Hitters)) # full regression with all variables Call: lm(formula = Salary ~., data = Hitters) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) AtBat ** Hits ** HmRun Runs RBI Walks *** Years CAtBat CHits CHmRun CRuns CRBI CWalks * LeagueN DivisionW ** PutOuts *** Assists Errors NewLeagueN Signif. codes: 0 *** ** 0.01 * Residual standard error: on 243 degrees of freedom (59 observations deleted due to missingness) Multiple R-squared: ,Adjusted R-squared: F-statistic: on 19 and 243 DF, p-value: < 2.2e-16

8 Variable selection Best subset Significant t-values suggest that only AtBat, Hits, Walks, Walks, CWalks, Division, and PutOuts (and possibly CRuns and Assists that are 10% significant) have explanatory power. We will see next how this subset compares to the best ones in terms of different criterion functions.

9 Variable selection Best subset > fit.full <- regsubsets(salary ~., data = Hitters, nvmax = 19) # all subsets > (sm.full <- summary(fit.full)) # with smallest RSS(k), k = 1,..., p, indicate variables included Subset selection object Call: regsubsets.formula(salary ~., data = Hitters, nvmax = 19) 19 Variables (and intercept)... 1 subsets of each size up to 19 Selection Algorithm: exhaustive AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " "*" 2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*" 3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*" 4 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*" 5 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " " " "*" 6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " " "*" 7 ( 1 ) " " "*" " " " " " " "*" " " "*" "*" "*" " " " " 8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " "*" "*" " " 9 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*" 10 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*" 11 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*" 12 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*" "*" 13 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*" "*" 14 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" " " " " "*" "*" 15 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" "*" " " "*" "*" 16 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*" "*" 17 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*" "*" 18 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" " " "*" "*" 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN 1 ( 1 ) " " " " " " " " " " " " " " 2 ( 1 ) " " " " " " " " " " " " " " 3 ( 1 ) " " " " " " "*" " " " " " " 4 ( 1 ) " " " " "*" "*" " " " " " " 5 ( 1 ) " " " " "*" "*" " " " " " " 6 ( 1 ) " " " " "*" "*" " " " " " " 7 ( 1 ) " " " " "*" "*" " " " " " " 8 ( 1 ) "*" " " "*" "*" " " " " " " 9 ( 1 ) "*" " " "*" "*" " " " " " " 10 ( 1 ) "*" " " "*" "*" "*" " " " " 11 ( 1 ) "*" "*" "*" "*" "*" " " " " 12 ( 1 ) "*" "*" "*" "*" "*" " " " " 13 ( 1 ) "*" "*" "*" "*" "*" "*" " " 14 ( 1 ) "*" "*" "*" "*" "*" "*" " " 15 ( 1 ) "*" "*" "*" "*" "*" "*" " " 16 ( 1 ) "*" "*" "*" "*" "*" "*" " " 17 ( 1 ) "*" "*" "*" "*" "*" "*" "*" 18 ( 1 ) "*" "*" "*" "*" "*" "*" "*" 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*"

10 Variable selection Best subset Asterisk indicates inclusion of a variable in a model. > names(sm.full) # objects in sm.full [1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj" Printing for example R-squares shows the highest values for the best combinations of k explanatory variables, k = 1,..., p (p = 19). > round(sm.full$rsq, digits = 3) # show R-squares (in 3 decimals) [1] [13] Thus, for example the highest R 2 with one explanatory variable is 32.1% when CRBI is in the regression.

11 Variable selection Best subset Plotting R 2, adjusted R 2, C p (here equivalent to AIC), and BIC for all subset sizes can be used to decide the final model. > par(mfrow = c(2, 2)) > plot(x = 1:19, y = sm.full$rsq, type = "l", col = "steel blue", main = "R-squares", + xlab = "N of Variables", ylab = "R-squared") > plot(x = 1:19, y = sm.full$adjr2, type = "l", col = "steel blue", main = "Adjusted R-squares", + xlab = "N of Variables", ylab = "Adjusted R-squared") > (k.best <- which.max(sm.full$adjr2)) # model with best adj R-square [1] 11 > points(k.best, sm.full$adjr2[k.best], col = "red", cex = 2, pch = 20) # show the maximum > plot(x = 1:19, y = sm.full$cp, type = "l", col = "steel blue", main = "Cp", + xlab = "N of Variables", ylab = "Cp") > (k.best <- which.min(sm.full$cp)) # model with the smallest Cp [1] 10 > points(k.best, sm.full$cp[k.best], col = "red", cex = 2, pch = 20) # show the minimum > plot(x = 1:19, y = sm.full$bic, type = "l", col = "steel blue", main = "BIC", + xlab = "N of Variables", ylab = "BIC") > (k.best <- which.min(sm.full$bic)) # model with the smallest BIC [1] 6 > points(k.best, sm.full$bic[k.best], col = "red", cex = 2, pch = 20)

12 Variable selection Best subset R squares Adjusted R squares R squared Adjusted R squared N of Variables N of Variables Cp BIC Cp BIC N of Variables N of Variables

13 Variable selection Best subset For example the six variables selected by BIC and the corresponding fitted model are: > names(coef(fit.full, 6)) [1] "(Intercept)" "AtBat" "Hits" "Walks" "CRBI" [6] "DivisionW" "PutOuts" > summary(lm(salary ~ AtBat + Hits + Walks + CRBI + Division + PutOuts, data = Hitters)) Call: lm(formula = Salary ~ AtBat + Hits + Walks + CRBI + Division + PutOuts, data = Hitters) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) AtBat *** Hits e-06 *** Walks ** CRBI < 2e-16 *** DivisionW ** PutOuts *** --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 256 degrees of freedom (59 observations deleted due to missingness) Multiple R-squared: ,Adjusted R-squared: F-statistic: on 6 and 256 DF, p-value: < 2.2e-16 The are differences with those significant in the full model (e.g. initially non-significant CRBI is included while initially significant CWalks is not).

14 Variable selection Validation set approach In particular if our objective is to use the model for prediction, we can use a validation set to identify the best predictors. First split the sample to a test set and a validation set, estimate the best explanatory variable subsets of sizes 1, 2,..., p using the estimation data set and find the minimum test set MSE. > sum(is.na(hitters$salary)) # number of missing Salary values [1] 59 > Hitters <- na.omit(hitters) # drop all rows with missing values > sum(is.na(hitters)) [1] 0 > set.seed(2) # initialize random seed for exact replication ## a random vector of TRUEs and FALSEs with length equaling the rows in Hitters ## and about one half of the values are TRUE > train <- sample(x = c(true, FALSE), size = nrow(hitters), replace = TRUE) # > mean(train) # proportion of TRUEs [1] > test <-!train # complement set to identify the test set > mean(test) # fraction of observations in the test set [1]

15 Variable selection Validation set approach fit.best <- regsubsets(salary ~., Hitters[train, ], nvmax = 19) # best fitting models > test.mat <- model.matrix(salary ~., data = Hitters[test, ]) # generate model matrix > head(test.mat) # a few first lines of test.mat (Intercept) AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits -Alvin Davis Andre Dawson Alfredo Griffin Al Newman Andres Thomas Alan Trammell CHmRun CRuns CRBI CWalks LeagueN DivisionW PutOuts Assists -Alvin Davis Andre Dawson Alfredo Griffin Al Newman Andres Thomas Alan Trammell Errors NewLeagueN -Alvin Davis Andre Dawson 3 1 -Alfredo Griffin Al Newman 7 0 -Andres Thomas Alan Trammell 22 0 The function model.matrix() generates constant vector for the intercept and transforms factor variables to 0/1 dummy vectors by indicating also which class is labeled by 1.

16 Variable selection Validation set approach > test.mse <- double(19) # vector of length 19 for validation set MSEs > for (i in 1:length(test.mse)) { + betai <- coef(object = fit.best, id = i) # extract coefficients of the model with k x-vars + pred.salary <- test.mat[, names(betai)] %*% betai # pred y = X beta + test.mse[i] <- mean((hitters$salary[test] - pred.salary)^2) + } # end for > test.mse # print results [1] [8] [15] > which.min(test.mse) # find the minimum [1] 10 > coef(fit.best, id = 10) # slope coefficients of the best fitting model (Intercept) AtBat Hits Walks CAtBat CRuns CRBI CWalks DivisionW PutOuts Assists > coef(fit.full, id = 10) # slope coefficients of the best set of 10 variables from the full data set (Intercept) AtBat Hits Walks CAtBat CRuns CRBI CWalks DivisionW PutOuts Assists Thus, the best model is the one with 10 predictors. This set of predictors is also the best model with 10 predictors from the full data, and would be the results selected by the Cp criterion. The results, however, can different for a different training and test sets (actually if we initialized the random genrator with set.seed(1), as in the book, a slightly different set of predictors would have been selected).

17 Variable selection Validation set approach Remark 1 The practice is that the size k of the best set of predictors (here k = 10) is selcted on the basis of the validation apprach, while the final best k predictors are selected from the full sample and the corresponding regression is estimated (again from the full sample). Thus, predictors in the final model may differ from those of the validation best predictors, only the number is the same (above in both cases the sets of best predictors happened to coincide).

18 Variable selection Cross-validation approach In the same manner as in the validation set approach, we can identify the size of the set of best predictors on the basis of cross-validation. We demonstrate here the k-fold CV with k = 10. First create a vector that indicates in which of the 10 groups each observation belongs to. > set.seed(1) # for exact replication > k <- 10 # n of folds > folds <- sample(x = 1:k, size = nrow(hitters), replace = TRUE) # randomly formed k folds > head(folds) [1] Thus, here the first observation falls into fold 3, the second into 4, etc. The following function produces predictions. > predict.regsubsets <- function(obj, # object produced by regsubset() + newdata, # out of sample data + id, # id of the model +... # potential additional argumets if needed + ) { # Source: James et al. (2013) ISL + if (class(obj)!= "regsubsets") stop("obj must be produced by regsubsets() function!") + fmla <- as.formula(obj$call[[2]]) # extract formula from the obj object + beta <- coef(object = obj, id = id) # coefficients corresponding to model id + xmat <- model.matrix(fmla, newdata) # data matrix for prediction computations + return(xmat[, names(beta)] %*% beta) # return predictions + } # pred.regsubsets

19 Variable selection Cross-validation approach Next we loop through the k sets, and best predictions sets to compute MSEs. > cv.mse <- matrix(nrow = k, ncol = 19, dimnames = list(1:k, 1:19)) # matrix to store MSE-values > for (i in 1:k) { # over validation sets + best.fits <- regsubsets(salary ~., data = Hitters[folds!= i, ], nvmax = 19) + y <- Hitters$Salary[folds == i] # new y values + for (j in 1:19) { # MSEs over n of predictors + ypred <- predict.regsubsets(best.fits, newdata = Hitters[folds == i, ], id = j) # predictions + cv.mse[i, j] <- mean((y - ypred)^2) # store MSEs into cv.mse matrix + } # for j + } # for i MSEs for a j-predictors model, j = 1,..., 19. > (mean.cv.mse <- apply(cv.mse, 2, mean)) # mean mse values, parentheses prints the results > plot(mean.cv.mse, type = "b", col = "red", xlab = "N of Predictors", ylab = "MSE") Models with 10 and 11 predictors are close to each other, however, the 11 predictor model is slightly better as shown also by the following figure, so a model with 11 predictors would be our choice. > coef(regsubsets(salary ~., data = Hitters, nvmax = 19), id = 11) # the best model (Intercept) AtBat Hits Walks CAtBat CRuns CRBI CWalks LeagueN DivisionW PutOuts Assists > plot(mean.cv.mse, type = "b", col = "red", xlab = "N of Predictors", ylab = "MSE", main = "Best 10-fold Cross-Validation MSEs\nfor Different Number of Predictors")

20 Variable selection Cross-validation approach Best 10 fold Cross Validation MSEs for Different Number of Predictors MSE N of Predictors

21 1 Model Selection and Regulairzation Variable selection Ridge regression Lasso Dimension reduction Principal components regression Partial least squares

22 constrain or regularizes the coefficient estimates, or equivalently shrink the coefficient estimates to zero. It turns out that shrinking estimated coefficients towards zero can significantly reduce their variance. Two best-known techniques are ridge regression and lasso (least absolute shrinking and selection operator, introduced in statistics by Tibshirani 2 ). 2 Tibshirani, Robert (1996). Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society, Series B, 58,

23 1 Model Selection and Regulairzation Variable selection Ridge regression Lasso Dimension reduction Principal components regression Partial least squares

24 Ridge regression OLS estimates regression coefficients by minimizing ( ) 2 n p RSS = y i β 0 β j x ij. (3) i=1 i=1 Ridge regression estimates ˆβ R λ are obtained by minimizing n y i β 0 p 2 β j x ij + λ βj 2 = RSS + λ i=1 j=1 j=1 j=1 p p βj 2, (4) where λ 0 is a tuning parameter, to be determined separately.

25 Ridge regression The term, λ p j=1 β2 j, is called a shrinkage penalty which gets the smaller the closer β j s are to zero. Thus, like OLS, ridge regression seeks optimal fitting finding coefficients that minimize the RSS, but at the same time it penalizes large coefficients, so that the in the optimum coefficients will be shrunken towards zero. The tuning parameter λ 0 serves to control the relative impact of these two terms; λ = 0 leads to OLS, while λ drives the coefficient estimates towards zero. Selecting a good value for λ is obviously critical and depends on the application. Note that shrinking is not applied to β 0.

26 Ridge regression Ridge regression s advantage over OLS is rooted in the bias-variance trade-off (in MSE). As λ increases, the flexibility of the ridge regression fit decreases, leading to decreased variance but increased bias. By a suitable selection of λ variance can decrease more than the bias increases, which leads to a gain over OLS. Ridge regression works best in cases where OLS estimates have high variance (e.g., in the case of high multicollinearity, and in the case of large number of explanatory variables relative to n) Remark 2 Because the slope coefficients β j depend of the scale of the explanatory variable, x j, the common practice is to use standardize the explanatory variables by scaling the variables by their (sample) standard deviations, i.e., use the transformations x j = x j /s j, where s j is the (sample) standard deviation of x j variable.

27 1 Model Selection and Regulairzation Variable selection Ridge regression Lasso Dimension reduction Principal components regression Partial least squares

28 Lasso The lasso is relatively recent alternative to ridge regression. The lasso estimates, ˆβ L λ, of regression (1) are obtained by minimizing n y i β 0 i=1 2 p β j x ij + λ j=1 p β j = RSS + λ j=1 p β j. (5) The difference from ridge estimation is that in the penalty βj 2 is replaced by the β j (i.e., the lasso uses l 1 -norm and ridge l 2 -norm). As with ridge, also lasso shrinks coefficients towards zero. However, unlike the l 2 penalty, l 1 has the effect of forcing some coefficients to be exactly equal to zero when the tuning parameter λ is large enough. Again, as in ridge regression, selecting λ is critical. j=1

29 Lasso Due to the property that the l 1 distance can force some coefficients to be exactly zero implies the variable selection property ( selection operator, i.e., so) of the lasso as indicated in its name. Ridge regression lags this property as it does not force any of the initially non-zero coefficients to zero.

30 Lasso Remark 3 A combination of the ridge regression and the lasso, called elastic-net regularization, estimates the regressions coefficients β 0, β 1, β 2,..., β p by minimizing n (y i β 0 x iβ) 2 + λp α (β), (6) i=1 where x i = (x i1,..., x ip ), β = (β 1,..., β p ), and P α (β) = p j=1 ( ) 1 2 (1 α)β2 j + α β j. (7) Thus, α = 0 implies the ridge regression and α = 1 the lasso.

31 Another formulation for ridge regression and lasso Lasso and ridge regression solve the problem { n } p min (y i x iβ) 2 s.t. β j s (8) β i=1 j=1 { n } p min (y i x iβ) 2 s.t. βj 2 s (9) β i=1 j=1 where x i = (1, x i1,..., x ip ), β = (β 0, β 1,..., β p ), and x iβ = β 0 + p β j x ij. That is, for every λ there is some s such that equations (8) and (9) will give the same lasso and ridge coefficients and vice versa. j=1

32 Another formulation for ridge regression and lasso For example the lasso restriction in (8) can be thought as a budget constraint that defines how large p j=1 β j can be. Formulating the constraints of lasso and ridge in equations (8) and (9) as p I (β j 0) s, (10) j=1 where I (β j 0) is an indicator function equaling 1 if β j 0 and zero otherwise. Then RSS is minimized under the constraint that no more than s coefficients can be nonzero, i.e., the problem becomes a best subset selection problem.

33 Variable selection property of lasso The figure below illustrates in the case of two variables the situation in which lasso tends to have a variable selection property, while ridge regression does not (Source: James et al. 2013, Fig 6.7).

34 Variable selection property of lasso Thus, if ( ˆβ 1, ˆβ 2 ) is outside the region β 1 + β 2 s and β1 2 + β2 2 s, lasso can reach the boundary at the corner in which a coefficient equals zero. The ellipses depict values on which RSS remains constants. Inside the diamond and circle lasso and ridge regression give the same values as the budget constraints are satisfied by the OLS solution.

35 Selecting the tuning parameter λ The tuning parameter is chosen as follows: choose a grid of λ compute cross-validation error for each value of λ select the value of λ for which the cross-validation error is the smallest. Finally, the model is re-fit using all of the available observations and the selected value of the tuning parameter. Remark 4 Because OLS estimate ˆβ corresponds lasso (and ridge) with λ = 0, and the OLS minimizes RSS = n i=1 (y i x ˆβ) i 2 in the training sample, so that for any λ > 0, n i=1 (y i x ˆβ L i λ) 2 RSS in the training sample. Therefore, finding optimal λ must be based on some sort of out-of-sample computations like cross-validation.

36 Example 2 We utilize again the Hitters data set to demonstrate lasso. R has package glmnet in which the main function to perform lasso, ridge regression, and generally elastic-net regularization estimation. The main function of the package is glmnet(), which does not allow missing values and R factor variables must be first transformed to 0/1 dummy variables. > library(islr) > library(glmnet) > head(hitters) AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun -Andy Allanson Alan Ashby Alvin Davis Andre Dawson Andres Galarraga Alfredo Griffin CRuns CRBI CWalks League Division PutOuts Assists Errors -Andy Allanson A E Alan Ashby N W Alvin Davis A W Andre Dawson N E Andres Galarraga N E Alfredo Griffin A W Salary NewLeague -Andy Allanson NA A -Alan Ashby N -Alvin Davis A -Andre Dawson N -Andres Galarraga 91.5 N -Alfredo Griffin A > Hitters <- na.omit(hitters) # remove lines with missing values

37 Function glmnet() requires its input in an x matrix (predictors) and a y vector (dependent variable), and syntax y x does not work. The model.matrix() function generates the required x-matrix by transforming also factor variables to dummy variables. glmnet() performs lasso by selecting alpha = 1 (which is also the default) and automatically selects a range for λ. Below we use this automatically generated grid for λ. > x <- model.matrix(salary ~., Hitters)[, -1] # x-variables, drop the constant term vector > y <- Hitters$Salary > head(x) (results omitted) > lasso.mod <- glmnet(x = x, y = y, alpha = 1) ## alpha = 1 performs lasso ## for a range of lambda values

38 > str(lasso.mod) # structure of the object produced by glmnet() List of 12 $ a0 : Named num [1:80] attr(*, "names")= chr [1:80] "s0" "s1" "s2" "s3"... $ beta :Formal class dgcmatrix [package "Matrix"] with 6 slots....@ i : int [1:882] @ p : int [1:81] @ Dim : int [1:2] @ Dimnames:List of $ : chr [1:19] "AtBat" "Hits" "HmRun" "Runs" $ : chr [1:80] "s0" "s1" "s2" "s3" @ x : num [1:882] @ factors : list() $ df : int [1:80] $ dim : int [1:2] $ lambda : num [1:80] $ dev.ratio: num [1:80] $ nulldev : num $ npasses : int 2851 $ jerr : int 0 $ offset : logi FALSE $ call : language glmnet(x = x, y = y, alpha = 1) $ nobs : int attr(*, "class")= chr [1:2] "elnet" "glmnet"

39 Below are shown the number of values in the automatically generated λ-grid and some of the values. > length(lasso.mod$lambda) # number of lambdas [1] 80 > c(min = min(lasso.mod$lambda), max = max(lasso.mod$lambda)) # range of lambdas min max > head(lasso.mod$lambda) # a few first lamdas [1] > tail(lasso.mod$lambda) # a few last lambdas [1] Thus, the regression coefficients are computed for 80 values of λ. These results are stored into the beta matrix in the lasso.mod object and can be extracted directly or using coef() function. > dim(coef(lasso.mod)) # dimension of the coefficient matrix (beta) [1] > lasso.mod$lambda[50] # the 50th value of lambda [1] > coef(lasso.mod)[, 50] # the corresponding beta estimates (Intercept) AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN > sum(abs(coef(lasso.mod)[-1, 50])) # the corresponing L1-norm [1]

40 The function coef() can be used to produce lasso estimates for any values of λ (the same can be done by the predic() function). For further information, see help(coef.glmnet) and help(predict.glmnet). > drop(coef(lasso.mod, s = 5)) # lasso estimates for any value of lambda (arg s) (Intercept) AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN

41 > par(mfrow = c(1, 2)) > plot(lasso.mod, xvar = "lambda", xlab = expression(lambda)) > plot(lasso.mod, xvar = "norm", xlab = expression(log(sum(abs(beta[j]))))) Coefficients Coefficients log(λ) log( β j) The coefficients are plotted against log(λ) in the left panel and logarithm of the l 1 -norm of the coefficients in the right panel. Numbers on the top of the figure indicate the number of non-zero coefficients. The figure shows that depending of the choice of λ, some of the coefficients will be exactly equal to zero.

42 Next we use corss-validation to identify the best λ in terms of the MSE. The glmnet has the function cv.glmnet() for the purpose. > set.seed(1) # for replication purposes > train <- sample(1:nrow(x), nrow(x)/2) # random selection of about ## one half of the observations for a train set > test <- -train # test set consist observations not in the train set > y.test <- y[test] # test y-values > cv.out <- cv.glmnet(x[train, ], y[train], alpha = 1) # by default performs 10-fold CV > plot(cv.out) Mean Squared Error log(lambda)

43 > (best.lam <- cv.out$lambda.min) # lambda for which CV MSE is at minimum [1] lasso.pred <- predict(lasso.train, s = best.lam, newx = x[test, ]) # predicted values ## with coefficients corresponding best.lam mean((lasso.pred - y.test)^2) # lasso test MSE [1] ## for comparison compute test MSE for OLS estimated model from the test set > ols.train <- lm(salary ~., data = Hitters, subset = train) # OLS train data estimats > ols.pred <- predict(ols.train, newdata = Hitters[test, ]) # OLS prediction > mean((ols.pred - y.test)^2) # OLS test MSE [1] Here lasso outperforms OLS in terms of test MSE. Below is a scatter plot of test set realized and precited salaries.

44 > plot(x = ols.pred, y = y.test, col = "red", xlim = c(0, 2500), ylim = c(0, 2500), + xlab = "Predicted", ylab = "Realized") > abline(lm(y.test ~ ols.pred), col = "red") > points(x = lasso.pred, y = y.test, col = "steel blue") > abline(lm(y.test ~ lasso.pred), col = "steel blue") > abline(a = 0, b = 1, lty = "dashed", col = "gray") > legend("topleft", legend = c("ols", "Lasso"), col = c("red", "Steel blue"), + pch = c(1, 1), bty = "n") Predicted Realized OLS Lasso

45 Dimension reduction 1 Model Selection and Regulairzation Variable selection Ridge regression Lasso Dimension reduction Principal components regression Partial least squares

46 Dimension reduction Variable selection and shrinking methods aim to control variance relying original predictors x 1,..., x p. Dimension reduction aims to find a lower number of new variable z 1,..., z M, M < p, that are linear combination of the original predictors, i.e., z m = p φ jm x j (11) j=1 for some constants φ 1m,..., φ pm, m = 1,..., M. Then y is regressed on these new variables i = 1,..., n. y i = θ 0 + M θ m z im + ɛ i, (12) m=1 A proper selection of the φ-coefficients in (11) can lead the dimension reduced regression with M + 1 coefficients in (12) outperform the original regression y on x-variables with p + 1 coefficients.

47 Dimension reduction 1 Model Selection and Regulairzation Variable selection Ridge regression Lasso Dimension reduction Principal components regression Partial least squares

48 Dimension reduction Principal components Principal component analysis (PCA) is a popular approach to reduce the dimension of the original p variables to a low number M p of linear combinations of the form (11) such that these linear combination capture the variability of the original x-variables. Given an n p data matrix X, PCA derives linear combinations of the variables such that: The first principal component defines a direction of the data along which the observations vary the most. The second principal component defines a direction which is orthogonal to the first one and along which the along varies most (among directions that are orthogonal to the first). The third principal component is defined such that it is orthogonal to the first and third and captures most of the remaining variability. Generally, an mth principal component is defined such that it is orthogonal with the earlier m 1 ones and captures most of the remaining variability

49 Dimension reduction Principal components Mathematically this reduces to finding the eigen values of the covariance matrix cov(x ) = Σ. The eigen vector of the largest eigen value is the coefficient vector of the first component, the eigen vector of the second largest eigenvalue is the coefficient vector of the second component, and so forth. The eigen vectors are normed to unity, i.e., if φ m = (φ 1m,..., φ pm ) is the mth eigen vector, then φ mφ m = 1. Because the eigenvalue problem is scale dependent, the solution is most often extracted form the correlation matrix (i.e., covariance matrix of standardized variables).

50 Dimension reduction Principal components The figure below illustrates the PC solution in the case of two variables, population size (pop) and ad spending (ad) (Source: James et al 2013, Fig 6.14). Ad Spending Population The green line is the first PC and the dashed line is the second.

51 Dimension reduction Principal components The computed scores z im = θ im (x i1 x 1 ) + + θ pm (x ip x p ), (13) i = 1,..., n are called principal component scores. Another interpretation of the PCA is: the first PC vector defines the line that is as close as porrible to the data. This is illustrated by the figure below (Source: James et al 2013, Fig 6.15). Ad Spending nd Principal Component Population st Principal Component

52 Dimension reduction Principal components Plotting the components against the original variables illustrates graphically how well the component represents the variability of that variable. For the advertising data the first PC represents well both variables, while the second PC is not much related with either of the variables (Source: James et al 2013, Fig 6.16 & 6.17). Population Ad Spending st Principal Component st Principal Component

53 Dimension reduction Principal components Population Ad Spending nd Principal Component nd Principal Component Because here the first PC captures the virtually all among the two variables in the advertising data, the component reflect jointly the population size and advertising spending of the cities. The components are linear combinations of demeaned original variables (e.q. (13)). Therefore, negative values reflect below average population size and below average advertising budget, close to zero reflects average size and budget, and positive values above average population size and budget.

54 Dimension reduction Principal component regression The principal component regression (PCR) involves constructing first M PCs and regress y on the components. The key is that M p. If M = p then PCR amounts to the same fit as the least squares fit of the original variables. M, the number of components for PCR is typically selected by cross-validation. Also, wen using PCR, it is generally recommended to standardize the x-variables before PC.

55 Dimension reduction Principal component regression Example 3 Consider again the Hitters data set > library(pls) # pls library contains pcr regression and more > library(islr) > Hitters <- na.omit(hitters) # remove missing values > pcr.fit <- pcr(salary ~., data = Hitters, scale = TRUE, # standardize x-variables + validation = "CV") # use CV to select n of components default 10 fold > summary(pcr.fit) # summary of the results Data: X dimension: Y dimension: Fit method: svdpc Number of components considered: 19 VALIDATION: RMSEP Cross-validated using 10 random segments. (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps CV adjcv comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps CV adjcv comps 15 comps 16 comps 17 comps 18 comps 19 comps CV adjcv TRAINING: % variance explained 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps X Salary comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps X Salary comps 17 comps 18 comps 19 comps X Salary

56 Dimension reduction Principal component regression The RMSEP is square root of CV MSE, and the smallest value is reached with six PCs, which is also shown by the plot of RMSEPs. Six components explain % of the total variation among the predictors. > validationplot(pcr.fit, val.type = "RMSEP") # plot RMSEPs Salary RMSEP number of components

57 Dimension reduction Principal component regression Next we will demonstrate in terms MSE how the PCR regression performs by using a test set. First define the best number of components from the training set, estimate the corresponding regression, and compute test MSE > set.seed(1) # for exact replication > train <- sample(nrow(hitters), size = nrow(hitters) / 2) # training set > head(train) # examples of observations in the training set [1] > test <- -train # test set resulted by not including those in the train set > pcr.train <- pcr(salary ~., data = Hitters, subset = train, scale = TRUE, + validation = "CV") # training set > validationplot(pcr.train, val.type = "RMSEP") > summary(pcr.train) Data: X dimension: Y dimension: Fit method: svdpc Number of components considered: 19 VALIDATION: RMSEP Cross-validated using 10 random segments. (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps CV adjcv comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps CV adjcv comps 15 comps 16 comps 17 comps 18 comps 19 comps CV adjcv TRAINING: % variance explained 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps X Salary comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps X Salary comps 17 comps 18 comps 19 comps X Salary

58 Dimension reduction Principal component regression On the basis of the training data CV RMSEP results M = 7 principal components yields the best resutls. Next compute test MSE. > pcr.pred <- predict(pcr.train, Hitters[test, ], ncomp = 7) > head(pcr.pred) # a few first predictions [1] > mean((pcr.pred - Hitters$Salary[test])^2) # test set MSE [1] This test set MSE is slightly smaller that that of lasso (114,470.6). PCR is useful in prediction purposes. If interpretation of the model is needed PCR results may be difficult to interpret. Finally, estimating the M = 7 components from the full data set yields the following results. > summary(pcr.fit <- pcr(salary ~., data = Hitters, scale = TRUE, ncomp = 7)) # estimate and summarize Data: X dimension: Y dimension: Fit method: svdpc Number of components considered: 7 TRAINING: % variance explained 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps X Salary

59 Dimension reduction 1 Model Selection and Regulairzation Variable selection Ridge regression Lasso Dimension reduction Principal components regression Partial least squares

60 Dimension reduction Partial least squares PCR focuses on reducing the dimension of the predictors without using the help of dependent variable y. In this sense PCR can be consider as unsupervised learning. As a results nothing guarantees that the dimension do really help in predicting y. Partial least squares (PLS) can be considered as a supervised alternative to PCR. Similar to PCR, PLS seeks to identify new features z 1,..., z M (M < p) that are linear combinations of the original variables, but unlike with PC, they are also related to the response y. In short PLS attempts to find directions that help explain both the response and the predictors.

61 Dimension reduction Partial least squares In PLS the predictors are first standardized. The first PLS direction z 1 is defined by regressing y on each predictor x j at a time and the resulting coefficients are used as φ 1j to define z 1 = φ 11 x φ 1p x p. The second PLS direction is defined by first regressing each x j on z 1 and taking the residuals (i.e., variation of x j not explained by z 1 ). Using these orthogonalized data, z 2 is formed in the same fashion as z 1 with the original data. This iterative procedure is repeated M times to produce PLS components z 1,..., z M. M is chosen by cross-validation.

62 Dimension reduction Partial least squares Example 4 Continuing the previous examples, we use plsr() function of the pls library. > set.seed(1) # initialize again the seed > pls.train <- plsr(salary ~., data = Hitters, subset = train, scale = TRUE, + validation = "CV") > summary(pls.train) Data: X dimension: Y dimension: Fit method: kernelpls Number of components considered: 19 VALIDATION: RMSEP Cross-validated using 10 random segments. (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps CV adjcv comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps CV adjcv comps 15 comps 16 comps 17 comps 18 comps 19 comps CV adjcv TRAINING: % variance explained 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps X Salary comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps X Salary comps 17 comps 18 comps 19 comps X Salary > validationplot(pls.train, val.type = "RMSEP")

63 Dimension reduction Partial least squares M = 2 produces the smallest CV RMSEP. Salary RMSEP number of components > pls.pred <- predict(pls.train, newdata = Hitters[test, ], ncomp = 2) > mean((hitters$salary[test] - pls.pred)^2) [1] MSE if comparable but slightly higher than that of PCR.

Penalized Regression

Penalized Regression Penalized Regression Deepayan Sarkar Penalized regression Another potential remedy for collinearity Decreases variability of estimated coefficients at the cost of introducing bias Also known as regularization

More information

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

Dimension Reduction Methods

Dimension Reduction Methods Dimension Reduction Methods And Bayesian Machine Learning Marek Petrik 2/28 Previously in Machine Learning How to choose the right features if we have (too) many options Methods: 1. Subset selection 2.

More information

MS-C1620 Statistical inference

MS-C1620 Statistical inference MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents

More information

Homework 1: Solutions

Homework 1: Solutions Homework 1: Solutions Statistics 413 Fall 2017 Data Analysis: Note: All data analysis results are provided by Michael Rodgers 1. Baseball Data: (a) What are the most important features for predicting players

More information

Linear regression methods

Linear regression methods Linear regression methods Most of our intuition about statistical methods stem from linear regression. For observations i = 1,..., n, the model is Y i = p X ij β j + ε i, j=1 where Y i is the response

More information

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina Direct Learning: Linear Regression Parametric learning We consider the core function in the prediction rule to be a parametric function. The most commonly used function is a linear function: squared loss:

More information

Prediction & Feature Selection in GLM

Prediction & Feature Selection in GLM Tarigan Statistical Consulting & Coaching statistical-coaching.ch Doctoral Program in Computer Science of the Universities of Fribourg, Geneva, Lausanne, Neuchâtel, Bern and the EPFL Hands-on Data Analysis

More information

Linear model selection and regularization

Linear model selection and regularization Linear model selection and regularization Problems with linear regression with least square 1. Prediction Accuracy: linear regression has low bias but suffer from high variance, especially when n p. It

More information

Lecture 14: Shrinkage

Lecture 14: Shrinkage Lecture 14: Shrinkage Reading: Section 6.2 STATS 202: Data mining and analysis October 27, 2017 1 / 19 Shrinkage methods The idea is to perform a linear regression, while regularizing or shrinking the

More information

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods. TheThalesians Itiseasyforphilosopherstoberichiftheychoose Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods Ivan Zhdankin

More information

STAT 462-Computational Data Analysis

STAT 462-Computational Data Analysis STAT 462-Computational Data Analysis Chapter 5- Part 2 Nasser Sadeghkhani a.sadeghkhani@queensu.ca October 2017 1 / 27 Outline Shrinkage Methods 1. Ridge Regression 2. Lasso Dimension Reduction Methods

More information

Regression III: Computing a Good Estimator with Regularization

Regression III: Computing a Good Estimator with Regularization Regression III: Computing a Good Estimator with Regularization -Applied Multivariate Analysis- Lecturer: Darren Homrighausen, PhD 1 Another way to choose the model Let (X 0, Y 0 ) be a new observation

More information

The prediction of house price

The prediction of house price 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Day 4: Shrinkage Estimators

Day 4: Shrinkage Estimators Day 4: Shrinkage Estimators Kenneth Benoit Data Mining and Statistical Learning March 9, 2015 n versus p (aka k) Classical regression framework: n > p. Without this inequality, the OLS coefficients have

More information

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Chapter 6 October 18, 2016 Chapter 6 October 18, 2016 1 / 80 1 Subset selection 2 Shrinkage methods 3 Dimension reduction methods (using derived inputs) 4 High

More information

Linear Methods for Regression. Lijun Zhang

Linear Methods for Regression. Lijun Zhang Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived

More information

Chapter 3. Linear Models for Regression

Chapter 3. Linear Models for Regression Chapter 3. Linear Models for Regression Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 c Wei Pan Linear

More information

Variable Selection and Regularization

Variable Selection and Regularization Variable Selection and Regularization Sanford Weisberg October 15, 2012 Variable Selection In a regression problem with p predictors, we can reduce the dimension of the regression problem in two general

More information

Multiple (non) linear regression. Department of Computer Science, Czech Technical University in Prague

Multiple (non) linear regression. Department of Computer Science, Czech Technical University in Prague Multiple (non) linear regression Jiří Kléma Department of Computer Science, Czech Technical University in Prague Lecture based on ISLR book and its accompanying slides http://cw.felk.cvut.cz/wiki/courses/b4m36san/start

More information

MSG500/MVE190 Linear Models - Lecture 15

MSG500/MVE190 Linear Models - Lecture 15 MSG500/MVE190 Linear Models - Lecture 15 Rebecka Jörnsten Mathematical Statistics University of Gothenburg/Chalmers University of Technology December 13, 2012 1 Regularized regression In ordinary least

More information

Ridge and Lasso Regression

Ridge and Lasso Regression enote 8 1 enote 8 Ridge and Lasso Regression enote 8 INDHOLD 2 Indhold 8 Ridge and Lasso Regression 1 8.1 Reading material................................. 2 8.2 Presentation material...............................

More information

Linear Regression Models. Based on Chapter 3 of Hastie, Tibshirani and Friedman

Linear Regression Models. Based on Chapter 3 of Hastie, Tibshirani and Friedman Linear Regression Models Based on Chapter 3 of Hastie, ibshirani and Friedman Linear Regression Models Here the X s might be: p f ( X = " + " 0 j= 1 X j Raw predictor variables (continuous or coded-categorical

More information

ITEC 621 Predictive Analytics 6. Variable Selection

ITEC 621 Predictive Analytics 6. Variable Selection ITEC 621 Predictive Analytics 6. Variable Selection Multi-Collinearity XI(û) X s are not independent (are correlated) Y = X * B Approximately: X has no inverse because its columns are dependent Really:

More information

Data Mining Stat 588

Data Mining Stat 588 Data Mining Stat 588 Lecture 02: Linear Methods for Regression Department of Statistics & Biostatistics Rutgers University September 13 2011 Regression Problem Quantitative generic output variable Y. Generic

More information

Statistical Learning

Statistical Learning Statistical Learning Supervised learning Assume: Estimate: quantity of interest function predictors to get: error Such that: For prediction and/or inference Model fit vs. Model stability (Bias variance

More information

STK 2100 Oblig 1. Zhou Siyu. February 15, 2017

STK 2100 Oblig 1. Zhou Siyu. February 15, 2017 STK 200 Oblig Zhou Siyu February 5, 207 Question a) Make a scatter box plot for the data set. Answer:Here is the code I used to plot the scatter box in R. library ( MASS ) 2 pairs ( Boston ) Figure : Scatter

More information

A Modern Look at Classical Multivariate Techniques

A Modern Look at Classical Multivariate Techniques A Modern Look at Classical Multivariate Techniques Yoonkyung Lee Department of Statistics The Ohio State University March 16-20, 2015 The 13th School of Probability and Statistics CIMAT, Guanajuato, Mexico

More information

Machine Learning Linear Regression. Prof. Matteo Matteucci

Machine Learning Linear Regression. Prof. Matteo Matteucci Machine Learning Linear Regression Prof. Matteo Matteucci Outline 2 o Simple Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression o Multi Variate Regession Model Least Squares

More information

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

More information

EDAMI DIMENSION REDUCTION BY PRINCIPAL COMPONENT ANALYSIS

EDAMI DIMENSION REDUCTION BY PRINCIPAL COMPONENT ANALYSIS EDAMI DIMENSION REDUCTION BY PRINCIPAL COMPONENT ANALYSIS Mario Romanazzi October 29, 2017 1 Introduction An important task in multidimensional data analysis is reduction in complexity. Recalling that

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

STK4900/ Lecture 5. Program

STK4900/ Lecture 5. Program STK4900/9900 - Lecture 5 Program 1. Checking model assumptions Linearity Equal variances Normality Influential observations Importance of model assumptions 2. Selection of predictors Forward and backward

More information

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless

More information

Introduction and Single Predictor Regression. Correlation

Introduction and Single Predictor Regression. Correlation Introduction and Single Predictor Regression Dr. J. Kyle Roberts Southern Methodist University Simmons School of Education and Human Development Department of Teaching and Learning Correlation A correlation

More information

Package Grace. R topics documented: April 9, Type Package

Package Grace. R topics documented: April 9, Type Package Type Package Package Grace April 9, 2017 Title Graph-Constrained Estimation and Hypothesis Tests Version 0.5.3 Date 2017-4-8 Author Sen Zhao Maintainer Sen Zhao Description Use

More information

High-dimensional regression modeling

High-dimensional regression modeling High-dimensional regression modeling David Causeur Department of Statistics and Computer Science Agrocampus Ouest IRMAR CNRS UMR 6625 http://www.agrocampus-ouest.fr/math/causeur/ Course objectives Making

More information

Using R in 200D Luke Sonnet

Using R in 200D Luke Sonnet Using R in 200D Luke Sonnet Contents Working with data frames 1 Working with variables........................................... 1 Analyzing data............................................... 3 Random

More information

Chapter 8 Conclusion

Chapter 8 Conclusion 1 Chapter 8 Conclusion Three questions about test scores (score) and student-teacher ratio (str): a) After controlling for differences in economic characteristics of different districts, does the effect

More information

Making Our Cities Safer: A Study In Neighbhorhood Crime Patterns

Making Our Cities Safer: A Study In Neighbhorhood Crime Patterns Making Our Cities Safer: A Study In Neighbhorhood Crime Patterns Aly Kane alykane@stanford.edu Ariel Sagalovsky asagalov@stanford.edu Abstract Equipped with an understanding of the factors that influence

More information

Gov 2000: 9. Regression with Two Independent Variables

Gov 2000: 9. Regression with Two Independent Variables Gov 2000: 9. Regression with Two Independent Variables Matthew Blackwell Harvard University mblackwell@gov.harvard.edu Where are we? Where are we going? Last week: we learned about how to calculate a simple

More information

Regularization: Ridge Regression and the LASSO

Regularization: Ridge Regression and the LASSO Agenda Wednesday, November 29, 2006 Agenda Agenda 1 The Bias-Variance Tradeoff 2 Ridge Regression Solution to the l 2 problem Data Augmentation Approach Bayesian Interpretation The SVD and Ridge Regression

More information

Tutorial on Linear Regression

Tutorial on Linear Regression Tutorial on Linear Regression HY-539: Advanced Topics on Wireless Networks & Mobile Systems Prof. Maria Papadopouli Evripidis Tzamousis tzamusis@csd.uoc.gr Agenda 1. Simple linear regression 2. Multiple

More information

Shrinkage Methods: Ridge and Lasso

Shrinkage Methods: Ridge and Lasso Shrinkage Methods: Ridge and Lasso Jonathan Hersh 1 Chapman University, Argyros School of Business hersh@chapman.edu February 27, 2019 J.Hersh (Chapman) Ridge & Lasso February 27, 2019 1 / 43 1 Intro and

More information

Inference After Variable Selection

Inference After Variable Selection Department of Mathematics, SIU Carbondale Inference After Variable Selection Lasanthi Pelawa Watagoda lasanthi@siu.edu June 12, 2017 Outline 1 Introduction 2 Inference For Ridge and Lasso 3 Variable Selection

More information

Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR

Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR Howard D. Bondell and Brian J. Reich Department of Statistics, North Carolina State University,

More information

Statistical Methods for Data Mining

Statistical Methods for Data Mining Statistical Methods for Data Mining Kuangnan Fang Xiamen University Email: xmufkn@xmu.edu.cn Linear Model Selection and Regularization Recall the linear model Y = 0 + 1 X 1 + + p X p +. In the lectures

More information

Institute of Statistics Mimeo Series No Simultaneous regression shrinkage, variable selection and clustering of predictors with OSCAR

Institute of Statistics Mimeo Series No Simultaneous regression shrinkage, variable selection and clustering of predictors with OSCAR DEPARTMENT OF STATISTICS North Carolina State University 2501 Founders Drive, Campus Box 8203 Raleigh, NC 27695-8203 Institute of Statistics Mimeo Series No. 2583 Simultaneous regression shrinkage, variable

More information

Different types of regression: Linear, Lasso, Ridge, Elastic net, Ro

Different types of regression: Linear, Lasso, Ridge, Elastic net, Ro Different types of regression: Linear, Lasso, Ridge, Elastic net, Robust and K-neighbors Faculty of Mathematics, Informatics and Mechanics, University of Warsaw 04.10.2009 Introduction We are given a linear

More information

Introduction to Statistics and R

Introduction to Statistics and R Introduction to Statistics and R Mayo-Illinois Computational Genomics Workshop (2018) Ruoqing Zhu, Ph.D. Department of Statistics, UIUC rqzhu@illinois.edu June 18, 2018 Abstract This document is a supplimentary

More information

High-dimensional regression

High-dimensional regression High-dimensional regression Advanced Methods for Data Analysis 36-402/36-608) Spring 2014 1 Back to linear regression 1.1 Shortcomings Suppose that we are given outcome measurements y 1,... y n R, and

More information

A Short Introduction to the Lasso Methodology

A Short Introduction to the Lasso Methodology A Short Introduction to the Lasso Methodology Michael Gutmann sites.google.com/site/michaelgutmann University of Helsinki Aalto University Helsinki Institute for Information Technology March 9, 2016 Michael

More information

Introduction to the Analysis of Hierarchical and Longitudinal Data

Introduction to the Analysis of Hierarchical and Longitudinal Data Introduction to the Analysis of Hierarchical and Longitudinal Data Georges Monette, York University with Ye Sun SPIDA June 7, 2004 1 Graphical overview of selected concepts Nature of hierarchical models

More information

Regression in R. Seth Margolis GradQuant May 31,

Regression in R. Seth Margolis GradQuant May 31, Regression in R Seth Margolis GradQuant May 31, 2018 1 GPA What is Regression Good For? Assessing relationships between variables This probably covers most of what you do 4 3.8 3.6 3.4 Person Intelligence

More information

STATS216v Introduction to Statistical Learning Stanford University, Summer Midterm Exam (Solutions) Duration: 1 hours

STATS216v Introduction to Statistical Learning Stanford University, Summer Midterm Exam (Solutions) Duration: 1 hours Instructions: STATS216v Introduction to Statistical Learning Stanford University, Summer 2017 Remember the university honor code. Midterm Exam (Solutions) Duration: 1 hours Write your name and SUNet ID

More information

Data analysis strategies for high dimensional social science data M3 Conference May 2013

Data analysis strategies for high dimensional social science data M3 Conference May 2013 Data analysis strategies for high dimensional social science data M3 Conference May 2013 W. Holmes Finch, Maria Hernández Finch, David E. McIntosh, & Lauren E. Moss Ball State University High dimensional

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Supervised Learning: Regression I Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com Some of the

More information

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression Data Mining and Data Warehousing Henryk Maciejewski Data Mining Predictive modelling: regression Algorithms for Predictive Modelling Contents Regression Classification Auxiliary topics: Estimation of prediction

More information

Lecture 6: Methods for high-dimensional problems

Lecture 6: Methods for high-dimensional problems Lecture 6: Methods for high-dimensional problems Hector Corrada Bravo and Rafael A. Irizarry March, 2010 In this Section we will discuss methods where data lies on high-dimensional spaces. In particular,

More information

ESL Chap3. Some extensions of lasso

ESL Chap3. Some extensions of lasso ESL Chap3 Some extensions of lasso 1 Outline Consistency of lasso for model selection Adaptive lasso Elastic net Group lasso 2 Consistency of lasso for model selection A number of authors have studied

More information

Regression, Ridge Regression, Lasso

Regression, Ridge Regression, Lasso Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.

More information

Business Statistics. Tommaso Proietti. Model Evaluation and Selection. DEF - Università di Roma 'Tor Vergata'

Business Statistics. Tommaso Proietti. Model Evaluation and Selection. DEF - Università di Roma 'Tor Vergata' Business Statistics Tommaso Proietti DEF - Università di Roma 'Tor Vergata' Model Evaluation and Selection Predictive Ability of a Model: Denition and Estimation We aim at achieving a balance between parsimony

More information

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Sparse regression. Optimization-Based Data Analysis.   Carlos Fernandez-Granda Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic

More information

Variable Selection under Measurement Error: Comparing the Performance of Subset Selection and Shrinkage Methods

Variable Selection under Measurement Error: Comparing the Performance of Subset Selection and Shrinkage Methods Variable Selection under Measurement Error: Comparing the Performance of Subset Selection and Shrinkage Methods Ellen Sasahara Bachelor s Thesis Supervisor: Prof. Dr. Thomas Augustin Department of Statistics

More information

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1). Regression and PCA Classification The goal: map from input X to a label Y. Y has a discrete set of possible values We focused on binary Y (values 0 or 1). But we also discussed larger number of classes

More information

Applied Machine Learning Annalisa Marsico

Applied Machine Learning Annalisa Marsico Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 22 April, SoSe 2015 Goals Feature Selection rather than Feature

More information

Collinearity: Impact and Possible Remedies

Collinearity: Impact and Possible Remedies Collinearity: Impact and Possible Remedies Deepayan Sarkar What is collinearity? Exact dependence between columns of X make coefficients non-estimable Collinearity refers to the situation where some columns

More information

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010 Model-Averaged l 1 Regularization using Markov Chain Monte Carlo Model Composition Technical Report No. 541 Department of Statistics, University of Washington Chris Fraley and Daniel Percival August 22,

More information

Regularization and Variable Selection via the Elastic Net

Regularization and Variable Selection via the Elastic Net p. 1/1 Regularization and Variable Selection via the Elastic Net Hui Zou and Trevor Hastie Journal of Royal Statistical Society, B, 2005 Presenter: Minhua Chen, Nov. 07, 2008 p. 2/1 Agenda Introduction

More information

PENALIZED PRINCIPAL COMPONENT REGRESSION. Ayanna Byrd. (Under the direction of Cheolwoo Park) Abstract

PENALIZED PRINCIPAL COMPONENT REGRESSION. Ayanna Byrd. (Under the direction of Cheolwoo Park) Abstract PENALIZED PRINCIPAL COMPONENT REGRESSION by Ayanna Byrd (Under the direction of Cheolwoo Park) Abstract When using linear regression problems, an unbiased estimate is produced by the Ordinary Least Squares.

More information

Introduction to the genlasso package

Introduction to the genlasso package Introduction to the genlasso package Taylor B. Arnold, Ryan Tibshirani Abstract We present a short tutorial and introduction to using the R package genlasso, which is used for computing the solution path

More information

PRINCIPAL COMPONENTS ANALYSIS

PRINCIPAL COMPONENTS ANALYSIS 121 CHAPTER 11 PRINCIPAL COMPONENTS ANALYSIS We now have the tools necessary to discuss one of the most important concepts in mathematical statistics: Principal Components Analysis (PCA). PCA involves

More information

Consider fitting a model using ordinary least squares (OLS) regression:

Consider fitting a model using ordinary least squares (OLS) regression: Example 1: Mating Success of African Elephants In this study, 41 male African elephants were followed over a period of 8 years. The age of the elephant at the beginning of the study and the number of successful

More information

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10 COS53: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 0 MELISSA CARROLL, LINJIE LUO. BIAS-VARIANCE TRADE-OFF (CONTINUED FROM LAST LECTURE) If V = (X n, Y n )} are observed data, the linear regression problem

More information

Introduction and Background to Multilevel Analysis

Introduction and Background to Multilevel Analysis Introduction and Background to Multilevel Analysis Dr. J. Kyle Roberts Southern Methodist University Simmons School of Education and Human Development Department of Teaching and Learning Background and

More information

Linear Regression Linear Regression with Shrinkage

Linear Regression Linear Regression with Shrinkage Linear Regression Linear Regression ith Shrinkage Introduction Regression means predicting a continuous (usually scalar) output y from a vector of continuous inputs (features) x. Example: Predicting vehicle

More information

Regression I: Mean Squared Error and Measuring Quality of Fit

Regression I: Mean Squared Error and Measuring Quality of Fit Regression I: Mean Squared Error and Measuring Quality of Fit -Applied Multivariate Analysis- Lecturer: Darren Homrighausen, PhD 1 The Setup Suppose there is a scientific problem we are interested in solving

More information

Introduction to Statistical modeling: handout for Math 489/583

Introduction to Statistical modeling: handout for Math 489/583 Introduction to Statistical modeling: handout for Math 489/583 Statistical modeling occurs when we are trying to model some data using statistical tools. From the start, we recognize that no model is perfect

More information

Stat 502X Exam 1 Spring 2014

Stat 502X Exam 1 Spring 2014 Stat 502X Exam 1 Spring 2014 I have neither given nor received unauthorized assistance on this exam. Name Signed Date Name Printed This is a long exam consisting of 11 parts. I'll score it at 10 points

More information

Regression Shrinkage and Selection via the Lasso

Regression Shrinkage and Selection via the Lasso Regression Shrinkage and Selection via the Lasso ROBERT TIBSHIRANI, 1996 Presenter: Guiyun Feng April 27 () 1 / 20 Motivation Estimation in Linear Models: y = β T x + ɛ. data (x i, y i ), i = 1, 2,...,

More information

Solutions to obligatorisk oppgave 2, STK2100

Solutions to obligatorisk oppgave 2, STK2100 Solutions to obligatorisk oppgave 2, STK2100 Vinnie Ko May 14, 2018 Disclaimer: This document is made solely for my own personal use and can contain many errors. Oppgave 1 We load packages and read data

More information

Prediction problems 3: Validation and Model Checking

Prediction problems 3: Validation and Model Checking Prediction problems 3: Validation and Model Checking Data Science 101 Team May 17, 2018 Outline Validation Why is it important How should we do it? Model checking Checking whether your model is a good

More information

Applied Regression Analysis

Applied Regression Analysis Applied Regression Analysis Chapter 3 Multiple Linear Regression Hongcheng Li April, 6, 2013 Recall simple linear regression 1 Recall simple linear regression 2 Parameter Estimation 3 Interpretations of

More information

R in Linguistic Analysis. Wassink 2012 University of Washington Week 6

R in Linguistic Analysis. Wassink 2012 University of Washington Week 6 R in Linguistic Analysis Wassink 2012 University of Washington Week 6 Overview R for phoneticians and lab phonologists Johnson 3 Reading Qs Equivalence of means (t-tests) Multiple Regression Principal

More information

EXTENDING PARTIAL LEAST SQUARES REGRESSION

EXTENDING PARTIAL LEAST SQUARES REGRESSION EXTENDING PARTIAL LEAST SQUARES REGRESSION ATHANASSIOS KONDYLIS UNIVERSITY OF NEUCHÂTEL 1 Outline Multivariate Calibration in Chemometrics PLS regression (PLSR) and the PLS1 algorithm PLS1 from a statistical

More information

Continuous soil attribute modeling and mapping: Multiple linear regression

Continuous soil attribute modeling and mapping: Multiple linear regression Continuous soil attribute modeling and mapping: Multiple linear regression Soil Security Laboratory 2017 1 Multiple linear regression Multiple linear regression (MLR) is where we regress a target variable

More information

Biostatistics Advanced Methods in Biostatistics IV

Biostatistics Advanced Methods in Biostatistics IV Biostatistics 140.754 Advanced Methods in Biostatistics IV Jeffrey Leek Assistant Professor Department of Biostatistics jleek@jhsph.edu Lecture 12 1 / 36 Tip + Paper Tip: As a statistician the results

More information

Density Temp vs Ratio. temp

Density Temp vs Ratio. temp Temp Ratio Density 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Density 0.0 0.2 0.4 0.6 0.8 1.0 1. (a) 170 175 180 185 temp 1.0 1.5 2.0 2.5 3.0 ratio The histogram shows that the temperature measures have two peaks,

More information

MSA220/MVE440 Statistical Learning for Big Data

MSA220/MVE440 Statistical Learning for Big Data MSA220/MVE440 Statistical Learning for Big Data Lecture 7/8 - High-dimensional modeling part 1 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Classification

More information

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression Noah Simon Jerome Friedman Trevor Hastie November 5, 013 Abstract In this paper we purpose a blockwise descent

More information

Iterative Selection Using Orthogonal Regression Techniques

Iterative Selection Using Orthogonal Regression Techniques Iterative Selection Using Orthogonal Regression Techniques Bradley Turnbull 1, Subhashis Ghosal 1 and Hao Helen Zhang 2 1 Department of Statistics, North Carolina State University, Raleigh, NC, USA 2 Department

More information

CMSC858P Supervised Learning Methods

CMSC858P Supervised Learning Methods CMSC858P Supervised Learning Methods Hector Corrada Bravo March, 2010 Introduction Today we discuss the classification setting in detail. Our setting is that we observe for each subject i a set of p predictors

More information

MS&E 226. In-Class Midterm Examination Solutions Small Data October 20, 2015

MS&E 226. In-Class Midterm Examination Solutions Small Data October 20, 2015 MS&E 226 In-Class Midterm Examination Solutions Small Data October 20, 2015 PROBLEM 1. Alice uses ordinary least squares to fit a linear regression model on a dataset containing outcome data Y and covariates

More information

A simulation study of model fitting to high dimensional data using penalized logistic regression

A simulation study of model fitting to high dimensional data using penalized logistic regression A simulation study of model fitting to high dimensional data using penalized logistic regression Ellinor Krona Kandidatuppsats i matematisk statistik Bachelor Thesis in Mathematical Statistics Kandidatuppsats

More information

Stat 401B Final Exam Fall 2015

Stat 401B Final Exam Fall 2015 Stat 401B Final Exam Fall 015 I have neither given nor received unauthorized assistance on this exam. Name Signed Date Name Printed ATTENTION! Incorrect numerical answers unaccompanied by supporting reasoning

More information

Exam Applied Statistical Regression. Good Luck!

Exam Applied Statistical Regression. Good Luck! Dr. M. Dettling Summer 2011 Exam Applied Statistical Regression Approved: Tables: Note: Any written material, calculator (without communication facility). Attached. All tests have to be done at the 5%-level.

More information

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn A Handbook of Statistical Analyses Using R Brian S. Everitt and Torsten Hothorn CHAPTER 6 Logistic Regression and Generalised Linear Models: Blood Screening, Women s Role in Society, and Colonic Polyps

More information

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson Bayesian variable selection via penalized credible regions Brian Reich, NC State Joint work with Howard Bondell and Ander Wilson Brian Reich, NCSU Penalized credible regions 1 Motivation big p, small n

More information

MATH 644: Regression Analysis Methods

MATH 644: Regression Analysis Methods MATH 644: Regression Analysis Methods FINAL EXAM Fall, 2012 INSTRUCTIONS TO STUDENTS: 1. This test contains SIX questions. It comprises ELEVEN printed pages. 2. Answer ALL questions for a total of 100

More information