UVA CS 4501: Machine Learning. Lecture 6 Extra: Op=miza=on for Linear Regression Model with Regulariza=ons. Dr. Yanjun Qi. University of Virginia

Size: px

Start display at page:

Download "UVA CS 4501: Machine Learning. Lecture 6 Extra: Op=miza=on for Linear Regression Model with Regulariza=ons. Dr. Yanjun Qi. University of Virginia"

Brenda Norman
5 years ago
Views:

1 UVA CS 4501: Machine Learning Lecture 6 Extra: Op=miza=on for Linear Regression Model with Regulariza=ons Dr. Yanjun Qi University of Virginia Department of Computer Science

2 EXTRA (NOT REQUIRED IN EXAMS) 2

3 Extra Recap q More about LR Model with RegularizaJons q Ridge Regression q Lasso Regression q Extra: how to perform training q ElasJc net q Extra: how to perform training 3

4 Why InverJble In Ridge Regression? ( ) 1 X T! (NOT AN EASY PROOF), many y concepts, SVD, PCA, β * = X T X + λi NOT AN EASY PROOF If through SVD Eigenvalues, relajon to singular hvps:// 4

5 Why InverJble In Ridge Regression? 5

6 Extra: two forms of Ridge Regression Totally equivalent hvp://stats.stackexchange.com/quesjons/ /how-to-find-regression-coefficientsbeta-in-ridge-regression 6

7 Extra: Intercept Term is usually not shrinked If the data is not centered, there exists bias term hvp://stats.stackexchange.com/quesjons/86991/ reason-for-not-shrinking-the-bias-intercept-term-inregression We normally assume we centered x and y. If this is true, no need to have bias term, e.g., for lasso, 7

8 Extra Recap q More about LR Model with RegularizaJons q Ridge Regression q Lasso Regression q Extra: how to perform training q ElasJc net q Extra: how to perform training 8

9 due to the nature of L_1 norm, the viable solujons are limited to corners, which are on a few axis only - in the above case x1. Value of x2 = 0. This means that the solujon has eliminated the role of x2, leading to sparsity 9

10 10

11 hrp:// In mathemajcs, parjcularly in calculus, a stajonary point or crijcal point of a differenjable funcjon of one variable is a point of the domain of the funcjon where the derivajve is zero (equivalently, the slope of the graph at that point is zero). 11

12 How to train Parameter for Lasso ˆβ lasso = argmin(y Xβ) T (y Xβ) subject to β s j Here assume x and y have been centered (normally), therefore no bias term needed in above! 12

13 13

14 14

15 We just need 0 in the region [-cj-λ, -cj+λ] (subgradient 15 calculus )

16 Lasso 16

17 Lasso 17

Coordinate escent based Learning of Lasso

18 Coordinate escent based Learning of Lasso Coordinate descent (WIKI)è one does line search along one coordinate direcjon at the current point in each iterajon. One uses different coordinate direcjons cyclically throughout the procedure. soo-thresholding 18

19 Least Angle Regression (LARS) (State-of-the-art LASSO solver) hvp://statweb.stanford.edu/~jbs/op/lars.pdf

20 LARS: Least Angle Regression Starts like classic Forward SelecJon Find predictor x j1 most correlated with the current residual Make a step (epsilon) large enough unjl another predictor x j2 has as much correlajon with the current residual LARS now step in the direcjon equiangular between two predictors unjl x j3 earns its way into the correlated set CorrelaJon: Dr. Yanjun Qi / UVA CS 20

21 Extra Recap q More about LR Model with RegularizaJons q Ridge Regression q Lasso Regression q Extra: how to perform training q ElasJc net q Extra: how to perform training 21

22 Naïve elasjc net For any non negajve fixed λ 1 and λ 2, naive elasjc net criterion: The naive elasjc net esjmator is the minimizer of equajon Let 22

23 Geometry of elasjc net 23

24 ConnecJng LASSO and Naïve ElasJc net Lemma: Given (λ 1,λ 2 ), define an arjficial data set (y *,X * ) Let, Then naive 24

25 25

26 Advantage of ElasJc net NaJve ElasJc set can be converted to lasso with augmented data In the augmented formulajon, sample size n+p and X * has rank p è can potenjally select all the predictors Naïve elasjc net can perform automajc variable selecjon like lasso 26

Grouping Effect hvp://web.stanford.edu/~hasje/papers/b67.2%20(2005)%20301-320%20zou%20&%20hasje.

27 Grouping Effect hvp://web.stanford.edu/~hasje/papers/b67.2%20(2005)% %20zou%20&%20hasje.pdf If there is a group of variables among which the pairwise correlajons are very high, then the lasso tends to select only one variable from the group and does not care which one is selected. 27

28 hvp://web.stanford.edu/~hasje/papers/b67.2%20(2005)% %20zou%20&%20hasje.pdf Grouping Effect of Naïve ElasJc net Consider the following penalized regression model: Where J(.) posijve for β 0. Clear DisJncJon between strictly convex penalty funcjon and lasso Lasso doesn't even have a unique solujon 28

29 hvp://web.stanford.edu/~hasje/papers/b67.2%20(2005)% %20zou%20&%20hasje.pdf Grouping Effect of Naïve ElasJc net Consider the following penalized regression model: Where J(.) posijve for β 0. Clear DisJncJon between strictly convex penalty funcjon and lasso Lasso doesn't even have a unique solujon 29

30 hvp://web.stanford.edu/~hasje/papers/b67.2%20(2005)% %20zou%20&%20hasje.pdf Grouping Effect of Naïve ElasJc net D is the difference between the coefficient paths of predictors i and j. If x i and x j are high correlated ρ=1, this theorem provides a quanjtajve descripjon for the grouping effect of Naive ElasJc Net. 30

31 hvp://web.stanford.edu/~hasje/papers/b67.2%20(2005)% %20zou%20&%20hasje.pdf Grouping Effect of Naïve ElasJc net D is the difference between the coefficient paths of predictors i and j. If x i and x j are high correlated ρ=1, this theorem provides a quanjtajve descripjon for the grouping effect of Naive ElasJc Net. 31

ElasJc Net hvp://web.stanford.edu/~hasje/papers/b67.2%20(2005)%20301-320%20zou%20&%20hasje.

32 ElasJc Net hvp://web.stanford.edu/~hasje/papers/b67.2%20(2005)% %20zou%20&%20hasje.pdf Deficiency of the Naive Elas=c Net: Empirical evidence shows the Naive ElasJc Net does not perform sajsfactorily. The reason is that there are two shrinkage procedures (Ridge and LASSO) in it. Double shrinkage introduces unnecessary bias. Re-scaling of Naive ElasJc Net gives bever performance, yielding the ElasJc Net solujon: Reason: Undo shrinkage. 32

33 ElasJc Net hvp://web.stanford.edu/~hasje/papers/b67.2%20(2005)% %20zou%20&%20hasje.pdf 33

34 ComputaJon of elasjc net First solve the Naive ElasJc Net problem, then rescale it. For fixed λ 2, the Naive ElasJc Net problem is equivalent to a LASSO problem, with a huge data matrix if p >> n LASSO already has an efficient solver called LARS (Least Angle Regression). LARS-EN algorithm. 34

35 hvp://web.stanford.edu/~hasje/papers/b67.2%20(2005)% %20zou%20&%20hasje.pdf ElasJc Net interpreted as a stabilized Lasso 35

36 Extra Recap q More about LR Model with RegularizaJons q Ridge Regression q Lasso Regression q Extra: how to perform training q ElasJc net q Extra: how to perform training 36

37 References q Big thanks to Prof. Eric CMU for allowing me to reuse some of his slides q Prof. Nando de Freitas s tutorial slide q Regulariza=on and variable selec=on via the elas=c net, Hui Zou and Trevor HasJe, Stanford University, USA 37

Last Lecture Recap. UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 6: Regression Models with Regulariza8on

Last Lecture Recap. UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 6: Regression Models with Regulariza8on UVA CS 45 - / 65 7 Introduc8on to Machine Learning and Data Mining Lecture 6: Regression Models with Regulariza8on Yanun Qi / Jane University of Virginia Department of Computer Science Last Lecture Recap