UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 6: Linear Regression Model with RegularizaEons. Dr. Yanjun Qi. University of Virginia

UVA CS 6316/4501 Fall 2016 Machine Learning Lecture 6: Linear Regression Model with RegularizaEons Dr. Yanjun Qi University of Virginia Department of Computer Science 1

Where are we? è Five major secgons of this course q Regression (supervised) q ClassificaGon (supervised) q Unsupervised models q Learning theory q Graphical models 2

Today è Regression (supervised) q Four ways to train / perform opgmizagon for linear regression models q Normal EquaGon q Gradient Descent (GD) q StochasGc GD q Newton s method q Supervised regression models q Linear regression (LR) q LR with non-linear basis funcgons q Locally weighted LR q LR with RegularizaGons 3

Today q Linear Regression Model with RegularizaGons q Ridge Regression q Lasso Regression q ElasGc net 4

Review: Vector norms Dr. Yanjun Qi / UVA CS 6316 / f16 A norm of a vector x is informally a measure of the length of the vector. Common norms: L 1, L 2 (Euclidean) L infinity 5

eview: Vector Norm (L2, when p=2) Dr. Yanjun Qi / UVA CS 6316 / f16 6

Review: Normal equagon for LR Write the cost funcgon in matrix form: n 1 J ( θ ) = ( x 2 = = 1 2 1 2 i= 1 T i θ y )! T ( Xθ y) ( Xθ y) i 2! T T T T!! T! T! ( θ X Xθ θ X y y Xθ + y y) To minimize J(θ), take derivagve and set to zero: X Xθ = X y T T! The normal equaeons θ * = ( X T X) 1 X T! y " $ $ X = $ $ $ # x 1 T x 2 T!!! x n T % ' ' ' ' ' &! # # Y = # # "# Assume that X T X is invergble y 1 y 2! y n $ & & & & %&

Comments on the normal equagon In most situagons of pracgcal interest, the number of data points N is larger than the dimensionality p of the input space and the matrix X is of full column rank. If this condigon holds, then it is easy to verify that X T X is necessarily invergble. that X T X is invergble implies that it is posigve definite (è SSE strong convex), thus the crigcal point we have found is a minimum. What if X has less than full column rank? à regularizagon (later).

Review: Page17 of Linear-Algebra Handout è One important property of posigve definite and negagve definite matrices is that they are always full rank, and hence, invergble. 9

Page11 0f Handout 10

Ridge Regression / L2 If not invergble, a solugon is to add a small element ^ to diagonal = β +!+ β^! Y^ 1 x 1 ( ) 1 X T! y β * = X T X + λi p x p The ridge estimator is solution from ˆβ ridge = argmin(y Xβ) T (y Xβ)+ λβ T β to minimize, take derivagve and set to zero Equivalently ˆ ridge T β = arg min( y Xβ ) ( y Xβ ) 2 subject to β s By convengon, the bias/intercept term is typically not regularized. Here we assume data has been centered therefore no bias term j Basic Model, HW2 11

ObjecGve FuncGon s Contour lines from Ridge Regression Dr. Yanjun Qi / UVA CS 6316 / f16 Ridge Regression solugon p s s Least Square solugon 12

Review: from L3 Dr. Yanjun Qi / UVA CS 6316 / f16 13

2 s 1 14

(1) Ridge Regression / L2 The parameter > 0 penalizes λ β j Solution is ˆ β λ = ( X X + λi) T 1 X T y where I is the identity matrix. Note λ = 0 gives the least squares estimator; if λ, then βˆ 0 15

Shrinkage? β OLS = ( X T X ) 1 X T! y β Rg = ( X T X + λi) 1 X T! y When When Page65 of ESL book @ hjp://statweb.stanford.edu/~gbs/ ElemStatLearn/prinGngs/ESLII_print10.pdf 16

Two forms of Ridge Regression Totally equivalent hjp://stats.stackexchange.com/quesgons/ 190993/how-to-find-regression-coefficientsbeta-in-ridge-regression 17

PosiGve Definiteness One important property of posigve definite and negagve definite matrices is that they are always full rank, and hence, invergble. 18

ˆ β λ = ( X X + λi) T 1 X T y Dr. Yanjun Qi / UVA CS 6316 / f16 19

Today q Linear Regression Model with RegularizaGons q Ridge Regression q Lasso Regression q ElasGc net 20

(2) Lasso (least absolute shrinkage and selection operator) / L1 The lasso is a shrinkage method like ridge, but acts in a nonlinear manner on the outcome y. The lasso is defined by ˆβ lasso = argmin(y Xβ) T (y Xβ) subject to β s j By convengon, the bias/intercept term is typically not regularized. Here we assume data has been centered therefore no bias term 21

Lasso (least absolute shrinkage and selection operator) Notice that ridge penalty by β j β 2 j is replaced Due to the nature of the constraint, if tuning parameter is chosen small enough, then the lasso will set some coefficients exactly to zero. 22

Lasso (least absolute Dr. Yanjun Qi / UVA CS 6316 / f16 shrinkage and selection operator) Suppose in 2 dimension β= (β 1, β 2 ) β 1 + β 2 =const β 1 + - β 2 =const -β 1 + β 2 =const -β 1 + -β 2 =const Lasso SoluGon s Least Square solugon If data not centered, not-shrinking-the-bias-intercept-term-in-regression 23

Lasso EsGmator Dr. Yanjun Qi / UVA CS 6316 / f16 Ridge Regression s s 24

Today q Linear Regression Model with RegularizaGons q Ridge Regression q Lasso Regression q ElasGc net 25

(3) Hybrid of Ridge and Lasso Normally x and y have been centered, therefore no bias term needed in above! 26

Geometry of elasgc net 27

Movie Reviews and Revenues: An Experiment in Text Regression, Proceedings of HLT '10 Human Language Technologies: Dr. Yanjun Qi / UVA CS 6316 / f16 Use linear regression to directly predict the opening weekend gross earnings, denoted y, based on features x extracted from the movie metadata and/or the text of the reviews. 28

More: A family of shrinkage estimators Dr. Yanjun Qi / UVA CS 6316 / f16 β = arg min subject N β i = 1 to ( y i β j q x T i s β ) 2 for q >=0, contours of constant value of are shown for the case of two inputs. j β j q Here assume x and y have been centered (normally), therefore no bias term needed in above! 29

Summary: Regularized multivariate linear regression Model: LR esgmagon: LASSO esgmagon: ^ Y Ridge regression esgmagon: ^ ^ = β + β x +! + β 0 1 1 ^ p xp ^ argmin Y Y 2 n ^ p argmin Y Y + λ β j i=1 2 j=1 n ^ p argmin Y Y + λ 2 β j i=1 2 j=1 30/54

Regularized multivariate linear regression Dr. Yanjun Qi / UVA CS 6316 / f16 Task Regression Representation Score Function Y = Weighted linear sum of X s Least-squares + Regularization Search/Optimization Linear algebra / GD Models, Parameters Regression coefficients (constrained) 31

RegularizaGon path of a Ridge Regression Dr. Yanjun Qi / UVA CS 6316 / f16 λ λ = 0 32

Dr. Yanjun Qi / UVA CS 6316 / f15 an example (ESL Fig3.8), Choose λ that generalizes well! Ridge Regression when varying λ, how β j varies. λ λ = 0 λ increases 33

2 s 1 34

λ Dr. Yanjun Qi / UVA CS 6316 / f16 2 2 s 1 p s 1 35

2 Dr. Yanjun Qi / UVA CS 6316 / f16 λ 0 s 1 36

RegularizaGon path of a Lasso Regression Dr. Yanjun Qi / UVA CS 6316 / f16 when varying λ, how β j varies. λ λ = 0 37

Dr. Yanjun Qi / UVA CS 6316 / f15 Choose λ that generalizes well! an example (ESL Fig3.10), λ λ = 0 38

Lasso when p>n PredicGon accuracy and model interpretagon are two important aspects of regression models. LASSO does shrinkage and variable selecgon simultaneously for bejer predicgon and model interpretagon. Disadvantage: -In p>n case, lasso selects at most n variable before it saturates -If there is a group of variables among which the pairwise correlagons are very high, then lasso select one from the group 39

Bias/Intercept Term is not Shrinked If the data is not centered, there exists bias term hjp://stats.stackexchange.com/quesgons/86991/ reason-for-not-shrinking-the-bias-intercept-term-inregression We normally assume we centered x and y. If this is true, no need to have bias term, e.g., for lasso, 40

Today Recap q Linear Regression Model with RegularizaGons q Ridge Regression q why invergble (next class) q Lasso Regression q Extra: how to perform training (next class) q ElasGc net q Extra: how to perform training (next class) 41

References q Big thanks to Prof. Eric Xing @ CMU for allowing me to reuse some of his slides q Prof. Nando de Freitas s tutorial slide q RegularizaEon and variable seleceon via the elasec net, Hui Zou and Trevor HasGe, Stanford University, USA q ESL book: Elements of Sta>s>cal Learning 42