UVA CS 4501: Machine Learning. Lecture 6: Linear Regression Model with Dr. Yanjun Qi. University of Virginia

UVA CS 4501: Machine Learning Lecture 6: Linear Regression Model with Regulariza@ons Dr. Yanjun Qi University of Virginia Department of Computer Science

Where are we? è Five major sec@ons of this course q Regression (supervised) q Classifica@on (supervised) q Unsupervised models q Learning theory q Graphical models 2

Today è Regression (supervised) q Four ways to train / perform op@miza@on for linear regression models q Normal Equa@on q Gradient Descent (GD) q Stochas@c GD q Newton s method q Supervised regression models q Linear regression (LR) q LR with non-linear basis func@ons q Locally weighted LR q LR with Regulariza@ons 3

Today q Linear Regression Model with Regulariza@ons ü Review: (Ordinary) Least squares: squared loss (Normal Equa@on) ü Ridge regression: squared loss with L2 regulariza@on ü Lasso regression: squared loss with L1 regulariza@on ü Elas@c regression: squared loss with L1 AND L2 regulariza@on ü WHY and Influence of Regulariza@on Parameter 4

A Dataset for regression con@nuous valued variable Data/points/instances/examples/samples/records: [ rows ] Features/a0ributes/dimensions/independent variables/covariates/ predictors/regressors: [ columns, except the last] Target/outcome/response/label/dependent variable: special column to be predicted [ last column ] 5

Dr. Yanjun Qi / SUPERVISED Regression Training dataset consists of inputoutput pairs When, target Y is a con@nuous target variable f(x? ) 6

Review: Normal equa@on for LR Write the cost func@on in matrix form: J(β)= 1 2 n i=1 = 1 2 Xβ! y (x i T β y i ) 2 ( ) T ( Xβ y! ) = 1 2 β T X T Xβ β T X T!! y y T Xβ + y! T! y $ T ( ) x n ' # & To minimize J(θ), take deriva@ve and set to zero: X T Xβ = X T! y The normal equa@ons β * = ( X T X ) 1 X T! y X = " $ $ $ $ x 1 T x 2 T!!! % ' ' ' '! # # Y = # # "# Assume that X T X is inver@ble y 1 y 2! y n $ & & & & %&

Comments on the normal equa@on What if X has less than full column rank? à Not Inver@ble à Add Regulariza@on

Page11 0f Handout 9

e.g. A Prac@cal Applica@on of Regression Model Proceedings of HLT 2010 Human Language Technologies: 10

e.g., Movie Reviews and Revenues: An Experiment in Text Regression, Proceedings of HLT '10 (1.7k n / >3k features) e.g. counts of a ngram in the text 11

e.g., QSAR: Drug Screening p p Binding to Thrombin (DuPont Pharmaceuticals) - n: 2543 compounds tested for their ability to bind to a target site on thrombin (a human enzyme) - p: 139,351 binary features, which describe three-dimensional properties of molecules. n X Weston et al, Bioinformatics, 2002 12

Review: Vector norms A norm of a vector x is informally a measure of the length of the vector. Common norms: L 1, L 2 (Euclidean) L infinity 14

eview: Vector Norm (L2, when p=2) 15

Norms Lasso Quadra@c

Ridge Regression / L2 Regularization β * = ( X T X ) 1! X T y If not inver@ble, a classical solu@on is to add a small posi@ve element to diagonal β * = ( X T X + λi) 1! X T y By conven@on, the bias/intercept term is typically not regularized. Here we assume data has been centered therefore no bias term 17

Posi@ve Definite Matrix One important property of posi@ve definite matrices is that è They are always full rank, and hence, inver@ble. Extra: See Proof at Page 17-18 of Linear-Algebra Handout è 18

β * = ( X T X + λi) 1! X T y 19

Ridge Regression / Squared Loss+L2 β * = ( X T X + λi) 1! X T y As the solution from HW2 β! ridge = argmin( y Xβ) T ( y Xβ)+ λβ T β to minimize, take deriva@ve and set to zero By conven@on, the bias/intercept term is typically not regularized. Here we assume data has been centered therefore no bias term 20

Ridge Regression / Squared Loss+L2 β * = ( X T X + λi) 1! X T y As the solution from HW2 β! ridge = argmin( y Xβ) T ( y Xβ)+ λβ T β to minimize, take deriva@ve and set to zero Equivalently β! ridge = argmin( y Xβ) T ( y Xβ) subject to β 2 j s 2 j={1..p} By conven@on, the bias/intercept term is typically not regularized. Here we assume data has been centered therefore no bias term 21

Objec@ve Func@on s Contour lines from Ridge Regression β! ridge = argmin( y Xβ) T ( y Xβ) subject to β 2 j s 2 j={1..p} s Least Square solu@on 22

Objec@ve Func@on s Contour lines from Ridge Regression Ridge Regression solu@on Least Square solu@on s 23

Least Square+L2: Ridge solu@on 2 s 1 24

Ridge Regression / Squared Loss+L2 > 0 penalizes each λ θ j See whiteboard if = 0 we get the least squares estimator; λ if λ, then θ j 0 25

Parameter Shrinkage β OLS = ( X T X ) 1 X T! y β Rg = ( X T X + λi) 1 X T! y When When Page65 of ESL book @ hqp://statweb.stanford.edu/~@bs/ ElemStatLearn/prin@ngs/ESLII_print10.pdf 26

ü Influence of Regulariza@on Parameter 2 1 27

ü Influence of Regulariza@on Parameter 2 2 1 λ 1 28

ü Influence of Regulariza@on Parameter 2 λ 0 1 29

(2) Lasso (least absolute shrinkage and selection operator) / Squared Loss+L1 The lasso is a shrinkage method like ridge, but acts in a nonlinear manner on the outcome y. The lasso is defined by ˆβ lasso = argmin( y X β) T ( y X β) subject to β s j n ( y i x it β) 2 i=1 By conven@on, the bias/intercept term is typically not regularized. Here we assume data has been centered therefore no bias term 31

Lasso (least absolute shrinkage and selection operator) Suppose in 2 dimension β= (β 1, β 2 ) β 1 + β 2 =const β 1 + - β 2 =const -β 1 + β 2 =const -β 1 + -β 2 =const Lasso Solu@on s Least Square solu@on 32

In the Figure, the solu@on has eliminated the role of x2, leading to sparsity Lasso Solu@on Least Square solu@on s 33

Lasso Es@mator Ridge Regression s s 34

Lasso (least absolute shrinkage and selection operator) Notice that ridge penalty by β j β 2 j is replaced Due to the nature of the constraint, if tuning parameter is chosen small enough, then the lasso will set some coefficients exactly to zero. 35

Lasso for when p>n Predic@on accuracy and model interpreta@on are two important aspects of regression models. LASSO does shrinkage and variable selec@on simultaneously for beqer predic@on and model interpreta@on. Disadvantage: -In p>n case, lasso selects at most n variable before it saturates -If there is a group of variables among which the pairwise correla@ons are very high, then lasso select one from the group 36

Lasso: Implicit Feature Selec@on p p n X 37

(3) Hybrid of Ridge and Lasso : Elas@c Net regulariza@on L1 part of the penalty generates a sparse model L2 part of the penalty: Remove the limita@on of the number of selected variables Encouraging group effect Stabilize the L1 regulariza@on path. 40

Naïve elas@c net For any non nega@ve fixed λ 1 and λ 2, naive elas@c net criterion: The naive elas@c net es@mator is the minimizer of equa@on Let 41

Geometry of elas@c net 42

Movie Reviews and Revenues: An Experiment in Text Regression, Proceedings of HLT '10 Human Language Technologies: Use linear regression to directly predict the opening weekend gross earnings, denoted y, based on features x extracted from the movie metadata and/or the text of the reviews. 43

Advantage of Elas@c net (Extra) Na@ve Elas@c set can be converted to lasso with augmented data form In the augmented formula@on, sample size n+p and X * has rank p è can poten@ally select all the predictors Naïve elas@c net can perform automa@c variable selec@on like lasso 44

Regularized multivariate linear regression Task Regression Representation Score Function Y = Weighted linear sum of X s Least-squares + Regularization Search/Optimization Models, Parameters Linear algebra for Ridge / sub-gd for Lasso & Elastic Regression coefficients (regularized weights) min J(β) = Y Y^ + λ( β q j ) 1/q 45 n i=1 2 p j=1

Why to Use simpler models? Because: Simpler to use (lower computa@onal complexity) Easier to train (needs less examples) Less sensi@ve to noise Easier to explain (more interpretable) Generalizes beqer (lower variance - Occam s razor) --- More in future lectures!!! Adapted from Dr. Christoph Eick Slides 47

Model Selec@on & Generaliza@on Generalisation: learn func@on / hypothesis from past data in order to explain, predict, model or control new data examples Underfi}ng: when model is too simple, both training and test errors are large Overfi}ng: when model is too complex and test errors are large although training errors are small. A~er learning knowledge, model tends to learn noise 48

Issue: Overfi}ng and underfi}ng Under fit Looks good Over fit y =θ 0 + θ1x y 2 5 = θ 0 + θ1x + θ2x = j = 0 y θ j x j Generalisation: learn func@on / hypothesis from past data in order to explain, predict, model or control new data examples K-fold Cross Valida@on!!!!

Overfi}ng and. Underfi}ng Overfit 50

Overfi}ng ----- regulariza@on A regularizer is an addi@onal criteria to the loss func@on to make sure that we don t overfit It s called a regularizer since it tries to keep the parameters more normal/regular It is a bias on the model forces the learning to prefer certain types of weights over others, e.g.,! β ridge = argmin n β ( y i x T i β) 2 i=1 + λβ T β

Common regularizers L2: Squared weights penalizes large values more L1: Sum of weights will penalize small values more j β j β 2 j j Generally, we don t want huge weights If weights are large, a small change in a feature can result in a large change in the predic@on Might also prefer weights of 0 for features that aren t useful

Summary: Regularized multivariate linear regression Model: LR es@ma@on: LASSO es@ma@on: ^ Y Ridge regression es@ma@on: ^ ^ = β + β x +! + β 0 1 1 ^ p xp ^ argmin Y Y 2 n ^ p argmin Y Y + λ β j i=1 j=1 n ^ p argmin Y Y + λ 2 β j i=1 2 2 j=1 53/54

More: A family of shrinkage estimators β = argmin N β ( y i x T i β) 2 i=1 subject to β j q s for q >=0, contours of constant value of are shown for the case of two inputs. j β j q 54

norms visualized all p-norms penalize larger weights q q < 2 tends to create sparse (i.e. lots of 0 weights) q > 2 tends to push for similar weights

Regulariza@on path of a Ridge Regression λ λ = 0 56

Regulariza@on path of a Ridge Regression Weight Decay λ λ = 0 An example with 8 features 57

Regulariza@on path of a Lasso Regression when varying λ, how β j varies. λ λ = 0 An example with 8 features 58

WHY and How to Select λ? 1. Generaliza@on ability è k-folds CV to decide 2. Control the bias and Variance of the model (details in future lectures) 59

Regression: Complexity versus Goodness of Fit y Training data y Too simple? x x Low Variance / High Bias y Too complex? About right? y Low Bias / High Variance x What ultimately matters: GENERALIZATION 60 x

An example Choose λ that generalizes well! of Ridge Regression when varying λ, how β j varies. λ λ = 0 λ increases An example with 8 features 61

Choose λ that generalizes well! when varying λ, how β j varies. λ λ = 0 An example with 8 features 62

Today Recap q Linear Regression Model with Regulariza@ons ü Review: (Ordinary) Least squares: squared loss (Normal Equa@on) ü Ridge regression: squared loss with L2 regulariza@on ü Lasso regression: squared loss with L1 regulariza@on ü Elas@c regression: squared loss with L1 AND L2 regulariza@on ü Influence of Regulariza@on Parameter 63

References q Big thanks to Prof. Eric Xing @ CMU for allowing me to reuse some of his slides q Prof. Nando de Freitas s tutorial slide q Regulariza@on and variable selec@on via the elas@c net, Hui Zou and Trevor Has@e, Stanford University, USA q ESL book: Elements of StaDsDcal Learning 64

More Op@miza@on of regularized regression: See L6-extra slide Rela@on between λ and s See L6-extra slide Why Elas@c Net has a few nice proper@es See L6-extra slide 65