UVA CS 4501: Machine Learning. Lecture 6: Linear Regression Model with Dr. Yanjun Qi. University of Virginia

Size: px

Start display at page:

Download "UVA CS 4501: Machine Learning. Lecture 6: Linear Regression Model with Dr. Yanjun Qi. University of Virginia"

Preston Stevenson
5 years ago
Views:

1 UVA CS 4501: Machine Learning Lecture 6: Linear Regression Model with Dr. Yanjun Qi University of Virginia Department of Computer Science

2 Where are we? è Five major of this course q Regression (supervised) q Classifica@on (supervised) q Unsupervised models q Learning theory q Graphical models 2

3 Today è Regression (supervised) q Four ways to train / perform op@miza@on for linear regression models q Normal Equa@on q Gradient Descent (GD) q Stochas@c GD q Newton s method q Supervised regression models q Linear regression (LR) q LR with non-linear basis func@ons q Locally weighted LR q LR with Regulariza@ons 3

4 Today q Linear Regression Model with Regulariza@ons ü Review: (Ordinary) Least squares: squared loss (Normal Equa@on) ü Ridge regression: squared loss with L2 regulariza@on ü Lasso regression: squared loss with L1 regulariza@on ü Elas@c regression: squared loss with L1 AND L2 regulariza@on ü WHY and Influence of Regulariza@on Parameter 4

5 A Dataset for regression valued variable Data/points/instances/examples/samples/records: [ rows ] Features/a0ributes/dimensions/independent variables/covariates/ predictors/regressors: [ columns, except the last] Target/outcome/response/label/dependent variable: special column to be predicted [ last column ] 5

6 Dr. Yanjun Qi / SUPERVISED Regression Training dataset consists of inputoutput pairs When, target Y is a con@nuous target variable f(x? ) 6

7 Review: Normal for LR Write the cost in matrix form: J(β)= 1 2 n i=1 = 1 2 Xβ! y (x i T β y i ) 2 ( ) T ( Xβ y! ) = 1 2 β T X T Xβ β T X T!! y y T Xβ + y! T! y $ T ( ) x n ' # & To minimize J(θ), take deriva@ve and set to zero: X T Xβ = X T! y The normal equa@ons β * = ( X T X ) 1 X T! y X = " $ $ $ $ x 1 T x 2 T!!! % ' ' ' '! # # Y = # # "# Assume that X T X is inver@ble y 1 y 2! y n $ & & & & %&

8 Comments on the normal What if X has less than full column rank? à Not Inver@ble à Add Regulariza@on

9 Page11 0f Handout 9

10 e.g. A Prac@cal Applica@on of Regression Model Proceedings of HLT 2010 Human Language Technologies: 10

11 e.g., Movie Reviews and Revenues: An Experiment in Text Regression, Proceedings of HLT '10 (1.7k n / >3k features) e.g. counts of a ngram in the text 11

12 e.g., QSAR: Drug Screening p p Binding to Thrombin (DuPont Pharmaceuticals) - n: 2543 compounds tested for their ability to bind to a target site on thrombin (a human enzyme) - p: 139,351 binary features, which describe three-dimensional properties of molecules. n X Weston et al, Bioinformatics,

13 Today q Linear Regression Model with Regulariza@ons ü Review: (Ordinary) Least squares: squared loss (Normal Equa@on) ü Ridge regression: squared loss with L2 regulariza@on ü Lasso regression: squared loss with L1 regulariza@on ü Elas@c regression: squared loss with L1 AND L2 regulariza@on ü Influence of Regulariza@on Parameter 13

14 Review: Vector norms A norm of a vector x is informally a measure of the length of the vector. Common norms: L 1, L 2 (Euclidean) L infinity 14

15 eview: Vector Norm (L2, when p=2) 15

16 Norms Lasso

17 Ridge Regression / L2 Regularization β * = ( X T X ) 1! X T y If not inver@ble, a classical solu@on is to add a small posi@ve element to diagonal β * = ( X T X + λi) 1! X T y By conven@on, the bias/intercept term is typically not regularized. Here we assume data has been centered therefore no bias term 17

Posi@ve Definite Matrix One important property of posi@ve definite matrices is that è They are

18 Definite Matrix One important property of definite matrices is that è They are always full rank, and hence, Extra: See Proof at Page of Linear-Algebra Handout è 18

19 β * = ( X T X + λi) 1! X T y 19

20 Ridge Regression / Squared Loss+L2 β * = ( X T X + λi) 1! X T y As the solution from HW2 β! ridge = argmin( y Xβ) T ( y Xβ)+ λβ T β to minimize, take deriva@ve and set to zero By conven@on, the bias/intercept term is typically not regularized. Here we assume data has been centered therefore no bias term 20

21 Ridge Regression / Squared Loss+L2 β * = ( X T X + λi) 1! X T y As the solution from HW2 β! ridge = argmin( y Xβ) T ( y Xβ)+ λβ T β to minimize, take deriva@ve and set to zero Equivalently β! ridge = argmin( y Xβ) T ( y Xβ) subject to β 2 j s 2 j={1..p} By conven@on, the bias/intercept term is typically not regularized. Here we assume data has been centered therefore no bias term 21

22 s Contour lines from Ridge Regression β! ridge = argmin( y Xβ) T ( y Xβ) subject to β 2 j s 2 j={1..p} s Least Square solu@on 22

23 s Contour lines from Ridge Regression Ridge Regression Least Square s 23

24 Least Square+L2: Ridge 2 s 1 24

25 Ridge Regression / Squared Loss+L2 > 0 penalizes each λ θ j See whiteboard if = 0 we get the least squares estimator; λ if λ, then θ j 0 25

26 Parameter Shrinkage β OLS = ( X T X ) 1 X T! y β Rg = ( X T X + λi) 1 X T! y When When Page65 of ESL hqp://statweb.stanford.edu/~@bs/ ElemStatLearn/prin@ngs/ESLII_print10.pdf 26

27 ü Influence of Parameter

28 ü Influence of Parameter λ 1 28

29 ü Influence of Parameter 2 λ

30 Today q Linear Regression Model with Regulariza@ons ü Review: (Ordinary) Least squares: squared loss (Normal Equa@on) ü Ridge regression: squared loss with L2 regulariza@on ü Lasso regression: squared loss with L1 regulariza@on ü Elas@c regression: squared loss with L1 AND L2 regulariza@on ü WHY and Influence of Regulariza@on Parameter 30

(2) Lasso (least absolute shrinkage and selection operator) / Squared Loss+L1 The lasso is a shrinkage method like ridge, but acts in a nonlinear manner on the outcome y.

31 (2) Lasso (least absolute shrinkage and selection operator) / Squared Loss+L1 The lasso is a shrinkage method like ridge, but acts in a nonlinear manner on the outcome y. The lasso is defined by ˆβ lasso = argmin( y X β) T ( y X β) subject to β s j n ( y i x it β) 2 i=1 By conven@on, the bias/intercept term is typically not regularized. Here we assume data has been centered therefore no bias term 31

32 Lasso (least absolute shrinkage and selection operator) Suppose in 2 dimension β= (β 1, β 2 ) β 1 + β 2 =const β β 2 =const -β 1 + β 2 =const -β 1 + -β 2 =const Lasso Solu@on s Least Square solu@on 32

33 In the Figure, the has eliminated the role of x2, leading to sparsity Lasso Least Square s 33

34 Lasso Ridge Regression s s 34

35 Lasso (least absolute shrinkage and selection operator) Notice that ridge penalty by β j β 2 j is replaced Due to the nature of the constraint, if tuning parameter is chosen small enough, then the lasso will set some coefficients exactly to zero. 35

36 Lasso for when p>n accuracy and model are two important aspects of regression models. LASSO does shrinkage and variable simultaneously for beqer and model Disadvantage: -In p>n case, lasso selects at most n variable before it saturates -If there is a group of variables among which the pairwise correla@ons are very high, then lasso select one from the group 36

37 Lasso: Implicit Feature p p n X 37

38 38

Today q Linear Regression Model with Regulariza@ons ü Review: (Ordinary) Least squares: squared loss (Normal Equa@on) ü Ridge regression: squared loss with L2

39 Today q Linear Regression Model with Regulariza@ons ü Review: (Ordinary) Least squares: squared loss (Normal Equa@on) ü Ridge regression: squared loss with L2 regulariza@on ü Lasso regression: squared loss with L1 regulariza@on ü Elas@c regression: squared loss with L1 AND L2 regulariza@on ü WHY and Influence of Regulariza@on Parameter 39

40 (3) Hybrid of Ridge and Lasso : Elas@c Net regulariza@on L1 part of the penalty generates a sparse model L2 part of the penalty: Remove the limita@on of the number of selected variables Encouraging group effect Stabilize the L1 regulariza@on path. 40

41 Naïve net For any non fixed λ 1 and λ 2, naive elas@c net criterion: The naive elas@c net es@mator is the minimizer of equa@on Let 41

42 Geometry of net 42

43 Movie Reviews and Revenues: An Experiment in Text Regression, Proceedings of HLT '10 Human Language Technologies: Use linear regression to directly predict the opening weekend gross earnings, denoted y, based on features x extracted from the movie metadata and/or the text of the reviews. 43

44 Advantage of net (Extra) set can be converted to lasso with augmented data form In the augmented sample size n+p and X * has rank p è can poten@ally select all the predictors Naïve elas@c net can perform automa@c variable selec@on like lasso 44

45 Regularized multivariate linear regression Task Regression Representation Score Function Y = Weighted linear sum of X s Least-squares + Regularization Search/Optimization Models, Parameters Linear algebra for Ridge / sub-gd for Lasso & Elastic Regression coefficients (regularized weights) min J(β) = Y Y^ + λ( β q j ) 1/q 45 n i=1 2 p j=1

46 Today q Linear Regression Model with Regulariza@ons ü Review: (Ordinary) Least squares: squared loss (Normal Equa@on) ü Ridge regression: squared loss with L2 regulariza@on ü Lasso regression: squared loss with L1 regulariza@on ü Elas@c regression: squared loss with L1 AND L2 regulariza@on ü WHY and Influence of Regulariza@on Parameter 46

47 Why to Use simpler models? Because: Simpler to use (lower complexity) Easier to train (needs less examples) Less to noise Easier to explain (more interpretable) Generalizes beqer (lower variance - Occam s razor) --- More in future lectures!!! Adapted from Dr. Christoph Eick Slides 47

48 Model & Generalisation: learn / hypothesis from past data in order to explain, predict, model or control new data examples Underfi}ng: when model is too simple, both training and test errors are large Overfi}ng: when model is too complex and test errors are large although training errors are small. A~er learning knowledge, model tends to learn noise 48

Issue: Overfi}ng and underfi}ng Under fit

from past data in order to explain, predict,

49 Issue: Overfi}ng and underfi}ng Under fit Looks good Over fit y =θ 0 + θ1x y 2 5 = θ 0 + θ1x + θ2x = j = 0 y θ j x j Generalisation: learn func@on / hypothesis from past data in order to explain, predict, model or control new data examples K-fold Cross Valida@on!!!!

50 Overfi}ng and. Underfi}ng Overfit 50

51 Overfi}ng A regularizer is an criteria to the loss to make sure that we don t overfit It s called a regularizer since it tries to keep the parameters more normal/regular It is a bias on the model forces the learning to prefer certain types of weights over others, e.g.,! β ridge = argmin n β ( y i x T i β) 2 i=1 + λβ T β

52 Common regularizers L2: Squared weights penalizes large values more L1: Sum of weights will penalize small values more j β j β 2 j j Generally, we don t want huge weights If weights are large, a small change in a feature can result in a large change in the predic@on Might also prefer weights of 0 for features that aren t useful

Summary: Regularized multivariate linear regression Model: LR es@ma@on: LASSO es@ma@on: ^ Y Ridge regression es@ma@on: ^ ^

53 Summary: Regularized multivariate linear regression Model: LR LASSO ^ Y Ridge regression es@ma@on: ^ ^ = β + β x +! + β ^ p xp ^ argmin Y Y 2 n ^ p argmin Y Y + λ β j i=1 j=1 n ^ p argmin Y Y + λ 2 β j i=1 2 2 j=1 53/54

54 More: A family of shrinkage estimators β = argmin N β ( y i x T i β) 2 i=1 subject to β j q s for q >=0, contours of constant value of are shown for the case of two inputs. j β j q 54

55 norms visualized all p-norms penalize larger weights q q < 2 tends to create sparse (i.e. lots of 0 weights) q > 2 tends to push for similar weights

56 path of a Ridge Regression λ λ = 0 56

57 path of a Ridge Regression Weight Decay λ λ = 0 An example with 8 features 57

58 path of a Lasso Regression when varying λ, how β j varies. λ λ = 0 An example with 8 features 58

59 WHY and How to Select λ? 1. ability è k-folds CV to decide 2. Control the bias and Variance of the model (details in future lectures) 59

60 Regression: Complexity versus Goodness of Fit y Training data y Too simple? x x Low Variance / High Bias y Too complex? About right? y Low Bias / High Variance x What ultimately matters: GENERALIZATION 60 x

61 An example Choose λ that generalizes well! of Ridge Regression when varying λ, how β j varies. λ λ = 0 λ increases An example with 8 features 61

62 Choose λ that generalizes well! when varying λ, how β j varies. λ λ = 0 An example with 8 features 62

63 Today Recap q Linear Regression Model with Regulariza@ons ü Review: (Ordinary) Least squares: squared loss (Normal Equa@on) ü Ridge regression: squared loss with L2 regulariza@on ü Lasso regression: squared loss with L1 regulariza@on ü Elas@c regression: squared loss with L1 AND L2 regulariza@on ü Influence of Regulariza@on Parameter 63

64 References q Big thanks to Prof. Eric CMU for allowing me to reuse some of his slides q Prof. Nando de Freitas s tutorial slide q Regulariza@on and variable selec@on via the elas@c net, Hui Zou and Trevor Has@e, Stanford University, USA q ESL book: Elements of StaDsDcal Learning 64

65 More of regularized regression: See L6-extra slide between λ and s See L6-extra slide Why Elas@c Net has a few nice proper@es See L6-extra slide 65

UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 6: Linear Regression Model with RegularizaEons. Dr. Yanjun Qi. University of Virginia

UVA CS 6316/4501 Fall 2016 Machine Learning Lecture 6: Linear Regression Model with RegularizaEons Dr. Yanjun Qi University of Virginia Department of Computer Science 1 Where are we? è Five major secgons