LINEAR REGRESSION, RIDGE, LASSO, SVR

LINEAR REGRESSION, RIDGE, LASSO, SVR Supervised Learning Katerina Tzompanaki

Linear regression one feature* Price (y) What is the estimated price of a new house of area 30 m 2? 30 Area (x) *Also called univariate linear regression. 23/03/18 2

Linear regression one feature Price (y) Find the line (or linear function) to fit the input points eg y = ax+b Area (x) 23/03/18 3

Linear regression one feature Price (y) y What is the estimated price of a new house of size 30? y = a*30+b Area (x) 23/03/18 4

Linear regression one feature Price (y) y We must calculate: y = ax+b Area (x) coefficient (the slope of the line) intercept (the offset of the line, ie the point where it hits the y axis) 23/03/18 5

Linear regression many features* Example of a plane In general: h(x) = w 1 x 1 +w 2 x 2 + +w 2 x 2 + b *Also called multivariate linear regression. 23/03/18 6

Linear regression many features Example of a plane In general: h(x) = w 1 x 1 +w 2 x 2 + +w 2 x 2 + b h(x) is called a line for one feature, a plane for two features or a hyperplane for more features. 23/03/18 7

Linear regression many features h(x) = w 1 x 1 +w 2 x 2 + +w n x n + b is called the hypothesis and it can also be written as: h(x) = n i=0 w[i]x[i] To obtain this, we have defined one extra feature: x 0 =1 in order to represent b as w 0 : x = (x 0,, x n ) R n w = (w 0,, w n ) R n weight vector 23/03/18 8

Linear regression many features h(x) = w 1 x 1 +w 2 x 2 + +w n x n + b is called the hypothesis and it can also be written as: h(x) = n i=0 w[i]x[i] = w T x Inner product of the transpose of w and x To obtain this, we have defined one extra feature: x 0 =1 thus x = (x 0,, x n ) R n w = (w 0,, w n ) R n weight vector 23/03/18 9

Linear regression many features h(x) = w 1 x 1 +w 2 x 2 + +w n x n + b is called the hypothesis and it can also be written as: h(x) = n i=0 w[i]x[i] = w T x 1, (n+1) matrix (n+1),1 matrix Inner product of the transpose of w and x To obtain this, we have defined one extra feature: x 0 =1 thus x = (x 0,, x n ) R n w = (w 0,, w n ) R n weight vector 23/03/18 10

Linear regression many features 23/03/18 11

Linear regression many features h(x) = n i=0 w[i]x[i] = w T x We must calculate w w[0] is the intercept and w[1],,w[n] are the coefficients 23/03/18 12

Linear regression many features h(x) = n i=0 w[i]x[i] = w T x We must calculate w w[0] is the intercept and w[1],,w[n] are the coefficients Goal? Minimize a cost function! 23/03/18 13

Cost function The cost function calculates the error. Price (y) y 2 y 2 residuals Minimize the error can approximately be seen as: Minimize the sum of the squares of the distance of the real values from the estimated values on the training data. x 2 Size (x) y - y is also called residual. Reminder: y denotes an estimated (predicted) value. 23/03/18 14

Cost function Ordinary Least Squares Residual Sum of Squares (RSS) J(w) = (h(x i ) y i ) 2 Goal: Find w=(w 0,,w n ) s.t. J(w) is minimized. How? Gradient Descent (but we will not go into details). N i=1 23/03/18 15

Cost function Ordinary Least Squares Residual Sum of Squares (RSS) Goal: Find w=(w 0,,w n ) s.t. J(w) is minimized. How? Gradient Descent. N J(w) = (h(x i ) y i ) 2 i=1 Intuition: We want to obtain predicted values as close as possible to the original values (for the training data). Simple linear regression uses this method, called Ordinary Least Squares. 23/03/18 16

Ordinary Least Squares Pros: Has no parameters to tune. Performs well in bigger datasets. Simple algorithm to understand. Cons Performs badly for small size of input data and small number of features. Can be slow in big datasets. High model complexity, as all the features are taken into account. Risk of over- or under-fitting. 23/03/18 17

Ordinary Least Squares Underfitting may occur when we have few features and few training data. Overfitting may occur when we have many features (eg in the last diagram the hyperplane is projected on the x feature). To avoid overfitting: Intervene by regularization! Figure: Linear Regression and Support Vector Regression, Paisitkriangkrai, University of Adelaide 23/03/18 18

Ridge regression Characteristics: Tries to shrink the estimated weights to zero. L2 penalty: Penalizes the L2 norm (eucledean length) of the coefficient vector. A tuning parameter α controls the strength of the penalty. Now, the cost function becomes: N J r (w) = (h(x i ) y i ) 2 2 +α w j i=1 n j=1 Goal: Find w=(w 0,,w n ) s.t. J r (w) is minimized. 23/03/18 20

Ridge regression Characteristics: Tries to shrink the estimated weights to zero. L2 penalty: Penalizes the L2 norm (eucledean length) of the coefficient vector. A tuning parameter α controls the strength of the penalty. Now, the cost function becomes: N J r (w) = (h(x i ) y i ) 2 2 +α w j i=1 error n j=1 penalty Goal: Find w=(w 0,,w n ) s.t. J r (w) is minimized. 23/03/18 21

Ridge regression N J r (w) = (h(x i ) y i ) 2 2 +α w j i=1 n j=1 23/03/18 22

Ridge regression N J r (w) = (h(x i ) y i ) 2 2 +α w j i=1 n j=1 error penalty The tuning parameter α 0 controls the strength of the penalty. α = 0: we get linear regression (J r = J) α : we get w 0 For other values of α: we try to fit a linear model h(x) and in the same time shrink the weights Weights here are only the coefficients of the features, not the intercept note how we start from j=1 and not j=0. 23/03/18 23

Lasso regression Ridge regression will never set the coefficients to zero only when α is will all coefficients be 0. What if some features are not as important as others? Ø Feature selection Lasso Regression: Regularized model that penalizes the L1 norm of the coefficient vector. A tuning parameter α controls the strength of the penalty. Now, the cost function is N J l (w) = (h(x i ) y i ) 2 +α w j i=1 n j=1 23/03/18 24

Lasso regression N J l (w) = (h(x i ) y i ) 2 +α w j i=1 n j=1 23/03/18 25

Lasso regression N J l (w) = (h(x i ) y i ) 2 +α w j i=1 n j=1 error penalty As for Ridge regression, for the tuning parameter α 0 we have: α = 0: we get linear regression (J l = J) α : we get w 0 For other values of α: we try to fit a linear model h(x) and in the same time shrink the weights which can now be exactly 0 for some features, because of L1. For bigger values of α, more weights become 0 and the rest are getting smaller. 23/03/18 26

Lasso regression Why does Lasso give some 0 weights? For Lasso the blue region represents the constraint area imposed by L1: w 1 + w 2 t For Ridge the constraint region is: w 12 +w 22 t The RSS has elliptical contours, centered at the OLS estimate w. We can see that the lasso diamond has corners, enabling some coefficients (here w 1 ) to become zero when the elliptical contour of RSS meet the blue contour. 23/03/18 27

Linear models for regression - summary Ridge & Lasso Regularized models (using L2 and L1 penalties respectively). Better prediction error than linear regression for small datasets. Needs attribute normalization. Ridge Works best when a number of weights are already small or 0. Is good for prediction not for model interpretation (no feature selection is possible). Lasso Makes feature selection possible. Good for model prediction. Combinations of L1 and L2 also exist (elastic net). 23/03/18 28

Beyond linear cases: Support Vector Regressor Linear SVR: As usually, we try to find a line (not imperatively a straight one) to fit the data, by minimizing a loss function. Use of support vectors, as for the Support Vector Classifier. Non-Linear SVR: We use the kernel trick, and the kernel families that we saw for SVC (linear, polynomial, gaussian) to separate non-linearly separable data. 23/03/18 29

Beyond linear cases: Support Vector Regressor *http://scikit-learn.org/stable/auto_examples/svm/plot_svm_regression.html 23/03/18 30