LINEAR REGRESSION, RIDGE, LASSO, SVR

Similar documents
Introduction to Machine Learning

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Ridge Regression 1. to which some random noise is added. So that the training labels can be represented as:

Lecture 14: Shrinkage

Linear Methods for Regression. Lijun Zhang

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Tutorial on Linear Regression

Linear Models for Regression CS534

Support Vector Machines

Real Estate Price Prediction with Regression and Classification CS 229 Autumn 2016 Project Final Report

Linear Models: Comparing Variables. Stony Brook University CSE545, Fall 2017

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Linear Models for Regression CS534

II. Linear Models (pp.47-70)

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

6.036 midterm review. Wednesday, March 18, 15

Ridge Regression: Regulating overfitting when using many features. Training, true, & test error vs. model complexity. CSE 446: Machine Learning

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Lecture 2 Machine Learning Review

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

CSCI567 Machine Learning (Fall 2014)

Linear Models for Regression CS534

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Overfitting, Bias / Variance Analysis

Introduction to SVM and RVM

Machine Learning Linear Regression. Prof. Matteo Matteucci

COMS 4771 Regression. Nakul Verma

SVMs, Duality and the Kernel Trick

Lecture 5 Multivariate Linear Regression

Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

MS-C1620 Statistical inference

Experiment 1: Linear Regression

Introduction to Support Vector Machines

Machine Learning Linear Models

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Support Vector Machine I

CPSC 340: Machine Learning and Data Mining

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

ESS2222. Lecture 3 Bias-Variance Trade-off

ECE521 week 3: 23/26 January 2017

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

l 1 and l 2 Regularization

y Xw 2 2 y Xw λ w 2 2

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Lasso, Ridge, and Elastic Net

Linear Model Selection and Regularization

SVM optimization and Kernel methods

Linear Regression Linear Regression with Shrinkage

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Lecture 2: Linear regression

DATA MINING AND MACHINE LEARNING

CMU-Q Lecture 24:

Binary Classification / Perceptron

CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015

CPSC 340: Machine Learning and Data Mining. Regularization Fall 2017

Machine Learning Practice Page 2 of 2 10/28/13

Linear Regression Linear Regression with Shrinkage

Linear Models for Regression

Support Vector Machines. Machine Learning Fall 2017

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Machine Learning 4771

6. Regularized linear regression

Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011!

Chart types and when to use them

Lecture Data Science

Ordinary Least Squares Linear Regression

STAT 462-Computational Data Analysis

Error Functions & Linear Regression (1)

CSC 411: Lecture 02: Linear Regression

Least Mean Squares Regression. Machine Learning Fall 2018

INTRODUCTION TO DATA SCIENCE

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Advanced Introduction to Machine Learning CMU-10715

The Perceptron algorithm

Logistic Regression Logistic

Statistical Machine Learning Hilary Term 2018

Classification Logistic Regression

CSC 411 Lecture 6: Linear Regression

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

ISyE 691 Data mining and analytics

Support vector machines Lecture 4

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Machine Learning for OR & FE

Bias-Variance Tradeoff

Mustafa Jarrar. Machine Learning Linear Regression. Birzeit University. Mustafa Jarrar: Lecture Notes on Linear Regression Birzeit University, 2018

A Modern Look at Classical Multivariate Techniques

STA414/2104 Statistical Methods for Machine Learning II

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Statistical Methods for Data Mining

Transcription:

LINEAR REGRESSION, RIDGE, LASSO, SVR Supervised Learning Katerina Tzompanaki

Linear regression one feature* Price (y) What is the estimated price of a new house of area 30 m 2? 30 Area (x) *Also called univariate linear regression. 23/03/18 2

Linear regression one feature Price (y) Find the line (or linear function) to fit the input points eg y = ax+b Area (x) 23/03/18 3

Linear regression one feature Price (y) y What is the estimated price of a new house of size 30? y = a*30+b Area (x) 23/03/18 4

Linear regression one feature Price (y) y We must calculate: y = ax+b Area (x) coefficient (the slope of the line) intercept (the offset of the line, ie the point where it hits the y axis) 23/03/18 5

Linear regression many features* Example of a plane In general: h(x) = w 1 x 1 +w 2 x 2 + +w 2 x 2 + b *Also called multivariate linear regression. 23/03/18 6

Linear regression many features Example of a plane In general: h(x) = w 1 x 1 +w 2 x 2 + +w 2 x 2 + b h(x) is called a line for one feature, a plane for two features or a hyperplane for more features. 23/03/18 7

Linear regression many features h(x) = w 1 x 1 +w 2 x 2 + +w n x n + b is called the hypothesis and it can also be written as: h(x) = n i=0 w[i]x[i] To obtain this, we have defined one extra feature: x 0 =1 in order to represent b as w 0 : x = (x 0,, x n ) R n w = (w 0,, w n ) R n weight vector 23/03/18 8

Linear regression many features h(x) = w 1 x 1 +w 2 x 2 + +w n x n + b is called the hypothesis and it can also be written as: h(x) = n i=0 w[i]x[i] = w T x Inner product of the transpose of w and x To obtain this, we have defined one extra feature: x 0 =1 thus x = (x 0,, x n ) R n w = (w 0,, w n ) R n weight vector 23/03/18 9

Linear regression many features h(x) = w 1 x 1 +w 2 x 2 + +w n x n + b is called the hypothesis and it can also be written as: h(x) = n i=0 w[i]x[i] = w T x 1, (n+1) matrix (n+1),1 matrix Inner product of the transpose of w and x To obtain this, we have defined one extra feature: x 0 =1 thus x = (x 0,, x n ) R n w = (w 0,, w n ) R n weight vector 23/03/18 10

Linear regression many features 23/03/18 11

Linear regression many features h(x) = n i=0 w[i]x[i] = w T x We must calculate w w[0] is the intercept and w[1],,w[n] are the coefficients 23/03/18 12

Linear regression many features h(x) = n i=0 w[i]x[i] = w T x We must calculate w w[0] is the intercept and w[1],,w[n] are the coefficients Goal? Minimize a cost function! 23/03/18 13

Cost function The cost function calculates the error. Price (y) y 2 y 2 residuals Minimize the error can approximately be seen as: Minimize the sum of the squares of the distance of the real values from the estimated values on the training data. x 2 Size (x) y - y is also called residual. Reminder: y denotes an estimated (predicted) value. 23/03/18 14

Cost function Ordinary Least Squares Residual Sum of Squares (RSS) J(w) = (h(x i ) y i ) 2 Goal: Find w=(w 0,,w n ) s.t. J(w) is minimized. How? Gradient Descent (but we will not go into details). N i=1 23/03/18 15

Cost function Ordinary Least Squares Residual Sum of Squares (RSS) Goal: Find w=(w 0,,w n ) s.t. J(w) is minimized. How? Gradient Descent. N J(w) = (h(x i ) y i ) 2 i=1 Intuition: We want to obtain predicted values as close as possible to the original values (for the training data). Simple linear regression uses this method, called Ordinary Least Squares. 23/03/18 16

Ordinary Least Squares Pros: Has no parameters to tune. Performs well in bigger datasets. Simple algorithm to understand. Cons Performs badly for small size of input data and small number of features. Can be slow in big datasets. High model complexity, as all the features are taken into account. Risk of over- or under-fitting. 23/03/18 17

Ordinary Least Squares Underfitting may occur when we have few features and few training data. Overfitting may occur when we have many features (eg in the last diagram the hyperplane is projected on the x feature). To avoid overfitting: Intervene by regularization! Figure: Linear Regression and Support Vector Regression, Paisitkriangkrai, University of Adelaide 23/03/18 18

Ridge regression Characteristics: Tries to shrink the estimated weights to zero. L2 penalty: Penalizes the L2 norm (eucledean length) of the coefficient vector. A tuning parameter α controls the strength of the penalty. 23/03/18 19

Ridge regression Characteristics: Tries to shrink the estimated weights to zero. L2 penalty: Penalizes the L2 norm (eucledean length) of the coefficient vector. A tuning parameter α controls the strength of the penalty. Now, the cost function becomes: N J r (w) = (h(x i ) y i ) 2 2 +α w j i=1 n j=1 Goal: Find w=(w 0,,w n ) s.t. J r (w) is minimized. 23/03/18 20

Ridge regression Characteristics: Tries to shrink the estimated weights to zero. L2 penalty: Penalizes the L2 norm (eucledean length) of the coefficient vector. A tuning parameter α controls the strength of the penalty. Now, the cost function becomes: N J r (w) = (h(x i ) y i ) 2 2 +α w j i=1 error n j=1 penalty Goal: Find w=(w 0,,w n ) s.t. J r (w) is minimized. 23/03/18 21

Ridge regression N J r (w) = (h(x i ) y i ) 2 2 +α w j i=1 n j=1 23/03/18 22

Ridge regression N J r (w) = (h(x i ) y i ) 2 2 +α w j i=1 n j=1 error penalty The tuning parameter α 0 controls the strength of the penalty. α = 0: we get linear regression (J r = J) α : we get w 0 For other values of α: we try to fit a linear model h(x) and in the same time shrink the weights Weights here are only the coefficients of the features, not the intercept note how we start from j=1 and not j=0. 23/03/18 23

Lasso regression Ridge regression will never set the coefficients to zero only when α is will all coefficients be 0. What if some features are not as important as others? Ø Feature selection Lasso Regression: Regularized model that penalizes the L1 norm of the coefficient vector. A tuning parameter α controls the strength of the penalty. Now, the cost function is N J l (w) = (h(x i ) y i ) 2 +α w j i=1 n j=1 23/03/18 24

Lasso regression N J l (w) = (h(x i ) y i ) 2 +α w j i=1 n j=1 23/03/18 25

Lasso regression N J l (w) = (h(x i ) y i ) 2 +α w j i=1 n j=1 error penalty As for Ridge regression, for the tuning parameter α 0 we have: α = 0: we get linear regression (J l = J) α : we get w 0 For other values of α: we try to fit a linear model h(x) and in the same time shrink the weights which can now be exactly 0 for some features, because of L1. For bigger values of α, more weights become 0 and the rest are getting smaller. 23/03/18 26

Lasso regression Why does Lasso give some 0 weights? For Lasso the blue region represents the constraint area imposed by L1: w 1 + w 2 t For Ridge the constraint region is: w 12 +w 22 t The RSS has elliptical contours, centered at the OLS estimate w. We can see that the lasso diamond has corners, enabling some coefficients (here w 1 ) to become zero when the elliptical contour of RSS meet the blue contour. 23/03/18 27

Linear models for regression - summary Ridge & Lasso Regularized models (using L2 and L1 penalties respectively). Better prediction error than linear regression for small datasets. Needs attribute normalization. Ridge Works best when a number of weights are already small or 0. Is good for prediction not for model interpretation (no feature selection is possible). Lasso Makes feature selection possible. Good for model prediction. Combinations of L1 and L2 also exist (elastic net). 23/03/18 28

Beyond linear cases: Support Vector Regressor Linear SVR: As usually, we try to find a line (not imperatively a straight one) to fit the data, by minimizing a loss function. Use of support vectors, as for the Support Vector Classifier. Non-Linear SVR: We use the kernel trick, and the kernel families that we saw for SVC (linear, polynomial, gaussian) to separate non-linearly separable data. 23/03/18 29

Beyond linear cases: Support Vector Regressor *http://scikit-learn.org/stable/auto_examples/svm/plot_svm_regression.html 23/03/18 30