Ridge Regression 1. to which some random noise is added. So that the training labels can be represented as:

Similar documents
The Perceptron Algorithm 1

Introduction to Machine Learning

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

LINEAR REGRESSION, RIDGE, LASSO, SVR

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

4 Bias-Variance for Ridge Regression (24 points)

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

Linear Models for Regression

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Machine Learning for Biomedical Engineering. Enrico Grisan

CSC 411: Lecture 02: Linear Regression

Machine learning - HT Basis Expansion, Regularization, Validation

6.867 Machine learning

Lecture 2 Machine Learning Review

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

y Xw 2 2 y Xw λ w 2 2

l 1 and l 2 Regularization

Linear Regression. Volker Tresp 2018

CMU-Q Lecture 24:

Linear Models for Regression CS534

Machine Learning 4771

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

DATA MINING AND MACHINE LEARNING

CSC 411 Lecture 6: Linear Regression

Ridge Regression: Regulating overfitting when using many features. Training, true, & test error vs. model complexity. CSE 446: Machine Learning

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Linear Models for Regression CS534

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regularization, Sparsity & Lasso

6. Regularized linear regression

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Tutorial on Linear Regression

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Introduction and Models

ECE521 lecture 4: 19 January Optimization, MLE, regularization

y(x) = x w + ε(x), (1)

Overfitting, Bias / Variance Analysis

Machine Learning for OR & FE

ECE521 week 3: 23/26 January 2017

Linear Models for Regression CS534

10-701/ Machine Learning - Midterm Exam, Fall 2010

Parameter Norm Penalties. Sargur N. Srihari

MS-C1620 Statistical inference

Linear Regression. Volker Tresp 2014

Machine Learning Basics III

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Lecture 3: Decision Trees

CMSC 422 Introduction to Machine Learning Lecture 4 Geometry and Nearest Neighbors. Furong Huang /

Issues and Techniques in Pattern Classification

Linear Discriminant Functions

COMS 4771 Introduction to Machine Learning. Nakul Verma

MLCC 2017 Regularization Networks I: Linear Models

Exercise List: Proving convergence of the Gradient Descent Method on the Ridge Regression Problem.

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

CSC321 Lecture 9: Generalization

CSCI567 Machine Learning (Fall 2014)

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

Lecture 14: Shrinkage

Today. Calculus. Linear Regression. Lagrange Multipliers

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

Linear Regression. S. Sumitra

Introduction to Machine Learning Fall 2017 Note 5. 1 Overview. 2 Metric

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization

Machine Learning And Applications: Supervised Learning-SVM

Dimension Reduction Methods

Linear Methods for Regression. Lijun Zhang

Machine Learning Linear Regression. Prof. Matteo Matteucci

Lecture : Probabilistic Machine Learning

CS 6375 Machine Learning

CSC321 Lecture 9: Generalization

UVA CS 4501: Machine Learning. Lecture 6: Linear Regression Model with Dr. Yanjun Qi. University of Virginia

CPSC 340: Machine Learning and Data Mining

Statistical Machine Learning, Part I. Regression 2

Week 3: Linear Regression

Least Squares Regression

CSC321 Lecture 2: Linear Regression

2 Upper-bound of Generalization Error of AdaBoost

Linear & nonlinear classifiers

Linear & nonlinear classifiers

CSE 546 Midterm Exam, Fall 2014

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Regularization in Neural Networks

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Linear and Logistic Regression. Dr. Xiaowei Huang

Sparse Linear Models (10/7/13)

CS 6375 Machine Learning

CPSC 340: Machine Learning and Data Mining. Regularization Fall 2017

Linear Models for Regression

Bayesian Machine Learning

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

Mustafa Jarrar. Machine Learning Linear Regression. Birzeit University. Mustafa Jarrar: Lecture Notes on Linear Regression Birzeit University, 2018

CS-E3210 Machine Learning: Basic Principles

Online Learning With Kernel

Classification: Decision Trees

Transcription:

CS 1: Machine Learning Spring 15 College of Computer and Information Science Northeastern University Lecture 3 February, 3 Instructor: Bilal Ahmed Scribe: Bilal Ahmed & Virgil Pavlu 1 Introduction Ridge Regression 1 Conisder the polynomial regression task, where the task is to fit a polynomial of a pre-chosen degree to a sample of data comprising of data sampled from the function: f() = 7.5 sin(.5π) to which some random noise is added. So that the training labels can be represented as: t = f() + ɛ (1) where ɛ Normal(, σ ) is additive Gaussian noise. The training dataset D is therefore represented as ( i, t i ) i {1,,... N}. P=1 P=3.1..3..5..7..9 1 (a).1..3..5..7..9 1 (b) Figure 1: Fitting a polynomial to data sampled from f() = 7.5 sin(.5π). The sampled data is shown as red circles, and the underlying function is shown as the solid blue curve, while the fitted functions are shown as the dotted black line. For a degree-1 polynomial shown in (a) we can see that the chosen polynomial class is unable to correctly represent the target function while in (b) the chosen polynomial very closely resembles the shape of the target function even though it does not pass through all the points perfectly. 1 These lecture notes are intended for in-class use only. 1

P=.1..3..5..7..9 1 (a) (b) Figure : (a) Fitting a degree- polynomial to data sampled from f() = 7.5 sin(.5π). The sampled data is shown as red circles, and the underlying function is shown as the solid blue curve, while the fitted function is shown as the dotted black line. We can see that the learned function passes through the training data perfectly and will have a very low training error. But can we say anything about how it will behave on out-of-sample (test) data? (b) shows what happens to training and test error when we over and under fit to training data. Figure-?? shows an eample of fitting a degree-1 (straight line) and degree-3 (cubic polynomial), to a training dataset comprising of eight instances. It can be seen in Figure-?? that if we use a degree-1 polynomial as our hypothesis class we are unable to represent the function accurately, it is very intuitive because as compared to the target function our chosen hypothesis class is very simple. This is an eample of underfitting where the hypothesis class fails to capture the compleities of the target concept. Figure-?? shows that a higher order polynomial (degree- in this case) fails to accurately represent the target function even though it perfectly predicts the label of each training instance. This is an eample of overfitting, where instead of learning the target concept the learned concept aligns itself with the idiosyncrasies (noise) of the observed data. On the other hand by using a cubic polynomial (Figure-??), we can see that it better rerepsents the target function, while having a higher training error than the degree- polynomial. 1.1 Occam s Razor What happens when we overfit to the training data? Figure-??, shows that in the case of overfitting, we achieve lower training errors but the error on unseen data (test error) is higher. So, what can we do to avoid overfitting? We have previously seen that in the case of Decision trees, we used pre-pruning and post-pruning to obtain shorter/simpler trees. Preferring simpler hypotheses that may have higher training error, over more comple hypotheses that have lower training error is an eample of Occam s Razor. William of Occam was an English friar and philosopher who put forward this principle which states that if one can eplain a phenomenon without assuming this or that hypothetical entity, then there is no ground for assuming it i.e. that one should always opt for an eplanation in terms of the fewest possible number of causes, factors, or variables. In terms of machine learning this simply implies that we should always prefer the simplest hypothesis that

fits our training data, because simpler hypothesis generalize well. How can we apply Occam s Razor to linear regression? In other words, how can we characterize the compleity of linear regression? Recall, that in the case of decision trees the model compleity was represented by the depth of the tree. A shorter tree therefore, corresponded to a simpler hypothesis. For linear regression, the model compleity is associated with the number of features. As the number of features grow, the model becomes more comple, and is able to capture local structure in data (compare the three plots for the polynomial regression eample in Figures?? and??), this results in high variance of estimates for feature coefficients i.e., w i. This means that small changes in the training data will cause large changes in the final coefficient estimates. In order to to apply Occam s razor to linear regression we need to restrict/constrain the feature coefficients. Regularizing Linear Regression For linear regression, we can impose a constraint on the feature coefficients so that we penalize the solutions that produce larger estimates. This can be achieved by augmenting the original linear regression objective with a constraint as follows: argmin w R m subject to: (t i w T i ) wi c The objective function is eactly the same as before for linear regression: minimize the sum of squares errors over the training data. However, we have an added constraint on the feature weights that restricts them form growing abnormally large. By formulating the Lagrangian of Equation-?? we get the following optimization problem: w ridge = argmin w R m J(w) = () (t i w T i ) + λ wi (3) for λ. There is a one-to-one correspondence between λ and c in Equation??. For a two dimensional case i.e., w R, we can visualize the optimization problem as shown in Figure-??. We know, that: wi = w T w = w therefore, we can re-write the problem as: w ridge = argmin w R m J(w) = (t i w T i ) + λ w () Including an L penalty term in the optimization problem is also known as Tirkhonov regularization. 3

Figure 3: The red circle represents the constraints imposed on the individual coefficients, while the contour plot shows the sum of squares objective. w is the ridge regression solution..1 An Analytical Solution We can get an analytical solution similar to how we derived the normal equations. Before we proceed we need to address the fact that the intercept or w is part of the weight vector that we are optimizing. If we place a constraint on w then essentially our solution will change if we re-scale the t i s. By convention, we are going to drop the constant feature (all ones) from the design matri, and also center each feature i.e., subtract the mean from the feature values. Let Z be the centered data matri and let t be the vector of observed target values (generated according to Equation??). Then, w J(w) = w E T E = w (Zw t) T (Zw t) + w λw T w = w (w T Z T Zw w T Z T t t T Zw + t T t) + λ w w T w = w (w T Z T Zw w T Z T t w T Z T t + t T t) + λ w w T w = w (w T Z T Zw w T Z T t + t T t) + λ w w T w = (Z T Zw Z T t) + λw = Z T Zw + λiw Z T t = (Z T Z + λi)w Z T t where, in the third step we have (AB) T = B T A T. Setting the derivative to zero, we obtain (Z T Z + λi)w = Z T t or w ridge = (Z T Z + λi) 1 Z T t This solution corresponds to a particular λ, which is a hyper-parameter that we have chosen before hand. Algorithm-?? shows the steps involved in estimating the ridge regression solution for a given training set.

Data: Training Data:( i, t i ) i {1,,..., N} Result: Ridge Regression Feature Estimates: w ridge center the data: z ij = ij j ; set the intercept: w = t = 1 N N t i; choose λ; estimate solution vector: ŵ ridge = (Z T Z + λi) 1 Z T t; Algorithm 1: The gradient descent algorithm for minimizing a function. 3 Generalizing Ridge Regression We can cast the objective function of ridge regression into a more general framework as follows: ŵ reg = argmin w R m Loss(, w) + w q where, q represents the q th vector norm. Some of the regularizers corresponding to different values of q are shown in Figure-??. Figure : Regularizers corresponding to different values of q. When q = 1 we get the Lasso penalty which produces sparse solution by driving most of the coefficients to zero. Lasso can be used for performing feature selection, but cannot be analytically solved instead we need to resort to approimations. q = produces the Trikhonov regularization resulting in Ridge regression. 5