Week 3: Linear Regression

Similar documents
COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Week 5: Logistic Regression & Neural Networks

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Bayesian Linear Regression [DRAFT - In Progress]

Overfitting, Bias / Variance Analysis

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

TDA231. Logistic regression

y Xw 2 2 y Xw λ w 2 2

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

Machine learning - HT Basis Expansion, Regularization, Validation

Introduction to Bayesian Learning. Machine Learning Fall 2018

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

6.867 Machine Learning

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation

Linear Models for Regression

CPSC 340: Machine Learning and Data Mining

Ways to make neural networks generalize better

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Lecture 2: Linear regression

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

CS 361: Probability & Statistics

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

CSCI567 Machine Learning (Fall 2014)

CPSC 540: Machine Learning

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Linear Regression (continued)

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

CSC321 Lecture 18: Learning Probabilistic Models

Linear Models for Regression CS534

4 Bias-Variance for Ridge Regression (24 points)

Why Bayesian? Rigorous approach to address statistical estimation problems. The Bayesian philosophy is mature and powerful.

6.867 Machine Learning

Lecture : Probabilistic Machine Learning

Today. Calculus. Linear Regression. Lagrange Multipliers

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Introduction to Statistical modeling: handout for Math 489/583

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Introduction to Machine Learning

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

CS 188: Artificial Intelligence Spring Today

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop

CSC321 Lecture 2: Linear Regression

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Lecture 2 Machine Learning Review

Least Squares Regression

ECE521 week 3: 23/26 January 2017

Machine Learning and Data Mining. Linear regression. Kalev Kask

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

Machine Learning (CSE 446): Probabilistic Machine Learning

CS 361: Probability & Statistics

Generative Clustering, Topic Modeling, & Bayesian Inference

Bayesian Analysis for Natural Language Processing Lecture 2

High-dimensional regression

Least Squares Regression

Lecture 5: GPs and Streaming regression

Linear Models for Regression CS534

CS-E3210 Machine Learning: Basic Principles

12 Statistical Justifications; the Bias-Variance Decomposition

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Bayesian Regression Linear and Logistic Regression

Machine Learning

Linear Regression (9/11/13)

Nonparametric Regression With Gaussian Processes

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

CS540 Machine learning Lecture 5

Multivariate Bayesian Linear Regression MLAI Lecture 11

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Lecture: Face Recognition and Feature Reduction

HOMEWORK #4: LOGISTIC REGRESSION

Linear Models for Regression CS534

Announcements. Proposals graded

CSC 411: Lecture 02: Linear Regression

CSC 411 Lecture 6: Linear Regression

Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting)

COMS 4771 Regression. Nakul Verma

Introduction to Machine Learning Fall 2017 Note 5. 1 Overview. 2 Metric

Lecture: Face Recognition and Feature Reduction

STA414/2104 Statistical Methods for Machine Learning II

[Disclaimer: This is not a complete list of everything you need to know, just some of the topics that gave people difficulty.]

Machine Learning

Machine Learning for Signal Processing Bayes Classification and Regression

Linear Models for Regression. Sargur Srihari

Learning from Data: Regression

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Probability models for machine learning. Advanced topics ML4bio 2016 Alan Moses

Linear Models in Machine Learning

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

y(x) = x w + ε(x), (1)

SAMPLE CHAPTER. Avi Pfeffer. FOREWORD BY Stuart Russell MANNING

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Introduction to Gaussian Process

GWAS IV: Bayesian linear (variance component) models

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Transcription:

Week 3: Linear Regression Instructor: Sergey Levine Recap In the previous lecture we saw how linear regression can solve the following problem: given a dataset D = {(x, y ),..., (x N, y N )}, learn to predict y from x. In linear regression, we learn a function f(x) = x w = ŷ or, when using features, f(x) = h(x) w = ŷ, where h(x) is the feature or basis function. We saw that linear regression corresponds to maximum likelihood estimation under the model y D(w x, σ ), and that the optimal parameters can be obtained according to ŵ = (X T X) X T Y, or, equivalently, according to ŵ = (H T H) H T Y when using features. In today s lecture, we ll analyze overfitting in linear regression, and see how it can be addressed by imposing a prior on w. Overfitting & regularization Let s imagine that we are trying to learn a D function, where x is one-dimensional and h(x) corresponds to monomials up to some power d: x h(x) = x.... x d If our dataset has size N, then we can always fit the dataset perfectly (with zero error) if d N. However, as d increases, a zero-error fit might not actually be desirable, because it might produce an extremely jagged and multimodal function that is unlikely to reflect the actual trends present in the data. More generally, whenever we have a high-dimensional input space or a highly expressive feature set, such that the dimensionality of w is large, we are liable to overfit. Recall the definition of overfitting: if we find a hypothesis w, but there exists some other hypothesis w such that its training error is worse but its test error is better, then we are overfitting.

In linear regression, one of the most recognizable symptoms of overfitting is the existence of very large values in w. This would happen, for example, when erroneously fitting a high-degree polynomial with near-perfect accuracy to a noisy dataset. Note that this overfitting is quite similar to something we discussed last week in the context of maximum likelihood estimation: if we flip a coin and accidentally observe heads five times in a row, MLE might lead us to conclude the coin would always come up heads. But that is unreasonable. Question. How can we mitigate overfitting in linear regression? Answer. Same as last week, we can switch from MLE to a Bayesian approach, and compute the maximum a posteriori (MAP) estimate of the parameters w instead. This involves imposing a prior on w: our reasonable prior belief about what the parameters should be, before we ve even seen the data. A reasonable prior belief is that the parameters w should be small: this would prevent the sort of huge parameters we might see when fitting a highdegree polynomial with zero error. Question. prior on w? What kind of distribution might be suitable for representing the Answer. Since each entry in w is continuous, real-valued, and unconstrained, the Gaussian distribution is a good choice. In general, we could place a full multivariate Gaussian prior on the entire vector w, but for now let s assume that we ll place an independent Gaussian prior on each dimension of w, with prior mean zero and prior variance, such that log p(w) = σ 0 wj + const. This means that for each dimension j of w, we have w j N (0, σ 0). Combining this prior with the likelihood, we get log p(w D) = σ j= N (y i x i w) i= wj + const. From the form of this likelihood, we can see that the posterior is also Gaussian. Just like before, we can compute the derivative of this quantity and set it to zero to determine the optimal weights: d dw σ N i= (y i x i w) σ 0 j= w j j= N = σ x i (y i x i w) σ w i= 0 = 0. This means that w N ( 0, I): that is, w is distributed according to a d-dimensional multivariate Gaussian.

Rewriting this in matrix notation like before, we get σ XT (Y Xw) w = 0 σ XT Y σ XT Xw w = 0 σ 0 σ XT Y = σ XT Xw + w (X T X + σ I) X T Y = w. X T Y = X T Xw + σ w X T Y = (X T X + σ I)w Our solution is therefore given by w = (X T X + σ I) X T Y. The only change from standard linear regression is that we ve added the term σ I to the matrix that we are inverting. In practice, we will often use a single parameter λ = σ, so that the solution has the form w = (X T X+λI) X T Y. We will discuss how to choose the parameter λ in the next section. This method corresponds to maximum a posteriori (MAP) estimation of the optimal parameters w under the objective log p(w D), and it is often referred to as ridge regression. But we can see here that it is simply the natural consequence of imposing a zero-mean Gaussian prior on the parameters w. In applying ridge regression in practice, we might also impose a different prior variance,j on each dimension w j of w. For example, if we use features h(x i ) (recall that the math is exactly the same if we use features!), we might have a constant feature that is equal to, called the bias feature. We often do not want to regularize the weight on this feature to allow for whatever bias best fits the data, so we might set its weight to 0 (which corresponds to,j = ). In the case where we use different weights on different features, the solution becomes w = (X T X + Λ) X T Y, where Λ is a diagonal matrix of weights. 3 LASSO This is covered in the slides. 4 Choosing the regularization amount The value λ (or Λ) in ridge regression is a hyperparameter: it is not learned by our learning algorithm, but rather must be specified in advance. Hyperparam- 3

eters can be set by hand using domain knowledge, or they can be optimized by using a hold-out set. First, let s try to understand how the setting of λ changes the weights that we get. First, as λ 0, ridge regression turns into ordinary linear regression (and our prior approaches the uniform prior). That means that we will fit the training data better (our training error will decrease), but we might experience more overfitting if we have too many parameters and too little data (our test error might increase). As λ, the wj terms in the objective dominate, and w 0. All of our weights zero out, and we just get a constant prediction of zero (or a constant if we don t regularize the bias term). In this case, we are least likely to see overfitting, but we will also experience very high training and test error, because we re essentially ignoring the input x i in making our predictions. For best results, we need to find the perfect value λ that gives the model enough expressive power to get low training and test error, but not so much expressive power as to overfit to the training data. In practice, even guessing a very low value of λ, such as λ = 0 4, can already help a lot. For example, if X T X is nearly rank-deficient (that is, it has eigenvalues close to zero, making it very hard to invert), adding λi to it before inversion can make it much easier to invert, making linear regression much more stable. It also quickly removes the really pathological solution that have coefficients in the millions or billions. So a quick fix to an ill-conditioned linear regression problem that is easy and often effective is to choose λ = 0 4. However, if we want to find a better setting of λ to get the best performance, we need to use our hold-out data. This can be done either manually or automatically. In the manual approach, we simply try a few different settings of λ that we think are reasonable, fit to the training data, and test how well we do on the hold-out data. We then take the best one. The automated approach consists of automating this process. Performance on the hold-out set does not necessarily follow a unimodal curve, but in practice this can be good enough to find a good value, so we could simply choose a lower and upper bound for λ, and then perform a search. We recursively update the lower bounds λ 0 and upper bound λ to find the best value of λ. Letting E holdout (λ) denote the error on the hold-out set for the optimal solution for hyperparameter λ, the search might look like this: One good choice for the constant ρ is based on the golden ratio: ρ = (3 5)/. 5 K-fold cross-validation Using a hold-out set to manually or automatically optimize hyperparameters such as λ is reasonably effective, but it requires us to carve out a large enough hold-out set from our data to provide an accurate estimate of the generalization error of our model. This means we have less data to use for actually fitting the training data. One idea to reduce the size of the hold-out set and still get a good estimate of the generalization error for optimizing hyperparameters is 4

Algorithm Hyperparameter search : Start with minimum λ 0 and maximum λ : while not converged (e.g. E holdout (λ ) E holdout (λ 0 ) > ɛ) do 3: λ 0 λ 0 + ρ(λ λ 0 ) 4: λ λ ρ(λ λ 0 ) 5: if E holdout (λ 0) < E holdout (λ ) then 6: λ 0 λ 0 7: else 8: λ λ 9: end if 0: end while to use K-fold cross-validation. In this approach, we partition the dataset into K folds of equal size, each one somewhat smaller than an ordinary hold-out set. When we want to evaluate E holdout (λ), we evaluate it as the average of K separate errors: E holdout (λ) = K E holdout,k (λ), K where E holdout,k (λ) is the error we get by testing on the k th fold a model that is trained on all of the other folds (with hyperparameter λ). Since we average together hold-out error on many folds to get E holdout (λ), each fold can be smaller than a standard hold-out set. In fact, if we set K = N (the total size of our dataset), we train N models, and test each of them on just one datapoint. This is called hold one out cross-validation. It is computationally expensive, but involves the least loss of data to a hold-out set. k= 5