CSC 411 Lecture 6: Linear Regression

Similar documents
CSC321 Lecture 2: Linear Regression

Lecture 2: Linear regression

CSC 411 Lecture 10: Neural Networks

CSC 411 Lecture 7: Linear Classification

CSC321 Lecture 9: Generalization

CSC321 Lecture 9: Generalization

CSC321 Lecture 6: Backpropagation

CSC 411: Lecture 02: Linear Regression

Lecture 6: Backpropagation

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

CSC321 Lecture 5 Learning in a Single Neuron

CSC 411 Lecture 3: Decision Trees

Lecture 4: Training a Classifier

Machine Learning and Data Mining. Linear regression. Kalev Kask

Vote. Vote on timing for night section: Option 1 (what we have now) Option 2. Lecture, 6:10-7:50 25 minute dinner break Tutorial, 8:15-9

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

Lecture 4: Training a Classifier

Least Mean Squares Regression. Machine Learning Fall 2018

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

CSC321 Lecture 5: Multilayer Perceptrons

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Least Mean Squares Regression

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

CPSC 340: Machine Learning and Data Mining

Machine Learning and Data Mining. Linear regression. Prof. Alexander Ihler

ECE521 W17 Tutorial 1. Renjie Liao & Min Bai

Lecture 9: Generalization

CPSC 340: Machine Learning and Data Mining. Gradient Descent Fall 2016

CSC321 Lecture 15: Exploding and Vanishing Gradients

CSC 411 Lecture 12: Principal Component Analysis

CSC321 Lecture 8: Optimization

ECE521 lecture 4: 19 January Optimization, MLE, regularization

CSC321 Lecture 18: Learning Probabilistic Models

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

CS-E3210 Machine Learning: Basic Principles

Geometric View of Machine Learning Nearest Neighbor Classification. Slides adapted from Prof. Carpuat

CSC 411: Lecture 04: Logistic Regression

CMSC 422 Introduction to Machine Learning Lecture 4 Geometry and Nearest Neighbors. Furong Huang /

CSC321 Lecture 4: Learning a Classifier

Ordinary Least Squares Linear Regression

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Machine learning - HT Basis Expansion, Regularization, Validation

Sample questions for Fundamentals of Machine Learning 2018

Linear and Logistic Regression. Dr. Xiaowei Huang

Overfitting, Bias / Variance Analysis

Week 3: Linear Regression

Ridge Regression 1. to which some random noise is added. So that the training labels can be represented as:

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Linear Regression 1 / 25. Karl Stratos. June 18, 2018

Machine Learning (CSE 446): Neural Networks

ECS171: Machine Learning

Introduction to Machine Learning Fall 2017 Note 5. 1 Overview. 2 Metric

CSC321 Lecture 4: Learning a Classifier

Error Functions & Linear Regression (1)

CS 6375 Machine Learning

CSC321 Lecture 4 The Perceptron Algorithm

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Introduction to Machine Learning (67577) Lecture 7

Lecture 10: Powers of Matrices, Difference Equations

INTRODUCTION TO DATA SCIENCE

Stochastic Gradient Descent

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Neural Networks and Deep Learning

12 Statistical Justifications; the Bias-Variance Decomposition

FA Homework 2 Recitation 2

Linear Regression (continued)

Fundamentals of Machine Learning (Part I)

Machine Learning Lecture 5

Machine Learning

Machine Learning Lecture 7

Linear Models for Regression. Sargur Srihari

Perceptron (Theory) + Linear Regression

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

CSC321 Lecture 7: Optimization

Lecture 6. Notes on Linear Algebra. Perceptron

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

4 Bias-Variance for Ridge Regression (24 points)

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

CSC321 Lecture 10 Training RNNs

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Ridge Regression: Regulating overfitting when using many features. Training, true, & test error vs. model complexity. CSE 446: Machine Learning

LINEAR REGRESSION, RIDGE, LASSO, SVR

Midterm exam CS 189/289, Fall 2015

Linear classifiers Lecture 3

Lecture 2 Machine Learning Review

Linear Discrimination Functions

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Optimization for Machine Learning

Mustafa Jarrar. Machine Learning Linear Regression. Birzeit University. Mustafa Jarrar: Lecture Notes on Linear Regression Birzeit University, 2018

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Fundamentals of Machine Learning

DATA MINING AND MACHINE LEARNING

Transcription:

CSC 411 Lecture 6: Linear Regression Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 06-Linear Regression 1 / 37

A Timely XKCD UofT CSC 411: 06-Linear Regression 2 / 37

Overview So far, we ve talked about procedures for learning. KNN, decision trees, bagging, boosting For the remainder of this course, we ll take a more modular approach: choose a model describing the relationships between variables of interest define a loss function quantifying how bad is the fit to the data choose a regularizer saying how much we prefer different candidate explanations fit the model, e.g. using an optimization algorithm By mixing and matching these modular components, your ML skills become combinatorially more powerful! UofT CSC 411: 06-Linear Regression 3 / 37

Problem Setup Want to predict a scalar t as a function of a scalar x Given a dataset of pairs {(x (i), t (i) )} N i=1 The x (i) are called inputs, and the t (i) are called targets. UofT CSC 411: 06-Linear Regression 4 / 37

Problem Setup Model: y is a linear function of x: y = wx + b y is the prediction w is the weight b is the bias w and b together are the parameters Settings of the parameters are called hypotheses UofT CSC 411: 06-Linear Regression 5 / 37

Problem Setup Loss function: squared error (says how bad the fit is) L(y, t) = 1 2 (y t)2 y t is the residual, and we want to make this small in magnitude The 1 2 factor is just to make the calculations convenient. Cost function: loss function averaged over all training examples J (w, b) = 1 2N = 1 2N N (y (i) t (i)) 2 i=1 N (wx (i) + b t (i)) 2 i=1 UofT CSC 411: 06-Linear Regression 6 / 37

Problem Setup UofT CSC 411: 06-Linear Regression 7 / 37

Problem setup Suppose we have multiple inputs x 1,..., x D. This is referred to as multivariable regression. This is no different than the single input case, just harder to visualize. Linear model: y = j w j x j + b UofT CSC 411: 06-Linear Regression 8 / 37

Vectorization Computing the prediction using a for loop: For-loops in Python are slow, so we vectorize algorithms by expressing them in terms of vectors and matrices. w = (w 1,..., w D ) x = (x 1,..., x D ) This is simpler and much faster: y = w x + b UofT CSC 411: 06-Linear Regression 9 / 37

Vectorization Why vectorize? The equations, and the code, will be simpler and more readable. Gets rid of dummy variables/indices! Vectorized code is much faster Cut down on Python interpreter overhead Use highly optimized linear algebra libraries Matrix multiplication is very fast on a Graphics Processing Unit (GPU) UofT CSC 411: 06-Linear Regression 10 / 37

Vectorization We can take this a step further. Organize all the training examples into the design matrix X with one row per training example, and all the targets into the target vector t. Computing the predictions for the whole dataset: w x (1) + b Xw + b1 = =. w x (N) + b y (1). y (N) = y UofT CSC 411: 06-Linear Regression 11 / 37

Vectorization Computing the squared error cost across the whole dataset: In Python: y = Xw + b1 J = 1 y t 2 2N UofT CSC 411: 06-Linear Regression 12 / 37

Solving the optimization problem We defined a cost function. This is what we d like to minimize. Recall from calculus class: minimum of a smooth function (if it exists) occurs at a critical point, i.e. point where the derivative is zero. Multivariate generalization: set the partial derivatives to zero. We call this direct solution. UofT CSC 411: 06-Linear Regression 13 / 37

Direct solution Partial derivatives: derivatives of a multivariate function with respect to one of its arguments. f (x 1 + h, x 2 ) f (x 1, x 2 ) f (x 1, x 2 ) = lim x 1 h 0 h To compute, take the single variable derivatives, pretending the other arguments are constant. Example: partial derivatives of the prediction y y = w j w j j w j x j + b = x j y b = b j w j x j + b = 1 UofT CSC 411: 06-Linear Regression 14 / 37

Direct solution Chain rule for derivatives: L w j = dl dy = d dy y w j [ ] 1 (y t)2 x j 2 = (y t)x j L b = y t Cost derivatives (average over data points): J w j = 1 N N i=1 N y (i) t (i) J b = 1 N i=1 (y (i) t (i) ) x (i) j UofT CSC 411: 06-Linear Regression 15 / 37

Direct solution The minimum must occur at a point where the partial derivatives are zero. J J = 0 w j b = 0. If J / w j 0, you could reduce the cost by changing w j. This turns out to give a system of linear equations, which we can solve efficiently. Full derivation in the readings. Optimal weights: w = (X X) 1 X t Linear regression is one of only a handful of models in this course that permit direct solution. UofT CSC 411: 06-Linear Regression 16 / 37

Gradient Descent Now let s see a second way to minimize the cost function which is more broadly applicable: gradient descent. Gradient descent is an iterative algorithm, which means we apply an update repeatedly until some criterion is met. We initialize the weights to something reasonable (e.g. all zeros) and repeatedly adjust them in the direction of steepest descent. UofT CSC 411: 06-Linear Regression 17 / 37

Gradient descent Observe: if J / w j > 0, then increasing w j increases J. if J / w j < 0, then increasing w j decreases J. The following update decreases the cost function: w j w j α J w j = w j α N (y (i) t (i) ) x (i) j N i=1 α is a learning rate. The larger it is, the faster w changes. We ll see later how to tune the learning rate, but values are typically small, e.g. 0.01 or 0.0001 UofT CSC 411: 06-Linear Regression 18 / 37

Gradient descent This gets its name from the gradient: J w = J w 1. J w D This is the direction of fastest increase in J. Update rule in vector form: w w α J w = w α N (y (i) t (i) ) x (i) N Hence, gradient descent updates the weights in the direction of fastest decrease. i=1 UofT CSC 411: 06-Linear Regression 19 / 37

Gradient descent Visualization: http://www.cs.toronto.edu/~guerzhoy/321/lec/w01/linear_ regression.pdf#page=21 UofT CSC 411: 06-Linear Regression 20 / 37

Gradient descent Why gradient descent, if we can find the optimum directly? GD can be applied to a much broader set of models GD can be easier to implement than direct solutions, especially with automatic differentiation software For regression in high-dimensional spaces, GD is more efficient than direct solution (matrix inversion is an O(D 3 ) algorithm). UofT CSC 411: 06-Linear Regression 21 / 37

Feature mappings Suppose we want to model the following data t 1 0 1 0 x 1 -Pattern Recognition and Machine Learning, Christopher Bishop. One option: fit a low-degree polynomial; this is known as polynomial regression y = w 3 x 3 + w 2 x 2 + w 1 x + w 0 Do we need to derive a whole new algorithm? UofT CSC 411: 06-Linear Regression 22 / 37

Feature mappings We get polynomial regression for free! Define the feature map Polynomial regression model: 1 ψ(x) = x x 2 x 3 y = w ψ(x) All of the derivations and algorithms so far in this lecture remain exactly the same! UofT CSC 411: 06-Linear Regression 23 / 37

Fitting polynomials y = w 0 t 1 M = 0 0 1 0 x 1 -Pattern Recognition and Machine Learning, Christopher Bishop. UofT CSC 411: 06-Linear Regression 24 / 37

Fitting polynomials y = w 0 + w 1 x t 1 M = 1 0 1 0 x 1 -Pattern Recognition and Machine Learning, Christopher Bishop. UofT CSC 411: 06-Linear Regression 25 / 37

Fitting polynomials y = w 0 + w 1 x + w 2 x 2 + w 3 x 3 t 1 M = 3 0 1 0 x 1 -Pattern Recognition and Machine Learning, Christopher Bishop. UofT CSC 411: 06-Linear Regression 26 / 37

Fitting polynomials y = w 0 + w 1 x + w 2 x 2 + w 3 x 3 +... + w 9 x 9 t 1 M = 9 0 1 0 x 1 -Pattern Recognition and Machine Learning, Christopher Bishop. UofT CSC 411: 06-Linear Regression 27 / 37

Generalization Underfitting : model is too simple does not fit the data. t 1 M = 0 0 1 0 x 1 Overfitting : model is too complex fits perfectly, does not generalize. t 1 M = 9 0 1 0 x 1 UofT CSC 411: 06-Linear Regression 28 / 37

Generalization Training and test error as a function of # training examples and # parameters: UofT CSC 411: 06-Linear Regression 29 / 37

Regularization The degree of the polynomial is a hyperparameter, just like k in KNN. We can tune it using a validation set. But restricting the size of the model is a crude solution, since you ll never be able to learn a more complex model, even if the data support it. Another approach: keep the model large, but regularize it Regularizer: a function that quantifies how much we prefer one hypothesis vs. another UofT CSC 411: 06-Linear Regression 30 / 37

L 2 Regularization Observation: polynomials that overfit often have large coefficients. y = 0.1x 5 + 0.2x 4 + 0.75x 3 x 2 2x + 2 y = 7.2x 5 + 10.4x 4 + 24.5x 3 37.9x 2 3.6x + 12 So let s try to keep the coefficients small. UofT CSC 411: 06-Linear Regression 31 / 37

L 2 Regularization Another reason we want weights to be small: Suppose inputs x 1 and x 2 are nearly identical for all training examples. The following two hypotheses make nearly the same predictions: ( ) ( ) 1 9 w = w = 1 11 But the second network might make weird predictions if the test distribution is slightly different (e.g. x 1 and x 2 match less closely). UofT CSC 411: 06-Linear Regression 32 / 37

L 2 Regularization We can encourage the weights to be small by choosing as our regularizer the L 2 penalty. R(w) = 1 2 w 2 = 1 2 j w 2 j. Note: to be pedantic, the L 2 norm is Euclidean distance, so we re really regularizing the squared L 2 norm. The regularized cost function makes a tradeoff between fit to the data and the norm of the weights. J reg = J + λr = J + λ 2 Here, λ is a hyperparameter that we can tune using a validation set. j w 2 j UofT CSC 411: 06-Linear Regression 33 / 37

L 2 Regularization The geometric picture: UofT CSC 411: 06-Linear Regression 34 / 37

L 2 Regularization Recall the gradient descent update: w w α J w The gradient descent update of the regularized cost has an interesting interpretation as weight decay: ( ) J w w α w + λ R w ( ) J = w α w + λw = (1 αλ)w α J w UofT CSC 411: 06-Linear Regression 35 / 37

L 1 vs. L 2 Regularization The L 1 norm, or sum of absolute values, is another regularizer that encourages weights to be exactly zero. (How can you tell?) We can design regularizers based on whatever property we d like to encourage. Bishop, Pattern Recognition and Machine Learning UofT CSC 411: 06-Linear Regression 36 / 37

Conclusion Linear regression exemplifies recurring themes of this course: choose a model and a loss function formulate an optimization problem solve the optimization problem using one of two strategies direct solution (set derivatives to zero) gradient descent vectorize the algorithm, i.e. represent in terms of linear algebra make a linear model more powerful using features improve the generalization by adding a regularizer UofT CSC 411: 06-Linear Regression 37 / 37