CSC321 Lecture 2: Linear Regression

Similar documents
CSC 411 Lecture 6: Linear Regression

Lecture 2: Linear regression

CSC321 Lecture 5 Learning in a Single Neuron

CSC321 Lecture 9: Generalization

CSC321 Lecture 6: Backpropagation

CSC321 Lecture 9: Generalization

Vote. Vote on timing for night section: Option 1 (what we have now) Option 2. Lecture, 6:10-7:50 25 minute dinner break Tutorial, 8:15-9

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

CSC321 Lecture 4: Learning a Classifier

Lecture 4: Training a Classifier

CSC321 Lecture 5: Multilayer Perceptrons

CSC321 Lecture 4: Learning a Classifier

Linear Regression 1 / 25. Karl Stratos. June 18, 2018

Lecture 6: Backpropagation

Least Mean Squares Regression

Lecture 4: Training a Classifier

CSC 411 Lecture 10: Neural Networks

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Neural Networks and Deep Learning

Least Mean Squares Regression. Machine Learning Fall 2018

Fundamentals of Machine Learning

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

CSC 411: Lecture 02: Linear Regression

CS-E3210 Machine Learning: Basic Principles

Sample questions for Fundamentals of Machine Learning 2018

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Fundamentals of Machine Learning (Part I)

CSC321 Lecture 15: Exploding and Vanishing Gradients

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Linear Regression (continued)

Overfitting, Bias / Variance Analysis

CSC242: Intro to AI. Lecture 21

Machine Learning (CSE 446): Neural Networks

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

CSC321 Lecture 7: Optimization

Machine Learning and Data Mining. Linear regression. Kalev Kask

CSC321 Lecture 8: Optimization

Lecture 9: Generalization

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Machine learning - HT Basis Expansion, Regularization, Validation

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Cheng Soon Ong & Christian Walder. Canberra February June 2018

CSC321 Lecture 18: Learning Probabilistic Models

ECE521 W17 Tutorial 1. Renjie Liao & Min Bai

CSC321 Lecture 4 The Perceptron Algorithm

CS 6375 Machine Learning

Machine Learning Lecture 5

CSC 411 Lecture 7: Linear Classification

Topic 3: Neural Networks

Linear Models for Regression

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Week 3: Linear Regression

CPSC 540: Machine Learning

CSCI567 Machine Learning (Fall 2014)

Mustafa Jarrar. Machine Learning Linear Regression. Birzeit University. Mustafa Jarrar: Lecture Notes on Linear Regression Birzeit University, 2018

ECS171: Machine Learning

CSC321 Lecture 10 Training RNNs

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

FA Homework 2 Recitation 2

ECE521 week 3: 23/26 January 2017

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Welcome to the Machine Learning Practical Deep Neural Networks. MLP Lecture 1 / 18 September 2018 Single Layer Networks (1) 1

Machine Learning and Data Mining. Linear regression. Prof. Alexander Ihler

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Introduction to Machine Learning

Artificial Neuron (Perceptron)

COMP 551 Applied Machine Learning Lecture 2: Linear Regression

Error Functions & Linear Regression (1)

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CSC321 Lecture 16: ResNets and Attention

CSC321 Lecture 7 Neural language models

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Experiment 1: Linear Regression

Machine Learning. Regression. Manfred Huber

Linear Models for Regression

Machine Learning Lecture 7

Logistic Regression. COMP 527 Danushka Bollegala

CS 231A Section 1: Linear Algebra & Probability Review

SVMs, Duality and the Kernel Trick

Lecture 2 Machine Learning Review

Neural Networks Teaser

Machine Learning (CSE 446): Backpropagation

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting)

Selected Topics in Optimization. Some slides borrowed from

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Lecture 15: Exploding and Vanishing Gradients

CMU-Q Lecture 24:

ECE521 Lectures 9 Fully Connected Neural Networks

Transcription:

CSC32 Lecture 2: Linear Regression Roger Grosse Roger Grosse CSC32 Lecture 2: Linear Regression / 26

Overview First learning algorithm of the course: linear regression Task: predict scalar-valued targets, e.g. stock prices (hence regression ) Architecture: linear function of the inputs (hence linear ) Roger Grosse CSC32 Lecture 2: Linear Regression 2 / 26

Overview First learning algorithm of the course: linear regression Task: predict scalar-valued targets, e.g. stock prices (hence regression ) Architecture: linear function of the inputs (hence linear ) Example of recurring themes throughout the course: choose an architecture and a loss function formulate an optimization problem solve the optimization problem using one of two strategies direct solution (set derivatives to zero) gradient descent vectorize the algorithm, i.e. represent in terms of linear algebra make a linear model more powerful using features understand how well the model generalizes Roger Grosse CSC32 Lecture 2: Linear Regression 2 / 26

Problem Setup Want to predict a scalar t as a function of a scalar x Given a training set of pairs {(x (i), t (i) )} N i= Roger Grosse CSC32 Lecture 2: Linear Regression 3 / 26

Problem Setup Model: y is a linear function of x: y = wx + b y is the prediction w is the weight b is the bias w and b together are the parameters Settings of the parameters are called hypotheses Roger Grosse CSC32 Lecture 2: Linear Regression 4 / 26

Problem Setup Loss function: squared error L(y, t) = (y t)2 2 y t is the residual, and we want to make this small in magnitude Roger Grosse CSC32 Lecture 2: Linear Regression 5 / 26

Problem Setup Loss function: squared error L(y, t) = (y t)2 2 y t is the residual, and we want to make this small in magnitude Cost function: loss function averaged over all training examples E(w, b) = 2N = 2N N (y (i) t (i)) 2 i= N (wx (i) + b t (i)) 2 i= Roger Grosse CSC32 Lecture 2: Linear Regression 5 / 26

Problem Setup Roger Grosse CSC32 Lecture 2: Linear Regression 6 / 26

Problem setup Suppose we have multiple inputs x,..., x D. This is referred to as multivariable regression. This is no different than the single input case, just harder to visualize. Linear model: y = j w j x j + b Roger Grosse CSC32 Lecture 2: Linear Regression 7 / 26

Vectorization Computing the prediction using a for loop: For-loops in Python are slow, so we vectorize algorithms by expressing them in terms of vectors and matrices. w = (w,..., w D ) x = (x,..., x D ) This is simpler and much faster: y = w x + b Roger Grosse CSC32 Lecture 2: Linear Regression 8 / 26

Vectorization We can take this a step further. Organize all the training examples into a matrix X with one row per training example, and all the targets into a vector t. Computing the squared error cost across the whole dataset: In Python: y = Xw + b E = y t 2 2N Example in tutorial Roger Grosse CSC32 Lecture 2: Linear Regression 9 / 26

Solving the optimization problem We defined a cost function. This is what we d like to minimize. Recall from calculus class: minimum of a smooth function (if it exists) occurs at a critical point, i.e. point where the derivative is zero. Multivariate generalization: set the partial derivatives to zero. We call this direct solution. Roger Grosse CSC32 Lecture 2: Linear Regression 0 / 26

Direct solution Partial derivatives: derivatives of a multivariate function with respect to one of its arguments. f (x + h, x 2 ) f (x, x 2 ) f (x, x 2 ) = lim x h 0 h To compute, take the single variable derivatives, pretending the other arguments are constant. Example: partial derivatives of the prediction y y = w j w j j w j x j + b = x j y b = b j w j x j + b = Roger Grosse CSC32 Lecture 2: Linear Regression / 26

Direct solution Chain rule for derivatives: L w j = dl dy = d dy y w j [ ] (y t)2 x j 2 = (y t)x j L b = y t We will give a more precise statement of the Chain Rule in a few weeks. It s actually pretty complicated. Cost derivatives (average over data points): E w j = N N i= N y (i) t (i) E b = N i= (y (i) t (i) ) x (i) j Roger Grosse CSC32 Lecture 2: Linear Regression 2 / 26

Direct solution The minimum must occur at a point where the partial derivatives are zero. E E = 0 w j b = 0. If E/ w j 0, you could reduce the cost by changing w j. This turns out to give a system of linear equations, which we can solve efficiently. Full derivation in tutorial. Optimal weights: w = (X X) X t Linear regression is one of only a handful of models in this course that permit direct solution. Roger Grosse CSC32 Lecture 2: Linear Regression 3 / 26

Gradient descent Now let s see a second way to minimize the cost function which is more broadly applicable: gradient descent. Observe: if E/ w j > 0, then increasing w j increases E. if E/ w j < 0, then increasing w j decreases E. The following update decreases the cost function: w j w j α E w j = w j α N (y (i) t (i) ) x (i) j N i= Roger Grosse CSC32 Lecture 2: Linear Regression 4 / 26

Gradient descent This gets its name from the gradient: E w = E w. E w D This is the direction of fastest increase in E. Roger Grosse CSC32 Lecture 2: Linear Regression 5 / 26

Gradient descent This gets its name from the gradient: E w = E w. E w D This is the direction of fastest increase in E. Update rule in vector form: w w α E w = w α N (y (i) t (i) ) x (i) N Hence, gradient descent updates the weights in the direction of fastest decrease. i= Roger Grosse CSC32 Lecture 2: Linear Regression 5 / 26

Gradient descent Visualization: http://www.cs.toronto.edu/~guerzhoy/32/lec/w0/linear_ regression.pdf#page=2 Roger Grosse CSC32 Lecture 2: Linear Regression 6 / 26

Gradient descent Why gradient descent, if we can find the optimum directly? GD can be applied to a much broader set of models GD can be easier to implement than direct solutions, especially with automatic differentiation software For regression in high-dimensional spaces, GD is more efficient than direct solution (matrix inversion is an O(D 3 ) algorithm). Roger Grosse CSC32 Lecture 2: Linear Regression 7 / 26

Feature mappings Suppose we want to model the following data t 0 0 x -Pattern Recognition and Machine Learning, Christopher Bishop. One option: fit a low-degree polynomial; this is known as polynomial regression y = w 3 x 3 + w 2 x 2 + w x + w 0 Do we need to derive a whole new algorithm? Roger Grosse CSC32 Lecture 2: Linear Regression 8 / 26

Feature mappings We get polynomial regression for free! Define the feature map Polynomial regression model: φ(x) = x x 2 x 3 y = w φ(x) All of the derivations and algorithms so far in this lecture remain exactly the same! Roger Grosse CSC32 Lecture 2: Linear Regression 9 / 26

Fitting polynomials y = w 0 t M = 0 0 0 x -Pattern Recognition and Machine Learning, Christopher Bishop. Roger Grosse CSC32 Lecture 2: Linear Regression 20 / 26

Fitting polynomials y = w 0 + w x t M = 0 0 x -Pattern Recognition and Machine Learning, Christopher Bishop. Roger Grosse CSC32 Lecture 2: Linear Regression 2 / 26

Fitting polynomials y = w 0 + w x + w 2 x 2 + w 3 x 3 t M = 3 0 0 x -Pattern Recognition and Machine Learning, Christopher Bishop. Roger Grosse CSC32 Lecture 2: Linear Regression 22 / 26

Fitting polynomials y = w 0 + w x + w 2 x 2 + w 3 x 3 +... + w 9 x 9 t M = 9 0 0 x -Pattern Recognition and Machine Learning, Christopher Bishop. Roger Grosse CSC32 Lecture 2: Linear Regression 23 / 26

Generalization Underfitting : The model is too simple - does not fit the data. t M = 0 0 0 x Overfitting : The model is too complex - fits perfectly, does not generalize. t M = 9 0 0 x Roger Grosse CSC32 Lecture 2: Linear Regression 24 / 26

Generalization We would like our models to generalize to data they haven t seen before The degree of the polynomial is an example of a hyperparameter, something we can t include in the training procedure itself We can tune hyperparameters using validation: partition the data into three subsets Training set: used to train the model Validation set: used to evaluate generalization error of a trained model Test set: used to evaluate final performance once, after hyperparameters are chosen Tune the hyperparameters on the validation set, then report the performance on the test set. Roger Grosse CSC32 Lecture 2: Linear Regression 25 / 26

Foreshadowing Feature maps aren t a silver bullet: It s not always easy to pick good features. In high dimensions, polynomial expansions can get very large! Until the last few years, a large fraction of the effort of building a good machine learning system was feature engineering We ll see that neural networks are able to learn nonlinear functions directly, avoiding hand-engineering of features Roger Grosse CSC32 Lecture 2: Linear Regression 26 / 26