Least Mean Squares Regression. Machine Learning Fall 2018

Similar documents
Least Mean Squares Regression

Least Mean Squares Regression. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Linear Regression. S. Sumitra

Classification with Perceptrons. Reading:

ECS171: Machine Learning

Optimization and Gradient Descent

Linear Models for Regression CS534

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.

Linear Models for Regression CS534

Error Functions & Linear Regression (1)

Comparison of Modern Stochastic Optimization Algorithms

Linear Discrimination Functions

CSC321 Lecture 2: Linear Regression

Neural Networks: Backpropagation

Regression with Numerical Optimization. Logistic

CS545 Contents XVI. l Adaptive Control. l Reading Assignment for Next Class

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Linear Models for Regression CS534

CSCI567 Machine Learning (Fall 2014)

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Gradient Descent. Sargur Srihari

CSC 411 Lecture 6: Linear Regression

CSC321 Lecture 4: Learning a Classifier

COMP 551 Applied Machine Learning Lecture 2: Linear Regression

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Learning in State-Space Reinforcement Learning CIS 32

Overfitting, Bias / Variance Analysis

CSC321 Lecture 8: Optimization

Logistic Regression Trained with Different Loss Functions. Discussion

CS545 Contents XVI. Adaptive Control. Reading Assignment for Next Class. u Model Reference Adaptive Control. u Self-Tuning Regulators

The Perceptron Algorithm 1

Fundamentals of Machine Learning

Reading Group on Deep Learning Session 1

Kernelized Perceptron Support Vector Machines

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

CS260: Machine Learning Algorithms

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Value Function Methods. CS : Deep Reinforcement Learning Sergey Levine

CSC321 Lecture 4: Learning a Classifier

Adaptive Filtering Part II

Simple Neural Nets For Pattern Classification

CSC 411: Lecture 02: Linear Regression

Stochastic Gradient Descent. CS 584: Big Data Analytics

PMR5406 Redes Neurais e Lógica Fuzzy Aula 3 Single Layer Percetron

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

1 Review of Winnow Algorithm

Error Functions & Linear Regression (2)

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Optimization Tutorial 1. Basic Gradient Descent

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Lecture 2: Linear regression

Logistic Regression. COMP 527 Danushka Bollegala

Answers Machine Learning Exercises 4

Bayesian Machine Learning

In the Name of God. Lecture 11: Single Layer Perceptrons

DATA MINING AND MACHINE LEARNING

Neural Networks (Part 1) Goals for the lecture

Linear Models for Regression

Linear Regression (continued)

CSC2541 Lecture 5 Natural Gradient

Linear discriminant functions

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Artifical Neural Networks

HOMEWORK #4: LOGISTIC REGRESSION

Machine Learning

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

CPSC 340: Machine Learning and Data Mining

STA 4273H: Statistical Machine Learning

Lab 5: 16 th April Exercises on Neural Networks

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Sample questions for Fundamentals of Machine Learning 2018

Introduction to Logistic Regression and Support Vector Machine

Linear Models for Classification

Midterm, Fall 2003

Logistic Regression Logistic

Introduction to Machine Learning HW6

Neural networks and optimization

Nonlinear Optimization Methods for Machine Learning

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

Learning with Momentum, Conjugate Gradient Learning

Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011!

COMP 551 Applied Machine Learning Lecture 2: Linear regression

CE213 Artificial Intelligence Lecture 14

ECE521 lecture 4: 19 January Optimization, MLE, regularization

J. Sadeghi E. Patelli M. de Angelis

2015 Todd Neller. A.I.M.A. text figures 1995 Prentice Hall. Used by permission. Neural Networks. Todd W. Neller

CSC321 Lecture 7: Optimization

HOMEWORK #4: LOGISTIC REGRESSION

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Transcription:

Least Mean Squares Regression Machine Learning Fall 2018 1

Where are we? Least Squares Method for regression Examples The LMS objective Gradient descent Incremental/stochastic gradient descent Exercises 2

What s the mileage? Suppose we want to predict the mileage of a car from its weight and age Weight (x 100 lb) x 1 Age (years) x 2 Mileage 31.5 6 21 What we want: A function that can predict mileage using x 1 and x 2 36.2 2 25 43.1 0 18 27.6 2 30 3

Linear regression: The strategy Predicting continuous values using a linear model Assumption: The output is a linear function of the inputs Mileage = w 0 + w 1 x 1 + w 2 x 2 Learning: Using the training data to find the best possible value of w Prediction: Given the values for x 1, x 2 for a new car, use the learned w to predict the Mileage for the new car 4

Linear regression: The strategy Predicting continuous values using a linear model Assumption: The output is a linear function of the inputs Mileage = w 0 + w 1 x 1 + w 2 x 2 Parameters of the model Also called weights Collectively, a vector Learning: Using the training data to find the best possible value of w Prediction: Given the values for x 1, x 2 for a new car, use the learned w to predict the Mileage for the new car 5

Linear regression: The strategy Inputs are vectors: x 2 < d Outputs are real numbers: y 2 < We have a training set D = { (x 1, y 1 ), (x 2, y 2 ),!, (x m, y m )} We want to approximate y as y = f w (x) = w 1 + w 2 x 2 +! + w n x n = w T x For simplicity, we will assume that x 1 is always 1. That is x = [1 x 2 x 3... x d ] T This lets makes notation easier w is the learned weight vector in < d 6

Examples y One dimensional input x 1 7

Examples y Predict using y = w 1 + w 2 x 2 One dimensional input x 1 8

Examples y Predict using y = w 1 + w 2 x 2 The linear function is not our only choice. We could have tried to fit the data as another polynomial One dimensional input x 1 9

Examples y Predict using y = w 1 + w 2 x 2 The linear function is not our only choice. We could have tried to fit the data as another polynomial One dimensional input x 1 Two dimensional input Predict using y = w 1 + w 2 x 2 +w 3 x 3 10

What is the best weight vector? Question: How do we know which weight vector is the best one for a training set? For an input (x i, y i ) in the training set, the cost of a mistake is Define the cost (or loss) for a particular weight vector w to be Sum of squared costs over the training set One strategy for learning: Find the w with least cost on this data 11

What is the best weight vector? Question: How do we know which weight vector is the best one for a training set? For an input (x i, y i ) in the training set, the cost of a mistake is Define the cost (or loss) for a particular weight vector w to be Sum of squared costs over the training set One strategy for learning: Find the w with least cost on this data 12

What is the best weight vector? Question: How do we know which weight vector is the best one for a training set? For an input (x i, y i ) in the training set, the cost of a mistake is Define the cost (or loss) for a particular weight vector w to be One strategy for learning: Find the w with least cost on this data 13

What is the best weight vector? Question: How do we know which weight vector is the best one for a training set? For an input (x i, y i ) in the training set, the cost of a mistake is Define the cost (or loss) for a particular weight vector w to be Sum of squared costs over the training set One strategy for learning: Find the w with least cost on this data 14

What is the best weight vector? Question: How do we know which weight vector is the best one for a training set? For an input (x i, y i ) in the training set, the cost of a mistake is Define the cost (or loss) for a particular weight vector w to be Sum of squared costs over the training set One strategy for learning: Find the w with least cost on this data 15

Least Mean Squares (LMS) Regression Learning: minimizing mean squared error 16

Least Mean Squares (LMS) Regression Learning: minimizing mean squared error Different strategies exist for learning by optimization Gradient descent is a popular algorithm (For this minimization objective, there is also an analytical solution) 17

We are trying to minimize Gradient descent J(w) General strategy for minimizing a function J(w) Start with an initial guess for w, say w 0 Iterate till convergence: Compute the gradient of the gradient of J at w t Update w t to get w t+1 by taking a step in the opposite direction of the gradient Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction w 18

We are trying to minimize Gradient descent J(w) General strategy for minimizing a function J(w) Start with an initial guess for w, say w 0 Iterate till convergence: Compute the gradient of the gradient of J at w t Update w t to get w t+1 by taking a step in the opposite direction of the gradient w 0 Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction w 19

We are trying to minimize Gradient descent J(w) General strategy for minimizing a function J(w) Start with an initial guess for w, say w 0 Iterate till convergence: Compute the gradient of the gradient of J at w t Update w t to get w t+1 by taking a step in the opposite direction of the gradient w 1 w 0 Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction w 20

We are trying to minimize Gradient descent J(w) General strategy for minimizing a function J(w) Start with an initial guess for w, say w 0 Iterate till convergence: Compute the gradient of the gradient of J at w t Update w t to get w t+1 by taking a step in the opposite direction of the gradient w 2 w 1 w 0 Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction w 21

We are trying to minimize Gradient descent J(w) General strategy for minimizing a function J(w) Start with an initial guess for w, say w 0 Iterate till convergence: Compute the gradient of the gradient of J at w t Update w t to get w t+1 by taking a step in the opposite direction of the gradient w 3 w 2 w 1 w 0 Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction w 22

We are trying to minimize Gradient descent J(w) General strategy for minimizing a function J(w) Start with an initial guess for w, say w 0 Iterate till convergence: Compute the gradient of the gradient of J at w t Update w t to get w t+1 by taking a step in the opposite direction of the gradient w 3 w 2 w 1 w 0 Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction w 23

We are trying to minimize Gradient descent for LMS 1. Initialize w 0 2. For t = 0, 1, 2,. 1. Compute gradient of J(w) at w t. Call it r J(w t ) 2. Update w as follows: r: Called the learning rate (For now, a small constant. We will get to this later) 24

Gradient descent for LMS We are trying to minimize 1. Initialize w 0 2. For t = 0, 1, 2,. What is the gradient of J? 1. Compute gradient of J(w) at w t. Call it r J(w t ) 2. Update w as follows: r: Called the learning rate (For now, a small constant. We will get to this later) 25

Gradient of the cost We are trying to minimize The gradient is of the form Remember that w is a vector with d elements w = [w 1, w 2, w 3,! w j,!, w d ] 26

Gradient of the cost We are trying to minimize The gradient is of the form 27

Gradient of the cost We are trying to minimize The gradient is of the form 28

Gradient of the cost We are trying to minimize The gradient is of the form 29

Gradient of the cost We are trying to minimize The gradient is of the form 30

Gradient of the cost We are trying to minimize The gradient is of the form 31

Gradient of the cost We are trying to minimize The gradient is of the form One element of the gradient vector 32

Gradient of the cost We are trying to minimize The gradient is of the form One element of the gradient vector Sum of Error Input 33

We are trying to minimize Gradient descent for LMS 1. Initialize w 0 2. For t = 0, 1, 2,. 1. Compute gradient of J(w) at w t. Call it r J(w t ) Evaluate the function for each training example to compute the error and construct the gradient vector 2. Update w as follows: r: Called the learning rate (For now, a small constant. We will get to this later) 34

Gradient descent for LMS We are trying to minimize 1. Initialize w 0 2. For t = 0, 1, 2,. 1. Compute gradient of J(w) at w t. Call it r J(w t ) Evaluate the function for each training example to compute the error and construct the gradient vector One element of r J 2. Update w as follows: r: Called the learning rate (For now, a small constant. We will get to this later) 35

Gradient descent for LMS We are trying to minimize 1. Initialize w 0 2. For t = 0, 1, 2,. 1. Compute gradient of J(w) at w t. Call it r J(w t ) Evaluate the function for each training example to compute the error and construct the gradient vector One element of r J 2. Update w as follows: r: Called the learning rate (For now, a small constant. We will get to this later) 36

Gradient descent for LMS We are trying to minimize 1. Initialize w 0 2. For t = 0, 1, 2,. 1. Compute gradient of J(w) at w t. Call it r J(w t ) Evaluate the function for each training example to compute the error and construct the gradient vector One element of r J 2. Update w as follows: (until total error is below a threshold) r: Called the learning rate (For now, a small constant. We will get to this later) 37

Gradient descent for LMS We are trying to minimize 1. Initialize w 0 2. For t = 0, 1, 2,. 1. Compute gradient of J(w) at w t. Call it r J(w t ) Evaluate the function for each training example to compute the error and construct the gradient vector One element of r J 2. Update w as follows: (until total error is below a threshold) r: Called the learning rate (For now, a small constant. We will get to this later) 38

Gradient descent for LMS We are trying to minimize 1. Initialize w 0 2. For t = 0, 1, 2,. 1. Compute gradient of J(w) at w t. Call it r J(w t ) Evaluate the function for each training example to compute the error and construct the gradient vector One element of r J 2. Update w as follows: (until total error is below a threshold) r: Called the learning rate (For now, a small constant. We will get to this later) This algorithm is guaranteed to converge to the minimum of J if r is small enough. Why? The objective J is a convex function 39

We are trying to minimize Gradient descent for LMS 1. Initialize w 0 2. For t = 0, 1, 2,. (until total error is below a threshold) 1. Compute gradient of J(w) at w t. Call it r J(w t ) Evaluate the function for each training example to compute the error and construct the gradient vector 2. Update w as follows: 40

We are trying to minimize Gradient descent for LMS 1. Initialize w 0 2. For t = 0, 1, 2,. (until total error is below a threshold) 1. Compute gradient of J(w) at w t. Call it r J(w t ) Evaluate the function for each training example to compute the error and construct the gradient vector 2. Update w as follows: The weight vector is not updated until all errors are calculated Why not make early updates to the weight vector as soon as we encounter errors instead of waiting for a full pass over the data? 41

Incremental/Stochastic gradient descent Repeat for each example (x i, y i ) Pretend that the entire training set is represented by this single example Use this example to calculate the gradient and update the model Contrast with batch gradient descent which makes one update to the weight vector for every pass over the data 42

Incremental/Stochastic gradient descent 1. Initialize w 2. For t = 0, 1, 2,. (until error below some threshold) For each training example (x i, y i ): Update w. For each element of the weight vector (w j ): 43

Incremental/Stochastic gradient descent 1. Initialize w 2. For t = 0, 1, 2,. (until error below some threshold) For each training example (x i, y i ): Update w. For each element of the weight vector (w j ): Contrast with the previous method, where the weights are updated only after all examples are processed once 44

Incremental/Stochastic gradient descent 1. Initialize w 2. For t = 0, 1, 2,. (until error below some threshold) For each training example (x i, y i ): Update w. For each element of the weight vector (w j ): This update rule is also called the Widrow-Hoff rule in the neural networks literature 45

Incremental/Stochastic gradient descent 1. Initialize w 2. For t = 0, 1, 2,. (until error below some threshold) For each training example (x i, y i ): Update w. For each element of the weight vector (w j ): This update rule is also called the Widrow-Hoff rule in the neural networks literature Online/Incremental algorithms are often preferred when the training set is very large May get close to optimum much faster than the batch version 46

Learning Rates and Convergence In the general (non-separable) case the learning rate r must decrease to zero to guarantee convergence The learning rate is called the step size. More sophisticated algorithms choose the step size automatically and converge faster Choosing a better starting point can also have impact Gradient descent and its stochastic version are very simple algorithms Yet, almost all the algorithms we will learn in the class can be traced back to gradient decent algorithms for different loss functions and different hypotheses spaces 47

Linear regression: Summary What we want: Predict a real valued output using a feature representation of the input Assumption: Output is a linear function of the inputs Learning by minimizing total cost Gradient descent and stochastic gradient descent to find the best weight vector This particular optimization can be computed directly by framing the problem as a matrix problem 48

Exercises 1. Use the gradient descent algorithms to solve the mileage problem (on paper, or write a small program) 2. LMS regression can be solved analytically. Given a dataset D = { (x 1, y 1 ), (x 2, y 2 ),!, (x m, y m )}, define matrix X and vector Y as follows: Show that the optimization problem we saw earlier is equivalent to This can be solved analytically. Show that the solution w* is Hint: You have to take the derivative of the objective with respect to the vector w and set it to zero. 49