Least Mean Squares Regression

Similar documents
Least Mean Squares Regression. Machine Learning Fall 2018

Least Mean Squares Regression. Machine Learning Fall 2017

Linear Regression. S. Sumitra

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Neural Networks: Backpropagation

Linear Classifiers: Expressiveness

Error Functions & Linear Regression (1)

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.

Classification with Perceptrons. Reading:

Optimization and Gradient Descent

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

CSC321 Lecture 4: Learning a Classifier

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar

CSC321 Lecture 2: Linear Regression

Comparison of Modern Stochastic Optimization Algorithms

The Perceptron Algorithm 1

CSC321 Lecture 4: Learning a Classifier

ECS171: Machine Learning

The Perceptron Algorithm

CS545 Contents XVI. l Adaptive Control. l Reading Assignment for Next Class

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Regression with Numerical Optimization. Logistic

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Neural Networks (Part 1) Goals for the lecture

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Answers Machine Learning Exercises 4

CSC321 Lecture 8: Optimization

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Error Functions & Linear Regression (2)

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

COMP 551 Applied Machine Learning Lecture 2: Linear Regression

Logistic Regression. COMP 527 Danushka Bollegala

Value Function Methods. CS : Deep Reinforcement Learning Sergey Levine

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Learning in State-Space Reinforcement Learning CIS 32

CS545 Contents XVI. Adaptive Control. Reading Assignment for Next Class. u Model Reference Adaptive Control. u Self-Tuning Regulators

CSCI567 Machine Learning (Fall 2014)

Linear Models for Regression CS534

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

Machine Learning Basics Lecture 3: Perceptron. Princeton University COS 495 Instructor: Yingyu Liang

Linear Discrimination Functions

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Overfitting, Bias / Variance Analysis

CPSC 340: Machine Learning and Data Mining

DATA MINING AND MACHINE LEARNING

ECE521 Lectures 9 Fully Connected Neural Networks

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Simple Neural Nets For Pattern Classification

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

CE213 Artificial Intelligence Lecture 14

Learning with Momentum, Conjugate Gradient Learning

Introduction to Logistic Regression and Support Vector Machine

Perceptron (Theory) + Linear Regression

CSC 411 Lecture 6: Linear Regression

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Linear Models for Classification

Mustafa Jarrar. Machine Learning Linear Regression. Birzeit University. Mustafa Jarrar: Lecture Notes on Linear Regression Birzeit University, 2018

Vote. Vote on timing for night section: Option 1 (what we have now) Option 2. Lecture, 6:10-7:50 25 minute dinner break Tutorial, 8:15-9

Linear Models for Regression CS534

CSC321 Lecture 7: Optimization

Logistic Regression Trained with Different Loss Functions. Discussion

Artifical Neural Networks

PMR5406 Redes Neurais e Lógica Fuzzy Aula 3 Single Layer Percetron

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Classification: Logistic Regression from Data

Lecture 6. Notes on Linear Algebra. Perceptron

Lab 5: 16 th April Exercises on Neural Networks

CSC321 Lecture 5 Learning in a Single Neuron

Linear discriminant functions

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

CS260: Machine Learning Algorithms

Kernelized Perceptron Support Vector Machines

Linear Regression (continued)

CSC242: Intro to AI. Lecture 21

Machine Learning (CS 567) Lecture 3

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Stochastic Gradient Descent

Machine Learning Lecture 5

Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011!

Algorithms other than SGD. CS6787 Lecture 10 Fall 2017

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

STA 4273H: Statistical Machine Learning

In the Name of God. Lecture 11: Single Layer Perceptrons

Lecture 6 Optimization for Deep Neural Networks

Midterm, Fall 2003

Gradient Descent. Sargur Srihari

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

Neural Networks and Deep Learning

Linear Models for Regression CS534

CSC 411: Lecture 02: Linear Regression

Stochastic Gradient Descent. CS 584: Big Data Analytics

Machine Learning for NLP

Solving Regression. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 12. Slides adapted from Matt Nedrich and Trevor Hastie

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Transcription:

Least Mean Squares Regression Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1

Lecture Overview Linear classifiers What functions do linear classifiers express? Least Squares Method for Regression 2

Where are we? Linear classifiers What functions do linear classifiers express? Least Squares Method for regression Examples The LMS objective Gradient descent stochastic gradient descent 3

What s the mileage? Suppose we want to predict the mileage of a car from its weight and age Weight (x 100 lb) x 1 Age (years) x 2 Mileage 31.5 6 21 What we want: A function that can predict mileage using x 1 and x 2 36.2 2 25 43.1 0 18 27.6 2 30 4

Linear regression: The strategy Predicting continuous values using a linear model Assumption: The output is a linear function of the inputs Mileage = w 0 + w 1 x 1 + w 2 x 2 Learning: Using the training data to find the best possible value of w Prediction: Given the values for x 1, x 2 for a new car, use the learned w to predict the Mileage for the new car 5

Linear regression: The strategy Predicting continuous values using a linear model Assumption: The output is a linear function of the inputs Mileage = w 0 + w 1 x 1 + w 2 x 2 Parameters of the model Collectively, a vector Learning: Using the training data to find the best possible value of w Prediction: Given the values for x 1, x 2 for a new car, use the learned w to predict the Mileage for the new car 6

Linear regression: The strategy Inputs are vectors: x R d Outputs are real numbers: y R We have a training set D = { (x 1, y 1 ), (x 2, y 2 ),!, (x m, y m )} We want to approximate y as For simplicity, we will assume that x 1 is always 1. That is x = [1 x 2 x 3... x d ] T This lets makes notation easier y = f w (x) = w 1 + w 2 x 2 +! + w n x n = w T x w is the learned weight vector in R d 7

Examples y One dimensional input x 2 8

Examples y Predict using y = w 1 + w 2 x 2 One dimensional input x 2 9

Examples y Predict using y = w 1 + w 2 x 2 One dimensional input x 2 Two dimensional input Predict using y = w 1 + w 2 x 2 +w 3 x 3 10

What is the best weight vector? Question: How do we know which weight vector is the best one for a training set? For an input (x i, y i ) in the training set, the cost of a mistake is Define the cost (or loss) for a particular weight vector w to be Sum of squared costs over the training set One strategy for learning: Find the w with least cost on this data 11

What is the best weight vector? Question: How do we know which weight vector is the best one for a training set? For an example (x i, y i ) in the training set, the cost of a mistake is Define the cost (or loss) for a particular weight vector w to be Sum of squared costs over the training set One strategy for learning: Find the w with least cost on this data 12

What is the best weight vector? Question: How do we know which weight vector is the best one for a training set? For an input (x i, y i ) in the training set, the cost of a mistake is Define the cost (or loss) for a particular weight vector w to be One strategy for learning: Find the w with least cost on this data 13

What is the best weight vector? Question: How do we know which weight vector is the best one for a training set? For an input (x i, y i ) in the training set, the cost of a mistake is Define the cost (or loss) for a particular weight vector w to be Sum of squared costs over the training set One strategy for learning: Find the w with least cost on this data 14

What is the best weight vector? Question: How do we know which weight vector is the best one for a training set? For an input (x i, y i ) in the training set, the cost of a mistake is Define the cost (or loss) for a particular weight vector w to be Sum of squared costs over the training set One strategy for learning: Find the w with least cost on this data 15

Least Mean Squares (LMS) Regression Learning: minimizing mean squared error 16

Least Mean Squares (LMS) Regression Learning: minimizing mean squared error Different strategies exist for learning by optimization Gradient descent is a popular algorithm (For this minimization objective, there is also an analytical solution) 17

We are trying to minimize Gradient descent J(w) General strategy for minimizing a function J(w) Start with an initial guess for w, say w 0 Iterate till convergence: Compute the gradient of the gradient of J at w t Update w t to get w t+1 by taking a step in the opposite direction of the gradient Intuition: The gradient is the direction of increase in the function. To get to the minimum, go in the opposite direction w 18

We are trying to minimize Gradient descent J(w) General strategy for minimizing a function J(w) Start with an initial guess for w, say w 0 Iterate till convergence: Compute the gradient of the gradient of J at w t Update w t to get w t+1 by taking a step in the opposite direction of the gradient w 0 Intuition: The gradient is the direction of increase in the function. To get to the minimum, go in the opposite direction w 19

We are trying to minimize Gradient descent J(w) General strategy for minimizing a function J(w) Start with an initial guess for w, say w 0 Iterate till convergence: Compute the gradient of the gradient of J at w t Update w t to get w t+1 by taking a step in the opposite direction of the gradient w 1 w 0 Intuition: The gradient is the direction of increase in the function. To get to the minimum, go in the opposite direction w 20

We are trying to minimize Gradient descent J(w) General strategy for minimizing a function J(w) Start with an initial guess for w, say w 0 Iterate till convergence: Compute the gradient of the gradient of J at w t Update w t to get w t+1 by taking a step in the opposite direction of the gradient w 2 w 1 w 0 Intuition: The gradient is the direction of increase in the function. To get to the minimum, go in the opposite direction w 21

We are trying to minimize Gradient descent J(w) General strategy for minimizing a function J(w) Start with an initial guess for w, say w 0 Iterate till convergence: Compute the gradient of the gradient of J at w t Update w t to get w t+1 by taking a step in the opposite direction of the gradient w 3 w 2 w 1 w 0 Intuition: The gradient is the direction of increase in the function. To get to the minimum, go in the opposite direction w 22

We are trying to minimize Gradient descent J(w) General strategy for minimizing a function J(w) Start with an initial guess for w, say w 0 Iterate till convergence: Compute the gradient of the gradient of J at w t Update w t to get w t+1 by taking a step in the opposite direction of the gradient w 3 w 2 w 1 w 0 Intuition: The gradient is the direction of increase in the function. To get to the minimum, go in the opposite direction w 23

We are trying to minimize Gradient descent for LMS 1. Initialize w 0 2. For t = 0, 1, 2,. 1. Compute gradient of J(w) at w t. Call it J(w t ) 2. Update w as follows: r: Called the learning rate (For now, a small constant. We will get to this later) 24

We are trying to minimize Gradient descent for LMS 1. Initialize w 0 2. For t = 0, 1, 2,. What is the gradient of J? 1. Compute gradient of J(w) at w t. Call it J(w t ) 2. Update w as follows: r: Called the learning rate (For now, a small constant. We will get to this later) 25

We are trying to minimize Gradient of the cost The gradient is of the form Remember that w is a vector with d elements w = [w 1, w 2, w 3,! w j,!, w d ] 26

We are trying to minimize Gradient of the cost The gradient is of the form 27

We are trying to minimize Gradient of the cost The gradient is of the form 28

We are trying to minimize Gradient of the cost The gradient is of the form 29

We are trying to minimize Gradient of the cost The gradient is of the form 30

We are trying to minimize Gradient of the cost The gradient is of the form 31

We are trying to minimize Gradient of the cost The gradient is of the form One element of the gradient vector 32

We are trying to minimize Gradient of the cost The gradient is of the form One element of the gradient vector Sum of Error Input 33

We are trying to minimize Gradient descent for LMS 1. Initialize w 0 2. For t = 0, 1, 2,. 1. Compute gradient of J(w) at w t. Call it J(w t ) Evaluate the function for each training example to compute the error and construct the gradient vector 2. Update w as follows: r: Called the learning rate (For now, a small constant. We will get to this later) 34

We are trying to minimize Gradient descent for LMS 1. Initialize w 0 2. For t = 0, 1, 2,. 1. Compute gradient of J(w) at w t. Call it J(w t ) Evaluate the function for each training example to compute the error and construct the gradient vector 2. Update w as follows: r: Called the learning rate (For now, a small constant. We will get to this later) One element of " 35

We are trying to minimize Gradient descent for LMS 1. Initialize w 0 2. For t = 0, 1, 2,. 1. Compute gradient of J(w) at w t. Call it J(w t ) Evaluate the function for each training example to compute the error and construct the gradient vector 2. Update w as follows: r: Called the learning rate (For now, a small constant. We will get to this later) One element of " 36

We are trying to minimize Gradient descent for LMS 1. Initialize w 0 2. For t = 0, 1, 2,. 1. Compute gradient of J(w) at w t. Call it J(w t ) Evaluate the function for each training example to compute the error and construct the gradient vector 2. Update w as follows: (until total error is below a threshold) r: Called the learning rate (For now, a small constant. We will get to this later) One element of " 37

We are trying to minimize Gradient descent for LMS 1. Initialize w 0 2. For t = 0, 1, 2,. 1. Compute gradient of J(w) at w t. Call it J(w t ) Evaluate the function for each training example to compute the error and construct the gradient vector 2. Update w as follows: (until total error is below a threshold) r: Called the learning rate (For now, a small constant. We will get to this later) One element of " 38

We are trying to minimize Gradient descent for LMS 1. Initialize w 0 2. For t = 0, 1, 2,. 1. Compute gradient of J(w) at w t. Call it J(w t ) Evaluate the function for each training example to compute the error and construct the gradient vector 2. Update w as follows: (until total error is below a threshold) r: Called the learning rate (For now, a small constant. We will get to this later) One element of " This algorithm is guaranteed to converge to the minimum of J if r is small enough. Why? The objective J is a convex function 39

We are trying to minimize Gradient descent for LMS 1. Initialize w 0 2. For t = 0, 1, 2,. (until total error is below a threshold) 1. Compute gradient of J(w) at w t. Call it J(w t ) Evaluate the function for each training example to compute the error and construct the gradient vector 2. Update w as follows: 40

We are trying to minimize Gradient descent for LMS 1. Initialize w 0 2. For t = 0, 1, 2,. (until total error is below a threshold) 1. Compute gradient of J(w) at w t. Call it J(w t ) Evaluate the function for each training example to compute the error and construct the gradient vector 2. Update w as follows: The weight vector is not updated until all errors are calculated Why not make early updates to the weight vector as soon as we encounter errors instead of waiting for a full pass over the data? 41

Stochastic gradient descent Repeat for each example (x i, y i ) Pretend that the entire training set is represented by this single example Use this example to calculate the gradient and update the model Contrast with batch gradient descent which makes one update to the weight vector for every pass over the data 42

Stochastic gradient descent 1. Initialize w 2. For t = 0, 1, 2,. (until error below some threshold) For each training example (x i, y i ): Update w. For each element of the weight vector (w j ): 43

Stochastic gradient descent 1. Initialize w 2. For t = 0, 1, 2,. (until error below some threshold) For each training example (x i, y i ): Update w. For each element of the weight vector (w j ): Contrast with the previous method, where the weights are updated only after all examples are processed once 44

Stochastic gradient descent 1. Initialize w 2. For t = 0, 1, 2,. (until error below some threshold) For each training example (x i, y i ): Update w. For each element of the weight vector (w j ): This update rule is also called the Widrow-Hoff rule in the neural networks literature 45

Stochastic gradient descent 1. Initialize w 2. For t = 0, 1, 2,. (until error below some threshold) For each training example (x i, y i ): Update w. For each element of the weight vector (w j ): This update rule is also called the Widrow-Hoff rule in the neural networks literature Online/stochastic algorithms are often preferred when the training set is very large May get close to optimum much faster than the batch version 46

Learning Rates and Convergence In the general case the learning rate r must decrease to zero to guarantee convergence The learning rate is called the step size. More sophisticated algorithms choose the step size automatically and converge faster Choosing a better starting point can also have impact Gradient descent and its stochastic version are very simple algorithms Yet, almost all the algorithms we will learn in the class can be traced back to gradient decent algorithms for different loss functions and different hypotheses spaces 47

Linear regression: Summary What we want: Predict a real valued output using a feature representation of the input Assumption: Output is a linear function of the inputs Learning by minimizing total cost Gradient descent and stochastic gradient descent to find the best weight vector This particular optimization can be computed directly by framing the problem as a matrix problem 48

Exercises 1. LMS regression can be solved analytically. Given a dataset D = { (x 1, y 1 ), (x 2, y 2 ),!, (x m, y m )}, define matrix X and vector Y as follows: Show that the optimization problem we saw earlier is equivalent to This can be solved analytically. Show that the solution w* is Hint: You have to take the derivative of the objective with respect to the vector w and set it to zero. 49