Error Functions & Linear Regression (1)

Similar documents
Error Functions & Linear Regression (2)

Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011!

Least Mean Squares Regression

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Least Mean Squares Regression. Machine Learning Fall 2018

CSC242: Intro to AI. Lecture 21

18.6 Regression and Classification with Linear Models

Linear Models for Regression CS534

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

CSC 411: Lecture 02: Linear Regression

Linear Models for Regression CS534

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Linear Models for Regression

Lecture 2: Linear regression

Optimization for Machine Learning

CSCI567 Machine Learning (Fall 2014)

CSC321 Lecture 7: Optimization

CSC321 Lecture 8: Optimization

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Neural Networks: Backpropagation

Linear Models for Regression CS534

CSC321 Lecture 2: Linear Regression

Optimization and Gradient Descent

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Overfitting, Bias / Variance Analysis

ECS171: Machine Learning

Linear Regression (continued)

CSC321 Lecture 4: Learning a Classifier

ECE521 week 3: 23/26 January 2017

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

CSC321 Lecture 4: Learning a Classifier

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

CSC 411 Lecture 6: Linear Regression

Lecture 4: Training a Classifier

Learning with Momentum, Conjugate Gradient Learning

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Gradient Descent. Stephan Robert

Machine Learning Brett Bernstein. Recitation 1: Gradients and Directional Derivatives

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Deep Learning for Computer Vision

Linear Regression. S. Sumitra

Advanced computational methods X Selected Topics: SGD

Lecture 13 Back-propagation

CSC321 Lecture 9: Generalization

Linear Regression. Udacity

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

LINEAR REGRESSION, RIDGE, LASSO, SVR

Solutions. Part I Logistic regression backpropagation with a single training example

Least Mean Squares Regression. Machine Learning Fall 2017

Machine Learning. Regression. Manfred Huber

Lecture 4: Training a Classifier

CSC321 Lecture 5 Learning in a Single Neuron

Sections 18.6 and 18.7 Analysis of Artificial Neural Networks

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Comparison of Modern Stochastic Optimization Algorithms

17 Neural Networks NEURAL NETWORKS. x XOR 1. x Jonathan Richard Shewchuk

CS260: Machine Learning Algorithms

Machine Learning (CS 567) Lecture 5

Sections 18.6 and 18.7 Artificial Neural Networks

Deep Learning. Authors: I. Goodfellow, Y. Bengio, A. Courville. Chapter 4: Numerical Computation. Lecture slides edited by C. Yim. C.

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Regression with Numerical Optimization. Logistic

Introduction to Logistic Regression

Mustafa Jarrar. Machine Learning Linear Regression. Birzeit University. Mustafa Jarrar: Lecture Notes on Linear Regression Birzeit University, 2018

Lecture 11 Linear regression

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Classification: Logistic Regression from Data

Neural Networks and Deep Learning

Lecture 3. Linear Regression

Day 3 Lecture 3. Optimizing deep networks

Sections 18.6 and 18.7 Artificial Neural Networks

Introduction to Machine Learning

Gradient Descent. Sargur Srihari

A thorough derivation of back-propagation for people who really want to understand it by: Mike Gashler, September 2010

CSE446: non-parametric methods Spring 2017

Linear Regression (9/11/13)

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

CSC321 Lecture 9: Generalization

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Neural Networks: Backpropagation

Introduction to Optimization

Lecture 5: Logistic Regression. Neural Networks

Motivation Subgradient Method Stochastic Subgradient Method. Convex Optimization. Lecture 15 - Gradient Descent in Machine Learning

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Linear Regression, Neural Networks, etc.

Machine Learning Basics Lecture 3: Perceptron. Princeton University COS 495 Instructor: Yingyu Liang

Ordinary Least Squares Linear Regression

ECE521: Inference Algorithms and Machine Learning University of Toronto. Assignment 1: k-nn and Linear Regression

CSC321 Lecture 6: Backpropagation

1 Review and Overview

Stochastic gradient descent; Classification

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Lecture 2 Machine Learning Review

Stochastic Gradient Descent: The Workhorse of Machine Learning. CS6787 Lecture 1 Fall 2017

Transcription:

Error Functions & Linear Regression (1) John Kelleher & Brian Mac Namee Machine Learning @ DIT

Overview 1 Introduction Overview 2 Univariate Linear Regression Linear Regression Analytical Solution Gradient Descent 3 Multivariate Linear Regression Problem Formulation Gradient Descent 4 Summary

Univariate Linear Regression

There are a family of machine learning approaches that can be thought of as directly searching for a set of parameters that maximise the performance of a particular prediction model. In this lecture we will look at linear regression and how it can be used for both categorical and continuous prediction problems.

Linear Regression Linear Regression Remember that the equation of a line, y = mx + b, can be used to represent the relationship between two variables, x and y. For simplicity we will rewrite the equation of a line as: h w (x) = w 0 + w 1 x where w is the vector [w 0, w 1 ] and w 0 and w 1 are referred to as weights. We now just need to find the weight values that maximise the performance of this model - i.e. the weight values that allow us best predict y given x.

Linear Regression Loss Function In order to find the best values for [w 0, w 1 ] we can measure the fit of a model by measuring squared loss function, L 2 over all training examples The loss function can be defined as: Loss(h w ) = = = N L 2 (y j, h w (x)) j=1 N (y j h w (x)) 2 j=1 N (y j (w 0 + w 1 x j )) 2 j=1

Linear Regression House Size House Price 500 320 550 380 620 400 630 390 665 385 700 410 770 480 880 600 920 570 1000 620 Practice Assuming w 0 = 6.47 what would be the most accurate value of w 1 based on the given training set (select from 0.4, 0.5, 0.6, 0.7, 0.8).

Linear Regression w 1 L 2 0.4 272,436 0.5 85,424 0.6 8,006 0.7 40,183 0.8 181,955

Linear Regression The Loss depending on the weights [w 0, w 1 ] can be viewed as a surface - in fact a surface that is convex and that has a global minimum. We need to find the lowest point on this surface and use it as our weight values.

Linear Regression

Analytical Solution So we want to find w = argmin w Loss(h w ). The sum Loss(h w ) is minimised when its partial derivatives with respect to w 0 and w 1 are equal to 0.

Analytical Solution So we want to find w = argmin w Loss(h w ). The sum Loss(h w ) is minimised when its partial derivatives with respect to w 0 and w 1 are equal to 0. Partial What???? Who knows what a derivative is? Who knows what a partial derivative is?

Analytical Solution In order to compute slopes we need to know some calculus: Fundamental Calculus: In mathematics, the derivative is a measurement of how a function changes when the values of its inputs change and a partial derivative (denoted by the symbol ) of a function of several variables is its derivative with respect to one of those variables with the others held constant. There are really only three things we need to know about calculus for working on linear regression. First: x x = 1 x x 2 = 2x

Analytical Solution Practice Calculate the following: x x + 1 = x 6x = x x 2 = x 3x 2 = x x 2 + 2x + 4 =

Analytical Solution Practice Calculate the following: x x + 1 = 1 x 6x = 6 x x 2 = 2x x 3x 2 = 6x x x 2 + 2x + 4 = 2x + 2

Analytical Solution Fundamental Calculus: The second thing is the Chain Rule The Chain Rule allows us calculate the derivative of a composition of two or more functions. Chain rule: x g(f (x)) = f (x) g(f (x)) x f (x) Ex.1: x (x 2 + 1) 2 = 2(x 2 + 1)(2x) 1 Ex.2: x 2 (c x)2 = 2 2 (c x)( x (c x)) = (c x)( 1) = (c x)

Analytical Solution Practice x (x 2 + 2) 2 = x (3x 2 + 3x + 3) 2 =

Analytical Solution Practice x (x 2 + 2) 2 = 2(x 2 + 2)(2x) x (3x 2 + 3x + 3) 2 = 2(3x 2 + 3x + 3)(6x + 3)

Analytical Solution So we want to find w = argmin w Loss(h w ). The sum Loss(h w ) is minimised when its partial derivatives with respect to w 0 and w 1 are equal to 0. So: w 0 N j=1 (y j (w 0 + w 1 x j )) 2 = 0 and w 1 N j=1 (y j (w 0 + w 1 x j )) 2 = 0

Analytical Solution Fortunately, these equations have a unique solution : w 1 = N( x j y j ) ( x j )( y j ) N( x 2 j ) ( x j ) 2 and w 0 = ( y j w 1 ( x j )) N *based on assumptions that there is normally distributed noise and the stationarity assumption amongst other things

Analytical Solution Practice House Size House Price 500 320 550 380 620 400 630 390 665 385 700 410 770 480 880 600 920 570 1000 620 Let s work out the optimal values for [w 0, w 1 ] for the house price dataset we looked at previously. See the associated Excel sheet for the solution!

Gradient Descent While it was great to have a unique solution to our system of equations in the previous example, this will not always be the case. We need an alternative way to find the best weights to use in our model. Any ideas?

Gradient Descent We need to perform a search through the weight space for the best values to use. An exhaustive search is just not reasonable. Instead we can use the common approach of gradient descent: Start at any point in weight space Move to a neighbouring point that is downhill Repeat until we converge on the point of minimum loss. Gradient descent w any point in the parameter space loop until convergence do for each w i in w do w i w i α w i Loss(w)

Gradient Descent α is referred to as the learning rate and determines the amount by which weights are changed at each step. This can be a fixed constant or can decay over time as learning proceeds. Nice gradient descent animations: http://animation.yihui.name/compstat:gradient_descent_algorithm

Gradient Descent

Gradient Descent Imagine there was just one training example (x, y). Then the gradient would be given as (note use of chain rule): w i Loss(w) = w i (y h w (x)) 2 = 2(y h w (x)) w i (y h w (x)) = 2(y h w (x)) w i (y (w 0 + w 1 x)) Applying to both w 0 and w 1 we get: w 0 Loss(w) = 2(y h w (x)) w 1 Loss(w) = 2(y h w (x)) x

Gradient Descent Plugging these into our weight update rule we get: w 0 w 0 + α(y h w (x)) w 1 w 1 + α(y h w (x)) x Intuitively this makes sense: if h w (x) is too high reduce w 0 a little bit and reduce w 1 if x was positive but increase it if x was negative. if h w (x) is too low do the opposite.

Gradient Descent The previous example only considered a single training example. However, adjusting this to deal with a full training set of N examples simply involves introducing a sum: w 0 w 0 + α j (y j h w (x j )) w 1 w 1 + α j (y j h w (x j )) x j This is known as batch gradient descent.

Gradient Descent An alternative to batch gradient descent is stochastic gradient descent in which the update rule is applied one example at a time and the examples are chosen, usually randomly, from the training set.

Multivariate Linear Regression

Problem Formulation We have seen how linear regression can be used to make continuous predictions from a single variable. However, we like big data so one variable is no good to us! A multivariate linear regression does the same job using multiple variables.

Problem Formulation Multi-variate Linear Regression Our hypothesis space is of the form: h w (x) = w 0 + w 1 x 1 + + w n x n = w 0 + i w i x i We can make this look a little neater by inventing a dummy input attribute, x 0, that is always equal to 1. Then: h w (x) = i w i x i = w x

Gradient Descent Gradient Descent for Multi-variate Linear Regression The best set of weights, w minimizes the squared error loss, L 2, over a set of training examples: w = argmin w L 2 (y j, w x j ) Gradient descent works for the multi-variate case just like it does for the univariate case and will minimise the loss function. The update rule for each weight is: j w i w i + α j x j,i (y i h w (x j ))

Summary

1 Introduction Overview 2 Univariate Linear Regression Linear Regression Analytical Solution Gradient Descent 3 Multivariate Linear Regression Problem Formulation Gradient Descent 4 Summary