Learning with Momentum, Conjugate Gradient Learning

Similar documents
Learning Vector Quantization

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Learning Vector Quantization (LVQ)

Linear Regression. S. Sumitra

Least Mean Squares Regression

Steepest descent algorithm. Conjugate gradient training algorithm. Steepest descent algorithm. Remember previous examples

Lecture 35: Optimization and Neural Nets

Gradient Descent. Sargur Srihari

Least Mean Squares Regression. Machine Learning Fall 2018

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Error Functions & Linear Regression (1)

Multilayer Perceptrons and Backpropagation

y(x n, w) t n 2. (1)

Comparison of Modern Stochastic Optimization Algorithms

Learning in State-Space Reinforcement Learning CIS 32

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on

ECS171: Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

COVER...72 LECUN, SIMARD, PEARLMUTTER...72 SILVA E ALMEIDA...72 ALMEIDA S ADAPTIVE STEPSIZE...72

CSC321 Lecture 5 Learning in a Single Neuron

CSC321 Lecture 8: Optimization

Lesson 1: What is a Parabola?

1. A discrete-time recurrent network is described by the following equation: y(n + 1) = A y(n) + B x(n)

Course Notes: Week 4

CSC2541 Lecture 5 Natural Gradient

Neural Networks: Backpropagation

4. Multilayer Perceptrons

Lecture 35 Minimization and maximization of functions. Powell s method in multidimensions Conjugate gradient method. Annealing methods.

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

The Conjugate Gradient Method

Root Finding and Optimization

CSC321 Lecture 9: Generalization

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

CSC321 Lecture 7: Optimization

Iterative Methods and Multigrid

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Neural Network Training

Programming, numerics and optimization

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Introduction to Neural Networks

CSC321 Lecture 9: Generalization

22 Path Optimisation Methods

Neural Networks. Xiaojin Zhu Computer Sciences Department University of Wisconsin, Madison. slide 1

Backpropagation Neural Net

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Lecture 22. r i+1 = b Ax i+1 = b A(x i + α i r i ) =(b Ax i ) α i Ar i = r i α i Ar i

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Introduction to Machine Learning

Optimization and Gradient Descent

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

ECE521 Lecture 7/8. Logistic Regression

CSC321 Lecture 4: Learning a Classifier

ECE521 Lectures 9 Fully Connected Neural Networks

Neural Networks and the Back-propagation Algorithm

Conjugate gradient algorithm for training neural networks

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Lecture 7: Minimization or maximization of functions (Recipes Chapter 10)

The error-backpropagation algorithm is one of the most important and widely used (and some would say wildly used) learning techniques for neural

CSC321 Lecture 2: Linear Regression

LECTURE 22: SWARM INTELLIGENCE 3 / CLASSICAL OPTIMIZATION

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Convolutional Neural Networks

1 What a Neural Network Computes

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30

A thorough derivation of back-propagation for people who really want to understand it by: Mike Gashler, September 2010

Gradient Descent Training Rule: The Details

LECTURE # - NEURAL COMPUTATION, Feb 04, Linear Regression. x 1 θ 1 output... θ M x M. Assumes a functional form

Computational statistics

Lecture 16: Introduction to Neural Networks

EAD 115. Numerical Solution of Engineering and Scientific Problems. David M. Rocke Department of Applied Science

Generative adversarial networks

Learning from Data Logistic Regression

Numerical Optimization Prof. Shirish K. Shevade Department of Computer Science and Automation Indian Institute of Science, Bangalore

15. LECTURE 15. I can calculate the dot product of two vectors and interpret its meaning. I can find the projection of one vector onto another one.

CSC321 Lecture 4: Learning a Classifier

Linear Models in Machine Learning

Stanford Machine Learning - Week V

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

CS260: Machine Learning Algorithms

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Logistic Regression. COMP 527 Danushka Bollegala

Volume vs. Diameter. Teacher Lab Discussion. Overview. Picture, Data Table, and Graph

1. Background: The SVD and the best basis (questions selected from Ch. 6- Can you fill in the exercises?)

Feedforward Neural Networks

Deep Learning II: Momentum & Adaptive Step Size

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Reinforcement Learning

Least Mean Squares Regression. Machine Learning Fall 2017

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Statistical Machine Learning from Data

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Operation and Supply Chain Management Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Advanced computational methods X Selected Topics: SGD

Transcription:

Learning with Momentum, Conjugate Gradient Learning Introduction to Neural Networks : Lecture 8 John A. Bullinaria, 2004 1. Visualising Learning 2. Learning with Momentum 3. Learning with Line Searches 4. Parabolic Interpolation 5. Conjugate Gradient Learning

Visualising Learning Visualising neural network learning is difficult because there are so many weights being updated at once. However, we can plot error function contours for pairs of weights to get some idea of what is happening. The weight update equations will produce a series of steps in weight space from the starting position to the error minimum: w 2 Error contours start w 1 True gradient descent produces a smooth curve perpendicular to the contours. Weight updates with a small step size η will result in the good approximation to it shown. L8-2

Step Size Too Large If the our learning rate step size is set too large, the approximation to true gradient descent will be poor, and we will end up with overshoots, or even divergence: w 2 Error contours start w 1 In practice, we don t need to plot error contours to see if this is happening. It is obvious that, if the error function is fluctuating, we should reduce the step size. It is also worth checking individual weights and output activations as well. On the other hand, if everything is smooth, it is worth trying to increase η and seeing if it stays smooth. L8-3

On-line Learning If we update the weights after each training pattern, rather than adding up the weight changes for all the patterns before applying them, the learning algorithm is no longer true gradient descent, and the weight changes will not be perpendicular to the contours: w 2 Error contours start w 1 If we keep the step sizes small enough, the erratic behaviour of the weight updates will not be too much of a problem, and the increased number of weight changes will still get us to the minimum quicker than true gradient descent (i.e. batch learning). L8-4

Learning with Momentum A compromise that will smooth out the erratic behaviour of the on-line updates, without slowing down the learning too much, is to update the weights with the moving average of the individual weight changes corresponding to single training patterns. If we label everything by the time t (which we can conveniently measure in weight update steps), then implementing a moving average is easy: ( n) ( n) ( n 1) ( n) l h p w ( t) = η delta ( t). out ( t) + α. w ( t 1) ( n) We simply add a momentum term α. w ( t 1 ) which is the weight change of the previous step times a momentum parameter α. If α is zero, then we have the standard on-line training algorithm used before. As we increase α towards one, each step includes increasing contributions from previous training patterns. Obviously it makes no sense to have α less than zero, or greater than one. Good sizes of α depend on the size of the training data set and how variable it is. Usually, we will need to decrease η as we increase α, so that the total step sizes don t get too large. L8-5

Learning with Line Searches Learning algorithms which work by taking a sequence of steps in weight space all have ( n two basic components: the step size size( t) and the direction dir ) ( t) such that ( n) ( n) w ( t) = size( t). dir ( t) For gradient descent algorithms, such as Back-Propagation, the chosen direction is ( n) ( m) ( n) given by the partial derivatives of the error function dir ( t) = E( w ) w, and the step size is simply the small constant learning rate parameter size(t)=η. A better procedure might be to carry on along in the chosen direction until the error starts rising again. This involves performing a line search to determine the step size. The simplest procedure for performing a line search would be to take a series of small steps along the chosen direction until the error increases, and then go back one step. However, this is not likely to be much more efficient than standard gradient descent. Fortunately, there are many better procedures for performing line searches. jk L8-6

Parabolic Interpolation We can plot the variation of the error function E(s) as we increase the step s taken along our chosen direction. Suppose we can find three points A, B, C such that E(A) > E(B) and E(B) < E(C). If A is where we are, and C is far away, this should not be difficult. E A C B D s = stepsize The three points are sufficient to define the parabola that passes through them, and we can then easily compute the minimum point D of that parabola. The step size corresponding to that point D is a good guess for the appropriate step size for the actual error function. L8-7

Repeated Parabolic Interpolation The process of parabolic interpolation can easily be repeated. We take our guess of the appropriate step size given by point D, together with the two original points of lowest error (i.e. B and C) to make up a new set of three points, and repeat the procedure: E C B D s = stepsize Each iteration of this process brings us closer to the minimum. However, the gradients change as we take each step, so it is not computationally efficient to get each step size too accurately. Often it is better to get it rougy right and move on to the next direction. L8-8

Problems using Gradient Descent with Line Search We have seen how we can determine appropriate step sizes by doing line searches. It is not obvious, though, that using the gradient descent direction is still the best thing to do. Remember that we chose our step size s(t 1) to minimise the new error E( w ( t 1)) E( w ( t) ) = E w ( t 1) s( t 1) w ( t 1) But we know the derivative of this with respect to s(t 1) must be zero at the minimum, and we can use the chain rule for derivatives to compute that derivative, so we have E( w ( t)) s( t 1) = i, j E( w ( t)) E( w ( t 1)). w ( t) w ( t 1) = 0 The vector product of the old direction E( w ( t 1)) w ( t 1 ) and the new direction E( w ( t)) w ( t) is zero, which means the directions are orthogonal (perpendicular). This will result in a less than optimal zig-zag path through weight space. L8-9

Finding a Better Search Direction An obvious approach to avoid the zig-zagging that occurs when we use line searches and gradient directions is to make the new step direction dir (t) a compromise between the new gradient direction E( w ( t)) w ( t) and the previous step direction dir (t 1): E( w( t)) dir( t) = + β. dir( t 1) w ( t) w 2 w 2 start start Error contours Error contours w 1 w 1 L8-10

Conjugate Gradient Learning The basis of Conjugate Gradient Learning is to find a value for β in the last equation so that each new search direction spoils as little as possible the minimisation achieved by the previous one. We thus want to find the new direction dir( t) such that the gradient E( w ( t)) w ( t) at the new point w ( t) + s. dir ( t) in the old direction is zero, i.e. i, j E( w ( t) + s. dir ( t)) dir ( t 1). = 0 w ( t) An appropriate value of β that satisfies this is given by the Polak-Ribiere rule β = i, j E( w( t)) E( w( t 1)) E( w( t)) w t w t. ( ) ( 1) w( t) E( w( t 1)) E( w( t 1)). w ( t 1) w ( t 1) i, j For most practical applications this is the fastest way to train our neural networks. L8-11

Overview and Reading 1. We began by looking at how we can visualise the process of stepping through weight space to find the error minimum. 2. We saw how adding a momentum term to the gradient descent weight update equations can smooth out and speed up on-line training. 3. We then looked at using line searches as an alternative to constant gradient descent step sizes. 4. We ended with a discussion of Conjugate Gradient Learning. Reading 1. Bishop: Sections 4.8, 7.5, 7.6, 7.7 2. Hertz, Krogh & Palmer: Sections 6.1, 6.2 3. Haykin: Sections 4.3, 4.4, 4.16, 4.17, 4.18 4. Gurney: Sections 5.4, 6.5 L8-12