Improving the Convergence of Back-Propogation Learning with Second Order Methods

Similar documents
Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

5 Quasi-Newton Methods

Quasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb Shanno February 6, / 25 (BFG. Limited memory BFGS (L-BFGS)

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

2. Quasi-Newton methods

Methods that avoid calculating the Hessian. Nonlinear Optimization; Steepest Descent, Quasi-Newton. Steepest Descent

Lecture 14: October 17

Programming, numerics and optimization

An Algorithm for Unconstrained Quadratically Penalized Convex Optimization (post conference version)

Day 3 Lecture 3. Optimizing deep networks

ORIE 6326: Convex Optimization. Quasi-Newton Methods

Nonlinear Programming

Optimization 2. CS5240 Theoretical Foundations in Multimedia. Leow Wee Kheng

Unconstrained optimization

Data Mining (Mineria de Dades)

Quasi-Newton Methods

Optimization for neural networks

Lecture 7 Unconstrained nonlinear programming

MATH 4211/6211 Optimization Quasi-Newton Method

Convex Optimization. Problem set 2. Due Monday April 26th

LBFGS. John Langford, Large Scale Machine Learning Class, February 5. (post presentation version)

Higher-Order Methods

Introduction to gradient descent

Conditional Gradient (Frank-Wolfe) Method

ECS289: Scalable Machine Learning

Quasi-Newton Methods. Javier Peña Convex Optimization /36-725

AM 205: lecture 19. Last time: Conditions for optimality, Newton s method for optimization Today: survey of optimization methods

Lecture 8: February 9

8 Numerical methods for unconstrained problems

Math 273a: Optimization Netwon s methods

CSC321 Lecture 8: Optimization

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Optimization: Nonlinear Optimization without Constraints. Nonlinear Optimization without Constraints 1 / 23

AM 205: lecture 19. Last time: Conditions for optimality Today: Newton s method for optimization, survey of optimization methods

Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2

Chapter 4. Unconstrained optimization

Optimization Methods for Machine Learning

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

A projected Hessian for full waveform inversion

Sub-Sampled Newton Methods I: Globally Convergent Algorithms

Convex Optimization CMU-10725

NonlinearOptimization

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Iterative Methods for Smooth Objective Functions

ABSTRACT 1. INTRODUCTION

Nonlinear Optimization: What s important?

Incremental Quasi-Newton methods with local superlinear convergence rate

Newton s Method. Javier Peña Convex Optimization /36-725

Advanced computational methods X Selected Topics: SGD

High Order Methods for Empirical Risk Minimization

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

An artificial neural networks (ANNs) model is a functional abstraction of the

CSC321 Lecture 7: Optimization

Quasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization

Nonlinear Optimization Methods for Machine Learning

Lecture 5 : Projections

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

High Order Methods for Empirical Risk Minimization

Gradient Methods Using Momentum and Memory

Line Search Methods for Unconstrained Optimisation

Optimization Methods

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Deep Learning & Neural Networks Lecture 4

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09

An Iterative Descent Method

Optimization Tutorial 1. Basic Gradient Descent

MA/OR/ST 706: Nonlinear Programming Midterm Exam Instructor: Dr. Kartik Sivaramakrishnan INSTRUCTIONS

Statistics 580 Optimization Methods

This ensures that we walk downhill. For fixed λ not even this may be the case.

Selected Topics in Optimization. Some slides borrowed from

Optimization and Root Finding. Kurt Hornik

Unconstrained Multivariate Optimization

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Lecture 18: November Review on Primal-dual interior-poit methods

Sub-Sampled Newton Methods

10. Unconstrained minimization

Exploring the energy landscape

Stochastic Optimization Algorithms Beyond SG

Empirical Risk Minimization and Optimization

Stochastic Quasi-Newton Methods

OPTIMIZATION METHODS IN DEEP LEARNING

CPSC 540: Machine Learning

Chapter 2. Optimization. Gradients, convexity, and ALS

Don t Decay the Learning Rate, Increase the Batch Size. Samuel L. Smith, Pieter-Jan Kindermans, Quoc V. Le December 9 th 2017

STA141C: Big Data & High Performance Statistical Computing

Convex Optimization. 9. Unconstrained minimization. Prof. Ying Cui. Department of Electrical Engineering Shanghai Jiao Tong University

Lecture 6 Optimization for Deep Neural Networks

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

Algorithms for Constrained Optimization

The Conjugate Gradient Method

Coordinate Descent and Ascent Methods

CPSC 540: Machine Learning

Gradient-Based Optimization

A Quick Tour of Linear Algebra and Optimization for Machine Learning

Sub-Sampled Newton Methods I: Globally Convergent Algorithms

Intelligent Control. Module I- Neural Networks Lecture 7 Adaptive Learning Rate. Laxmidhar Behera

Lecture 9: September 28

Transcription:

the of Back-Propogation Learning with Second Order Methods Sue Becker and Yann le Cun, Sept 1988 Kasey Bray, October 2017

Table of Contents 1 with Back-Propagation 2 the of BP 3 A Computationally Feasible to s Method 4 Analysis of the 5 Learning with the Algorithm

Table of Contents 1 with Back-Propagation 2 the of BP 3 A Computationally Feasible to s Method 4 Analysis of the 5 Learning with the Algorithm

Back-Propagation The back-propagation algorithm performs a gradient descent search in weight space to determine the weights w ij that minimize some cost function L. where ɛ is a learning constant. L w ij (t) = ɛ w ij (t),

Back-Propagation The back-propagation algorithm performs a gradient descent search in weight space to determine the weights w ij that minimize some cost function L. where ɛ is a learning constant. L w ij (t) = ɛ w ij (t), Gradient-based numerical optimization methods are slow to converge.

Back-Propagation The back-propagation algorithm performs a gradient descent search in weight space to determine the weights w ij that minimize some cost function L. where ɛ is a learning constant. L w ij (t) = ɛ w ij (t), Gradient-based numerical optimization methods are slow to converge. Can we apply second derivative-based methods to improve the convergence of the back-propagation algorithm?

Table of Contents 1 with Back-Propagation 2 the of BP 3 A Computationally Feasible to s Method 4 Analysis of the 5 Learning with the Algorithm

Acceleration Methods Back-Propagation + Momentum Add a fixed proportion of the previous weight change to the current weight change. L w ij (t) = ɛ w ij (t) + α w ij(t 1) where α is a momentum rate between 0 and 1. Accelerates learning once a stable descent direction has been found.

Acceleration Methods Conjugate Gradient Method Generates a series of mutually conjugate search directions, and at each step the function is minimized along one of these conjugate directions. L w ij (t) = ɛ w ij (t) + α n w ij (t 1). Good for large non-sparse optimization problems uses only gradient information, and doesn t require storage of a matrix.

Acceleration Methods Conjugate Gradient Method Generates a series of mutually conjugate search directions, and at each step the function is minimized along one of these conjugate directions. L w ij (t) = ɛ w ij (t) + α n w ij (t 1). Good for large non-sparse optimization problems uses only gradient information, and doesn t require storage of a matrix. Converges linearly in n (where n is the number of weights) prohibitively slow for large problems.

Second Derivative Methods Requires the Hessian matrix; 2 L. wij 2

Second Derivative Methods Requires the Hessian matrix; 2 L. wij 2 s Method w ij (t) = L/ w ij(t) 2 L/ w 2 ij (t). is superlinear (quadratic under certain conditions).

Second Derivative Methods Requires the Hessian matrix; 2 L. wij 2 s Method w ij (t) = L/ w ij(t) 2 L/ w 2 ij (t). is superlinear (quadratic under certain conditions). Requires a good starting point. Expensive in both storage and computation (O(n 3 ) ops to invert the Hessian). If L is non-convex, can converge to a local maximum, saddle point, or minimum.

Second Derivative Methods Broyden-Fletcher-Goldfarb-Shanno Algorithm Input x 0 and B 0. For k = 1, 2,... 1 Solve B k p k = f(x k ) for direction p k. 2 Perform line search to obtain step size α k. α k = argmin f(x k + αp k ). 3 s k = α k p k. Update x k+1 = x k + s k. 4 y k = f(x k+1 ) - f(x k ). 5 B k+1 = B k + y ky T k y T k s k B ks k s T k B k s T k B ks k. End.

Second Derivative Methods BFGS Algorithm Good convergence properties. Invariant under linear transformations of the parameter space. Approximates the inverse Hessian in O(n 2 ) operations. The Hessian update is SPD, hence the algorithm is numerically stable.

Second Derivative Methods BFGS Algorithm Good convergence properties. Invariant under linear transformations of the parameter space. Approximates the inverse Hessian in O(n 2 ) operations. The Hessian update is SPD, hence the algorithm is numerically stable. Each iteration requires O(n 2 ) operations, thus the computation advantage probably only holds for small to moderately sized problems. Does not have locally computable weight update terms.

Table of Contents 1 with Back-Propagation 2 the of BP 3 A Computationally Feasible to s Method 4 Analysis of the 5 Learning with the Algorithm

The Step Approximate the Hessian with a diagonal matrix. w ij (t) = L/ w ij(t) 2 L/ wij 2 (t) + µ

The Step Approximate the Hessian with a diagonal matrix. w ij (t) = L/ w ij(t) 2 L/ wij 2 (t) + µ Requires only O(n) operations. Hessian terms can be computed and stored locally. Trivial to invert. Can deal with negative curvature no positive definite requirement.

The Step Approximate the Hessian with a diagonal matrix. w ij (t) = L/ w ij(t) 2 L/ wij 2 (t) + µ Requires only O(n) operations. Hessian terms can be computed and stored locally. Trivial to invert. Can deal with negative curvature no positive definite requirement. Do the diagonal elements of the Hessian well approximate the true Hessian?

Table of Contents 1 with Back-Propagation 2 the of BP 3 A Computationally Feasible to s Method 4 Analysis of the 5 Learning with the Algorithm

Three Small Problems Compare L 1 norms and eigenvalues of the diagonal approx., D, and full Hessian, H, for three small problems: 1 A Tiny Problem: Only three units (1 hidden), and one input pattern with no thresholds. Input unit always on, desired output is 1. Two weights

Three Small Problems Compare L 1 norms and eigenvalues of the diagonal approx., D, and full Hessian, H, for three small problems: 1 A Tiny Problem: Only three units (1 hidden), and one input pattern with no thresholds. Input unit always on, desired output is 1. Two weights 2 A 4:2:4 Encoder: Four input patterns. 22 weights. Input vector has one bit on (the rest off), and the desired output pattern identical to the input.

Three Small Problems Compare L 1 norms and eigenvalues of the diagonal approx., D, and full Hessian, H, for three small problems: 1 A Tiny Problem: Only three units (1 hidden), and one input pattern with no thresholds. Input unit always on, desired output is 1. Two weights 2 A 4:2:4 Encoder: Four input patterns. 22 weights. Input vector has one bit on (the rest off), and the desired output pattern identical to the input. 3 An 8:4:8 Encoder: Eight input patterns. 76 weights. Input vector has one bit on (the rest off), and the desired output pattern identical to the input.

Comparing Norms

Examining Eigenvalues Eigenvalues of the Hessian yield information about the curvature of the error curve L.

Examining Eigenvalues Eigenvalues of the Hessian yield information about the curvature of the error curve L. The spread between the largest and small eigenvalue indicates the eccentricity of L (or how badly the problem is conditioned).

Examining Eigenvalues Eigenvalues of the Hessian yield information about the curvature of the error curve L. The spread between the largest and small eigenvalue indicates the eccentricity of L (or how badly the problem is conditioned). The clustering of the eigenvalues of D compared with those of H, can tell us to what extent the principle curvature of the error surface is captured in the diagonal terms.

Examining Eigenvalues Eigenvalues of the Hessian yield information about the curvature of the error curve L. The spread between the largest and small eigenvalue indicates the eccentricity of L (or how badly the problem is conditioned). The clustering of the eigenvalues of D compared with those of H, can tell us to what extent the principle curvature of the error surface is captured in the diagonal terms.

Examining Eigenvalues Figure: Eigenvalue histogram for 8:4:8 encoder. White bars for H, black bars for D. Top: Random weight. Bottom: Solution point

Table of Contents 1 with Back-Propagation 2 the of BP 3 A Computationally Feasible to s Method 4 Analysis of the 5 Learning with the Algorithm

Experiment Network Set-up Create a problem which would be difficult for back-propagation to solve. Generate 64 patterns consisting of 32 bits randomly set to ±1, and each pattern is assigned to one of four classes. The network has 32 input units and 4 output units. Activation functions g(x) = 1.7159 tanh ( 2 3 x), so g(1) = 1 and g( 1) = 1. Cost function is the L 2 norm. Desired outputs are +1 and 1. Relative learning rate set for each layer is ɛ i (i is number of inputs to each unit in that layer).

Compare with Back-propagation Algorithms Method ɛ = 1 i, and µ = 1. Online Back-propagation Weights adjusted after each pattern presentation. ɛ =.01 i. Batch Back-propagation Gradient is accumulated over the whole training set, and then the weights are updated. ɛ =.04 i Learning trials repeated from 100 different random initial weight settings.

Compare with BP Figure: The mean and standard deviation of the error plotted over leaning trials, for 100 repetitions and 1280 pattern presentations.

Discussion With momentum and optimally tuned learning parameters, the algorithm is successful: Number of iterations required to learn to 100 % correctness reduced by a factor of 1.5 compared to batch BP and 2.5 compared to online BP. Under certain conditions the algorithm can fail: If initial weights are very large or very small. Without momentum, performs about the same as back-propagation. The values of µ and ɛ are critical in getting reasonable behavior with the pseudo- algorithm.