Ch 12: Variations on Backpropagation

Similar documents
Variations on Backpropagation

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

Boosting with log-loss

Pattern Recognition and Machine Learning. Artificial Neural networks

VI. Backpropagation Neural Networks (BPNN)

Feature Extraction Techniques

Sharp Time Data Tradeoffs for Linear Inverse Problems

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

NBN Algorithm Introduction Computational Fundamentals. Bogdan M. Wilamoswki Auburn University. Hao Yu Auburn University

Block designs and statistics

Kernel Methods and Support Vector Machines

Pattern Recognition and Machine Learning. Artificial Neural networks

A Simplified Analytical Approach for Efficiency Evaluation of the Weaving Machines with Automatic Filling Repair

26 Impulse and Momentum

A Self-Organizing Model for Logical Regression Jerry Farlow 1 University of Maine. (1900 words)

Feedforward Networks

Ph 20.3 Numerical Solution of Ordinary Differential Equations

Projectile Motion with Air Resistance (Numerical Modeling, Euler s Method)

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Feedforward Networks

Feedforward Networks. Gradient Descent Learning and Backpropagation. Christian Jacob. CPSC 533 Winter 2004

A method to determine relative stroke detection efficiencies from multiplicity distributions

A MESHSIZE BOOSTING ALGORITHM IN KERNEL DENSITY ESTIMATION

Stochastic Subgradient Methods

Non-Parametric Non-Line-of-Sight Identification 1

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

Experimental Design For Model Discrimination And Precise Parameter Estimation In WDS Analysis

CS Lecture 13. More Maximum Likelihood

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Bootstrapping Dependent Data

OBJECTIVES INTRODUCTION

Interactive Markov Models of Evolutionary Algorithms

Least Squares Fitting of Data

The Methods of Solution for Constrained Nonlinear Programming

Qualitative Modelling of Time Series Using Self-Organizing Maps: Application to Animal Science

Randomized Recovery for Boolean Compressed Sensing

Extension of CSRSM for the Parametric Study of the Face Stability of Pressurized Tunnels

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Probability Distributions

An Approximate Model for the Theoretical Prediction of the Velocity Increase in the Intermediate Ballistics Period

Chaotic Coupled Map Lattices

A model reduction approach to numerical inversion for a parabolic partial differential equation

Page 1 Lab 1 Elementary Matrix and Linear Algebra Spring 2011

Kinematics and dynamics, a computational approach

paper prepared for the 1996 PTRC Conference, September 2-6, Brunel University, UK ON THE CALIBRATION OF THE GRAVITY MODEL

Topic 5a Introduction to Curve Fitting & Linear Regression

A Theoretical Analysis of a Warm Start Technique

Supplementary to Learning Discriminative Bayesian Networks from High-dimensional Continuous Neuroimaging Data

COS 424: Interacting with Data. Written Exercises

Supplementary Information for Design of Bending Multi-Layer Electroactive Polymer Actuators

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

Optical Properties of Plasmas of High-Z Elements

Homework 3 Solutions CSE 101 Summer 2017

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

A Simple Regression Problem

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

a a a a a a a m a b a b

Analyzing Simulation Results

Chapter 6 1-D Continuous Groups

P016 Toward Gauss-Newton and Exact Newton Optimization for Full Waveform Inversion

ACTIVE VIBRATION CONTROL FOR STRUCTURE HAVING NON- LINEAR BEHAVIOR UNDER EARTHQUAKE EXCITATION

Support Vector Machines MIT Course Notes Cynthia Rudin

Pattern Recognition and Machine Learning. Artificial Neural networks

Testing equality of variances for multiple univariate normal populations

Data-Driven Imaging in Anisotropic Media

Use of PSO in Parameter Estimation of Robot Dynamics; Part One: No Need for Parameterization

A note on the multiplication of sparse matrices

3.3 Variational Characterization of Singular Values

RECOVERY OF A DENSITY FROM THE EIGENVALUES OF A NONHOMOGENEOUS MEMBRANE

Order Recursion Introduction Order versus Time Updates Matrix Inversion by Partitioning Lemma Levinson Algorithm Interpretations Examples

W-BASED VS LATENT VARIABLES SPATIAL AUTOREGRESSIVE MODELS: EVIDENCE FROM MONTE CARLO SIMULATIONS

Statistical properties of contact maps

An Improved Particle Filter with Applications in Ballistic Target Tracking

Lecture 21. Interior Point Methods Setup and Algorithm

REDUCTION OF FINITE ELEMENT MODELS BY PARAMETER IDENTIFICATION

INNER CONSTRAINTS FOR A 3-D SURVEY NETWORK

Ocean 420 Physical Processes in the Ocean Project 1: Hydrostatic Balance, Advection and Diffusion Answers

Necessity of low effective dimension

N-Point. DFTs of Two Length-N Real Sequences

CHAPTER 19: Single-Loop IMC Control

Deflation of the I-O Series Some Technical Aspects. Giorgio Rampa University of Genoa April 2007

A remark on a success rate model for DPA and CPA

Linear Transformations

FITTING FUNCTIONS AND THEIR DERIVATIVES WITH NEURAL NETWORKS ARJPOLSON PUKRITTAYAKAMEE

Keywords: Estimator, Bias, Mean-squared error, normality, generalized Pareto distribution

Robustness Experiments for a Planar Hopping Control System

A Generalized Permanent Estimator and its Application in Computing Multi- Homogeneous Bézout Number

Iterative Linear Solvers and Jacobian-free Newton-Krylov Methods

Using a De-Convolution Window for Operating Modal Analysis

arxiv: v1 [cs.ds] 29 Jan 2012

List Scheduling and LPT Oliver Braun (09/05/2017)

Quantum Chemistry Exam 2 Take-home Solutions

Comparison of Charged Particle Tracking Methods for Non-Uniform Magnetic Fields. Hann-Shin Mao and Richard E. Wirz

arxiv: v2 [cs.lg] 30 Mar 2017

1 Bounding the Margin

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

CSE525: Randomized Algorithms and Probabilistic Analysis May 16, Lecture 13

Ştefan ŞTEFĂNESCU * is the minimum global value for the function h (x)

Effective joint probabilistic data association using maximum a posteriori estimates of target states

arxiv: v1 [math.na] 10 Oct 2016

Transcription:

Ch 2: Variations on Backpropagation The basic backpropagation algorith is too slow for ost practical applications. It ay take days or weeks of coputer tie. We deonstrate why the backpropagation algorith is slow in converging. We saw the steepest descent is the slowest iniization ethod. The conjugate gradient algorith and Newton's ethod generally provide faster convergence.

Variations Heuristic odifications Moentu and rescaling variables Variable learning rate Standard nuerical optiization Conjugate gradient Newton s ethod (Levenberg-Maruardt) 2

Drawbacks of BP We saw that the LMS algorith is guaranteed to converge to a solution that iniizes the ean suared error, so long as the learning rate is not too large. Single layer uadratic function constant Hessian atrix constant curvature. Steepest Descent backpropagation (SDBP) is a generalization of the LMS algorith. Multilayer nonlinear Net any local iniu points the curvature can vary widely in different regions of the paraeter space. 3

Perforance Surface Exaple Network Architecture Noinal Function Paraeter Values w = 0 w 2 = 0 b = 5 b 2 = 5 2 w 2 = w 2 = b 2 = 4

Suared Error vs. w and w 2,, The curvature varies drastically over the paraeter space. So it is difficult to choose an appropriate learning rate for SD algorith. 5

Suared Error vs. w and b, w = 0 b = 5 6

Suared Error vs. b and b 2 b = 5 b 2 = 5 7

Convergence Exaple We use a variation of the standard algorith, called batching. In batching ode the paraeters are updated only after the entire training set has been presented. The gradients calculated at each training exaple are averaged together to produce a ore accurate estiate of the gradient. Soothing the training saple outliers Learning independent of the order of saple presentations Usually slower than in seuential ode 8

b a a: converge to the optial solution, but the convergence is slow. b: converge to a local iniu (w, =0.88, w 2,=38.6). 9

0

Learning Rate Too Large nnd2sd nnd2sd2

Moentu Filter yk = yk + w k 0 Exaple wk = + sin 2k ---- 6 2

Observations The oscillation of the filter output is less than the oscillation in the filter input (low pass filter). As γ is increased the oscillation in the filter output is reduced. The average filter output is the sae as average filter input, although as γ is increased the filter output is slower to respond. To suarize, the filter tends to reduce the aount of oscillation, while still tracking the average value. 3

Moentu Backpropagation Steepest Descent Backpropagation (SDBP) W k = s a T w 2, b k = s Moentu Backpropagation (MOBP) W k b k = W k s a = b k s T w, = 0.8 4

The batching for of MOBP, in which the paraeters are updated only after the entire exaple set has been presented. The sae initial condition and learning rate has been used as in the previous exaple, in which the algorith was not stable. The algorith now is stable and it tends to accelerate convergence when the trajectory is oving in a consistent direction. nnd2o 5

Variable Learning Rate (VLBP) If the suared error (over the entire training set) increases by ore than soe set percentage z after a weight update, then the weight update is discarded, the learning rate is ultiplied by soe factor (0<r<), and the oentu coefficient is set to zero. If the suared error decreases after a weight update, then the weight update is accepted and the learning rate is ultiplied by soe factor h>. If has been previously set to zero, it is reset to its original value. If the suared error increases by less than z, then the weight update is accepted, but the learning rate and the oentu coefficient are unchanged. 6

Exaple h =.05 r = 0.7 z = 4% nnd2vl 7

Suared Error α Learning Rate Convergence Characteristics Of Variable Learning Rate 8

Other algoriths Adaptive learning rate (delta-bar-delta ethod) [Jacobs 88]. Each weight w jk has its own rate α jk If w jk reains in the sae direction, increase α jk (F has a sooth curve in the vicinity of current W) If w jk changes the direction, decrease α jk (F has a rough curve in the vicinity of current W) delta-bar-delta also involves oentu ter 9

Quickprop algorith of Fahlan (988).(It assues that the error surface is parabolic and concave upward around the iniu point and that the effect of each weight can be considered independently) SuperSAB algorith of Tollenaere (990). (It has ore coplex rules for adjusting the learning rates).. Drawbacks In SDBP we have only one paraeter to select, but in heuristic odification soeties we have six paraeters to be selected. Soeties odifications fail to converge while SDBP will eventually find a solution. 20

Experiental Coparison Training for XOR proble (batch ode) 25 siulations: success if E averaged over 50 consecutive epochs is less than 0.04 Results. ethod siulations success Mean epochs BP 25 24 6,859.8 BP with oentu BP with deltabar-delta 25 25 2,056.3 25 22 447.3

Conjugate Gradient We saw SD is the siplest optiization ethod but is often slow in converging. Newton s ethod is uch faster, but reuires that the Hessian atrix and its inverse be calculated. The conjugate gradient is a coproise; it does not reuire the calculation of 2 nd derivatives, and yet it still has the uadratic convergence property. Now we describe how the conjugate gradient algorith can be used to train ultilayer network. This algorith is called Conjugate Gradient Backpropagation (CGBP). 22

Review Of CG Algorith. The first search direction is steepest descent. p 0 = g 0 g k F x 2. Take a step and choose the learning rate to iniize the function along the search direction. x k = + + x k k p k x = x k 3. Select the next search direction according to: where T g k g k k --------------- T g k p k p k = g k + k p k g T k g = or k k g = ------------- or k k = ------------- T g k g k T T g k g k g k 23

This cannot be applied to neural network training, because the perforance index is not uadratic. We cannot use to iniize the function along a line. - The exact iniu will not norally reached in a finite nuber of steps, and therefore the algorith will need to be reset after soe set nuber of iterations. Locating the iniu of a function Interval location Interval reduction k T Fx p x = x k = ------------------------ k = T p k 2 Fx p x = x k k g T k p ----------- k p T k A k p k 24

Interval Location 25

Interval Reduction 26

Golden Section Search t=0.68 Set c = a + (-t)(b -a ), F c =F(c ) d = b - (-t)(b -a ), F d =F(d ) For k=,2,... repeat If F c < F d then Set a k+ = a k ; b k+ = d k ; d k+ = c k c k+ = a k+ + (-t)(b k+ -a k+ ) F d = F c ; F c =F(c k+ ) else Set a k+ = c k ; b k+ = b k ; c k+ = d k d k+ = b k+ - (-t)(b k+ -a k+ ) F c = F d ; F d =F(d k+ ) end end until b k+ - a k+ < tolerance 27

For uadratic functions the algorith will converge to the iniu in at ost n (# of paraeters) iterations; this norally does not happen for ultilayer networks. The developent of the CG algorith does not indicate what search direction to use once a cycle of n iterations has been copleted. The siplest ethod is to reset the search direction to the steepest descent direction after n iterations. In the following function approxiate exaple we use the BP algorith to copute the gradient and the CG algorith to deterine the weight updates. This is a batch ode algorith. 28

Conjugate Gradient BP (CGBP) nnd2ls nnd2cg 29

Newton s Method x k g k + = x k A k A k Fx g k F x 2 x = x k x = x k If the perforance index is a su of suares function: Fx N = v 2 i x = i = v T xv x then the jth eleent of the gradient is F x F x j ------- 2 v x i x v i x = = ------- j x j N i = 30

Matrix For The gradient can be written in atrix for: Fx = 2J T xvx where J is the Jacobian atrix: v x ---------------- x v x ---------------- v x x 2 x n ---------------- J x = v 2 x ---------------- x v 2 x ---------------- v 2 x x 2 x n ---------------- v N x ----------------- x v N x ----------------- v N x x 2 x n ----------------- N n 3

Now we want to find the Hessian atrix 2 Fx k j 2 Fx = --------- = 2 x k x j N i = v i x -------- v ix ------- v x k x i x 2 v i x + --------- j x k x j F x F x j ------- 2 v x i x v i x = = ------- j x j N i = 2 Fx = 2J T xj x + 2Sx where Sx = N i = v i x 2 v i x 32

Gauss-Newton Method Approxiate the Hessian atrix as: 2 Fx 2J T xjx We had: Fx x k = (if we assue that S(x) is sall) 2J T xvx g k + = x k A k Newton s ethod becoes: x k + = x k 2J T x k Jx k = x k J T x k Jx k 2J T x k T J xk vx k vx k 33

We call this the Gauss-Newton ethod. Note that the advantage of Gauss-Newton over the standard Newton s ethod is that it does not reuire calculation of 2 nd derivatives. 34

Levenberg-Maruardt Gauss-Newton approxiates the Hessian by: H = This atrix ay be singular, but can be ade invertible as follows: J T J G = H + I If the eigenvalues and eigenvectors of H are: 2 n z z 2 z n then Gz i = H + Iz i = Hz i + z i = i z i + z i = i + z i Eigenvalues of G G can be ade positive definite by increasing µ until λ i + µ >0 for all i. + = x k J T x k Jx k + k I J T x k vx k x k 35

Adjustent of k x k As k 0, LM becoes Gauss-Newton. + = x k J T x k Jx k J T x k vx k As k, LM becoes Steepest Descent with sall learning rate. x k + x k ---J T x k vx k = x k ----Fx k 2 k Therefore, begin with a sall k to use Gauss-Newton and speed convergence. If a step does not yield a saller F(x), then repeat the step with an increased k until F(x) is decreased. F(x) ust decrease eventually, since we will be taking a very sall step in the steepest descent direction. 36

Application To Multilayer Network F x Q t a T t a = = = 2 = = Eual probability The perforance index for the ultilayer network is: Q = e T e Q = S M j = e j Where e j, is the jth eleent of the error for the th input/target pair. N i = v i 2 This is siilar to perforance index, for which LM was designed. In standard BP we copute the derivatives of the suared errors, with respect to weights and biases. To create atrix J we need to copute the derivatives of errors. 37

The error vector is: v T = v v 2 v N = e The paraeter vector is: e 2 e S M e 2 e M S Q x T = x x 2 x n = w w 2 w S R b bs 2 w M b M S The diensions of the two vectors are: N = Q S M, n = S R + + S 2 S + + + S M S M + If we ake these substitutions into the Jacobian atrix for ultilayer network training we have: 38

Jacobian Matrix e ------------- w e ------------- w 2 e ---------------- w S R e ------------ b e 2 ------------- w e 2 ------------- w 2 e 2 ---------------- w S R e 2 ------------ b J x = e M S --------------- w e M S --------------- w 2 e e S M ---------------- w S R e e S M --------------- b e 2 ------------- w e 2 ------------- w 2 e 2 ---------------- w S R e 2 ------------ b N n 39

Coputing The Jacobian SDBP coputes ters like: Fˆ x ------- x l = e T e -------- x l using the chain rule: Fˆ ------ w i j = Fˆ ----- n i n i ------ w i j where the sensitivity s i Fˆ ----- is coputed using backpropagation. n i For the Jacobian we need to copute ters like: v J h e h l = ---- = ------ k x l x l 40

J h l s i h J h l Maruardt Sensitivity v ------ h n i If we define a Maruardt sensitivity: = e ------ k n i We can copute the Jacobian as follows: v h e k e k n ---- ------ ------ i n ------ i = = = = s x i h ------ = l w i j n i weight w i j w i j s i h bias v h e k e k n i n i = ---- = ------ = ------ ------ = x s i h ------ = l b i b i b i n i, h S M + k = Δ a j s i h 4

Coputing the Sensitivities Initialization M s i h v = ----- h = e ------ k = t k a k -------------- - = M n i M n i M ~ M s f ( n i, i, h 0 M ) for M n i for i M i k k ------ M n i M a k Therefore when the input p has been applied to the M network and the corresponding network output a has been coputed, the LMBP is initialized with ~ M S M F ( n M ) 42

Where F ( n ) f ( n ) 0... 0 0 f ( n2 )... 0 : : : 0 0... f ( n ) S M Each colun of the atrix S ~ ust be backpropagated through the network using the following euation (Ch) to produce one row of the Jacobian atrix. s = F ( n ) W + T s + 43

The coluns can also be backpropagated together using ~ ~ S F ( n )( W ) T S The total Maruardt sensitivity atrices for each layer are then created by augenting the atrices coputed for each input: S S S2... S Q Note that for each input we will backpropagate S M sensitivity vectors. Because we copute the derivatives of each individual error, rather than the derivative of the su of suares of the errors. For every input we have S M errors. For each error there will be one row of the Jacobian atrix. 44

After the sensitivities have been backpropagated, the Jacobian atrix is coputed using: J h l v h e k e k n ---- ------ ------ i n ------ i = = = = s x i h ------ = l w i j n i w i j w i j s i h a j J h l v h e k e k n i n i = ---- = ------ = ------ ------ = x s i h ------ = l b i b i b i n i s i h 45

LMBP (suarized) Present all inputs to the network and copute the corresponding network outputs and the errors. Copute the su of suared errors over all inputs. e t a M F x Q t a T t a = = = 2 = = Q = e T e Q = S M j = e j N i = v i 2 Copute the Jacobian atrix. Calculate the sensitivities with the backpropagation algorith, after initializing. Augent the individual atrices into the Maruardt sensitivities. Copute the eleents of the Jacobian atrix. 46

~ M S M F ( n M ) ~ ~ S F ( n )( W ) T S, = M 2 S S S2... S Q J h l v h e k e k n ---- ------ ------ i n ------ i = = = = s x i h ------ = l w i j n i w i j w i j s i h a j J h l v h e k e k n i n i = ---- = ------ = ------ ------ = x s i h ------ = l b i b i b i n i s i h 47

Solve the following E. to obtain the change in the weights. x k = + k I J T x k vx k k k k + x k J T x k J x k x x x Recopute the su of suared errors with the new weights. If this new su of suares is saller than that coputed in step, then divide k by u, update the weights and go back to step. If the su of suares is not reduced, then ultiply k by u and go back to step 3. The algorith is assued to have converged when the nor of the gradient is less than soe predeterined value, or when the su of suares has been reduced to soe error goal. See P2.5 for a nuerical illustration of Jacobian coputation. 48

Exaple LMBP Step Black arrow: sall µ k (Gauss-Newton direction) Blue arrow : large µ k (SD direction) Blue curve: LM for interediate µ k 49

LMBP Trajectory nnd2s nnd2 Storage reuireent: n n for Hesssian atrix HW9 - Ch 2: 3,6,8,3,5 50