Conjugate gradient algorithm for training neural networks

Similar documents
Steepest descent algorithm. Conjugate gradient training algorithm. Steepest descent algorithm. Remember previous examples

Chapter 10 Conjugate Direction Methods

January 29, Non-linear conjugate gradient method(s): Fletcher Reeves Polak Ribière January 29, 2014 Hestenes Stiefel 1 / 13

Conjugate Directions for Stochastic Gradient Descent

The Conjugate Gradient Method

Neural Network Training

The Conjugate Gradient Algorithm

Chapter 4. Unconstrained optimization

Chapter 8 Gradient Methods

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

4. Multilayer Perceptrons

Performance Surfaces and Optimum Points

EECS 275 Matrix Computation

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

FALL 2018 MATH 4211/6211 Optimization Homework 4

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Unconstrained optimization

Gradient Methods Using Momentum and Memory

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Nonlinear Programming

Lecture 10: September 26

The Steepest Descent Algorithm for Unconstrained Optimization

Eigen Vector Descent and Line Search for Multilayer Perceptron

Combining Conjugate Direction Methods with Stochastic Approximation of Gradients

Programming, numerics and optimization

Lecture Notes: Geometric Considerations in Unconstrained Optimization

Computational statistics

Optimization Methods for Machine Learning

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Data Mining (Mineria de Dades)

New hybrid conjugate gradient methods with the generalized Wolfe line search

1 What a Neural Network Computes

Lecture 5: Logistic Regression. Neural Networks

Gradient Descent. Dr. Xiaowei Huang

Iterative Methods for Solving A x = b

Notes on Some Methods for Solving Linear Systems

Numerical Optimization of Partial Differential Equations

Unconstrained minimization of smooth functions

x k+1 = x k + α k p k (13.1)

Learning with Momentum, Conjugate Gradient Learning

Today. Introduction to optimization Definition and motivation 1-dimensional methods. Multi-dimensional methods. General strategies, value-only methods

CSC 578 Neural Networks and Deep Learning

y(x n, w) t n 2. (1)

The Conjugate Gradient Method for Solving Linear Systems of Equations

An Iterative Descent Method

Numerical Optimization

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Repeated Eigenvalues and Symmetric Matrices

Constrained optimization. Unconstrained optimization. One-dimensional. Multi-dimensional. Newton with equality constraints. Active-set method.

Step-size Estimation for Unconstrained Optimization Methods

Lecture: Adaptive Filtering

Introduction to Neural Networks

1 Numerical optimization

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Solutions and Notes to Selected Problems In: Numerical Optimzation by Jorge Nocedal and Stephen J. Wright.

Lecture 35 Minimization and maximization of functions. Powell s method in multidimensions Conjugate gradient method. Annealing methods.

NonlinearOptimization

Second-order Learning Algorithm with Squared Penalty Term

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09

AI Programming CS F-20 Neural Networks

Intelligent Control. Module I- Neural Networks Lecture 7 Adaptive Learning Rate. Laxmidhar Behera

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Conjugate gradient method. Descent method. Conjugate search direction. Conjugate Gradient Algorithm (294)

A recursive algorithm based on the extended Kalman filter for the training of feedforward neural models. Isabelle Rivals and Léon Personnaz

Day 3 Lecture 3. Optimizing deep networks

Math 411 Preliminaries

A vector from the origin to H, V could be expressed using:

Lecture 4: Perceptrons and Multilayer Perceptrons

CSC321 Lecture 8: Optimization

REVIEW OF DIFFERENTIAL CALCULUS

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Stable Dynamic Parameter Adaptation

Neural networks III: The delta learning rule with semilinear activation function

ECE580 Fall 2015 Solution to Midterm Exam 1 October 23, Please leave fractions as fractions, but simplify them, etc.

Optimization Methods

Topics. The CG Algorithm Algorithmic Options CG s Two Main Convergence Theorems

Penalty and Barrier Methods. So we again build on our unconstrained algorithms, but in a different way.

Classification with Perceptrons. Reading:

Optimization Tutorial 1. Basic Gradient Descent

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Introduction to gradient descent

Introduction to feedforward neural networks

Improving the Convergence of Back-Propogation Learning with Second Order Methods

Numerical Computation for Deep Learning

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

QM and Angular Momentum

weightchanges alleviates many ofthe above problems. Momentum is an example of an improvement on our simple rst order method that keeps it rst order bu

Artificial Neural Networks

B553 Lecture 5: Matrix Algebra Review

Optimization for neural networks

Gradient Descent Training Rule: The Details

Regularization and Optimization of Backpropagation

Artifical Neural Networks

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Deep Learning II: Momentum & Adaptive Step Size

Supervised Learning in Neural Networks

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Classic K -means clustering. Classic K -means example (K = 2) Finding the optimal w k. Finding the optimal s n J =

Transcription:

. Introduction Recall that in the steepest-descent neural network training algorithm, consecutive line-search directions are orthogonal, such that, where, gwt [ ( + ) ] denotes E[ w( t + ) ], the gradient of the error E with respect to the weights w( t + ) at step ( t + ) of the training algorithm, and d() t denotes the line-search direction at step t of the training algorithm. In steepest descent, of course, and, where d() t g() t, () w( t + ) w() t ηt ()g() t, (3) is determined at each step through line minimization.. Deriving the conjugate gradient algorithm A. Introduction As we have seen, choosing consecutive line-search directions as in equation () can lead to oscillatory behavior in the training algorithm, even for simple quadratic error surfaces. This can significantly slow down convergence of the algorithm. To combat this, we should ideally choose consecutive line-search directions that are non-interfering in other words, consecutive search directions that do not undo the progress of previous search directions. Suppose we have executed a line-search along direction d() t at time step t ; then the next search direction d( t + ) should be chosen so that, gwt [ ( + ) + ηd( t + ) ] T d() t, η. (4) Condition (4) explicitly preserves progress made in the previous line search (see Figure 7. in []), and will guide us through our development of the conjugate gradient training algorithm. Below, we closely follow the derivation in [], but provide somewhat greater detail in certain parts of the derivation. B. First step gwt [ ( + ) ] T d() t η() t Lemma: Condition (4) implies, d( t + ) T Hd() t to first order, where H is the Hessian matrix with elements H ( ij, ), H E ( ij, ) -----------------, (6) ω i ω j evaluated at w( t + ), where, () (5) w T ω ω ω W. (7) Proof: First, let us approximate gw ( ) by its first-order Taylor approximation about w( t + ), gw ( ) T gwt [ ( + ) ] T + [ w w( t + ) ] T { gwt [ ( + ) ]} and evaluate the right-hand side of equation (8) at w( t + ) + ηd( t + ) : (8) gwt [ ( + ) + ηd( t + ) ] T gwt [ ( + ) ] T + ηd( t + ) T H. (9) Note from equation (9) that, - -

H { E[ w( t + ) ]} { gwt [ ( + ) ]}. () Substituting equation (9) into (4), { gwt [ ( + ) ] T + ηd( t + ) T H}d() t () From (), gwt [ ( + ) ] T d() t + ηd( t + ) T Hd() t. () gwt [ ( + ) ] T d() t, (3) so that equation () reduces to, ηd( t + ) T Hd() t. (4) Equation (4) implies either η and/or d( t + ) T Hd() t. We can dismiss the first of these implications ( η ), since a zero learning parameter would mean that we are not changing the weights. Therefore, (5) follows from (4) and the proof is complete. Note that for quadratic error surfaces, the first-order Taylor approximation in (8) is exact. For non-quadratic error surfaces, it is approximately correct in the locally quadratic neighborhood of the weight vector w( t + ). Search directions that meet condition (5) are called H -orthogonal or conjugate with respect to H. Our overall goal in developing the conjugate gradient algorithm is to construct consecutive search directions d() t that meet condition (5) without explicit computation of the large Hessian matrix ( H ). Note that an algorithm that uses H -orthogonal search directions will converge to the minimum of the error surface (for quadratic or locally quadratic error surfaces) in at most W steps, where W is the number of weights in the neural network. C. Linear algebra preliminaries In our derivation of the conjugate gradient algorithm, we will assume a locally quadratic error surface of the following form: E( w) E( w) E + b T w + --wt Hw (5) where E is a constant scalar, b is a constant vector, and H is a constant matrix (i.e. the Hessian). Furthermore, we assume that H is symmetric and positive-definite, a property that we will sometimes denote by H >. Definition: A square matrix H is positive-definite, if and only if all its eigenvalues λ i are greater than zero. If a matrix is positive-definite, then, v T Hv >, v. (6) Note that for a quadratic error surface, H is constant and is therefore guaranteed to be positive-definite everywhere. Consider, for example, the following simply quadratic error surface of two weights, E ω + ω. (7) For (7), we have previously computed the Hessian to be, H 4 (8) with eigenvalues λ 4 and λ. Hence, H is positive-definite. Note also, that for any error function with continuous partial derivatives, the Hessian is guaranteed to be symmetric since, - -

E ----------------- ω i ω j E ----------------- ω j ω i (9) Theorem: For a positive-definite square matrix H, H -orthogonal vectors { d, d,, d k } are linearly independent. Proof: For linear independence, we must show that, α d + α d + + α k d k () if and only if α i, i. In equation (), denotes a vector with equal dimensionality as d i and all zero elements. Let us first, pre-multiply equation () by d T i H, α d T i Hd + α d T i Hd + + α k d T i Hd k d T i H α d T i Hd + α d T i Hd + + α k d T i Hd k () () Note that since we have assumed that the d i, i {,,, k}, vectors are H -orthogonal, i.e., d T i Hd j, i j, (3) all terms in () drop out except the α i d T i Hd i term, so that () reduces to, α i d T i Hd i (4) Since H is positive-definite, however, we know from (6) that, d T i Hd i > (5) so that in order for (4) to be true, α i, i {,,, k}, (6) and the proof is complete. From this theorem, it follows directly that in a W -dimensional space, H -orthogonal vectors d i, i {,,, W}, form a complete, albeit non-orthogonal, basis set. This means that any vector v in that space may be expressed as the linear combination of the d i vectors, i.e., v W α i d i. (7) i D. The setup This section establishes the foundations of the conjugate gradient algorithm. In our derivation here and subsequent sections, we will assume a locally quadratic error surface of form (5), E( w) E + b T w +. (8) --wt Hw For now, let us also assume a set of H -orthogonal (or conjugate) vectors d i, i {,,, W}. Given these conjugate vectors, let us see how we can move from some initial (nonzero) weight vector w to the minimum w of error surface (8). From (7) we can express the difference between w and w as, W w w α i d i i so that, (9) - 3 -

W w w + α i d i. (3) Then, if we define, i j w j w + α i d i i we can rewrite equation (3) recursively as, (3) w j + w j + α j d j. (3) Note that equation (3) represents a sequence of steps in weight space from w to w. These steps are parallel to the conjugate directions d j, and the length of each step is controlled by α j. Equation (3) forms the basis of the conjugate gradient algorithm for training neural networks and is similar to previous training algorithms that we have studied in that it offers a recursive minimization procedure for the weights w. The big difference to previous algorithms is, of course, that d j g j, as was the case for the gradient and steepest descent algorithms. Also, the step size α j is not held constant, as is typical in gradient descent. At this point, several big questions remain unanswered, however. First, how do we construct the dj vectors given an error surface? Second, how do we determine the corresponding step sizes α j such that we are guaranteed convergence in at most W steps on a locally quadratic error surface? Finally, how can we do both without explicit computation of the Hessian H? Below, we answer these questions one by one. E. Computing correct step sizes Let us first attempt to find an expression for α j, assuming that we have a set of W conjugate vectors d i, i {,,, W}. To do this, let us pre-multiply equation (9) by d T j H : W d T j Hw ( w ) d T j H α i d i i (33) W d T j Hw ( w ) α i d T j Hd i i By H -orthogonality, all the terms in the right-hand sum of equation (34) where i j are zero. Thus, equation (34) reduces to, d T j Hw ( w ) α j d T j Hd j. (35) Now, note that since the gradient of (8) is given by, gw ( ) b+ Hw, (36) the minimum w can be expressed implicitly as, (34) so that, gw ( ) (37) b+ Hw, (38) Hw b. (39) Consequently, equation (35) further reduces to, dt j ( b Hw ) α j d T j Hd j, (4) - 4 -

d T j ( b+ Hw ) α j d T j Hd j. (4) Thus, d T j ( b+ Hw ) α j ----------------------------------- d T j Hd j (4) Although equation (4) gives an expression for α j, the right-hand side of (4) is dependent on the initial weights w. We can remove this dependence. Starting with equation (3), j w j w + α i d i i we (once again) pre-multiply by d T j H, (43) j d T j Hw j d T j Hw + α i d T j Hd i. (44) i Since i j for all terms in the sum on the right-hand side of equation (44), the sum is zero by H -orthogonality. Thus, equation (44) reduces to, d T j Hw j d T j Hw Substituting (45) into (4) and letting, g j b+ Hw j equation (4) further simplifies to,. (45), (46) d T j ( b+ Hw j ) α j ---------------------------------- d T j Hd j (47) α j d T j g j ---------------- d T j Hd j (48) Below, we proof that iteration (3), with step size α j as given in (48), will converge to w in at most W steps from any nonzero initial weight vector w and a quadratic error surface. Theorem: Assuming a W -dimensional quadratic error surface, E( w) E + b T w +, (49) --wt Hw and H -orthogonal vectors d i, i {,,, W}, the following weight iteration, w j + w j + α j d j, (5) α j d T j g j ----------------, (5) d T j Hd j will converge in at most W steps to the minimum w of the error surface E( w) from any nonzero initial weight vector w. Proof: We will proof this theorem by showing that the gradient vector gj at the j th iteration is orthogonal to all previous conjugate directions d i, i< j. From this, it follows that after W steps, the components of the gradient along all directions will have been made zero, and consequently, iteration (5) will have converged to the minimum w. - 5 -

Since, g j b+ Hw j (5) we can write, g j + g j Hw ( j + w j ). (53) From (5), ( w j + w j ) α j d j so that (53) reduces to, g j + g j α j Hd j. (55) Not surprisingly, we now pre-multiply equation (55) by dt j, (54) d T( j g j + g ) j α dthd j j j and substitute (5) for α j so that, (56) d T j ( g j + g j ) d T j g j ---------------- d T dj T Hd j Hd j j (57) d T( j g j + g ) d j Tg j j d T j g j + d T j g j d T j g j (58) (59) d T j g j +. (6) Therefore, from the result in (6), we know that the gradient at the ( j + ) th step is orthogonal to the previous direction d j. To show that the same is also true for all previous conjugate directions d k, k< j, let us premultiply equation (55) by dt k, d T( k g j + g ) j α dthd. (6) j k j Since k< j, the right-hand side of equation (6) is zero by H -orthogonality. Thus, (6) reduces to, d T k ( g j + g j ) (6) d T k g j + d T k g j, k< j. (63) Through induction of equations (63) and (6), we can easily show that, d T k g j, k < j. (64) For example, let k j. From (63), dt j g j + dt j g j, (65) and from (6), so that, dt j g j (66) dt j g j +. (67) Now we can let k j, j 3,, and repeat the steps for k j. This completes the proof. - 6 -

F. Constructing conjugate vectors So far in our development, we have assumed a set of mutually H -orthogonal vectors { d j }, but have not talked about how to construct these vectors. The theorem below gives one iterative method for doing just that. Theorem: Let d j be defined as follows:. Let,. Let, d g (68) d j + g j + + β j d j, j, (69) where, gt j + Hd j β j --------------------- d T j Hd j (7) The construction of (68), (69) and (7) generates W mutually H -orthogonal vectors on a W -dimensional quadratic error surface, such that, d T j Hd i, i j. (7) Proof: First, we will show that our choice of β j in (7) implies, dt j + Hd j. (7) In other words, we will show that consecutive search directions are d T j H, H -orthogonal. Pre-multiplying (69) by d T j Hd j + d T j Hg j + + β j d T j Hd j, (73) and substituting (7) into (73), we get, d T j Hd j + d T gt j + Hd j j Hg j + + --------------------- d T dj T Hd j Hd j j (74) d T j Hd j + d T j Hg j + + gt j + Hd j. (75) Next, we demonstrate that if, d T j Hd i dt j + Hd i, i < j, (76) ( means implies ), then, d T j Hd i, i j. (77) This can be shown by induction of (7) and (76); for example, d T Hd d T 3 Hd d T 3 Hd d T 4 Hd (from (7), by construction) (78) (from (78) and (76), by assumption) (79) (from (7), by construction) (8) (from (79) and (76), by assumption) (8) - 7 -

d T 4 Hd d T 4 Hd 3 (from (8) and (76), by assumption) (8) (from (7), by construction), etc. (83) Thus, to complete the proof, we need to show that (76) is true. First, we transpose and post-multiply (69) by Hd i, i< j, dt j + Hd i gt j + Hd i + β j d T j Hd i (84) Now, by assumption (76),. Hence, dt j + Hd i gt j + Hd i. (85) From (3), we have that, Since, d T j Hd i w j + w j α j d j, (86) Hw ( j + w ) j α Hd. (87) j j g j Hw j + b, (88) Hw ( j + w ) j g. (89) j + g j Combining equations (87) and (89), α j Hd j g j + g j (9) Hd j ---- ( g. (9) α j + g j ) j We substitute equation (9) into the right-hand side of equation (85) [switching the index from j to i in equation (9)], so that, dt j + Hd i ----g T, (9) α j + ( g i + g i ) i dt j + Hd i ---- ( gt, (93) α j + g i + gt j + g i ) i dt j + Hd i ---- ( gt. (94) α i + g j + g T i g j + ) i Now, we will show that the right-hand side of equation (94) is zero, by showing that, g T k g j, k < j. (95) This will complete the proof. From (68) and (69), we see that d k, k, is a linear combination of all previous gradient directions, d k k g k + γ l g l, (96) l where the γ l are scalar coefficients. For example, d g, (97) - 8 -

d g + β d g β g, (98) d 3 g 3 + β d g 3 β g β β g. (99) We have already shown, d T k g j, k < j, () so that if we transpose and post-multiply equation (96) by g j, j> k, d k T g j k g T k g j + γ l g T l g k, () l k g T k g j + γ l g T l g k l [from ()], () g k T g j l Since d g, k γ l g T l g k. (3) d T g j g T g j, j >, [from ()], (4) g T g j, j >. (5) By induction of equations (3) and (5), we can now show that (95) is true; for example, g T g [from (5)] (6) g T g 3 γ g T g 3 [from (3) and (5)] (7) g T g 4 [from (5)] (8) g T g 4 γ g T g 4 [from (3) and (5)] (9) g T 3 g 4 γ g T g 4 + γ g T g 4 [from (3) and (9)], etc. () Thus, so that, g T k g j, k < j, () dt j + Hd i ---- ( gt,. () α i + g j + g T i g j + ) i< j i Recall that equation () presupposes, so that we have shown equation (76) to be true; i.e., dt j + Hd i when d T j Hd i, i < j. (3) Thus, by induction of equations (7) and (76), the construction in (68), (69) and (7) generates mutually H - orthogonal vectors on a quadratic error surface, such that, d T j Hd i, i j, (4) and the proof is complete. d T j Hd i - 9 -

G. Computing β j without the Hessian We now have the following algorithm for training the weights of a neural network:. Let, d g, (5) where w is a random initial weight vector.. For j let, w j + w j + α j d j where, (6) α j d T j g j ---------------- d T j Hd j (7) 3. Let, d j + g j + + β j d j, (8) where, gt j + Hd j β j --------------------- d T j Hd j (9) 4. Iterate steps and 3. Ideally, we would like to compute both α j and β j without explicit computation of H. Let us look at β j first. Substituting equation (9), Hd j ---- ( g α j + g j ) j () into equation (9), we get, gt j + ( g j + g j ) β j --------------------------------------- () d T j ( g j + g j ) which is known as the Hestenes-Stiefel expression for β j. Expression () can be further simplified. From equation (8), d j g j + β j d j. () Then, transposing and post-multiplying equation () by g j, we get, d j T g j g T j g j + β j dt j g j. (3) Since d T k g j, k < j, (4) equation (3) reduces to, d j T g j g T j g j, (5) g T j g j d T j g j. (6) - -

Consequently, equation () reduces to, gt j + ( g j + g j ) β j --------------------------------------- g T j g j (7) which is known as the Polak-Ribiere expression for β j. Finally since, g T k g j, k < j, (8) equation (7) further reduces to, gt j' + g j + β j ------------------------ g T j g j (9) which is known as the Fletcher-Reeves expression for β j. Note that none of the three β j expressions (equations (), (7) and (9), respectively) requires computation of the local Hessian H. Also, note that on a purely quadratic error surface, the three expressions above are identical. Since a neural network error function is typically only approximately quadratic, however, the different expressions for β j will give slightly different results. According to [], the Polak-Ribiere form in (7) usually gives slightly better results in practice than the other two expressions. With (7), when the algorithm is making poor progress, β j will be approximately zero, so that the search direction will be reset to the negative gradient, effectively re-initializing or restarting the conjugate gradient algorithm. H. Computing α j without the Hessian As we will show below, rather than compute α j (5), (the step size along each search direction) through equation α j d T j g j ---------------- p (3) d T j Hd j we can instead determine α j through a line minimization along d j. We will show below that for a quadratic error surface, this line minimization is equivalent to computing α j explicitly. For a quadratic error surface, E( w) E + b T w + --wt Hw (3) we have that, E( w j + α j d j ) E b T + ( w j + α j d j ) + -- ( w. (3) j + α j d j ) T Hw ( j + α j d j ) We now differentiate equation (3) with respect to α j and equate to zero: E( w j + α j d j ) ---------------------------------- b α T d j --d j T + j Hw ( j + α j d j ) + -- ( w j + α j d j ) T Hd j (33) Since H is symmetric, d T j Hw ( j + α j d j ) ( w j + α j d j ) T Hd j (34) equation (33) reduces to, b T d j + d T j Hw ( j + α j d j ) d T j b+ d T j Hw j + α j d T j Hd j (35) (36) - -

α j d T j Hd j d T j ( b+ Hw j ) (37) Since, g j b+ Hw j equation (37) reduces to,, (38) α j d j T Hd j d T j g j (39) α j d T j g j ---------------- d T j Hd j (4) Note that equation (4) is equivalent to equation (3), which was derived independently from theoretical considerations. Therefore, we have shown that we may safely replace explicit computation of α j with a line minimization along d j. 3. The conjugate gradient algorithm We have now arrived at the complete conjugate gradient algorithm, which we summarize below:. Choose an initial weight vector w (e.g. vector of small, random weights).. Evaluate g E[ w ] and let,. (4) 3. At step j, j, perform a line minimization along d j such that, 4. Let, d g E( w j + α d j ) E( w j + αd j ), α. (4) w j + w j + α d j 5. Evaluate g j +. 6. Let, d j + g j + + β j d j where, gt j + ( g j + g j ) β j --------------------------------------- g T j g j (43) (44). (45) 7. Let j j+ and go to step 3. Note that at no time does the above algorithm require explicit computation of the Hessian H. In fact, the conjugate gradient algorithm requires very little additional computation compared to the steepest descent algorithm. It does, however, exhibit significantly faster convergence properties. In fact, on a W -dimensional quadratic error surface, the conjugate gradient algorithm is guaranteed to converge to the minimum in W steps; no such guarantees exist for the steepest descent algorithm. The main difference between the two algorithms is that the conjugate gradient algorithm exploits a reasonable assumption about the error surface (i.e. locally quadratic) to achieve faster convergence. A neural network error surface will in general, of course, not be globally quadratic. Therefore, in practice, more than W steps will usually be required before convergence to a local minimum. Consequently, in applying the conjugate gradient algorithm to neural network training, we typically reset the algorithm every W steps; that is, - -

we reset the search direction to the negative gradient [equation (4)]. Finally we note that throughout our development, we have assumed that the Hessian H is positive definite. When H is not positive definite, the line minimization in step 3 of the algorithm prevents the algorithm from making negative progress (i.e. increasing the error of the neural network). 4. Examples In this section, we will examine the conjugate gradient algorithm for some simple examples, including () a simple quadratic error surface, () a simple non-quadratic error surface, and (3) a simple neural network training example. We will see that the conjugate gradient algorithm significantly outperforms both the steepest descent and gradient descent algorithms in speed of convergence. A. Quadratic example Here we consider the simple quadratic error surface, E ω + ω. (46) For this error surface and from initial weights ( ω, ω ) (, ), Figure below plots the trajectory in weight space of () the conjugate gradient algorithm, () the steepest descent algorithm, and (3) the gradient descent algorithm with a learning rate of η.4. Each trajectory is superimposed on top of a contour plot of E and is stopped when E < 6. Note that the minimum for E occurs at ( ω, ω ) (, ). From Figure, we make several observations. First, as expected, the conjugate gradient algorithm does indeed converge in two steps for a two-dimensional quadratic error surface. By contrast, the steepest descent algorithm requires 5 steps to converge to an error E < 6, even though the two algorithms require approximately the same amount of computation per step. The gradient descent algorithm comes in a distant third at 75 steps to convergence for a learning rate η.4, and this does not count the amount of computation performed in searching for the near-optimal value of the learning rate η. In fact, one of the nicest features of the conjugate gradient algorithm is that it requires no hand-selection of any learning parameter, momentum term, etc..5.5.5 ω ω ω.5.5.5 - -.5.5 A legitimate complaint about Figure is that the conjugate gradient algorithm is specifically designed for a quadratic error surface, so that its superior performance in this case is not surprising in other words, the skeptical reader might easily conclude that the comparison in Figure is rigged. Therefore, we consider a decidedly non-quadratic error surface in the following section. B. Non-quadratic example Here we consider the simple non-quadratic error surface, - -.5.5 - -.5.5 ω ω ω conjugate gradient steepest descent gradient descent ( η.4) steps to convergence 5 steps to convergence 75 steps to convergence Figure - 3 -

E.75.5.5 - - ω - ω - Figure E exp( 5ω ω ), (47) which is plotted in Figure. This error surface is characteristic of some regions in weight space for neural network training problems in that for certain weights, the gradient is close to zero. For this error surface and from initial weights ( ω, ω ) (.,.5), Figure 3 below plots the trajectory in weight space of () the conjugate gradient algorithm, () the steepest descent algorithm, and (3) the gradient descent algorithm with a learning rate of η.. Each trajectory is superimposed on top of a contour plot of E and is stopped when E < 6. Note that the minimum for E occurs at ( ω, ω ) (, ). From Figure 3, we make several observations. First, the conjugate gradient algorithm no longer converges in just two steps (for a two-dimensional surface), since E in (47) is not quadratic; it does, however, converge in four short steps to E < 6. By contrast, the steepest descent algorithm requires 5 steps to converge to an error E < 6, and the gradient descent algorithm comes in a distant third at 6 steps to convergence for a learning rate η. ; once again, the 6 steps for gradient descent does not account for the computation performed in searching for the near-optimal value of the learning rate η. The observant reader may still be skeptical. Why, you ask, did we choose (.,.5) as the initial weight vector for this problem? Caught once again! It turns out that these particular initial weights fall just inside the region of the weight space where the Hessian H is positive definite. The region in Figure 4 encircled by the heavy line corresponds to that region in weight space where H >. Note that (.,.5) falls just inside that region. So how does the conjugate gradient algorithm fare when H is not positive definite? Figure 5 answers this question. In Figure 5, we initialize the weights to ( ω, ω ) (, ), outside the positive-definite region, and once again track the conjugate gradient, steepest descent and gradient descent (with η. ) algorithms. Note that once again, the conjugate gradient algorithm clearly outperforms its two competitors. In fact, the.6.6.6.4.4.4... ω ω ω -. -. -. -.4 -.4 -.4 -.6 -.6 -.6 -.4 -...4 -.4 -...4 -.4 -...4 ω ω ω conjugate gradient steepest descent gradient descent ( η.) 4 steps to convergence 5 steps to convergence 6 steps to convergence Figure 3-4 -

.6.4. ω -. -.4 -.6 ω -.4 -...4 Figure 4.5.5.5 ω ω ω.5.5.5 - -.5.5 - -.5.5 - ω -.5.5 ω ω conjugate gradient steepest descent gradient descent ( η.) 5 steps to convergence 4 steps to convergence 94 steps to convergence Figure 5 gradient descent algorithm degenerates to an awful 94 steps to convergence this poor performance is typical of gradient descent in flat regions of the error surface. C. Neural-network training example Finally, we look at a simple neural network training example. Here we are interested in approximating, y -- + -- sin( πx), x, (see Figure 6) (48) with a single-layer, three-hidden-unit neural network (see Figure 6) and symmetric sigmoidal activation functions, γ ( u) ----------------. (49) + e u -- Our training data for this problem consists of N 5 evenly spaced ( xy, ) points sampled from x [, ]. Figure 7 plots, log E N --- (5). Due to an oversight, no connections from the bias unit to the output unit was included in this network. Thus, there are a total of nine trainable weights in this neural network. - 5 -

z.8.6 y.4. as a function of the number of steps (or epochs) of the conjugate training algorithm for the first steps. By comparison, Figure 8 illustrates that steepest descent does not come close to approaching the conjugate gradient algorithm s performance, even after 5 steps. Finally, we note that gradient descent (with η.3 ) requires, epochs to reach approximately the same error as the conjugate gradient algorithm does in epochs, as shown in Figure 9. Once again, the conjugate gradient algorithm converges in many fewer steps than either of its two competitors. D. Postscript Let us look at two more issues with regard to the neural network training example of the previous section. First, Figure plots the neural network s progress in approximating y as the number of epochs (steps) increases from to. Note how the conjugate gradient algorithm slowly begins to bend an initially linear approximation to ultimately generate a very good fit of the sine curve. Second, it is interesting to examine how this neural network, with sigmoidal activation functions centered about the origin [see equation (49)] is able to approximate the function y (with non-zero average value) without a bias connection to the output unit z. Let us denote z i, i { 3,, }, as the net contribution from hidden unit i to the output z, so that, In other words,..4.6.8 z z + z + z 3 x. (5) x Figure 6 z i ω i h i, i { 3,, }, (5) where ω i denotes the weight from the i th hidden unit to the output, and h i denotes the output of hidden unit i. Figure (top) plots each of the z i, which are easily recognized as different parts of the symmetric sigmoid function in equation (49). To understand how they combine to form an approximation of y, we separately plot z + z and z 3 in Figure (bottom). Note that z 3 is primarily responsible for the overall shape of the z function, while z + z together pull the ends of z to conform to the familiar sine curve shape. 5. Conclusion We have derived the conjugate gradient algorithm and illustrated its properties on some simple examples. We have shown that it outperforms both the gradient descent as well as the steepest descent algorithms in terms of the number of computations required to reach convergence. This improvement has been achieved primarily by accounting for the local properties of error surfaces, allowing us to employ gradient information in a more intelligent manner; furthermore, the conjugate gradient algorithm makes unnecessary the often painful trialand-error selection of tunable learning parameters. [] C. M. Bishop, Chapter 7: Parameter Optimization Algorithms, Neural Networks for Pattern Recognition, Oxford University Press, Oxford, 995. - 6 -

-.5 log E --- N - -.5 conjugate gradient algorithm - Figure 7 5 5 number of epochs -.5 steepest descent algorithm - log E --- N -.5 Figure 8 - conjugate gradient algorithm 4 6 8 number of epochs -.5 - log E --- N -.5 gradient descent algorithm Figure 9 - conjugate gradient algorithm 4 6 8 number of epochs - 7 -

epoch 5 epochs.8.8.6.6.4.4....4.6.8..4.6.8 7 epochs epochs.8.8.6.6.4.4....4.6.8..4.6.8 epochs 9 epochs.8.8.6.6.4.4....4.6.8..4.6.8 63 epochs epochs.8.8.6.6.4.4....4.6.8..4.6.8 Figure - 8 -

z z 3 - z -..4.6.8 x.5.5 -.5 z + z.5.5 -.5 - z 3 -..4.6.8 x Figure -.5..4.6.8 x - 9 -