Batch-mode, on-line, cyclic, and almost cyclic learning 1 1 Introduction In most neural-network applications, learning plays an essential role. Throug

Similar documents
Stochastic Dynamics of Learning with Momentum. in Neural Networks. Department of Medical Physics and Biophysics,

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

Error Empirical error. Generalization error. Time (number of iteration)

Equivalence of Backpropagation and Contrastive Hebbian Learning in a Layered Network

Learning with Ensembles: How. over-tting can be useful. Anders Krogh Copenhagen, Denmark. Abstract

= w 2. w 1. B j. A j. C + j1j2

4. Multilayer Perceptrons

Temporal Backpropagation for FIR Neural Networks

Introduction to Neural Networks

a subset of these N input variables. A naive method is to train a new neural network on this subset to determine this performance. Instead of the comp

The Bias-Variance dilemma of the Monte Carlo. method. Technion - Israel Institute of Technology, Technion City, Haifa 32000, Israel

An Adaptive Bayesian Network for Low-Level Image Processing

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems

Novel determination of dierential-equation solutions: universal approximation method

Back-Propagation Algorithm. Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples

Supervised (BPL) verses Hybrid (RBF) Learning. By: Shahed Shahir

Reading Group on Deep Learning Session 1

y(x n, w) t n 2. (1)

1 Introduction Independent component analysis (ICA) [10] is a statistical technique whose main applications are blind source separation, blind deconvo

Widths. Center Fluctuations. Centers. Centers. Widths

Gaussian Processes for Regression. Carl Edward Rasmussen. Department of Computer Science. Toronto, ONT, M5S 1A4, Canada.

Neural Networks (Part 1) Goals for the lecture

Learning Long Term Dependencies with Gradient Descent is Difficult

AI Programming CS F-20 Neural Networks

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

ESANN'1999 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 1999, D-Facto public., ISBN X, pp.

CONVERGENCE ANALYSIS OF MULTILAYER FEEDFORWARD NETWORKS TRAINED WITH PENALTY TERMS: A REVIEW

Conjugate Directions for Stochastic Gradient Descent

Artificial Intelligence

Average Reward Parameters

Convergence of Hybrid Algorithm with Adaptive Learning Parameter for Multilayer Neural Network

Advanced computational methods X Selected Topics: SGD

Artificial Neural Networks. MGS Lecture 2

Parallel layer perceptron

Artificial Neural Networks. Edward Gatt

Day 3 Lecture 3. Optimizing deep networks

Gaussian process for nonstationary time series prediction

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Local minima and plateaus in hierarchical structures of multilayer perceptrons

Neural Learning in Structured Parameter Spaces Natural Riemannian Gradient

Neural Network Training

Lecture 4: Perceptrons and Multilayer Perceptrons

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Optimal Stochastic Search and Adaptive Momentum

Confidence Estimation Methods for Neural Networks: A Practical Comparison

PATTERN CLASSIFICATION

A summary of Deep Learning without Poor Local Minima

bound on the likelihood through the use of a simpler variational approximating distribution. A lower bound is particularly useful since maximization o

Lecture 5: Logistic Regression. Neural Networks

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Machine Learning Lecture 7

Slope Centering: Making Shortcut Weights Effective

Non-Euclidean Independent Component Analysis and Oja's Learning

THE Back-Prop (Backpropagation) algorithm of Paul

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Multilayer Neural Networks

Online learning from finite training sets: An analytical case study

squashing functions allow to deal with decision-like tasks. Attracted by Backprop's interpolation capabilities, mainly because of its possibility of g

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Cheng Soon Ong & Christian Walder. Canberra February June 2018

A recursive algorithm based on the extended Kalman filter for the training of feedforward neural models. Isabelle Rivals and Léon Personnaz

Machine Learning. Neural Networks. Le Song. CSE6740/CS7641/ISYE6740, Fall Lecture 7, September 11, 2012 Based on slides from Eric Xing, CMU

memory networks, have been proposed by Hopeld (1982), Lapedes and Farber (1986), Almeida (1987), Pineda (1988), and Rohwer and Forrest (1987). Other r

MULTILAYER NEURAL networks improve their behavior

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Advanced Machine Learning

y(n) Time Series Data

Empirical Bayes for Learning to Learn. learn" described in, among others, Baxter (1997) and. Thrun and Pratt (1997).

Linear Models for Regression CS534

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Multivariate class labeling in Robust Soft LVQ

Artificial Neural Networks

output dimension input dimension Gaussian evidence Gaussian Gaussian evidence evidence from t +1 inputs and outputs at time t x t+2 x t-1 x t+1

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

1 What a Neural Network Computes

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Feed-forward Network Functions

Rprop Using the Natural Gradient

Small sample size generalization

GENERATION OF COLORED NOISE

Keywords- Source coding, Huffman encoding, Artificial neural network, Multilayer perceptron, Backpropagation algorithm

A Logarithmic Neural Network Architecture for Unbounded Non-Linear Function Approximation

NEURAL NETWORKS

Intelligent Control. Module I- Neural Networks Lecture 7 Adaptive Learning Rate. Laxmidhar Behera

CSCI567 Machine Learning (Fall 2018)

Artificial Neural Networks Examination, June 2005

Credit Assignment: Beyond Backpropagation

Worst-Case Analysis of the Perceptron and Exponentiated Update Algorithms

Introduction to Neural Networks

Nonlinear Models. Numerical Methods for Deep Learning. Lars Ruthotto. Departments of Mathematics and Computer Science, Emory University.

Why should you care about the solution strategies?

Multilayer Neural Networks

Learning in Boltzmann Trees. Lawrence Saul and Michael Jordan. Massachusetts Institute of Technology. Cambridge, MA January 31, 1995.

Christian Mohr

Neuro-Fuzzy Comp. Ch. 4 March 24, R p

Artificial Neural Networks. Q550: Models in Cognitive Science Lecture 5

(a) (b) (c) Time Time. Time

Transcription:

A theoretical comparison of batch-mode, on-line, cyclic, and almost cyclic learning Tom Heskes and Wim Wiegerinck RWC 1 Novel Functions SNN 2 Laboratory, Department of Medical hysics and Biophysics, University of Nijmegen, Geert Grooteplein 21, NL 6525 EZ Nijmegen, The Netherlands Submitted to IEEE Transactions on Neural Networks Abstract We study and compare dierent neural network learning strategies: batch-mode learning, on-line learning, cyclic learning, and almost cyclic learning. Incremental learning strategies require less storage capacity than batch-mode learning. However, due to the arbitrariness in the presentation order of the training patterns, incremental learning is a stochastic process, whereas batch-mode learning is deterministic. In zeroth order, i.e., as the learning parameter tends to zero, all learning strategies approximate the same ordinary dierential equation, for convenience referred to as the \ideal behavior". Using stochastic methods valid for small learning parameters, we derive dierential equation describing the evolution of the lowest order deviations from this ideal behavior. We compute how the asymptotic misadjustment, measuring the average asymptotic distance from a stable xed point of the ideal behavior, scales as a function of the learning parameter and the number of training patterns. Knowing the asymptotic misadjustment, we calculate the typical number of learning steps necessary to generate a weight within order of this xed point, both with xed and time-dependent learning parameters. We conclude that almost cyclic learning (learning with random cycles) is a better alternative for batch-mode learning than cyclic learning (learning with a xed cycle). 1 Real World Computing rogram 2 Dutch Foundation for Neural Networks

Batch-mode, on-line, cyclic, and almost cyclic learning 1 1 Introduction In most neural-network applications, learning plays an essential role. Through learning, the weights of the network are adapted to meet the requirements of its environment. Usually, the environment consists of a nite number of examples: the training set. We consider two popular ways of learning with this training set: incrementally and batch-mode. With incremental learning, a pattern x is presented to the network and the weight vector w is updated before the next pattern is considered: w = f(w; x ) ; (1) with the learning parameter and f(; ) the learning rule. This learning rule can be either supervised, e.g. the backpropagation learning rule [1] where x represents an input-output combination, or unsupervised, e.g. the Kohonen learning rule [2] where x stands for an input vector. In batch-mode, we rst average the learning rule over all training patterns before changing the weights: w = 1 =1 f(w; x ) F (w) ; (2) where we have dened the average learning rule or drift F (w). Both incremental and batch-mode learning can be viewed as an attempt to realize or at least approximate the ordinary dierential equation dw = F (w) : (3) In [3] it was rigorously established that the sequence of weight vectors following (1) can be approximated by the dierential equation (3) in the limit! 0. However, choosing an innitesimal learning parameter is not realistic, since the smaller the learning parameter, the longer the time needed to converge. In this paper, we will therefore go one step further and calculate the lowest order deviations from the dierential equation (3) for small learning parameters. For convenience, we will refer to equation (3) as the \ideal behavior". If the drift F (w) can be written as the gradient of an error potential E(w), i.e., if F (w) =?re(w) ; the ideal behavior will lead the weights w to a (local) minimum of E(w). Batch-mode learning is completely deterministic, but requires additional storage for each weight, which can be inconvenient in hardware applications. Incremental learning strategies, on the other hand, are less demanding on the memory side, but the arbitrariness in the order in which the patterns are presented, makes them stochastic. We will consider three popular incremental learning strategies: on-line learning, cyclic learning, and almost cyclic learning. At each on-line learning step one of the patterns is drawn at random from the training set and presented to the network. The training process for (almost) cyclic learning consists of training cycles in which each of the patterns is presented exactly once. Cyclic learning is learning with a xed cycle, i.e., before learning starts a particular order of pattern presentation is drawn at random and then xed in time. In almost cyclic learning, on the other hand, the order of pattern presentation is continually drawn at random after each training cycle. On-line learning has been studied using stochastic methods borrowed from statistical physics (see e.g. [4, 5, 6]). These studies are not restricted to on-line learning on nite pattern sets, but also discuss learning in changing environments [7, 8], learning with time-correlated patterns [9], and learning with momentum term [10, 11]. The method we use in this paper to derive the (wellknown) results on on-line learning can also be applied to learning with cycles of training patterns. Learning with cycles has been studied in [12] for the linear LMS learning rule. Our results are valid for any (nonlinear) learning rule that can be written in the form (1). Furthermore, we will point out and quantify the important dierence between cyclic and almost cyclic learning.

2 Tom Heskes and Wim Wiegerinck In Section 2, we will apply a mixture between the time-averaging method proposed in [5] and Van Kampen's expansion [13] explained in [6] to derive the lowest order deviations from the ideal behavior (3) for the various learning strategies. At rst reading, the reader may want to skip this section or view it as an appendix. Section 3 focusses on the asymptotic behavior. The asymptotic misadjustment measures the asymptotic local deviations between the weight vector and a stable xed point of the ideal behavior (3) and is therefore a useful indication of the network's performance. Another closely related performance measure is the typical number of learning steps n necessary to generate a weight within order of the stable xed point. We will calculate n for the three incremental learning strategies, both with xed and with time-dependent learning parameters. 2 Theory In this section we will study batch-mode learning, on-line learning, cyclic learning, and almost cyclic learning in the limit of small learning parameters. We will focus on the lowest order deviations from the ordinary dierential equation found in the limit! 0. For convenience, we will use one-dimensional notation. It is straightforward to generalize the results to higher dimensions. 2.1 Batch-mode learning In the following we will use subscripts n and m to indicate that time is measured in number of presentations of one pattern and in number of cycles of patterns, respectively, i.e., n = m. With this convention, the batch-mode learning rule (2) can be written w n+? w n = w m+1? w m = =1 f(w m ; x ) = F (w n ) ; (4) where we have dened the rescaled learning parameter. In order to turn the dierence equation (4) into a set of dierential equations, we make the ansatz w n = (t) + (t) with t m = n : Up to the two lowest orders in the learning parameter, the dierence equation (4) is equivalent to d(t) d(t) = F ((t)) (5) = F 0 ((t))(t)? 1 2 F 0 ((t))f ((t)) : (6) For batch-mode learning, the deviation from the ideal behavior (3) is of order =. This deviation, which is a consequence of the discretization of the learning steps, is well known as the error of Euler's method in numerical analysis [14]. 2.2 On-line learning On-line learning is an incremental learning strategy where a weight update takes place after each presentation of one randomly drawn training pattern. Given training pattern x n at learning step n, the weight change reads w n+1? w n = f(w n ; x n ) : (7) We start with the ansatz that the uctuations are small, i.e., we write (see e.g. [6]) w n = n + p n ;

Batch-mode, on-line, cyclic, and almost cyclic learning 3 with n a deterministic part and n a noise term. After T iterations of the learning rule (7), we have n+t? n + p [ n+t? n ] (8) T = f( n+i ; x n+i ) + p T f 0 ( n+i ; x n+i ) n+i + O 2 T = " T # " f( n ; x n+i ) + O 2 T 2 + p T # f 0 ( n ; x n+i ) n + O 2 T 2 ; (9) where in the last step we used that n+i = n + O (i) and similarly for n+i. We write the rst sum on the right-hand side as an average part plus a noise term: T?1 f( n ; x n+i ) = (T )F ( n ) + p p T n ( n ) ; (10) with the drift F (w) dened in (2) and noise n (w) dened by n (w) p 1 T?1 [f(w; x n+i )? F (w)] : T For large T the noise n (w), consisting of T independent terms, is Gaussian distributed with zero average and variance D(w) D E f 2 (w; x)? F 2 (w) = 1 x =1 f 2 (w; x )? F 2 (w) : From (9), we obtain the set of dierence equations n+t? n = (T )F ( n ) + O 2 T 2 p n+t? n = (T )F 0 ( n ) n + T n ( n ) + O 2 T 2 : (11) For small learning parameters, we can replace the dierence equation for n by the dierential equation (5). The deviation due to discretization is of order (see the analysis of batch-mode learning), which is negligible in comparison with the noise term of order p. The dierence equation (11) for the noise n is a discretized version of a Langevin equation, which in the limit T! 0 becomes a continuous-time Langevin equation (see e.g. [15]). The corresponding Fokker-lanck equation for the probability (; t) is @(; t) @t =?F 0 ((t)) @ @ [(; t)] + 1 2 D((t))@2 (; t) @ 2 : (12) In \zeroth order" the weights follow the ideal behavior (5). The randomness of the sampling, however, leads to deviations of order p. This result is not new and has been derived in many dierent ways (see e.g. [7, 16, 8]). Our derivation combines the ansatze suggested by Van Kampen's expansion [13, 6] with the time-averaging procedure applied in [5]. In the following section we will show how a similar procedure can be used to study learning with cycles. 2.3 Learning with cycles Let ~x fx 0 ; : : : ; x i ; : : : ; x?1 g denote a cycle of patterns. There are! possible dierent training cycles. Given such a training cycle ~x, the weight change can be written as w n+? w n?1 = f(w n+i ; x i ) ;

4 Tom Heskes and Wim Wiegerinck which, after substitution of the ansatz can be turned into v n+? v n + ( ) [z n+? z n ]?1 = f(v n+i ; x i ) + 2?1 = f(v n ; x i ) + 2 i=1 w n = v n + ( )z n ; f 0 (v n+i ; x i )z n+i + O 3 2 f 0 i?1 (v n ; x i ) j=0 f(v n ; x j ) + 2 = ( )F (v n ) + ( ) 2 F 0 (v n )z n + b(v n ; ~x) + 1 2 F 0 (v n )F (v n ) with denition b(w; ~x) 1 2 i?1 i=1 j=0 f 0 (w; x i )f(w; x j )? 1 2 F 0 (w)f (w) : f 0 (v n ; x i )z n + O 3 2 + O 3 3 ; (13) In terms of the rescaled learning parameter and the time m measured in cycles, equation (13) yields v m+1? v m = F (v m ) (14) z m+1? z m = F 0 (v m )z m + b(v m ; ~x) + 1 2 F 0 (v m )F (v m ) + O 2 : (15) The dierence equation (14) for v m is just the batch-mode learning rule (4). The lowest order correction (6) to the ideal behavior (5) can be incorporated in the dierence equation for z m. Up to order 2, the expressions (14) and (15) are therefore equivalent to (5) and z m+1? z m = F 0 ((t))z m + b((t); ~x m ) + O 2 ; (16) with ~x m the particular cycle presented at \cycle step" m. Neglecting the higher order terms, equation (16) can be viewed as an incremental learning rule for training cycle ~x m, just as equation (7) is the learning rule for training pattern x n. All necessary information about the cycle ~x m is contained in the term b((t); ~x m ). With cyclic learning, we have the same cycle ~x m = ~x at all cycle steps, with almost cyclic learning we draw the cycle ~x m at random at each cycle step. 2.3.1 Almost cyclic learning With almost cyclic learning subsequent training cycles are drawn at random: almost cyclic learning is on-line learning with training cycles instead of training patterns. We can apply a similar time-averaging procedure as in our study of on-line learning. Starting from the ansatz z m = 1 m + p m ; we obtain, after T iterations of the \learning rule" (16) and neglecting all terms of order 2 T 2 and higher, m+t? m = (T ) F 0 ((t)) m + B((t)) (17) m+t? m = (T )F 0 ((t)) m + p T m ((t)) ; (18)

Batch-mode, on-line, cyclic, and almost cyclic learning 5 with denitions B(w) hb(w; ~x)i ~x and m (w) s T The variance Q(w) for the white noise m (w) is given by Q(w) [ D b 2 (w; ~x) E T [b(w; ~x m+i )? hb(w; ~x)i ~x ] : ~x? hb(w; ~x)i2 ~x ] : (19) Calculation of the average B(w) and the variance Q(w) is tedious but straightforward. In terms of we obtain C(w) 1 =1 f 0 (w; x )f(w; x ) and G(w) 1 =1 [f 0 (w; x )] 2 ; B(w) =? 1 C(w) (20) 2 Q(w) = 1 h i [F 0 (w)] 2 D(w) + F 2 (w)g(w) + 2F 0 (w)f (w)c(w) + 1 h i D(w)G(w)? C 2 (w) : 12 12 We can, similar to what we did for on-line learning, turn the dierence equation (17) for m and the discretized Langevin equation (18) for m into a dierential equation for (t) and a Fokker-lanck equation for (; t): d (t) @(; t) @t = F 0 ((t)) (t) + B((t)) (21) =?F 0 ((t)) @ @ [(; t)] + 1 2 Q((t))@2 (; t) @ 2 : (22) Recalling all our denitions and ansatze, we conclude that this set of equations, in combination with (5), can be used to predict the behavior of = (n) + (n) + ( ) p (n) + O 2 2 : w n For almost cyclic learning, the deviation from the ideal behavior due to discretization of learning steps is of order and the deviation due to the randomness of the sampling is of order 3=2. 2.3.2 Cyclic learning With cyclic learning a particular cycle ~x is drawn at random from the set of! possible cycles and then kept xed at all times. The \learning rule" is, up to order 2, given in (16). Given a particular training cycle ~x with corresponding b (w) b(w; ~x ), the evolution of the deviation z is completely deterministic: dz (t) = F 0 ((t))z (t) + b ((t)) : We can split z (t) in an average part (t), common to all cycles, and a specic part (t): z (t) = 1 (t) + 1 p (t) : The evolution of (t) follows (21) and the evolution of (t) is given by d (t) = F 0 ((t)) (t) + q ((t)) ; (23)

6 Tom Heskes and Wim Wiegerinck where we have dened the term q (w) p [b (w)? hb(w; ~x)i ~x ] ; with zero average and variance Q(w) dened in (19) and computed in (20). From w n = (n) + (n) + p (n) we conclude that, for learning with a particular xed cycle ~x, the weight vector wn follows the ideal behavior (n) with correction terms of order p. These correction terms for cyclic learning are larger than those for almost cyclic learning. 2.4 A note on the validity Let us reconsider our ansatze. We assumed that deviations from the ideal behavior (t) scale with some positive power of, i.e., the smaller the smaller these deviations. Looking at the dierential and Langevin-type equations for these deviations, we see that the deviations remain bounded if and only if F 0 ((t)) < 0. Assuming that the drift F (w) can be written as minus the gradient of some error potential E(w), this implies that the theory is valid in regions of weight space where the error potential is convex, which is true in the vicinity of the local minima w. Outside these so-called \attraction regions" our derivations are only valid on short time scales (see [6] for a more detailed discussion on the validity of Fokker-lanck approaches of on-line learning processes). A second notion concerns perfectly learnable problems. For these problems there exists a weight or a set of weights w such that f(w ; x ) 0 for all patterns in the training set. An example is a perceptron learning a linearly separable problem. For these perfectly learnable problems the \perfect" weight w acts like a sink: all learning processes will end up in this state. In our analysis, a perfectly trainable network has a vanishing asymptotic diusion. Most practical problems, however, are not perfectly learnable and a minimum of the error potential w corresponds to the best compromise on the training set. In this paper, we therefore restrict ourselves to networks that are not perfectly trainable. 3 Asymptotic properties The ideal behavior (3) leads w to a stable xed point w obeying F (w ) = 0 and F 0 (w ) < 0 ; i.e., to a (local) minimum of the error potential E(w) (assuming such an error potential exists). In this section we will study asymptotic properties of the various learning strategies. We will focus on the asymptotic behavior of the misadjustment E E i M n D(w? w ) 2 = [hwi w w? w ] 2 + hdw 2? w hwi2 w + ; (24) where the average is over the distribution of the weights after a large number of learning steps n. and are called the bias and the variance, respectively. 3.1 The asymptotic misadjustment First, we will consider the asymptotic misadjustment A M 1 for the various learning strategies. We will concentrate on training sets with a large number of patterns and small learning parameters, i.e., we will consider the situation 1 1=. In the following we will only give the results in leading order.

Batch-mode, on-line, cyclic, and almost cyclic learning 7 3.1.1 Batch-mode learning A stable xed point w of the dierential equation (5) is also a stable xed point of the batchmode equation (2). Therefore, all deviations from the ideal behavior will completely vanish, and the batch-mode learning rule yields zero asymptotic misadjustment. 3.1.2 On-line learning The most important contribution to the asymptotic misadjustment for on-line learning stems from the noise due to the randomness of the sampling. The Fokker-lanck equation (12) yields E A = = D 2 = D(w ) 2jF 0 (w )j ; i.e., the asymptotic misadjustment is of order. The diusion D(w ) measures the local uctuations in the learning rule, the derivative F 0 (w ) the local curvature of the error potential. The bias is of order 2 and thus negligible [8]. 3.1.3 Almost cyclic learning For almost cyclic learning, the bias follows from the stationary solution of the average deviation. Equation (21) gives = 2 2 = C(w ) 2 2F 0 (w 2 ; ) where C(w ) measures the correlation between the learning rule and its derivative. The variance is of higher order in, but strongly depends on the number of patterns : = 3 2 D 2 E = 3 2 Q(w ) 2jF 0 (w )j = jf 0 (w )jd(w ) 3 2 ; 24 which follows from the Fokker-lanck equation (22) and the expression (20). Depending on the term 2, the asymptotic misadjustment is dominated by either the bias or the variance. 3.1.4 Cyclic learning With cyclic learning, we rst have to calculate the asymptotic misadjustment for a particular cycle ~x. In lowest order we obtain A = 2 h p i 2 ; with the asymptotic solution of the dierential equation (23): = q (w ) jf 0 (w )j : The average asymptotic misadjustment, which is the average over all possible cycles, thus obeys A ha i = D(w ) 2 : (25) 12 The (average) asymptotic misadjustment for cyclic learning is (for small learning parameters and a considerable number of patterns ) always an order of magnitude larger than the asymptotic misadjustment for almost cyclic learning. Almost cyclic learning is therefore a better alternative for batch-mode learning than cyclic learning.

8 Tom Heskes and Wim Wiegerinck 3.2 Necessary number of learning steps In this section we consider the decay of the misadjustment to its asymptotic solution for the three incremental learning strategies, both with xed and with time-dependent learning parameters. We will work in the limit 1 2 1= n n. The analysis in Section 2 shows that the convergence rate is the same for all learning strategies discussed in this paper, i.e., in lowest order we can write M n+1? M n = 2 n jf 0 (w )j [A( n )? M n ] ; (26) with n the learning parameter at learning step n and A() the asymptotic misadjustment for the various learning strategies. We recall from Section 3 that A() / for on-line learning, A() / 2 for almost cyclic learning, and A() / 2 for cyclic learning. Following [12], we consider the concept of an -optimal weight, i.e., a weight w within order of the local minimum w. We will calculate the typical number of learning steps n necessary to reach such an - optimal weight, i.e., the typical number of learning steps until the misadjustment is of order 2. Our analysis is a generalization of [12] to general (nonlinear) learning rules and points out the important dierence between the three incremental learning strategies. 3.2.1 Fixed learning parameters With a xed learning parameter, i.e., n = for all n, the misadjustment M n obeying (26) can be written = A() + O M n e?2jf 0 (w )jn :? (27) To make sure that the asymptotic misadjustment is of order 2 we have to choose = O A?1 ( 2 ) with A?1 () the inverse of A(). Substitution into (27) yields e?n = O 2 ; with = O (1). Thus, the typical number of learning steps n necessary to generate an -optimal weight is 1 1 n = O A?1 ( 2 ) log : We have n 1= 2 log(1=) for on-line learning, n 1= log(1=) for almost cyclic learning, and n p = log(1=) for cyclic learning. 3.2.2 Time-dependent learning parameter With time-dependent learning parameters we can choose the learning parameter n yielding the fastest possible decay of the misadjustment M n. Optimizing the right-hand side of (26) with respect to n, we obtain a relationship between M n and n : A( n )? M n + n A 0 ( n ) = 0 and thus M n = (r + 1)A( n ) ; (28) with r = 1 for on-line learning and r = 2 for learning with cycles. Substitution of (28) into (26) yields a dierence equation for n. For large n, the solution of this dierence equation reads n = r + 1 2jF 0 (w )jn : (29) Combining (28) and (29), we obtain M n 1=n and thus n 1= p for on-line learning, M n 1=n 2 and n 1= for almost cyclic learning, and M n =n 2 and n p = for cyclic learning. With time-dependent learning parameters, the necessary number of learning steps n is order log(1=) smaller than with xed learning parameters. In both cases, cyclic learning requires about p more learning steps than almost cyclic learning.

Batch-mode, on-line, cyclic, and almost cyclic learning 9 4 Discussion We have studied the consequences of dierent learning strategies. For local optimization, learning with cycles is a better alternative for batch-mode learning than on-line learning. The asymptotic misadjustment for learning with cycles scales with 2, whereas the asymptotic misadjustment for on-line learning is proportional to. Furthermore, learning with cycles requires less learning steps to get close to the minimum. Almost cyclic learning yields in this respect even better performances than cyclic learning. Learning with cycles can be interpreted as a more \conservative" learning strategy, since within each cycle the network receives information about all training patterns. With on-line learning, on the other hand, the time span between two presentations of a particular pattern can be much larger than the period of one cycle. As a result the uctuations in the network weights are larger for on-line learning than for learning with cycles. The asymptotic bias for cyclic learning can be viewed as an artefact of the xed presentation order. Almost cyclic learning prevents this artefact by randomizing over the presentation orders at the (lower) price of small asymptotic uctuations. In our presentation, we have used one-dimensional notation. However, it is straightforward to generalize the results to general high-dimensional weight vectors. The asymptotic misadjustment for the various learning strategies then depends on, most importantly, the Hessian matrix and the diusion matrix. The Hessian matrix is related to the local curvature of the error potential, the diusion matrix to the uctuations in the learning rule. Training sets with lots of redundant information have a lower diusion, and thus a lower asymptotic bias than training sets with very specic and contradictory information. For backpropagation as well as for other learning rules minimizing the loglikelihood of training patterns, the diusion matrix is, up to a global scale factor, proportional to the Hessian matrix (see e.g. [17]). As a consequence, for on-line learning the asymptotic distribution of the weights is isotropic (see [5]). Similarly, it can be shown that the asymptotic covariance matrix for cyclic learning (averaged over all possible representation orders) is, in a lowest order approximation, proportional to the diusion matrix, as could be guessed from the one-dimensional result (25). Therefore, this covariance matrix is, for learning rules based on loglikelihood procedures, also proportional to the Hessian matrix. The asymptotic covariance matrix for almost cyclic learning is proportional to the square of the Hessian matrix. The covariance matrix that would result from a local lowest order approximation of a Gibbs distribution is proportional to the inverse of the Hessian matrix. Our analysis is valid in the limit of small learning parameters, but even then only locally, i.e., in the vicinity of local minima and on short time scales. The theory can therefore not be used to make quantitative predictions about global properties of the learning behavior, such as escape times out of local minima or stationary distributions. However, the above local description may be helpful to explain some aspects of global properties. For instance, the larger the local uctuations, the higher the probability to escape (see e.g. [18, 19]). This might be one of the reasons why on-line learning, the learning strategy with the largest uctuations, often yields the best results, especially in complex problems with many local minima (see e.g. [20]). References [1] D. Rumelhart, J. McClelland, and the D Research Group. arallel Distributed rocessing: Explorations in the Microstructure of Cognition. MIT ress, 1986. [2] T. Kohonen. Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43:59{69, 1982.

10 Tom Heskes and Wim Wiegerinck [3] C. Kuan and K. Hornik. Convergence of learning algorithms with constant learning rates. IEEE Transactions on Neural Networks, 2:484{89, 1991. [4] G. Radons. On stochastic dynamics of supervised learning. Journal of hysics A, 26:3455{ 3461, 1993. [5] L. Hansen, R. athria, and. Salamon. Stochastic dynamics of supervised learning. Journal of hysics A, 26:63{71, 1993. [6] T. Heskes. On Fokker-lanck approximations of on-line learning processes. Journal of hysics A, 27:5145{5160, 1994. [7] S. Amari. A theory of adaptive pattern classiers. IEEE Transactions on Electronic Computers, 16:299{307, 1967. [8] T. Heskes and B. Kappen. Learning processes in neural networks. hysical Review A, 44:2718{2726, 1991. [9] W. Wiegerinck and T. Heskes. On-line learning with time-correlated patterns. Europhysics Letters, 28:451{455, 1994. [10] G. Orr and T. Leen. Momentum and optimal stochastic search. In M. Mozer,. Smolensky, D. Touretzky, J. Elman, and A. Weigend, editors, roceedings of the 1993 Connectionist Models Summer School, Hillsdale, 1993. Erlbaum Associates. [11] W. Wiegerinck, A. Komoda, and T. Heskes. Stochastic dynamics of learning with momentum in neural networks. Journal of hysics A, 27:4425{4437, 1994. [12] Z. Luo. On the convergence of LMS algorithm with adaptive learning rate for linear feedforward netyworks. Neural Computation, 3:226{245, 1991. [13] N. van Kampen. Stochastic rocesses in hysics and Chemistry. North-Holland, Amsterdam, 1992. [14] W. ress, B. Flannery, A. Teukolsky, and W. Vettering. Numerical Recipes in C. Cambridge University ress, Cambridge, 1989. [15] C. Gardiner. Handbook of Stochastic Methods. Springer, Berlin, second edition, 1985. [16] G. Radons, H. Schuster, and D. Werner. Fokker-lanck description of learning in backpropagation networks. In International Neural Network Conference 90 aris, pages 993{996, Dordrecht, 1990. Kluwer Academic. [17] W. Buntine and A. Weigend. Computing second derivatives in feed-forward networks: A review. IEEE Transactions on Neural Networks, 5:480{488, 1994. [18] T. Heskes, E. Slijpen, and B. Kappen. Learning in neural networks with local minima. hysical Review A, 46:5221{5231, 1992. [19] T. Heskes, E. Slijpen, and B. Kappen. Cooling schedules for learning in neural networks. hysical Review E, 47:4457{4464, 1993. [20] E. Barnard. Optimization for neural nets. IEEE Transactions on Neural Networks, 3:232{ 240, 1992.