Memory capacity of neural networks learning within bounds

Nous We done sont are faites J. Physique 48 (1987) 20532058 DTCEMBRE 1987, 1 2053 Classification Physics Abstracts 75.10H 64.60 87.30 Memory capacity of neural networks learning within bounds Mirta B. Gordon Centre d Etudes Nucléaires de Grenoble, Département de Recherche Fondamentale/Service de Physique, Groupe Magnétisme et Diffraction Neutronique (*), 85 X, 38041 Grenoble Cedex, France (Reçu le 7 juillet 1987, accept6 le 12 aoat 1987) Résumé. présentons un modèle de mémoire à long terme : apprentissage avec bornes irréversibles. Les meilleures valeurs des bornes et la capacité de mémoire sont déterminés numériquement. Nous montrons qu il est possible en général de calculer analytiquement la capacité de mémoire si l on résout le problème de 2014 marche aléatoire associé à chaque règle d apprentissage. Nos estimations pour plusieurs règles 2014 d apprentissage en excellent accord avec les résultats numériques et de mécanique statistique. 2014 Abstract. present a model of long term memory : learning within irreversible bounds. The best bound values and memory capacity are determined numerically. We show that it is possible in general to calculate analytically the memory capacity by solving the random walk problem associated to a given learning rule. Our 2014 2014 estimations for several learning rules in excellent agreement with numerical and analytical statistical mechanics results. In the last few years, a great amount of work has been done on the properties of networks of formal neurons, proposed by Hopfield [1] as models of associative memories. In these models, each neuron i is represented by a spin variable oi which can take only two values ai 1 or ai I. Any state of the system is defined by the values {oi, U2,..., UN} U taken by each one of the N spins or neurons. Pairs of neurons i, j interact with strengths Cij, the synaptic efficacies, which are modified by learning. As usual, we denote 6 (v 1, 2,...) the learnt states or patterns. Retrieval of patterns is a dynamic process in which each spin takes the sign of the local field : acting on it. The primed sum means that terms j i should be ignored. A learnt state ç v is said to be memorized or retrieved if, starting with the network in state ç v it relaxes towards a final state close to ç v. In general, the final state can be very different from ç v, and will be denoted lv. The overlap between both : gives a measure of retrieval quality. The simplest local learning prescription [2] for p learnt patterns is Hebb s rule : Assuming that the values of )I are random and uncorrelated, it has been shown [13] that the maximum number of patterns p that can be memorized with Hebb s learning rule is proportional to the number of neurons : p an, with a 0.145 ± 0.009. If more than an patterns are learnt, memory breaks down and none of the learnt patterns are retrieved. In order to avoid this catastrophic effect, different modifications of Hebb s rule were proposed [46]. The simplest one is the socalled learning within bounds [5] : synaptic efficacies are modified by learning in the same way as Hebb s rule, but their values are constrained to remain within some chosen range. In the version proposed by Parisi [4] bounds are reversible : once a Cij reaches a barrier, it remains at its value until a pattern is learnt that returns it inside the allowed range. This is a model of Article published online by EDP Sciences and available at http://dx.doi.org/10.1051/jphys:0198700480120205300

is is with 2054 short term memory : only the last learnt patterns are retrieved, old memories are gradually erased by learning. With this learning rule no deterioration occurs, but the storage capacity is smaller than with Hebb s rule. In the first part of this paper, we present numerical simulations on a model of long term memory, which is an irreversible version of learning within bounds : those synaptic efficacies that reach a bound remain at its value for ever [6]. The best bounds and the storage capacity are similar to those found with reversible bounds, but now the first, and not the last, learnt patterns are memorized. In the second part of the paper, we show that a quantitative analysis of the random walk associated to each learning rule gives a very good estimate of the network s memory capacity. We present results for the standard Hebb s rule and for different variants of learning within bounds. Generalization to other learning rules is straightforward, and is presented in section 3. 1. Learning within irreversible bounds. Numerical simulations. Fig. 1 Overlap between the learnt pattern and the retrieved state vs. u the number of learnt pattern, once p patterns were learnt with the best bound value m mopt. The learning rule with irreversible bounds or barriers with C ij (0) 0. Sij is the pattern number for which Cij first reaches a bound. Patterns after are Sij not learnt and the synaptic efficacy is saturated. For m oo, the standard Hebb s rule is recovered. But, unlike in Hebbian learning, with rule (4) the number u the «time» at which l is learnt relevant. In our numerical simulations, random patterns were learnt following (4). Each time a new pattern was added, the retrieval quality of all the previously stored patterns was tested : starting with the network in a learnt state, spins are allowed to flip with Monte Carlo sequential dynamics until relaxation to a state in which each spin takes the sign of the field (1) acting on it. A learnt pattern is considered as well memorized if its overlap q with the relaxed state is q > 0.97. Any other value would give nearly the same results because patterns are either retrieved without almost any error (q ~ 1 ), or with q 1. The bound value giving maximal number of well retrieved patterns, was mopt, determined for networks with N 100, 150, 200 and 400 neurons by testing different values of m. Figure 1 shows the retrieval quality (2) as a function of the.pattern number, for N 400. With the best bounds (mopt ), the overlap jumps abruptly from 1 to a small value, Fig. 2. Number of well retrieved patterns (q > 0.97 ) vs. number of learnt patterns. showing that only the first learnt patterns are memorized. Figure 2 is a plot of the number of well retrieved patterns versus the number of learnt patterns. For m mopt, a smaller number of patterns are retrieved in the asymptotic regime (p large), and for m > mopt the number of retrieved patterns vanishes for large p, as it should, because in the large m limit, the standard Hebb s rule deterioration recovered. its memory Optimal bound values are proportional to the network size, but we do not have enough accuracy to establish numerically the law mopt (N ). In next section it is shown that mopt..! 0.3 J N, and numerical data are consistent with this prediction. With the optimal bounds, we find a storage capacity

in assumed observed remain the 2055 0.05 N. These results show that learning within irreversible bounds is a model of long term memory in the sense that only old learnt patterns are remembered. The catastrophic deterioration of Hebb s rule is avoided by stopping the acquisition of new patterns once the memory is saturated. The capacity, and the «best» bound values, are similar to those of reversible «learning [4] memory which forgets». 2. Random walk analysis. For uncorrelated random learnt patterns, the synaptic efficacies Cij perform random walks of steps 1. In this section we show how a N probabilistic p analysis gives the maximum memory capacity of the network under a given learning rule. It is based on the following fact in our numerical simulations : when the initial state of the network is a learnt state, then either it remains in this state upon relaxation (retrieval is then perfect, q 1) or it moves away, and this from the very first Monte Carlo step, to a distant state (q small). This suggests that an analysis based on the first Monte Carlo step should be able to predict the memory capacity of a network with a given learning rule. That this is the case is shown in this and the following sections. We first present the method on Hebb s rule, for which analytic and very accurate numerical simulations exist, to show how it works on a simple model, before applying it to learning within bounds. 2.1 HOPFIELD MODEL. The learning rule is given by (3). When the network is in the learnt state 6, the field acting on neuron i, averaged over all the learnt patterns random and uncorrelated is terms j # k and is fî2 (neglecting terms of order 1/N). The variance of the field acting on a given neuron is then,,. Therefore, even if the initial state is a learnt state, say 6 II, when p/n is large enough, there is some probability that the sign of the field acting on a neuron i is opposite to gr. This probability (we drop down subscript i, all neurons being equivalent) is a function of Alh2 : For small x, the function P (x ) vanishes like exp ( x 2), and is linear in x in the neighbourhood of x* 1/3, the inflexion point. It can be approximated (Fig. 3) by a straigth line passing by x *, P(x.) dp I 1 ( 3 ) 3/2 0.042 of slope dx x. J7r 2 e ;: 0.2313, which crosses the x axis at xo 0.153. For x «xo, P (x) 0. Beyond the crossover point at 0.153, errors in retrieval are expected. From (5) and (7), Alh2 p/n ; the maximum number of patterns that can be learnt before errors in retrieval become important is therefore p 0.153 N, in excellent agreement with theoretical [3] and numerical [12] results z 0.145 ± 0.009). The prescription for maximum storage capacity is then Therefore, when the network is allowed to relax, spins should the average in state g v. Note that if the initial state is not a learnt state, then hi 0. The second moment of the field distribution for p learnt patterns is : The first contribution to comes hf from the terms j k. It exists also if the network is not in a learnt state [1, 7]. The second contribution comes from Fig. 3. Probability of hi 6i 0 as a function of x

limited now constant 2056 In what follows, the same argument is applied to other learning rules. 2.2 LEARNING WITHIN IRREVERSIBLE BOUNDS. When the network is in state g v the average field acting on a neuron i is (to lower order in 1 IN) where P(s > v ) is the probability to perform a random walk of more than v steps between absorbing barriers at m and m, without absorption. For large v (see Appendix Aa) : not exist. After learning a large number of patterns : The random walk between reversible barriers gets into an equilibrium distribution : C ij (p ) takes any of the allowed values N (n m, m 1,..., m) with 1 N. probability When the network is in state 2m+1 ) v, the field averaged over all the learnt patterns and its variance, are given by (see Appendix Ab) : The variance of the field is easily seen to be : where P (s ) is the probability that absorption takes place in s steps, so that s is the mean number of patterns learnt by a bond before its strength Cij sticks to the bounds. From the random walk problem (Appendix Aa) : Unlike in the Hebbian scheme of learning, in the present case the dispersion of the field values is constant by the bounds. Storage capacity is limited because the average field with Hebb s rule decreases with the pattern number. Therefore, only the first learnt patterns have a field on each neuron large enough to ensure good retrieval. Introducing (9) to (12) into (8) gives the maximum number v of patterns expected to be memorized, for a given m. After maximization of v with respect to m, we find in very good agreement simulations. with our numerical 2.3 LEARNING WITHIN REVERSIBLE BOUNDS. With this learning scheme [4], the synaptic efficacies show reversible saturation effects. They stick to the bounds and do not learn those patterns that would make them take values beyond the allowed range. Let sij be the pattern that produced the last saturation effect on bond ij. The values taken by Cij on learning the patterns that follow pattern Sij, are all within the allowed range, as if barriers did were 11 p v is the pattern number counted starting from the last learnt one, and P (x > 11 ) is the probability of a random walk of more than q steps starting from + m or m, without sticking to the barriers. For q > 1 we get (see Appendix Ab) The field is now a decreasing function of q : the effect of learning new patterns is to lower the local fields acting on older patterns, and the variance of the field distribution remains constant. Introducing (15) and (16) into (8), and maximizing q with respect to m gives in good agreement with numerical results [4] : m,pt 0.35 JN ; q (mopt ) 0.04 N. It is interesting to apply this analysis to learning without synaptic sign changes [5]. The learning rule is the same as (14), but half of the synaptic efficacies are constrained between m/n and 0, the others between 0 and m/n. From the corresponding random walk, The field decreases faster with q than (16), for a given m, because the allowed range for the Cij is half as before and therefore saturation effects appear in fewer steps. However, the variance is of the same order of magnitude

... 2057 Therefore, memory capacity will be smaller than when synaptic sign changes are allowed. Indeed, one finds the same value (Eq. (17a)) for as mopt before (this value is only a function of 4) but q is 4 times smaller, T1 (mopt) 0.011 N, in fairly good agreement with numerical results [5], which with our 3. Generalization to other learning rules. The results of section 2 can easily be generalized to learning rules with variable acquisition intensities : The average field on a neuron, when the network is in the learnt state g v is : and the dispersion is given by : With Hebb s rule, À #L 1 and the results of section 2.1 are recovered. Here, the condition for pattern v to be well retrieved is : An example of such a rule is the marginalist learning [5, 9], in which weights increase exponentially in order to ensure good retrieval of the last learnt pattern. Introduction of À II. eif2ia/i Nin (19) and (20) shows that within this scheme, both the average field and its dispersion increase with learning. If good retrieval of only the last learnt pattern is imposed, then v p in (21), and the value of e2 that ensures this must satisfy That is, E 2.56, which is the value estimated numerically in [5], and is in very good agreement with e 2.465, the replica symmetric solution of this model [9]. But it is possible to do better, and ask that the last q learnt patterns be retrieved. Introducing v p q into (21), we find : Maximising 7y with respect to 2 gives Eopt, the JOURNAL DE PHYSIQUE. T. 48, N 12, DTCEMBRE 1987 «best >> E21 and the number of well retrieved states : again in excellent agreement with the theoretical 0.04895 N. predictions [9] Bopt 4.108, 71 ( Eopt ) Result (21) shows that the normalization of the p Cij that consists of dividing it by A 2 JA i does not affect the memory capacity, and also suggests how other selective learning rules can be devised. It is possible, for example, to give stronger weights to the most «important» patterns, in order to keep them in memory even when other patterns are forgotten, or reinforce [9] the memorization of a given pattern v when it is at the limit of being erased (sign in (21)), by learning it again. Conclusion. We analysed different schemes of learning sequences of uncorrelated patterns. When the network is in a learnt state, the average value h of the field acting on a given neuron, produced by all the others, has the same sign as the neuron s spin. The network should remain in the learnt state. The probability to have a field of opposite sign is vanishingly small for a small number of stored patterns, but the crossover to a regime where this probability increases almost linearly sets an upper limit to storage capacity. The maximum storage capacity is attained when A 0.153, where d is the mean square width of the field on several distribution. We tested this prescription models of learning within bounds, proposed as models of short and long term memory. The estimated storage capacity and the best bound values are in excellent agreement with the numerical results. With Hebb s rule, h 1 and remains constant with pattern acquisition, while A increases. At crossover, because A and h are the same for all learnt patterns, all of them are «forgotten» together. In learning within bounds, d is constant and h decreases with the pattern number : memory is lost only of those patterns that have small values of h. Generalization to other learning schemes is straightforward, the storage capacity with a given rule can be estimated once h and A are known. The fact that our predictions, based on a first Monte Carlo step, are so successful, suggests that the size of the basins of attraction at maximum ~ storage capacity is N / [2(max. storage capacity) ]. The factor 2 is there because patterns g and g cannot be distinguished in Hopfield s networks. h 132

by For The 2058 This extends to other learning rules a result that is exact with Hebb s rule [10]. Finally, several authors [1, 3, 5, 7, 8] already pointed out that memory deterioration is due to the increasing noise on synaptic efficacies, produced by acquisition of new patterns. Our approach gives a quantitative estimate of storage capacity, until now only available by numerical simulations or in some special cases statistical mechanics calculations. Acknowledgments. Useful discussions with Pierre Peretto, who suggested the model of learning within irreversible bounds, are gratefully acknowledged. Appendix A. The solution to the random walk between barriers and some intermediate results leading to for are summarized in this mulae (10), (12) and (16) appendix. A(a) ABSORBING BARRIERS. a random walk [11] between absorbing barriers at + m and m, the probability of performing a walk of n steps from state i to state j is where À k COS (k 7r /2 m ) are the eigenvalues of the transition probability matrix, and vj(k) sin [(j+m) ktt/2m]/jm (j m l,..., m+l) the corresponding eigenvector components. The probability of a random walk of more than v steps without absorption starting from i 0 is then The dominant contribution to this sum is the term k 1, which gives equation (10). The mean number of patterns learnt by a given bond Cij before saturation is the mean time to absorption Y in the random walk problem. It is the derivative of the generating function of the probability of absorption [11] where f0,m(s) is the probability of first passage from state 0 to state m in s steps, and A± (x) (I ± B/l x)lx. It is then easy to check that s lim dfldx m2, which gives equation (12). x A(b) NON ABSORBING BARRIERS. stationary probability distribution is given by the eigenvector of the eigenvalue 1 of the transition probability matrix. This gives the same probability for all the 2 m + 1 allowed states, namely (2 m + 1 )1. We are interested on the walks of more than q steps starting from + m or m that do not stick to the barriers. Their probability P (t q ) can be deduced from the random walk between absorbing barriers at m + 1 and (m + 1 ) as the sum of the following terms : 1) the random walks starting at m, making a first step of 1 and then 17 1 steps without absorption ; 2) those starting at m, making a first step of + 1 and then q steps without absorption ; 3) those walks starting at n ( m + 1 n m 1) performing q steps without absorption. Each of these terms enters in the sum multiplied by the probability (2 m + 1) of starting the random walk at the corresponding point. The problem is therefore reduced to calculate sums of terms of the form (A. 1) with m + 1 instead of m. The dominant term of the sum gives equation (16). References [1] HOPFIELD, J. J., Proc. Natl. Acad. Sci. (USA) 79 (1982) 2554. [2] PERETTO, P., On learning rules and memory storage abilities of neural networks, preprint (1987). [3] CRISANTI, A., AMIT, D. J., GUTFREUND, H., Europhys. Lett. 2 (1986) 337. [4] PARISI, G., J. Phys. A 19 (1986) L 617. [5] NADAL, J. P., TOULOUSE, G., CHANGEUX, J. P. and DEHAENE, S., Europhys. Lett. 1 (1986) 535. [6] This model has been suggested by P. Peretto. [7] PERETTO, P., NIEZ, J. J., Biol. Cybern. 54 (1986) 1. [8] WEISBUCH, G., FOGELMANSOULIÉ, F., J. Physique Lett. 46 (1985) L 623. [9] MÉZARD, M., NADAL, J. P. and TOULOUSE, G., J. Physique 47 (1986) 1457. [10] COTTRELL, M., Preprint 1987. [11] Cox, D. R., MILLER, H. D., The theory of stochastic processes (Ed. Chapman and Hall Ltd., London) 1977.