Generalization in a Hopfield network

Generalization in a Hopfield network J.F. Fontanari To cite this version: J.F. Fontanari. Generalization in a Hopfield network. Journal de Physique, 1990, 51 (21), pp.2421-2430. <10.1051/jphys:0199000510210242100>. <jpa-00212540> HAL Id: jpa-00212540 https://hal.archives-ouvertes.fr/jpa-00212540 Submitted on 1 Jan 1990 HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

J. Phys. France 51 (1990) 2421-2430 ler NOVEMBRE 1990, 2421 Classification Physics Abstracts - 87.10 64.60C Generalization in a Hopfield network J. F. Fontanari Instituto de Fisica e Quimica de São Carlos, Universidade de São Paulo, Caixa Postal 369, 13560 São Carlos SP, Brasil (Received 23 April 1990, accepted in final form 12 July 1990) 2014 Abstract. The performance of a Hopfield network in learning an extensive number of concepts having access only to a finite supply of typical data which exemplify the concepts is studied. The minimal number of examples which must be taught to the network in order it starts to create representations for the concepts is calculated analitically. It is shown that the mixture states play a crucial role in the creation of these representations. 1. Introduction Learning and generalization in neural networks has been the subject of intensive research in the past few years [1-7]. The most recent studies have been carried out in the context of supervised learning in single-layer feedforward neural networks [5-7], following the theoretical framework presented in Gardner s seminal paper [8]. Comparatively, little attention has been devoted to the ability of simple feedback neural networks, e.g. Hopfield s model [9], to perform computational tasks beyond the simple storage of a set of activity patterns. The learning process in Hopfield s model consists of setting the value of the coupling.l j between neurons i and j for all pairs of neurons such that a given set of activity patterns is memorized by the network. In this model the states of the neurons are represented by Ising spins, Si + 1 (firing) or Si - 1 (rest). Storage of an activity pattern {ç r ± 1, i 1,..., N } into the memory of the network is achieved by modifying the couplings according to the generalized Hebb rule After being exposed to p activity patterns the couplings are set to where we have assumed Jii 0 initially (tabula rasa). Article published online by EDP Sciences and available at http://dx.doi.org/10.1051/jphys:0199000510210242100

2422 Once the couplings are fixed, the dynamical retrieval process is Hamiltonian [9] governed by the The network can potentially retrieve a given activity pattern if it is a minimum or if it is very near a minimum of H. The natural parameters for measuring the performance of the network in retrieving the stored patterns are the overlaps where the state {Si, i 1,..., N } is a minimum of H. The properties of these minima have been fully studied by Amit et al. [10, 11 ] using statistical mechanics tools developed in the analysis of infinite range spin-glasses [12]. It has been shown that besides the retrieval states which have macroscopic overlaps, 0(1), with only one stored pattern there exist mixture states which have macroscopic overlaps with several stored patterns [10]. Since the interest in Hopfield s model is mainly due to its prospective use as an associative memory, the attention has been focused on the retrieval states, the mixtures states being regarded as a nuisance which can be eliminated either by adding external noise to the system [10] or by modifying the learning rule [13]. On the other hand, these spurious states have been seen as proof of the ability of the network to create new representations to handle the information contained in the stored patterns [14]. In this paper we show that the mixture states play a crucial role when the task posed to the network is to extract meaningful information from the activity patterns it is exposed to during the learning stage. We consider the following problem. Let us suppose that during the learning stage the network is exposed to s examples of a given concept. The examples are embedded in the memory of the network by the Hebbian learning process, equation (1.2). The question we address is whether the network can create a representation for the concept to which it had been exposed only through examples. We say that the network has a representation for a concept if the concept is a minimum or if it is very near a minimum of H. More specifically, we consider p a N concepts represented by the activity patterns {T}. concept, a finite number of examples {ç jjl}, V 1,..., s 1,..., p. For each is generated. Their components are statistically independent random variables drawn from the distribution with O:s; b :s; 1. The examples can be thought of as noisy versions of the concepts they exemplify. The parameter b measures the difficulty of the task posed to the network : Small values of b result in low correlations between examples and concepts, making the grasping of the concepts more difficult. For simplicity, the components of the concepts are randomly chosen as ± 1 with equal probability. As an alternative viewpoint, one may consider the concepts as defining p classes each one containing s individuals (examples). Thus, the task of the network would be to group the examples in their respective classes, i.e. the network should categorize the examples. However, in this paper we follow the point of view expressed in Denker et al. [1 ] that categorization is a particular case of rule extraction (generalization) in which the rule may be roughly described by «nearby inputs should produce nearby outputs». We return to this issue in the conclusion of the paper.

2423 Finished the learning stage, the couplings are set to The quantities we focus on are the generalization errors ejl defined by where m e, the generalization overlaps, are the overlaps between the concepts and a certain minimum of H which will be specified later. The aim of this paper is to calculate the dependence of the generalization error (el-") on the number of examples taught to the network (s) for a given task characterized by the parameter b. To achieve this we study the thermodynamics of Hopfield s Hamiltonian, equation (1.3), with the couplings set as in equation (1.6). Our analysis is restricted to the noiseless (zero temperature) limit. A first attempt to tackle this problem has been published recently [15]. This paper presents a simpler and more general approach. The paper is organized as follows. In section 2 we study the thermodynamics of the model in the limit a p /N - 0. The simplicity of this limit, which dispenses the use of the replica trick in the computation of the averaged free energy density, allows us to find analytical expressions relating ee with s and b. The analysis of the nonzero a limit is performed within the replica symmetric framework in section 3. We summarize our results and present some concluding remarks in section 4. 2. Finite number of concepts. The Hamiltonian of Hopfield s model with the couplings given by equation (1.6) written as can be where we have omitted a constant term and included an additional term in order to compute M,". In fact, writing the averaged free energy density as where Z Trs e- ph one has The parameter 8 - T-1 in the expression for the partition function Z is a measure of the amount of noise acting on the system. The noiseless limit is obtained by taking /3 - oo. The notation... > stands for the averages over the examples and over the concept taken in this order. The calculation of f is straightforward in the limit a 0 [10] so we present only the final result

2424 The order parameters, with (... ) T standing for the thermal average, are given by the saddle-point equations with hu 0. Using equation (2.3) we can write the generalization overlaps me in terms of the retrieval overlaps m p, JI We restrict our analysis to a particular class of solutions for m J.L 11, i.e. to a particular class of minima of f (or H for T 0). Since there are no macroscopic correlations between different Moreover, we choose concepts we consider solutions of the form m J.L 11 m 1 11 8 J.L 1. m1v to be of the form The motivation for choosing this solution is that it gives a bias to the network to behave as an associative memory : each example is singled out (the solution is s degenerate since we can select any of the examples to be ml 1) and treated as an independent piece of information to be stored. If any other behaviour emerges, it will be a spontaneous property of the network and not an artifice due to the particular choice expressed by equation (2.8). At T 0, m11 and ms - 1 satisfy the equations where Xs - 1 ç 1 JI. For s > 10 the binomial distribution of Xs - 1 can be replaced by a v > i Gaussian with mean (s - 1 ) bç 1 and variance 4) (s - 1 ) ( 1 - b 2). Performing the averages over ç Il, Xs - 1 i and ç 1 in this order, we find where Following the same procedure we can compute m from equation (2.7). The result is

The 2425 Next we discuss the solutions of equations (2.10). For small s the network retrieves the examples almost perfectly, i.e. m11 ~ 1 and ms - 1 b2. Hence the generalization error E E, equation (1.7), is Strictly, the retrieval is perfect for (s - 1 ) -- b-2 as can be easily seen from equations (2.9) since 1 xs - 1 1 -- s - 1. This result is not recovered by equations (2.10) because the replacement of the discrete distribution of xs _ 1 by a Gaussian makes 1 Xs - 11 [ unbounded. As s increases, m 11 decreases slightly until s reaches a critical value Sc, above which m11 jumps to a much smaller value, almost equating with ms - 1. This behaviour is reflected in the generalization error which we show in figure 1 as a function of s for several values of b. For s - Sc the network just memorizes the patterns it is exposed to. This behaviour is referred to as the retrieval phase (R). For S:> sc the network no longer treats the examples as independent pieces of information and starts to mix them, creating the representation for the concept. This is the generalization phase (G). The phase diagram in the (s, b ) plane indicating the regions where each regime occurs is shown in figure 2. Although mil m s - i is solution of equations (2.10) only in the limit s -+0 oo, in the generalization regime one has m ms ms where ms is the symmetric solution of equation (2.6), In fact, m 11 never equals ms _ 1 because when averaging over the examples we have explicitly ruled out this possibility by retaining the discrete nature of e 11 while making the Gaussian approximation for the distribution of Xs - 1. Nevertheless, since the differences between m 11, ms - and m, are negligible in the G phase we can approximately characterize this regime by the symmetric solution ms. Following similar steps to the ones leading to equations (2.10) one finds - Fig. 1. generalization error as function of the number of examples for b 0.18, 0.20, 0.25, 0.40 and a 0.

2426 where Thus, the generalization error in the G phase can be approximate to The values of 8 computed through this equation are indistinguishable from the exact generalization error, computed through equations (1.7) and (2.13), in the scale of figure 1. 3. Infinité number of concepts. In this section we consider the case where the network has to create representations for an extensive number of concepts, a p /N + 0, being exposed to a finite number of examples s of each concept. In this case we have to take into account the fact that the combined overlap of a concept with all the other concepts is of 0 (-,/à ). To handle this situation we follow Amit et al. [11] and assume that only the overlaps m JI condense, i.e. are of 0(1) while the others are of O(N- 1/2). The averaged free energy density is calculated through the replica trick where Z is the partition function replicated n times, Averaging over rjl(j.l > 1) explicitly and using the self-averaging property of ei yields where we have omitted multiplicative factors which vanish in the thermodynamic limit and withbjlà b2 + (1 - b2) 8J1À and Qpu qpu + (1 - qpu) 8pu. The integrals in equation (3.3) can be readily effected by saddle-point integration while the integrals in equation (3.4), the trace over SP and the limit n --+ 0 can be performed in the replica symmetric framework

2427 resulting in the following expression for the averaged free energy density where with C 6 (1 - q ) and Dz dz / J2; e - z 2/2 Thé order parameters are given by the saddlepoint equations which, in the limits f3 -+ oo and hl... 0, are written as The equations for the standard Hopfield model [11] are recovered in the cases s 1 and b 1 with an appropriate rescaling of C and r. For b 0 the network is effectively storing sp uncorrelated patterns and Hopfield s equations are obtained by rescaling a - sa. Once m " and C are known we can compute m through the equation Next we discuss the solutions of the saddle-point equations (3.10)-(3.12). In addition to the solutions considered in the a 0 limit, there exist a spin-glass solution m 1 JI 0 d v which is stabilized by the Gaussian noise due to the overlaps of O(N- 1/2) between 1. However, adding noise to the system has a destabilizing effect for the e " and el,» > retrieval phase [ 13, 16, 17], reducing its domain to a region much smaller than the one shown in figure 2. Therefore the phase diagram in the (s, b ) plane will be dominated by the generalization phase characterized by the symmetric solution, equation (2.15), and the spinglass phase. In the following we will focus only on the interplay between these two phases. For the symmetric solutions equations (3.10) and (3.11) reduce to and

The 2428 Fig. 2. Fig. 3. Fig. 2. - Phase diagram showing the generalization phase (G) and the retrieval phase (R) for a 0. The transition at s SC is discontinuous. The retrieval of the examples is perfect below the dashed curve, b 1 / Bls - 1. Fig. 3. - and b 0.4. generalization error as function of the number of examples for a/a 0 0.05, 0.2, 0.4, 0.5 respectively, where Xs L çl JI. For s > 1 0 one can replace Xs by v sbe 1 and variance s( 1 - b 2). Performing the averages as in section 2 yields a Gaussian variable of mean where 4) a r + m.; s (1 - b 2). Equations (3.12), (3.16) and (3.17) must be solved numerically in order to compute the generalization overlap, 1- m 1)/2. and, consequently, the generalization error ( e In the limit s - oo the noise in the examples is averaged out, i.e. Xs - sbe 1 and the standard Hopfield model with the concepts replacing the examples in equation (1.6) is recovered. This result can be easily obtained by rescaling C sb 2 C and r r / S2 b 4 which implies that ms bm 1 with m satisfying the standard Hopfield s equations. Next we discuss the behaviour of the generalization error as a function of s, b and a. In figure 3 we show e as a function of s for b 0.4 and several values of a / a o, where ao 0.138 gives the storage capacity of the standard Hopfield model. As in the a 0 case, there is a minimal number of examples which must be taught to the network in order it starts to generalize. The transition between the SG phase, where E 0.5 since the spin-glass states have no macroscopic correlations with the concepts, and the G phase is always discontinuous. The values of a lao for which the transition occurs are shown in figure 4 for several values of b. As b --+ 1 or s --> 00, ac tends to Hopfield s storage capacity afro. Generalization occurs for «: ac and the

2429 Fig. 4. Fig. 5. Fig. 4. - The critical values of a / a 0 below which the network starts to generalize number of examples for b 0.2, 0.4, 0.6, 0.8. Fig. 5. - The generalization error at criticality as function of the number of examples. as function of the generalization error at the transition (Be) is shown in figure 5. As b --+ 1 or s --+ oo, ec tends to 0.0165, the critical retrieval error for the standard Hopfield model [11]. 4. Conclusion. In this paper we have s*hown that a simple feedback neural network using a local Hebbian learning rule is able to learn a set of concepts having access only to a finite supply of typical data which exemplify the concepts. The network acomplishes that by creating representations, i.e. minima of the energy function governing the retrieval process, which capture the underlying statistical structure of the examples, allowing the extraction of meaningful information from the data supply. For the specific problem we have considered in this paper, the symmetric mixture states provide for the representations which allow the network to grasp the concepts. Since the research on Hopfield s model has focused mainly on the retrieval states, very little is known about the relation between the statistical structure of the patterns presented to the network during the learning stage and the structure of the representations created by the network [13, 17]. This seems to be a crucial issue if one intends to lead the study of Hopfield s model beyond its memorizing capabilities. Next we summarize our main results. For finite p we have found that there is a regime where the network simply memorizes the examples ignoring their statistical structure (retrieval phase). However, as the number of examples increases passing a certain critical number Sc the behaviour of the network undergoes an abrupt change entering a new regime where the representations of the concepts are created (generalization phase). For p a N (a >0) the retrieval phase is confined to a very small region of the (s, b ) plane. However, a spin-glass regime, where the network ignores both the examples and the concepts, appears to compete with the generalization regime. The interplay between these two regimes is qualitatively similar to the one discussed in the finite p limit with the spin-glass replacing the retrieval regime. It is interesting to compare our results with the ones presented in the literature for feedforward neural networks [1, 3, 6]. As mentioned above, the most remarkable outcome of

2430 the present work is the existence of a critical number of examples (sc) beyond which the network generalizes well (Figs. 1 and 3). Whether a similar behaviour occurs in the case of feedforward networks is an unsettled issue. On the one hand, simulations of multilayer networks for the contiguity problem and some general theoretical arguments point out for the existence of a critical size of the training set above which the generalization error (E) falls of exponentially fast [1, 3]. On the other hand, an analytical study of the performance of a single-layer perceptron in classifying examples according to their Hamming distance from a set of prototypes indicates that such a critical number does not exist [6]. However, since the behaviour of E seems to depend strongly on the architecture of the network considered [3] there may be no simple answer for this issue. A similar controversy could very well arise in the context of feedback neural networks if we use other leaming rules than the one considered in this paper. Finally, we should mention the relevance of our results to the problem of categorization in neural networks. As pointed out in the Introduction, the generalization regime may be interpreted as a categorization of the examples into the classes defined by the concepts. We have found that categorization emerges spontaneously when a critical number of examples is presented to the network during the learning stage. This result corroborates the viewpoint that categorization is related with the limitation of an associative memory. This point was beautifully expressed by Virasoro : «we categorize not because we want to but because we cannot do otherwise» [18]. Acknowledgments It is a pleasure to thank Ronny Meir for many helpful discussions. References [1] DENKER J., SCHWARTZ D., WITTNER B., SOLLA S., HOWARD R., JACKEL L. and HOPFIELD J. J., Complex Systems 1 (1987) 877. [2] PATARNELLO S. and CARNEVALI P., Europhys. Lett. 4 (1987) 503 ; CARNEVALI P. and PATARNELLO S., Europhys. Lett. 4 (1987) 1199. [3] TISHBY N., LEVIN E. and SOLLA S. A., Proceedings of the International Joint Conference on Neural Networks (1989). [4] VALLET F., Europhys. Lett. 8 (1989) 747. [5] DEL GIUDICE P., FRANZ S. and VIRASORO M. A., J. Phys. France 50 (1989) 121. [6] HANSEL D. and SOMPOLINSKY H., Europhys. Lett. 11 (1990) 687. [7] GYÖRGYI G. and TISHBY N., Proceedings of the STATPHYS-17 Workshop on Neural Networks and Spin Glasses, Eds. W. K. Theumann and R. Köberle (World Scientific, Singapore) 1990. [8] GARDNER E., J. Phys. A 21 (1988) 257. [9] HOPFIELD J. J., Proc. Natl. Acad. Sci. USA 79 (1982) 2554. [10] AMIT D. J., GUTFREUND H. and SOMPOLINSKY H., Phys. Rev. A 32 (1985) 1007. [11] AMIT D. J., GUTFREUND H. and SOMPOLINSKY H., Ann. Phys. N. Y. 173 (1987) 30. [12] MEZARD M., PARISI G. and VIRASORO M. A., Spin Glass Theory and Beyond (World Scientific, Singapore) 1987. [13] FONTANARI J. F. and THEUMANN W. K., J. Phys. France 51 (1990) 375. [14] ANDERSON J. A., IEEE Transactions on Systems, Man and Cybernetics 13 (1983) 799. [15] FONTANARI J. F. and MEIR R., Phys. Rev. A 40 (1989) 2806. [16] FONTANARI J. F. and KÖBERLE R., J. Phys. A 21 (1988) 2477. [17] ERICHSEN R. and THEUMANN W. K., Preprint IF-UFRGS (1990). [18] VIRASORO M. A., Preprint 608, Università di Roma «La Sapienza».