Hopfield and Potts-Hopfield Networks

Hopfield and Potts-Hopfield Networks Andy Somogyi Dec 10, 2010 1 Introduction Many early models of neural networks lacked mathematical rigor or analysis. They were, by and large, ad hoc models. John Hopfield was among the first to apply the formalism of statistical mechanics to artificial neural networks by developing a model that was mathematically tractable and which exhibited many of the key attributes found in biological neural networks. In this article, I follow the standard derivation of free energy for such a network. Next, I introduce my extension, the Potts-Hopfield network, which I argue behaves more like a biological system than the original Hopfield network. I next discuss the details of the program I wrote to test and evaluate both networks, mine and Hopfield s, trained on and matching a variety of patterns. Finally, I display and discuss the results of various experiments with both networks. 2 Biological Plausibility Nature is fundamentally stochastic. All continuum descriptions of nature are approximations at best. Let us suppose that we could know the exact velocity AND momentum of every particle in the universe. Only then could we have a truly deterministic universe. The problem is that particles do not have exact velocities and momentums: a little thing called the Pauli exclusion principle prevents that. We cannot even in principle hope to measure both the exact velocity AND momentum. Thus the exact state of any real system cannot be exactly specified since the initial conditions cannot be exactly specified. Continuum approximations work well when describing man-made objects, objects that are smooth, straight, are homogeneous, objects that are artificial. Nature rarely presents us with such objects. Objects in nature, and in particular biology are rough, bumpy, in-homogeneous, objects are natural. Biological systems are thermally driven. At the molecular level, all reactions are thermally activated. It simply does not make sense to model biological processes as continuum processes. It is not that mathematics can not be applied to biology, its just that classical, smooth mathematics don t make a lot of sense here. Instead, stochastic dynamics is what must be used when modeling biological systems. Biological systems at the cellular level predominantly interact via local connections. There are obvious exceptions, such as the multitude of long range axonal projections in the cortex. These 1

long range connections account for a significant portion of the cortical volume; however they occupy only a relatively small percentage of the total synaptic connection count. Consider that among the many constraints of cortical configuration, one of the most significant is the volume constraint. (I.e., the entire cortical assembly must fit inside the skull.) Long range axonal projections occupy a significant proportion of that volume. The Hopfield network, if evolved stochastically does exhibit the first key trait of biological plausibility. This type of network however stores all information in a potentially all-all connected network. If even a fraction of connections were long range, it would make sense biologically, there simply is no possible what anything but the most trivially simply biological neural network can be all-all connected. 3 The Hopfield Network We can say with certainty that the nervous system of every known animal consists of neurons and synapses. Out of the millions of life forms that exist on this planet, all of them have control systems composed of neurons and synapses. Whilst it may be plausible to have intelligences using some other system, there is no way to prove a negative. But we can say that we have not seen any other type of intelligent system. If we accept that we, as humans, are intelligent and if we can say that other animals have some varying degrees of intelligence (the ability to learn and problem solve, e.g.), and given that all animals have a neural system composed of neurons and synapses, then intelligence must somehow emerge out of incredibly complex interactions of neurons and synapses. This is the system found in nature: why not use it as our model? Early researchers in connectionism, such as McCulloch and Pitts, were no doubt inspired by this fact, and that led them to create the first mathematical model of a neuron in 1943. This model combined neurophysiology and mathematical logic. Even in 1943, neurons were known to behave in a binary way: they either fire or not based on some external threshold. Although researchers in their day did not fully understand the mechanics of real neurons (most would contend that we still do not fully understand them), it made sense to treat them as logical constructs. The McCulloch- Pitts neuron operates on a discrete time scale. At each time step, the value based on the input is computed, and the single output is set to either high or low for the next time step. If this time scale were converted into real units, it would be on the order of a millisecond, as that is the typical period we see for spike trains recorded from actual neurons. Research in Connectionism thrived through the 1960s with contributions from scientists such as Hebb, Ashby, Rosenblatt, et al. However, early research focused on a very simple model of a neuron called a perceptron. But perceptrons are relatively limited in what they can do, and in 1969 Minsky and Papert demonstrated the limits on the sorts of functions which perceptrons can calculate, showing that even simple functions like the exclusive or could not be handled properly. As a result of Minsky and Papert s work, very little research was done in biologically inspired computing until the early 1980s. In the intervening years, the AI community shifted their efforts toward symbol processing rather than connectionism. In 1982, John Hopfield, likely inspired by the similarities between biological neural networks and the popular Ising model devised a neural network based on the Sherrington-Kirkpatrick spin- 2

glass [6] (which is a generalization of the Ising network) that could learn and recognize information [3]. The introduction of the Hopfield network, along with a new method for training neural networks called backpropagation discovered by Paul Werbos and popularized by David Rumelhart, caused renewed interest in the field of connectionism. 3.1 The Hopfield Neuron Biological neurons, in the absolute simplest abstraction, can be described as a two state system: +1 if excited, 1 if quiescent. The synaptic efficacy from neuron j to neuron i will be denoted by J ij. Then the sum of signals to the ith neuron is h i = j J ij (S i + 1). (1) 3.2 Deterministic Time Evolution Neuron i becomes excited if the input signal exceeds a threshold θ i at time t and is not excited otherwise: ( ) S i (t + t) = sgn J ij (S j (t) + 1) θ i. (2) j The simplest case is where the threshold θ i is equal to sum, no constant: ( ) S i (t + t) = sgn J ij S)j(t). (3) A pattern of excitation is denoted by {ξ µ i }. i = 1,..., N is the neuron index, µ = 1,..., p is the excitation pattern index, and ξ µ i is an Ising variable ±1. If the µ th pattern has the i th neuron in the excited state, ξ µ i = 1. The µ th excitation pattern can be written as {ξ µ 1, ξ µ 2,..., ξ µ N }, and p such patterns are assumed to exist {µ = 1, 2,..., p}. If the system is evolved, and there there is no further change in state, the system has reached a stable fixed point (i.e., a stable basin of attraction) such that S i (t) = ξ µ i S i (t + t) = ξ µ i. (4) The connectivity matrix J is constructed via the Hebb rule such that j J ij = 1 N p ξ µ i ξµ j. (5) µ=1 Here each stored pattern ξ µ is one of the stable fixed point patterns of eqn. 4. The connectivity matrix J can be substituted into 3, and if we assume orthogonality amongst the patterns, 1 ( ) 1 ξ µ j N ξν j = δ νµ + O,. (6) N j 3

can be expanded and simplified as ( ) ( ) ( ) sgn J ij ξ µ 1 j = sgn ξi ν ξj ν ξ µ j = sgn ξ)i ν δ νµ = sgn (ξ µ i N ). (7) j j ν Assuming that the values of each pattern ξ µ are approximately orthogonal, random and uncorrelated, the maximum number of patterns p that can be stored with the Hebbian rule [5] is proportional to the size of the network : p = αn, with α 0.145 [1, 3]. If more than αn patterns are stored, the attractor states break down forming a chimera a state where more than one pattern merges together. 3.3 Statistical Mechanics The behavior of such a spin system is entirely described by a Hamiltonian and a temperature. J is an interconnection matrix organized according to the Hebb rule into which p patterns are stored. The diagonal elements of the connectivity matrix are assumed to be zero, i.e., self connections are forbidden. The traditional approach to such a system is that all spins are assumed to be free and their dynamics are defined only by the action of a local field along which they are oriented. The previous deterministic time evolution is equivalent to the zero-temperature dynamics of the Ising model with the Hamiltonian H = 1 J ij S i S j = 1 S i J ij S j, (8) 2 2 i,j where j J ijs j is the local field h i to the spin S i which aligns the spin (neuron state) S i to the direction of the local field at the next time step. This section will follow a fairly standard treatment[5, 2] of the derivation of free energy from the Hamiltonian 8 with the addition of some clarifying steps. All life on Earth (and presumably elsewhere) exists because of the transduction of free energy. Knowing the form of the free energy of a system provides great insight into its inner workings. The free energy is a measure of how much useful energy a system has: it is the difference in the total energy of the system E and the product of its entropy and temperature T S. We can see that as either the temperature or level of disorder of a system increases, the less useful energy a system has to perform work. Formally, the free energy is defined as The free energy can be calculated from the partition function, i j ν F = E T S. (9) Z = s e βes, (10) which is defined as the sum of all possible energy configurations weighted by the Boltzmann factor. Given the partition function, the free energy is calculated as F = kt ln Z. (11) 4

Once the free energy is obtained, any measure of the system can be obtained directly from the free energy. The partition function for the Hamiltonian 8 with no external field is ( ) Z = Tr exp β 2 S i ξ µ i. (12) 2N This equation may seem daunting at first the squared term expands into potentially n + ( ) n 2 terms. This however can be readily handled via the introduction of the Hubbard-Stratonovich method, which is an observation of the result of a Gaussian integral. In the present case it takes the form ( ) βj NβJ ( exp 2N S2 = exp NβJ ) 2π 2 m2 + βjms dm, (13) which can be shown by completing the square µ i NβJ 2 m2 + βjms = NβJ ( m S ) 2 + βjs2 2 N 2N. (14) Using the integration variable dm µ from 13, we can linearize the square in the exponent as ( Z = Tr Π p µ=1dm µ exp 1 2 Nβ m 2 µ + β ) m µ S i ξ µ i µ µ i ( = Π µ dm µ exp 1 2 Nβm2 + ) log (2 cosh(βm ξ i )). (15) i In the limit of large N, the integral in 15 can be evaluated by steepest descent. The free energy is thus f = 1 2 m2 T log (2 cosh(βm ξ i )). (16) N i The equation of state for this system can be calculated as usual by minimizing the free energy 16 arriving at m = 1 ξ i tanh(βm ξ i ). (17) N i In the large N limit, the sum in 17 approximates the average over infinitely many random patterns ξ i. Thus in the large N limit, the equation of state becomes m =<< ξ tanh(βm ξ) >>, (18) where << >> denotes the configuration average, or average over all possible patterns. 5

4 Potts-Hopfield Model The traditional Hopfield model does incorporate many biologically plausible features. In spirit, it is closer to how we currently understand the cortex an assemblage of interconnected neurons, rather than a symbol processing system as was advocated by the AI community in the 1960 s and 70 s. The Hopfield model is initialized in a certain state (the desired pattern to match), and is evolved without any further input until a steady state - a stable basin of attraction - is reached. Regardless of whether the Hopfield model is evolved via synchronous, asynchronous or Boltzmann updates, it functions in a similar manner, much like an ancient calculating device such as a Babbage difference engine, where the device is given some input, the crank is turned, and one waits for the output to converge. Biological systems do not function as such an isolated calculating device; rather they are constantly fed some input. Thus, I believe that a more biologically plausible model is one where the pattern to be matched is always present, motivating and guiding the dynamics of the pattern matching system. So instead of the pattern being given to the pattern matching system, it is presented as a boundary condition, one that is always present during the dynamic progression of the system. Instead of treating the Hopfield network as a system that is started in a certain state and evolves deterministically towards one attractor, what if instead we treat the Hopfield system as a gas and treat the patterns as a surface onto which the Hopfield gas can adsorb? This is a fairly common system to analyze. Irving Langmuir first analyzed the adsorption of gases onto surfaces with finite number of binding sites in 1916. The derivation of the Langmuir equation is a common task in all introductory statistical mechanics courses. In essence, what results is a system in which each Hopfield particle can now be either adsorbed or free. If the particle is adsorbed onto the surface, there is an additional spin-spin interaction. If the particle spin matches the surface spin, it is energetically favorable; if it does not match, it is not favorable. Thus we move from a two state system to a four state system. Such a system can be described by the following Hamiltonian ( ) H = 1 S i J ij S j + ɛα i A i, (19) 2 i j where the spin variables S i, S j are the Hopfield spin states, ɛ is the binding energy of the surface, and α i is a Boolean variable indicating whether or not the i th Hopfield particle is bound to the i th pattern adsorption site A i. It is not immediately apparent how one would analytically calculate the free energy and equation from a Hamiltonian of the form 19. This Hamiltonian is not as amenable to the linearization method used for 8. 5 Implementation The time evolution algorithm of Hopfield and Potts-Hopfield networks is described as follows. The initial spin directions (neuron states) are oriented according to the components of the input 6

vector. The local field h i = E/ s i, which is produced by all the remaining connected spins of the network and acts on the i th spin at time t, is calculated via eqn. 1. If the spin direction is parallel with the direction of the local field, then its position is energetically favorable and remains unchanged. If however the spin position is anti-parallel, the position is energetically unfavorable, and the local field acts as a torque and flips the spin state S i (t + 1) = S i (t) with a probability proportional to the Boltzmann factor. At zero temperature, the total energy is reduced any time there is a spin flip. Where there are no further spin flips, the network is in a stable state. The time evolution of the Potts-Hopfield network is nearly identical to the evolution discussed above, except that there are a total of four rather than two possible states. Both Hopfield and Potts-Hopfield networks are extremely computationally intensive as they both have all to all connectivity that is, an energy calculation at a single node requires querying the state of all other nodes. In effect, this is O(N 2 ) operations per time step per node. Thus a highly efficient programming language is required. All of the computational routines were written in C++ and the user interface was written in Objective-C. Sadly, there is no extant programming language that is well suited for BOTH numerics AND user interface development. Each pattern is stored on disk as.png file, is read and converted to a 100 100 matrix. Each time a pattern is learned, the program is added to the connectivity matrix J via the Hebbian rule, eqn. 5. For the Hopfield network, the matching pattern is set as the initial state of the network and for the Potts-Hopfield network it is stored as the adsorbing surface. The evolution of the simulation is performed via the Metropolis algorithm [4]. This algorithm is probably the most widely used algorithm in Ising/Potts simulations. The Metropolis algorithm is designed to pick the most likely state for configurations obeying the Boltzmann distribution. Essentially, any system in or near thermal equilibrium has configurations which obey the Boltzmann distribution. The algorithm works by considering two configurations A and B, each of which occur with probability proportional to the Boltzmann factor. The ratio of probabilities for two energy configurations E A and E B is P (A) P (B) = e EA/T e E B/T = e (E A E B )/T. (20) In order to calculate the most likely energy configuration, 1. Starting from a configuration A, with known energy E A, make a change in the configuration to obtain a new configuration B. 2. Compute E B. 3. If E B < E A, automatically accept the new configuration, since it has lower energy (a desirable thing, according to the Boltzmann factor). 4. If E B > E A, accept the new (higher energy) configuration with probability exp( (E B E A )/T ). This means that when the temperature is high, we don t mind taking steps in the higher direction, but as the temperature is lowered, we are forced to settle into the lowest configuration we can find. The Metropolis algorithm is most commonly used to simulate the Ising model, but it is easily extended to work with a Potts model. The Potts model is a generalization of the Ising system. Potts q = 2 is equivalent to the Ising model. In order to use the Metropolis algorithm on a q = n Potts model, the energy configuration E A is the current state. To determine E B, simply pick at random any other state q A for the E B configuration. 7

6 Results For the purposes of this article, three tests were run on each network type. The time evolution of the state and the time evolution of the total network energy are displayed. The size of the network was held at 10,000 nodes which allows a 100 100 bit-mapped image to be used as a pattern. Each image was captured at 5000 Monte Carlo (MC) steps, and each energy value was calculated every 2500 MC steps. 6.1 Hopfield Network The time evolution of Hopfield network is shown in Figures 1 and 3. In these figures, blue pixels represent spin down (lower) and yellow represents spin up. The Hopfield network tests were run for 100,000 Monte Carlo steps. When the system has two stored patterns A, and B, and the system is in either one of these states, the energies are identical with a value of 1.318. When the system is in the chalky A state the total energy is 0.787. When the network was held at 300K, and started from the chalky A state, run for approximately 150,000 Monte Carlo steps, then cooled to 0K, it settled in either the A or B with equal probability. The implication here is that when the system is run with some level of disorder, the initial state configuration is eventually lost. As we will see, the Potts-Hopfield network does not exhibit this behavior. For the first two tests, two patterns A and B were stored in the networks, and the matching pattern was a chalky A. The first test was run with the system held at 300K, and the second test was run with the system held at 0K. One should not place too much significance on the physical meaning of these temperatures, in that a value for the Boltzmann constant was chosen so that a temperature of 300K (room temperature) corresponded roughly to a system that exhibits some thermal fluctuations (i.e., is partially disordered). A value of 1000K corresponds to a fully disordered system. The third test was run at 0K with the patterns A, B, and C stored in the network. When the system has three stored patterns A, B, and C and the system is in one of these pure states, the total energy is 1.648, 1.695, or 1.708, respectively. When the system has these three stored patterns and it is in the chalky A state, the total energy is 1.031. When the system is in a pure state, the total energy value is lower than when the system is set to a nonmatching pattern. Regardless of initial conditions, when the system has three stored patterns, it will always eventually end up at a chimera such as the last frame in Figure 5. This chimera state corresponds to a hypothesized global minimum energy value of 1.883. Even if the system is initialized to a pure state and evolved at 0K, the system will tend towards the chimera state. This implies that the pure states are not local minima. The energy over time profile of all tests for the Hopfield network was identical: a monotonic decrease of energy with no evidence of local minima. 8

Figure 1. Time evolution of a Hopfield network with two stored patters A and B held at 300K started from the chalky A state 0.5 Energy 1.0 1.5 2.0 0 20 000 40 000 60 000 80 000 Figure 2. Total energy of the Hopfield network displayed in Figure 1 9

Figure 3. Time evolution of a Hopfield network with two stored patters A and B held at 0K started from the chalky A state 0.5 Energy 1.0 1.5 2.0 0 20 000 40 000 60 000 80 000 Figure 4. Total energy of the Hopfield network displayed in Figure 3 10

Figure 5. Time evolution of a Hopfield network with two stored patters A, B and C held at 0K started from the chalky A state 0.5 Energy 1.0 1.5 2.0 0 20 000 40 000 60 000 80 000 Figure 6. Total energy of the Hopfield network displayed in Figure 5 11

6.2 Potts-Hopfield Network The time evolution of Potts-Hopfield network is shown in Figures 7 and 9. In these figures, black pixels represent bound, spin down, blue pixels represent un-bound, spin down (identical to Hopfield spin down), yellow represents un-bound spin up (identical to Hopfield spin up), and red represents bound, spin up (the highest energy level). The Potts-Hopfield network tests were run for 150,000 Monte Carlo steps. This system has a longer convergence time than the previous simply because there are more possible states for the system to explore. All tests were performed with a binding energy of ɛ = 100. All energies measured when the system we set at a pure state are identical to the previous values because in a pure state, there are no additional binding energies. When the network was held at 300K, and started from the chalky A state, run for approximately 150,000 Monte Carlo steps, then cooled to 0K, it always converged to the correct A state. Figure 7. Time evolution of a Potts-Hopfield network with two stored patterns A and B held at 300K started from the chalky A state When the system has three stored patterns A, B, and C and the system is in one of these pure states, energies are again identical to the previous Hopfield network. Again, regardless of initial conditions, when the system has three stored patterns, it will always eventually end up at a chimera such as the last frame in Figure 11. The chimera behavior is nearly identical to the Hopfield network. Thus we can conclude that the addition of binding energy, an omni-present input pattern that is continuously influencing the evolution of the system has no effect on the total information storage capacity of a Hopfield like network. 12

0.5 Energy 1.0 1.5 2.0 0 20 000 40 000 60 000 80 000 100 000 120 000 140 000 Figure 8. Total energy of the Potts-Hopfield network displayed in Figure 7 Figure 9. Time evolution of a Potts-Hopfield network with two stored patterns A and B held at 0K started from the chalky A state 13

0.5 Energy 1.0 1.5 2.0 0 20 000 40 000 60 000 80 000 100 000 120 000 140 000 Figure 10. Total energy of the Potts-Hopfield network displayed in Figure 9 Figure 11. Time evolution of a Hopfield network with three stored patterns A, B and C held at 0K started from the chalky A state 14

0.5 Energy 1.0 1.5 2.0 0 20 000 40 000 60 000 80 000 100 000 120 000 140 000 Figure 12. Total energy of the Hopfield network displayed in Figure 11 15

7 Conclusions At this point it is still unclear why the storage capacity of the Hopfield network as presently implemented is so much lower than the generally accepted theoretical value. It may be possible that the Hopfield network is not implemented properly. The theoretical value was previously calculated for both synchronous and asynchronous updates, so the fact that we used asynchronous updates is an unlikely cause. Because of the inherent computational inefficiencies, neither the Hopfield nor the Potts-Hopfield networks appear to have many practical prospects in either pattern matching or distributed information storage. The Potts-Hopfield extension did allow patterns to be matched after being randomized, but did not increase accuracy or information storage capacity of the Hopfield network. Despite the drawbacks of these networks, I still firmly believe that biologically inspired designs can still be incredibly useful for distributed, fault tolerant information storage and retrieval. References [1] D Amit and H Gutfreund. Saturation level of the hopfield model for neural network. EPL (Europhysics Letters), Jan 1986. [2] Viktor Dotsenko. An Introduction to the Theory of Spin Glasses and Neural Networks (World Scientific Lecture Notes in Physics). World Scientific Publishing Company, March 1995. [3] JJ Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proc Natl Acad Sci USA, 79(8):2554, 1982. [4] D. P. Landau and K. Binder. A Guide to Monte Carlo Simulations in Statistical Physics - 2nd Edition. A Guide to Monte Carlo Simulations in Statistical Physics - 2nd Edition, by David P. Landau and Kurt Binder, pp. 448. Cambridge University Press, September 2005. ISBN-10: 0521842387. ISBN-13: 9780521842389, September 2005. [5] H. Nishimori. Statistical Physics of Spin Glasses and Information Processing: An Introduction. Oxford University Press, USA, 2001. [6] D Sherrington.... Solvable model of a spin-glass. Physical review letters, Jan 1975. 16