Hopfield Networks. (Excerpt from a Basic Course at IK 2008) Herbert Jaeger. Jacobs University Bremen

Hopfield Networks (Excerpt from a Basic Course at IK 2008) Herbert Jaeger Jacobs University Bremen

Building a model of associative memory should be simple enough... Our brain is a neural network Individual neurons are quite well understood Almost universally shared belief: memories are coded in synaptic connectivity So it only remains to find out how the memories go there and how they are retrieved

Our topic for today How can "information" be "coded" in the "connectivity" of "neural networks"? Eerh... hrmm... What does this mean? All of these concepts are so imprecise... Somebody's gotta set the rules. We need a decision. A tough decision. And stick to it. http://www.sillyjokes.co.uk/

The Sheriff says... We can store discrete items in memory (finitely many, individual, fundamental patterns) Fundamental patterns are addressed by auto-association: the fundamental pattern is reconstituted from a "similar" cue input Example: "similar" = "corrupted": Pattern restauration Example: "similar" = "partial": Pattern completion Images from: Hertz et al 99

Other "similarity" relationships cue - pattern Thinking only of visual patterns for simplicity... "similar" = distorted "similar" = shifted, mirrored, rotated "similar" = B/W (pattern in color) "similar" = line drawing (pattern photo) "similar" = preceding in time (pattern appears in animated sequence) "similar" = still (cueing an animated scene)

The rolling-downhill-ball model of memory Consider space of all possible neural "pattern states" (in fig.: 2-dim vector space spanned by V, V2) Within this space, some points are the fundamental patterns ( i in fig.) Above the pattern space, an energy landscape (or a potential) E is defined. Fundamental patterns lie at the minima of the energy. Any pattern may serve as cue. The process of associative retrieval of a pattern from a "similar" cue is determined by gradient descent from E( ) to E( ). E 2 3 www.ift.uib.no/~antonych/protein.html

Aerial view of the same This contour plot shows the energy landscape like in a map. The pattern space becomes divided into basins of attraction. E.g., all cue patterns that are "attracted" by 2 form the basin of attraction of the fixed point attractor 2. 5 4 3 2 2 0 0 2 3 4 5 One may say that all these are instances of the category (concept, class) represented by 2. 3

Same, with "real" patterns... 5 4 3 2 0 0 2 3 4 5 Pixel images from Haykin 999

Structure of a Hopfield network (HN) A HN is made of N binary neurons (in figure: N = 4) Each neuron is connected to all other neurons (except itself - no auto-feedback) Connection between neurons i and j is symmetric (same as connection between j and i) and has a weight w ij = w ji R. At a given time, neuron i has state s i {, }. The entire network has a state S = (s,..., s N ) T (written as a column vector) A simple demo HN w 3 = w 3 = 0.5 s = w 2 = w 2 = 3 s 3 = w 4 = w 4 = 0.2 w 32 = w 23 = s 2 = s 4 = w 34 = w 43 = 2 s s2 S = = s 3 s4 w 42 = w 24 = 0.

Energy of a state By definition, a state S = (s,..., s N ) T has an energy s = 3 s 2 = E( S) = N w ij i, j i< j =,..., N s i s j 0.5 0.2 0. Example right: E(S) = /4 ( w 2 w 3 w 4 + w 23 + w 24 + w 34 ) = /4 (+ 3 0.5 0.2 + + 0. 2) = 0.35 Electric metaphor: higher energy means more "neighboring opposite charges" (almost like a charged battery) s 3 = 2 s s2 S = = s 3 s4 s 4 =

A closeup on the neuron model The (discrete-time) HN is made of McCulloch-Pitts neurons (same as the perceptron) Update rule: Neuron i first sums its connection-weighted inputs, then takes difference to a threshold i, then passes the sum through a binary decision function, the sign function sgn(x) = if x < 0, sgn(x) = + if x 0 Biology view Engineering view s ( t i + ) = sgn w j=... N j i ij s j ( t) Θ i Math view (Figures from the Maida lecture notes)

Convention: in HNs, all i = 0 Updating neurons and networks Single neuron update, for instance of neuron 3: s 3 ( t + ) = sgn w j=,2,4 ( t) = sgn( 0.5*+ * 2*) = sgn(.5) = ij s j s = 0.5 0.2 3 s 2 = 0. Entire network update:. Pick one neuron at random (stochastic choice) 2. Update it as above (deterministic update) 3. Iterate HNs evolve over time by individual updates of randomly picked neurons. s 3 = 2 s s2 S = = s 3 s4 s 4 =

Updating neurons and networks 2 From previous slide: update of s 3 was s 3 (t) = s 3 (t+) = This is a network update S( t) = S( t + ) = s = 0.5 0.2 3 s 2 = 0. Energy at time t was E(S(t)) = 0.35. Calculate: E(S(t+)) =.. Observation: E(S(t+)) < E(S(t)). s 3 = 2 s 4 = Fact: whenever S(t+) S(t), then E(S(t+)) < E(S(t)). "All state changes reduce the energy in a Hopfield network." s s2 S = = s 3 s4

Representing patterns by HN states A "pattern" is identified with a state (thus, every neuron takes part in representing a pattern) HNs "patterns" are therefore just N-dimensional {-,} vectors. To represent real-world patterns, one must code them as binary vectors. Example: The image could be coded as = S

HN as memory: Problem statement Given: p fundamental patterns,..., p, coded as N-dimensional {-,}- vectors. Wanted: a HN with N neurons, which has these fundamental patterns as the minima of its energy landscape. E(S) S 2 3 4 Reward: we would have a neural network, which when started in some pattern (state) "similar" to a fundamental pattern, would evolve toward it by the stochastic state update and the energy minimization principle.

Remember our intuition... 5 4 3 2 0 0 2 3 4 5

Hypercube state space of HNs The states of a HN are not continuous but discrete. A HN with N neurons can host 2 N different states - a finite number. These states can conveniently be arranged at the corners of an N-dimensional hypercube. N = 2 N = ) ( ) ( N = 3 N = 4

"Rolling downhill" in a hypercube Just to give you an impression... Pattern states in a N = 6 HN (slightly misleading though, images have N = 20!) Indicates energy at state (hypercube corner) Single-neuron update (flips one component of state vector)

Problem statement, again Given: p fundamental patterns,..., p, coded as N-dimensional {-,}-vectors. Wanted: a HN with N neurons, which has these fundamental patterns as the minima of the energy values at hypercube corners.... Seems a tough problem.

Solution: the HN should learn that by itself! Observation: the dynamic behaviour of a HN depends only on the weights w ij. Idea: train the weights by "showing" the fundamental patterns,..., p repeatedly to the network. At each exposure, the weights are adapted a little bit such that the energy for that pattern is a bit reduced. After numerous exposures, the energy landscape should have deep troughs at the states,..., p. Where do we get such a clever weight adaptation rule from?

Donald O. Hebb (904-985) Started out as English teacher Turned to an academic carreer in psychology 949 The Organization of Behavior: A Neuropsychological Theory williamcalvin.com/bk9/bk9inter.htm

Hebb's postulate (949) "When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A s efficiency, as one of the cells firing B, is increased." (Hebb, The Organization of Behavior, 949) Also known as, "Cells that fire together wire together" Or simply as, "Hebb's learning rule"

Training HNs with Hebb's rule Hebb's rule spells out like this: "Presenting = (,..., N ) to the network" means just this, "set the N neuron states equal to (,..., N )." The neurobiological, Hebbian version of adapation: " i and j fire together if sgn( i ) = sgn( j ). Then, increase w ij a bit." Corresponds to energy reduction. Remember, Thus, if sgn( i ) = sgn( j ), increasing w ij reduces E( ). µ µ E( ξµ ) = w ij ξi ξ j. N i, j,..., i< j = N This leads to the following training scheme.. Present,..., p to the HN repeatedly (in random or cyclic order). 2. When is presented, update all weights by w ij w ij + λ i j (where λ is a small learning rate)

Training HNs with Hebb's rule, continued Training scheme (repeated):. Present,..., p to the HN repeatedly. 2. When is presented, update all weights by w ij w ij + λ i j Effect: On each pattern presentation of some, the energy of the corresponding state is lowered a bit. HOPE: We will eventually get a network that has energy troughs at the fundamental pattern states (and thus can serve as a "ball-rolling-downhill" memory)

A shortcut to the iterative training scheme Fact: the iterative learning scheme will converge (up to an irrelevant scaling constant) to a unique set of weights. Luck: there exists a simple one-shot computation for this asymptotic weight set, given by w ij p µ = ξ µ i ξ j N µ=

Summary so far Given p fundamental patterns,..., p (coded as N-dimensional {-, } vectors), we can "store" them into an N-dimensional HN by setting the weights according to w ij p µ = ξ µ i ξ j N µ= These weights are equivalent (up to irrelevant scaling) to the weights that we would ultimately get by using Hebbian learning, which at each step "tries" to reduce the energy of one of the states.

Natural questions Do we really get local energy minima at the fundamental pattern states? How many patterns can we store? Do we get energy minima only at the fundamental pattern states (or do we also create other local minima, that is, "false" memories)? Answers to these questions are known, and the analysis behind all of this has made Hopfield Networks so famous.

Q: fundamental patterns = local minima? Test for this: present a fundamental pattern to the trained network. If (and only if) the state corresponds to a local minimum, then all bits i of that state vector must be stable under the state update rule, that is ξ µ i µ µ ( t + ) = sgn w ξ ( t) = ξ ( t) Outcome: unfortunately, not always are all fundamental patterns stable. Analytic result: If p patterns are stored in a size N network, the probability that a randomly picked bit in a randomly picked pattern will flip when updated, is j i ij j i P(i unstable) = Φ( N p ) Note: Φ(a) is the shaded area under the standard Gaussian distribution a 0

More on stability... Situation: p patterns are stored in a size N network. Then the probability that a pattern bit is unstable when updated is P(i unstable) = Φ( N p ) Φ(a) a 0 Consequence: the pattern bit stability is related to p/n, called the load of the HN. The more patterns stored relative to the network size, the higher the chances that bits of fundamental patterns may flip. Beware of avalanches! The bit stability refers only to one isolated update within a presented pattern. But... If the entire network ist "run" by iterated state bit updates, flipped bits may pull further bits with them...

Avalanche stability Situation: p patterns are stored in a size N network. Pick a pattern and present it to the network (that is, make it the starting state). Some of its bits may be unstable under bit update. Two things may happen under iterated stochastic network update:. While some bits may flip, their flips don't induce more and more flips - the dynamics stabilizes in a pattern not too different from the original one. 2. The unstable bits trigger an avalanche, just like in a nuclear chain reaction. Analysis: whether or not avalanches occur depends on the load p/n. In the limit of large N, at a load of p/n > 0.38 every pattern becomes avalanche-instable. Figure: % of changed pattern bits under iterated network update, vs. load (here denoted by ). (D. J. Amit et al., Storing infinite numbers of patterns in a spin-glass model of neural networks. Phys. Rev. Let. 55(4), 985)

Avalanche stability: comments Load below 0.38: essentially stable memories Load above 0.38: suddenly all memories are completely unstable p/n = 0.38 is a critical value of the load parameter; when it is crossed, the behaviour of the system changes abruptly and dramatically. This sudden change of qualitative behaviour at critical values is common in nature; here we have a case of a phase transition.

Q: are there "false memories" (other local minima)? Let's train a 20-neuron HN on these 8 fundamental patterns: Which minimal-energy patterns will emerge in the HN? (how many local troughs will the energy landscape have?)

Some "false memories" Starting from 43,970 random states ended in these final stable states (from Haykin 998). There are three kinds of "false memories":. Inverted states (occur necessarily) 2. Mixtures of an odd number of fundamental patterns (example: mixture of, 4, 9 patterns) 3. spurious states (aka spin glass states) which are uncorrelated to stored patterns

False memories and network load For all loads p/n: stable spin glass states exist. For p/n > 0.38: spin glass states are the only stable ones. For 0 < p/n < 0.38: stable states close to desired fundamental patterns exist. For 0 < p/n < 0.05: pattern-related stable states have lower energy than spin glass states. For 0.05 < p/n < 0.38: spin glass states dominate (some of them have lower energy than pattern-related states) For 0 < p/n < 0.03: additional mixture states exist, with energies not quite as low as the pattern-related states. In sum, a HN works really well only for a load 0.03 < p/n < 0.05. Results on this slide reported after: MacKay 2003 (chapter 42).

Hopfield networks: Pro's Simple model, can be mathematically analysed Biologically not immediately unrealistic Has strongly influenced how neuroscientists think about memory Connections to other field of computational physics (spin glass models) Are robust against "brain damage" (not discussed here) Have historically helped to salvage neural network research, and J. J. Hopfield was traded over several years as a Nobel prize candidate

Hopfield networks: Con's Small memory capacity: can store only about 5-0% of Nr. of patterns compared to network size (but is this really "small" -? Considering how many neurons we have... -?) All the "nice" results hold for uncorrelated fundamental patterns - an unrealistic assumption Not technically useful Unwanted spurious, inverted and superimposed memories Has strongly influenced how neuroscientists think about memory

Variants, ramifications Continuous-time, continuous-state HNs (see Hopfield chapters in Haykin's and MacKay's textbooks) Have also been used to tackle hard optimization problems (e.g., google "Traveling Salesman" + "Hopfield network") Still and repeatedly and always anew the subject of studies. Biologically plausible connectivity patterns have been studied, as well as small-world connectivity patterns (Davey et al 2005)