Hopfield networks. Lluís A. Belanche Soft Computing Research Group

Lluís A. Belanche belanche@lsi.upc.edu Soft Computing Research Group Dept. de Llenguatges i Sistemes Informàtics (Software department) Universitat Politècnica de Catalunya 2010-2011

Introduction Content-addressable autoassociative memory Deals with incomplete/erroneous keys: Julia xxxxázar: Rxxxela H ENERGY FUNCTION Attractors (memories) Basins of attraction A mechanical system tends to minimum-energy states (e.g. a pole) Symmetric Hopfield networks have an analogous behaviour: low-energy states attractors memories

Introduction A Hopfield network is a singlelayer recurrent neural network, where the only layer has n neurons (as many as the input dimension)

Introduction The data sample is a list of keys (aka patterns) L = {ξ i 1 i l}, with ξ i {0, 1} n that we want to associate with themselves. What is the meaning of autoassociativity? when a new pattern Ψ is shown to the network, the output is the key in L closest (in Hamming distance) to Ψ. Evidently, this can be achieved by direct programming; yet in hardware it is quite complex (and the Hopfield net does it quite distinctly) We therefore set up two goals: 1. Content-addressable autoassociative memory (not easy) 2. Tolerance to small errors, that will be corrected (rather hard)

Intuitive analysis The state of the network is the vector of activations e of the n neurons. The state space is {0, 1} n : 001 011 101 111. ξ 2 000 010. ξ 1. 100 110 ξ 3 When the network evolves, it traverses the hypercube

Formal analysis (I) For convenience, we change from {0, 1} to { 1, +1}, by s = 2e 1. Dynamics of a neuron: s i := sgn j=1 w ij s j θ i, with sgn(z) = { +1 : z 0 1 : z < 0 Let us simplify the analysis by using θ i = 0, and uncorrelated keys ξ i (from a uniform distribution).

Formal analysis (II) How are neurons chosen to update their activations? 1. Synchronous: all at once 2. Asynchronous: there is a sequential order or a fair probability distribution The memories do not change; the end attractor (the recovered memory) does We will assume asynchronous updating When does the process stop? When there are no more changes in the network state (we are in a stable state)

Formal analysis (III) PROCEDURE: 1. Formula for setting the weights w ij to store the keys ξ i L 2. Sanity check that the keys ξ i L are stable (they are autoassociated with themselves) 3. Check that small perturbations of the keys ξ i L are corrected (they are associated with the closest pattern in L)

Formal analysis (IV) Case l = 1 (only one pattern ξ): Present ξ to the net: s i := ξ i, 1 i n Stability: s i := sgn w ij s j = sgn w ij ξ j =... should be... = ξi j=1 j=1 One way to get this is: w ij = αξ i ξ j, α > 0 R (called Hebbian learning) sgn w ij ξ j = sgn αξ i ξ j ξ j = sgn αξ i = sgn (αnξ i ) = sgn(ξ i ) = ξ i j=1 j=1 j=1

Formal analysis (V) Take α = 1/n. This has a normalizing effect: w i 2 = 1/ n Therefore W n n = (w ij ) = 1 n ξ ξ, i.e. w ij = 1 n ξ iξ j If the perturbation consists in b flipped bits, where b (n 1) 2, then the sign of n j=1 w ij s j does not change! = if the number of correct bits exceeds that of incorrect bits, the network is able to correct the error bits

Formal analysis (VI) Case l 1: Superposition: w ij = 1 n l k=1 ξ ki ξ kj and W n n = 1 n l k=1 ξ k ξ k Defining E n l = [ξ 1,..., ξ l ], we get W n n = 1 n EEt Note W n n is a symmetric matrix

Formal analysis (VII) Stability of a pattern ξ v L: Initialize the net to s i := ξ vi, 1 i n Stability: s i := sgn h vi = n j=1 w ij ξ vj = 1 n ( j=1 l j=1 k=1 w ij ξ vj ) = sgn(h vi ) ξ ki ξ kj ξ vj = 1 n j=1 ( l k=1,k v ) ξ ki ξ kj ξ vj + ξ vi ξ vj ξ }{{ vj } +1 = ξ vi + 1 n l j=1 k=1,k v ξ ki ξ kj ξ vj = ξ vi + crosstalk(v, i) If crosstalk(v, i) < 1 then sgn(h vi ) = sgn(ξ vi + crosstalk(v, i)) = sgn(ξ vi ) = ξ vi

Formal analysis (and VIII) Analogously to the case l = 1, if the number of correct bits of a pattern ξ v L exceeds that of incorrect bits, the network is able to correct the bits in error (the pattern ξ v L is indeed a memory) When is crosstalk(v, i) < 1? If l << n nothing is wrong, but as we have l closer to n, the crosstalk term is quite strong compared to ξ v : there is no stability The million euro question: given a network of n neurons, what is the maximum value for l? What is the capacity of the network?

Capacity analysis For random uncorrelated patterns (for convenience): l max = { n 2ln(n)+lnln(n) : perfect recall 0,138n : small errors ( 1,6 %) Realistic patterns will not be random, though some coding can make them more so An alternative setting is to have negatively correlated patterns: ξ v ξ u 0, v u This condition is equivalent to stating that the number of different bits is at least the number of equal bits

Spurious states When some pattern ξ is stored, the pattern ξ is also stored. In consequence, all initial configurations of ξ where the majority of bits are incorrect will end up in ξ! Hardly surprising: from the point of view of ξ, there are more correct bits than incorrect bits The network can be made more stable (less spurious states) if we set w ii = 0 for all i W n n = 1 n l k=1 ξ k ξ k l n I n n = 1 n (EEt li n n )

Example ξ 1 = (+1 1 + 1), ξ 2 = ( 1 + 1 1), E 3 2 = +1 1 (+1 1 + 1) = +1 +1 1 +1 1 +1 1 +1 1 +1 +1 1 1 +1 +1 1 1 +1 1 ( 1 + 1 1) = +1 1 +1 1 +1 1 +1 1 +1 1 3 0 2 +2 2 0 2 +2 2 0 (+1 + 1 + 1) ( 1 1 + 1) (+1 1 1) (+1 + 1 1) (+1 1 + 1), ( 1 1 1) ( 1 + 1 + 1) ( 1 + 1 1) The network is able to correct 1 erroneous bit

2. Memorias asociativas: 20 field Practical applications (I) Una red de 120 nodos (12x10) CONJUNTO DE OCHO PATRONES SECUENCIA DE SALIDA PARA LA ENTRADA "3" DISTORSIONADA de iteraciones. 0 1 2 3 4 5 6 7 A Hopfield network that stores and recovers several digits UPV: 18 de diciembre de 2001 Redes Neuronales Artificiales. 2001-2002 Facultat d Informática UPV: 18 de diciembre de 2001 2. Memorias asociativas: 22 4. Otros paradigmas Conexionistas 2. Memorias asociativas: 23

Practical applications (II) A Hopfield network that stores fighter aircraft shapes is able to reconstruct them. The network operates on the video images supplied by a camera on board a USAF bomber

The energy function (I) If the weights are symmetric, there exists a so-called energy function: H(s) = i=1 j>i w ij s i s j which is a function of the state of the system. The central property of H is that it always decreases (or remains constant) as the system evolves Therefore the attractors (the stored memories) have to be the local minima of the energy function

The energy function (II) Proposition 1: A change of state in one neuron i yields a change H < 0 Let h i = n j=1 w ij s j. If neuron i changes is because s i sgn(h i ) Therefore H = ( s i )h i ( s i h i ) = 2s i h i < 0 Proposition 2: H(s) n i=1 j=1 w ij (the energy function is lower-bounded)

The energy function (III) The energy function H is also useful to derive the proposed weights Consider the case of only one pattern ξ: we want H to be minimum when the correlation between the current state s and ξ is maximum and equal to ξ, s 2 = ξ, ξ 2 = ξ 2 = n Therefore we choose H(s) = K 1 n ξ, s 2, where K > 0 is a suitable constant For the many-pattern case, we can try to make each of the ξ i into local minima of H: H(s) = K 1 n = K ξ v, s 2 = K 1 n v=1 i=1 j=1 ( 1 n ( ) 2 ξ vi s i = K 1 n v=1 i=1 ) ξ vi ξ vj s i s j = K v=1 i=1 j=1 ( v=1 i=1 w ij s i s j = 1 2 ξ vi s i ) i=1 ξ vj s j j=1 w ij s i s j j>i

The energy function (and IV) Exercises: 1. In the previous example, a) Check that H(s) = 2 3 (s 1s 2 s 2 s 3 + s 1 s 3 ) b) Show that H decreases as the network evolves towards the stored patterns 2. Prove Proposition 2

Spurious states (continued) We have not shown that the explicitly stored memories are the only attractors of the system We have seen that the reversed patterns ξ are also stored (they have the same energy than ξ) There are stable mixture states, linear combinations of an odd number of patterns, e.g.: ξ (mix) i = sgn α ±ξ vk,i k=1, 1 i n, α odd

Dynamics (continued) Let us change the dynamics of a neuron by using an activation with memory: s i (t + 1) = +1 if h i (t + 1) > 0 h i (t) if h i (t + 1) = 0 1 if h i (t + 1) < 0 where h i (t + 1) = n j=1 w ij s j (t)

Application to optimization problems Problem: find the optimum (or close to) of a scalar function subject to a set of restrictions e.g.: The traveling-salesman problem (TSP) or graph bipartitioning (GBP); often we deal with NP-complete problems. TSP(n) = tour n cities with the minimum total distance; TSP(60) 6,93 10 79 possible valid tours. In general, TSP(n) = n! 2n. GBP(n) = given a graph with an even number n of nodes, partition it in two subgraphs of n/2 nodes each such that the number of crossing edges is minimum Idea: setup the problem as a function to be minimized, setting the weights so that the obtained function is the energy function of a Hopfield network

Application to graph bipartitioning Representation: n graph nodes = n neurons, with s i = { +1 : node i in left part 1 : node i in right part Connectivity matrix: c ij = 1 indicates a connection between nodes i and j.

Application to graph bipartitioning H(S) = c ij S i S j i=1 j>i }{{} connections between parts + α 2 S i i=1 }{{} bipartition = H(S) = i=1 j>i (c ij α)s i S j i=1 j>i w ij S i S j + i=1 θ i S i = w ij = c ij α, θ i = 0 (general idea: w ij is the coefficient of S i S j in the energy function)

Application to the traveling-salesman problem Representation: 1 city n neurons, coding for position; hence n cities = n 2 neurons. S n n = (S ij ) (network states); D n n (distances between cities, given) with S ij = +1 if the city i is visited in position j and 1 otherwise. Restrictions on S: 1. A city cannot be visited more than once every row i must have only one +1: S ij S ik = 0 i=1 j=1 k>j R 1 (S) 2. A city can only be visited in sequence (no ubiquity gift every column j must have only one +1: R 1 (S t ) 3. Every city must be visited ( 2 S ij n) = 0 i=1 j=1 there must be n +1s overall: R 2 (S)

Application to the traveling-salesman problem 4. The tour has to be the shortest possible D ij S ik (S j,k 1 + S j,k+1 ) k=1 i=1 j>i Total sum of distances in the tour: R 3 (S, D) H(S) = w ij,kl S ij S kl + θ ij S ij αr 1 (S)+βR 1 (S t )+γr 2 (S)+ɛR 3 (S, D) i=1 j=1 k>i l>j i=1 j>i = w ij,kl = αδ ik (1 δ jl ) βδ jl (1 δ ik ) γ ɛd ik (δ j,l+1 + δ j,l 1 ) and θ ij = γn This is a continuous Hopfield network: S ij ( 1, 1) since S ij := tanh χ (h ij θ ij ) Problem: set up good values for α, β, γ, ɛ > 0, χ > 0

Application to the traveling-salesman problem Example for n = 5: 120 valid tours (12 different), 25 neurons 2 25 potential attractors; actual number of them: 1,740 (among them, the valid 120). 1,740-120 = 1,620 spurious attractors (invalid tours) The valid tours correspond to the states of lower energy (correct design) But... we have a network with 30 times more invalid tours than valid ones: a random initialization will end up in an invalid tour with probability 0.966 Moreover, we aim for the best tour! A possible workaround is to use stochastic units with simulated annealing