Neural Network Essentials 1 - PDF Free Download

Neural Network Essentials Draft: Associative Memory Chapter Anthony S. Maida University of Louisiana at Lafayette, USA June 6, 205 Copyright c 2000 2003 by Anthony S. Maida

Contents 0. Associative memory............................. 3 0.. Linear threshold networks...................... 3 Notation........................... 4 Hypercubes and Hamming distance............. 5 Improving symmetry..................... 6 Threshold functions..................... 6 0..2 Building an auto-associative memory................ 6 Preliminaries............................. 6 Representing a pattern.................... 6 Fully connected network................... 7 Notation........................... 8 Hologram metaphor..................... 8 Operation of an auto-associative memory.............. 9 Storing patterns........................ 9 Retrieving a stored pattern.................. 0 A Hopfield memory module................. Noiseless retrieval dynamics..................... 2 Synchronous update..................... 2 Asynchronous update..................... 2 Attractors........................... 3 Energy function........................ 5 Performance of attractor memories.............. 7 Limitations of the method.................. 9 0..3 Hetero-associative memory and central pattern generators..... 9 Building a hetero-associative memory................ 20 Limit cycles.............................. 20 Pattern generators........................... 2 0..4 Significance of Hopfield networks.................. 23 Further reading........................ 23 References...................................... 23 2

CONTENTS 0.. ASSOCIATIVE MEMORY 0. Associative memory Association in humans refers to the phenomenon of one thought causing us to think of another. The phenomenon of association is partly responsible for the structure of our stream of consciousness and has been pondered both by philosophers, such as Aristotle and Locke, and by psychologists, such as Ebbinghaus and William James. This chapter presents mathematical tools which can be used to model distributed associative memory. We will conceptualize a thought as some kind of semantic item. Any concept you are capable of thinking of or conceiving is an example of a semantic item. In distributed memories, the semantic items are known as patterns and for the simple memories of this chapter, they will be represented as binary feature vectors. The term distributed refers to the way the memory is built. Distributed models are designed so that the information corresponding to a stored pattern or semantic item is distrubuted throughout the neural units in the memory. Chapter 5 will discuss another type of memory model in which information for a particular semantic item is localized to a small number of units. There are two main classifications of associative memories: auto-associative memories and hetero-associative memories. These models assume that an active thought in a neural network is represented by the instantaneous firing rate of the neurons in the network, sometimes called the network state vector. Hetero-associative memories provide a model of what we normally view as a stream-of-consciousness association, how one thought sometimes reminds us of another. Auto-associative memory addresses the question of how thoughts can be reconstructive. Auto-associative memories also address the question of how a thought can be momentarily stable when it is realized in a network of independently firing neurons. This section begins by describing a type of dynamic auto-associative memory known as a Hopfield network. Hopfield (982) [8] showed that a type of associative memory had an identical analog to a magnetic system, enabling theoretical analyses of neural systems which were previously impossible. The behavior of an auto-associative memory is similar to that of a spell checking program. It offers the most likely correction to a presented pattern. The memory stores a large number of patterns. When you submit a degraded version of one of the stored patterns to the memory, it goes into a retrieval state that represents the most closely matching stored pattern. Many people feel that human memory is reconstructive. When one sees part of a pattern, one often thinks of the whole. Hopfield networks offer a concrete example of a brain-like system that operates as a reconstructive memory. Hopfield networks illustrate how a set of neurons can collectively store information. These networks are not the first mathematical models of associative memory, nor do they have any direct biological realism. Their importance is that they are among the simplest models, are very well understood, and suggest how to approach the problem of memory storage in the brain. 0.. Linear threshold networks There is not a generally accepted name for the simplest classes of model neurons. We will simply call them units. Schematically, units always have the structure shown in Figure??. 3 Copyright c 2000 2003 A. S. Maida

0.. ASSOCIATIVE MEMORY CONTENTS pre-synaptic terminals weighted sum and threshold axon w sum Σ h difference unit step function Ψ a soma w n Threshold Θ axons of presynaptic neurons Figure : Idealized neuron (left) and LTU used as a mathematical model (right). The oldest type of unit going back to McCulloch and Pitts (943) [0] is the linear threshold unit (LTU). The LTU uses the unit step function as its activation function. The name comes from the fact that the unit is active if its linearly weighted sum of inputs reaches a threshold. Hopfield networks are made of LTU s, occassionally called McCulloch Pitts neurons, after their inventors. LTU s model drastically simplified biological neurons as shown in Figure. We often use the term unit rather than neuron to avoid any misunderstandings about claiming biological realism. Networks that use LTU s may be inspired by biological neurons but the modeling itself is divorced from brain structure and the biological implications of the models are very theoretical. Nevertheless, these models are interesting metaphors for brain function. Notation For the moment, assume that the units under discussion are embedded in a network having a synchronized, global clock. Assume further that the units are always in either the on or the off state, represented (for the moment) by the binary activation vlues of 0 and. The formula below, adapted from Hertz et al. 99, describes the LTU shown in Figure. It states that the activation state of a neuron is a deterministic function of the activation states of the incoming neurons at the previous time step. s i (t + ) = Ψ(h i (t) Θ i ) () The expression s i (t + ) denotes the activation state of unit i at time t +. The state of unit i has value if its net input at the previous time step, denoted h i (t) is greater than its threshold Θ i ( theta ). The symbol Ψ( ) (spelled psi, pronounced sigh ) denotes the activation function (unit step function for now) which has value one if its input is greater than or equal to zero and has value zero otherwise. The net input of unit i is determined by There will be several changes shortly. We won t necessarily assume a global clock, units will take on the values of - and +, and Ψ( ) will denote an asymmetric sign function, rather than the unit step function. 4 Copyright c 2000 2003 A. S. Maida

CONTENTS 0.. ASSOCIATIVE MEMORY the activation of the n units connected to i (along with their associated (synaptic) connection weights, w) at the previous time step, t. Net input, denoted by h, is defined below. n h i (t) w ij s j (t) = w i, s (t) + w i,2 s 2 (t) + + w i,n s n (t) (2) j= The net input is defined to be the linear sum of activations of incoming units at time t weighted by the strengths of the corresponding synaptic connections w. The notation w ij denotes the strength of the connection from unit j to unit i. The subscripts in this notation may seem backwards but there are notational reasons to use this convention. First, when w ij is embedded in a summation formula, the iteration is over the rightmost subscript. Second, when the w ij are embedded in a weight matrix, the ith row of the matrix denotes the j incoming connection weights to unit i. We have interpreted t as referring to the tth clock tick. This presumes that there is a reference global clock. The expression below describes an update step for unit i without referring to a global clock. n s i = Ψ w ij s j (3) j= s i refers to the next state of unit i, and the equation describes how the next state of unit i is determined by the previous state. In this example, we also assume that the threshold for unit i, Θ i, is 0. Hypercubes and Hamming distance Consider the state of a network of LTU s having fixed connection weights (no learning). The only thing that varies in such a network are the activation values of the individual units, so we can describe the state by describing the activations of the individual units. It is geometrically useful to view the set of possible states of such a network as corresponding to the corners of a hypercube. For example, if you have a network of three units, then the set of possible states for the network as a whole corresponds to the set of eight (2 3 ) possible bit strings of length three. Each bit string specifies the Cartesian coordinates of one of the corners of a cube in three dimensional space. If you have two distinct bit strings a and b, each of length n, and you want to judge their differences, it is natural to ask how many of the corresponding bits in the strings do not match. The number of mismatches is the Hamming distance between the two strings. If the bits in the strings are interpreted as coordinates, they will map the strings into two corners of an n-dimensional hypercube. If you wanted to travel from the corner for string a to the corner for string b by walking along the edges of the hypercube, the appropriate distance measure between the two points would be the city-block (as opposed to Euclidean) distance metric. It happens that the Hamming distance between two binary strings is the city-block distance between their coordinates in n-dimensional space. Let us return to thinking about the state of a linear threshold network. Suppose that we are interested in two distinct states that the network could be in, and that we want to judge 5 Copyright c 2000 2003 A. S. Maida

0.. ASSOCIATIVE MEMORY CONTENTS the distance between these states. For example, we might be interested in the distance between the current state of the system, state a, and the retrieval state, state b. We want to judge the distance from the current state to the retrieval state. When a unit in the network changes state, it corresponds to traversing an edge of the hypercube. For the network to make a transition from state a to state b each of the units in the network that mismatch state b must make a state transition. The number of state transitions is the Hamming distance. Improving symmetry The hypercube specified by bit strings of length N has one corner at the origin. We can move the center of the hypercube to the origin if we use activation values of and + instead of 0 and. With this modification, all of the corners of the hypercube have the same distance from the origin. There is a trade-off however. Now the city-block distance between two corners is twice the Hamming distance. State transitions of units in the network, however, still correspond to walking along the edges of the hypercube. As a rule, the more symmetry there is, the easier the system is to analyze. Threshold functions The two most common threshold functions are the Heaviside-unitstep function and the sgn (pronounced sign-num ) function. The unit step function is defined below. { 0 x < 0 H(x) (4) x 0 This definition is known as a piecewise definition and should be read if x < 0 then H(x)=0; if x 0 then H(x) =. The sgn function is also defined using a piecewise definition. sgn(x) { x < 0 + x 0 (5) The two functions are related by the following formula. sgn(x) = 2H(x) (6) The two functions are used to give binary-valued units activation functions and, since they change values discretely, are sometimes classified as hard non-linearities. If the binary units take on the values 0 and, then you would use the unit step function. If they take on the values and +, then you would use the sgn function. 0..2 Building an auto-associative memory Preliminaries Representing a pattern Normally, we think of a visual pattern, say on a biological or artificial retina, as being two dimensional. The appropriate data structure to represent such a pattern would also be two dimensional, such as a matrix if the shape of the retina were rectilinear. This approach was taken in the chapter on vision computations. If it is not 6 Copyright c 2000 2003 A. S. Maida

CONTENTS 0.. ASSOCIATIVE MEMORY Figure 2: Fully connected network consisting of three units. Each neuron has exactly one connection to every other neuron. The network is necessarily recurrent. Connections from units to themselves are symmetric but other connections need not be symmetric. necessary to preserve neighborhood information (such as which points are close to which others) on the retina, then we can represent the pattern as a vector. A Hopfield network does not preserve or encode locality information. Although units in the network could represent points in an input pattern, the connection strength between any two units would be independent of how close the points were on the retina. Consequently, the input pattern can be represented as a vector, sometimes called a pattern vector or feature vector, without losing information that the Hopfield network would need. Fully connected network A fully connected neural network is one in which each neuron receives input from every other neuron, including itself as shown in Figure 2. Let s suppose you have a vector of neurons and you are interested in describing the connection weights, strengths, or synaptic efficacies, between the neurons. The natural way to do this is to use a matrix, called a connection matrix, such as the one shown below. W = 0 + + 0 + + + 0 + + + 0 w, w,2 w,n = w 2, w 2,2 w 2,n...... w n, w n,2 w n,n This matrix, W, describes a network consisting of n = 4 neurons. Let s assume the neurons, or units, are labeled u, u2, u3, and u4. The first row of W describes the incoming connection strengths for unit u, the second row describes the incoming connection strengths for unit u2, and so forth. The zeros along the diagonal indicate that the connection strength of each unit to itself is zero. This is a way of saying that there is no connection while still using the framework of a fully connected network. In fact, any connection topology is a special case of a fully connected network if you use the convention that absence of a connection is represented by a connection strength of zero. The matrix is symmetric, meaning that the transpose of the matrix is equal to itself, that is W = W T. (7) 7 Copyright c 2000 2003 A. S. Maida

0.. ASSOCIATIVE MEMORY CONTENTS The interpretation is that for any two units, a and b, the connection strength from unit a to unit b is the same as that from unit b to unit a. Hopfield networks require that the connection strengths be symmetric and that the weights along the diagonals are non-negative. This assumption of symmetric connnection weights is a step away from biological reality that we will return to later. Notation A common convention (cf., [7]) is to represent a stored pattern using the symbol ξ (spelled xi, pronounced ksee ) and an input pattern with the symbol ζ (zeta). We will also use S to refer to the instantaneous activations of the units, sometimes called the state vector. The set of possible values of the state vector forms the configuration space and each point in the configuration space falls on the corner of a hypercube and is equidistant from the origin. All are vectors having n components, where n is the number of units. We want these vectors to be compatible with matrix notation as is the convention in linear systems (see Appendix), so we will write them as column vectors, shown below. ξ ζ S ξ. ξ n ζ. ζ n s. s n = (ξ,..., ξ n ) T (8) = (ζ,..., ζ n ) T (9) = (s,..., s n ) T (0) Since there may be many stored patterns in the network, one may use a superscript µ as in ξ µ. This vector refers to pattern µ and can be decomposed into components, as shown below. ξ µ (ξ µ,..., ξµ n) T () Thus ξ µ i refers to the i-th component of pattern µ. Hologram metaphor Karl Pribram suggested in the 960 s [2] that the brain s memory system is like a hologram because. it was content addressable (or reconstructive); that is, a part of a pattern could be used to access the whole pattern. 2. it was distributed; the information for a stored pattern was not localized in one part of the network. This was viewed as desirable at the time because of Lashley s principle of mass action. 8 Copyright c 2000 2003 A. S. Maida

CONTENTS 0.. ASSOCIATIVE MEMORY 3. it obeyed the principle of superposition; the same neurons could be reused to represent more than one pattern. Except for the brain, the hologram was the only known physical device which had these properties at the time. It was of great interest to understand how holograms worked as a source of clues to understand the operation of the brain. Hopfield networks are another concrete example of a mechanism which realizes this metaphor. The underlying mathematics of the basic Hopfield network are more accessable than that needed to understand a hologram and the structure of a Hopfield network more closely resembles the structure of the brain than does a hologram. Operation of an auto-associative memory Storing patterns Pattern vectors are stored in a Hopfield network by using a method known as correlation recording. Correlation recording sets the connection weights between the relevant units. For each matrix element w ij, we set the connection weight to the product of the desired activations of the units ξ i ξ j that are linked together by the synapse connecting unit j to unit i (recall that w ij is the connection from unit j to unit i). This recipe is known as the generalized Hebbian learning rule. Suppose we have a network consisting of four units and wish to store the two patterns below into the network. ξ = (+,,, +) T (2) ξ 2 = (+, +,, ) T (3) To store pattern ξ, we compute the outer product ξ ( ξ ) T. The outer product is a concise way of specifying the Hebbian rule. This is shown in formula (4). ξ ξ ξ ξ2 ξ ξ n W = ξ ( ξ ) T ξ2 ξ ξ2 ξ2 ξ2 ξn =...... = (4) ξn ξ ξn ξ2 ξn ξn Since ξ i ξ j = ξ j ξ i, the matrix is always symmetric. Since ξ i ξ i =, the weights on the diagnonals are one. Consequently, the weights on the diagonals do not encode information about the specific stored pattern and no information about the pattern is lost if one sets these weights to zero. Hopfield networks sometimes work better if they do not have self-coupling terms (connections strengths of units to themselves is zero). The matrix in (4) stores the single pattern ξ. What about storing both patterns ξ and ξ 2 at the same time? To determine the proper connection weights for encoding these patterns, take the outer product of each pattern and then add the resulting matrices together. The general formula for correlation recording of multiple patterns is the following, where p is the number of patterns. p W = ξ µ ( ξ µ ) T (5) µ= 9 Copyright c 2000 2003 A. S. Maida

0.. ASSOCIATIVE MEMORY CONTENTS When one studies this formula, one sees that each w ij equals p µ= ξµ i ξµ j. A worked out example for pattern vectors ξ and ξ 2 is shown below. Notice that the resulting connection matrices are symmetric and have non-negative weights along the diagonals. W = ξ ( ξ ) T + ξ 2 ( ξ 2 ) T = + = 2 0 2 0 0 2 0 2 2 0 2 0 0 2 0 2 This example illustratrates the distributed and superposition characteristics of the hologram metaphor. The connection matrix produced by the outer product operation causes each unit to participate in the representation of each bit in a pattern. Adding the two connection matrices together corresponds to superimposing the memories onto the same substrate. It is natural to wonder whether the two stored patterns will interfer with one another if they are superimposed. The answer is that they will interfer with each other if the patterns are highly correlated or if the capacity of the memory is too small. The four-unit memory in this example is very small and does not have the capacity to represent two different patterns, so they will interfer with each other. Retrieving a stored pattern Let s suppose that we have stored the single pattern, ξ, from formula (2), in the memory and therefore the connection matrix, W, is that shown in equation (4). If we set the input pattern ζ equal to the stored pattern ξ, then if the memory is working correctly, the input pattern should retrieve itself. To temporarily over simplify, the procedure to retrieve a pattern from memory is to multiply the connection matrix with the input pattern and run the result through the asymmetric sign function. Ψ(W ζ ) = Ψ(W ξ ) = Ψ 4 4 = Ψ 4 = 4 = ξ (6) The value of the function Ψ( ) is if its argument is greater than or equal to 0, and the value is - otherwise. 2 We assume the notational convention that when Ψ is given an array as argument, it computes the value for each component of the array and returns the array of results. 2 We call the function asymmetric because it arbitrarily maps 0 into +. When the units took on values of 0 and, then Ψ( ) was used to denote the unit step function. Its purpose is essentially the same in both cases. 0 Copyright c 2000 2003 A. S. Maida

CONTENTS 0.. ASSOCIATIVE MEMORY Storage controller compute and add to W ξ µ ξ µ ( ) T Hopfield memory activation vector: S connection matrix: W Retrieval controller initialize S to ζ and start retrieval dynamics return S after retrieval converges ξ µ ζ S Figure 3: A Hopfield memory module consists of a storage controller, a retrieval controller, and the memory proper. The storage controller adjusts the weights to store a new pattern. The retrieval controller sets the state vector for the units to some input pattern and then sends a message to the memory to start the retrieval dynamics. After the update process converges, the retrieval controller reads out the memory state vector. There is a principled reason that the example in (6) gave the correct answer. Let s redo this example substituting ξ ( ξ ) T for W. This gives us the following. Ψ(W ζ ) = Ψ(W ξ ) = Ψ(( ξ ( ξ ) T ) ξ ) (7) = Ψ( ξ (( ξ ) T ξ )) (8) = Ψ( ξ n) (9) = ξ We can go from step (7) to (8) because matrix multiplication is associative. Matrix multiplication is not commutative and ξ ( ξ ) T is not equal to ( ξ ) T ξ. In fact, since the stored patterns are always plus and minus ones, ( ξ ) T ξ simplifies to n. This gives us step (9). A Hopfield memory module If you were to implement a Hopfield memory module in software, it would consist of three parts. First, there would be the memory itself, consisting of a vector of units and an associated connection matrix. The connection matrix would be initialized to all zeroes because the memory is not yet storing any patterns. The second and third parts of the module would be two controllers: a storage controller and a retrieval controller, as seen in Figure 3. Whenever one would want to add a new pattern to the memory, the storage controller would accept a pattern vector, ξ, whose length is equal to the number of units in the memory. It would compute the outer product to generate a connection matrix, and then it would add this matrix to the connection matrix implementing the memory. Copyright c 2000 2003 A. S. Maida

0.. ASSOCIATIVE MEMORY CONTENTS Whenever you would want to retrieve a pattern from the memory, you would submit an input pattern, ζ, to the retrieval controller. The retrieval controller would initialize the unit activations to match the bits in the input pattern. At this point, the unit activations in the memory would not necessarily be consistent with the values in the connection matrix. The retrieval process would update the units in the network to be as consistent as possible with the connection matrix. After the update process completes, the unit activations should be in a retrieval state corresponding to the stored pattern that has least Hamming distance from the input pattern. Noiseless retrieval dynamics Expression (6) illustrated an over simplified retrieval process. We now describe retrieval dynamics more thoroughly. There are two classes of update procedures associated with Hopfield networks depending on whether you update all of the units simultaneously or one-by-one. The former is called synchronous update and the latter is called asynchronous update. These alternative procedures lead to different retrieval dyamics. Both types of update are based on equation (). Since the effect of the update event is deterministic without a random component, these dynamics are called noiseless. Synchronous update A single synchronous update step was illustrated in formula (6). In synchronous update, you assume that there is a global clock and each unit in the network updates itself simultaneously at each clock tick or time step. Normally, in a retrieval episode, the synchronous update step is repeated some number of times because the network rarely reaches a correct retrieval state after just one update step. If the units were linear, matrix multiplication alone would accurately model a single update step. You would simply multiply the connection matrix with the state vector describing the instantaneous activations of the units. Since the units are threshold units, we have to include thresholds in the update procedure as we had in expression (6). For the first update step, the unit activations have been initialized to the input vector ζ by the retrieval controller. Thus, the first update step is described as follows. S() = Ψ(W ζ) (20) When there is a global clock, we use the notation S(t) to refer to the unit activations at time step t. Of course, the unit activations at time t + can be described thus. S(t + ) = Ψ(W S(t)) (2) Asynchronous update In asynchronous update, only one unit is updated at a time. The term asynchronous comes from the metaphor of an idealized physical system in which units update themselves randomly and the update event is instantaneous. Then if the temporal grain size is taken to be small enough, no two units will update themselves simultaneously. In software, you would need a memory controller that randomly chooses one neuron to 2 Copyright c 2000 2003 A. S. Maida

CONTENTS 0.. ASSOCIATIVE MEMORY s(t) 0 2 3 4 5 6 7 8 9 0 2 3 4 5 s(t+) 5 9 6 5 6 5 6 6 9 9 5 9 5 9 6 5 distance 4 2 2 0 0 2 2 0 Table : The top row shows the network state at time t. The middle row shows the network state at time t + after being updated. The bottom row shows the Hamming distance between the corresponding states. This data is used to construct the transition graph in Figure 4. update at each time step. This procedure is called random, asynchronous update. There is another procedure called sequential, asynchronous update which updates one unit per time step, but always cycles through the units in the same order. For a network with n units, there are n! possible sequential update regimens. Attractors Mapping out attractors helps us determine which of the two retrieval processes, synchronous or asynchronous update, gives better performance. Recall that a fourunit Hopfield network has 6 states corresponding to the corners of a four-dimensional hypercube. If the network is effectively acting as a memory, the states corresponding to stored patterns should form basins of attraction, sometimes called fixed-point attractors. Let us map out the basins of attraction for the memory formed from the connection matrix given in Formula (4). The attractor map depends on the update procedure as well as the connection matrix. We ll start by assuming synchronous update. We need to name the possible network states so that we can indicate which state we are talking about. Since the states derive from the binary activations of the units, we can interpret the unit activations as a binary code and translate that into a decimal number (and add the prefix s to signify state). Thus, the states will have names s0 through s5. The state which is most important to us is s9 because it corresponds to the stored pattern ξ given in formula (2). The state names are also displayed on the corners of the hypercube in Figure 8. The state transition table and associated attractor map are shown in Figure 4. The state transition table at the top of Table 4 has three rows. The first row shows each of the 6 states at time t. The second row shows the new state the network reaches at time t + if it is updated synchronously. The third row shows the Hamming distance between the corresponding states at times t and t +. We can use this table to construct a very informative state transition graph shown in the bottom of Figure 4. The state transition graph tells us how well the memory works. This particular graph forms three groups. The nodes in the graph depict network states and the arcs show the state transition when the network is updated in a given state. The labels on the arcs show the Hamming distance between the two adjoining states. State s9 is in the group on the left. If the network is in s9 when it is updated, it stays in s9. So the memory is stable. If the network is in any of s, s8, s, or s3, it goes into s9 when it is updated. In this sense s9 is a basin of attraction, because the network will stabilize in state s9 if it is in any of the surrounding states. This is encouraging because it shows that s9 is 3 Copyright c 2000 2003 A. S. Maida

0.. ASSOCIATIVE MEMORY CONTENTS s3 0 s9 s s4 0 s6 s2 s2 2 s0 4 s5 2 s3 s s8 s7 s4 2 s0 0 2 s5 Figure 4: Transition graph showing state changes for the network using the weight matrix in Formula (4) and using the synchronous update regimen given in Equation (2). Numbers on edges (arrows) show Hamming distance traveled by the update step. acting like a retrieval state. However, there seem to be two anomalies. Although we stored only one pattern in the memory, there are two other basins of attraction surrounding s6 and s5. State s6 is the complement state of s9 generated by flipping the bits. This is characteristic of Hopfield memories. When you store a pattern, you also store the pattern generated by complementing the bits. Notice that the complement of a pattern vector ξ T is represented as ξ = ( ξ,..., ξ n ) T. (22) From this, it follows that Equation (23) is true. ξ( ξ) T = ξ( ξ) T (23) Therefore, storing a pattern or its complement will generate the same connection matrix, and the memory looses information about which pattern was stored. The other basin of attraction, centered around s5, is spurious and should be minimized in memory design. Examining the Hamming distance between the states offers a clue as to why there is such a large spurious attractor basin at s5. First, notice that the attractor basin for our stored pattern s9 doesn t have any states whose Hamming distance from s9 is larger than one. In fact, this basin contains all of the states whose distance from s9 is one (in a four-bit pattern, there are four ways to have a one-bit mismatch). The same situation holds for pattern s6. So it seems that any pattern that is not within a Hamming distance of one from the stored patterns is captured by the spurious attractor s5. To make the memory work better, we need to make the attractor basins for stored memories larger and we need to make the spurious attractor basins smaller. Figure 5 displays graphs to show how the attractor basins have changed when sequential, asynchronous update is used. With asynchronous update, the attractors have better characteristics. The graphs were constructed from the information in Table 2. If only one unit at a time is updated, then a state change in the network corresponds to traversing exactly one edge in the hypercube. Thus, the new state is guaranteed to have a Hamming distance of one from the previous state. More important, if the connection matrix is symmetric with non-negative weights, then each network state can be associated with an energy value. A 4 Copyright c 2000 2003 A. S. Maida

CONTENTS 0.. ASSOCIATIVE MEMORY s0 s2 s0 s3 2 s 0 s9 2 s5 s 2 s8 s3 s5 0 s4 s7 0 s6 s2 s4 Figure 5: Transition graph showing state changes for sequential, asynchronous update. An update step consists of updating the four units, one at a time, in the order u, u2, u3, and u4. Numbers on edges indicate Hamming distance traveled by the update step. s 0 2 3 4 5 6 7 8 9 0 2 3 4 5 E 0-2 -2 0-2 0-8 -2-2 -8 0-2 0-2 -2 0 u 8 9 2 4 3 6 7 8 9 0 2 3 6 5 E -2-6 0-2 0-2 0 0 0 0 0 0 0 0-6 0 u2 4 6 7 4 5 6 7 8 9 4 2 9 4 5 E -2 0-6 -2 0 0 0 0 0 0-2 0 0-6 0 0 u3 2 2 3 6 7 6 7 8 9 0 9 4 3 4 5 E -2 0 0 0-6 -2 0 0 0 0 0-6 -2 0 0 0 u4 2 3 4 5 6 6 9 9 3 3 4 5 E -2 0 0 0 0 0 0-6 -6 0-2 0-2 0 0 0 Table 2: Table showing state changes when exactly one unit is updated. This information is used to contruct Figure 6. single asynchronous update event is guaranteed never to lead to a state of higher energy. The basins of attraction in the network are energy minima. Table 2 describes the state changes when only one unit is updated at a time. The results are summarized in Figure 6. In the table, the row labeled s indicates which of the 6 states is under consideration. The row labeled E shows the energy of that state. The concept of energy will be explained in the next section. The row labeled u shows the new state that is reached if unit u is updated, and the row labeled E shows the change in energy that resulted from this state transition. The rows labeled u2, u3, and u4 provide analogous information about the results of updating units u2, u3, and u4. Energy function The concept of energy will help us interpret Figure 6. Recall that the network state is given by the set of instantaneous unit activations and that this is encoded by 5 Copyright c 2000 2003 A. S. Maida

0.. ASSOCIATIVE MEMORY CONTENTS 4 s0 s3 2 s5 3 2 s0 4 2 s2 4 3 s5 3 2 s s2 s4 s7 s8 s s3 s4 2 3 4 4 3 2 s6 s9 Figure 6: Graph showing state changes when one unit is updated at a time. Numbers on edges of graph indicate which of the four units was updated to cause the state transition. If a node has less than four arcs leaving it, then there is no state change associated with some of the unit updates. Nodes in top row have energy of 0. Nodes in middle row have energy of -2. Retrieval states in bottom row have energy of -8. the state vector S. The energy associated with S, given by formula (24), is a measure of the global consistency between the instantaneous unit activations and the connection weights. The lower the energy, the more consistent the network state is with the connection weights. In a noiseless network, the energy does not depend on the update procedure. E( S ) = n n w ij s i s j (24) 2 i= j= Figure 7 helps to interpret this formula. The figure shows a fully connected, four-unit Hopfield network in four different states (self-coupling terms are not shown). Since the connection strengths are symmetric, each arc depicts the two weights w ij and w ji. Let us examine what happens when the network stores one pattern ξ and the instantaneous state of the unit activations S is equal to ξ. This corresponds to the network labeled I in the figure. Recall that, when storing a single pattern, the connection weight needed on each w ij is ξ i ξ j. When S = ξ, then s i s j = ξ i ξ j = w ij, and the quantity, w ij s i s j, within the scope of the double summation is equal to one. Thus the energy of the retrieval state of a single stored pattern is 2 n2. In a network of four units, the energy will be 8. Since the complement of ξ yields the same connection matrix, the energy associated with this state is also -8. For the network labeled II in Figure 7, the state vector has a Hamming distance of one from the stored pattern ξ. The activation of unit u3 has been flipped from the value of +ξ 3 to ξ 3. The network is no longer globally consistent and the dashed lines show which part of the network is inconsistent. Consider all the synaptic connections that make contact with u3. Since each of these weight values were consistent with the unit activations in the top figure, they must now be inconsistent in the middle figure. If u3 is updated, it will flip 6 Copyright c 2000 2003 A. S. Maida

CONTENTS 0.. ASSOCIATIVE MEMORY back to the value of +ξ 3 associated with the stored pattern. Since ten weights (including self-coupling terms) are consistent with the unit activations and six weights are inconsistent with the unit activations, the value of the double summation becomes 0 6 = 4 and the resulting energy is -2. For the network labeled III, unit u2 has been flipped and S now has Hamming distance of two from ξ. Notice that the arc depicting weights w ij and w ji is no longer dashed, indicating that it is now consistent with its adjacent units. In this case eight weights are consistent and eight weights are inconsistent, so the summation becomes 8 8 = 0, yielding an energy state of zero. Part IV of the figure shows that if we flip the activation of one more unit, u, the global consistency of the network increases. This is because the instantanteous state of the network has Hamming distance of one from complement stored pattern ξ. If the energies for all 6 states of our network are computed, one finds that the energy of a state is always either 0, 2, or 8. Only the retrieval states s6 and s9 have energies of 8. The states with energies of 2 have Hamming distances of one from a retrieval state and those with energies of zero having Hamming distances of two. Our discussion assumed that the connection matrix W was symmetric but it never made reference to specific values of the weights in the connection matrix. Therefore we know that the network associated with the matrix in expression (4) has these energy values. Asynchronous update does not increase energy (proof is in appendix), and since, for this example, energy is positively correlated with Hamming distance, update events never increase the Hamming distance of the instantaneous state from a stored pattern. Figure 8 shows the single-unit-update transitions visualized on the corners of a hypercube. Performance of attractor memories A network having n units can store about.38n random patterns without exceeding the memory capacity. To put this in concrete terms, if you have a memory consisting of 00 units, you can store 3 patterns, where each pattern consists of 00 random bits. This yields a storage capacity of 300 random bits. The weight matrix will have 0,000 entries. The memory capacity is proportional to the square root of the number of weights, which is proportional to the number of units. Hopfield networks also have spurious attractor states. One type of spurious attractor state is called a mixture state. A mixture state can be obtained by superimposing an odd number of stored patterns (such as three) and adding the corresponding bits together to produce a new vector of the same length. Thus, it is a mixture pattern. If the network state is initialized to the mixture pattern, the network will remain in that state during the asynchronous update procedure. That is, a mixture state is stable. This is undesirable because a mixture state does not correspond to a stored pattern. 3 3 We haven t introduced the concept of temperature. However, it is known that mixture states are unstable if the network is run at a high enough temperature. 7 Copyright c 2000 2003 A. S. Maida

0.. ASSOCIATIVE MEMORY CONTENTS I u u2 u3 u4 S = (ξ, ξ2, ξ3, ξ4) II u u2 u3 u4 S = (ξ, ξ2, ξ3, ξ4) III u u2 u3 u4 S = (ξ, ξ2, ξ3, ξ4) IV u u2 u3 u4 S = ( ξ, ξ2, ξ3, ξ4) Figure 7: How Hamming distance from a stored pattern relates to energy state. Arcs show symmetric connections in a four-unit network. Self-coupling terms are not shown. Stored pattern is ξ. (I) Since S = ξ, there is no inconsistency. Energy is at a minimum. (II) Unit u3 is flipped giving S a Hamming distance of one from ξ. Three connection weights are inconsistent with S. (III) Unit u4 is flipped and now four weights are inconsistent but the w 2,3 /w 3,2 connection, which was inconsistent, is now consistent, despite the fact that the Hamming distance of S from ξ has increased from one to two. Energy is at a maxima. (IV) Flipping one more unit, u, starts to make the network consistent with the complement retrieval state. 8 Copyright c 2000 2003 A. S. Maida

CONTENTS 0.. ASSOCIATIVE MEMORY s6 s7 s2 s3 s4 s5 s4 s5 s0 s s0 s s2 s3 s8 s9 Figure 8: Single-unit-update transitions visualized on a hypercube. State s9 represents the stored pattern ξ and s6 represents its complement. Arrows indicate transition from one state to another as a result of updating one unit. Limitations of the method We have mapped out attractor basins in a brute force manner. This is impractical for networks large enough to store more than one pattern. In order to obtain a specific example of a Hopfield memory having spurious basins (called mixture states) for further analysis, one needs a memory large enough to store three patterns. In order for a memory to be reliable, the number of stored patterns cannot exceed 0.38 the number of units. Therefore, you would need a memory consisting of 22 units to reliably store three patterns. The hypercube describing the instantaneous states of the network would have 2 22 corners, which is approximately four million. 0..3 Hetero-associative memory and central pattern generators We have so far only discussed auto-associative memory. We need to address the question of hetero-associative memory. Hetero-associative memory models the phenomenon of free and structured association. How does one idea or thought lead to the next? We can model this by having an attractor state become unstable after a certain period of time. But to do this right, we need some kind of (perhaps implicit) clocking mechanism in our network. Let us explore this intuition in more detail. Suppose that we have stored p patterns in an auto-associative memory. First, we would like the memory to go into a retrieval state for say pattern ξ and stay in that state long enough to have a meaningful thought, so to 9 Copyright c 2000 2003 A. S. Maida

0.. ASSOCIATIVE MEMORY CONTENTS speak. Next, we would like that retrieval state to become unstable and transition to the next retrieval state, say, for pattern ξ 2. These operating characteristics raise two problems. First, how do we make a retrieval state become unstable given that we have designed it to be an attractor state? Second, once unstable, how do we control the direction of the transition? Building a hetero-associative memory Let us approach this problem by first trying to build a basic hetero-associative memory. Let us return to patterns ξ and ξ 2 given in Formulas (2) and (3). We want to build a device that allows us to think of pattern ξ 2 when given the input pattern ξ. For simplicity, we shall assume a synchronous update procedure. For auto-association we had used the storage prescription in Formula (4), repeated below. The superscript aut stands for autoassociation. W aut = ξ ( ξ ) T (25) For hetero-association, we shall use the storage prescription below. The superscript het stands for hetero-association. W het = ξ 2 ( ξ ) T (26) To retrieve a hetero-associatively stored pattern we do the following. Ψ(W het ζ ) = Ψ(W het ξ ) = Ψ( ξ 2 ( ξ ) T ξ ) = Ψ( ξ 2 n) = ξ 2 (27) This logic is the same as that used in the derivation given in Formulas (7) through (9). We can store multiple hetero-associations by analogy with the correlation recording recipe given in Formula (5). Simply superimpose the hetero-associations as described below. W het = p µ= ξ µ+ ( ξ µ ) T (28) This builds a memory where pattern ξ µ retrieves pattern ξ µ+. When multiple heteroassociations are stored in the memory, the retrieval operation can be described as shown below. W het ξ = [ ξ 2 ( ξ ) T + (W het ξ 2 ( ξ ) T )] ξ (29) = ξ 2 ( ξ ) T ξ + [(W het ξ 2 ( ξ ) T ) ξ ] (30) = ξ 2 + [(W het ξ 2 ( ξ ) T ) ξ ] (3) The expression (W het ξ 2 ( ξ ) T ) ξ is the crosstalk term. Limit cycles The state-transistion graphs we have so far seen did not contain any limit cycles. Instead of looking at the matrix in Formula (4), let s look at the matrix in Formula (7). If we 20 Copyright c 2000 2003 A. S. Maida

CONTENTS 0.. ASSOCIATIVE MEMORY use this matrix, and apply a synchronous update step when the network is in state s, the network goes to state s8. If we apply synchronous update when in state s8, the network goes into state s. So the network oscillates between states s and s8. When drawn as a graph, this is called a limit cycle attractor. In this example, the oscillator depends on synchronous update, which assumes a global clock. Since an asynchronous network is biologically more realistic, it would be nice if we could wire up an asynchronous network to act as an oscillator. This is important because the brain, consisting of asynchronous neurons, does exhibit oscillations, as measured by EEG and does exhibit synchronized firing. How do these properties emerge from an asynchronous system? Pattern generators A central pattern generator is a system of neurons, perhaps consisting of a few clusters, that acts as an oscillator, perhaps generating repetitive motion. It can also cycle through a sequence of states such as what happens when a musical tune plays over and over in your head. One way to build a central pattern generator is to assume that asynchronous neurons have a primitive memory of how long they have been in a particular state (e.g., a firing state). Biologically this is plausible because neurons do show fatigue after sustained firing. Let s call this property the clocking primitive. With this primitive, it is possible to build an asynchronous network that, when in a stable state, jumps to a new stable state because the stable state becomes unstable after the neurons begin to show fatigue. Let us suppose that we have three patterns of interest, ξ, ξ 2, and ξ 3, and that we want to build the device described earlier. This system thinks of pattern ξ long enough to have a meaningful thought and then gets reminded of ξ 2, then ξ 3, and finally cycles back to ξ. Let s suppose also that we have already built an auto-associative memory for these three patterns and it is stored in the matrix W aut. Let s suppose that we have also built a heteroassociative memory along the lines described in Formula (28), and it is stored in the matrix W het. Since superposition has worked in the past, let s try using matrix W for the central pattern generator and see if we can make it work. W = W aut + W het (32) = ξ ( ξ ) T + ξ 2 ( ξ 2 ) T + ξ 3 ( ξ 3 ) T + ξ 2 ( ξ ) T + ξ 3 ( ξ 2 ) T + ξ ( ξ 3 ) T The main obstacle to success is that the memory now has conflicting information. For instance, when pattern ξ is input to the memory, should it retrieve itself or ξ 2? In order to solve this problem, we shall need to keep the matrices separate. We shall assume that the two matrix types have different hardware realizations, i.e., different types of synapses. Further the hetero-associative weights operate on a different time scale than the auto-associative weights. In particular, the hetero-associative weights will be slower, as if they are realized by slow synapses, as compared to the auto-associative synapses. This means we shall have two sets of synapses, one to store the auto-associative weights and the other to store the hetero-associative weights. 2 Copyright c 2000 2003 A. S. Maida

0.. ASSOCIATIVE MEMORY CONTENTS W aut W het W aut W het W aut ξ W het ξ 2 ξ 3 Figure 9: Operation of central pattern generator. Figure 9 illustrates how the separate sets of weights mediate the transitions from thinking of ξ, ξ 2, and ξ 3 cyclically in sequence. The nodes in the graph depict the attractor states for the three patterns. The arcs, labeled with either W aut or W het, show causal influences within the network dynamics. The W aut try to keep the pattern stable. The W het weights try get the network to make a transition to the next state. Let s assume that it takes a W aut synapse one time step to propagate its information from pre-synaptic to post-synaptic but that it takes a W het synapse five time steps to do the same thing. Then, if the network goes into pattern ξ, it will be controlled by the stabilizing W aut weights before the W het weights start to exert their effect. If the W het weights overwhelm the W aut weights, then the system will transition to the state representing pattern ξ 2. Equation (33) formally specifies how this idea works. Recall, that we have used h i (t) to refer to the net input to unit i at time t. Formula (33) says that we are going to separate the total net input into that caused by the auto-associative weights and that caused by the heteroassociative weights. Further, we are going to combine the two net inputs by superposition, and we are going to amplify the effects of the heter-associative weights by the parameter λ which is a positive number greater than one, so that the effect of the hetero-associative weights overwhelms those of the auto-associative weights. The parameter τ is a time delay, which in our previous example was five. This prevents the hetero-associative weights from kicking in for five time steps. Both h aut i (t) and h het i h total i (t) h aut i (t) + λ h het i (t τ) (33) (t) are defined as you would expect, as shown below. h aut i (t) = h het i (t) = n j= n j= w aut ij s j (t) (34) w het ij s j (t) (35) In summary, although the weight matrices for the auto- and hetero-associative mechanisms are kept separate, we do superimpose their effects when it is time to update a unit. 22 Copyright c 2000 2003 A. S. Maida