patterns (\attractors"). For instance, an associative memory may have tocomplete (or correct) an incomplete

Chapter 6 Associative Models Association is the task of mapping input patterns to target patterns (\attractors"). For instance, an associative memory may have tocomplete (or correct) an incomplete (or corrupted) pattern. Unlike computer memories, no \address" is known for each pattern. Learning consists of encoding the desired patterns as aweight matrix (network) retrieval (or \recall") refers to the generation of an output pattern when an input pattern is presented to the network.. Hetero-association: mapping input vectors to output vectors that range over a dierent vector space, e.g., translating English words to Spanish words.

2 CHAPTER 6. ASSOCIATIVE MODELS 2. Auto-association: input vectors and output vectors range over the same vector space, e.g., character recognition, eliminating noise from pixel arrays, as in Figure 6.. Noisy Input Stored Image? (or)? Stored Image Stored Image Figure 6.: Noisy input pattern resembles \3" and \8," but not \." This example illustrates the presence of noise in the input, and also that more than one (but not all) target patterns may be reasonable for an input pattern. The dierence between input and target patterns allows us

6.. NON-ITERATIVE PROCEDURES 3 to evaluate network outputs. Many associative learning models are based on variations of Hebb's observation: \When one cell repeatedly assists in ring another, the axon of the rst cell develops synaptic knobs (or enlarges them if they already exist) in contact with the soma of the second cell." We rst discuss non-iterative, \one-shot" procedures for association, then proceed to iterative models with better error-correction capabilities, in which node states may be updated several times, until they \stabilize." 6. Non-iterative Procedures In non-iterative association, the output pattern is generated from the input pattern in a single iteration. Hebb's law maybeused to develop associative \matrix memories," or gradient descent can be applied to minimize recall error. Consider hetero-association, using a two-layer network

4 CHAPTER 6. ASSOCIATIVE MODELS developed using the training set T = f(i p d p ):p = ::: Pg where i p 2 f; g n d p 2 f; g m. Presenting i p at the rst layer leads to the instant generation of d p at the second layer. Each node of the network corresponds one component of input (i) or desired output (d) patterns. * ::::::::::::::::::::::::::::::::::::::::::::::::::* Matrix Associative Memories: A weight matrix W is rst used to premultiply the input vector: y = WX () : If output node values must range over f; g, then the signum function is applied to each coordinate: X (2) j = sgn(y j ), for j = ::: m: The Hebbian weight update rule is w j k / i p k d p j : If all input and output patterns are available at once,

6.. NON-ITERATIVE PROCEDURES 5 we can perform a direct calculation of weights: w j k = c PX p= i p k d p j for c>: Each w j k measures the correlation between the kth component of the input vectors and the jth component of the associated output vectors. The multiplier c can be omitted if we are only interested in the sign of each component of y, so that: W = PX p= d p [i p ] T = DI T where rows of matrix I are input patterns i p, and rows of D are desired output patterns d p. Non-iterative procedures have low error-correction capabilities: multiplying W with even a slightly corrupted input vector often results in \spurious" output that differs from the patterns intended to be stored. Example 6. Associate input vector ( ) with output vector (; ), and ( ;) with (; ;):

6 CHAPTER 6. ASSOCIATIVE MODELS W = @ ; ; ; A @ ; A = @ ;2 2 A : When the original input pattern (,) is presented, y = WX () = @ ;2 2 A @ A = @ ;2 2 A and sgn(y) = X (2), the correct output pattern associated with ( ). If the stimulus is a new input pattern (; ;), for which no stored association exists, the resulting output pattern is y = WX () = @ ;2 2 A @ ; ; A = @ 2 ;2 A and sgn(y) =( ;), dierent from the stored patterns. * ::::::::::::::::::::::::::::::::::::::::::::::::::* Least squares procedure (Widrow-Ho rule): When i p is presented to the network, the resulting output i p must be as close as possible to the desired output pat-

6.. NON-ITERATIVE PROCEDURES 7 tern d p. Hence must be chosen to minimize E = = PX p= PX jjd p ; i p jj 2 p=[(d p ; X j j i p j ) 2 + +(d p m ; X j m j i p j ) 2 ] Since E is a quadratic function whose second derivative is positive, we obtain the weights that minimize E by solving @E @ j ` = ;2 PX p= for each ` j, obtaining PX p= d p j i p ` = = d p j ; PX p= nx k= nx k= nx k= j k j k i p k! i p ` = j k i p k i p ` @ P X p= =(jth row of) (`th column of II T ). Combining all such equations, DI T =(II T ): i p k i p ` A When (II T ) is invertible, mean square error is mini-

8 CHAPTER 6. ASSOCIATIVE MODELS mized by =DI T (II T ) ; (6.) The least squares method \normalizes" the Hebbian weight matrix DI T using the inverse of (II T ) ;. For auto-association, = II T (II T ) ; is the identity matrix that maps each vector to itself. * ::::::::::::::::::::::::::::::::::::::::::::::::::* What if II T has no inverse? Every matrix A has a unique `pseudo-inverse', dened to be a matrix A that satises the following conditions: AA A = A A AA = A A A = (A A) T AA = (AA ) T : The Optimal Linear Associative Memory (OLAM) generalizes Equation 6.: =D [I] : For autoassociation, I = D, sothat=i [I] :

6.. NON-ITERATIVE PROCEDURES 9 When I is a set of orthonormal unit vectors, hence I [I] T i p i p = 8 < : if p = p otherwise =[I] T I = I, the identity matrix, so that all conditions dening the pseudo-inverse are satised with [I] =[I] T. Then =D [I] = D [I] T : * ::::::::::::::::::::::::::::::::::::::::::::::::::* Example 6.2 Hetero-association, with three (i d) pairs: ( ; ) ( ; ; ; ) (; ;). Input and output matrices are: I = B @ ; ; C A and D = @ ; ; ; ; A : The inverse of I[I] T exists, and = D[I] T (I[I] T ) ; = @ ; ; ; ; A B @ ; ; C A B @ :5 :25 :25 :25 :5 :25 :25 :25 :5 C A

CHAPTER 6. ASSOCIATIVE MODELS = @ ; ; A : An input pattern (; ; ;) yields (; ) on premultiplying by. * ::::::::::::::::::::::::::::::::::::::::::::::::::* Example 6.3 Consider hetero-association where four input/output patterns are ( ; ; ;) ( ; ; ; ) (; ; ), and (; ; ). I = B @ ; ; ; ; ; ; C A and D = @ ; ; ; ; A : ; I[I] T = B @ :25 : : : :25 : : : : C A :

6.. NON-ITERATIVE PROCEDURES = = @ ; ; ; ; @ ; ; A : A B @ ; ; ; ; ; ; B C @ A :25 : : : :25 : : : : C A For input vector ( ;), the output is @ ; ; A B @ ; C A = @ ; ; A : * ::::::::::::::::::::::::::::::::::::::::::::::::::* Noise Extraction: Given I, a vector x can be written as the sum of two components: x = Ix = Wx+(I ; W )x =^x +~a = mx i= c i i i +~a where ^x is the projection of x onto the vector space spanned by I, and ~a is the noise component orthonormal to this space.

2 CHAPTER 6. ASSOCIATIVE MODELS Matrix multiplication by W thus projects any vector onto the space of the stored vectors, whereas (I ; W ) extracts the noise component to be minimized. Noise is suppressed if the number of patterns being stored is less than the number of neurons otherwise, the noise may be amplied. * ::::::::::::::::::::::::::::::::::::::::::::::::::* Nonlinear Transformations: X (2) = f(w X () ) = B @ f(w X () + + w n X () n ). f(w m X () + + w m n X () n ) where f is dierentiable. Error E is minimized w.r.t. each weight w j ` by solving @E @w j ` PX p= where X (2) p j = f( (d p j ; X (2) p j ) f ( nx k= nx k= w j k i p `)i p ` = w j k i p `) and f (Z) =df(z)=dz. Iterative procedures are needed to solve these nonlinear equations. * ::::::::::::::::::::::::::::::::::::::::::::::::::* C A

6.2. HOPFIELD NETWORKS 3 6.2 Hopeld Networks Hopeld networks are auto-associators in which node values are iteratively updated based on a local computation principle: the new state of each node depends only on its net weighted input at a given time. The network is fully connected, as shown in Figure 6.2, and weights are obtained by Hebb's rule. The system may undergo many state transitions before converging to a stable state. w 2 = w 2 2 w 5 = w 5 w 2 3 = w 3 2 5 w 4 5 = w 5 4 4 3 w 3 4 = w 4 3 Figure 6.2: Hopeld network with ve nodes. * ::::::::::::::::::::::::::::::::::::::::::::::::::* Discrete Hopeld networks One node corresponds to each vector dimension,

4 CHAPTER 6. ASSOCIATIVE MODELS taking values 2f; g. Each node applies a step function to the sum of external inputs and the weighted outputs of other nodes. Node output values may change when an input vector is presented. Computation proceeds in \jumps": the values for the time variable range over natural numbers, not real numbers. NOTATION: T = fi ::: i P g: training set. x p j (t): value generated by thejth node at time t, for pth input pattern. w k j : weight of connection from jth to kth node. I p k : external input to kth node, for pth input vector (includes threshold term). The node output function is described by: x p k (t +)= sgn @ n X j= w k j x p j (t)+i p k A where sgn(x) = if x and sgn(x) =; if x <. Asynchronicity: at every time instant, precisely one node's output value is updated. Node selection may be

6.2. HOPFIELD NETWORKS 5 cyclic or random each node in the system must have the opportunity to change state. Network Dynamics: Select a node k 2f ::: ng to be updated x p `(t +)= 8 >< >: x p `(t) sgn @ n X j= if ` 6= k (6.2) w` j x p j (t)+i p ` A if ` = k: By contrast, in Little's synchronous model, all nodes are updated simultaneously, at every time instant. Cyclic behavior may result when two nodes update their values simultaneously, each attempting to move towards a dierent attractor. The network may then repeatedly shuttle between two network states. Any such cycle consists of only two states and hence can be detected easily. The Hopeld network can be used to retrieve a stored pattern when a corrupted version of the stored pattern is presented. The Hopeld network can also be used to \complete" a pattern when parts of the pat-

6 CHAPTER 6. ASSOCIATIVE MODELS X X O X O (a) X Y O (b) X Y O (c) Y Y O Y (d) (e) Figure 6.3: Initial network state (O) is equidistant from attractors X and Y, in (a). Asynchronous computation instantly leads to one of the attractors, (b) or (c). In synchronous computation, the object moves instead from the initial position to a non-attractor position (d). Cycling behavior results for the synchronous case, with network state oscillating between (d) and (e).

6.2. HOPFIELD NETWORKS 7 tern are missing, e.g., using for an unknown node input/activation value. * ::::::::::::::::::::::::::::::::::::::::::::::::::* Example 6.4 Consider a 4-node Hopeld network whose weights store patterns ( ) and (; ; ; ;). Each weight w` j =, for ` 6= j, and w j j =forallj. I. Corrupted Input Pattern ( ;): I = I 2 = I 3 = and I 4 = ; are the initial values for node outputs.. Assume that the second node is randomly selected for possible update. Its net input is w 2 x +w 2 3 x 3 + w 2 4 x 4 + I 2 =+;+=2. Since sgn(2) =, this node's output remains at. 2. If the fourth node is selected for possible update, its net input is ++; =2, and sgn(2) = implies that this node changes state to. 3. No further changes of state occur from this network conguration (,,, ). Thus, the network has successfully recovered one of the stored patterns from

8 CHAPTER 6. ASSOCIATIVE MODELS the corrupted input vector. II. Equidistant Case( ; ;): Both stored patterns are equally distant one is chosen because the node function yields when the net input is.. If the second node is selected for updating, its net input is, hence state is not changed from. 2. If the third node is selected for updating, its net input is, hence state is changed from ; to. 3. Subsequently, the fourth node also changes state, resulting in the network conguration (,,, ). Missing Input Element Case( ; ;): If the second node is selected for updating, its net input is w 2 x + w 32 x 3 + w 42 x 4 + I 2 =;;+<, implying the updated node output is ; for this node. Subsequently, the rst node also switches state to ;, resulting in the network conguration (;, ;, ;, ;). Multiple Missing Input Element Case( ;): Though most of the initial inputs are unknown, the network suc-

6.2. HOPFIELD NETWORKS 9 ceeds in switching states of three nodes, resulting in the stored pattern (;, ;, ;, ;). Thus, a signicant amount of corruption, noise or missing data can be successfully handled in problems with a small number of stored patterns. * ::::::::::::::::::::::::::::::::::::::::::::::::::* Energy Function: The Hopeld network dynamics attempt to minimize an \energy" (cost) function derived as follows, assuming each desired output vector is the stored attractor pattern nearest the input vector. If w` j is positive and large, then we expect that the `th and jth nodes are frequently ON or OFF together in the attractor patterns, i.e., P P p= (i p ` i p j ) is positive and large. Similarly, w` j is negative and large when the `th and jth nodes frequently have opposite activation values for dierent attractor patterns, i.e., P P p= (i p ` i p j )is negative and large.

2 CHAPTER 6. ASSOCIATIVE MODELS So w` j should be proportional to P P p= (i p ` i p j ). For strong positive or negative correlation, w` j and P p(i p ` i p j ) have the same sign, hence P P p= w` j(i p ` i p j ) >. Self-excitation coecient w j j, often =. P Summing over all pairs of nodes, P` j w` ji p `i p j is positive and large for input vectors almost identical to some attractor. We expect that node correlations present in the attractors are absent in an input vector i distant from P all attractor patterns, so that P` j w` ji`i j is then low or negative. Therefore, the energy function contains a term ; @ X` X j w` j x`x j A : When this is minimized, the nal values of various node outputs are expected to correspond to an attractor pattern. Network output should be close to the input vector

6.2. HOPFIELD NETWORKS 2 when presented with a corrupted image of `3', we do not want the system to generate an output corresponding to `'. For external inputs I`, another term ; P`I`x` is included in the energy expression I`x` > i input=output for the `th node. Combining the two terms, the following \energy" or Lyapunov function must be minimized by modifying x` values E = ;a X` X j w` j x`x j ; b X` I`x` where a b. The values a ==2 and b =correspond to reduction in energy whenever a node update occurs, as described below. Even if each I` =,we can select initial node inputs x`() so that the network settles into a state close to the input pattern components. * ::::::::::::::::::::::::::::::::::::::::::::::::::* Energy Minimization: Steadily reducing E will result in convergence to a stable state (which mayormay not be

22 CHAPTER 6. ASSOCIATIVE MODELS one of the desired attractors). Let the kth node be selected for updating at time t. For the node update rule in Eqn. 6.2, the resulting change of energy is: E(t) = E(t +); E(t) = ;a X` ;b X` X j6=` w` j [x`(t +)x j (t +); x`(t)x j (t)] I`(x`(t +); x`(t)) = ;a X j6=k ((w k j + w j k )(x k (t +); x k (t)) x j (t)) ;bi k (x k (t +); x k (t)) because x j (t +) = x j (t) for every node (j 6= k) not selected for updating at this step. Hence, E(t) =; 2 4 a X j6=k (w k j + w j k )x j (t)+bi k 3 5 (xk (t +); x k (t)) : For E(t) to be negative, (x k (t +); x k (t)) and (a P j6=k (w k j + w j k )x j (t)+bi k )must have the same sign. The weights are chosen to be proportional to correlation terms, i.e., w` j = P P p= i p `i p j =P. Hence w j k = w k j,i.e.,theweights are symmetric, and for the choice

6.2. HOPFIELD NETWORKS 23 of the constants a = b =, the energy change expres- 2 sion simplies to E(t) = ; where net k (t) = @ X w j k x j (t)+i k A (xk (t +); x k (t)) j6=k = ;net k (t)x k (t) @ X j6=k to the kth node at time t. w j k x j (t)+i k A is the net input To reduce energy, thechosen (kth) node changes state i current state diers from the sign of the net input, i.e., net k (t) x k (t) > : Repeated applications of this update rule results in a `stable state' in which all nodes stop changing their current values. Stable states may not be the desired attractor states, but may instead be \spurious". * ::::::::::::::::::::::::::::::::::::::::::::::::::* Example 6.5 The patterns ( ; ;), ( ) and (; ; ) are to be stored in a 4-node network. The

24 CHAPTER 6. ASSOCIATIVE MODELS rst and second nodes have exactly the same values in every stored pattern, hence w 2 =. Similarly, w 3 4 =. The rst and third nodes agree in one stored pattern but disagree in the other two stored patterns, hence w 3 = (no. of agreements)-(no.of disagreements) no. of patterns Similarly, w 4 = w 2 3 = w 2 4 = ;=3. If the input vector is (; ; ; ;), and the fourth node is selected for possible node update, its net input is w 4 x + w 2 4 x 2 + w 3 4 x 3 + I 4 =(;=3)(;) + (;=3)(;)+(;) + (;) = ;4=3 <, hence the node does not change state. The same holds for every node in the network, so that the network conguration remains at (; ; ; ;), dierent from the patterns that were to be stored. If the input vector is (; ; ; ), representing the case when the fourth input value is missing, and the fourth node is selected for possible node update, its net input is w 4 x + w 2 4 x 2 + w 3 4 x 3 + I 4 = = ; 2 3 :

6.2. HOPFIELD NETWORKS 25 (;=3)(;)+(;=3)(;)+(;)+() = ;=3 <, and the node changes state to ;, resulting in the spurious pattern (; ; ; ;). * ::::::::::::::::::::::::::::::::::::::::::::::::::* Example 6.6 Images of four dierent objects are shown in Figure 6.4. We treat each image as a 9 9 binary pixel array, as in Figure 6.5, stored using a Hopeld network with 9 9 neurons. XII IX III VI Figure 6.4: Four images stored in a Hopeld network. Figure 6.5: Binary representation of objects in Fig. 6.4.

26 CHAPTER 6. ASSOCIATIVE MODELS Figure 6.6: Corrupted versions of objects in Fig. 6.5. The network is stimulated by a distorted version of a stored image, shown in Figure 6.6. In each case, as long as the total amount of distortion is less than 3% of the number of neurons, the network recovers the correct image, even when big parts of an image are lled with ones or zeroes. Poorer performance would be expected if the network is much smaller, or if the number of patterns to be stored is much larger. One drawback of a Hopeld network is the assumption of full connectivity: a million weights are needed for a thousand-node network such large cases arise in imageprocessing applications with one node per pixel. * ::::::::::::::::::::::::::::::::::::::::::::::::::*

6.2. HOPFIELD NETWORKS 27 Storage capacity refers to the quantity of information that can be stored and retrieved without error, and may be measured as no. of stored patterns C = : no. of neurons Capacity depends on the connection weights, the stored patterns, and the dierence between the stimulus patterns and stored patterns. Let the training set contain P randomly chosen vectors i ::: i P where each i p 2f ;g n. These vectors are stored using the connection weights w` j = n PX p= i p ` i p j : How large can P be, so that the network responds to each i` by correctly retrieving i`? Theorem 6. The maximum capacity of a Hopeld neural network (with n nodes) is bounded above by(n=4lnn). In other words, if = Prob.[ `-th bit of p-th stored vector is correctly retrieved for each ` p]

28 CHAPTER 6. ASSOCIATIVE MODELS then lim n! = whenever P <n=(4 ln(n)). Proof: For stimulus i, the output of the rst node is o = nx j=2 w j i j = n nx PX j=2 p= i p i p j i j : This output will be correctly decoded if o i >. Algebraic manipulation yields o i = +Z ; n where Z =(=n) P n j=2 P P p=2 i p i p j i j i. Probability of correct retrieval for the rst bit is ' = Prob[o i > ] = Prob[ + Z ; n > ] Prob[Z >;] By assumption, E(i` j )=and E(i` j i` j ) = 8 < : if ` = ` j = j otherwise By the central limit theorem, Z would be distributed as a Gaussian random variable with mean and variance (P ;)(n;)=n 2 P=n for large n and P, with

6.2. HOPFIELD NETWORKS 29 density function (2P=n) ;=2 exp(;nx 2 =2P ), and Prob[Z >a]= ' = = Z a (2P=n) ;=2 exp(;nx 2 =2P )dx p n Z p 2 P ; p n Z p 2 P ; e nx2 ;( 2P e nx2 ;( 2P ) dx ) dx since the density function of Z is symmetric. If (n=p ) is large, ' ; q P exp; ; n 2 n 2 P : The same probability expression ' applies for each bit of each pattern. Therefore, if p n, the probability of correct retrieval of all bits of all stored patterns is given by r P! np =(') np ; 2 n e(;n=2p ) r P ; np 2 n e(;n=2p ) : If P=n ==(2 ln n), for some >, then exp(; n 2P ln n ) = exp(;2 )=n ; 2

3 CHAPTER 6. ASSOCIATIVE MODELS which converges to zero as n!, so that r P np 2 n exp(; n 2 P ) n 2; (2 ln n) 3=2p 2 which converges to zero if 2. A better bound can be obtained by considering the correction of errors during the iterative evaluations. * ::::::::::::::::::::::::::::::::::::::::::::::::::* Hopeld network dynamics are not deterministic: the node to be updated at any given time is chosen randomly. Dierent sequences of choices of nodes to be updated may lead to dierent stable states. Stochastic version (of Hopeld network): Output of node ` is + with probability =( + exp(;2net`)), for net node input net`. eective if P < :38 n. Retrieval of stored patterns is then * ::::::::::::::::::::::::::::::::::::::::::::::::::* Continuous Hopeld networks:. Node outputs 2 a continuous interval. 2

6.2. HOPFIELD NETWORKS 3 2. Time is continuous: each node constantly examines its net input and updates its output. 3. Changes in node outputs must be gradual over time. For ease of implementation, assurance of convergence, and biological plausibility, we assume each node's output 2 [; ], with the modied node update rule: x`(t) t = 8 >< >: if x` =andf( P j w j `x j (t)+i`) > if x` = ; andf( P j w j `x j (t)+i`) < f( P j w j `x j (t)+i`) otherwise: Proof of convergence is similar to that for the discrete Hopeld model. An energy function with a lower bound is constructed, and it is shown that every change made in a node's output decreases energy, assuming asynchronous dynamics. The energy function E = ;(=2) X` X j6=` w` j x`(t)x j (t) ; X` I`x`(t) is minimized as x (t) ::: x n (t) vary with time t. Given the weights and external inputs, E has a lower bound since x (t) ::: x n (t) have upper and lower bounds.

32 CHAPTER 6. ASSOCIATIVE MODELS Since (=2)w` j x`x j +(=2)w j `x j x` = w` j x`x j for symmetric weights, we have E x`(t) = ;(X j The Hopeld net update rule requires x` w` j x j + I`): (6.3) t > if(x w j `x j + I`) > : (6.4) j Equivalently, the update rule may be expressed in terms of changes occurring in the net input to a node, instead of node output (x`). Whenever f is a monotonically increasing function (such as tanh), with f() =, f( X j w j `x j + I`) > i( X j w j `x j + I`) > : (6.5) From Equations 6.3, 6.4, and 6.5, x` t > i @ X j =) E w j`x` + I` > A i.e., x` E t =) E t = X` x` x` t x` < for each i E x` < : <

6.2. HOPFIELD NETWORKS 33 Computation terminates, since: (a) each node update decreases (lower-bounded) E, (b) the number of possible states is nite, (c) the number of possible node updates is limited, (d) and node output values are bounded. The continuous model generalizes the discrete model, but the size of the state space is larger and the energy function may have many more local minima. * ::::::::::::::::::::::::::::::::::::::::::::::::::* Cohen-Grossberg Theorem gives sucient conditions for a network to converge asysmptotically to a stable state. Let u` be the net input to the `th node, f` its node function, a j the rate of change, and b j a \loss" term. Theorem 6.2 Let a j (u j ), (df j (u j )=du j ), where u j is the net input to the jth node in a neural network with symmetric weights w j ` = w` j, whose behavior is governed by the following dierential equation: du j dt = a j (u j ) " b j (u j ) ; NX i= w j ` f`(u`) # for j = ::: n:

34 CHAPTER 6. ASSOCIATIVE MODELS Then, there exists an energy function E for which (de=dt) for u j 6=, i.e., the network dynamics lead to a stable state in which energy ceases to change. * ::::::::::::::::::::::::::::::::::::::::::::::::::* 6.3 Brain-State-in-a-Box (BSB) Network BSB is similar to the Hopeld model, but all nodes are updated simultaneously. The node function used is a ramp function f(net) = min( max(; net)) which is bounded, continuous, and piecewise linear, as shown in Figure 6.7. f (x) + x ; Figure 6.7: Ramp function, with output values 2 [; ].

6.3. BRAIN-STATE-IN-A-BOX (BSB) NETWORK 35 Node update rule: initial activation is steadily amplied by positive feedback until saturation, jx`j. x`(t +)=f @ n X j= w` ` may be xed to equal. w` j x j (t) The state of the network always remains inside an n- dimensional \box," giving rise to the name of the network, \Brain-State-in-a-Box." Network state steadily moves from an arbitrary point inside the box(a in Figure 6.8(a)) towards one side of the box (B in gure), and then crawls along the side of the box to reach a corner of the box (C in gure), a stored pattern. Connections between nodes are Hebbian, representing correlations between node activations, and can be A obtained by the non-iterative computation: w` j = P where each i p j 2f; g. PX p= (i p ` i p j ) If the training procedure is iterative, training patterns are repeatedly presented to the network, and weights

36 CHAPTER 6. ASSOCIATIVE MODELS x 3 C ( ) /3 A (; ; ;) B x x 2 /3 (a) ( ; ;) (b) Figure 6.8: State change trajectory (A! B! C) in a BSB: (a) the network, with 3 nodes (b) three stored patterns, indicated by darkened circles. are successively modied using the weight update rule w` j = i p j (i p ` ; X k w` k i p k ) for p = ::: P where. This rule steadily reduces E = PX p= nx `=(i p ` ; X k w` k i p `) 2 and is applied repeatedly for each pattern until E. When training is completed, we expect P p w` j =, implying that X p i p j (i p ` ; X k w` k i p k )=

6.3. BRAIN-STATE-IN-A-BOX (BSB) NETWORK 37 X X i.e., (i p j i p `) = p p an equality satised when X i p j (w` k i p k ) k i p ` = X k (w` k i p k ): The trained network is hence \stable" for the trained patterns, i.e., presentation of a stored pattern does not result in any change in the network. * ::::::::::::::::::::::::::::::::::::::::::::::::::* Example 6.7 Let the training set contain 3 patterns f( ) (; ; ;) ( ; ;)g as in Figure 6.8, with network connection weights w 2 = w 2 =(+; )=3 ==3 w 3 = w 3 =(+; )=3 ==3 w 2 3 = w 3 2 =(++)=3 =: If an input pattern (:5 :6 :) is presented, the next network state is f :5+ :6 3 + : 3 f :6+ :5 3 +: f(:::)

38 CHAPTER 6. ASSOCIATIVE MODELS =(:73 :97 :97) where f is the ramp function described earlier. Note that the second and third nodes exhibit identical states, since w 2 3 =. The very next network state is ( ) a stored pattern. If (:5 :6 ;:7) is the input pattern, network state changes to (:47 :7 :7) then to (:52 :3 :3), eventually converging to the stable memory ( ). If the input pattern presented is ( :5 ;:5), network state changes instantly to ( ), and does not change thereafter. Network state may converge to a pattern which was not intended to be stored. E.g., if a 3-dimensional BSB network was intended to store only the two patterns f( ) ( ; ;)g with weights w 2 = w 2 = ( ; )=2 = w 3 = w 3 = ( ; )=2 = and w 2 3 = w 3 2 = ( + )=2 =

6.3. BRAIN-STATE-IN-A-BOX (BSB) NETWORK 39 then (; ; ;) is such a spurious attractor. BSB computations steadily reduce ; P` P j w` jx`x j when the weight matrix is symmetric and positive denite (i.e., all eigenvalues of W are positive). Stability and the number of spurious attractors tends to increase with the self-excitatory weights w j j. If the weight matrix is \diagonal-dominant," with w j j X`6=j w` j for each j 2f ::: ng, then every vertex of the `box' is a stable memory. The hetero-associative version of the BSB network contains two layers of nodes the connection from the jth node of the input layer to the `th node of the output layer carries the weight w` j = P PX p= i p j d p `: Example application: clustering radar pulses, distinguishing meaningful signals from noise in a radar surveillance environment where a detailed description of the signal sources is not known.

4 CHAPTER 6. ASSOCIATIVE MODELS 6.4 Hetero-associators A hetero-associator maps input patterns to a dierent set of output patterns. Consider the task of translating English word inputs into Spanish word outputs. A simple feedforward network trained bybackpropagation may be able to assert that a particular word supplied as input to the system is the 25th word in the dictionary. The desired output pattern may be obtained by concatenating the feedforward network function f : < n ;!fc ::: C k g with a lookup table mapping g : fc ::: C k g;!fp ::: P k g that associates each \address" C i with a pattern P i,as shown in Figure 6.9. Two-layer networks with weights determined by Hebb's rule are much less computationally expensive. In a noniterative model, output layer node activations are given by X x (2) ` (t +)=f( j w` j x () j (t)+`)

6.4. HETERO-ASSOCIATORS 4 Input vector Feedforward network C =. C i =. C C 2 C 3. P P 2 P 3. C k = C k P k P i Output pattern Figure 6.9: Association between input and output patterns using feedforward network and simple memory (at most one C i is \ON" for any input vector presentation).

42 CHAPTER 6. ASSOCIATIVE MODELS For error correction, iterative models are more useful:. Compute output node activations using the above (non-iterative) update rule, and then perform iterative auto-association within the output layer, leading to a stored output pattern. 2. Perform iterative auto-association within the input layer, resulting in a stored input pattern, which is then fed into the second layer of the hetero-associator network. 3. Bidirectional Associative Memory (BAM) with no intra-layer connections, as in Figure 6.: REPEAT (a) x (2) ` (t +)=f( P j (b) x () ` (t +2)=f( P j w` jx () j (t)+ (2) ` ) w` jx (2) j (t +)+ () ` ) UNTIL x () ` (t+2) = x () ` (t) and x (2) ` (t+2) = x (2) ` (t): Weights can be chosen to be Hebbian (correlation) terms: w` j = c P P p= i p j d p ` Sigmoid node functions can be used in continuous BAM models.

6.4. HETERO-ASSOCIATORS 43 First layer Second layer Figure 6.: Bidirectional Associative Memory (BAM) * ::::::::::::::::::::::::::::::::::::::::::::::::::* Example 6.8 The goal is to establish the following three associations between 4-dim. and 2-dim. patterns: (+ + ; ;)! (+ +) (+ + + +)! (+ ;) (; ; + +)! (; +): By the Hebbian rule (with c ==P ==3), w = (P 3 p= i p d p ) 3 =:

44 CHAPTER 6. ASSOCIATIVE MODELS Likewise, w 2 =andevery other weight =; 3, e.g., w 3 = (P 3 p= i p 3d p ) 3 = ; 3 : These weights constitute the weight matrix: W = @ ;=3 ;=3 ;=3 ;=3 ;=3 ;=3 A : When an input vector such asi =(; ; ; ;) is presented at the rst layer, x (2) = sgn(x () w +x () 2 w 2+ x () 3 w 3 + x () 4 w 4 ) = sgn(; ; +=3 +=3) = ; and x (2) 2 = sgn(x () w 2 + x () 2 w 2 2 + x () 3 w 2 3 + x () 4 w 2 4 )= sgn(=3 + =3 + =3 + =3) = : The resulting vector is (; ), one of the stored 2-dim. patterns. The corresponding rst layer pattern generated is (; ; ), following computations such as x () = sgn(x (2) w + x (2) 2 w 2 )= sgn(; ; =3) = ;: No further changes occur. * ::::::::::::::::::::::::::::::::::::::::::::::::::* The `additive' variant of the BAM separates out the eect of the previous activation value and the external

6.4. HETERO-ASSOCIATORS 45 input I` for the node under consideration, using the following state change rule: x (2) X ` (t +)=a`x (2) ` (t)+b`i` + f( j6=` w` j x () j (t)) where a i b i are frequently chosen from f, g. If the BAM is discrete, bivalent, as well as additive, then x (2) i (t+) = 8 < : if[a i x (2) i (t)+b i I i + P j6=i w i jx () j (t)] > i otherwise, where i is the threshold for the ith node. A similar expression is used for the rst layer updates. BAM models have been shown to converge using a Lyapunov function such as L = ; X` X j x () ` (t)x (2) j (t)w` j : This energy function can be modied, taking into account external node inputs I k` for nodes: L 2 = L ; 2X k= N (k) X `= as well as thresholds (k) ` x (k) ` (I k` ; (k) ` )

46 CHAPTER 6. ASSOCIATIVE MODELS where N(k) denotes the number of nodes in the kth layer. As in Hopeld nets, there is no guarantee that the system stabilizes to a desired output pattern. Stability is assured if nodes of one layer are unchanged while nodes of the other layer are being updated. All nodes in a layer may change state simultaneously, allowing greater parallelism than Hopeld networks. When a new input pattern is presented, the rate at which the system stabilizes depends on the proximity of the new input pattern to a stored pattern, and not on the number of patterns stored. The number of patterns that can be stored in a BAM is limited by network size. * ::::::::::::::::::::::::::::::::::::::::::::::::::* Importance Factor: The relative importance of dierent pattern-pairs may be adjusted by attaching a \signicance factor" ( p )toeach pattern i p being used to modify the BAM weights: w j i = w i j = PX p= ( p i p i d p j ):

6.4. HETERO-ASSOCIATORS 47 Decay: Memory may change with time, allowing previously stored patterns to decay, using a monotonically decreasing function for each importance factor: p (t) =(; ) p (t ; ) or p (t) = max( p (t ; ) ; ) where << represents the \forgetting" rate. If (i p d p ) is added to the training set at time t, then w i j (t) =(; )w i j (t ; ) + p (t ; )i p j d p i where ( ; ) is the attenuation factor. Alternatively, the rate at which memory fades may be xed to a global clock: w i j (t) =(; )w i j (t ; ) + (t) X p= p (t ; )i p j d p i where (t) is the number of new patterns being stored at time t. There may be many instants at which (t) =, i.e., no new patterns are being stored, but existing memory continues to decay. * ::::::::::::::::::::::::::::::::::::::::::::::::::*

48 CHAPTER 6. ASSOCIATIVE MODELS 6.5 Boltzmann Machines Memory capacity ofhopeld models can be increased byintroducing hidden nodes. A stochastic learning process is needed to allow theweights between hidden nodes and other (\visible") nodes to change to optimal values, beginning from randomly chosen values. Principles of simulated annealing are invoked to minimize the energy function E dened earlier for Hop- eld networks. Anodeisrandomlychosen, and changes state with a probability that depends on (E=), where temperature > is steadily lowered. The state change is accepted with probability =( + exp(e=)). Annealing terminates when. Many state changes occur at each temperature. If such a system is allowed to reach equilibriumatany temperature, the ratio of the probabilities of two states a b with energies E a and E b will be given by P (a)=p (b) = exp(e b ;E a )=. This is the Boltzmann distribution, and

6.5. BOLTZMANN MACHINES 49 does not depend on the initial state or the path followed in reaching equilibrium. Figure 6. describes the learning algorithm for heteroassociation problems. For autoassociation, no nodes are \clamped" in the second phase of the algorithm. The Boltzmann machine weight change rule conducts gradient descent on relative entropy (cross-entropy), H(P P )= X s P s ln(p s =P s ) where s ranges over all possible network states, P s is the probability of network state s when the visible nodes are clamped, and P s is the probability ofnetwork state s when no nodes are clamped. H(P P ) compares the probability distributions P and P note that H(P P )= when P = P, the desired goal. In using the BM, we cannot directly compute output node states from input node states since initial states of the hidden nodes are undetermined. Annealing the network states would result in a global optimum of the energy function, which may have nothing in common

5 CHAPTER 6. ASSOCIATIVE MODELS Algorithm Boltzmann while weights continue to change and computational bounds are not exceeded, do Phase : for each training pattern, do Clamp all input and output nodes ANNEAL, changing hidden node states Update f:::p i j :::g, the equilibrium probs. with which nodes i j have same state end-for Phase 2: for each training pattern, do Clamp all input nodes ANNEAL, changing hidden & output nodes Update f:::p i j :::g, the equilibrium probs. with which nodes i j have same state end-for Increment each w i j by (p i j ; p ) i j end-while. Figure 6.: BM Learning Algorithm

6.5. BOLTZMANN MACHINES 5 with the input pattern. Since the input pattern may be corrupted, the best approach is to initially clamp input nodes while annealing from a high temperature to an intermediate temperature I. This leads the network towards a local minimum of the energy function near the input pattern. The visible nodes are then unclamped, and annealing continues from I to, allowing visible node states also to be modied, correcting errors in the input pattern. The cooling rate with which temperature decreases must be extremely slow toassureconvergence to global minima of E. Faster cooling rates are often used due to computational limitations. The BM learning algorithm is extremely slow: many observations have to be made at many temperatures before computation concludes. * ::::::::::::::::::::::::::::::::::::::::::::::::::* Mean eld annealing improves on the speed of execution of the BM, using a \mean eld" approximation in

52 CHAPTER 6. ASSOCIATIVE MODELS the weight change rule, e.g., approximating the weight update rule w` j = (p` j ; p ` j ) (where p` j = E(x`x j ) when the visible nodes are clamped, while p ` j = E(x`x j ) when no nodes are clamped), by w` j = (q`q j ; q `q j ) where average output of `th node is q` nodes are clamped, and q` without clamping. when visible For the Boltzmann distribution, the average output is q` = tanh( X j w` j x j =): The mean eld approximation suggests replacing the random variable x j by its expected value E(x j ), so that q` = tanh( X j w` j E(x j )=): These approximations improve the speed of execution of the Boltzmann machine, but convergence of the weight values to global optima is not assured. * ::::::::::::::::::::::::::::::::::::::::::::::::::*

6.6. CONCLUSION 53 6.6 Conclusion The biologically inspired Hebbian learning principle shows how to make connection weights represent the similarities and dierences inherent in various attributes or input dimensions of available data. No extensive slow \training" phase is required the number of statechanges executed before the network stabilizes is roughly proportional to the number of nodes. Associative learning reinforces the magnitudes of connections between correlated nodes. Such networks can be used to respond to a corrupted input pattern with the correct output pattern. In autoassociation, the input pattern space and output space are identical these spaces are distinct in heteroassociation tasks. Heteroassociative systems may be bidirectional, with a vector from either vector space being generated when a vector from the other vector space is presented. These tasks can be accomplished using one-shot, non-iterative pro-

54 CHAPTER 6. ASSOCIATIVE MODELS cedures, as well as iterative mechanisms that repeatedly modify the weights as new samples are presented. In dierential Hebbian learning, changes in a weight w i j are caused by the change in the stimulation of the ith node by thejth node, at the time the output of the ith node changes also, the weight change is governed by the sum of such changes over a period of time. Not too many pattern associations can be stably stored in these networks. If few patterns are to be stored, perfect retrieval is possible even when the input stimulus is signicantly noisy or corrupted. However, such networks often store spurious memories.