Learning and Memory in Neural Networks

Learning and Memory in Neural Networks Guy Billings, Neuroinformatics Doctoral Training Centre, The School of Informatics, The University of Edinburgh, UK. Neural networks consist of computational units (neurons) that are linked by a directed graph with some degree of connectivity (network). The connections comprising the edges in the graph are termed weights. As the name suggests the magnitude of the weight determines the magnitude of the effect that the connecting neuron can have upon its target partner. In caricature, neural networks use the many parallel operations of simple units to perform computations with uncertain data, rather than serial operations with logical blocks to perform computations with exact data. Neural networks are useful computational devices for learning data classifications, for autoassociative (content addressable) memories and for associative (classical conditioning) memories. In this brief, neural networks performing each of these tasks are introduced, respectively: The multilayer perceptron, the Hopfield network and the associative network. Two class data classification: The Perceptron The perceptron (Rosenblatt, 1958) can be used to learn a distinction between two clusters within some data set, Fig. 1A. The aim of the perceptron is to classify data into two classes C 1 and C 2 by labelling each data point x with its output f(a) { 1, 1} such that f(a) = 1 for class C 1 and f(a) = 1 for class C 2. As input the perceptron is passed a feature vector φ(x i ) { 1, 1}, i {1,..., N} constructed from the N dimensional data point to be classified by means of some fixed non linear transformation (for example brightness thresholding of the pixels of an image). In order to perform a classification, the activation is first calculated a(x) = i w i φ(x i ) φ (1) where the φ is the bias. The activation is then passed through the step function { 1 a 0 f(a) = 1 a < 0. What classifications of the data can the perceptron perform given some feature vector φ(x)? The perceptron can only separate feature vectors that are linearly separable. The decision between attributing a given input vector to class C 1 or to class C 2 occurs when a(x) = 0. This criterion is satisfied by an N 1 dimensional hyperplane within feature space. This surface is the decision boundary of the classification. Data that is separable with the perceptron is described as linearly separable because the decision boundary for a single linear thresholder is linear (a hyperplane). The bias determines the displacement of the decision boundary from the origin. How does the decision boundary relate to the correct weight vector for a given classification? Consider two locations on the decision boundary φ(x 1 ) and φ(x 2 ). Since a(x 1 ) = a(x 2 ) = 0 it is the case that w T (φ(x 1 ) φ(x 2 )) = 0 which can only be satisfied if the displacement from φ(x 1 ) to φ(x 2 ) is orthogonal to the weight vector (Bishop, 2006). Thus, components of the weight vector associated with some decision boundary are identical to the components of the normal to the decision boundary in input space. The aim of learning with the perceptron is to find the weight vector corresponding to a decision boundary that renders the input data disjoint. One method for achieving this is the perceptron learning algorithm, an explanation of which is beyond the scope of this brief but see Rojas or Bishop (Rojas, 1996; Bishop, 2006). The perceptron convergence theorem states that the perceptron learning algorithm is guaranteed to find a weight vector corresponding to a decision boundary in a finite number of steps, as long as the feature vectors are linearly separable (Minsky and Papert, 1969; Bishop, 2006). When the data are linearly separable there may be more than one valid classification, in which case the one achieved will depend upon the initial conditions. If the data are not linearly separable, then the perceptron learning algorithm does not converge. For any given data set there may be many possible decision boundaries. Assume that the learning algorithm has converged upon some successful decision boundary. The information that is transmitted about that data by the perceptron in this case is a disambiguation between the chosen labeling and the other possible labellings. By counting all possible classifications of a set of random points in feature space it can be shown that the limiting capacity of the perceptron is 2 bits per weight (MacKay, 2003). (2) 1

Figure 1: Diagrams contrasting the architectures of the neural networks discussed in this brief. A: The perceptron consists of a single unit (open circle) that takes a threshold of the scalar product of its feedforward weights (lines) with its inputs (filled circles). The Perceptron can classify two-class linearly separable data. B: The multilayer perceptron (MLP) consists of one hidden layer of units (grey circles) between the input layer (filled circles) and the output layer (open circles). Each unit in the hidden layer computes a continuous threshold function of the scalar product of its feedforward weights (lines) with the values of the input layer. Each unit in the output layer takes the scalar product of its feedforward weights (lines) with the inputs received from the hidden layer. The MLP can fit nonlinear curves to classifications of data sets where the number of classes is determined by the number of outputs. C: A Hopfield network is a feedback network. Thus the synaptic weights are directed (arrows) and symmetric. The Hopfield net functions as a content addressable memory by completing a disrupted pattern. Each unit operates as both an input and an output unit. D: The associative network learns an association between two binary patterns on the inputs (filled circles) and the outputs (open circles). At each crossing of an input line and an output line, there is a binary synaptic weight. The weights in this diagram show the result of storing the association between the input pattern and the output pattern using the learning rule in the text. Each filled square represents a weight that has been set to one. All other weights on the grid are zero. 2

Multiclass data classification with a neural network: The multilayer Perceptron As we have seen, the classifications that can be performed by a single perceptron are limited to those that are linearly separable. This is an enormous restriction since many interesting patterns in data give rise to feature vectors that are not linearly separable. This point was demonstrated by Minsky and Papert (Minsky and Papert, 1969). However it turns out that this restriction does not apply in the case of multilayer perceptrons (MLP) and so these neural networks find far greater utility in learning data classifications. The most common implementation of the multilayer perceptron makes use of three layers: An input layer, a hidden layer and an output layer, Fig. 1B. Multilayer perceptrons are fully feedforward networks, meaning that the graph describing their structure is acyclic. The input layer of the network provides the feature space into which the data must be mapped. Each input comprising the N dimensional feature vector φ(x) is connected to every perceptron in the hidden layer. There are H perceptrons in the hidden layer. In turn each of the perceptrons in the hidden layer passes its output to every perceptron in the output layer. There are K perceptrons in the output layer. The classification of the input feature vector is read out from the output layer. The activations of the units in the hidden layer, layer (1) are a (1) j = l w (1) jl φ(x l) + θ (1) j (3) where j {1,..., H}, l (1,..., N} and the θ (1) j in the output layer, layer (2) are are the biases of the hidden layer perceptrons. The activations a (2) i = j w (2) ij h j + θ (2) i (4) where i {1,..., K} and the θ (2) i layer respectively are are the biases of the output layer perceptrons. The outputs for units in each h j = f (1) (a (1) j ) y i = f (2) (a (2) i ) where the activation function f (2) is the identity function f (2) (a) = a. The form of the activation function f (1) can differ from implementation to implementation. For learning in the MLP however, it is important that this function be a continuous non linearity. Consequently the logistic sigmoid is chosen, f (1) (a) = 1 1 + exp( a). (6) Training in the MLP adjusts the weights to minimize a sum squared error function. In this manner the network fits a non linear function to the data (the target classification). Again, a full discussion of the learning algortihm is beyond the scope of this brief but see Bishop and MacKay (MacKay, 2003; Bishop, 2006). One important algorithm is the backpropagation algorithm (Rumelhart, Hinton, and Williams, 1986). The Multilayer perceptron is a non linear curve fitting device. The number of hidden units H that should be chosen in order to perform that fit is not constrained by the data but is a free parameter. Since increasing the number of hidden units increases the complexity of the model we might expect that the complexity of the function that is fit to the data should also increase. Indeed this is the case, but it turns out that the complexity of the curve is independent of the number of hidden units as H (Neal, 1996; MacKay, 2003), but is determined by the magnitude of the weights themselves. It may seem advantageous to choose the most complex fit that is possible with available computational resources, but this is not the case. Since the input data contains noise that is unreproducible and peculiar to the example under consideration it is possible for the neural network to overfit the data. Overfitting results in a match that is too close to one particular example and leads to a decrease in the performance of recognising examples that should belong to the same class, but that have random variations (generalisation). The approach taken to prevent overfitting is to add a regularizer that penalises excessively large weights. One is then left with the problem of how to choose the regularizer so as to optimise the trade off between specificity and generalisation performance (MacKay, 2003; Bishop, 2006). (5) 3

Auto-associative memory: Hopfield networks Hopfield networks can be used to create content addressable memories that store data in such a way that it can be retrieved by supplying a partial version of the original pattern. The graph of the connections in a Hopfield network is cyclic and thus the network is a feedback rather than a feedforward network, Fig. 1C. Hopfield networks are constructed from N neurons where each neuron i {1,..., N} is connected to some other neuron j {0, 1,..., N} by a symmetric connection w ij such that w ij = w ji. There are no self connections w ii = 0. Each neuron can have a bias w i0 that can be considered as resulting from the feedforward input from a zeroth layer of neurons with constant acitivity. The activation of each neuron is a i = j w ij x j. (7) For the binary Hopfield network the output is a threshold function of the activation as in Eq. (2) where x i = f(a i ) { 1, 1}. Alternatively for a continuous Hopfield network where outputs vary between 1 and 1, the output function is x i = f(a i ) = tanh(a i ). In feedback networks the output of each neuron is also an input to other neurons. Due to this, alterations to the weights of each neuron can either be synchronous or asynchronous. In the synchronous case all neurons calculate their activations according to Eq. (7) and then update their outputs only after the calculation of all other activations is complete. In the asynchronous case each neuron first calculates its activation and then updates its output in turn, before the calculation of activations of the other neurons. For some fixed set of weights, the Hopfield network stores data as features in the N dimensional phase space defined by the x i activity variables. Each memory is stored as a fixed point of the activity of the network. The partial pattern to be recalled is presented as the initial condition of the activity in the network. After some time the network converges upon the fixed point corresponding to the basin of attraction of that initial condition. The aim is that this fixed point should be the complete pattern and that the initial condition is the partial pattern to be completed. The values of the weights in the Hopfield network determine the locations of the fixed points in the activity space. For an asynchronous Hopfield net to store a set of M patterns {x (m) }, m {1,..., M}, the weights are set with one-shot learning according to w ij = α m x (m) i x (m) j (8) where α is a constant. How do we know that the the simple operation in Eq.(8) ensures that the Hopfield net recalls the given patterns? Asynchronous Hopfield networks have Lyapunov functions of their dynamics. The Lyapunov function is a function that always decreases under the evolution of the system, the presence of which ensures that the system settles upon a fixed point. Proof of this is beyond the scope of this brief, but see MacKay (MacKay, 2003). What about synchronous Hopfield nets? In the synchronous case we can guarantee that the activity of the Hopfield net settles upon a fixed point as long as the time is continuous. In this case activation functions a i (t) and the activity of the neurons x i (t) are defined as continuous functions of time a i (t) = j w ij x j (t) (9) and the neurons response to its activation is filtered with some time constant τ dx i (t) = 1 dt τ (x i(t) f(a i )) (10) where the output function is the hyperbolic tangent x i = f(a i ) = tanh(a i ). The synchronous continuous time Hopfield net has a Lyapunov function and is thus guaranteed to settle to a fixed point of the weights are set with Eq.(8). As mentioned, due to the presence of the Lyapunov function, the Hopfield network is guaranteed to settle into some stable state (output pattern) when supplied with an initial activity state (input pattern). However as the number of patterns stored in the network is increased, there comes a point at which the output patterns are garbled and are not valid completions of the input patterns. This failure of performance of the Hopfield net is graceful rather than catastrophic. Typically when the limit of overload is approached, some memories survive with a minority of bits flipped. The limiting number of patterns stored at which this transition occurs from memory storage to spurious spin glass states is M = 0.138N (Amit, Gutfreund, and Sompolinski, 1985). 4

Associative networks One distinctive feature of biological learning is the ability for associations to be made between stimuli (for example, in the famous case of Pavlovian, or classical, conditioning where a dog learns to associate the ring of a bell with being fed). Associative networks are feedforward neural networks that learn an association between two input patterns. When one pattern is presented to the inputs of the network, the outputs of the associative network present a pattern that has been associated with the input pattern. Adjustment of the weights allows this pairing between an arbitrary input and an arbitrary output pattern. The associative network can be envisaged as a grid of horizontal input lines and vertical output lines. At the intersections between these lines - the points on the grid - are the weights. One edge of the input lines terminates in an array of points that are the N I inputs. One edge of the output lines terminates in an array of points that are the N O outputs, Fig. 1D. The inputs x i {0, 1}, i {1,..., N I } are binary and the outputs are binary y j {0, 1}, j {1,..., N O }. The weights at each grid point w ij {0, 1} are also binary. When one pattern k is presented {x (k) }, k {1,..., K} each output is set to one y j = 1 if its dendritic sum d (k) i = j w ij x (k) j (11) is equal to the input activity a (k) = j x (k) j (12) but is set to zero y j = 0 otherwise (Willshaw, Buneman, and Longuet-Higgins, 1969; Buckingham and Willshaw, 1992). Associations are stored in the network by applying the input pattern to be associated, to the inputs x while the output pattern to be associated is applied to the outputs y. The following rule is then applied to each weight w ij associated with input line i and output line j w ij = 1 if x i = 1 y j = 1 w ij = 0 otherwise There is a rich literature dealing with the optimisation of the storage capacity of associative networks under differing conditions (Dayan and Willshaw, 1991). An important factor in determining the performance of the associative networks is the sparsity of the patterns. Sparse patterns tend to have a small fraction of their units active. Consider N I inputs where for each pattern to be stored M I inputs are active on average, being stored in association with N O outputs where M O outputs are active on average. For the simple network described here the limit to the number of associations that can be stored before the expected number of output units that fires spuriously approaches one is R N I N O M I M O ln(1 1 N O ) (Buckingham and Willshaw, 1992). References Amit, D.J., H. Gutfreund, and H. Sompolinski (1985). Storing infinite numbers of patterns in a spin glass model of neural networks. Phys. Rev. Lett. 55: 1530 1533. Bishop, C. (2006). Pattern Recognition and Machine Learning. Springer-Verlag, New York. Buckingham, J. and D. Willshaw (1992). Performance characteristics of the associative net. Network 3: 407 414. Dayan, P. and D.J. Willshaw (1991). Optimising synaptic learning rules in linear associative memories. Biological Cybernetics 65: 253 265. MacKay, D.J.C (2003). Information theory, inference and learning algorithms. Cambridge University Press, Cambridge, UK. Minsky, M. and S. Papert (1969). Perceptrons. MIT Press, Cambridge, Mass. Neal, R. (1996). Bayesian Learning for Neural Networks. Springer, Berlin. Rojas, R. (1996). Neural Networks. Springer, Berlin. (13) 5

Rosenblatt, F. (1958). The perceptron a probabilistic model for information storage in the brain. Psychological review 65: 386 408. Rumelhart, D. E., G. E. Hinton, and R. J. Williams (1986). Learning internal representations by error propagation, pp. 318 362. MIT Press, Cambridge, Mass. Willshaw, D.J., O.P. Buneman, and H.C. Longuet-Higgins (1969). Non-holographic associative memory. Nature 222: 960 969. 6