15. NEURAL NETWORKS Introduction to The Chapter Neural Networks: A Primer Basic Terminology and Functions

Size: px

Start display at page:

Download "15. NEURAL NETWORKS Introduction to The Chapter Neural Networks: A Primer Basic Terminology and Functions"

Joshua Tate
5 years ago
Views:

1 5. NEURAL NETWORKS 5. Introduction to The Chapter In this chapter, we provide a new perspective into neural networks, both feedforward and feedback networks. In section 5.2, we provide a small primer on neural networks. In addition to providing necessary details (that will be of use to those who are not familiar neural networks) in a concise manner, we try to remove the mystery and hype surrounding these networks and try to look at them in a more critical and objective way. Thus, we concentrate on what they can really do, how they do it, what are the advantages and disadvantages, etc. One key aspect of neural networks we try to highlight in this and the next section is that they represent special classes of nonlinear networks, and hence we can learn quite a lot by understanding nonlinear networks and their dynamics. In section 5.3, we consider recurrent neural networks (RNNs), or neural networks with internal feedback, and discuss some of the wellknown models. In the case of recurrent neural networks, stability, or the lack of it, is a major concern, and we discuss some of the existing approaches to prove the stability of RNNs. This section highlights the slow progress in the design of RNNs due to the lack of structured methods to obtain RNN architectures with guaranteed stability property and the problems in training such networks. In section 5.4, we discuss a new approach based on the building block concept (seen in earlier chapters for designing complex, stable nonlinear networks) to obtain new and complex RNN architectures and develop training or learning algorithms. We also show that existing RNN architectures can be derived by placing specific constraints on the architectures obtained using the building block approach. We also indicate that in RNN architectures, no energy function is being minimized, but a situation of power balance (between sources and sinks) is reached. At the bottom level of the hierarchy, we have the static, linear processing. For the special case of oneinput, oneoutput transformation, this can be mathematically represented as: y = ax b (5.) where a and b are two constants. Viewed as a functional mapping, this transformation can be represented graphically as a straight line connecting the input and the output as shown in Fig. 5.. y b slope = a x Figure 5. A simple oneinput, oneoutput system with a linear functional relationship. 5.2 Neural Networks: A Primer 5.2. Basic Terminology and Functions In chapter # and later chapters, we argued that a system could be viewed as performing a mapping from an input domain to an output domain. Here, we will look at neural networks from the same perspective. We start the discussion by looking at the different levels (in the order of increasing complexity) of signal/information processing described in chapter #. A number of points are worth repeating here. The input variable (and hence the output variable) can take values in the real R space. The mapping is restricted (to a straight line) and is represented by the two parameters a and b, representing the two degrees of freedom available. If observations of the inputs x corresponding to different times and the corresponding outputs y are available, the values of the parameters can be found by a curve (straightline) fitting algorithm. The model with the resulting parameter values can be used at later times to predict the output given the input. In the neural network terminology, the input is known as the stimulus, the output as the response (emitted by the system), the numerical I/O data as the training set, and the

2 parameter estimation process as learning or training. In this particular example, both the samples of the stimulus and the response are assumed to be available for training. Such learning or training process is known as supervised or teacherbased learning. We will see later examples of the other learning category known as unsupervised learning, also known as selforganization. An important observation that can be made from this simple single input, single output system is that the degree of freedom is restricted to two {or some constant for the general case} for linear processing and cannot be increased any further. In other words, any linear representation involving more than two variables {or degrees of freedom} can be reduced to the representation involving only two variables. This is not so for nonlinear processing, which makes it more complicated and interesting. By relaxing the linearity constraint, we obtain the next level of processing static, nonlinear processing. Depending upon the nonlinear primitive(s) employed, we can obtain different mathematical representations. For example, in the simple oneinput, oneoutput case we could have: or M y = a i x i (5.2) i=0 N y = b i tanh[c i x i ] (5.3) i=0 The first expression simply involves a polynomial (of order M) representation with the various powers of the inputs as the nonlinear primitives or as the basis of representation, whereas the second involves weighted transcendal functions of the scaled inputs. The coefficients a i (i = 0 to M) and b i, c i (i = 0 to N) represent the degrees of freedom available in modeling the given system using the two representations. As we noted in chapter # 4, and as the two examples above illustrate, the leap from linear to nonlinear is not that simple. The choice for nonlinear primitives is unlimited. The nonlinear primitives and nonlinear representations have to be chosen to represent the mapping process properly. That is, one representation may be good for one set of systems stimulated by a particular set of inputs, but may fail miserably with other sets of systems and or inputs even if the degrees of freedom are increased considerably. Further, the training data has to be a good representation of the stimuli seen by the system in its past as well as the inputs that will be seen in the future. Thus, input and output probability functions play a major role in the approximation (training or learning) process. Though researchers in the neural network area assume the probability functions are unknown and claim that it need not be known apriori, the conditions that good representative data must be available and must be used are very important Primitives of Neural Architectures & Feedforward Neural Networks The earlier neural networks fall under the class of static, nonlinear mapping. The neural network researchers handled the explosion in the choice for the nonlinear primitives by assuming that there will be one and only one {or one class of} nonlinear primitive in their architectures. Thus, in addition to the two basic building blocks {multiplier for multiplication of a signal by a constant value and adder} used in static linear processing, a third building block, called a neuron, is used in neural networks. A neuron, shown as a block diagram in Fig. 5.2a, is simply a oneinput, oneoutput memoryless nonlinear system or functionprimitive that transforms (or transduces) an unbounded input activation signal x into a bounded output signal y = s x( t) [ ] by the transformation s[ ]. The bounded output signal is thus a main characteristic of nonlinear primitives used in neural networks and is a result of practical considerations {interpretation from neurophysiological perspectives}. The mapping function s[ ] is usually a sigmoidal or 's' shaped curve. Some of the common functions are the hyperbolic tangent function and the logistic function s[x]= tanh[ x] = e x e x (5.4) e x e x s[ x] = (5.5) e x shown in Fig. 5.2b. It should be noted that the hyperbolic tangent function goes from () to (), and the logistic function goes from 0 to, as the input variable x goes from to. Both these functions are bounded and saturating, continuously differentiable and monotonically increasing, characteristics commonly associated with the nonlinear primitives of neural networks. Further, by scaling the input x to cx (c > 0) prior to inputting to the neuron 2, we can make the slope of the overall mapping (from x to y) small or large. For large value of c ( c ), both the functions approach the threshold function with the We will see shortly how these issues have been resolved in the neural network area. 2 In practice, the scaling factor c is incorporated within the nonlinear primitive itself.

3 output of {0, } or {, } as shown in the Fig. 5.2b. Thus the output of the neurons becomes binary or bipolar. A neuron whose output signal approaches the value one might be considered a winning neuron where as the one whose signal level remains near zero (or minus one) a losing neuron Logistic Function x We indicated earlier that the choice of the nonlinear primitives is crucial for proper modeling of a given nonlinear mapping. Thus, we may wonder what effect fixing the nonlinear primitives as neurons will have on the modeling and what the neural researchers have to say about this. Neural network researchers argue that the use of very large number {approaching millions and more} of the same nonlinearities is sufficient to model any kind of nonlinear mapping seen in practice. The key word here is "in practice". We can construct arbitrary y Sigmoidal Function Threshold Function Figure 52. a) The block diagram of neuron, a nonlinear primitive used in neural networks; b) Commonly used characteristics for the neuron. (a) (b) x mappings to show that architectures composed of only neurons cannot do the job of approximation properly. Thus, care has to be exercised in the use of neural architectures, and the one size fit all approach will only lead to failure and disappointment concerning the real potential of neural networks. In general, a static neural network architecture 3 {also known as feedforward neural net} is formed from the three primitives mentioned above and fixed bias inputs. Similar to the human brain with roughly 00 billion neurons, it is assumed that a useful artificial neural net or ANN {artificial to differentiate it from the real neural nets} will have millions of nonlinear neurons. The neurons are grouped into different fields {or layers for feedforward nets}, and neurons in a particular field receive, as inputs, the sum of weighted signals from neurons from proceeding fields and bias signals as shown in Fig The summing junction is known as the synoptic junction, the weights as synoptic weights or synoptic efficacies {and classified as excitory if the weights are positive and as inhibitory if the weights are negative}, and the matrix consisting of the synoptic weights as synoptic matrix or connection matrix. A synoptic function of each neuron may receive signals from a large number {0 4 in the case of the human brain} of neurons from the proceeding field. Thus, the factors that make a neural net unique among all static, nonlinear mapping systems are: ) the use of few primitives or building blocks; 2) the use of a large number of the only nonlinear building block, the neuron; 3) the use of very large number of trainable synoptic weights, and 4) the use of very large number of interconnections. Referring to Fig. 5.3a of a feedforward neural net, we find a number of inputs going into the input layer {a distribution terminal}, an output layer and a number of intermediate layers known as hidden layers. From a nonlinear functional mapping perspective, the presence of hidden layers provides the capability to implement reasonable mappings using the simple & only one nonlinear primitive. In fact, earlier research in neural nets stalled temporarily when it was shown that even a simple mapping such as a two input ExclusiveOr problem {two binary inputs & one binary output as shown in Fig. 5.4} cannot be implemented using a single layer network. It is now widely accepted that two hidden layer are sufficient for acceptable mapping. As indicated before, the weights are found using I/O data samples and a training procedure. Neural network researchers term such estimation as modelfree estimation as (according to them) they do not use a mathematical model describing how systems outputs depend on its inputs. However, considering Fig. 5.3b, where we have shown a threelayer network, the outputs can be written in terms of the inputs as: 3 This definition doesn t take into consideration that the inputs to this network might come from a timeseries and updated continuously.

4 x y x y x 2 x x 2 w i Bias input y 2 x 2 0 x N Input nodes (Input layer) Field or layers w in i Synoptic weights (a) Synoptic junction Output nodes y M x x 2 (a) y[ x, x 2 ] (b) y 0, x 2 y= y=0 x y=0 y= 0,0,0, (c) (d) x x 2 y y 2 Figure 54. a) Logic symbol of a XOR function; b) The truth table or the input/output map assuming that both the inputs and the output take only binary values; c) A neural network realization of the XOR function; d) The I/O map as the inputs are varied from 0 to. y M x N Input layer st hidden layer (b) 2nd hidden layer Output layer Figure 53. a) The various elements leading to a feedforward neural network, b) An Ninput, Moutput threelayer neural network architecture. N x i = s w ji x j I i j= ; i = to N N x 2i = s w 2ji x j I 2i j= ; i = to N 2 N 2 y i = s w 3ji x 2j I 3i j= i = to M (5.6)

5 Thus, there is an underlying model involving the various primitives, though the model may be at a micro level. The contention that there is an underlying model becomes more valid in the case of other more complex neural networks as such as recurrent neural networks that we will see shortly. Before we do that, let s look at possible applications of neural nets and training of FFNN weights Implementational Issues As we learned in the previous subsection, an artificial neural network is formed from dense interconnection among a large number of elements taken from a very limited primitive set. Within that domain, we do have a number of choices concerning the signals to be processed, implementational issues, and applications. The inputs and or the outputs can be continuous signals, digital signals, or binary/bipolar signals. The continuous signals can be processed using analog, electronics or analog, optical implementation or digital technology using A/D converters in the front end and D/A converters at the output end. The digital signals can be processed using digital hardware based on binary logic. The binary/bipolar signals can be processed using analog or digital architecture. Also, we can either timemultiplex one or few processing units or use a massively parallel architecture. It may be true that the human brain {the true neural net} performs its tasks using nonlinear, massively parallel, asynchronous {no underlying clock} and feedback architecture. However, some or any of these may or may not apply to artificial neural nets depending on how they have been designed and implemented. Hence, caution must be exercised while coming across such "hyperbole Techniques for Signal Encoding Another unique property of neural architectures is the technique used for signal encoding. We all know that Bbits (and B channels in an implementational sense) can be used to uniquely denote 2 B symbols and to denote 2 B quantization levels in a weightedscheme. Here, the representation is obtained by minimizing (to the lowest level) the number of bits or channels needed in the representation. Of course, such a representation is neither fault tolerant 4, nor provides the ability for graceful degradation 5. In digital computing and communication areas, an accepted solution to these problems is through the use of error detection and correction schemes where we basically add a few more redundant (in an information point of view) bits to the codes corresponding to each symbol in a systematic manner. We know that higher the number of bits we add per symbol, the better the error correction / error detection properties will be. For example, 4 Outputs remain unchanged even if some of the parameters change. 5 Minimal change in the performance for some changes in the parameters. the addition of just one bit per symbol code {through the use of bit parity schemes, for example} enable us to detect if an odd number of bits have changed. Adding one bit alone doesn't help us in detecting the change in an even number of bits or help in determining the corrupted bits. We can, of course, add two or more bits to have better error detection and correction capability. Neural architectures take this concept to the extreme where Bbits are used to denote just B symbols or B quantization levels. Such an encoding is know as distributed coding and leads to a redundant encoding (we will have more than one code representing the same symbol). The redundancy leads to better fault tolerance (deviations in weights for example will have minimal effect on the input/output mapping), and graceful degradation (removing certain connections for example will not drastically change the mapping) properties. Of course, as the number of channels and the associated elements have increased enormously, it can be argued that the chances for failure are higher. More on this as we look at the applications Applications of Neural Nets By constraining the inputs and outputs to special domains (of values), we can gain some knowledge about possible applications for artificial neural nets. For example, if both the inputs and outputs are multivalued (in amplitude or magnitude), an ANN architecture can be configured to handle such a situation by: ) making the input layer and the output layer linear and use analog implementation (see Fig. 5.5a), or 2) encoding the inputs to binary/bipolar signals at the input layer, use a digital neural net implementation and decode to continuous values at the output layer. (Fig. 5.5b) Such architectures can be used for a) modeling of physical nonlinear systems (Fig. 5.5c), and b) as nonlinear controller architectures as shown in Fig. 5.5d. In the later application, though the controller is a feedforward or static architecture with guaranteed stability property, the controller and the plant/process combination forms a closedloop system whose stability may be open to question. This problem will be addressed as we look at feedback neural network architectures. Another example of multivalued inputs to multivalued outputs mapping is that of cleaning up a noisy signal (of N samples) or noisy image (of N N samples). Here again, the training will be performed using (deliberately) corrupted images as the inputs and the original images as the outputs. It should be added here that, unlike linear filtering where scaling and delaying (and or rotation in the case of images) will not have any effect on filtering capability of the system, the performance of ANNs (nonlinear systems) can widely vary depending on the signal strength. Hence, techniques such as prescaling before inputting the data to the ANN need to be considered.

6 A second possibility is where the inputs are multivalued and the outputs are constrained to be binary or bipolar (see Fig. 5.6). Such architectures are useful in classification and recognition applications. An example belonging to this category is recognition of a particular image (F6 fighter plane, for example) from a (known) set of M possible images given a gray scale image as input. We can use this gray scale image or the features extracted from this image as the inputs and M binary outputs to indicate the presence of one of the M possible images. As discussed in the case of phoneme recognition, the recognition system has to take into account the possibility that the input image can be a scaled, translated, and rotated version of the original, known image of the same object. This problem can be solved by choosing features that are scale/rotation/translation invariant or by letting the neural net learn such possibilities through training with different versions of the same image. Another example of multivalued inputs to binary/bipolar outputs mapping is in speech processing (Fig. 5.7). In vowel recognition, the firsttwo formant frequencies are used to identify the vowel phonemes. Here, we have two multivalued inputs and binary outputs to indicate one of forty or so phonemes. x x 2 x N x 3layer analog neural network architecture (a) Plant Neural Network as model (c) y p y m y y 2 y M e x x 2 x N Sampler & encoder Neural Network as controller 3layer digital neural network architecture (b) Decoder & D/A converter Desired output from the plant (d) control input Plant y y 2 y M y p Figure 55. a) An analog neural net implementation with continuous inputs and outputs; b) A digital architecture with continuous inputs and outputs; c) A neural network used for modeling the behavior of a given plant; d) A neural net used as a nonlinear controller of a given plant.

7 x F 2 (z) IY (i) x 2 Artificial neural network y y 2 OW (o) A (a) F (Hz) F F 2 Artificial neural network y i 20 (a) (b) x N y M Multivalued inputs (a) Bipolar or binary outputs Figure 57. a) The firsttwo formant frequencies and their relationship to some vowel phonemes in English language; b) A block diagram of a neural network that identifies the occurrence of a particular vowel in a segment of speech based on the firsttwo formant frequencies. Figure 56. a) A neural net with multivalued inputs and binary or bipolar outputs; b) Grayscale (multivalued) images belonging to the class of planes. The neural net may have binary/bipolar outputs indicating if a given image belong to the class of planes or not or outputs indicating the subclasses within the plane class. Finally, both the inputs and outputs can be binary. A typical example is again in image recognition. In computers, for example, the English alphabets are represented by a binary matrix of M N bits (M = 2, N = 9). Assuming that only the upper case letters are possible, we have only 26 possibilities. A neural net used to recognize the occurrence of one of these 26 letters will have the M N (08 bits) bits as inputs and 26 bits as outputs (under distributed encoding) as shown in Fig Another use is in "associative memory" application where an Nbit binary string (vector a) is associated with an Mbit binary string (vector b) (M, N very large), with the association preknown. The ANN can be used to a) provide the associated vector b given the vector a or b) provide both a and b given partial information on a and b. We examine the implications from such applications below Redundancy in Coding & Redundancy in Problem Domain Earlier, we discussed redundancy introduced through distributed encoding and other similar methods. By studying the two problems discussed above, we can see redundancy arising from the problem domain.

8 2 bits 9 bits (a) x i Artificial 08 neural network 26 Figure 58. a) A 2 9 binary representation of English alphabets; b) Block diagram of a neural net architecture to identify one of 26 possible uppercase alphabets. Both x i, y i are binary. Considering the alphabet recognition task, we have 08 input bits (assuming M = 2, N = 9) and 26 output bits. Thus, the input can take 2 08 = unique combinations, the output 2 26 = unique combinations and the system I/O State can be one of the 2 (0826) = possible states. Out of these over 0 32 input states, only 26 states are valid states under ideal condition (no input bits corrupted). If one is daring enough, he (she) can go ahead & design a digital logic (with 08 bits as inputs and 26 bits as outputs). In such a design the other input combinations will be considered as a) don'tcares implying that the outputs corresponding to these combinations can be anything 6, a common practice in digital design, or b) force the outputs corresponding to certain input combinations that are close to the 26 ideal combinations as representations of one of those 26 valid states. The latter approach is sort of a trade mark of neural networks. As we train the system, the weights converge to values such that the ANN points to the same class (here one of 26 classes of the alphabets) even if some input bits are in error. We can also specifically train the net as we do in the case of digital design. Because of this and the presence of a very large number of weights etc., the correct class is identified even when 6 Of course, the digital system designed will produce a specific output corresponding to any (& all) input combination. (b) y i certain weights are changed and or the interconnections disturbed. Hence the fault tolerant and graceful degradation properties of neural architectures Storage capacity & crosstalk It should be clear that properties such as fault tolerance and graceful degradation are achieved by introducing redundancy, a tradeoff Omnipresent in all engineering applications. In the above example, we store only 26 classes. Suppose we excite the network with all possible combinations of the input, observe the corresponding outputs, and classify them into unique classes or vectors. We can generally expect a) still 26 classes only or b) c ( 26 << c << 2 26 ) classes. That is, the network splits the entire input combination space in R 26 into a) 26 classes or b) c classes. This value (26 or c) can be considered to be the storage capacity of the network and will be much less than Thus, fault tolerance is achieved at the cost of reduced storage capacity. We will look at storage capacity further when we study feedback NNs. Suppose the same network (which has already been trained to recognize 26 uppercase letters) is trained further to recognize the 26 lowercase letters as well (52 classes in all). After the new training, it is quite possible that the new classes make it difficult to recognize the older classes, or similar classes get represented by same class etc. In the NN terminology, such a possibility is known as crosstalk Approximation or Training or Learning 5.3 Recurrent Neural Nets 5.3. Basic Concepts A key feature of the human brain is the presence of feedback that gives rise to the dynamical behavior and the memory capability. ANNs that are patterned after the human brain thus need to incorporate feedback as well. From earlier chapters, we know that feedback provides the memory capability, and for a feedback system to be realizable there should not be any delay free loops {closed paths}. Thus, in a continuous ANN architecture, each and every loop will contain at least one integrating (or differentiating, though the former is preferred due to its lowsensitivity property to noise) element whereas the loops in a discrete ANN architecture will have at least one delay

9 unit. Thus, a mathematical model for a feedback neural net {also known as a recurrent neural net or RNN} can be written in the state space form as: continuous case: discrete case: x D ( n )T x c dx c = f dt c [ x c ] (5.7a) ( ) = f D [ x D ( nt) ] (5.7b) where x c, f c [.] etc. are vectors. The models given above are the same as the representations for nonlinear dynamical systems. Thus, RNN are indeed special 7 cases of nonlinear dynamical systems. The models can be rewritten to indicate the presence of the synoptic weight matrix M and the input (bias or external) signals i as: continuous case: [ ] (5.8a) x c = f c x c,m,i discrete case: x D ( n )T ( ) = f D [ x D ( nt), M( nt), i] (5.8b) and the training or learning property can be represented as: continuous case: m c = ˆ discrete case: m D ( n )T ) = ˆ f c [ x c,m c ] (5.9a) [ ( )] (5.9b) f D x D ( nt ),m D nt where we have used m to indicate the column vector formed from the elements of the matrix M. The above equations simply indicate that the differentials of the weights go to zero as the weights reach their constant, steadystate values. From a careful look at the above representations, especially the discrete ones, two important observations can be made. First, we have represented the discrete models as synchronous (due to our familiarity with such models & the relative ease with which such models can be handled), whereas, on reflection it should be clear that a human brain works asynchronously. Second, the update times for the state variables and the weights are different (T and T respectively) with T >> T 8. That is, the state variables change much faster than the synoptic weights, as it should be. In neuron network terminology, the state values (x) are thus called shortterm memory (STM) where as the synopses as longterm memory (LTM). 7 Special since the nonlinear terms used are limited. Later, we will also study some other properties that make RNNs unique nonlinear systems. 8 If we are forming discretedomain RNNs from continuousdomain RNNs, the both the values must be small enough to prevent aliasing. Of course, not all nonlinear dynamical systems are neural systems. We need to incorporate the same constraints indicated while discussing feed forward neural networks and a few more. Thus, ) the use of few primitives or building blocks; 2) the use of the only nonlinear building block, the neuron with a saturating output; 3) the use of very large number of these building blocks; 4) dense interconnection; 5) feedback; and 6) learnability or trainability characterize recurrent or feedback neural networks. We will add later some more constraints that arise due to the presence of feedback and the specific tasks expected of RNNs RNN Architectures Different approaches can be used to arrive at numerous RNN architectures. A straightforward approach is to add feedback to a feedforward neural network. Another approach is to derive architectures from a careful study of the biological models. A third approach, proposed by this author, is to use nonlinear electrical network theory and the building block approach discussed in this book. A simple RNN architecture derived using the first approach is shown in Fig In the figure, we have a FFNN with no hidden layer and feedback between their inputs and outputs through integrator units (shown with selffeedback). The outputs of the integrators, x, the state of a dynamical system, are the state of the neuronal dynamical system that can take any value in the real vector space R n. They are fed to the neurons to produce the bounded signal state s(x) (This should explain why the nonlinear functions appear first in the FFNN subsystem). The signal state space is thus an ndimensional hypercube. The synoptic matrix or connection matrix M= m ij [ ] multiplies the signal state. Bias or input signal is added to this output and the result, along with selffeedback from the outputs of the integrators, becomes the inputs to the integrators. The dynamics of the RNN architecture in Fig. 5.9 can be written as: where f i x i = f i x i n I i ; i =ton (5.0) j= [ ] m ij s[ x i ] [ x i ] is some nonlinear mapping of the independent variable x i. In x i [ ] = a i x i, a i > 0 and the practice, this mapping is constrained to be linear, f i dynamics corresponding to this case can be written in a matrix form as: x = Ax Ms[ x] i (5.)

10 where A is a diagonal matrix. S(x ) S(x N ) Weigh & sum M.S[x] Bias input addition I I 2 I N x f [ ] s Integration units with selffeedback x N f N[ ] s x x N feedback projection from neuron field F y to neuron field F x 9. A special case of bidirectional networks is the bidirectional associative memories or BAM (continuous additive bidirectional associative memory or CABAM since the dynamics is described in the continuous domain). In this architecture, the signal states { s[ x], s[ y]} reach certain stable values given x[ 0], y[ 0] and hence {s[ x], s[ y]} can be considered as associative pairs. When P and Q are not equal to zero, they are assumed to be symmetric (this applies to M of autoassociative architecture as well) with positive diagonal entries { p ii, q ii > 0 }(excitory connections) and nonpositive offdiagonal elements { p ij, q ij 0, i j} (inhibitory connections ). The symmetry is considered to be a reflection of a lateral inhibition or a competitive connection topology 0. The strength of the inhibitory connections is assumed to be a decreasing function of the distance i j. The inputs i i and j j, when present, are added directly to the dynamics in the above models. Hence, such models are known as additive activation models. We will see other models as we study some wellknown RNNs shortly. But let us first look at some central issues related to RNNs. Figure 59. A continuous RNN architecture based on feedforward neural network and feedback through integrator units. In the above model, all the signal states are assumed to be in a signal field Fx and used collectively to affect the future state. Such architecture is known as autoassociative or unidirectional. By adding another field F y (with the associated state y of size m) and allowing for cross connections as shown in Fig. 5.0, we arrive at a dynamics given by: x y = A 0 x 0 B y P N M s[ x] Q s[ y] i j (5.2) Depending upon the characteristics of the synoptic submatrices P, M, N, and Q, we arrive at different network categories. When P = Q = [0], the resulting architecture is called hetroassociative. M = N t leads to bidirectional networks (F x feeding into the inputs of the integrators of field F y and F y feeding into the inputs of the integrators of field F x ). Here M is called the forward projection or feedforward from neuron field F x to the neuron field F y and N backward or 9 The readers may note that these classifications differ vastly from the definitions found in the classical network and system theory literature. 0 We will discuss later the real requirement for symmetry using network interpretations.

11 x y F x F y S[x] *M P S[x] *P S[y] *N *Q N S[y] Response & stability definitions for RNNs Figure 50. An hetroassociative RNN architecture with two fields F x, F y. Vector/matrix notations are used to represent various operations. I J An arbitrarily chosen RNN, being a dynamical feedback system, can be: ) absolutely, globally stable with x e = 0 as the equilibrium point, 2) locally stable with multiple, nonzero equilibrium points, 3) a system that exhibits limit cycle response or even chaotic response, or 4) completely unstable. using the various classifications of general nonlinear dynamical systems seen in earlier chapters. Thus, we need to know which of the above behavior(s) is (are) acceptable for practical RNNs. Further, as we know from previous chapters, the above behavior classifications are based on the transient response or the response to initial conditions. Thus, we also need to ask if we should restrict our network operation to such situations only or should we be concerned with the network operation when external stimuli are present and if so what kind of stimuli? Constant valued (DC) inputs or one that varies with respect to time? Let us develop the answers to these questions using intuition and the possible application of RNNs as our guide. Consider the problem of the English alphabet (upper case only) recognition where we might have a corrupted n r by n c bitmatrix to start with. We can expect the information regarding the 26 correct combinations some how stored in the synoptic matrix M. We can think of applying the corrupted matrix: *A *B s I s I ) a)as the input i to the system & keep it permanently (DC input) or 2) b) as the initial signal state s x( 0) [ ] with no other stimuli. Case () is possible with an absolutely globally stable system {which also possesses BIBO or bounded input, bounded output stability property} where we force the DC response to one of the 26 desired states 2. For case (2), the network cannot be an absolutely, globally stable system since the transient response of such systems tend to zero (the equilibrium point) as time progresses. Thus, we need to use locally stable systems with multiple nonzero equilibrium points, some of which are attracting points 3. In earlier chapters, we learned that multiple equilibrium points are possible using ) networks consisting of reactive elements with multiple relaxation points, or 2) networks consisting of both positive and negative resistors. The RNNs proposed so far falls under the later category, as we will see in section 5.4. The above example points to the use of the response of RNNs to initial conditions. Another problem where the response of RNNs to both initial conditions and external stimuli (DC input) is used is in the area of functional minimization. The networks used for this application category also correspond to networks with nonzero equilibrium points. We will look at this problem after we look at some wellknown RNN models. In summary, the RNN architectures proposed so far do not correspond to globally, absolutely stable systems in the classical, nonlinear dynamics sense, and their applications are limited to cases involving response to stored energy or response to DC stimuli. The number of stable equilibria in a given RNN network represents the storage capacity of that network and the parameters are adjusted {trained or learned, in the NN terminology) so as to place the stable equilibria at values dictated by the application under consideration. We will have to preprocess the signal so that it lies in the 0 to range of the s[.] function. Thus, we may change all zeros to 0.0 & all ones to 0.99 etc. 2 How do we design such a system is another question & we have not yet addressed that problem. 3 NN literature use the term globally stable to describe the behavior of such networks to imply that those equilibrium points are stable.

12 5.3.4 Some well known RNNs Continuous Domain Models Hopfield Model A simple RNN model known as Hopfield circuit 4 is given by: c i x i = x i R i n m ij s j [ x j ] I i ; i = ton (5.3) j= where M = M t (symmetric synoptic matrix), s j [ x j ] > 0, and monotonically increasing, and bounded. That is, the Hopfield circuit corresponds to a special case of the autoassociative model seen before. This circuit, which reignited the interest in NNs once again in the early 980's, uses a synoptic matrix M that is learned (or computed) offline and is shown to be useful in two applications: ) cleaning up corrupted binary (or bipolar) data, and 2) functional optimization. In both the applications, the function s j [ x j ] > 0 is chosen to be the threshold function (with output values of zero and one). In application #, the corrupted data is used as the initial value of the Signal State, s[ x], and the external stimuli i is set equal to zero. After some iteration, the Signal State s[ x] will settle to a fixed value, providing the filtered data. Thus, as indicated earlier, the Hopfield circuits represent a locally stable system with multiple equilibrium points, one of which is reached, depending upon the initial condition. In fact, it has been shown that a specific M corresponds to making certain number of binary (bipolar) vectors as the attracting equilibrium points of the circuit and the resulting equilibrium point corresponding to an initial state is simply a stored vector with the least Hamming distance to the initial state 5. The coefficients c i, R i, i i, m ij (i, j = to n) of (5.3) represent the degrees of freedom available in the Hopfield dynamics. By varying their values properly, we can control the number of stable equilibria and their values. We have approximately n 2 (for n large) coefficients or degrees of freedom, whereas we have 2 n possible signal state values. The growth in the number of degrees of freedom becomes very small compared to the possible signal state values as the value of n increases. Therefore, we can expect the maximum number of stable equilibria that is possible by the variation of these limited number of 4 5 Thus, there is also the possibility for oscillation as more than one stored vector can have the same, least Hamming distance to the initial state. coefficients to be very small compared to the total number of signal states. It has been shown through simulations and rigorous analytical approaches that the maximum number of stable equilibria or the storage capacity of Hopfield network is approximately equal to 0.5 n, a very small number compared to the total signal states of the dynamics. This can be considered a major limitation of Hopfield (and other RNN) dynamics, but we have to keep in mind that the limited number of stable equilibria gives rise to better fault tolerance and graceful degradation properties. Later we will consider if the storage capacity can be increased through the use more complex nonlinear mapping functions using the network approach (section 5.3.4). Tank and Hopfield recognized that their circuit could be used for a special case of functional optimization known as linear programming (LP). Given a variable vector x (x R n ), constant vectors a, c and constant matrix B, the task in LP is to find the value of vector x such that the scalar cost function. φ[ x] = a T x (5.4) is minimized and where the solution x satisfies the p constraints f[ x] = Bx c 0 (5.5) A simple example of this LP problem is the onevariable case (n = ) with φ[ x] = x (5.6) as the linear function to be minimized subject to a single constraint (p = ) f[ x] = x 0 (5.7) It can be noted that this simple problem has a unique solution given by x = 0. A firstorder Hopfield circuit can be used to find the solution to this problem. The firstorder dynamics is given by: cx = x R m s[ x] i (5.8) Tank and Hopfield argued that the network minimizes a pseudo energy function 6 E[x] given by: 6 Tank & Hopfield called this an energy function, similar to a Lyapunov function, as its time derivation evaluated along the circuit dynamics is negative semidefinite. We have added the term pseudo to stress the possibility of the function E[x] becoming negative. Also, we will find later that the function that really gets minimized has the dimension of power and not energy. We discussed

13 E[ x] = i x x 2 2R x m S [ v ]dv (5.9) That is, the equilibrium point x e obtained from the dynamics of the Hopfield circuit correspond to the minimum of E[ x]. The circuit corresponding to the dynamics in (5.8) will provide the solution to the LP problem if we can make the following identifications: i = m = leading to E[ x] as: s[ x] = o if x 0 x if x < 0 E[ x] = x x2 2R x 0 (5.20) s[ v]dv (5.2) 0 The first term in the above expression corresponds to the LP function to be minimized, and the other two terms are added to make the energy function larger as and when the constraint is not satisfied (a penalty function or constraint). If the circuit is simulated with x(0) > 0, x will tend to x e = 0 which is the solution to the LP problem 7. In summary, in this application we excite the network with an external DC stimulus and some initial value and hope that the obtained solution is independent of the initial condition. Of course, this is possible only if the error function is bowl shaped with only one minimum. However, the dynamics and hence the cost function being minimized is in general nonlinear and hence the minimum reached may not be the global minimum. this in general terms in chapter 3 and came across the term 'mixed potential' which is really getting minimized. 7 The choice of s[.] as given here will make x as t for any x(0) < 0. We will analyze the cause for this problem and look at an alternate energy function as we study RNNs from 'Electrical Nets' perspective in section Continuous Additive Bidirectional associative Memory (CABAM) We have seen this model as we discussed the evolution of general RNN architectures from feedforward NNs. The dynamics is given by: m x i = a i x i m ij s j [ y j ] i i ; i = to n j= n y j = ˆ a j y j m ij s i [ x i ] j j ; j = to m i= (5.22) It can be observed that this model is a hetro associative analog of the Hopfield circuit. We will look at this model once again from an electrical network perspective in section Grossberg Models In this section, we present a number of biologically inspired RNN models to illustrate the slow and painful evolution leading to the various models. Restricting to an autoassociative model with field F x of n neurons, Grossberg models restrict the neuronal output to a finite range [0, B i ] and hence are mostly useful for pattern classification and recognition problems. A pattern is denoted as p = [p p 2 L p n ] t where p i (i = to n) are normalized such that n p i= i = and thus p i can be considered as a probability distribution corresponding to a pattern. Letting I as the background illumination and I i the reflectance from the ith element, we have I i = p i I with simple additive activation model can be written as: n I i i= = I and a x i = (a i I i )x i b i I i ; i = to n (5.23) The constant a i, which represents the term corresponding to selffeedback, leads to a stable system, and has a decaying response when I i = 0, is called the passive decay rate. From the expression, we can note that x i is bounded in the range [0, b i ]. Since in general I i is a constant, this dynamics represent a simple linear, uncoupled model whose solution can be written down easily. Since I i affects only x i, the term additive activation. We can note that x i b i as I regardless of the value of p i. Grossberg calls this saturation (an undesirable effect) of large inputs. He went on to suggest that the inputs I j (j= to n, j i) should also be used in the expression for x i, in an inhibitory sense. The corresponding simplest model is:

14 x i = (a i I i )x i a i I i x i I j j i (5.24) = (a i I)x i a i I i Thus, the dynamics represents an uncoupled, linear model for fixed I i and hence a closed form solution for x i can be easily written. It can be shown that x i p i b i as I. That is, we no longer have saturating outputs but outputs that represent the pattern. Thus, the pattern information gets stored in the neuron outputs. Grossberg calls this multiplicative activation model.. The multiplicative model can be modified to allow the activation variable x i to assume negative values in a finite range [c i, b i ], where c i > 0 and c i << b i. This leads to what is known as a multiplicative, shunting activation model:: x i = (a i I)x i b i I i c i I j (5.25) j i A more general shunting activation is possible by allowing cross coupling between the various states x i. Separating the inputs as excitory inputs I i and inhibitory inputs J i, the model can be written as: x i = ( a i I i J i )x i b i I i c i J i c i s i [ x i ] c i s j [ x j ] x i s i x i j i [ ] s j x j j i [ ] (5.26) That is, for the first time in the evolution of the Grossberg models, we have a true nonlinear model involving cross coupling between the various states, and it took almost 5 years to get to this point. In the next section we will show how these and other exotic RNN models that can be derived easily using the "building block concept" advanced in this book CohenGrossberg Model A RNN model proposed by CohenGrossberg has the activation dynamics given by: n x i = a i [ x i ] b i [ x i ] m ij s j [ x j ] (5.27) j= [ x i ] is assumed to be nonnegative and bounded, s j x i where the function a i [ ] bounded monotone nondecreasing function, and b i [ ] belongs to the set that ensures the stability of the activation dynamics Continuous Bidirectional Associative Memories This model results from extension to hetro associative case of the above Cohen Grossberg model and has the dynamics given by: x i = a i x i y j = ˆ a j y j m [ ] b i [ x i ] m ij s j y j b j y j j= n [ ] ˆ [ ] m ijˆ i= [ ] ; s i [ x i ] ; i =ton j = tom (5.28) where we have used ' ^ ' to differentiate the functions used for the F x field dynamics from those of F y. In practice, all the nonlinear functions are made identical to simplify the design problem. In summary, it can be noted that the RNN dynamics known so far correspond to nonlinear dynamics that are not globally stable in the classical system theory 8 sense. They correspond to dynamics with more than one locally stable equilibrium point that produce stable outputs under the application of external, DC excitation. More importantly, they are carefully hand crafted to lead to the desired behavior Discrete RNN Models Discrete RNN models often use neurons with threshold signal functions (binary or bipolar output) and hence can be called bivalent RNN models. They can be derived: ) by a direct approach that is, the dynamics will be represented by nonlinear difference equations, or 2) from an analog model and a suitable discretization procedure. we will use the later approach and present one specific model here. Considering the CABAM seen before (equation 5.22) and using the following substitutions: a i, ˆ a j = T, d dt = z T m ˆ ij = T.m ij, ˆ I i = T.I i, J i = T.J i We obtain a discrete hetro associate dynamics as: (5.29) 8 Another way to put it is to say that the electrical network architectures corresponding to such dynamics are nonpassive. We will use this line of thinking in the next section.

15 m x i ( k ) = m ˆ ij s j y j ( k ) j= n y j ( k ) = m ˆ ij s i x i ( k) i= [ ] [ ] I ˆ i ( k) ; i =to n J ˆ j ( k) ; j = tom (5.30) x i (k) S i S i [x i (k)] where k represents the time or iteration index. When ˆ I i (k), J ˆ i (k) are constant (DC excitation), we expect x i (k), y i (k) to stabilize to some steadystate values as k. When the neuron signal functions s i [ ], s j [ ] are constrained to be the threshold functions, the model reduces to the bidirectional associative or BAM model. The signal functions s i [ ], s j [ ] can also be made complex while retaining the threshold function property. For example: S i [x i (k )] u i z Unit delay x i s i [ x i ( k ) ] = if x i (k)> U i s i [ x i ( k) ] if x i (k) = U i 0 if x i (k) < U i (5.3) Figure 5. A discretetime threshold function with one memory used as a neuron. and similarly for s j [ ] are used and where U i (V j for s j [ ] ) is a threshold value (chosen to be equal to zero for all i and j). That is, memory is introduced into the signal functions themselves (in addition to memory present due to feedback) { see Fig. 5.}, and thus a continuous domain network equivalent of such dynamics needs to have delay elements, in addition to other lumped passive elements. We can combine the two sets of equations and write them together as: x i (k ) = m ij s j [y j ( k)] I i (k) j if x i ( k) > U i s i [ x i ( k) ]= s i [ x i (k)] if x i ( k) = U i 0 if x i ( k) < U i y j ( k ) = m ij s i [ y i ( k) ] J j ( k) s j [ y j ( k ) ] = i if y j ( k) > V j s i [ y j ( k) ] if y j ( k) = V j 0 if y j ( k) < V j (5.32) The memory in the signal functions s i [ ], s j [ ] and the resulting possible actions lead to what is known as "stayat the same value" capability and acts as a tie breaker that makes possible a steady state value. Using the model above, we can contemplate different implementations. Starting with x i (k 0 ), y i (k 0 ) (where k 0 is some initial time index), if we do the update simultaneously or as a block, we obtain a synchronous model. Alternately, we can update the state information for field F x, followed by

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD WHAT IS A NEURAL NETWORK? The simplest definition of a neural network, more properly referred to as an 'artificial' neural network (ANN), is provided