ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

WHAT IS A NEURAL NETWORK? The simplest definition of a neural network, more properly referred to as an 'artificial' neural network (ANN), is provided by the inventor of one of the first neurocomputers, Dr. Robert Hecht-Nielsen. He defines a neural network as: "...a computing system made up of a number of simple, highly interconnected processing elements, which process information by their dynamic state response to external inputs.

WHAT ARE ARTIFICIAL NEURAL NETWORKS? The brain basically learns from experience. Now, advances in biological research promise an initial understanding of the natural thinking mechanism. This research shows that brains store information as patterns. Some of these patterns are very complicated and allow us the ability to recognize individual faces from many different angles. This process of storing information as patterns, utilizing those patterns, and then solving problems encompasses a new field in computing. This field, as mentioned before, does not utilize traditional programming but involves the creation of massively parallel networks and the training of those networks to solve specific problems. This field also utilizes words very different from traditional computing, words like behave, react, self-organize, learn, generalize, and forge

HOW NEURONS WORK? The design of Artificial Neural Network (ANN) is inspired by human brain. It is important to have a rough idea what is going on in our brain. Therefore, before we look into the artificial one, let s have a look at the real one.

WHAT ARE ARTIFICIAL NEURAL NETWORKS? Neuron in ANNs tend to have fewer connections than biological neurons. Each neuron in ANN receives a number of inputs. An activation function is applied to these inputs which results in activation level of neuron (output value of the neuron). Knowledge about the learning task is given in the form of examples called training examples.

ARTIFICIAL NEURAL NETWORKS An Artificial Neural Network is specified by: Neuron model: the information processing unit of the NN, An architecture: a set of neurons and links connecting neurons. Each link has a weight, A learning algorithm: used for training the NN by modifying the weights in order to model a particular learning task correctly on the training examples. The aim is to obtain a NN that is trained and generalizes well. It should behaves correctly on new instances of the learning task.

THE NEURON DIAGRAM

NEURON The neuron is the basic information processing unit of a NN. It consists of: 1 A set of links, describing the neuron inputs, with weights W 1, W 2,, W m 2 An adder function (linear combiner) for computing the weighted sum of the inputs: (real numbers) u m wjxj j 1 3 Activation function for limiting the amplitude of the neuron output. Here b denotes bias. y (u b)

BIAS OF A NEURON The bias b has the effect of applying a transformation to the weighted sum u v = u + b The bias is an external parameter of the neuron. It can be modeled by adding an extra input. v is called induced field of the neuron v w m wjx 0 j 0 b j

HOW DOES THE NEURON DETERMINE ITS OUTPUT? The neuron computes the weighted sum of the input signals and compares the result with a threshold value, θ. If the net input is less than the threshold, the neuron output is -1. But if the net input is greater than or equal to the threshold, the neuron becomes activated and its output attains a value +1(McCulloch and Pitts, 1943).

HOW DOES THE NEURON DETERMINE ITS OUTPUT? In other words, the neuron uses the following transfer or activation function: where X is the net weighted input to the neuron, x i is the value of input i, w i is the weight of input i, n is the number of neuron inputs, and Y is the outputof the neuron. This type of activation function is called a sign function. Thus the actual output of the neuron with a sign activation function can be represented as:

IS THE SIGN FUNCTION THE ONLY ACTIVATION FUNCTION USED BY NEURONS? The choice of activation function Examples: 1. step function 2. ramp function 3. sigmoid function 4. Gaussian function determines the neuron model. Note: The step and sign activation functions, also called hard limit functions, are often used in decision-making neurons for classification and pattern recognition tasks.

Step Function ( v) a b if if v c v c b a c

Ramp Function ( v) a b a if if v c v d (( v c)( b a) /( d c)) otherwise b a c d

Sigmoid function ( v) z 1 1 exp( xv y)

The Gaussian function is the probability function of the normal distribution. Sometimes also called the frequency curve. ( v) 1 exp 2 1 2 v 2

u m y (u b) wjxj j 1 1. step function 2. ramp function 3. sigmoid function 4. Gaussian function

NETWORK ARCHITECTURES Three different classes of network architectures: single-layer feed-forward multi-layer feed-forward recurrent The architecture of a neural network is linked with the learning algorithm used to train

WHAT IS FEED-FORWARD NN? In a feed forward network information always moves one direction; it never goes backwards. A feedforward neural network is an artificial neural network where connections between the units do not form a directed cycle. This is different from recurrent neural networks. The feedforward neural network was the first and simplest type of artificial neural network devised. In this network, the information moves in only one direction, forward, from the input nodes, through the hidden nodes (if any) and to the output nodes. There are no cycles or loops in the network.

SINGLE-LAYER FEED-FORWARD The simplest kind of neural network is a single-layer perceptron network, which consists of a single layer of output nodes; the inputs are fed directly to the outputs via a series of weights. It can be considered the simplest kind of feed-forward network. A perceptron can be created using any values for the activated and deactivated states as long as the threshold value lies between the two.

SINGLE-LAYER FEED-FORWARD The sum of the products of the weights and the inputs is calculated in each node, and if the value is above some threshold (typically 0) the neuron fires and takes the activated value (typically 1); otherwise it takes the deactivated value (typically -1). The output node has a "threshold",v. Rule: If summed input v, then it "fires" (output y = 1). Else (summed input < v) it doesn't fire (output y = 0). b (bias) x 1 x 2 w 1 w 2 v (v) y x n w n ( v) 1if 1if v v 0 0

SINGLE-LAYER LIMITATION The perceptron is used for binary classification. First train a perceptron for a classification task. Find suitable weights in such a way that the training examples are correctly classified. Geometrically try to find a hyper-plane that separates the examples of the two classes. The perceptron can only model linearly separable classes. When the two classes are not linearly separable, it may be desirable to obtain a linear separator that minimizes the mean squared error. Given training examples of classes C 1, C 2 train the perceptron in such a way that : If the output of the perceptron is +1 then the input is assigned to class C 1 If the output is -1 then the input is assigned to C 2 Although a single threshold unit is quite limited in its computational power, it has been shown that networks of parallel threshold units can approximate any continuous function from a compact interval of the real numbers into the interval [-1,1].

MULTI LAYER FEED-FORWARD FFNN is a more general network architecture, where there are hidden layers between input and output layers. Hidden nodes do not directly receive inputs nor send outputs to the external environment. FFNNs overcome the limitation of single-layer NN. They can handle non-linearly separable learning tasks. Input layer Output layer Hidden Layer 3-4-2 Network

CAN A NEURAL NETWORK INCLUDE MORE THAN TWO HIDDEN LAYERS? Commercial ANNs incorporate three and sometimes four layers, including one or two hidden layers. Each layer can contain from 10 to 1000 neurons. Experimental neural networks may have five or even six layers, including three or four hidden layers, and utilise millions of neurons, but most practical applications use only three layers, because each additional layer increases the computational burden exponentially.

TRAINING IN SINGLE LAYER The model consists of a linear combiner followed by a hard limiter. The weighted sum of the inputs is applied to the hard limiter, which produces an output equal to - 1 if its input is positive and +1 if it is negative. The aim of the perceptron is to classify inputs, or in other words externally applied stimuli x 1 ; x 2 ;... ; x n, into one of two classes, say A 1 and A 2

TRAINING IN SINGLE LAYER For the case of two inputs, x 1 and x 2, the decision boundary takes the form of a straight line shown in bold. Point 1, which lies above the boundary line, belongs to class A 1 ; and point 2, which lies below the line, belongs to class A 2. The threshold θ can be used to shift the decision boundary.

HOW DOES THE PERCEPTRON LEARN ITS CLASSIfiCATION TASKS? The perceptron is trained (i.e., the weights and threshold values are calculated) based on an iterative training phase involving training data. Training data are composed of a list of input values and their associated desired output values. In the training phase, the inputs and related outputs of the training data are repeatedly submitted to the perceptron. The perceptron calculates an output value for each set of input values.

TRAINING IN SINGLE LAYER (EXAMPLE) Example: If the output of a particular training case is labelled 1 when it should be labelled 0, the threshold value (theta) is increased by 1, and all weight values associated with inputs of 1 are decreased by 1. The opposite is performed if the output of a training case is labelled 0 when it should be labelled 1. No changes are made to the threshold value or weights if a particular training case is correctly classified.

TRAINING IN SINGLE LAYER (EXAMPLE) This set of training rules is summarized as: If OUTPUT is correct, then no changes are made to the threshold or weights If OUTPUT = 1, but should be 0 then {theta = theta + 1} and {weight x = weight x -1, if input x = 1} If OUTPUT = 0, but should be 1 then {theta = theta - 1} and {weight x =weight x +1, if input x = 1} An example of a perceptron. The system consists of binary

TRAINING IN SINGLE LAYER (EXAMPLE) -1 An example of a perceptron. The system consists of binary activations. Weights are identified by w s, and inputs are identified by i s. A variable threshold value (theta) is used at the

THE PERCEPTRON LEARNING RULE If at iteration p, the actual output is Y(p)and the desired output is Yd(p) then the error is given by: Iteration p here refers to the pth training example presented to the perceptron. If the error, e(p), is positive, we need to increase perceptron output Y(p), but if it is negative, we need to decrease Y(p). Taking into account that each perceptron input contributes x i (p)* w i(p)to the total input X (p), we find that if input value x i(p)is positive, an increase in its weight w i (p)tends to increase perceptron output Y (p), whereas if x i (p) is negative, an increase in w i(p)tends to decrease Y (p). Thus, the following perceptron learning rule can be established: Where α is the learning rate, a positive constant less than unity

TRAINING IN SINGLE LAYER Once the network is trained, it can be used to classify new data sets whose input/output associations are similar to those that characterize the training data set. Thus, through an iterative training stage in which the weights and threshold gradually migrate to useful values (i.e., values that minimize or eliminate error), the perceptron can be said to learn how to solve simple problems.

EXAMPLE: TRAIN A PERCEPTRON TO PERFORM BASIC LOGICAL OPERATIONS (AND) Truth tables for AND

EXAMPLE: TRAIN A PERCEPTRON TO PERFORM BASIC LOGICAL OPERATIONS (AND) The perceptron output Y is 1 only if the total weighted input X is greater than or equal to the threshold value θ. This means that the entire input space is divided in two along a boundary defined by X =θ. If we substitute values for weights w1 and w2 and threshold θ=0.2, we obtain one of the possible separating lines as Truth tables for AND Thus, the region below the boundary line, where the output is 0, is given by and the region above this line, where the output is 1, is given by The fact that a perceptron can learn only linear separable functions is rather bad news, because there are not many such functions.

TRAINING IN MULTILAYER More than a hundred different learning algorithms are available to train MLP, but then most popular method is back-propagation. With back-propagation, the input data is repeatedly presented to the neural network. With each presentation the output of the neural network is compared to the desired output and an error is computed. This error is then fed back (back-propagated) to the neural network and used to adjust the weights such that the error decreases with each iteration and the neural model gets closer and closer to producing the desired output. This process is known as "training". During the training session, the neural network receives a number of different input patterns, discovers significant features in these patterns and learns how to classify input data into appropriate categories.

TRAINING IN MULTILAYER

TRAINING ALGORITHM: BACKPROPAGATION The Backpropagation single perceptron. algorithm learns in the same way as It searches for weight values that minimize the total error of the network over the set of training examples (training set). Backpropagation consists of the repeated application of the following two passes: Forward pass: In this step, the network is activated on one example and the error of (each neuron of) the output layer is computed. Backward pass: in this step the network error is used for updating the weights. The error is propagated backwards from the output layer through the network layer by layer. This is done by recursively computing the local gradient of each neuron.

BACK-PROPAGATION Back-propagation training algorithm Network activation Forward Step Error propagation Backward Step Backpropagation adjusts the weights of the NN in order to minimize the network total mean squared error.

BACK-PROPAGATION With one hidden layer, we can represent any continuous function of the input signals, and with two hidden layers even discontinuous functions can be represented. Please go through PDF file

NN DESIGN ISSUES Data representation Network Topology Network Parameters Training Validation

Data Representation Data representation depends on the problem. In general ANNs work on continuous (real valued) attributes. Therefore symbolic attributes are encoded into continuous ones. Attributes of different types may have different ranges of values which affect the training process. Normalization may be used, like the following one which scales each attribute to assume values between 0 and 1. x i xi min i max min i for each value x i of i th attribute, min i and max i are the minimum and maximum value of that attribute over the training set. i

Network Topology The number of layers and neurons depend on the specific task. In practice this issue is solved by trial and error. Two types of adaptive algorithms can be used: start from a large network and successively remove some neurons and links until network performance degrades. begin with a small network and introduce new neurons until performance is satisfactory.

Network parameters How are the weights initialized? How is the learning rate chosen? How many hidden layers and how many neurons? How many examples in the training set?

INITIALIZATION OF WEIGHTS In general, initial weights are randomly chosen, with typical values between -1.0 and 1.0 or -0.5 and 0.5. If some inputs are much larger than others, random initialization may bias the network to give much more importance to larger inputs. In such a case, weights can be initialized as follows: 1 1 w ij 2N xi i 1,..., N 1 1 w jk 2N ( wijx i 1,..., N i ) For weights from the input to the first layer For weights from the first to the second layer

Choice of learning rate The right value of α depends on the application. Values between 0.1 and 0.9 have been used in many applications. Other heuristics is that adapt α during the training as described in previous slides.

NUMBER OF TRAINING Rule of thumb: the number of training examples should be at least five to ten times the number of weights of the network. Other rule: N W (1- a) W = number of weights a=expected accuracy on test set

RECURRENT NETWORK FFNN is acyclic where data passes from input to the output nodes and not vice versa. Once the FFNN is trained, its state is fixed and does not alter as new data is presented to it. It does not have memory. Recurrent network can have connections that go backward from output to input nodes and models dynamic systems. In this way, a recurrent network s internal state can be altered as sets of input data are presented. It can be said to have memory. It is useful in solving problems where the solution depends not just on the current inputs but on all previous inputs. Applications predict stock market price, weather forecast

RECURRENT NETWORK ARCHITECTURE Recurrent Network with hidden neuron: unit delay operator d is used to model a dynamic system d d input hidden output d

LEARNING AND TRAINING During learning phase, a recurrent network feeds its inputs through the network, including feeding data back from outputs to inputs process is repeated until the values of the outputs do not change. This state is called equilibrium or stability Recurrent networks can be trained by using back-propagation algorithm. In this method, at each step, the activation of the output is compared with the desired activation and errors are propagated backward through the network. Once this training process is completed, the network becomes capable of performing a sequence of actions.

SUMMARY ANN Neuron model Architecture Learning Algorithm step function ramp function sigmoid function Gaussian function recurrent singlelayer feedforward multilayer feedforward backpropagation