Backpropagation Neural Net As is the case with most neural networks, the aim of Backpropagation is to train the net to achieve a balance between the ability to respond correctly to the input patterns that are used for training (memorization) and the ability to give reasonable (good) responses to input that is similar, but not identical, to that used in training (generalization). After training, application of the net involves only the computations of the feedforward phase. Even if training is slow, a trained net can produce its output very rapidly. The training of a network by backpropagation involves three stages: 1- The feedforward of the input training pattern. - The calculation and backpropagation of the associated error. 3- The adjustment of the weights. 1- Architecture A multilayer neural network with one layer of hidden units (the Z units) is shown in Figure 1. The output units (the Y units) and the hidden units also may have biases (as shown). The bias on a typical output unit Y k is denoted by w 0k ; the bias on a typical hidden unit Z j is denoted v 0j. Only the direction of information flow for the feedforward phase of operation is shown. During the backpropagation phase of learning, signals are sent in the reverse direction. The algorithm is presented for one hidden layer, which is adequate for a large number of applications. - Algorithm As mentioned earlier, training a network by backpropagation involves three stages: the feedforward of the input training pattern, the backpropagation of the associated error, and the adjustment of the weights. 118
Figure (1) Backpropagation neural network with one hidden layer During feedforward: - Each input unit (X i ) receives an input signal and broadcasts this signal to the each of the hidden units Z 1,..., Z p. - Each hidden unit then computes its activation and sends its signal z j to each output unit. - Each output unit Y k computes its activation y k to form the response of the net for the given input pattern. During training: - Each output unit compares its computed activation y k with its target value t k to determine the associated error for that pattern with that unit. - Based on this error, the factor δ k (k = 1,..., m) is computed. - δ k is used to distribute the error at output unit Y k back to all units in the previous layer (the hidden units that are connected to Y k ). 119
- It is also used (later) to update the weights between the output and the hidden layer. - In a similar manner, the factor δ j (j = 1,..., p) is computed for each hidden unit Z j. - It is not necessary to propagate the error back to the input layer, but δ j is used to update the weights between the hidden layer and the input layer. Adjustment of the weights: - After all of the δ factors have been determined, the weights for all layers are adjusted simultaneously. - The adjustment to the weight w jk (from hidden unit Z j to output unit Y k ) is based on the factor δ k and the activation z j of the hidden unit Z j. - The adjustment to the weight v ij (from input unit X i to hidden unit Z j ) is based on the factor δ j and the activation x i of the input unit. The nomenclature we use in the training algorithm for the backpropagation net is as follows: x Input training vector: x = (x 1,...,x i,, x n ). t Output target vector: t = (t l,..., t e,..., t m ). δ k δ j α Portion of error correction weight adjustment for w jk that is due to an error at output unit Y k ; also, the information about the error at unit Y k that is propagated back to the hidden units that feed into unit Y k. Portion of error correction weight adjustment for v ij that is due to the backpropagation of error information from the output layer to the hidden unit Z j Learning rate. X i Input unit i: 10
For an input unit, the input signal and output signal are the same, namely, x i. v 0j Bias on hidden unit Z j. Z j Hidden unit j: The net input to Z j is denoted z_in j : z_in j = v 0j + x i v ij i The output signal (activation) of Z j is denoted z j : z j = f(z_in j ) w 0k Bias on output unit k. Output unit k: The net input to Y k is denoted y_in k : y_in k = w 0k + z j w jk j The output signal (activation) of Y k is denoted y k : y k = f(y_in k ) Activation function An activation function for a backpropagation net should have several important characteristics: It should be continuous, differentiable, and nondecreasing. Furthermore, for computational efficiency, it is desirable that its derivative be easy to compute. One of the most typical activation functions is the binary sigmoid function, which has range of (0, 1) and is defined as with 1 f1( x) 1 e x f 1 (x) = f 1 (x)[1 f 1 (x)] Another common activation function is bipolar sigmoid, which has range of ( - 1, 1) and is defined as 11
with f ( x) 1 e x 1 f (x) = ½ [ 1+ f (x)][1 f (x)] Note that the bipolar sigmoid function is closely related to the function tanh( x) Training algorithm e e x x e e x x Either of the activation functions defined in the previous section can be used in the standard backpropagation algorithm given here. The form of the data (especially the target values) is an important factor in choosing the appropriate function. The algorithm is as follows: 1- Initialize weights. (Set to small random values). - For each training pair: Feedforward:.1 Each input unit (Xi, i = 1,..., n) receives input signal x i and broadcasts this signal to all units in the layer above (the hidden units).. Each hidden unit (Zj, j = 1,...,p) sums its weighted input units). signals, z_in j = v 0j + x i v ij i n applies its activation function to compute its output signal, z j = f(z_in j ) and sends this signal to all units in the layer above (output.3 Each output unit (Y k, k = 1,..., m) sums its weighted input signals, 1
signal, y_in k = w 0k + z j w jk j p and applies its activation function to compute its output y k = f(y_in k ) Backpropagation of error:.4 Each output unit (Y k, k = 1,...,m) receives a target pattern corresponding to the input training pattern, computes its error information term, δ k = (t k y k ) f '(y_in k ) calculates its weight correction term (used to update w jk later), w jk = α δ k z j calculates its bias correction term (used to update w 0k later), and sends δ k w 0k = α δ k to units in the layer below..5 Each hidden unit (Z j, j = 1,...,p) sums its delta inputs (from units in the layer above), m δ_in j = δ k w jk k=1 multiplies by the derivative of its activation function to calculate its error information term, δ j = δ_in j f ' (z_in j ), calculates its weight correction term (used to update v ij later), later), v ij = α δ j x i and calculates its bias correction term (used to update v 0j v 0j = α δ j 13
Update weights and biases:.6 Each output unit (Y k, k = 1,, m) updates its bias and weights (j = 0,, p): w jk (new) = w jk (old) + w jk Each hidden unit (Z j, j = 1,,p) updates its bias and weights (i = 0,, n): v ij (new) = v ij (old) + v ij Note that in implementing this algorithm, separate arrays should be used for the deltas for the output units (δ k ) and the deltas for the hidden units (δ j ). Choice of initial weights and biases. Random Initialization: A common procedure is to initialize the weights (and biases) to random values between -0.5 and 0.5 (or between -1 and 1 or some other suitable interval). Nguyen-Widrow Initialization: The following simple modification of the common random weight initialization just presented typically gives much faster learning. Weights from the hidden units to the output units (and biases on the output units) are initialized to random values between -0.5 and 0.5, as is commonly the case. The initialization of the weights from the input units to the hidden units is designed to improve the ability of the hidden units to learn. The definitions we use are as follows: n number of input units p number of hidden units scale factor: = 0.7 (p) 1/n The procedure consists of the following steps: for each hidden unit (j = 1,...,p): 14
Initialize its weight vector (from the input units): v ij (old) = random number between -0.5 and 0.5 (or between γ and γ). Compute Euclidean norm (length) of vector v j : v ( old) v old j Reinitialize weights: vij ( old ) vij v ( old ) Set bias: 1 j ( old) v j ( old)... vnj ( ) j v 0j = random number between - and. The Nguyen-Widrow analysis is based on the activation function tanh(x). Application procedure After training, a backpropagation neural net is applied by using only the feedforward phase of the training algorithm. The application procedure is as follows: 1- Initialize weights (from training algorithm). - For each input vector:.1 For i = 1,..., n: set activation of input unit Xi. For j = 1,...,p: z_in j = v 0j + x i v ij i=1 z j = f(z_in j ).3 For k = 1,...,m: y_in k = w 0k + z j w jk j=1 n p y k = f(y_in k ) Example-1: Find the new weights when the net illustrated in Figure () is presented the input pattern (-1, l) and the target output is l. Use a learning rate of a = 0.5, and the bipolar sigmoid activation function. 15
Sol: Z j x 1 x v 0j v 1j v j z_in j z j δ_in j δ j Δv 0j Δv 1j Δv j 1-1 1.4.7 -. -.5 -.85.13.03 -.03.03.4-1 1.6 -.4.3 1.3.57.057.0.005 -.005.005 Y k z 1 z w 0 w 1 w y_in k y k δ k Δw 0 Δw 1 Δw 1 -.4.57 -.3.5.1 -.365 -.18.57.143 - Feedforward: z_in j = v 0j + x i v ij i=1 z_in 1 = 0.4 + (-1)(0.7) + (1)(-0.) = -0.5 z_in = 0.6 + (-1)(-0.4) + (1)(0.3) = 1.3 z j = f(z_in j ) The activation function is bipolar sigmoid, f ( z _ in j ) 1 z _ in j 1 e.035.08 16
z 1 = -0.45 z = 0.57 y_in k = w 0k + z j w jk j=1 y_in = -0.3 + (-0.45)(0.5)+(0.57)(0.1) = 0.365 y k = f(y_in k ) f ( y _ ink ) 1 y _ ink 1 e f(y_in k )=f( -0.365) y k = -0.18 Backpropagation of error: δ k = (t k y k ) f '(y_in k ) δ k = (t k y k )[0.5{1+ f (y_in k )}{1- f (y_in k )}] = (1 (-0.18))(0.5(1+(-0.18))(1-(-0.18))) = 0.57 w jk = α δ k z j w 1 = 0.5*0.57*(-0.45) = -0.035 w = 0.5*0.57*0.57 = 0.08 w 0k = α δ k = 0.5* 0.57 = 0.143 m δ_in j = δ k w jk k=1 δ_in 1 = 0.57*0.5 = 0.85 δ_in = 0.57*0.1 = 0.057 δ j = δ_in j f ' (z_in j ) δ 1 = 0.85 *0.5*(1+f(z_in 1 ))(1-f(z_in 1 )) δ 1 = 0.85*0.5*(1+(-0.45))*(1-(-0.45))= 0.13 δ = 0.057*0.5*(1+0.57)*(1-0.57)= 0.0 v ij = α δ j x i 17
v 11 = 0.5*0.13*(-1) = -0.03 v 1 = 0.5*0.13*(1) = 0.03 v 1 = 0.5*0.0*(-1) = -0.005 v = 0.5*0.0*(1) = 0.005 v 0j = α δ j v 01 = 0.5* 0.13 = 0.03 v 0 =.5* 0.0 = 0.005 Update weights and biases: w jk (new) = w jk (old) + w jk w 1 (new) = 0.5 0.035 = 0.465 w (new) = 0.1 + 0.08 = 0.18 w 0 (new) = -0.3 + 0.143 = -0.157 v ij (new) = v ij (old) + v ij v 11 (new) = 0.7-0.03 = 0.67 v 1 (new) = -0.+ 0.03 = -0.17 v 1 (new) =- 0.4-0.005 = -0.405 v (new) = 0.3+ 0.005 = 0.305 v 01 (new) = 0.4+ 0.03 = 0.403 v 0 (new) = 0.6+ 0.005 = 0.605 The Use of Momentum as Alternative Weight Update Procedure In backpropagation convergence is sometimes faster if a momentum term is added to the weight update formulas. In order to use momentum, weights (or weight updates) from one or more previous training patterns must be saved. For example, in the simplest form of backpropagation with momentum, the new weights for training step t + 1 are based on the weights at training steps t and t - 1. The weight update formulas for backpropagation with momentum are 18
where the momentum parameter is constrained to be in the range from 0 to 1, exclusive of the end points. Batch updating of weights In some cases it is advantageous to accumulate the weight correction terms for several patterns (or even an entire epoch if there are not too many patterns) and make a single weight adjustment (equal to the average of the weight correction terms) for each weight rather than updating the weights after each pattern is presented. This procedure has a smoothing effect on the correction terms. In some cases, this smoothing may increase the chances of convergence to a local minimum. 19