Network training: Training a neural network involves determining the weight parameter vector w that minimizes a cost function. Given a training set comprising a set of input vector {x n }, n = 1,...N, together with a corresponding target {t n }, we minimize the error function E(w) = 1 2 N n=1 y(x n, w) t n 2. (1) Batch Gradient descent: Gradient descent is a method of minimization of a given cost or objective function E(w). w (τ+1) = w (τ) η E(w (τ) ) where η > 0 is the learning rate. After each such update, the gradient is re-evaluated and the process repeated. Note that each step requires that the whole data set be processed. This is called batch method. 2
On-line gradient descent: The error function comprises a set of independent observations comprising a sum of terms, one for each data point, E(w) = N n=1 E n (w). (2) The weight vector is updated one data point at a time. w (τ+1) = w (τ) η E n (w (τ) ) The process is repeated by cycling through the data either in sequence of by selecting points at random. gradient descent moves always downwards in a hilly landscape local minima can trap the movement 3
initialisation is important to avoid local minima choice of the learning rate η is crucial for speed of convergence; For online gradient descent algorithm, the direction of movement fluctuates highly between the steps but the average direction is approximately the steepest descent of the batch version lower computational cost slower - needs more iterations to converge 4
Example: Rosenblatt s Perceptron Rosenblatt s Perceptron is the linear discriminant function classifier, where the activation function takes the form φ(v) = sgn(v) = { 1 if v 0 1 if v < 0 (3) The training data set comprises N input vectors x 1,...,x N, with corresponding target values t 1,..., t N where t n { 1, 1}. 5
1. Initialization: Set w(0) = 0. Then perform the following computation for steps n = 1, 2,... 2. Compute the response of the Perceptron as y n = sgn([w(n)] T x n ) (4) 3. The online gradient descent algorithm is Or w(n + 1) = w(n + 1) = w(n) + η[t n y n ]x n (5) { w(n) if t n = y n (6) w(n) + 2ηt n x n if t n y n 4. Continuation: increment time step by one and go back to step 2. The perceptron is guaranteed to converge in finite steps if the data set is linearly separable. 6
Backpropagation: The Backpropagation is the application of gradient descent algorithm for training a MLP. One needs to calculate the derivative of the squared error function with respect to the weights of the network. Consider the squared error function E(w) = N n=1 E n (w). (7) from which we consider evaluating E n (w) as used in the sequential optimization. E n = 1 2 k (y nk t nk ) 2 (8) where y nk = y k (x n, w). If the model is a simple linear one as y k = i w ki x i (9) The gradient of this error function with respect to a weight w ji is given by w ji = (y nj t nj )x ni (10)
which can be interpreted as a local computation involving the product of an error signal (y nj t nj ) (at the output end of the link w ji ) and the variable x ni (at the input end of the link) In a general feed-forward network, each unit calculates a j = i w ji z i z j = h(a j ) (11) where h(.) is a nonlinear activation function. This step is often called forward propagation, which can be regarded as a forward flow of the information through the network. Apply the chain rule for partial derivatives to give w ji = w ji = δ j z i (12) 7
where δ j denotes a, referred to as error. j Equation (18) tells that the required derivative is obtained by multiplying the value of δ for the unit at the output end by the value of z for input end of the weight. For the output unit, we have δ k = y k t k (13) For the hidden units, we also use chain rule δ j = = k a k a k (14) where the sum runs over all units k to which unit j sends connections. We obtain the backpropagation formula δ j = h (a j ) k w kj δ k (15) which tells us that the value of δ for a particular hidden unit can be obtained by propagating the δ backwards from units higher up in the network. 8
Summary of Backpropagation 1. Apply an input vector x n to the network and forward propagate through the network to the find the activations of all hidden and output units. 2. Evaluate the δ k for all the outputs units using δ k = y k t k (16) 3. Backpropagate the δ to obtain δ j for each hidden unit in the network, using δ j = = k a k a k (17) 4. Evaluate the derivatives using w ji = δ j z i (18) 9
Example: A two layer network y k (x, w) = M j=0 w (2) kj h ( D i=0 w (1) ji x i ) where the output activation function is a linear function, and the hidden units h(a) have sigmoidal activation function given by h(a) = tanh(a) = ea e a e a + e a (19) A useful feature of this function is that its derivative can be expressed in a simple form h (a) = 1 h(a) 2 (20) For each training pattern in the training set, we first perform a forward propagation using a j = D i=0 w (1) ji x i z j = tanh(a j ) y k = M j=0 w (2) kj z j 10
Next we compute the δ for each output unit using δ k = y k t k. (21) Then we backpropagate to obtain δ s for the hidden units using δ j = (1 z 2 j ) K k=1 w (2) kj δ k. (22) Finally the derivatives with respect to the first layer and second layer weights are given by w (1) ji w (2) kj = δ j x i = δ k z i 11