Tutorial on Tangent Propagation

Size: px

Start display at page:

Download "Tutorial on Tangent Propagation"

Annabel Sherman
5 years ago
Views:

1 Tutorial on Tangent Propagation Yichuan Tang Centre for Theoretical Neuroscience February 5, Introduction Tangent Propagation is the name of a learning technique of an artificial neural network (ANN) which enforces soft constaints on first order partial derivatives of the output vector [2]. Normally, when an ANN is trained with the error function using MSE, F (x 0 ) and G(x 0 ; W ) are minimized at the k-th training data locations x 0 k. 2 Feedforward Neural Network In order to add the tangent constraint to the training of the network, it is useful to see why backpropagation is needed when training a normal network Let the sum of activity before a nonlinearity in the i-th artificial neuron (henceforth refered to as a node) in the l-th layer be a l i and the activity of that node be x l i. Let W ij l be the weight of the l-th layer that connects xl 1 j to a l i. This ANN is shown in figure 1. In mathematical terms, a lt = x l 1T W lt + b lt (1) x l = σ(a l ) (2) The above two equations apply to all layers as a 0 x L, where 0 is the first (input) node layer and L is the last (output) node layer. It is most common to define the error as E p = 1 2 (xl t) T (x L t) (3) where t is the correct output vector. In order to do first order optimization with respect to the gradient of the error surface, we need to find of weights. for all layers 1

2 Figure 1: Graphical representation of an ANN. The green node is the bias with output of 1. The square is the activity before nonlinear activation and the circle represent the activities after the nonlinear activation Using matrix calculus, = al l a l (4) if we denote a l δ l, then = x l 1 j δ l i (5) W l = δl x l 1 (6) Therefore, to get the first order partial derivatives of E p w.r.t. W l, all you need is the activation of the previous layer and the partial derivatives of layer above. Since an ANN is a function that is made up of nested compositions of the functions of equations 1 and 2, using the chain rule, we can write δ l as the matrix multiplication of a matrix and δ l+1 : δ l = δl+1 δ l δ l+1 (7) Equation 7 is what s know as the famous backpropagation algorithm. 3 Tangent Penalty Notice that E p as defined in equation 3 only requires that the G(x 0 ; W ) be as close to F (x 0 ) as possible at the input training points. However, if we know the 2

3 partial derivatives of F ( ) at some input locations, then we can try to match the partial derivatives of G( ; W ) at those same input locations. Since G( ; W ) is a vector function, the first partial derivative G(x0 ; W ) x 0 is a matrix and it can be defined as J T, where J is the jacobian matrix of G. However, we often do not have all the information inorder to constrain the jacobian matrix at various input locations. What we often have is the directional Jacobian at input locations. Directional Jacobian is a direct extension of directional derivatives for scalar functions. Intuitively, we often know the F as x 0 x 0 + τ. Let use define S(x 0 ; α) = x 0 + ατ (8) and The directional Jacobians for F and G will be κ L F (x0 ) α (9) α=0 ε L G(x0 ; W ) α respectively. We can add a tangent penalty term (10) α=0 E t = 1 2 (εl κ L ) T (ε L κ L ) (11) This is known as tangent penalty because the method was introduced for image recognition, where τ k is the tangent vector of the k-th input vector x 0 k on a high dimensional class specific manifold. For classication purposes, it is desired that the output of G remain unchanged even though S(x 0 ; α) is applied in the input space. Therefore, κ L = 0 and we have E t = 1 2 εlt ε L (12) Notice that since because of the composition nature of ANN, the directional Jacobian can be calculated as a series of matrix multiplications: ε L = S(x0 ; α) x 0 α a 0 a1 x 1 x 0 a 1 al x L x L 1 a L (13) ε L = τ x0 } a {{ 0 } ε 0 a1 x 1 x 0 a 1 } {{ } ε 1 al x L 1 x L a L (14) The fact that ε can be computed in a composed manner suggests that it would be easy to invent a virtual network which takes τ as input and outputs E t. 3

4 Since the 0-th layer nodes are usually linear activated, x0 a 0 = I and ε0 = τ. Then, ε l = ε l 1 W lt D l (15) where we have denoted W lt a1 x 0 and Dl x1 a 1. Note that Dl is a diagnoal matrix with Dii l = σ (a l i) (16) We should also note that even though the bias b l determins a l, they are constant w.r.t. x l 1, so they are not included in a1 x 0. feedforward section, what we want is E t l. As described in the E t = εl E t l ε l (17) It is critical to notice that because of equation 16, ε l is a nonlinear function of W l, therefore, taking the derivative of ε l i w.r.t. W l ij is = σ (a l i)ε l 1 i + ε l 1 [W l (i,:) ]T σ (a l i)x l 1 j (18) where [W l (i,:) ] is the i-th row vector of W l. For linear nodes of layer l, D l = I, and we have = ε l 1 j (19) Now, we also want to find the partial derivative for the biases. Looking back at equation 15, b l i only affects Dl through a l i. Therefore, using equations 15 and 16, for a nonlinear layer of nodes: b l i = ε l 1 [W l (i,:) ]T σ (a l i) (20) For a linear layer of nodes, D l = I, therefore: b l i = 0 (21) Since this virtual network for finding E t through ε 0 ε L has the same structure as a feedforward ANN, we have: E t ε l = εl+1 ε l E t ε l+1 (22) 4

5 The above equation from backpropagating ε is named tangent prop. Same as for the feedforward network, we need εl+1 ε l for both the nonlinear and linear case. For the nonlinear case, since ε l = ε l 1 W lt D l, and changing ε l 1 has no effect on W l and D l, ε l+1 ε l = W lt D l (23) Likewise, for the linear case, ε l = ε l 1 W lt I, ε l+1 ε l = W lt (24) 4 Gradient Descent Optimization Using the derivations of the partial derivative of E = ηe p + µe t, the update rule would be: Wij l = (η l + µ E t l ) (25) 5 Finite Differencing It is often useful to use finite differencing techniques to verify that the analytically derived partial derivatives are correct. 5.1 Feedforward Network For feedforward network, it is often enough to set ɛ = 1 e 5, and re-run the network once for every weight and bias by setting W ij W ij + ɛ. Central differencing requires two times more evaluations but is more accurate [1]. 5.2 Tangent Prop Network For the tangent prop network, first we need to verify that the Jacobian or equivalently G(x0 ; W ) x 0 is equal to the analyticaly derived version. We do this by fixing the weights W, and set x 0 i x0 i + ɛ to find the change in G( ; W ). By chaning all components of x 0, we would then generate a finite difference matrix. To make sure that ε L = τ G(x0 ; W ) x 0 is correct, we run the network using x 0 k x0 k + τ. Note that it is required to make norm of τ small in this case. Finally, to test the accuracy of the partial derivative of the weights w.r.t. E t we must do two things. Note that the weights determine a l i in equation 16. Therefore, everytime we make a small change to Wij l, we must run the ANN again to see the changes in a l i before we can rerun the tangent network defined by equation 15. 5

6 References [1] C. M. Bishop. Neural Networks for Pattern Recogntion. Oxford University Press, [2] P. Y. Simard, Y. A. LeCun, J. S. Denker, and B. Victorri. Transformation invariance in pattern recognition tangent distance and tangent propagation. Lecture Notes in Computer Science, 1524,

Neural Networks and the Back-propagation Algorithm

Neural Networks and the Back-propagation Algorithm Francisco S. Melo In these notes, we provide a brief overview of the main concepts concerning neural networks and the back-propagation algorithm. We closely