Backpropagation: The Good, the Bad and the Ugly

Backpropagation: The Good, the Bad and the Ugly The Norwegian University of Science and Technology (NTNU Trondheim, Norway keithd@idi.ntnu.no October 3, 2017

Supervised Learning Constant feedback from an instructor, indicating not only right/wrong, but also the correct answer for each training case. Many cases (i.e., input-output pairs to be learned. Weights are modified by a complex procedure (back-propagation based on output error. Feed-forward networks with back-propagation learning are the standard implementation. 99% of neural network applications use this. Typical usage: problems with a lots of input-output training data, and b goal of a mapping (function from inputs to outputs. Not biologically plausible, although the cerebellum appears to exhibit some aspects. But, the result of backprop, a trained ANN to perform some function, can be very useful to neuroscientists as a sufficiency proof.

Backpropagation Overview Training/Test Cases: {(d1, r1 (d2, r2 (d3, r3...} d3 r3 Encoder Decoder r* E = r3 - r* de/dw Feed-Forward Phase - Inputs sent through the ANN to compute outputs. Feedback Phase - Error passed back from output to input layers and used to update weights along the way.

Training -vs- Testing Cases Training N times, with learning Neural Net Test 1 time, without learning Generalization - correctly handling test cases (that ANN has not been trained on. Over-Training - weights become so fine-tuned to the training cases that generalization suffers: failure on many test cases.

Widrow-Hoff (a.k.a. Delta Rule 1 2 3 X w T = target output value δ = error 1 Y Node N 0 0 S Δw = ηδx Y δ = T - Y Delta (δ = error; Eta (η = learning rate Goal: change w so as to reduce δ. Intuitive: If δ > 0, then we want to decrease it, so we must increase Y. Thus, we must increase the sum of weighted inputs to N, and we do that by increasing (decreasing w if X is positive (negative. Similar for δ < 0 Assumes derivative of N s transfer function is everywhere non-negative.

Gradient Descent Goal = minimize total error across all output nodes Method = modify weights throughout the network (i.e., at all levels to follow the route of steepest descent in error space. min ΔE Error(E Weight Vector (W w ij = η E i w ij

Computing E i w ij 1 x 1d w i1 f T t id sum id i o id n x nd w in E id Sum of Squared Errors (SSE E i = 1 2 (t id o id 2 d D E = 1 w ij 2 2(t id o id (t id o id = w d D ij (t id o id ( o id w d D ij

Computing ( o id w ij 1 t id x 1d w i1 f T sum id i o id n x nd w in E id Since output = f(sum weighted inputs where E = w ij (t id o id ( f T (sum id w d D ij n sum id = w ik x kd k=1 Using Chain Rule: f (g(x = f x g(x g(x x (f T (sum id = f T (sum id sum id = f T (sum id x w ij sum id w ij sum jd id

Computing sum id w ij - Easy!! sum id w ij = ( n k=1 w ik x kd = ( wi1 x 1d + w i2 x 2d +... + w ij x jd +... + w in x nd w ij w ij = (w i1x 1d w ij + (w i2x 2d w ij +... + (w ijx jd w ij +... + (w inx nd w ij = 0 + 0 +... + x jd +... + 0 = x jd

Computing f T (sum id sum id - Harder for some f T f T = Identity function: f T (sum id = sum id Thus: f T (sum id sum id = 1 (f T (sum id = f T (sum id w ij sum id sum id w ij = 1 x jd = x jd f T = Sigmoid: f T (sum id = 1 1+e sum id Thus: (f T (sum id = f T (sum id w ij sum id f T (sum id sum id = o id (1 o id sum id w ij = o id (1 o id x jd = o id (1 o id x jd

The only non-trivial calculation f T (sum id sum id But notice that: = ( (1 + e sum id 1 = ( 1 (1 + e sumid (1 + e sum id 2 sum id sum id = ( 1( 1e sum id (1 + e sum id 2 = e sum id (1 + e sum id 2 e sum id (1 + e sum id 2 = f T (sum id (1 f T (sum id = o id (1 o id

Putting it all together E i = w ij (t id o id ( f ( T (sum id = w d D ij d D (t id o id f T (sum id sum id sum id w ij So for f T = Identity: E i = w ij (t id o id x jd d D and for f T = Sigmoid: E i = w ij (t id o id o id (1 o id x jd d D

Weight Updates (f T = Sigmoid Batch: update weights after each training epoch w ij = η E i = η w ij (t id o id o id (1 o id x jd d D The weight changes are actually computed after each training case, but w ij is not updated until the epoch s end. Incremental: update weights after each training case w ij = η E i w ij = η(t id o id o id (1 o id x jd A lower learning rate (η recommended here than for batch method. Can be dependent upon case-presentation order. So randomly sort the cases after each epoch.

Backpropagation in Multi-Layered Neural Networks d(e d d(sum 1d 1 d(sum 1d d(o jd d(o jd d(sum jd w 1j sum jd j o jd E d w nj d(sum nd d(o jd n d(e d d(sum nd For each node (j and each training case (d, backpropagation computes an error term: δ jd = E d sum jd by calculating the influence of sum jd along each connection from node j to the next downstream layer.

Computing δ jd d(e d d(sum 1d 1 d(sum 1d d(o jd d(o jd d(sum jd w 1j sum jd j o jd E d w nj d(sum nd d(o jd n d(e d d(sum nd Along the upper path, the contribution to So summing along all paths: E d sum jd is: o jd sum jd sum 1d o jd E d sum 1d E d sum jd = o jd sum jd n sum kd o k=1 jd E d sum kd

Computing δ jd Just as before, most terms are 0 in the derivative of the sum, so: sum kd o jd = w kj Assuming f T = a sigmoid: Thus: o jd = f T (sum jd = o sum jd sum jd (1 o jd jd δ jd = E d = o n jd sum sum jd sum jd kd E d o k=1 jd sum kd n n = o jd (1 o jd w kj ( δ kd = o jd (1 o jd w kj δ kd k=1 k=1

Computing δ jd Note that δ jd is defined recursively in terms of the δ values in the next downstream layer: δ jd = o jd (1 o jd n k=1 w kj δ kd So all δ values in the network can be computed by moving backwards, one layer at a time.

Computing E d w ij from δ jd - Easy!! 1 w i1 d(e d d(sum id j o jd w ij sum id i The only effect of w ij upon the error is via its effect upon sum id, which is: sum id w ij = o jd So: E d = sum id E d = sum id ( δ w ij w ij sum id w id = o jd δ id ij

Computing w ij Given an error term, δ id (for node i on training case d, the update of w ij for all nodes j that feed into i is: w ij = η E d w ij = η( o jd δ id = ηδ id o jd So given δ i, you can easily calculate w ij for all incoming arcs.

Learning XOR 1.2 Sum-Squared-Error 1.2 Sum-Squared-Error 1.2 Sum-Squared-Error 1.0 1.0 1.0 0.8 0.8 0.8 Error 0.6 Error Error 0.6 Error Error 0.6 Error 0.4 0.4 0.4 0.2 0.2 0.2 0.0 0 200 400 600 800 1000 Epoch 0.0 0 200 400 600 800 1000 Epoch 0.0 0 200 400 600 800 1000 Epoch Epoch = All 4 entries of the XOR truth table. 2 (inputs X 2 (hidden X 1 (output network Random init of all weights in [-1 1]. Not linearly separable, so it takes awhile! Each run is different due to random weight init.

Learning to Classify Wines Class Properties 1 14.23 1.71 2.43 15.6 127 2.8 1 13.2 1.78 2.14 11.2 100 2.65 2 13.11 1.01 1.7 15 78 2.98 3 13.17 2.59 2.37 20 120 1.65. Wine Properties 1 2 1 Wine Class 5 1 13 Hidden Layer

Wine Runs 0.007 Sum-Squared-Error 0.010 Sum-Squared-Error 0.006 0.008 0.005 Error 0.004 0.003 Error Error 0.006 0.004 Error 0.002 0.001 0.002 0.000 0 20 40 60 80 100 Epoch 0.16 0.000 0 20 40 60 80 100 Epoch 13x5x1 (lrate = 0.3 13x5x1 (lrate = 0.1 Sum-Squared-Error 0.014 Sum-Squared-Error 0.14 0.012 0.12 0.010 Error 0.10 0.08 0.06 Error Error 0.008 0.006 Error 0.04 0.004 0.02 0.002 0.00 0 20 40 60 80 100 Epoch 0.000 0 20 40 60 80 100 Epoch 13 x 10 x 1 (lrate = 0.3 13 x 25 x 1 (lrate = 0.3

WARNING: Notation Change When working with matrices, it is often easier to let w ij denote the weight on the connection FROM node i TO node j. Then, the normal syntax for matrix row-column indices implies that: The values in row i are written as w i? and denote the weights on the outputs of node i. The values in column j are written as w?j and denote the weights on the inputs to node j. w 11 w 12 w 13 Weights from node 2 w 21 w 22 w 23 w 31 w 32 w 33 Weights to node 3

Backpropagation using Tensors X V Y g: W Z f: v 11 v 12 v 21 v W = V = 22 R = XV Y = g(r S = YW Z = f(s w 11 w 21 X = [x 1, x 2, x 3 ] Y = [y 1, y 2 ] Z = [z 1 ] f = g = sigmoid v 31 v 32 v ij = weight from i to j

Goal: Compute Z W Using the Chain Rule of Tensor Calculus: Z f (S = W W = f (S S S f (S = W S (Y W W f(s is a sigmoid, with derivative f(s(1-f(s, so simplify to: Using the product rule: z 1 (1 z 1 (Y W W z 1 (1 z 1 (Y W W = z 1(1 z 1 [ Y W W + Y W W ] Y is independent of W, so red term = 0. But W W 1 0 0 1 is not simply 1, but:

Computing Z W Y W W = (y 1,y 2 1 0 0 1 ( = y1 y 2 Putting it all together: Z W = z 1(1 z 1 Y W W = z 1 (1 z 1 y 1 z 1 (1 z 1 y 2 = Z w 1,1 Z w 2,1 For each case, c k, in a minibatch, the forward pass will produce values for X, Y, and Z. Those values will be used to compute actual numeric gradients by filling in these symbolic gradients: z Z 1 (1 z 1 y 1 = W ck z 1 (1 z 1 y 2 ck

Tensor Calculations across Several Layers Goal: Compute Z V Expanding using the Chain Rule: Z f (S = V V Using the Product Rule: (Y W V = f (S S S V = z 1(1 z 1 (Y W V = Y V W + Y W V = (g(r V W is indep. of V, so red term = 0; and Y = g(r. Using the Chain Rule, and R = X V, yields: (g(r V g(r is also a sigmoid, so: W = [ g(r R R V ] W W g(r R = g(r(1 g(r = Y (1 Y = [y 1(1 y 1,y 2 (1 y 2 ]

Computing Z V Since R = X V, and using the Product Rule: R V = (X V = X V V V + X V V X is indep. of V, so red term = 0; V is 4-dimensional: V 1 0 0 1 0 0 0 0 0 0 0 0 V V = 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1

Computing Z V Take dot product of X = [x 1,x 2,x 3 ] with each internal maxtrix of V V : 1 0 0 1 0 0 0 0 0 0 0 0 X V V = [x 0 0 0 0 1,x 2,x 3 ] 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 = x 1 x 2 x 3 0 0 0 0 x 1 0 x 2 0 x 3

Computing Z V Putting the pieces back together: g(r R R V = [y 1(1 y 1,y 2 (1 y 2 ] = x 1 y 1 (1 y 1 x 2 y 1 (1 y 1 x 3 y 1 (1 y 1 0 0 0 0 0 0 x 1 x 2 x 3 0 0 0 x 1 y 2 (1 y 2 x 2 y 2 (1 y 2 x 3 y 2 (1 y 2 0 x 1 0 x 2 0 x 3

Computing Z V (g(r V W = [ g(r R R V ] W = x 1 y 1 (1 y 1 x 2 y 1 (1 y 1 x 3 y 1 (1 y 1 = 0 0 0 0 0 0 x 1 y 2 (1 y 2 x 2 y 2 (1 y 2 x 3 y 2 (1 y 2 x 1 y 1 (1 y 1 w 11 x 1 y 2 (1 y 2 w 21 x 2 y 1 (1 y 1 w 11 x 2 y 2 (1 y 2 w 21 x 3 y 1 (1 y 1 w 11 x 3 y 2 (1 y 2 w 21 ( w11 w 21 = (Y W V

Computing Z V...and finally... = Z f (S = V V = z 1(1 z 1 (Y W V x 1 y 1 (1 y 1 w 11 x 1 y 2 (1 y 2 w 21 = z 1 (1 z 1 x 2 y 1 (1 y 1 w 11 x 2 y 2 (1 y 2 w 21 x 3 y 1 (1 y 1 w 11 x 3 y 2 (1 y 2 w 21 z 1 (1 z 1 x 1 y 1 (1 y 1 w 11 z 1 (1 z 1 x 1 y 2 (1 y 2 w 21 z 1 (1 z 1 x 2 y 1 (1 y 1 w 11 z 1 (1 z 1 x 2 y 2 (1 y 2 w 21 z 1 (1 z 1 x 3 y 1 (1 y 1 w 11 z 1 (1 z 1 x 3 y 2 (1 y 2 w 21 = Z v 11 Z v 21 Z v 31 Z v 12 Z v 22 Z v 32

Gradients and the Minibatch For every case c k in a minibatch, the values of X, Y and Z are computed during the forward pass. The symbolic gradients in Z V are filled in using these values in X,Y and Z (along with W, producing one numeric, 3 x 2 matrix per case: Z Z v 11 v 12 = ( Z V ck = Z v 21 Z v 31 Z v 22 Z v 32 z 1 (1 z 1 x 1 y 1 (1 y 1 w 11 z 1 (1 z 1 x 1 y 2 (1 y 2 w 21 z 1 (1 z 1 x 2 y 1 (1 y 1 w 11 z 1 (1 z 1 x 2 y 2 (1 y 2 w 21 z 1 (1 z 1 x 3 y 1 (1 y 1 w 11 z 1 (1 z 1 x 3 y 2 (1 y 2 w 21 ck ck

Gradients and the Minibatch In most Deep Learning situations, the gradients will be based on a loss function, L, not simply the output of the final layer. But that s just one more level of derivative calculations. At the completion of a minibatch, the numeric gradient matrices are added together to yield the complete gradient, which is then used to update the weights. For any weight matrix U in the network, and minibatch M, update weight u i,j as follows: ( L u M = i,j ( L u ck c k M i,j η = learning rate u i,j = η( L u i,j M Tensorflow and Theano do all of this for you. Everytime you write Deep Learning code, be grateful!!

The General Algorithm: Jacobian Product Chain Weights (W Layer M Goal: Compute dl / dw Layer N Targets Loss (L L W = L N N M + 1 N 1 M M W

Distant Gradients = Sums of Path Products X X 2 dy 1 dy 3 dx 2 dy 2 dx 2 X Y dx 2 dz 3 Y Z Z 3 dz 3 dy 2 dy 1 dz 3 dy 3 Z More generally: z 3 x 2 = z 3 y 1 y 1 x 2 + z 3 y 2 y 2 x 2 + z 3 y 3 y 3 x 2 z i z = x j i y k y k Y k x j This results from the multiplication of two Jacobian matrices.

The Jacobian Matrix J Z Y = z 1 y 1 z 1 y 2 z 2 y 1 z 2.. z m y 1 y 2.... z m y 2 z 1 y n z 2 y n.. z m y n

Multiplying Jacobian Matrices. Jy z Jx y z = i y 1... =... z i y 2. z i y 1 + z i y 1 x j. z i y 3.. y 2 y 2 x j. + z i y 3 y 3 y 1 x j y 2 x j y 3 x j. x j... = Jx z We can do this repeatedly to form J n m, the Jacobian relating the activation levels of upstream layer m with those of downstream layer n: R = Identity Matrix For q = n down to m+1 do: J n m R R R J q q 1

Last of the Jacobians Once we ve computed J n m, we need one final Jacobian: J m w, where w are the weights feeding into layer m. Then we have the standard Jacobian, J n w needed for updating all weights in matrix w. J n w J n m J m w Weights V from layer X to Y (from the earlier example ( T ( y1 y 2 y1 y 2 JV Y = Y v 11 v 11 v 12 v V = ( 12 T y1 y 2 v 21 v 21 ( T y1 y 2 v 31 v 31 T Since a the act func for Y is sigmoid, and b y k = 0 when j k: v ij ( JV Y = Y y1 (1 y 1 x 1 0 T ( T 0 y2 (1 y 2 x V = ( 1 y1 (1 y 1 x 2 0 T ( T 0 y2 (1 y 2 x 2 ( y1 (1 y 1 x 3 0 T ( T 0 y2 (1 y 2 x 3 Inner vectors are transposed to fit them on the page.

Last of the Jacobian Multiplications From the previous example (the network with layer sizes 3-2-1, once we have JY Z and JY V, we can multiply (taking dot products of JZ Y with the inner vectors of JV Y to produce JZ V : the matrix of gradients that backprop uses to modify the weights in V. J Z V = JZ Y JY V = z 1 y 1 z 1 y 2 Y V Since a the act func for Z is sigmoid, and b sum(z k = w y i ik (where sum(z k = sum of weighted inputs to node z k : JV Z = = z 1 (1 z 1 w 11 z 1 (1 z 1 w 21 Y V z 1 (1 z 1 x 1 y 1 (1 y 1 w 11 z 1 (1 z 1 x 1 y 2 (1 y 2 w 21 z 1 (1 z 1 x 2 y 1 (1 y 1 w 11 z 1 (1 z 1 x 2 y 2 (1 y 2 w 21 z 1 (1 z 1 x 3 y 1 (1 y 1 w 11 z 1 (1 z 1 x 3 y 2 (1 y 2 w 21

First of the Jacobians This iterative process is very general, and permits the calculation of JM N (M < N for any layers M and N, or JW N for an weights (W upstream of N. However, the standard situation in backpropagation is to compute J L W i i where L is the objective (loss function and W i are the weight matrices. This follows the same procedure as sketched above, but the series of dot products begins with JN L : the Jacobian of derivatives of the loss function w.r.t. the activations of the output layer. For example, assume L = Mean Squared Error (MSE, Z is an output layer of 3 sigmoid nodes, and t i are the target values for those 3 nodes for a particular case (c: L(c = 1 3 3 (z i t i 2 k=1 Taking partial derivatives of L(c w.r.t. the z i yields: J L Z = L Z = 2 3 (z 1 t 1 2 3 (z 2 t 2 2 3 (z 3 t 3

First of the Jacobian Multiplications Continuing the example: Assume that layer Y (of size 2 feeds into layer Z, using weights W, then: JY Z = z 1 (1 z 1 w 11 z 1 (1 z 1 w 21 z 2 (1 z 2 w 12 z 2 (1 z 2 w 22 z 3 (1 z 3 w 13 z 3 (1 z 3 w 23 Our first Jacobian multiplication yields JY L, a 1 x 2 row vector (shown transposed to fit on the page: J L Z JZ Y = JL Y = 2 3 (z 1 t 1 z 1 (1 z 1 w 11 + 2 3 (z 2 t 2 z 2 (1 z 2 w 12 + 2 3 (z 3 t 3 z 3 (1 z 3 w 13 2 3 (z 1 t 1 z 1 (1 z 1 w 21 + 2 3 (z 2 t 2 z 2 (1 z 2 w 22 + 2 3 (z 3 t 3 z 3 (1 z 3 w 23 T

Backpropagation with Tensors: The Big Picture General Algorithm Assume Layer M is upstream of Layer N (the output layer. So M < N. Assume V is the tensor of weights feeding into Layer M. Assume L is the loss function. Goal: Compute J L V = L V R = JN L (the partial derivatives of the loss function w.r.t. the output layer For q = N down to M + 1 do: R R J q q 1 J L V R JM V Use JV L to update the weights in V.

Practical Tips 1 Only add as many hidden layers and hidden nodes as necessary. Too many more weights to learn + increased chance of over-specialization. 2 Scale all input values to the same range, typically [0 1] or [-1 1]. 3 Use target values of 0.1 (for zero and 0.9 (for 1 to avoid saturation effects of sigmoids. 4 Beware of tricky encodings of input (and decodings of output values. Don t combine too much info into a single node s activation value (even though it s fun to try, since this can make proper weights difficult (or impossible to learn. 5 For discrete (e.g. nominal values, one (input or output node per value is often most effective. E.g. car model and city of residence -vs- income and education for assessing car-insurance risk. 6 All initial weights should be relatively small: [-0.1 0.1] 7 Bias nodes can be helpful for complicated data sets. 8 Check that all your layer sizes, activation functions, activation ranges, weight ranges, learning rates, etc. make sense in terms of each other and your goals for the ANN. One improper choice can ruin the results.

Supervised Learning in the Cerebellum Parallel Fibers Granular Cells Golgi Cells Mossy Fibers Climbing Fibers Purkinje Cells Inferior Olive Sensory + Cortical Inputs Somatosensory (touch, pain, body position + Cortical Inputs Efference Copy Inhibition of deep cerebellar To neurons Spinal Cord To Cerebral Cortex Granular cells detect contexts. Parallel fibers and Purkinje cells map contexts to actions Climbing fibers from Inferior Olive provide (supervisory feedback signals for LTD on Parallel-Purkinje synapses