8-1: Backpropagation Prof. J.C. Kao, UCLA. Backpropagation. Chain rule for the derivatives Backpropagation graphs Examples

Size: px

Start display at page:

Download "8-1: Backpropagation Prof. J.C. Kao, UCLA. Backpropagation. Chain rule for the derivatives Backpropagation graphs Examples"

Beverly Riley
5 years ago
Views:

1 8-1: Backpropagation Prof. J.C. Kao, UCLA Backpropagation Chain rule for the derivatives Backpropagation graphs Examples

2 8-2: Backpropagation Prof. J.C. Kao, UCLA Motivation for backpropagation To do gradient descent, we need to calculate the gradient of the objective with respect to the parameters, θ. However, in a neural network, some parameters are not directly connected to the output. How do we calculate then the gradient of the objective with respect to these parameters? Backpropagation answers this question, and is a general application of the chain rule for derivatives.

3 8-3: Backpropagation Prof. J.C. Kao, UCLA Nomenclature Forward propagation: Forward propagation is the act of calculating the values of the hidden and output units of the neural network given an input. It involves taking input x, propagating it to through each hidden unit sequentially, until you arrive at the output y. From forward propagation, we can also calculate the cost function J(θ). In this manner, the forward propagated signals are the activations. With the input as the start and the output as the end, information propagates in a forward manner. Backpropagation (colloquially called backprop): As its name suggests, now information is passed backward through the network, from the cost function and outputs to the inputs. The signal that is backpropagated are the gradients. It enables the calculation of gradients at every stage going back to the input layer.

4 8-4: Backpropagation Prof. J.C. Kao, UCLA Why do we need backpropagation? It is possible in many cases to calculate an analytical gradient. So why use backpropagation? Evaluating analytical gradients may be computationally expensive. Backpropagation is generalizable and is often inexpensive. A few further notes on backpropagation. Backpropagation is not the learning algorithm. It s the method of computing gradients. Backpropagation is not specific to multilayer neural networks, but a general way to compute derivatives of functions.

5 8-5: Backpropagation Prof. J.C. Kao, UCLA Computational graphs Computational graphs help to visualize and contextualize how we aim to backpropagate derivatives. In this class, we will let each inner node denote an operation. Examples: y = xz. (Note: we let denote dot product.) y = softplus(x). x z y x exp + log y 1

6 8-6: Backpropagation Prof. J.C. Kao, UCLA Computational graphs (cont) y = σ(w T x) = 1. (Note, we let denote dot product.) 1+exp( w T x) 1 x mult exp + inv y w 1 H = ReLU(WX + b). (Note: we let denote matrix multiplication with appropriate order.) W X + relu H b

7 8-7: Backpropagation Prof. J.C. Kao, UCLA Calculus chain rule Scalar version. Let x, y, z R and f( ), g( ) be R R functions. Suppose y = g(x) and z = f(x). Then z = f(g(x)). The calculus chain rule states: dz dx = dz dy dy dx Vector version. Let x R m, y R n. Suppose g : R m R n and f : R n R. Suppose y = g(x) and z = f(y). Then z = f(g(x)). The multivariate calculus chain rule states: z x i = This can be equivalently stated as: n j=1 z y j y j x i xz = J(g)(x) T yz

8 8-8: Backpropagation Prof. J.C. Kao, UCLA Backpropagation example: softplus y = softplus(x) = log(1 + exp(x)). We can analytically calculate the derivative. dy dx = exp(x) 1 + exp(x) Let s now calculate it via backpropagation. x exp b 1 + log y a Here, we have that b = exp(x) and that a = 1 + exp(x). Applying the chain rule, we have that: dy da dy db dy dx = d da log a = 1 a = da dy db da = dy da = db dy dx db = exp(x) 1 a

9 8-9: Backpropagation Prof. J.C. Kao, UCLA Backpropagation example: sigmoid y = σ(w T x) = 1 1+exp( w T x). 1 x mult exp c b e + inv y a w 1 Apply the chain rule: dy da dy db dy dc dy de = 1 a 2 = dy da = exp(c) dy db = dy dc

10 8-10: Backpropagation Prof. J.C. Kao, UCLA Backpropagation example: sigmoid (cont.) Finally, we have that: xy = xe dy de wy = we dy de Putting it all together, we have that: wy = x dy de = x dy dc = exp( w T x)x dy db = exp( w T x)x dy da exp( w T x) = (1 + exp( w T x)) x 2 A similar computation could be done to compute xy.

11 8-11: Backpropagation Prof. J.C. Kao, UCLA Backpropagation: a few intermediate notes A few things to note thus far: So far, all multiplications have been scalar-scalar or vector-scalar, where order does not matter. It may be the case that we have matrix-vector or matrix-matrix multiplies, in which case the order should be correct. While one could do the rigorous math to determine the order, in general, one quick trick to get the order correct with less cognitive effort is to look at the dimensionality. Prior to backpropagation, we would have performed a forward pass to calculate intermediate values. These intermediate values were denoted a, b, c, e, etc. in the prior plots. We can cache these and would not have to recalculate their values in doing backpropagation.

12 8-12: Backpropagation Prof. J.C. Kao, UCLA Backpropagation: neural network layer Here, h = ReLU(Wx + b), with h R h, x R m, W R h m, and b R h. While many output cost functions might be used, let s consider a simple squared-loss output l = (h z) 2 where z is some target value. W c x + relu h ( ) 2 l a b A few things to note: Note that h l = 2(h z). We will start backpropagation at h rather than at l. In the following backpropagation, we ll have need for elementwise multiplication. This is formally called a Hadamard product, and we will denote it via. Concretely, the ith entry of x y is given by x iy i. z

13 8-13: Backpropagation Prof. J.C. Kao, UCLA Backpropagation: neural network layer (cont.) Applying the chain rule, we have: A few notes: l = I(a > 0) l a h l = l a l x = l x = W T l For, see example in the Tools notes. x Why was the chain rule written right to left instead of left to right? This turns out to be a result of our convention of how we defined derivatives (using the denominator layout notation ) in the linear algebra notes. Though maybe not the most satisfying answer, you can always check the order of operations is correct by considering non-square matrices, where the dimensionality must be correct.

14 8-14: Backpropagation Prof. J.C. Kao, UCLA Backpropagation: neural network layer (cont.) l What about a 3-dimensional tensor.? In doing backpropagation, we have to calculate which is However, we also intuit the following (informally): l is a matrix that is h m. The derivative of Wx with respect to W ought to look like x. Since l Rh, and x R m then, intuitively, l = l xt This intuition turns out to be correct, and it is common to use this intuition. However, the first time around we should do this rigorously and then see the general pattern of why this intuition works.

15 8-15: Backpropagation Prof. J.C. Kao, UCLA Calculating a tensor derivative is an h m h tensor. Letting c k denote the ith element of c, we see that each i is an h m matrix, and that there are h of these for each element c i from i = 1,..., h. The matrix, i, can be calculated as follows: i = h j=1 and thus, the (i, j)th element of this matrix is: ( ) i = x j i,j w ijx j It is worth noting that ( ) k = 0 for k i i,j

16 8-16: Backpropagation Prof. J.C. Kao, UCLA Calculating a tensor derivative (cont.) Hence, i is a matrix where the ith row is xt and all other rows are the zeros (we denote the zero vector by 0). i.e., 1 = x T. Now applying the chain rule, 2 = l = l x T. etc... is a tensor product between an (h m h) tensor and an (h 1) vector, whose resulting dimensionality is (h m 1) or equivalently, an (h m) matrix.

17 8-17: Backpropagation Prof. J.C. Kao, UCLA Calculating a tensor derivative (cont.) We carry out this tensor-vector multiply in the standard way. l = l h i = i=1 = = ( ) l 1 ( ) l ( l ( )1 xt l )2 xt ( l = l xt. )h xt i x T. + ( ) l 2 x T. + + ( ) l h. x T

18 8-18: Backpropagation Prof. J.C. Kao, UCLA A few notes on tensor derivatives In general, the simpler rule can be inferred via pattern intuition / looking at the dimensionality of the matrices, and these tensor derivatives need not be explicitly derived. (See intuition three slides prior.) Indeed, actually calculating these tensor derivatives, storing them, and then doing e.g., a tensor-vector multiply, is usually not a good idea for both memory and computation. In this example, storing all these zeros and performing the multiplications is unnecessary. If we know the end result is simply an outer product of two vectors, we need not even calculate an additional derivative in this step of backpropagation, or store an extra value (assuming the inputs were previously cached).

Introduction to gradient descent

6-1: Introduction to gradient descent Prof. J.C. Kao, UCLA Introduction to gradient descent Derivation and intuitions Hessian 6-2: Introduction to gradient descent Prof. J.C. Kao, UCLA Introduction Our