How the backpropagation algorithm works Srikumar Ramalingam School of Computing University of Utah

Size: px

Start display at page:

Download "How the backpropagation algorithm works Srikumar Ramalingam School of Computing University of Utah"

Horace Ellis
6 years ago
Views:

1 How the backpropagation agorithm works Srikumar Ramaingam Schoo of Computing University of Utah

2 Reference Most of the sides are taken from the second chapter of the onine book by Michae Nieson: neuranetworksanddeepearning.com

3 Introduction First discovered in First infuentia paper in 1986: Rumehart, Hinton and Wiiams, Learning representations by backpropagating errors, Nature, 1986.

4 Perceptron (Reminder)

5 Sigmoid neuron (Reminder) A sigmoid neuron can take rea numbers (x 1, x 2, x 3 ) within 0 to 1 and returns a number within 0 to 1. The weights (w 1, w 2, w 3 ) and the bias term b are rea numbers. Sigmoid function

6 Matrix equations for neura networks The indices j and k seem a itte counter-intuitive!

7 Layer to ayer reationship a j = σ(z j ) z j = k w jk a 1 k + b j b j is the bias term in the jth neuron in the th ayer. a j is the activation in the jth neuron in the th ayer. z j is the weighted input to the jth neuron in the th ayer.

8 Cost function from the network Groundtruth for each input Output activation vector for a specific training sampe x. # of input sampes for each input sampe

9 Backpropagation and stochastic gradient descent The goa of the backpropagation agorithm is to compute the gradients C C and of the cost function C with respect to each and w b every weight and bias parameters. Note that backpropagation is ony used to compute the gradients. Stochastic gradient descent is the training agorithm.

10 Assumptions on the cost function 1. We assume that the cost function can be written as the average over the cost functions from individua training sampes: C = 1 σ n x C x. The cost function for the individua training sampe is given by C x = 1 y x 2 al x 2. - why do we need this assumption? Backpropagation wi ony aow us to compute the gradients with respect to a singe training sampe as given by C x and C x C C. We then recover and by averaging w b w b the gradients from the different training sampes.

11 Assumptions on the cost function (continued) 2. We assume that the cost function can be written as a function of the output from the neura network. We assume that the input x and its associated correct abeing y x are fixed and treated as constants.

12 Hadamard product Let s and t are two vectors. The Hadamard product is given by: Such eementwise mutipication is aso referred to as schur product.

13 Backpropagation Our goa is to compute the partia derivatives C w jk and C b j. We compute some intermediate quantities whie doing so: δ j = C z j

14 Four equations of the BP (backpropagation)

15 Chain Rue in differentiation In order to differentiate a function z = f g x foowing: w.r.t x, we can do the Let y = g x, z = f y, dz = dz dy dx dy dx

16 Chain Rue in differentiation (vector case) Let x R m, y R n, g maps from R m to R n, and f maps from R n to R. If y = g x and z = f y, then z x i = k z y k y k x i

17 Chain Rue in differentiation (computation graph) z x = j:x Parent y j, y j Ancestor (z) z y j y j x x y 1 y 2 y 3 z

18 a L = σ(z L ) BP1 C = f(a L ) Here L is the ast ayer. Layer L 1 Layer L δ L = C z L, σ z L = Proof: σ zl z L, a C = C a L = C L a, C L 1 a,, C L 2 a n T δ j L = C z j L = σ k L C a k a L k z L = C L a j j a L j z L j when j k, the term a L k z L vanishes. j δ j L = C a j L σ (z j L ) Thus we have L = C a L σ (z L )

19 z +1 = w +1 a + b +1 BP2 Proof: = w +1 T δ +1 σ z δ j = C z j = σ k C z k +1 z k +1 z j = σ k +1 z k z j z +1 k = σ j w +1 kj a j + b k = σ j w +1 kj σ z j By differentiating we have: z k +1 z j = w +1 kj σ z j δ j = σ k w +1 kj δ +1 k σ z j δ k +1 Layer Layer b k

20 BP3 z = w a 1 + b Proof: C b j = δ j C b = C z k = C j z k k b j z j = δ σ k w jk a 1 k + b j j b j = δ j z j b j Layer -1 Layer

21 BP4 z = w a 1 + b Proof: C w jk = σ m C z m z m w jk C w jk = a k 1 δ j Layer -1 Layer z j = C z j w jk = δ σ k w jk a 1 k + b j j = δ j a k 1 w jk

22 The backpropagation agorithm The word backpropagation comes from the fact that we compute the error vectors δ j in the backward direction.

23 Stochastic gradient descent with BP

24 Gradients using finite differences Here ε is a sma positive number and e j is the unit vector in the jth direction. Conceptuay very easy to impement. In order to compute this derivative w.r.t one parameter, we need to do one forward pass for miions of variabes we wi have to do miions of forward passes. - Backpropagation can get a the gradients in just one forward and backward pass forward and backward passes are roughy equivaent in computations. The derivatives using finite differences woud be a miion times sower!!

25 Backpropagation the big picture To compute the tota change in C we need to consider a possibe paths from the weight to the cost. We are computing the rate of change of C w.r.t a weight w. Every edge between two neurons in the network is associated with a rate factor that is just the ratio of partia derivatives of one neurons activation with respect to another neurons activation. The rate factor for a path is just the product of the rate factors of the edges in the path. The tota change is the sum of the rate factors of a the paths from the weight to the cost.

26 Thank You

27 Chain Rue in differentiation (vector case) Let x R m, y R n, g maps from R m to R n, and f maps from R n to R. If y = g x and z = f y, then z x i = k z y k y k x i Here y x x z = y x T y z is the n m Jacobian matrix of g.

28 Source:

29 Source:

How the backpropagation algorithm works Srikumar Ramalingam School of Computing University of Utah

How the backpropagation algorithm works Srikumar Ramalingam School of Computing University of Utah How the backpropagation agorithm works Srikumar Ramaingam Schoo of Computing University of Utah Reference Most of the sides are taken from the second chapter of the onine book by Michae Nieson: neuranetworksanddeepearning.com