How the backpropagation agorithm works Srikumar Ramaingam Schoo of Computing University of Utah
Reference Most of the sides are taken from the second chapter of the onine book by Michae Nieson: neuranetworksanddeepearning.com
Introduction First discovered in 1970. First infuentia paper in 1986: Rumehart, Hinton and Wiiams, Learning representations by backpropagating errors, Nature, 1986.
Perceptron (Reminder)
Sigmoid neuron (Reminder) A sigmoid neuron can take rea numbers (x 1, x 2, x ) within 0 to 1 and returns a number within 0 to 1. The weights (w 1, w 2, w ) and the bias term b are rea numbers. Sigmoid function σ 0 = 0.5, σ = 0, σ = 1
Matrix equations for neura networks The indices j and k seem a itte counter-intuitive! Notations are used in this manner to enabe matrix mutipications.
Layer to ayer reationship Exampes: a j = σ(z j ) a 1 = σ z 1, a 2 = σ z 2 z j = k w jk a 1 k + b j z j = k=1 to 4 w jk a k 2 + b j, j {1,2} z j 2 = k=1 to 4 w jk 2 a k 1 + b j 2, j {1,.., 4} b j is the bias term in the j th neuron in the th ayer. a j is the activation in the j th neuron in the th ayer. z j is the weighted input to the j th neuron in the th ayer.
Cost function from the network Groundtruth for each input Output activation vector for a specific training sampe x. # of input sampes Input vector x for each input sampe
Backpropagation and stochastic gradient descent The goa of the backpropagation agorithm is to compute the gradients and of the cost function C with respect to each and w b every weight and bias parameters. Note that backpropagation is ony used to compute the gradients. Stochastic gradient descent is the training agorithm.
Assumptions on the cost function 1. We assume that the cost function can be written as the average over the cost functions from individua training sampes: C = 1 σ n x C x. The cost function for the individua training sampe is given by C x = 1 y x 2 al x 2. - why do we need this assumption? Backpropagation wi ony aow us to compute the gradients with respect to a singe training sampe as given by x and x. We then recover w b the gradients from the different training sampes. w and b by averaging
Assumptions on the cost function (continued) 2. We assume that the cost function can be written as a function of the output from the neura network. We assume that the input x and its associated correct abeing y x are fixed and treated as constants.
Hadamard product Let s and t are two vectors. The Hadamard product is given by: E. g., 1 2 2 2 2 = 2 4 6 Such eementwise mutipication is aso referred to as schur product.
Backpropagation Our goa is to compute the partia derivatives w jk and b j. We compute some intermediate quantities whie doing so: δ j = z j
Four equations of the BP (backpropagation) Summary: the equations of backpropagation (L is the tota number of ayers) 1) δ L = a L σ z L BP1 2) δ = w +1 T δ +1 σ z BP2 ) b j = δ j BP 4) w jk = a k 1 δ j BP4
Chain Rue in differentiation In order to differentiate a function z = f g x foowing: w.r.t x, we can do the Let y = g x, z = f y, dz = dz dy dx dy dx
Chain Rue in differentiation (computation graph) z x = j:x Parent y j, y j Ancestor (z) z y j y j x x y 1 y 2 y z
Chain Rue in differentiation (vector case) Let x R m, y R n, g maps from R m to R n, and f maps from R n to R. If y = g x and z = f y, then z x i = k z y k y k x i
BP1 C To Show: δ L = z L a L a L σ z L Variabe association for appying vector chain rue Here L is the ast ayer. δ L = z L, a L = L a, L 1 a,, L 2 a n T, σ z L = σ z 1 L, σ z 2 L,, σ z n L T Proof: δ j L = z j L = σ k L a k a L k z L = L a j j a L j δ j L = a j L σ (z j L ) -> Thus we have L = a L σ (z L ) L z L, because the other terms a k j z L, when j k, vanish because a L k does not invove z L j. j
Exampe for BP1 δ 1 = z = σ a k k=1 1 a k z = a 1 1 a 1 z 1, 0 for a other vaues of k and 1 C δ 2 = σ a k k=1 a k z = a 2 2 a 2 z 2, z a δ = σ a k k=1 a k z = a a z, δ L = a L σ z L
Derivates of Sigmoid activation function σ z = dσ z dz 1 1 + e z = σ (z) = 1 1 + e z 2 1 e z e z = 1 + e z 2 = e z + 1 1 1 + e z 2 1 1 = 1 + e z 1 1 + e z = σ(z)(1 σ z )
Derivates of quadratic objective function C = 1 2 y al 2 = 1 2 y 1 a 1 L 2 + y 2 a 2 L 2 + + y n a n L 2 a j L = y j a j L a L = (y 1 a 1 L ) (y 2 a 2 L ).. (y n a n L )
BP2 C Proof: δ = w +1 T δ +1 σ z δ j = z j = σ k z k +1 z k +1 z j = σ k +1 z k z j δ k +1 z z +1 Variabe association for appying vector chain rue z +1 k = σ j w +1 kj a j + b k = σ j w +1 kj σ z j + b k By differentiating we have: z k +1 z j = w +1 kj σ z j δ j = σ k w +1 kj δ +1 k σ z j
BP2 Exampe z 2 z C Variabe association for appying vector chain rue δ 2 1 = 4 4 2 z = z k 2 1 z k=1 k z = δ z k k 2 1 z k=1 1 z k = j=1 w kj σ(z 2 j ) + b k, z k 2 z = w k1 σ 2 z 1 1 4 δ 1 2 = k=1 δ z k 4 k 2 z = δ k w k1 σ z 2 1 = (δ 1 w 11 + δ 2 w 21 + δ w 1 + δ 4 w 41 ) σ 2 z 1 1 k=1 4 δ 2 2 = k=1 δ k w k1 σ z 2 2 = (δ 1 w 12 + δ 2 w 22 + δ w 2 + δ 4 w 42 ) σ 2 z 2 δ = ( w +1 T δ +1 ) σ (z )
BP Proof: b j = σ k z k z k b j = z j z j b j = δ j b, the other terms z k j b vanish when j k. j b z C Variabe association for appying vector chain rue = δ j σ k w jk a k 1 + b j b j = δ j
z = w a 2 + b BP Exampe b z C Variabe association for appying vector chain rue b = δ 1 = 1 z k=1 k z k b = z 1 1 z 1 b 1 = δ (σ k=1 1 2 w 1kak+b1 ) b = δ 1 1 b = δ
BP4 Proof: w jk = σ m z m = z j z j z m w jk w jk w jk and the other terms = a k 1 δ j z j w jk when m j. = δ σ k w jk a 1 k + b j j w jk = δ 1 j a k w z C Variabe association for appying vector chain rue
BP4 Exampe Proof: w = a 2 2 δ 1 12 w z C Variabe association for appying vector chain rue w = 12 m z m z m w 12 = z 1 z 1 w 12 = δ 1 σ k w 12 a 2 2 + b 1 = δ 2 w 1 a 2 12
The backpropagation agorithm δ L = a L σ z L. The word backpropagation comes from the fact that we compute the error vectors δ j in the backward direction.
Stochastic gradient descent with BP
Gradients using finite differences Here ε is a sma positive number and e j is the unit vector in the jth direction. Conceptuay very easy to impement. In order to compute this derivative w.r.t one parameter, we need to do one forward pass for miions of variabes we wi have to do miions of forward passes. - Backpropagation can get a the gradients in just one forward and backward pass forward and backward passes are roughy equivaent in computations. The derivatives using finite differences woud be a miion times sower!!
Backpropagation the big picture To compute the tota change in C we need to consider a possibe paths from the weight to the cost. We are computing the rate of change of C w.r.t a weight w. Every edge between two neurons in the network is associated with a rate factor that is just the ratio of partia derivatives of one neurons activation with respect to another neurons activation. The rate factor for a path is just the product of the rate factors of the edges in the path. The tota change is the sum of the rate factors of a the paths from the weight to the cost.
Thank You
Source: http://math.arizona.edu/~cac/rues.pdf
Source: http://math.arizona.edu/~cac/rues.pdf