Notes on Backpropagation with Cross Entropy

Size: px

Start display at page:

Download "Notes on Backpropagation with Cross Entropy"

Kelly Shields
5 years ago
Views:

1 Notes on Backpropagation with Cross Entropy I-Ta ee, Dan Gowasser, Bruno Ribeiro Purue University October 3, 07. Overview This note introuces backpropagation for a common neura network muti-cass cassifier. Specificay, the network has ayers with a genera f function as hien ayer activation an Softmax activation at the fina output ayer. Cross Entropy is use as the obective function to measure preiction errors.. Notations an Definitions Figure visuaizes the network architecture with notations use in this note: Input ayer ai ayer = ayer - ayer - zi z a x z a x z a W b zk ak Wk bk xn Figure : Notation Visuaization The expanations are iste: There are M casses. inicates the ast ayer. The ast ayer has M output neurons (one for each target cass). inicates a specific ayer. It cou be equa to, e.g., not aways the case. =, but The subscript k usuay enotes neuron inices in the ouptut ayer (ayer ). The subscript usuay enotes neuron inices in ayer. The subscript i usuay enotes neuron inices in ayer. 3

2 zk is the weighte sum of activations from the previous ayer. That is, z k = b k + X W ka. () a k is an activation. a k = f(z k), () where f(.) is the activation function. In this note, the ast ayer uses a softmax a k = softmax(z k )= ez k P, (3) c ez c an the hien ayers use the generic activation a k = f(z k ), which cou be the ReU activation ( a k = reu(zk)= 0 if zk < 0 zk (4) otherwise, ogistic function activation or the hyperboic tangent activation a k = (zk)= exp(z k ) +exp(zk (5) ), a k = tanh(zk)= exp(z k ) exp( z k ) exp(zk )+exp( (6) z k ). t k (i) {0, } is the target cass of the i-th exampe in the ataset. The cass are is one-hot encoe, such that t k (i) = if an ony if item i has cass k {0,...,M}. A other bits are zero. E(i) is the oss (output error) of the i-th input exampe. We use Cross Entropy, which is the negative og ikeihoo efine as E(i) = t k (i) og a k = k= t k (i)(zk k= og e z c ). (7) c=.3 Graient Descent Consier the training ata {(x(i),t(i)} N i=,wheret(i) is the one-hot encoe vector of the target of training exampe i. To upate weight matrices Wk, variants of graient escent agorithms are appie. A of them have a common upate rue, which is W / NX (i) (8) 4 i=

3 In the simpiest case, a earning rate is use to contro the step size an we can cacuate the erivatives using Chain Rue, so the upate rue can be re-written as W k = N NX i=.4 Backpropagation k = N NX i= k In what foows we wi omit the epenence of E on the training exampes to P simpify the exposition. However, E appears, it shou be unerstoo as N i= E(i). Backprpagation provies an eegant way to cacuate for N each ayer using a recursive efinition of k an for the aacent ayers an, so that the upates can be cacuate, or propagate, in backwar orer. We wi see how to erive k an from eriving the upates for ayer = an..4. Upate for the ast ayer Instea k k the erivation a ot for Softmax an Cross k, we k k k, because it simpifies (0) Equation (7) has the efinition of the error E, an we can cacuate its erivative with respect to zk as k = X = X = X t a k X = a k t ( =k P e z k ) c ez c t ( =k a k ) t X t =k t k = a k t k. () where the =k is an ientify function: ( if = k =k = 0 otherwise. () Then we can efine k as k = = a k t k. (3) 5

4 We got the first part of Equation (0), so can move on to the secon part which k. Referring to the efinition of Equation (), this is k = a (4) As a resut the upates for the weights in the ast ayer k k = k a. (5) We aso nee to o a simiar erivation for the k k.4. Upate for the Secon ast ayer = k () = k (6) Simiary, Equation (8) - (0) erives each component of Equation = X k = X kw k (8) Combining a together, we get We can efine an re-write = f 0 (z = a i = a i f 0 (z ) X @z = f 0 (z ) X k kw k. () kw k = a i. (3) 6

5 For the Backpropagation Summary = () =. (4) To this point, we got a the erivatives we nee to upate our specific neura network (the one with ReU activation, softmax output, an cross-entropy error), an they can be appie to arbitrary number of ayers. In fact, Backpropagation can be generaize an use with any activations an obectives. It is summarize in the foowing four equations: k = = f 0 (z ) X k (5) kw k k = ka (7) = k (8).6 Numericay Stabe Softmax This is a practica impementation issue. Cacuating the exponentias in Softmax is numericay unstabe, since the vaues cou be extremey arge. We can o a sma trick by introucing a constant C to mitigate such probem. softmax(x i )= exi P ex (9) = Cexi C P ex (30) = exi+og C P ex+ogc (3) A common choice for the constant is ogc = max x. 7

How the backpropagation algorithm works Srikumar Ramalingam School of Computing University of Utah

How the backpropagation algorithm works Srikumar Ramalingam School of Computing University of Utah How the backpropagation agorithm works Srikumar Ramaingam Schoo of Computing University of Utah Reference Most of the sides are taken from the second chapter of the onine book by Michae Nieson: neuranetworksanddeepearning.com