For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Neural Networks : Dervaton compled by Alvn Wan from Professor Jtendra Malk s lecture Ths type of computaton s called deep learnng and s the most popular method for many problems, such as computer vson and mage processng. 1 Model For now, let us focus on a specfc model of neurons. These are smplfed from realty but can acheve remarkable results. 1.1 Sngle Layer Neural Newtork Let us call the nputs x 1, x 2... x n. We have several nodes v 1, v 2... v n, whch each receve a lnear combnaton of the nputs, w 1 x 1 + w 2 x 2 + + w n x n. The conventon for each edge weght s to lst the ndex of the source n the frst poston and the ndex of the destnaton n the second. For example, edge lnkng nput 1 to node 2, would be w 12. The total nput for node 2 would be S 2 = n w 2 x =1 There are varety of ways we translate S 2 nto a decson, x 2 = g(s 2 ) We call g the actvaton functon. We have several common optons for actvaton functons. 1. Logstc 1 1+e z 2. ReLu (Rectfed lnear unt) { 0 for z 0 z for z > 0 In a sngle-layer neural network, f g s logstc, we would smply have logstc regresson. 1

1.2 Mult-Layer Neural Networks The set of nputs could also be the output from another stage of the neural network. Now, we wll consder a three-layer neural network. We call the nputs x, the second-layer v and the thrd layer o, where all actvaton functons g are the same. We can compute the outputs usng the followng. O = g( w g( k w k x k )) Suppose the frst stage and the second stage are all lnear. Is ths effectve? Of course not. Ths s effectvely one layer of lnear combnatons. As a result, t s mportant that you have some non-lnearty. Wth that sad, even the dscontnuty of a ReLu s suffcent. 2 Tranng What does tranng a neural network mean? It means fndng a w such that the output o s as close as possble to y, the desred output. In other words, the values for a regresson problem or the labels for a classfcaton problem. Here s our 3-step approach: 1. Defne the loss functon L(w). 2. Compute wα. 3. w new w old η w α Compute the gradent of w means we compute partal dervatves. In other words, we compute w = L w k for all edges usng chan rule. We want to know: how can we tweak the edge weght, so that our loss s mnmzed? 2

2.1 Sngle-Layer Neural Networks For bnary classfcaton, an effectve loss functon s the cross entropy. For regresson, use squared error. Note that the followng comes from the maxmum lkelhood estmate for logstc. L = X (y log o + (1 y ) log 1 o ) We can model the actvaton functon g as a sgmod g(z) = 1 1+e z. Fndng w reduces to logstc regresson, so we can use stochastc gradent descent. 2.2 Two-Layer Neural Network Note that ths generalzes to multple layers. We can compute the gradent wth respect to all the weghts, from the nput to the hdden layer and the hdden layer to the output layer. The vector we get wth may be massve, whch s lnear n sze wth respect to the weghts. We can use stochastc gradent descent. Despte the fact that t only fnds local mnma, ths s suffcent for more applcatons. Computng gradents the nave verson wll be quadratc n the number of weghts. The back propagaton s a trck that allows us to compute the gradent n lnear tme. We can add a regularzaton term to mprove performance. Here s the smple dea; the followng s an nvocaton of g, wth many arguments, one of whch s w k. o = g(... w k..., x) We can compute the followng. 3

o = g(... w k + w k..., x) Then, we can compute the losses L(o ), L(o ). We now have a numercal approxmaton for the gradent. L(o ) L(o ) w k The computatonal complexty of ths s what s called a forward pass through the network, as we evaluate the entre network. The cost of ths s lnear n the number of weghts. Gven n weghts, t would be O(n) for a sngle weght, but wth n weghts, ths s O(n 2 ) whch s extremely expensve. 3 Backpropagaton Algorthm Frst, let us consder the bg pcture. We can see that computaton can be and should be shared, so we wll try to mnmze work n ths fashon. Consder an nput layer, a hdden layer, and output layer. We wll compute a varable called δ at the output layer. Usng the deltas, we wll compute the deltas for the hdden layer, and fnally, compute the deltas at the nput layer. Ths s why the algorthm s called back propagaton, as we move from the outputs to the nputs. We can consder ths a form of nducton, where the output layer s the base case. 3.1 Chan Rule Let us move on to fner detals and develop a more precse structure. Recall the chan rule from calculus. Consder the followng f(x) = x 2 z x = 2y z = sn y 4

To compute f, we apply chan rule. Intutvely, a small change n y causes a change n z, y whch causes a small change n x, whch changes f. Alternatvely, a small change n y could drectly affect x, whch changes f. There are, n effect, two paths to f. Chan rule tells us that f y = f x x y + f z z y 3.2 Notaton To dstngush between weghts n dfferent layers, we wll use superscrpts. w (l) s the lnk from node n layer l 1 to node n layer l. Note that l s the layer of the target node. Let L be the ndex of the hghest layer, so f we have L = 10, l wll range from 1 to 10. d (l) wll denote the number of nodes at the layer l. Notaton s farly confusng, but we wll try to follow the aforementoned conventons. To compute x for some node n layer l, we have the followng. We smply took the equaton from before and added superscrpts to denote layers. d (l 1) S (l) = x (l) =1 w (l) x(l 1) = g(s (l) ) 5

3.3 Defnng δ We wll use e for error, whch s equvalent to loss. Let us frst defne the error at a specfc node. δ (l) = S (l) Usng these deltas, we can then compute the gradents. Here s how: Recall that we are lookng for the gradent of the error wth respect to a weght. w affects S, and S then affects x. We can also say the reverse. The error of at x s a functon of the error n S, whch s then a functon of the error n w. More formally sad, we can apply chan rule to re-express the dervatve of e wth respect to the weght. = s (l) S(l) We can now use our delta notaton. In fact, note that the frst term s smply our delta. = δ (l) S (l) S s a lnear combnaton of all w, but we are takng the dervatve wth respect to a specfc w. As a result, we have that S(l) = x (l 1). = δ (l) x(l 1) If we can compute deltas at every node n the neural network, we have an extremely smple formula for the gradents at every node, for every weght. As a result, we have acheved our goal, whch was to compute the gradent of the error wth respect to each weght. We wll now compute the gradent of the error wth respect to each node. 6

3.4 Base Case We wll now buld the full algorthm, usng nducton. Frst, run the entre network, so that we have the values g(s L ). We compute δ n the fnal layer, to start. δ = S (L) Gven the followng error functon, we can plug n and take the dervatve to compute δ. Error = 1 2 (g(s(l) ) y ) 2 We compute the deltas for our output layer frst. Apply chan rule, and note that y constant wth respect to w. δ (L) = δ (L) ( 1 2 (g(s(l) ) y ) 2 ) = (g(s (L) ) y )g (s (L) ) Suppose g s the ReLu. Then, we have the followng possble outputs for g (s (L) ). g (S (L) ) = { 1 f s (L) 0 0 otherwse 7

3.5 Inductve Step We compute δ for an ntermedate layer. Take an dentcal defnton of δ. Per the nductve hypothess, assume that all δ (l) have been computed. δ (l 1) = S (l 1) We examne x (l 1). Whch quanttes does t affect? The answer: t may affect any node n the layer above t, x (l). Specfcally, x (l 1) affects S (l), whch n turn affects x (l). Ths s, agan, expressed formally usng chan rule. = S (l) S x (l 1) x (l 1) s (l 1) Let us now smplfy ths expresson. The frst term, s n fact δ (l), by defnton. = δ (l) S x (l 1) x (l 1) s (l 1) Recall that S s a lnear combnaton of weghts and nodes, so the second term s smply w. = δ (l) w (l) x (l 1) s (l 1) Fnally, the last term can be re-expressed usng the actvaton functon g. = δ (l) w (l) g (S (l 1) ) 8

3.6 Full Algorthm Note that ths algorthm s lnear n the number of weghts n the network and not lnear n the number of nodes. The summaton derved above demonstrates ths. 1. We compute δ at the output layer. 2. We compute δ at the ntermedate layers. 3. Once ths has concluded, we can use the followng t compute the dervatve wth respect to any weght w. = δ (l) x(l 1) Ths concludes the ntroducton to neural networks and ther dervaton. 9